<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Sample Blogging Exercise

_Author: Ben Shaver (DC)_
---


<a id=goals></a>

##  Overview

A great way to build your skills, portfolio, and online presence as a data scientist is to regularly play with new datasets and publish interesting observations you have made. Every Monday morning, you'll be examining a dataset of your choosing in order to practice this skill. Below are a series of questions to get you started. You may use this notebook as a template to complete the exercise every week.

For this exercise, you are highly encouraged to use a 'fresh' dataset like one featured in [Data is Plural](https://tinyletter.com/data-is-plural/letters/).  

## Required Objectives

Every monday, slack the dataset you have chosen to Ben by 10 am. You are required to examine a different dataset every week.

You are required to write 10 blog posts. At least 3(?) of them, and no more than 6(?) must explore interesting datasets.

## Additional Resources

[100 Interesting Datasets for Statisticians](http://rs.io/100-interesting-data-sets-for-statistics/)

[Open Data Inception - 2600+ Open Data Portals Around the World](https://opendatainception.io/)

[https://www.crowdflower.com/data-for-everyone/](Data For Everyone Library)

[https://ourworldindata.org/](Our World in Data)

[https://www.kaggle.com/annavictoria/ml-friendly-public-datasets/](ML Friendly Public Datasets)

[Top Datasets on r/Datasets](https://www.reddit.com/r/datasets/top/?sort=top&t=all)




In [2]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

<a id='Data Cleaning'></a>
## Data Cleaning

---


### Data Description

Begin with a short description of your dataset. Where did you get it? (Provide a link.) What does it show?

What are the observational units? What are the variables? (If your dataset has many variables, try and focus on just 5-10 for the purposes of this exercise.)

Is your data tidy? Is each row an observation and each column a variable?

In [None]:
# My data comes from the Open University Learning Analytics dataset. Specifically,
# I'm using the Studentinfo table, which contains demographic data on students 
# who took courses remotely via tha Open University. 

# The observational units are students. Variables include which course the student
# was taking, their age, geographic region, highest education attained, the number
# of credits the student is currently enrolled in, and the final result of the
# studentin that module. Some students may show up more than once if they are enrolled
# in more than one module.

In [4]:
df = pd.read_csv('assets/Studentinfo.csv')

In [7]:
df.head()

Unnamed: 0,code_module,code_presentation,id_student,gender,region,highest_education,imd_band,age_band,num_of_prev_attempts,studied_credits,disability,final_result
0,AAA,2013J,11391,M,East Anglian Region,HE Qualification,90-100%,55<=,0,240,N,Pass
1,AAA,2013J,28400,F,Scotland,HE Qualification,20-30%,35-55,0,60,N,Pass
2,AAA,2013J,30268,F,North Western Region,A Level or Equivalent,30-40%,35-55,0,60,Y,Withdrawn
3,AAA,2013J,31604,F,South East Region,A Level or Equivalent,50-60%,35-55,0,60,N,Pass
4,AAA,2013J,32885,F,West Midlands Region,Lower Than A Level,50-60%,0-35,0,60,N,Pass


In [5]:
df.shape

(32593, 12)

### Data Types and Missing Values

What data type is each column? Are they what you expect?

Are there any null values? What about missing values or impossible values?

(Don't go down too many rabbit holes here. In a real project, you'd want to check all the variables for outliers and make sure every column is the right data type. For this exercise, it is OK to ignore some variables that are giving you trouble if you feel you can make interesting observations using the other variables.)

In [6]:
df.dtypes

# Basically, the data types are what I expect. Age is captured as a categorical
# variable, so it is a string. Disability is also a string, although it could be
# encoded as boolean instead.

code_module             object
code_presentation       object
id_student               int64
gender                  object
region                  object
highest_education       object
imd_band                object
age_band                object
num_of_prev_attempts     int64
studied_credits          int64
disability              object
final_result            object
dtype: object

In [9]:
# There are null values in my dataset:
df.isnull().sum().sum() > 0 

True

In [10]:
# They're all in the 'depravation band' column. This is convenient because
# we can just ignore that one for now.
df.isnull().sum()

code_module                0
code_presentation          0
id_student                 0
gender                     0
region                     0
highest_education          0
imd_band                1111
age_band                   0
num_of_prev_attempts       0
studied_credits            0
disability                 0
final_result               0
dtype: int64

In [12]:
df[['num_of_prev_attempts','studied_credits']].min()
# Minimums of numerical columns look sensible:

num_of_prev_attempts     0
studied_credits         30
dtype: int64

In [13]:
df[['num_of_prev_attempts','studied_credits']].max()
# The maximum value for enrolled credits looks a little high.
# But this is an online university... Maybe a lot of students sign
# up for more courses than they ever plan to complete?

num_of_prev_attempts      6
studied_credits         655
dtype: int64

<a id='EDA'></a>
## Exploratory Data Analysis

---


What exactly are you trying to figure out using this data? What is your population of interest?

What are some questions you could ask of this data? Write down a few questions before you write another line of code.

In [None]:
# My population of interest are students of the Open University, a distance 
# learning institution in the UK. I'm trying to figure out what patterns in
# student demographics are associated with module performance.

# Some questions I could ask of this data:
## Are students with higher educational attainment more likely to pass a module?
## Are students with a disability less likely to enroll in a lot of modules at the same time?
## Which region in the UK performs the best? Which has the oldest enrolled students?

Of the questions you've brainstormed, which are the easiest to answer? Which are the hardest?

What answers would surprise you? For which questions do you lack strong pre-existing opinions?

In [None]:
# Since students either have a disability, or don't, it would be really easy to see
# if the average number of courses a student with a disability has is less than the
# average number for a student without a disability.

# Since age is categorical in this dataset, it would require some munging in order to 
# figure out which region has the oldest students. I would have to decide what counts
# as 'old,' maybe.

# I would be surprised if students with higher educational attainment aren't more likely
# to pass a module. I don't really know which region in the UK has students who are more
# likely to pass. It would be interesting to find out.

<a id='Visualization'></a>
## Visualization
---
