# Working with the Fragile Families Challenge data

# Table of contents
[1. Reading in the data](#1.-Reading-in-the-Data)

[2. Subselecting variables](#2.-Subselecting-variables)

[3. Imputation](#3.-Imputation)

In [3]:
# First, we import the libraries we will use in this notebook and load the Fragile Families data. 
%matplotlib inline 
import pandas as pd
import sys

# 1. Reading in the data
When you download the challenge data, you should have the following data files:
- **`background.csv`**: the set of questionaire answers from years 0, 1, 3, 5 and 9
- **`train.csv`**: the set of 6 features from year 15 to be predicted (train split)
- **`test.csv`**: the set of 6 features from year 15 to be predicted (test split)

Let's read these files one by one and inspect them.

In [8]:
# Give the absolute path to the 
background = "../../ai4all_data/background.csv"
train = "../../ai4all_data/train.csv"

# we will not use the test data for now
test = "../../ai4all_data/test.csv"

In [34]:
# Read in data
data_frame = pd.read_csv(background, low_memory=False)
train_outcome = pd.read_csv(train, low_memory=False)
test_outcome = pd.read_csv(test, low_memory=False)

We can display the shapes of the dataframes by calling the shape function (notice no parentheses!)

In [19]:
print("Background data frame shape is:", data_frame.shape)
print("Train outcome data frame shape is:", train_outcome.shape)
print("Test outcome data frame shape is:", test_outcome.shape)

Background data frame shape is: (4242, 12943)
Train outcome data frame shape is: (2121, 7)
Test outcome data frame shape is: (1591, 7)


Calculate how many rows are missing (data frame rows - (train+test rows))

In [25]:
data_frame.shape[0] - (train_outcome.shape[0]+test_outcome.shape[0])

530

Notice that **`train_outcome`** contains 2121 rows, **`test_outcome`** contains 1591 rows, and **`data_frame`** contains 4242 rows. This is because the original **`outcome`** dataset has been split into three groups:
- challenge training set, on which we train our algorithms: 2121 (1/2*4242)
- challenge test set, on which we test before we submit: 1591 (1/3*4242)
- real (hidden) test set, used to evaluate the challenge submissions once they're uploaded on the website: 530 (1/8*4242)

We will primarily work with the training set.

We can display the first few rows of the data frame by calling the .head() function:

In [6]:
data_frame.head()

Unnamed: 0,challengeID,m1intmon,m1intyr,m1lenhr,m1lenmin,cm1twoc,cm1fint,cm1tdiff,cm1natsm,m1natwt,...,m4d9,m4e23,f4d6,f4d7,f4d9,m5c6,m5d20,m5k10,f5c6,k5f1
0,1,-3,,-3,40,,0,,,,...,-3.0,-3.0,-3.0,-3.0,-3.0,-3.0,-3.0,-3,-3.0,-3.0
1,2,-3,,0,40,,1,,,,...,-3.0,8.473318,-3.0,-3.0,-3.0,-3.0,9.845074,-3,-3.0,9.723551
2,3,-3,,0,35,,1,,,,...,-3.0,-3.0,9.097495,10.071504,-3.0,-3.0,-3.0,-3,-3.0,-3.0
3,4,-3,,0,30,,1,,,,...,-3.0,-3.0,9.512706,10.286578,-3.0,10.677285,-3.0,-3,8.522331,10.608137
4,5,-3,,0,25,,1,,,,...,-3.0,-3.0,11.076016,9.615958,-3.0,9.731979,-3.0,-3,10.115313,9.646466


In [9]:
train_outcome.head()

Unnamed: 0,challengeID,gpa,grit,materialHardship,eviction,layoff,jobTraining
0,1,,,,,,
1,3,,,,,,
2,6,,3.5,0.090909,0.0,0.0,0.0
3,7,2.5,3.25,0.0,0.0,0.0,0.0
4,8,,,,,,


Notice that we have a few NaN - not a number - values in the DataFrame. NaN entries appear in real-world datasets very often, usually signifying missing data. NaNs are also produced when dividing by zero, or casting a non-numerical value to a number.

We will discuss dealing with NaNs in the Imputation section (below).

Notice that in the **`outcome`** data frame, the challengeID is not aligned with the index on the left. That is because the indices of individual samples have been sorted into training and test set at random. It's always good practice to split our data into training and testing sets that way, because splitting them into top and bottom slices (for example), may carry over biases (such as: first individuals in the study could all come from the same city, which has better schools than the other cities).

We can change the indexing such that it reflects the actual family id:

In [35]:
data_frame = data_frame.set_index('challengeID')
train_outcome = train_outcome.set_index('challengeID')

See the changes below:

In [16]:
data_frame.head()

Unnamed: 0_level_0,m1intmon,m1intyr,m1lenhr,m1lenmin,cm1twoc,cm1fint,cm1tdiff,cm1natsm,m1natwt,cm1natsmx,...,m4d9,m4e23,f4d6,f4d7,f4d9,m5c6,m5d20,m5k10,f5c6,k5f1
challengeID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,-3,,-3,40,,0,,,,,...,-3.0,-3.0,-3.0,-3.0,-3.0,-3.0,-3.0,-3,-3.0,-3.0
2,-3,,0,40,,1,,,,,...,-3.0,8.473318,-3.0,-3.0,-3.0,-3.0,9.845074,-3,-3.0,9.723551
3,-3,,0,35,,1,,,,,...,-3.0,-3.0,9.097495,10.071504,-3.0,-3.0,-3.0,-3,-3.0,-3.0
4,-3,,0,30,,1,,,,,...,-3.0,-3.0,9.512706,10.286578,-3.0,10.677285,-3.0,-3,8.522331,10.608137
5,-3,,0,25,,1,,,,,...,-3.0,-3.0,11.076016,9.615958,-3.0,9.731979,-3.0,-3,10.115313,9.646466


In [36]:
train_outcome.head()

Unnamed: 0_level_0,gpa,grit,materialHardship,eviction,layoff,jobTraining
challengeID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,,,,,,
3,,,,,,
6,,3.5,0.090909,0.0,0.0,0.0
7,2.5,3.25,0.0,0.0,0.0,0.0
8,,,,,,


Now, notice that the IDs in **`data_frame`** do not align with the IDs in **`outcome`**. If we want to train a supervised algorithm (one for which we have labeled examples) to predict one of the six challenge classes, we need to subselect only those rows of **`data_frame`** which have their labels provided. The rest will be used for testing.

In [37]:
# get a list of indices that appear in the training set
select_index = train_outcome.index

In [38]:
select_index

Int64Index([   1,    3,    6,    7,    8,    9,   10,   13,   14,   16,
            ...
            4217, 4220, 4222, 4224, 4229, 4235, 4236, 4239, 4240, 4241],
           dtype='int64', name='challengeID', length=2121)

In [44]:
# select only those rows in data_frame which appear in the training set
train_X = data_frame.loc[select_index]
train_X.shape

# let's rename the label frame train_Y for consistency
train_y = train_outcome.copy()

In [42]:
train_X.head()

Unnamed: 0_level_0,m1intmon,m1intyr,m1lenhr,m1lenmin,cm1twoc,cm1fint,cm1tdiff,cm1natsm,m1natwt,cm1natsmx,...,m4d9,m4e23,f4d6,f4d7,f4d9,m5c6,m5d20,m5k10,f5c6,k5f1
challengeID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,-3,,-3,40,,0,,,,,...,-3.0,-3.0,-3.0,-3.0,-3.0,-3.0,-3.0,-3,-3.0,-3.0
3,-3,,0,35,,1,,,,,...,-3.0,-3.0,9.097495,10.071504,-3.0,-3.0,-3.0,-3,-3.0,-3.0
6,-3,,0,25,,1,,,,,...,8.5157,10.558813,-3.0,-3.0,7.022328,-3.0,10.564085,-3,-3.0,10.255825
7,-3,,0,35,,1,,,,,...,-3.0,-3.0,9.660643,9.861125,-3.0,10.991854,-3.0,-3,10.972726,10.8598
8,-3,,1,10,,1,,,,,...,-3.0,10.558813,-3.0,-3.0,-3.0,-3.0,-3.0,-3,-3.0,-3.0


Notice that the train_X frame now only has 2121 rows - same ones as the training labels. 

We can keep the the train_X and train_y frames separate, but if for any reason (such as imputation) we want to merge them, you can do it by calling:

In [None]:
train_full = p

Pick out the students' language and literacy skills (`t5c13a`), social science skills (`t5c13b`), math skills (`t5c13c`)

In [5]:
edu = train_X[["t5c13a", "t5c13b", "t5c13c"]]
edu["gpa"] = outcome["gpa"]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


In [6]:
edu.head()

Unnamed: 0_level_0,t5c13a,t5c13b,t5c13c,gpa
challengeID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,-9,-9,-9,
3,-9,-9,-9,
6,-9,-9,-9,
7,-9,-9,-9,2.5
8,-9,-9,-9,


In [7]:
print(edu.t5c13a.unique())
print(edu.t5c13b.unique())
print(edu.t5c13c.unique())

[-9  1  3  2  4  5 -2 -1]
[-9  2  4  3  1  5 -2 -1]
[-9  2  4  3  1  5 -2 -1]


In this subsection, we are gonna predict GPA based on a single feature - language and literacy skills (`t5c13a`). Intuitively, better language and literacy skills would result in a better GPA. Indeed, GPA depends on a lot of factors and language and literacy skill is only one of them. 

The language and literacy skill can take values 1,2,3,4,5. This is a categorical variable. The GPA data in the data is rounded to a scale of 0.25. This is a discrete variable. We emphasize that regression is a tool that estimate the correlation between two continuous variables. 

Because of the nature of the fragile framily study, most of the collected data are discretized, like GPA.

In [8]:
import ff_functions as helper
edu_nonan = helper.remove_nan(edu)
edu_nonan = helper.select_above_zero(edu_nonan)

In [9]:
edu_nonan.head()
edu_nonan.shape

(734, 4)

We will put all these steps together in a single function so that we can just do it all at once next time:

In [10]:
'''
This function takes in the outcome and backtround DataFrame,
a list of desired background variables and a list desired outcome_vars,
and subselects them from the background and output frames.
It returns a single DataFrame containing the desired columns, where 
corresponding rows between the two DataFrames have been subselected.
'''
def pick_out_variables(background, output, background_vars, outcome_vars, remove_nans=False, remove_negatives=False):
    train_X = background.loc[outcome.index]
    new_frame = train_X[background_vars]
    new_frame[outcome_vars] = output[outcome_vars]
    if remove_nans:
        if len(new_frame.shape)>1:
            new_frame = new_frame[(~np.isnan(new_frame)).all(1)]
        else:
            new_frame = new_frame[~np.isnan(new_frame)]
    if remove_negatives:
        if len(new_frame.shape)>1:
            new_frame = new_frame[(new_frame>=0).all(1)]
        else:
            new_frame = new_frame[new_frame>=0]
    return new_frame

We can reproduce the steps by calling:

**`edu_nonan = pick_out_variables(data_frame, outcome, ["t5c13a", "t5c13b", "t5c13c"], ["gpa"], True, True)`**

You can find this function in **`ff_functions.py`** and call after doing **`import ff_functions`**

We have our data ready! We can now move to perform our first regression task - linear regression! 

### Linear regression for GPA using language and literacy skills ('t5c13a') as predictor 
In our regression analysis, we will first split the 734 samples into training set and test set, and then investigate the correlation between the language and literacy skill and the *average GPA* using linear regression techniques.

In [None]:
# First, let's split our data into train and test sets
from sklearn.cross_validation import train_test_split
#X_train, X_test, y_train, y_test = train_test_split(edu_nonan.t5c13a, edu_nonan.gpa, test_size = 0.3, random_state = 100)

train, test = train_test_split(data_frame, test_size = 0.3, random_state = 100)

# 3. Imputation