# Working with the Fragile Families Challenge data

# Table of contents
[1. Reading in the data](#1.-Reading-in-the-Data)

[2. Understanding column names](#2.-Understanding-column-names)

[3. Subselecting variables](#4.-Subselecting-variables)

[4. Imputation](#4.-Imputation)

[5. Feature Engineering](#4.-Feature-Engineering)

In [1]:
# First, we import the libraries we will use in this notebook and load the Fragile Families data. 
%matplotlib inline 
import pandas as pd
import numpy as np
import sys

# 1. Reading in the data
When you download the challenge data, you should have the following data files:
- **`background.csv`**: the set of questionaire answers from years 0, 1, 3, 5 and 9
- **`train.csv`**: the set of 6 features from year 15 to be predicted (train split)
- **`test.csv`**: the set of 6 features from year 15 to be predicted (test split)

Let's read these files one by one and inspect them.

In [2]:
# Give the absolute path to the 
background = "../../ai4all_data/background.csv"

The Fragile Families data is saved in CSV files. CSV files are "comma separated values" files and are a very common format used to save tabular (excel-like) data. To create a DataFrame from a CSV file, we use the Pandas `read_csv` function.

You can read more about it here:

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

In [3]:
# Read in data
data_frame = pd.read_csv(background, low_memory=False)

We can display the shapes of the dataframes by calling the shape function (notice no parentheses!)

In [4]:
print("Background data frame shape is:", data_frame.shape)

Background data frame shape is: (4242, 12943)


We can display the first few rows of the data frame by calling the .head() function:

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html

In [5]:
data_frame.head()

Unnamed: 0,challengeID,m1intmon,m1intyr,m1lenhr,m1lenmin,cm1twoc,cm1fint,cm1tdiff,cm1natsm,m1natwt,...,m4d9,m4e23,f4d6,f4d7,f4d9,m5c6,m5d20,m5k10,f5c6,k5f1
0,1,-3,,-3,40,,0,,,,...,-3.0,-3.0,-3.0,-3.0,-3.0,-3.0,-3.0,-3,-3.0,-3.0
1,2,-3,,0,40,,1,,,,...,-3.0,8.473318,-3.0,-3.0,-3.0,-3.0,9.845074,-3,-3.0,9.723551
2,3,-3,,0,35,,1,,,,...,-3.0,-3.0,9.097495,10.071504,-3.0,-3.0,-3.0,-3,-3.0,-3.0
3,4,-3,,0,30,,1,,,,...,-3.0,-3.0,9.512706,10.286578,-3.0,10.677285,-3.0,-3,8.522331,10.608137
4,5,-3,,0,25,,1,,,,...,-3.0,-3.0,11.076016,9.615958,-3.0,9.731979,-3.0,-3,10.115313,9.646466


# 2. Understanding column names

Each row in the data frame represents a single family enrolled in the Fragile Families study. Each column represents a different *variable* - information collected about that family. The row labels is the default numbering of rows starting at 0 while the column labels are the names of the variables. However, this names don't make much sense!

The first column labelled "challengeID" is the a unique identifier for each family. So challengeID=5 stands for the family identified as Family 5.

First, let's use the challengeID as the labels for the rows. This way, we can refer to Family 5 by referring to variables in data_frame for the row with label "5". To do this, we can use the "data_frame.set_index('row label')" function where we set the row label to 'challengeID':  

In [6]:
data_frame = data_frame.set_index('challengeID')
data_frame.head()

Unnamed: 0_level_0,m1intmon,m1intyr,m1lenhr,m1lenmin,cm1twoc,cm1fint,cm1tdiff,cm1natsm,m1natwt,cm1natsmx,...,m4d9,m4e23,f4d6,f4d7,f4d9,m5c6,m5d20,m5k10,f5c6,k5f1
challengeID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,-3,,-3,40,,0,,,,,...,-3.0,-3.0,-3.0,-3.0,-3.0,-3.0,-3.0,-3,-3.0,-3.0
2,-3,,0,40,,1,,,,,...,-3.0,8.473318,-3.0,-3.0,-3.0,-3.0,9.845074,-3,-3.0,9.723551
3,-3,,0,35,,1,,,,,...,-3.0,-3.0,9.097495,10.071504,-3.0,-3.0,-3.0,-3,-3.0,-3.0
4,-3,,0,30,,1,,,,,...,-3.0,-3.0,9.512706,10.286578,-3.0,10.677285,-3.0,-3,8.522331,10.608137
5,-3,,0,25,,1,,,,,...,-3.0,-3.0,11.076016,9.615958,-3.0,9.731979,-3.0,-3,10.115313,9.646466


For all other variables, the names appear to be not that helpful! Luckily for us, the Fragile Families project has a fantastic website that provides information on what each variable name represents, what type of variable it is, and what the values of the variable mean:

[http://metadata.fragilefamilies.princeton.edu/variables](http://metadata.fragilefamilies.princeton.edu/variables)

## 2.1. Checking if feature is present in the data_frame
Note that not all variables listed on the website are accessible in our dataset. Some have been removed because they were considered sensitive data, or have not yet been released publicly. If you want to check if a given column name (metadata variable) is in the data frame, you can type

In [7]:
"p6b21" in data_frame.columns

False

In [8]:
"k5e1a" in data_frame.columns

True

## 2.2. What about NaN values?

Notice that we have a few NaN - not a number - values in the DataFrame. NaN entries appear in real-world datasets very often, usually signifying missing data. NaNs are also produced when dividing by zero, or casting a non-numerical value to a number.

We will discuss dealing with NaNs in the Imputation section (below).

# 3. Subselecting variables

It is very hard to work with all 12,942 features at once when trying to predict outcomes - we would have to use very advanced methods to learn which features to discard, otherwise we'll just have a ton of features impacting outcomes to a very small extent.

What we can do instead is to pick out a few potential features and try to predict the target variable using only those features.

## Problem 3.1. 
Let's assume we want to study the impact of various features on variable **`k5g2i`**: *I (child) worry about doing well in school (in year 9).*

Take a few minutes, and look through the metadata website and try to find 3 features which would be features which we would like to study along with **`k5g2i`**, among them:
- one continuous
- one binary 
- and one categorical (ordered or unordered)

(Hint: use the Variable Type option on the metadata website to select the desired type)

Let's write down their names so that you can all share the features we found.

## Answer:

In [9]:
feature_continuous = "" # what it means
feature_binary = "" # what it means
feature_categorical = "" # what it means

############ TODO: comment these out #############
feature_continuous = "t5e13" # % of children that complete hw
feature_binary = "p5i2a" # participated in athletic activity
feature_categorical = "t5c13a" # math skills
##################################################


Let's now make a list that contains the names of all the variables we're interested in

In [10]:
selected_features = ["k5g2i", feature_continuous, feature_binary, feature_categorical]
selected_features

['k5g2i', 't5e13', 'p5i2a', 't5c13a']

## Problem 3.2
Check if all the features you selected are in the data frame columns (see 2.1).

(Hint: use the following loop format)

In [11]:
######################### TODO: remove this ###################
for feature in selected_features:
    if feature in data_frame.columns:
        print("Feature " + feature + " is in the columns!")
    else:
        print("Feature " + feature + " is in NOT the columns!")
###############################################################

Feature k5g2i is in the columns!
Feature t5e13 is in the columns!
Feature p5i2a is in the columns!
Feature t5c13a is in the columns!


In [12]:
for feature in selected_features:
    # if feature is in the columns:
        print("Feature " + feature + " is in the columns!")
    # otherwise (remember the keyword for that in python?):
        print("Feature " + feature + " is in NOT the columns!")

Feature k5g2i is in the columns!
Feature k5g2i is in NOT the columns!
Feature t5e13 is in the columns!
Feature t5e13 is in NOT the columns!
Feature p5i2a is in the columns!
Feature p5i2a is in NOT the columns!
Feature t5c13a is in the columns!
Feature t5c13a is in NOT the columns!


Now, let's create a new data frame that contains only these features.

In [13]:
df = data_frame[selected_features]
df.head()

Unnamed: 0_level_0,k5g2i,t5e13,p5i2a,t5c13a
challengeID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,-9,-9,-9,-9
2,1,90,2,3
3,-9,-9,2,-9
4,1,-9,1,-9
5,2,-9,1,-9


## Problem 3.3
You can change the column names to make them more descriptive:

In [14]:
# change the names to be more descriptive
column_names = ["name1", "name2", "name3", "name4"]
df.columns = column_names
df.head()

Unnamed: 0_level_0,name1,name2,name3,name4
challengeID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,-9,-9,-9,-9
2,1,90,2,3
3,-9,-9,2,-9
4,1,-9,1,-9
5,2,-9,1,-9


Let's take a closer look at the values displayed for each feature. Go back to the metadata website and look at response codes for all variables.

Notice that the negative values in the Fragile Families dataset are reserved for missing values. There are various reasons for which data is missing.

We need to remember that we cannot interpret missing values the same way we do actual responses.

We talk more about dealing with missing values and NaN values in the imputation part of the notebook. For now, let's just remove all the rows which contain any missing values. You can use the functions provided below:

In [15]:
'''
This function removes rows containing NaNs from a 
DataFrame. 
'''
def remove_nan(data):
    if len(data.shape)>1:
        return data[(~np.isnan(data)).all(1)]
    else:
        return data[~np.isnan(data)]

'''
This function removes rows with values below 0 from a 
DataFrame. 
'''
def select_nonnegative(data):
    if len(data.shape)>1:
        return data[(data>=0).all(1)]
    else:
        return data[data>=0]

Let's print the dimensions of the data frame again for comparison:

In [16]:
print("Background data frame shape is:", df.shape)

Background data frame shape is: (4242, 4)


In [17]:
df = remove_nan(df)
df = select_nonnegative(df)
df.head()

Unnamed: 0_level_0,name1,name2,name3,name4
challengeID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2,1,90,2,3
9,0,95,2,1
11,0,25,1,4
12,0,98,2,4
14,0,90,1,3


## Problem 3.4
Print the shape again to compare. How many features have been removed?

If your data frame at this point is empty, don't despair - this simply means that your feature have a lot of missing values in them. You can go back and look for better features, or plug in the ones we've already tested:

In [18]:
feature_1 = "t5e13"  # % children that complete hw
feature_2 = "p5i2a"  # participated in athletic activity
feature_3 = "t5c13a" # math skills

We will put all these steps together in a single function so that we can just do it all at once next time:

In [24]:
'''
This function takes in the backtround DataFrame,
a list of desired background variables,
and subselects them from the background frame.
It returns a single DataFrame containing the desired columns, where 
corresponding rows between the two DataFrames have been subselected.

The function also provides the options to remove nan variables and remove negative values.

Input arguments:

Required:
dataframe: the orginal DataFrame (to be reduced)
features: a list of column names to subselect from the dataframe

Additional (with default values filled in):
remove_nans: if True, remove rows containing NaN values
remove_negatives: if True, remove rows containing negative values


Output: 
a pandas dataframe containing only selected columns (features).
'''
def pick_ff_variables(dataframe, features, remove_nans=False, remove_negatives=False):
    # For exery feature inside the list of features, make sure it's contained in the columns
    for ft in features:
        if ft not in dataframe.columns:
            print("Feature " + ft + " is in NOT the columns - provide other features.")
    
    # select only the columns corresponding to desired features
    new_frame = dataframe[features]
    
    # option to remove NaNs
    if remove_nans:
        if len(new_frame.shape)>1:
            new_frame = new_frame[(~np.isnan(new_frame)).all(1)]
        else:
            new_frame = new_frame[~np.isnan(new_frame)]
    
    # option to remove negative values
    if remove_negatives:
        if len(new_frame.shape)>1:
            new_frame = new_frame[(new_frame>=0).all(1)]
        else:
            new_frame = new_frame[new_frame>=0]
            
    print("Data frame with ", new_frame.shape[0], " rows and ", new_frame.shape[1], "columns.")
    return new_frame

Now you can reproduce the dataframe we created earlier by calling the function with your read-in **data_frame** and **selected_features**.

In [23]:
selected_features

['k5g2i', 't5e13', 'p5i2a', 't5c13a']

In [25]:
df1 = pick_ff_variables(data_frame, selected_features, remove_nans=True, remove_negatives=True)

Data frame with  1932  rows and  4 columns.


In [26]:
df1.head()

Unnamed: 0_level_0,k5g2i,t5e13,p5i2a,t5c13a
challengeID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2,1,90,2,3
9,0,95,2,1
11,0,25,1,4
12,0,98,2,4
14,0,90,1,3


Check if the two data frames are indeed equivalent:

In [None]:
df.equals(df1)

# 4. Imputation

## What should we do about missing values?
Some researchers simply discard data samples where NaN values are present. This is problematic, because in relatively small datasets, this means getting rid of a large portion of the data.

The alternative solution is to *impute* - or fill in - missing data points. However, correct imputation requires advanced statistical knowledge. Sometimes, the average of a given column is used to replace NaN values. Other times, values are copied from other rows which have similar entries in the non-missing columns (the K Nearest Neighbors algorithm, which you'll learn about in week 2).

During this project, we will use three ways of dealing with missing data:
* removing NaN and missing columns (see above)
* filling in average values of the column (potentially making a simplifying assumption)

Later, you'll learn a machine learning method for imputation:
* K Nearest Neighbors (KNN)

We built functions which will deal with missing data in the examples you'll be seeing during the next two weeks.

### 4.1. Filling values in each column
You can use a pandas function fillna:

https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.DataFrame.fillna.html

Let's try imputing our missing values with values for each feature:

In [None]:
df_means =  data_frame[selected_features]

Let's change all the negative values to NaN values (so that they don't get mistaken for numerical values):

In [None]:
df_means[df_means<0] = np.nan

Calculate the mean of df_means (hint: google the pandas mean function usage)

In [None]:
data_mean # = function here

##### TODO: remove this #####
data_mean = df_means.mean()
#############################
data_mean

In [None]:
df_means = df_means.fillna(data_mean)
df_means.head()

## Question 4.1 
What do you think about this method? What may be the possible advantages and disadvantages of imputing missing values with a single value? What may be the possible advantages and disadvantages of removing rows with missing values altogether?

## Answer:

# 5. Feature engineering

Feature engineering refers to creating new features by combining existing features. For example, you might have found 3 different binary features that you want to use. You could then combine these 3 features into one single feature by adding them all together. This way your new feature takes on values from 0 to 3 and it might be more strongly correlated to the outcome than any of the individual 3 features is.

A similar approach has been used in studies using the Fragile Families data before. You can check out the paper below to learn more:
https://dx.doi.org/10.1007/s10995-018-2521-2

The paper, "Father Early Engagement Behaviors and Infant Low Birth Weight" by Lee et al., combined the following three features into a single "father involvement" variable.

In [None]:
ft1 = "f1b16" # During BM preg, did you give her money or buy things for the baby/ies?
ft2 = "f1b17" # Did you help in other ways, like providing transportation / doing chores?
ft3 = "f1a2" # Were you present at the birth?
response = "cm1lbw" # Constructed - Low Birth Weight?

selected = [ft1, ft2, ft3, response]

df_father = pick_ff_variables(data_frame, selected, remove_nans=True, remove_negatives=True)
df_father.head()

Observe (in the metadata website) that all three features are binary (1 or 2), so each of them can be weighted by 1/3 to generate a new feature in the range from 1 to 2.

Generate a new variable by adding the feature values and weighing them all equally.

In [None]:
df_father["father_engagement"] = (df_father.f1b16 + df_father.f1b17 + df_father.f1a2)/3
df_father.head()

## Problem 5.1
Find four variables which you hypothesize may predict variable **f4b4b9**: Child (Year 5) is nervous, high strung, or tense?

Create a new feature which captures the combination of the four features you found.

In [None]:
ft1 = ""
ft2 = ""
ft3 = ""
ft4 = ""