# Lab 6: Feature Generation


In this lab we'll get some hands-on experience with generating features.


## Goals for this lab

- Understand the types of features that are usually used in ML projects
- Practice generating those features
- Explore the effect of different types of features on model performance



In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.ensemble import BaggingClassifier,AdaBoostClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score as accuracy
from sklearn.metrics import roc_curve, auc
import graphviz # If you don't have this, install via pip/conda
from sklearn.metrics import f1_score, precision_recall_curve
from sklearn.linear_model import LogisticRegression
%matplotlib inline

# exercise: what additional modules should you import?

## Data
We'll use the data from donorschoose that we used in Assignment 3.

In [5]:
# Change this to wherever you're storing your data
datafile = "../data/projects_2012_2013.csv"
df = pd.read_csv(datafile, parse_dates=['date_posted', 'datefullyfunded'])

In [6]:
df.head()

Unnamed: 0,projectid,teacher_acctid,schoolid,school_ncesid,school_latitude,school_longitude,school_city,school_state,school_metro,school_district,...,secondary_focus_subject,secondary_focus_area,resource_type,poverty_level,grade_level,total_price_including_optional_support,students_reached,eligible_double_your_impact_match,date_posted,datefullyfunded
0,00001ccc0e81598c4bd86bacb94d7acb,96963218e74e10c3764a5cfb153e6fea,9f3f9f2c2da7edda5648ccd10554ed8c,170993000000.0,41.807654,-87.673257,Chicago,IL,urban,Pershing Elem Network,...,Visual Arts,Music & The Arts,Supplies,highest poverty,Grades PreK-2,1498.61,31.0,f,2013-04-14,2013-05-02
1,0000fa3aa8f6649abab23615b546016d,2a578595fe351e7fce057e048c409b18,3432ed3d4466fac2f2ead83ab354e333,64098010000.0,34.296596,-119.296596,Ventura,CA,urban,Ventura Unif School District,...,Literature & Writing,Literacy & Language,Books,highest poverty,Grades 3-5,282.47,28.0,t,2012-04-07,2012-04-18
2,000134f07d4b30140d63262c871748ff,26bd60377bdbffb53a644a16c5308e82,dc8dcb501c3b2bb0b10e9c6ee2cd8afd,62271000000.0,34.078625,-118.257834,Los Angeles,CA,urban,Los Angeles Unif Sch Dist,...,Social Sciences,History & Civics,Technology,high poverty,Grades 3-5,1012.38,56.0,f,2012-01-30,2012-04-15
3,0001f2d0b3827bba67cdbeaa248b832d,15d900805d9d716c051c671827109f45,8bea7e8c6e4279fca6276128db89292e,360009000000.0,40.687286,-73.988217,Brooklyn,NY,urban,New York City Dept Of Ed,...,,,Books,high poverty,Grades PreK-2,175.33,23.0,f,2012-10-11,2012-12-05
4,0004536db996ba697ca72c9e058bfe69,400f8b82bb0143f6a40b217a517fe311,fbdefab6fe41e12c55886c610c110753,360687000000.0,40.793018,-73.205635,Central Islip,NY,suburban,Central Islip Union Free SD,...,Literature & Writing,Literacy & Language,Technology,high poverty,Grades PreK-2,3591.11,150.0,f,2013-01-08,2013-03-25


## 1. Create label/outcome
same as in the homework - predict if a project on donorschoose will not get fully funded within 60 days of posting.

In [39]:
# code
df['duration'] = df.datefullyfunded - df.date_posted
df['label'] =  np.where(df['duration']>pd.Timedelta('60 days'), 1, 0)

In [41]:
df.head()

Unnamed: 0,projectid,teacher_acctid,schoolid,school_ncesid,school_latitude,school_longitude,school_city,school_state,school_metro,school_district,...,resource_type,poverty_level,grade_level,total_price_including_optional_support,students_reached,eligible_double_your_impact_match,date_posted,datefullyfunded,duration,label
0,00001ccc0e81598c4bd86bacb94d7acb,96963218e74e10c3764a5cfb153e6fea,9f3f9f2c2da7edda5648ccd10554ed8c,170993000000.0,41.807654,-87.673257,Chicago,IL,urban,Pershing Elem Network,...,Supplies,highest poverty,Grades PreK-2,1498.61,31.0,f,2013-04-14,2013-05-02,18 days,0
1,0000fa3aa8f6649abab23615b546016d,2a578595fe351e7fce057e048c409b18,3432ed3d4466fac2f2ead83ab354e333,64098010000.0,34.296596,-119.296596,Ventura,CA,urban,Ventura Unif School District,...,Books,highest poverty,Grades 3-5,282.47,28.0,t,2012-04-07,2012-04-18,11 days,0
2,000134f07d4b30140d63262c871748ff,26bd60377bdbffb53a644a16c5308e82,dc8dcb501c3b2bb0b10e9c6ee2cd8afd,62271000000.0,34.078625,-118.257834,Los Angeles,CA,urban,Los Angeles Unif Sch Dist,...,Technology,high poverty,Grades 3-5,1012.38,56.0,f,2012-01-30,2012-04-15,76 days,1
3,0001f2d0b3827bba67cdbeaa248b832d,15d900805d9d716c051c671827109f45,8bea7e8c6e4279fca6276128db89292e,360009000000.0,40.687286,-73.988217,Brooklyn,NY,urban,New York City Dept Of Ed,...,Books,high poverty,Grades PreK-2,175.33,23.0,f,2012-10-11,2012-12-05,55 days,0
4,0004536db996ba697ca72c9e058bfe69,400f8b82bb0143f6a40b217a517fe311,fbdefab6fe41e12c55886c610c110753,360687000000.0,40.793018,-73.205635,Central Islip,NY,suburban,Central Islip Union Free SD,...,Technology,high poverty,Grades PreK-2,3591.11,150.0,f,2013-01-08,2013-03-25,76 days,1


In [52]:
df.date_posted[0]

Timestamp('2013-04-14 00:00:00')

## 2. Feature Generation
We'll do this in a few iterations and run some models betweeen each iteration to see how the performance changes.
 - Models: Let's take a few simple models to run - logistic regression (L2) and Random Forests (n_estimators = 1000)
 - Training and Test Sets: For now, create one six month test set and use data before that as training set (same as in the homework)
 - Metrics: Try AUCROC, Precision at 10% and 20%
 
Feature Generation iterations:

The main thing to remember here is that the features you generate are being generated as of the "posting_date" and can only use information up to that date.

1. select existing columns that already exist in the raw data and prep them to run with sklearn models. This should be very similar to what you did in assignment 3. You'll create dummy variables from categorical variables.

2. Could discretizing some of the varibles help? Try discretizing "total_ammount" and "students_reached" 

3. Aggregation:
 - let's try simple aggregations such as number and percentage (2 different features) of projects that got fully funded in the last x days for several values of x (let's say 10, 30, 60)
 - you can extend the previous features to spatial aggregations by limiting that to the same city/state/school as the project you are generating features for.
 - you can use the lat long to generate the same features for projects within some distance y
 


In [34]:
# feature generation code
str_columns = [column for column in df.columns if (df[column].dtype=='O') and (len(df[column].unique())<=51)]
float_columns = ['total_price_including_optional_support', 'students_reached']

In [35]:
print(str_columns)
print(float_columns)

['school_state', 'school_metro', 'school_charter', 'school_magnet', 'teacher_prefix', 'primary_focus_subject', 'primary_focus_area', 'secondary_focus_subject', 'secondary_focus_area', 'resource_type', 'poverty_level', 'grade_level', 'eligible_double_your_impact_match']
['total_price_including_optional_support', 'students_reached']


The code above automatically generated two lists of columns for dummy variables and discretized variables respectively. It's more reliable than handwriting column names manually. The str_columns has the additional restriction by setting an upper bound on the number of distinct values.

In [36]:
# Generate dummy variables from str_columns
features = pd.get_dummies(df[str_columns], dummy_na=True, columns=str_columns, drop_first=True)

In [37]:
# Generate discreized variables from float_columns
for column in float_columns:
    features[column] = pd.cut(df[column], bins=5, labels=['low', 'medium low', 'medium', 'medium high', 'high'])

Now we generate aggregation feature. We will use number of projects that got funded in the last 10 days as an example.

In [54]:
date_posted_list = pd.to_datetime(df.date_posted.unique())
num_projects_funded_dict = {}
# use a dictionary to store the number of projects funded within 10 days of a specific day
for date_posted in date_posted_list:
    since = date_posted - df.datefullyfunded
    num_projects_funded_dict[date_posted.strftime("%Y%m%d")] = np.sum((since>pd.Timedelta('0 days')) & (since<=pd.Timedelta('10 days')))


In [57]:
# create a Series of aggregation feature
aggr_list = np.zeros(len(df))
for i in range(len(df)):
    date = df.iloc[i].date_posted
    aggr_list[i] = num_projects_funded_dict[date.strftime("%Y%m%d")]

Append the newly created aggregation feature to the feature dataframe

In [59]:
features['num_projects_funded_within10day'] = aggr_list

In [61]:
features.head()

Unnamed: 0,school_state_AL,school_state_AR,school_state_AZ,school_state_CA,school_state_CO,school_state_CT,school_state_DC,school_state_DE,school_state_FL,school_state_GA,...,poverty_level_nan,grade_level_Grades 6-8,grade_level_Grades 9-12,grade_level_Grades PreK-2,grade_level_nan,eligible_double_your_impact_match_t,eligible_double_your_impact_match_nan,total_price_including_optional_support,students_reached,num_projects_funded_within10day
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,low,low,1281.0
1,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,1,0,low,low,1435.0
2,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,low,low,568.0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,low,low,1973.0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,low,low,1688.0


## 3. Let's test models now

In [None]:
raw_features = []
discretized_features = []
simple_aggregate_features = []
spatial_aggregate_features = []

# now seelect which one(s) you want to test models with
selected_feature_groups  = []


### Create one (temporal) train and test split 


In [45]:
# split train and test data based on a date threshold in mid 2013
split_threshold = pd.Timestamp(2013,6,30)
train_filter = (df.date_posted <= split_threshold)
test_filter = (df.date_posted > split_threshold)
train_x, train_y = features[train_filter], df.label[train_filter]
test_x, test_y = features[test_filter], df.label[test_filter]

### Imputation
Impute features that may be missing (separately on train and test set to avoid leakage). Each feature may be missing for a different reason so fill them appropriately (and generate missing flags as separate variables when necessary - remember what we talked about in class about this)
.

In [None]:
# code

### Train and Test models
- Build model(s) using the selected feature groups
- test model(s)
- evaluate

You should do this for different subsets of feature groups above to get an idea of what the performance impact is

In [None]:
# code

### Add more features
Can you think of other features (especially aggregate ones) that will be helpful?
  - avg amount for fully funded projects in the last x days within y distance (or same geographical area)?
  - difference between what this project is asking for and the feature above?
  - ...
  
Now create a new feature group and see how well do the models do with the additional feature(s)?