## Feature Engineering
In this notebook, we will be engineering some of the features that needed some form of engineering as seen in the Exploratory Data Analysis notebook

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings("ignore")

C:\Users\ali95\Anaconda3\envs\python3.7\lib\site-packages\numpy\.libs\libopenblas.PYQHXLVVQ7VESDPUVUADXEVJOBGHJPAY.gfortran-win_amd64.dll
C:\Users\ali95\Anaconda3\envs\python3.7\lib\site-packages\numpy\.libs\libopenblas.TXA6YQSD3GCQQC22GEQ54J2UDCXDXHWN.gfortran-win_amd64.dll
  stacklevel=1)


#### Loading Data

In [2]:
df = pd.read_csv('train.csv')

df.head()

Unnamed: 0,project_id,name,desc,goal,keywords,disable_communication,country,currency,deadline,state_changed_at,created_at,launched_at,backers_count,final_status
0,kkst1451568084,drawing for dollars,I like drawing pictures. and then i color them...,20.0,drawing-for-dollars,False,US,USD,1241333999,1241334017,1240600507,1240602723,3,1
1,kkst1474482071,Sponsor Dereck Blackburn (Lostwars) Artist in ...,"I, Dereck Blackburn will be taking upon an inc...",300.0,sponsor-dereck-blackburn-lostwars-artist-in-re...,False,US,USD,1242429000,1242432018,1240960224,1240975592,2,0
2,kkst183622197,Mr. Squiggles,So I saw darkpony's successfully funded drawin...,30.0,mr-squiggles,False,US,USD,1243027560,1243027818,1242163613,1242164398,0,0
3,kkst597742710,Help me write my second novel.,Do your part to help out starving artists and ...,500.0,help-me-write-my-second-novel,False,US,USD,1243555740,1243556121,1240963795,1240966730,18,1
4,kkst1913131122,Support casting my sculpture in bronze,"I'm nearing completion on a sculpture, current...",2000.0,support-casting-my-sculpture-in-bronze,False,US,USD,1243769880,1243770317,1241177914,1241180541,1,0


In [3]:
df.shape

(108129, 14)

#### Dropping Missing & Unneeded Columns
As seen in data analysis, only few rows had missing project names/descriptions, therefore, we can safely remove those from our set

In [4]:
df = df.dropna().reset_index(drop=True)
df.shape

(108119, 14)

We also seen that columns 'project_id', 'name', are not useful and date columns can be dropped as those will not be available in prediction time

In [5]:
df.drop(['project_id', 'name', 'deadline', 'state_changed_at', 'created_at', 'launched_at'], axis=1, inplace=True)

df.head()

Unnamed: 0,desc,goal,keywords,disable_communication,country,currency,backers_count,final_status
0,I like drawing pictures. and then i color them...,20.0,drawing-for-dollars,False,US,USD,3,1
1,"I, Dereck Blackburn will be taking upon an inc...",300.0,sponsor-dereck-blackburn-lostwars-artist-in-re...,False,US,USD,2,0
2,So I saw darkpony's successfully funded drawin...,30.0,mr-squiggles,False,US,USD,0,0
3,Do your part to help out starving artists and ...,500.0,help-me-write-my-second-novel,False,US,USD,18,1
4,"I'm nearing completion on a sculpture, current...",2000.0,support-casting-my-sculpture-in-bronze,False,US,USD,1,0


#### Splitting Data

In [6]:
from sklearn.model_selection import train_test_split
X = df
y = df.final_status
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=200, test_size=0.3) # 70% train data
X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, random_state=200, test_size=0.33) #10% test data, 20% validation

In [7]:
X_train.shape

(75683, 8)

In [8]:
X_val.shape

(21732, 8)

#### Categoricals - Replacing Rare Labels

In [9]:
CATEGORICALS = [var for var in X_train.columns if X_train[var].nunique()<20 and X_train[var].dtype=='O']

CATEGORICALS

['country', 'currency']

In [10]:
def get_frequent_labels(df, var, rare_perc):
    df = df.copy()
    tmp = df.groupby(var)['final_status'].count()/len(df)
    return tmp[tmp>rare_perc].index

for var in CATEGORICALS:
    frequent_labels = get_frequent_labels(X_train, var, 0.01)
    
    X_train[var] = np.where(X_train[var].isin(frequent_labels), X_train[var], 'Rare')
    X_val[var] = np.where(X_val[var].isin(frequent_labels), X_val[var], 'Rare')
    X_test[var] = np.where(X_test[var].isin(frequent_labels), X_test[var], 'Rare')
    
X_train.head()

Unnamed: 0,desc,goal,keywords,disable_communication,country,currency,backers_count,final_status
103955,Creating a top notch music video for time wont...,15000.0,time-wont-heal,False,CA,CAD,2,0
40881,A Futuristic Cyberspace Utopia with free Bitco...,10000.0,project-babylon-20,False,GB,GBP,7,0
81146,An outdoor amphitheater for performing arts wi...,1575000.0,the-hill,False,US,USD,3,0
31778,Tilt your phone to victory steering Noah's Ark...,4000.0,noahs-ark-a-silly-stacking-matching-game-for-your,False,US,USD,17,0
56767,The game where you must pick the best of the w...,2500.0,famous-missions-a-card-game,False,US,USD,113,1


In [11]:
X_train.country.unique()

array(['CA', 'GB', 'US', 'Rare', 'AU'], dtype=object)

In [12]:
X_train.currency.unique()

array(['CAD', 'GBP', 'USD', 'Rare', 'AUD'], dtype=object)

#### Categoricals - Encoding

In [225]:
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(sparse=False)
ohe.fit(X_train[CATEGORICALS])

X_train = pd.concat([X_train.reset_index(drop=True), pd.DataFrame(ohe.transform(X_train[CATEGORICALS]), \
                                           columns=np.concatenate(ohe.categories_).ravel())],axis=1).drop(CATEGORICALS+['Rare'],
                                                                                                           axis=1)
X_val = pd.concat([X_val.reset_index(drop=True), pd.DataFrame(ohe.transform(X_val[CATEGORICALS]), \
                                           columns=np.concatenate(ohe.categories_).ravel())],axis=1).drop(CATEGORICALS+['Rare'],
                                                                                                           axis=1)
X_test = pd.concat([X_test.reset_index(drop=True), pd.DataFrame(ohe.transform(X_test[CATEGORICALS]), \
                                           columns=np.concatenate(ohe.categories_).ravel())],axis=1).drop(CATEGORICALS+['Rare'],
                                                                                                           axis=1)

X_train.head()

Unnamed: 0,desc,goal,keywords,disable_communication,backers_count,final_status,AU,CA,GB,US,AUD,CAD,GBP,USD
0,Creating a top notch music video for time wont...,15000.0,time-wont-heal,False,2,0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
1,A Futuristic Cyberspace Utopia with free Bitco...,10000.0,project-babylon-20,False,7,0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
2,An outdoor amphitheater for performing arts wi...,1575000.0,the-hill,False,3,0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
3,Tilt your phone to victory steering Noah's Ark...,4000.0,noahs-ark-a-silly-stacking-matching-game-for-your,False,17,0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
4,The game where you must pick the best of the w...,2500.0,famous-missions-a-card-game,False,113,1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0


We can also encode 'disable_communication' column to 1 if True, 0 if False

In [226]:
X_train.disable_communication = X_train.disable_communication.apply(lambda x: 1 if x is True else 0)
X_val.disable_communication = X_val.disable_communication.apply(lambda x: 1 if x is True else 0)
X_test.disable_communication = X_test.disable_communication.apply(lambda x: 1 if x is True else 0)

X_train.head()

Unnamed: 0,desc,goal,keywords,disable_communication,backers_count,final_status,AU,CA,GB,US,AUD,CAD,GBP,USD
0,Creating a top notch music video for time wont...,15000.0,time-wont-heal,0,2,0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
1,A Futuristic Cyberspace Utopia with free Bitco...,10000.0,project-babylon-20,0,7,0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
2,An outdoor amphitheater for performing arts wi...,1575000.0,the-hill,0,3,0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
3,Tilt your phone to victory steering Noah's Ark...,4000.0,noahs-ark-a-silly-stacking-matching-game-for-your,0,17,0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
4,The game where you must pick the best of the w...,2500.0,famous-missions-a-card-game,0,113,1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0


#### Numericals - Outlier Removal
In the data analysis phase we have seen that numerical variables have some extreme outliers that need to be handled carefully. 

In [227]:
NUM_VARS = [var for var in X_train.columns if X_train[var].dtype!='O' and X_train[var].nunique()>20]
NUM_VARS

['goal', 'backers_count']

For the sake of simplicity, we will only consider projects within 95% percentile in numerical values (that is between 2.5% and 97.5%) Others will dismissed as there will be too few data to strongly predict those extreme points

In [228]:
X_train.shape

(75683, 14)

In [229]:
def filter_extremes(df, var, percentiles):
    df = df.copy()
    df = df.loc[df[var].between(percentiles[0], percentiles[1]), :]
    df.reset_index(drop=True, inplace=True)
    return df

for var in NUM_VARS:
    (p1, p2) = np.percentile(X_train[var], (2.5, 97.5))
    
    X_train = filter_extremes(X_train, var, (p1, p2))
    X_val = filter_extremes(X_val, var, (p1, p2))
    X_test = filter_extremes(X_test, var, (p1, p2))
    
X_train.shape

(70374, 14)

#### Strings - Engineering
As seen in the data analysis phase, we will merge 'keywords' and 'desc' into a single string column and transform the text using TFIDF word represenetation

In [230]:
STR_VARS = [var for var in X_train.columns if X_train[var].dtype=='O']
STR_VARS

['desc', 'keywords']

In [231]:
X_train['kw_desc'] = (X_train['keywords'] + ' ' + X_train['desc']).str.replace('-', ' ')
X_val['kw_desc'] = (X_val['keywords'] + ' ' + X_val['desc']).str.replace('-', ' ')
X_test['kw_desc'] = (X_test['keywords'] + ' ' + X_test['desc']).str.replace('-', ' ')

X_train.drop(STR_VARS, axis=1, inplace=True)
X_val.drop(STR_VARS, axis=1, inplace=True)
X_test.drop(STR_VARS, axis=1, inplace=True)

X_train.head()

Unnamed: 0,goal,disable_communication,backers_count,final_status,AU,CA,GB,US,AUD,CAD,GBP,USD,kw_desc
0,15000.0,0,2,0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,time wont heal Creating a top notch music vide...
1,10000.0,0,7,0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,project babylon 20 A Futuristic Cyberspace Uto...
2,4000.0,0,17,0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,noahs ark a silly stacking matching game for y...
3,2500.0,0,113,1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,famous missions a card game The game where you...
4,8000.0,0,0,0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,momentum ldb and the tyrant producing album w ...


In [232]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords

tfidf = TfidfVectorizer(stop_words=stopwords.words('english'), ngram_range=(1,2), max_df=0.7, max_features=1000)

tfidf.fit(X_train['kw_desc'])
feat_names = tfidf.get_feature_names()

X_train = pd.concat([X_train, pd.DataFrame(tfidf.transform(X_train['kw_desc']).todense().tolist(), columns=feat_names)],
                    axis=1).drop(['kw_desc'], axis=1)
X_val = pd.concat([X_val, pd.DataFrame(tfidf.transform(X_val['kw_desc']).todense().tolist(), columns=feat_names)], 
                  axis=1).drop(['kw_desc'], axis=1)
X_test = pd.concat([X_test, pd.DataFrame(tfidf.transform(X_test['kw_desc']).todense().tolist(), columns=feat_names)], 
                   axis=1).drop(['kw_desc'], axis=1)
X_train.head()

Unnamed: 0,goal,disable_communication,backers_count,final_status,AU,CA,GB,US,AUD,CAD,...,year,year old,years,yet,york,young,youth,youtube,zombie,zombies
0,15000.0,0,2,0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,10000.0,0,7,0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,4000.0,0,17,0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2500.0,0,113,1,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,8000.0,0,0,0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now that our engineered dataset is all numeric, before we can persist it for later steps, we will scale data down to make it suitable for feature selection

In [233]:
from sklearn.preprocessing import MinMaxScaler

SCALED_VARS = [var for var in X_train.columns if var!='final_status']

scaler = MinMaxScaler()

X_train[SCALED_VARS] = scaler.fit_transform(X_train[SCALED_VARS])
X_val[SCALED_VARS] = scaler.transform(X_val[SCALED_VARS])
X_test[SCALED_VARS] = scaler.transform(X_test[SCALED_VARS])

X_train.head()

Unnamed: 0,goal,disable_communication,backers_count,final_status,AU,CA,GB,US,AUD,CAD,...,year,year old,years,yet,york,young,youth,youtube,zombie,zombies
0,0.0,0.0,0.002695,0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.009434,0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.022911,0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.152291,1,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [234]:
X_train.to_csv('Xtrain.csv')
X_val.to_csv('Xval.csv')
X_test.to_csv('Xtest.csv')

#### Conclusion
We have engineered features in our sets in this notebook. Now that we have persisted our sets, we are ready to move to the next step where we select the few most important features in the sets and finally to model building