Baseline Model

<b>Introduction</b><br>
In this course, you will learn a practical approach to feature engineering. You'll be able to apply what you learn to Kaggle competitions and other machine learning applications.



*Load the data* <br>
We'll work with data from Kickstarter projects. The first few rows of the data looks like this:

In [3]:
import pandas as pd
ks = pd.read_csv('C:/Users/Admin/Documents/Datasets/ks-projects-201801-extra.csv', parse_dates=['deadline', 'launched'])
ks.head(6)

Unnamed: 0.1,Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,...,n_polysyllable_words,flesch_kincaid_grade_level,flesch_reading_ease,smog_index,gunning_fog_index,coleman_liau_index,automated_readability_index,lix,gulpease_index,wiener_sachtextformel
0,0,1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,2015-10-09,1000.0,2015-08-11 12:12:28,0.0,...,1,5.24,66.4,8.841846,10.0,7.680995,4.62,45.0,99.0,7.057
1,1,1000003930,Greeting From Earth: ZGAC Arts Capsule For ET,Narrative Film,Film & Video,USD,2017-11-01,30000.0,2017-09-02 04:43:57,2421.0,...,0,0.72,97.025,3.1291,1.6,3.996687,2.35375,29.0,117.75,0.5838
2,2,1000004038,Where is Hank?,Narrative Film,Film & Video,USD,2013-02-26,45000.0,2013-01-12 00:20:50,220.0,...,0,-2.62,119.19,3.1291,1.2,-4.103777,-2.66,3.0,152.333333,-3.6434
3,3,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,USD,2012-04-16,5000.0,2012-03-17 03:24:11,1.0,...,1,10.74,30.53,8.841846,8.514286,16.091526,11.002857,49.857143,70.428571,7.216829
4,4,1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,USD,2015-08-29,19500.0,2015-07-04 08:35:03,1283.0,...,3,9.655,40.09,13.023867,18.2,17.249855,12.0075,58.0,64.0,12.1601
5,5,1000014025,Monarch Espresso Bar,Restaurants,Food,USD,2016-04-01,50000.0,2016-02-26 13:38:27,52375.0,...,0,1.313333,90.99,3.1291,1.2,9.615875,8.33,69.666667,129.0,6.093267


When importing, we can use parse_dates to make a column into date.

The state column shows the outcome of the project.

In [4]:
print('Unique values in `state` column:', list(ks.state.unique()))

Unique values in `state` column: ['failed', 'canceled', 'successful', 'live', 'undefined', 'suspended']


In [5]:
ks['state'].unique()

array(['failed', 'canceled', 'successful', 'live', 'undefined',
       'suspended'], dtype=object)

In [6]:
ks.state

0           failed
1           failed
2           failed
3           failed
4         canceled
            ...   
378656    canceled
378657      failed
378658      failed
378659      failed
378660      failed
Name: state, Length: 378661, dtype: object

In [7]:
ks.category

0                  Poetry
1          Narrative Film
2          Narrative Film
3                   Music
4            Film & Video
               ...       
378656        Documentary
378657     Narrative Film
378658     Narrative Film
378659         Technology
378660    Performance Art
Name: category, Length: 378661, dtype: object

In [8]:
ks['category']

0                  Poetry
1          Narrative Film
2          Narrative Film
3                   Music
4            Film & Video
               ...       
378656        Documentary
378657     Narrative Film
378658     Narrative Film
378659         Technology
378660    Performance Art
Name: category, Length: 378661, dtype: object

ks.category and ks['category'] are the same thing and are arrays.

Arrays and lists are both used in Python to store data, but they don't serve exactly the same purposes. They both can be used to store any data type (real numbers, strings, etc), and they both can be indexed and iterated through, but the similarities between the two don't go much further. The main difference between a list and an array is the functions that you can perform to them. For example, you can divide an array by 3, and each number in the array will be divided by 3 and the result will be printed if you request it. If you try to divide a list by 3, Python will tell you that it can't be done, and an error will be thrown.

Using this data, how can we use features such as project category, currency, funding goal, and country to predict if a Kickstarter project will succeed?

<b>Prepare the target column</b>
First we'll convert the state column into a target we can use in a model. Data cleaning isn't the current focus, so we'll simplify this example by:

Dropping projects that are "live"
Counting "successful" states as outcome = 1
Combining every other state as outcome = 0

In [9]:
# Drop live projects
ks = ks.query('state != "live"')

# Add outcome column, "successful" == 1, others are 0
ks = ks.assign(outcome=(ks['state'] == 'successful').astype(int))

In [10]:
ks.head()

Unnamed: 0.1,Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,...,flesch_kincaid_grade_level,flesch_reading_ease,smog_index,gunning_fog_index,coleman_liau_index,automated_readability_index,lix,gulpease_index,wiener_sachtextformel,outcome
0,0,1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,2015-10-09,1000.0,2015-08-11 12:12:28,0.0,...,5.24,66.4,8.841846,10.0,7.680995,4.62,45.0,99.0,7.057,0
1,1,1000003930,Greeting From Earth: ZGAC Arts Capsule For ET,Narrative Film,Film & Video,USD,2017-11-01,30000.0,2017-09-02 04:43:57,2421.0,...,0.72,97.025,3.1291,1.6,3.996687,2.35375,29.0,117.75,0.5838,0
2,2,1000004038,Where is Hank?,Narrative Film,Film & Video,USD,2013-02-26,45000.0,2013-01-12 00:20:50,220.0,...,-2.62,119.19,3.1291,1.2,-4.103777,-2.66,3.0,152.333333,-3.6434,0
3,3,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,USD,2012-04-16,5000.0,2012-03-17 03:24:11,1.0,...,10.74,30.53,8.841846,8.514286,16.091526,11.002857,49.857143,70.428571,7.216829,0
4,4,1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,USD,2015-08-29,19500.0,2015-07-04 08:35:03,1283.0,...,9.655,40.09,13.023867,18.2,17.249855,12.0075,58.0,64.0,12.1601,0


In [11]:
ks.state

0           failed
1           failed
2           failed
3           failed
4         canceled
            ...   
378656    canceled
378657      failed
378658      failed
378659      failed
378660      failed
Name: state, Length: 375862, dtype: object

## Convert timestamps

we use .dt.hour,day, month etc to extract the time of something

In [12]:
ks = ks.assign(hour=ks.launched.dt.hour,
               day=ks.launched.dt.day,
               month=ks.launched.dt.month,
               year=ks.launched.dt.year)
ks.head()

Unnamed: 0.1,Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,...,coleman_liau_index,automated_readability_index,lix,gulpease_index,wiener_sachtextformel,outcome,hour,day,month,year
0,0,1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,2015-10-09,1000.0,2015-08-11 12:12:28,0.0,...,7.680995,4.62,45.0,99.0,7.057,0,12,11,8,2015
1,1,1000003930,Greeting From Earth: ZGAC Arts Capsule For ET,Narrative Film,Film & Video,USD,2017-11-01,30000.0,2017-09-02 04:43:57,2421.0,...,3.996687,2.35375,29.0,117.75,0.5838,0,4,2,9,2017
2,2,1000004038,Where is Hank?,Narrative Film,Film & Video,USD,2013-02-26,45000.0,2013-01-12 00:20:50,220.0,...,-4.103777,-2.66,3.0,152.333333,-3.6434,0,0,12,1,2013
3,3,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,USD,2012-04-16,5000.0,2012-03-17 03:24:11,1.0,...,16.091526,11.002857,49.857143,70.428571,7.216829,0,3,17,3,2012
4,4,1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,USD,2015-08-29,19500.0,2015-07-04 08:35:03,1283.0,...,17.249855,12.0075,58.0,64.0,12.1601,0,8,4,7,2015


Prep categorical variables
Now for the categorical variables -- category, currency, and country -- we'll need to convert them into integers so our model can use the data. For this we'll use scikit-learn's LabelEncoder. This assigns an integer to each value of the categorical feature.

In [13]:
from sklearn.preprocessing import LabelEncoder

cat_features = ['category', 'currency', 'country']
encoder = LabelEncoder()

# Apply the label encoder to each column
encoded = ks[cat_features].apply(encoder.fit_transform)

In [14]:
# Since ks and encoded have the same index and I can easily join them
data = ks[['goal', 'hour', 'day', 'month', 'year', 'outcome']].join(encoded)
data.head()

Unnamed: 0,goal,hour,day,month,year,outcome,category,currency,country
0,1000.0,12,11,8,2015,0,108,5,9
1,30000.0,4,2,9,2017,0,93,13,22
2,45000.0,0,12,1,2013,0,93,13,22
3,5000.0,3,17,3,2012,0,90,13,22
4,19500.0,8,4,7,2015,0,55,13,22


Take note of how it was done
1. Handled dates into hour, day, months, etc if needed.
2. Created list of categorical features,
3. join times, selected numerical variables and encoded categorical variables. 

#### Create training, validation, and test splits
We need to create data sets for training, validation, and testing. We'll use a fairly simple approach and split the data using slices. We'll use 10% of the data as a validation set, 10% for testing, and the other 80% for training.

In [15]:
valid_fraction = 0.1   #% of data used for validation
valid_size = int(len(data) * valid_fraction) ## total size of validation data used

train = data[:-2 * valid_size]   ### meaning from the begining to 80% of data
valid = data[-2 * valid_size:-valid_size]  #81 to 90%
test = data[-valid_size:] #91 to 100%

Train a model
For this course we'll be using a LightGBM model. This is a tree-based model that typically provides the best performance, even compared to XGBoost. It's also relatively fast to train.

We won't do hyperparameter optimization because that isn't the goal of this course. So, our models won't be the absolute best performance you can get. But you'll still see model performance improve as we do feature engineering.

In [16]:
import lightgbm as lgb

feature_cols = train.columns.drop('outcome')

dtrain = lgb.Dataset(train[feature_cols], label=train['outcome'])
dvalid = lgb.Dataset(valid[feature_cols], label=valid['outcome'])

param = {'num_leaves': 64, 'objective': 'binary'}
param['metric'] = 'auc'
num_round = 1000
bst = lgb.train(param, dtrain, num_round, valid_sets=[dvalid], early_stopping_rounds=10, verbose_eval=False)

ModuleNotFoundError: No module named 'lightgbm'

In [None]:
from sklearn import metrics
ypred = bst.predict(test[feature_cols])
score = metrics.roc_auc_score(test['outcome'], ypred)

print(f"Test AUC score: {score}")