# Marketing Analytics - Shopee Code League 2020

### Introduction
This is the last challenge, challenge number 8, for the Shopee Code League 2020! Reviewing the problem set & tasks from [Shopee Code League 2020 - Marketing Analytics](https://www.kaggle.com/c/student-shopee-code-league-marketing-analytics/overview), we are to build a model to predict if users, with provided data we have on them, would open their newsletters from us. 

#### Problem Set
The aim of this project is to build a model that can predict whether a user opens the emails sent by Shopee.

Sending emails is one of the marketing channels Shopee uses to reach out to our users. Being able to predict whether a user opens an email allows Shopee to forecast and evaluate the performance of future marketing campaigns before launch. This is because when a user opens an email, the probability of the user knowing the campaign increases and this in turn increases the probability of the user making a checkout during the campaign period. Therefore, with the predicted open rates, Shopee can better develop, strategize and implement future marketing campaigns.

#### Task
We provide you with data related to marketing emails (Electronic Direct Mail) that were sent to Shopee users over a certain period. It contains information about:
- User-specific information
- Email nature
- Users’ engagement on the platform
- User’s reaction to the email, including whether users opened the email
- Based on the data provided, you must predict whether each user will open an email sent to him/her.

#### Team Introduction
Team Name: **JNNY** <br/>
Team Members: **Natalie, James, Yong Xian, Nicky** <br/>
Script Prepared by: **Nicky** [@ahjimomo](https://github.com/ahjimomo)

## 1. Data Preparation & Wrangling

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import re

from sklearn.metrics import accuracy_score, matthews_corrcoef
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

### 1.1. Importing Data

In [2]:
os.listdir("input/")

['sample_submission_0_1.csv', 'test.csv', 'train.csv', 'users.csv']

In [3]:
train_raw = pd.read_csv("input/train.csv")
test_raw = pd.read_csv("input/test.csv")
user_raw = pd.read_csv("input/users.csv")

In [4]:
# Check count of columns imported
print(len(train_raw)) # 73539 rows imported
print(len(test_raw)) # 55970 rows imported
print(len(user_raw)) # 127886 rows imported

73539
55970
127886


In [5]:
# Review raw training data imported
train_raw.head()

Unnamed: 0,country_code,grass_date,user_id,subject_line_length,last_open_day,last_login_day,last_checkout_day,open_count_last_10_days,open_count_last_30_days,open_count_last_60_days,login_count_last_10_days,login_count_last_30_days,login_count_last_60_days,checkout_count_last_10_days,checkout_count_last_30_days,checkout_count_last_60_days,open_flag,row_id
0,4,2019-07-16 00:00:00+08:00,43,44,19,6,18,0,2,4,12,43,99,0,5,10,0,0
1,4,2019-07-16 00:00:00+08:00,102,44,9,4,8,2,9,17,18,48,90,1,1,4,1,1
2,6,2019-07-16 00:00:00+08:00,177,49,14,5,5,0,4,12,24,69,119,5,19,27,0,2
3,1,2019-07-16 00:00:00+08:00,184,49,49,9,53,0,0,1,9,23,69,1,3,6,0,3
4,6,2019-07-16 00:00:00+08:00,221,49,227,6,221,0,0,0,2,5,5,0,0,0,0,4


In [6]:
test_raw.head()

Unnamed: 0,country_code,grass_date,user_id,subject_line_length,last_open_day,last_login_day,last_checkout_day,open_count_last_10_days,open_count_last_30_days,open_count_last_60_days,login_count_last_10_days,login_count_last_30_days,login_count_last_60_days,checkout_count_last_10_days,checkout_count_last_30_days,checkout_count_last_60_days,row_id
0,6,2019-09-03 00:00:00+08:00,0,35,27,2,13,2,3,4,10,34,134,0,6,18,0
1,6,2019-09-03 00:00:00+08:00,130,35,7,5,383,1,1,1,5,5,5,0,0,0,1
2,5,2019-09-03 00:00:00+08:00,150,25,34,1,3,0,0,0,13,19,38,2,2,2,2
3,1,2019-09-03 00:00:00+08:00,181,36,63,5,5,0,0,0,43,110,173,2,5,5,3
4,5,2019-09-03 00:00:00+08:00,192,23,6,5,54,0,0,0,4,12,39,0,0,2,4


In [7]:
user_raw.head()

Unnamed: 0,user_id,attr_1,attr_2,attr_3,age,domain
0,0,,1.0,0.0,,@gmail.com
1,1,1.0,1.0,2.0,50.0,@gmail.com
2,2,,1.0,0.0,,other
3,3,,1.0,0.0,,@gmail.com
4,4,1.0,1.0,2.0,33.0,@gmail.com


## 2. Data Processing

Looking into the data, we will want to append both the training & testing data set together with the users for both the modeling & later for the testing & submission. This is also part of the ETL procedure where we will work on cleaning up the data to find as many outliers or anomalies as possible and prepare a cleaner dataset. 

**Basic Procedures we will execute:**
1. Merging of dataset & users set
2. Processing dates
3. Encoding categorical data for modeling
4. Removing `user_id` to avoid confusing predictions
5. Processing anomalies/outliers
6. Preparing the training data


### 2.1. Merging of dataset & users data

Reviewing the columns of our dataset, we can see that the `user_id` is the identification key across all df. We will now merge the data from the `user_raw` to create both the training & testing set respectively

In [8]:
train_set = train_raw.merge(user_raw, left_on = "user_id", right_on = "user_id")
test_set = test_raw.merge(user_raw, left_on = "user_id", right_on = "user_id")

In [9]:
print(len(train_set)) # 73539 rows
print(len(test_set)) # 55970 rows

73539
55970


In [10]:
train_set.head() # 23 columns

Unnamed: 0,country_code,grass_date,user_id,subject_line_length,last_open_day,last_login_day,last_checkout_day,open_count_last_10_days,open_count_last_30_days,open_count_last_60_days,...,checkout_count_last_10_days,checkout_count_last_30_days,checkout_count_last_60_days,open_flag,row_id,attr_1,attr_2,attr_3,age,domain
0,4,2019-07-16 00:00:00+08:00,43,44,19,6,18,0,2,4,...,0,5,10,0,0,1.0,1.0,2.0,47.0,@gmail.com
1,4,2019-07-16 00:00:00+08:00,102,44,9,4,8,2,9,17,...,1,1,4,1,1,1.0,1.0,2.0,25.0,@hotmail.com
2,6,2019-07-16 00:00:00+08:00,177,49,14,5,5,0,4,12,...,5,19,27,0,2,,1.0,0.0,,@yahoo.com
3,1,2019-07-16 00:00:00+08:00,184,49,49,9,53,0,0,1,...,1,3,6,0,3,1.0,1.0,2.0,24.0,@yahoo.com
4,6,2019-07-16 00:00:00+08:00,221,49,227,6,221,0,0,0,...,0,0,0,0,4,,1.0,0.0,,@hotmail.com


In [11]:
test_set.head() # 22 columns -> As test_set does not have the open_flag attribute for testing

Unnamed: 0,country_code,grass_date,user_id,subject_line_length,last_open_day,last_login_day,last_checkout_day,open_count_last_10_days,open_count_last_30_days,open_count_last_60_days,...,login_count_last_60_days,checkout_count_last_10_days,checkout_count_last_30_days,checkout_count_last_60_days,row_id,attr_1,attr_2,attr_3,age,domain
0,6,2019-09-03 00:00:00+08:00,0,35,27,2,13,2,3,4,...,134,0,6,18,0,,1.0,0.0,,@gmail.com
1,6,2019-09-03 00:00:00+08:00,130,35,7,5,383,1,1,1,...,5,0,0,0,1,,1.0,0.0,,@gmail.com
2,5,2019-09-03 00:00:00+08:00,150,25,34,1,3,0,0,0,...,38,2,2,2,2,1.0,1.0,2.0,33.0,@gmail.com
3,1,2019-09-03 00:00:00+08:00,181,36,63,5,5,0,0,0,...,173,2,5,5,3,1.0,1.0,2.0,22.0,@yahoo.com
4,5,2019-09-03 00:00:00+08:00,192,23,6,5,54,0,0,0,...,39,0,0,2,4,,1.0,0.0,,@gmail.com


### 2.2. Processing Dates
For a marketing team - the data and timing of the newsletter could potentially play a big factor on the open rate. Thanks to some forums online - we identify that we can utilize [fast.ai](https://dev.fast.ai/tabular.core#add_datepart) `add_datepart` function to help us break the `grass_date` column into more specific parts for our modeling

**Self-Note:** Look into how to integrate Holidays & special dates in different countries might create a even better training set

In [12]:
def add_datepart(df, fldname, drop = True):
    fld = df[fldname]
    
    # Updates outliers into a pd.datetime64 type if field type is not a datetime
    if not np.issubdtype(fld.dtype, np.datetime64):
        df[fldname] = fld = pd.to_datetime(fld, infer_datetime_format = True)
        
    tar_pre = re.sub("[Dd]ate$", "", fldname)
    
    for n in ("Year", "Month", "Week", "Day", "Dayofweek", "Dayofyear", \
             "Is_month_end", "Is_month_start", "Is_quarter_end", "Is_quarter_start", \
             "Is_year_end", "Is_year_start"):
        df[tar_pre+n] = getattr(fld.dt, n.lower())
        
    # After including breaking up the grass_date row, we will delete the data
    if drop:
        df.drop(fldname, axis = 1, inplace = True)
        
add_datepart(train_set, 'grass_date')
add_datepart(test_set, 'grass_date')

In [13]:
train_set.head() # Now we have 34 columns for the train set while grass_date has been dropped & divided up

Unnamed: 0,country_code,user_id,subject_line_length,last_open_day,last_login_day,last_checkout_day,open_count_last_10_days,open_count_last_30_days,open_count_last_60_days,login_count_last_10_days,...,grass_Week,grass_Day,grass_Dayofweek,grass_Dayofyear,grass_Is_month_end,grass_Is_month_start,grass_Is_quarter_end,grass_Is_quarter_start,grass_Is_year_end,grass_Is_year_start
0,4,43,44,19,6,18,0,2,4,12,...,29,16,1,197,False,False,False,False,False,False
1,4,102,44,9,4,8,2,9,17,18,...,29,16,1,197,False,False,False,False,False,False
2,6,177,49,14,5,5,0,4,12,24,...,29,16,1,197,False,False,False,False,False,False
3,1,184,49,49,9,53,0,0,1,9,...,29,16,1,197,False,False,False,False,False,False
4,6,221,49,227,6,221,0,0,0,2,...,29,16,1,197,False,False,False,False,False,False


In [14]:
list(train_set.columns)

['country_code',
 'user_id',
 'subject_line_length',
 'last_open_day',
 'last_login_day',
 'last_checkout_day',
 'open_count_last_10_days',
 'open_count_last_30_days',
 'open_count_last_60_days',
 'login_count_last_10_days',
 'login_count_last_30_days',
 'login_count_last_60_days',
 'checkout_count_last_10_days',
 'checkout_count_last_30_days',
 'checkout_count_last_60_days',
 'open_flag',
 'row_id',
 'attr_1',
 'attr_2',
 'attr_3',
 'age',
 'domain',
 'grass_Year',
 'grass_Month',
 'grass_Week',
 'grass_Day',
 'grass_Dayofweek',
 'grass_Dayofyear',
 'grass_Is_month_end',
 'grass_Is_month_start',
 'grass_Is_quarter_end',
 'grass_Is_quarter_start',
 'grass_Is_year_end',
 'grass_Is_year_start']

In [15]:
test_set.head()

Unnamed: 0,country_code,user_id,subject_line_length,last_open_day,last_login_day,last_checkout_day,open_count_last_10_days,open_count_last_30_days,open_count_last_60_days,login_count_last_10_days,...,grass_Week,grass_Day,grass_Dayofweek,grass_Dayofyear,grass_Is_month_end,grass_Is_month_start,grass_Is_quarter_end,grass_Is_quarter_start,grass_Is_year_end,grass_Is_year_start
0,6,0,35,27,2,13,2,3,4,10,...,36,3,1,246,False,False,False,False,False,False
1,6,130,35,7,5,383,1,1,1,5,...,36,3,1,246,False,False,False,False,False,False
2,5,150,25,34,1,3,0,0,0,13,...,36,3,1,246,False,False,False,False,False,False
3,1,181,36,63,5,5,0,0,0,43,...,36,3,1,246,False,False,False,False,False,False
4,5,192,23,6,5,54,0,0,0,4,...,36,3,1,246,False,False,False,False,False,False


Now we can remove columns that was added but was non-essential for our modeling, but we would first like to check if the years might play a difference if there our dataset contains more than 1-year of newsletters result

In [16]:
train_set.grass_Year.unique() # Only 2019 data is included, thus we can remove the grass_Year

array([2019], dtype=int64)

In [17]:
# Dropping non-essential columns

def remove_grasscol(df):
    fld = ["grass_Year", "grass_Is_quarter_end", "grass_Is_quarter_start", \
           "grass_Is_year_start", "grass_Is_year_end"]
    df.drop(fld, axis = 1, inplace = True)
    
remove_grasscol(train_set)
remove_grasscol(test_set)

In [18]:
train_set.head() # 29 columns -> Confirm that 5 columns has been removed from 34 columns

Unnamed: 0,country_code,user_id,subject_line_length,last_open_day,last_login_day,last_checkout_day,open_count_last_10_days,open_count_last_30_days,open_count_last_60_days,login_count_last_10_days,...,attr_3,age,domain,grass_Month,grass_Week,grass_Day,grass_Dayofweek,grass_Dayofyear,grass_Is_month_end,grass_Is_month_start
0,4,43,44,19,6,18,0,2,4,12,...,2.0,47.0,@gmail.com,7,29,16,1,197,False,False
1,4,102,44,9,4,8,2,9,17,18,...,2.0,25.0,@hotmail.com,7,29,16,1,197,False,False
2,6,177,49,14,5,5,0,4,12,24,...,0.0,,@yahoo.com,7,29,16,1,197,False,False
3,1,184,49,49,9,53,0,0,1,9,...,2.0,24.0,@yahoo.com,7,29,16,1,197,False,False
4,6,221,49,227,6,221,0,0,0,2,...,0.0,,@hotmail.com,7,29,16,1,197,False,False


### 2.3. Encoding categorical data for modeling

As part of the requirement of scikit-learn, all the data has to be numerical. Thus, we will need to look into our dataset to identify any non-numeric types data and also all columns & determine if they are of importance else we can replace them with dummy variables. 

As part of reviewing the dataset, the column found is `domain` from the `user.csv` provided.

In [19]:
train_set = pd.concat([train_set.drop("domain", axis = 1), pd.get_dummies(train_set["domain"])], axis = 1)
test_set = pd.concat([test_set.drop("domain", axis = 1), pd.get_dummies(test_set["domain"])], axis = 1)

In [20]:
list(train_set.columns)

['country_code',
 'user_id',
 'subject_line_length',
 'last_open_day',
 'last_login_day',
 'last_checkout_day',
 'open_count_last_10_days',
 'open_count_last_30_days',
 'open_count_last_60_days',
 'login_count_last_10_days',
 'login_count_last_30_days',
 'login_count_last_60_days',
 'checkout_count_last_10_days',
 'checkout_count_last_30_days',
 'checkout_count_last_60_days',
 'open_flag',
 'row_id',
 'attr_1',
 'attr_2',
 'attr_3',
 'age',
 'grass_Month',
 'grass_Week',
 'grass_Day',
 'grass_Dayofweek',
 'grass_Dayofyear',
 'grass_Is_month_end',
 'grass_Is_month_start',
 '@163.com',
 '@gmail.com',
 '@hotmail.com',
 '@icloud.com',
 '@live.com',
 '@outlook.com',
 '@qq.com',
 '@rocketmail.com',
 '@yahoo.com',
 '@ymail.com',
 'other']

### 2.4. Removing user_id to avoid confusing predictions
As the `user_id` is no longer required for the training or submission, it could reduce the performance of our model. Thus, we will remove it.

In [21]:
train_set.drop("user_id", axis = 1, inplace = True)
test_set.drop("user_id", axis = 1, inplace = True)

In [22]:
train_set.head()

Unnamed: 0,country_code,subject_line_length,last_open_day,last_login_day,last_checkout_day,open_count_last_10_days,open_count_last_30_days,open_count_last_60_days,login_count_last_10_days,login_count_last_30_days,...,@gmail.com,@hotmail.com,@icloud.com,@live.com,@outlook.com,@qq.com,@rocketmail.com,@yahoo.com,@ymail.com,other
0,4,44,19,6,18,0,2,4,12,43,...,1,0,0,0,0,0,0,0,0,0
1,4,44,9,4,8,2,9,17,18,48,...,0,1,0,0,0,0,0,0,0,0
2,6,49,14,5,5,0,4,12,24,69,...,0,0,0,0,0,0,0,1,0,0
3,1,49,49,9,53,0,0,1,9,23,...,0,0,0,0,0,0,0,1,0,0
4,6,49,227,6,221,0,0,0,2,5,...,0,1,0,0,0,0,0,0,0,0


### 2.5. Processing anomalies/outliers

When using the `SSIS` to look through the dataset provided, we were able to identify outliers that included text instead of numric values. Initially I thought it would've been set to 0 but realized that it does make sense that if they just act upon, there will be a day 0. Thus the text does make sense. 

**The columns with text include:**
1. last_open_day 
2. last_login_day
3. last_checkout_day

**The columns with missing data include:**
1. last_open_day
2. last_login_day
3. last_checkout_day
4. attr_1
5. attr_2
6. attr_3
7. age

In [23]:
# We will first replace the texts into nan
train_set.replace(["Never open", "Never login", "Never checkout"], np.nan, inplace = True)
test_set.replace(["Never open", "Never login", "Never checkout"], np.nan, inplace = True)

# We will then convert all the nan into numeric values for our model training
train_set[["last_open_day", "last_login_day", "last_checkout_day"]] = train_set[["last_open_day", \
                                                                                 "last_login_day", \
                                                                                 "last_checkout_day"]].apply(pd.to_numeric)
test_set[["last_open_day", "last_login_day", "last_checkout_day"]] = test_set[["last_open_day", \
                                                                                 "last_login_day", \
                                                                                 "last_checkout_day"]].apply(pd.to_numeric)

In [24]:
train_set.head()

Unnamed: 0,country_code,subject_line_length,last_open_day,last_login_day,last_checkout_day,open_count_last_10_days,open_count_last_30_days,open_count_last_60_days,login_count_last_10_days,login_count_last_30_days,...,@gmail.com,@hotmail.com,@icloud.com,@live.com,@outlook.com,@qq.com,@rocketmail.com,@yahoo.com,@ymail.com,other
0,4,44,19.0,6.0,18.0,0,2,4,12,43,...,1,0,0,0,0,0,0,0,0,0
1,4,44,9.0,4.0,8.0,2,9,17,18,48,...,0,1,0,0,0,0,0,0,0,0
2,6,49,14.0,5.0,5.0,0,4,12,24,69,...,0,0,0,0,0,0,0,1,0,0
3,1,49,49.0,9.0,53.0,0,0,1,9,23,...,0,0,0,0,0,0,0,1,0,0
4,6,49,227.0,6.0,221.0,0,0,0,2,5,...,0,1,0,0,0,0,0,0,0,0


While there are some missing values, it is quite concerning as they seem to be quite important.

Looking on the competition's discussion board, we were able to find suggested replacement values for the missing values. We will use them for now. 

**Self-Note:** To find/learn best practices on dealing with missing values/attributes

In [25]:
# Identify the different values we can use for the missing values

### attr_1 ####
print("attr_1 :\n", train_set["attr_1"].dtypes)
print(train_set.attr_1.unique()) # Looks like a boolean

### attr_2 ####
print("attr_2 :\n", train_set["attr_2"].dtypes)
print(train_set.attr_2.unique()) # Looks like a boolean

### attr_3 ####
print("attr_3 :\n", train_set["attr_3"].dtypes)
print(train_set.attr_3.unique()) # Looks like a integer of range 0 - 4

attr_1 :
 float64
[ 1. nan  0.]
attr_2 :
 float64
[ 1. nan  0.]
attr_3 :
 float64
[2. 0. 1. 4. 3.]


In [26]:
# Credit: Kaggle User - Bobby Muljono

# Fill missing values
train_set.fillna({"last_open_day": train_set['last_open_day'].max(),
                 "last_login_day": train_set['last_login_day'].max(),
                 "last_checkout_day": train_set['last_checkout_day'].max(),
                 "attr_1": 2,
                 "attr_2": 2,
                 "attr_3": 2,
                 "age": train_set['age'].median()}, inplace = True)

test_set.fillna({"last_open_day":test_set['last_open_day'].max(),
                 "last_login_day":test_set['last_login_day'].max(),
                 "last_checkout_day":test_set['last_checkout_day'].max(),
                 "attr_1": 2,
                 "attr_2": 2,
                 "attr_3": 2,
                 "age": test_set['age'].median()}, inplace = True)

### 2.6. Preparing the training data

We have prepared our data & now we can move into splitting the data into train & test for the model training. We also know that the results is in the ``open_flag`` column. 

The notebook provided by [Nathaniel Ng](https://www.kaggle.com/nathaniel) was helpful in helping to extract the features vs. the label sets for the model training.

We will use 20% of the total dataset for the `test_size`

In [27]:
def get_xy(df, target_col = 'open_flag', **kwargs):
    # All columns except 'open_flag' will be tagged as features
    feature_cols = [ col for col in df.columns if col != target_col ]
    
    # When processing training dataframe, extract features & labels
    if target_col in df:
        X = df[feature_cols]
        y = df[target_col]
        X_train, X_valid, y_train, y_valid = train_test_split(X, y)
        return X_train, X_valid, y_train, y_valid
    
    # When processing challenge's test dataframe, all columns will be features
    else:
        X_test = df[feature_cols]
        return X_test

In [28]:
X_train, X_valid, y_train, y_valid = get_xy(train_set, test_size = 0.2, random_state = 12345)
X_test = get_xy(test_set)

## 3. Training & Saving our Model!

Now that the data are ready, it's time for us to start training our model using `RandomForestClassifier` from `Scikit-Learn`

In [29]:
model = RandomForestClassifier(max_depth = 200, n_estimators = 300, n_jobs = -1, bootstrap = True, \
                               random_state = 12345, verbose = 2)

trained = model.fit(X_train, y_train)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.


building tree 1 of 300
building tree 2 of 300
building tree 3 of 300
building tree 4 of 300
building tree 5 of 300
building tree 6 of 300
building tree 7 of 300
building tree 8 of 300
building tree 9 of 300
building tree 10 of 300
building tree 11 of 300
building tree 12 of 300
building tree 13 of 300
building tree 14 of 300
building tree 15 of 300
building tree 16 of 300
building tree 17 of 300
building tree 18 of 300
building tree 19 of 300
building tree 20 of 300
building tree 21 of 300
building tree 22 of 300
building tree 23 of 300
building tree 24 of 300
building tree 25 of 300
building tree 26 of 300
building tree 27 of 300
building tree 28 of 300
building tree 29 of 300
building tree 30 of 300
building tree 31 of 300
building tree 32 of 300
building tree 33 of 300
building tree 34 of 300
building tree 35 of 300
building tree 36 of 300
building tree 37 of 300
building tree 38 of 300

[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:    0.6s



building tree 39 of 300
building tree 40 of 300
building tree 41 of 300
building tree 42 of 300
building tree 43 of 300
building tree 44 of 300
building tree 45 of 300
building tree 46 of 300
building tree 47 of 300
building tree 48 of 300
building tree 49 of 300
building tree 50 of 300
building tree 51 of 300
building tree 52 of 300
building tree 53 of 300
building tree 54 of 300
building tree 55 of 300
building tree 56 of 300
building tree 57 of 300
building tree 58 of 300
building tree 59 of 300
building tree 60 of 300
building tree 61 of 300
building tree 62 of 300
building tree 63 of 300
building tree 64 of 300
building tree 65 of 300
building tree 66 of 300
building tree 67 of 300
building tree 68 of 300
building tree 69 of 300
building tree 70 of 300
building tree 71 of 300
building tree 72 of 300
building tree 73 of 300
building tree 74 of 300
building tree 75 of 300
building tree 76 of 300
building tree 77 of 300
building tree 78 of 300
building tree 79 of 300
building tree 8

[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed:    3.5s



building tree 155 of 300
building tree 156 of 300
building tree 157 of 300
building tree 158 of 300
building tree 159 of 300
building tree 160 of 300
building tree 161 of 300
building tree 162 of 300
building tree 163 of 300
building tree 164 of 300
building tree 165 of 300
building tree 166 of 300
building tree 167 of 300
building tree 168 of 300
building tree 169 of 300
building tree 170 of 300
building tree 171 of 300
building tree 172 of 300
building tree 173 of 300
building tree 174 of 300
building tree 175 of 300
building tree 176 of 300
building tree 177 of 300
building tree 178 of 300
building tree 179 of 300
building tree 180 of 300
building tree 181 of 300
building tree 182 of 300
building tree 183 of 300
building tree 184 of 300
building tree 185 of 300
building tree 186 of 300
building tree 187 of 300
building tree 188 of 300
building tree 189 of 300
building tree 190 of 300
building tree 191 of 300
building tree 192 of 300
building tree 193 of 300
building tree 194 of 300

[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:    7.4s finished


In [30]:
# Saving (Pickle!) the model
import pickle

filename = "rfc_ma_model_01.pkl"
with open(filename, 'wb') as file:
    pickle.dump(model, file)

## 4. Evaluating our Model

Now that our model is trained & pickled! Time to test it out ;)

In [31]:
predictions = trained.predict(X_valid)

matthews_corrcoef(y_valid, predictions)

[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  25 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 146 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 300 out of 300 | elapsed:    0.2s finished


0.5226621598820339

Decent accuracy score at 52.2% :D

## 5. Applying model to test for submission! 

In [32]:
# Run model on challene's test set to get prediction results
test_predict = trained.predict(X_test)

# Create new dataframe in submission format
subm_df = pd.DataFrame({"row_id": np.arange(len(test_set)), \
                       "open_flag": test_predict})

# Saving file for submission
subm_df.to_csv("submission.csv", index = False)

[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  25 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 146 tasks      | elapsed:    0.1s
[Parallel(n_jobs=8)]: Done 300 out of 300 | elapsed:    0.4s finished


In [33]:
subm_df.head()

Unnamed: 0,row_id,open_flag
0,0,0
1,1,0
2,2,0
3,3,0
4,4,0


# Congratulations! We are done ;)

Reviewing the accuracy of our predictions, we are almost certain that the model can be even better with improved accuracy. It was great being able to learn about something new like this even though our scores are probably not fantastic this time round.