# Statistical Learning Model - Production


Import libraries

In [None]:
#!conda update scikit-learn

In [76]:
import numpy as np
import pandas as pd

from matplotlib import pyplot as plt

import time

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from lightgbm import LGBMClassifier
from sklearn.model_selection import train_test_split

# from sklearn.feature_extraction import TfidfVectorizer ===> gets an error
# pls use the import below
from sklearn.feature_extraction.text import TfidfVectorizer 

from scipy.sparse import hstack, vstack                            # used on bow of workds, vetoctored the 'Title' column

from sklearn.metrics import roc_auc_score, average_precision_score


%matplotlib inline
%pylab inline

Populating the interactive namespace from numpy and matplotlib




In [None]:
import sklearn
print('The scikit-learn version is {}.'.format(sklearn.__version__))

## Load final raw data with labels

In [17]:
df = pd.read_csv('./raw_data/final_labels_dataset.csv', index_col=0)

print('Dataset shape before: ', df.shape)
print('Dataset columns: ', df.columns)

# drop duplicated data
df.drop_duplicates(inplace=True)

print('Dataset shape after: ', df.shape)

print('Check duplicates: ', df.duplicated().mean())
df.tail(2)

Dataset shape before:  (727, 5)
Dataset columns:  Index(['title', 'y', 'upload_date', 'view_count', 'new'], dtype='object')
Dataset shape after:  (701, 5)
Check duplicates:  0.0


Unnamed: 0,title,y,upload_date,view_count,new
1016,Data Scientists vs Data Engineers: Which one i...,1.0,20191219,183723,1.0
1468,#kaggle #DataScience Machine Learning MicroCou...,0.0,20190730,405,1.0


# 1. Data Cleanup

Using a NEW dataframe with data that's ready and clean the data to fit the ML model.

In [18]:
# create a clean dataframe with the same indice on the original dataframe - raw data
df_clean = pd.DataFrame(index=df.index)

In [19]:
df_clean['date'] = pd.to_datetime(df['upload_date'], format='%Y%m%d')

# note: format='%Y %m %d' shows the time; format='%Y%m%d' brings only YYYY-MM-DD - easy!

In [20]:
df_clean['date']

# dtype: datetime64[ns] used by numpy and pandas

0      2021-05-05
1      2021-05-05
2      2021-05-05
3      2021-05-05
4      2021-05-05
          ...    
1041   2019-11-18
488    2018-11-25
528    2018-07-24
1016   2019-12-19
1468   2019-07-30
Name: date, Length: 701, dtype: datetime64[ns]

In [21]:
# columns views: make sure all NAN will be convert to 0 and an integer data type will be added
df_clean['views'] = df['view_count'].fillna(0).astype(int64)

In [22]:
# adding title column. It will be used on the model....will be vectorized later
df_clean['title'] = df['title']

In [23]:
df_clean.dtypes

date     datetime64[ns]
views             int64
title            object
dtype: object

## 2.Features & Labels

Create an unique features dataframe. JUST an extra step. Making sure the features are ready.

**Reason**: Align the feaatures dataframe with the most cleaning data - raw data collected & cleaned. The cleaning process can skip rows or columns.


In [25]:
# features: it's similar to df_clean, just an extra step
features = pd.DataFrame(index=df_clean.index)

# labels/targets
y = df['y'].copy()

In [26]:
print('Features shape: {}'.format(shape(features)))
print('Labels shape: {}'.format(shape(y)))

Features shape: (701, 0)
Labels shape: (701,)


## Important: sklearn can't use *date* as a feature.

Let's manipulate and create a feature using the raw date - **Num_views_per_day**.

Sklearn needs a number.

In [27]:
# time_since_pub: time since the video was published. Random data choose. Use the date I created this code: fix date point - 2021-05-09

# np.timedelta64(1, 'D'): time delta in numpy. Difference in days
# we have data on a granually day, meaning a difference less than a day makes sense.
features['time_since_pub'] = (pd.to_datetime("2021-05-09") - df_clean['date']) / np.timedelta64(1, 'D')

# used features
features['views'] = df_clean['views']
features['views_per_day'] = features['views'] / features['time_since_pub']

features = features.drop(['time_since_pub'], axis=1)   # time_since_pub only used for the calculation

# time_since_pub as a feature may impact the model once the numbers seem to increase a lot and the end of the time serie.
# The training&validations datasets may not have a normal distributed values.Thus, an umbalaced feature weights
# and random samples are important to train and fit a ml model


In [28]:
features.tail()

Unnamed: 0,views,views_per_day
1041,31788,59.085502
488,65724,73.352679
528,34286,33.613725
1016,183723,362.372781
1468,405,0.624037


In [29]:
features.describe()

Unnamed: 0,views,views_per_day
count,701.0,701.0
mean,152549.0,521.359455
std,2042103.0,4284.958273
min,0.0,0.0
25%,339.0,4.977941
50%,2892.0,37.956522
75%,19972.0,180.0
max,48724170.0,95913.728346


## 3. Data Preparation

Let's try to split the train&validation datasets 50/50.

How the 2 features **view** and **views_per_day** impacted the ML model? 
Does a simple model with only 2 features impact the way the YouTube videos will be selected?

In [30]:
# check all data on df_clean
# pd.set_option('display.max_rows', 527)
# df_clean

median_date = df_clean['date'].quantile(0.5, interpolation="midpoint")
median_date

# median date 2021-03-12 before

Timestamp('2021-01-10 00:00:00')

# Increasing the validation set

**Note**: prior models, without active learning samples, we split the dataset train and validation with only the **median date**.

## Active Learning samples - usually are added into the training dataset

## However, we'll try to add them into the validation dataset

In [32]:
# splitting features dataset - trying a 50/50 using a median date
# balanced dataset is important!!!

# code below can also be used
# Xtrain, Xval = features[df_clean['date'] < '2021-03-12'], features[df_clean['date'] >= '2021-03-12']
# ytrain, yval = y[df_clean['date'] < '2021-03-12 '], y[df_clean['date'] >= '2021-03-12 ']

# needed approach - mask parameter to select the data
mask_train = df_clean['date'] < '2021-01-10'
mask_val = df_clean['date'] >= '2021-01-10'

Xtrain, Xval = features[mask_train], features[mask_val]
ytrain, yval = y[mask_train], y[mask_val]

Xtrain.shape, Xval.shape, ytrain.shape, yval.shape

# datasets, training & validation, not huge. But,...

((349, 2), (352, 2), (349,), (352,))

## Add the title feature

**Important**: transforming the Title string to numbers.

Building a matrix in which each column will be the counting word from the Title feature.

Import to notice that commom words like machine+learning will have a low weight.



In [92]:
title_train = df_clean[mask_train]['title']
title_val = df_clean[mask_val]['title']

# Vectorizing the Title features
title_vec = TfidfVectorizer(min_df=2, ngram_range=(1,3))   # object defined
# mind_df = 2 means the minimum numnber of words that be used to create a column
# ngram_range=(1,3) - combining words to maximum 3 words

# bow: bag of words
title_bow_train = title_vec.fit_transform(title_train)     # fit + transform: store the words on the features, plus how many times the word appeared
title_bow_val = title_vec.transform(title_val)             # validation set ONLY transform. Validation should NOT learning the words

Without ngram_range=(1,3): Shape for title bag of words matrix:  (349, 289)
with ngram_range=(1,3):   Shape for title bag of words matrix:  (349, 719)

In [65]:
# checking
print('Shape for title bag of words matrix: ', title_bow_train.shape)
title_bow_train

Shape for title bag of words matrix:  (349, 719)


<349x719 sparse matrix of type '<class 'numpy.float64'>'
	with 3968 stored elements in Compressed Sparse Row format>

 TfidfVectorizer function returns a vectorized sparse matrix. It's an optimize matrix in Scipy where only values NOT equal to zero are returned.

In [66]:
# the sparse matrix 'title_bow_train' contains 3968  elements NOT ZERO
1 - 2286 /(349*719)   
# % of ZERO elements on the sparse matrix, but only 3% are NOT ZERO elements. Meaning that the matrix is sparse computationally and mathematically speaking

0.9908899259158892

## IMPORTANT to note: 
Combining simple matrix - Xtrain&Xval - with a sparse matrix - title_bow_train & title_bow_val

Use scipy.sparse hstack and vstack

More details on hstack and vstack...stacking matrix (vectoes) horizontally and vertically

Sample:

hstack - [1 2]    [3 4]  -> [1 2 3 4]

vstack [1 2]      [3 4]  -> [1 2]
                            [3 4]
                            
USE *scipy.sparse hstack and vstack*, numpy sparse function may take TOO LONG, or not compute at all!!

In [67]:
# combining sparse matrix with original features
from scipy.sparse import hstack, vstack  

Xtrain_wtitle = hstack([Xtrain, title_bow_train])
Xval_wtitle = hstack([Xval, title_bow_val])

In [68]:
Xtrain_wtitle.shape, Xval_wtitle.shape

# 2 nummerical features on training dataset plus 289 columns from 'Title'

((349, 721), (352, 721))

## Random Forest

In [69]:
# check number of 1 samples under train dataset
print('Positive samples - videos select: {}'.format(ytrain.mean() * 349))
print(' % of positive samples - videos select: {}'.format(ytrain.mean() * 100))

# definitely unbalaced

Positive samples - videos select: 107.0
 % of positive samples - videos select: 30.659025787965614


The training dataset is not big!!

In [70]:
clf_rf = RandomForestClassifier(n_estimators=1000, random_state=0, min_samples_leaf=1, class_weight='balanced', n_jobs=6)    # defined object

## Fitting the model against the train dataset

NOW: 3 features - views, views per day, and title

In [71]:
clf_rf.fit(Xtrain_wtitle, ytrain)



RandomForestClassifier(bootstrap=True, class_weight='balanced',
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, min_impurity_decrease=0.0,
                       min_impurity_split=None, min_samples_leaf=1,
                       min_samples_split=2, min_weight_fraction_leaf=0.0,
                       n_estimators=1000, n_jobs=6, oob_score=False,
                       random_state=0, verbose=0, warm_start=False)

In [52]:
print('ML model already trainded/fitted and ready to be used!')

ML model already trainded/fitted and ready to be used!


## Predicting if a video has been select

Probability = 1

predict_proba: returns a numpy array with prob of zero and prob of 1

In [72]:
pred = clf_rf.predict_proba(Xval_wtitle)[:, 1]   # only 1



## Metrics - validating the model


 Before: Random Forest Model

average precision: 0.4566659359075169

Roc Auc: 0.5182456140350877

In [73]:
# area of precision for decision tree
print('Random Forest baseline model - average precision')
average_precision_score(yval, pred)

Random Forest baseline model - average precision


0.4605227155328958

## IMPORTANT: any future model in PRD should have a greater than baseline model **0.50**

In [74]:
# area under curve of roc curve metric
print('Random Forest baseline model - roc auc')
roc_auc_score(yval, pred)

Random Forest baseline model - roc auc


0.5253960396039604

## Validating metrics before and after the active learning samples

Increassing the validation dataset metrics:

roc auc: 0.47

average precision: 0.41

----------------------------------------------------------
Original decision metric - baseline model

roc auc: 0.49

average precision: 0.42

---------------------------------------------------------

Metrics seem to be on the variation range. Meaning AP & ROC AUC are close enough between baseline model and new model with active learning samples.
Please note that the number of samples are not significantly big, then the metrics might not change a lot...small steps definitely count!!!

---------------------------------------------------------

## Productin Baseline Model
roc auc: 0.46

average precision: 0.52

**Details**: TfidfVectorizer(min_df=2, ngram_range=(1,3)) 
---------------------------------------------------------


## LigthGBM Model

In [79]:
cl_lgbm = LGBMClassifier(random_state=0, class_weight='balanced', n_jobs=6)   # defined object

In [81]:
cl_lgbm.fit(Xtrain_wtitle, ytrain)

LGBMClassifier(boosting_type='gbdt', class_weight='balanced',
               colsample_bytree=1.0, importance_type='split', learning_rate=0.1,
               max_depth=-1, min_child_samples=20, min_child_weight=0.001,
               min_split_gain=0.0, n_estimators=100, n_jobs=6, num_leaves=31,
               objective=None, random_state=0, reg_alpha=0.0, reg_lambda=0.0,
               silent=True, subsample=1.0, subsample_for_bin=200000,
               subsample_freq=0)

In [82]:
pred = cl_lgbm.predict_proba(Xval_wtitle)[:,1]



In [84]:
# area of precision for decision tree
print('Random Forest baseline model - average precision: ', average_precision_score(yval, pred))

# area under curve of roc curve metric
print('Random Forest baseline model - roc auc: ', roc_auc_score(yval, pred))

Random Forest baseline model - average precision:  0.4052786105139529
Random Forest baseline model - roc auc:  0.48495049504950494


# Bayesian Optmization

In [102]:
title_train
title_val

title_vec = TfidfVectorizer(min_df=2)
title_bow_train = title_vec.fit_transform(title_train)
title_bow_val = title_vec.transform(title_val)

In [109]:
# needed approach - mask parameter to select the data
mask_train = df_clean['date'] < '2021-01-10'
mask_val = df_clean['date'] >= '2021-01-10'

Xtrain, Xval = features[mask_train], features[mask_val]
ytrain, yval = y[mask_train], y[mask_val]

title_train = df_clean[mask_train]['title']
title_val = df_clean[mask_val]['title']


In [110]:
from skopt import forest_minimize

In [111]:
def tune_lgbm(params):
    print(params)
    lr = params[0]
    max_depth = params[1]
    min_child_samples = params[2]
    subsample = params[3]
    colsample_bytree = params[4]
    n_estimators = params[5]
    
    min_df = params[6]
    ngram_range = (1, params[7])
    
    title_vec = TfidfVectorizer(min_df=min_df, ngram_range=ngram_range)
    title_bow_train = title_vec.fit_transform(title_train)
    title_bow_val = title_vec.transform(title_val)
    
    Xtrain_wtitle = hstack([Xtrain, title_bow_train])
    Xval_wtitle = hstack([Xval, title_bow_val])
    
    mdl = LGBMClassifier(learning_rate=lr, num_leaves=2 ** max_depth, max_depth=max_depth, 
                         min_child_samples=min_child_samples, subsample=subsample,
                         colsample_bytree=colsample_bytree, bagging_freq=1,n_estimators=n_estimators, random_state=0, 
                         class_weight="balanced", n_jobs=6)
    mdl.fit(Xtrain_wtitle, ytrain)
    
    p = mdl.predict_proba(Xval_wtitle)[:, 1]
    
    print(roc_auc_score(yval, p))
    
    return -average_precision_score(yval, p)

space = [(1e-3, 1e-1, 'log-uniform'), # lr
          (1, 10), # max_depth
          (1, 20), # min_child_samples
          (0.05, 1.), # subsample
          (0.05, 1.), # colsample_bytree
          (100,1000), # n_estimators
          (1,5), # min_df
          (1,5)] # ngram_range

res = forest_minimize(tune_lgbm, space, random_state=160745, n_random_starts=20, n_calls=50, verbose=1)


Iteration No: 1 started. Evaluating function at random point.
[0.009944912110647982, 5, 1, 0.4677107511929402, 0.49263223036174764, 272, 3, 1]


ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 2 dimension(s) and the array at index 1 has 1 dimension(s)

# Overrall
## Model 4: Random Forest with active learning samples
## Changing validation and training dataset


Active learning is impacting a bit the model, but not enough. Metrics are on the range.

Model Number	Model	Number of Features	Features	Metric	#	Remark
1	Baseline Decision Tree 	2	views; views per day	Average Precision	0.42	
1	Baseline Decision Tree 	2	views; views per day	ROC AUC	0.49	
2	Random Forest 	2	views; views per day	Average Precision	0.45	
2	Random Forest 	2	views; views per day	ROC AUC	0.51	
3	Random Forest 	3	views; views per day; title	Average Precision	0.41	increase validation dataset with active learning samples
3	Random Forest 	3	views; views per day; title	ROC AUC	0.47	increase validation dataset with active learning samples
3	Random Forest 	3	views; views per day; title	Average Precision	0.42	increase training dataset with active learning samples
3	Random Forest 	3	views; views per day; title	ROC AUC	0.5	increase training dataset with active learning samples