# Statistical Learning Model with active learning data


Import libraries

In [1]:
#!conda update scikit-learn

In [2]:
import numpy as np
import pandas as pd

from matplotlib import pyplot as plt

import time

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# from sklearn.feature_extraction import TfidfVectorizer ===> gets an error
# pls use the import below
from sklearn.feature_extraction.text import TfidfVectorizer 

from scipy.sparse import hstack, vstack                            # used on bow of workds, vetoctored the 'Title' column

from sklearn.metrics import roc_auc_score, average_precision_score


%matplotlib inline
%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [3]:
import sklearn
print('The scikit-learn version is {}.'.format(sklearn.__version__))

The scikit-learn version is 0.21.3.


## Load raw data with labels

In [4]:
df1 = pd.read_csv('./raw_data/raw_data_with_labels.csv', index_col=0)

# select only data with labels
df1 = df1[df1['y'].notnull()]
print('Num of labelled data: ', df1.shape)

Num of labelled data:  (527, 4)


## Load raw data with active learning samples labelled

Active learning dataset contains p columns

In [5]:
df2 = pd.read_csv('./raw_data/raw_data_active_learning_with_labels.csv', index_col=0)

# select only data with labels
df2 = df2[df2['y'].notnull()]
df2['new'] = 1
print('Num of labelled data - active learning: ', df2.shape)

Num of labelled data - active learning:  (200, 6)


In [6]:
df2.head()
df2.tail()

# right p values are from the ranmdom selection

Unnamed: 0,title,y,upload_date,view_count,p,new
1041,9 Ways You Can Make Extra Income as a Data Sci...,0,20191118,31788,0.754,1
488,Keras vs Tensorflow vs PyTorch | Deep Learning...,1,20181125,65724,0.701605,1
528,Predictive Maintenance & Monitoring using Mach...,0,20180724,34286,0.750433,1
1016,Data Scientists vs Data Engineers: Which one i...,1,20191219,183723,0.562605,1
1468,#kaggle #DataScience Machine Learning MicroCou...,0,20190730,405,0.233,1


In [7]:
# just checking the metrics for the active learning samples 
# comparing against Random Forest metrics: ap: 0.45; roc auc: 0,51
average_precision_score(df2['y'], df2['p']), roc_auc_score(df2['y'], df2['p'])

# model is generalizing?!
# we can't tell if the model is better or not once it's the first time we're evaluating the samples from active learning method

active learning 
roc auc: 0.65 
average precision: 0.31    

1st random forest model
roc auc: 0.45
average precision: 0.51


(0.31629718316293826, 0.6513978494623656)

## Active learning details

The metrics for the active learning samples seem to be ok!
Meaning the that new samples may help the model to generalize.
The roc auc metric is better for active learning samples, but worse for average precison. It may tell us that the model is very sensible regarding the number of sample.
Please note that the active learning samples contain only 200 items.


## Concatenating raw data with raw active learning data

In [9]:
df = pd.concat([df1, df2.drop('p', axis=1)])   # drop p column in df2

In [62]:
df.shape
df.tail()

Unnamed: 0,title,y,upload_date,view_count,new
1041,9 Ways You Can Make Extra Income as a Data Sci...,0.0,20191118,31788,1.0
488,Keras vs Tensorflow vs PyTorch | Deep Learning...,1.0,20181125,65724,1.0
528,Predictive Maintenance & Monitoring using Mach...,0.0,20180724,34286,1.0
1016,Data Scientists vs Data Engineers: Which one i...,1.0,20191219,183723,1.0
1468,#kaggle #DataScience Machine Learning MicroCou...,0.0,20190730,405,1.0


# 1. Data Cleanup

Using a NEW dataframe with data that's ready and clean the data to fit the ML model.

In [13]:
# create a clean dataframe with the same indice on the original dataframe - raw data
df_clean = pd.DataFrame(index=df.index)

In [14]:
df_clean['date'] = pd.to_datetime(df['upload_date'], format='%Y%m%d')

# note: format='%Y %m %d' shows the time; format='%Y%m%d' brings only YYYY-MM-DD - easy!

In [15]:
df_clean['date']

# dtype: datetime64[ns] used by numpy and pandas

0      2021-05-05
1      2021-05-05
2      2021-05-05
3      2021-05-05
4      2021-05-05
          ...    
1041   2019-11-18
488    2018-11-25
528    2018-07-24
1016   2019-12-19
1468   2019-07-30
Name: date, Length: 727, dtype: datetime64[ns]

In [16]:
# columns views: make sure all NAN will be convert to 0 and an integer data type will be added
df_clean['views'] = df['view_count'].fillna(0).astype(int)

In [17]:
# adding title column. It will be used on the model....will be vectorized later
df_clean['title'] = df['title']

# column new from df2 originally
df_clean['new'] = df['new'].fillna(0)    # new:1 meaning data from active learning step. It will be used to compare the impact of the active learning

In [18]:
df_clean.dtypes

date     datetime64[ns]
views             int32
title            object
new             float64
dtype: object

## 2.Features & Labels

Create an unique features dataframe. JUST an extra step. Making sure the features are ready.

**Reason**: Align the feaatures dataframe with the most cleaning data - raw data collected & cleaned. The cleaning process can skip rows or columns.


In [19]:
# features: it's similar to df_clean, just an extra step
features = pd.DataFrame(index=df_clean.index)

# labels/targets
y = df['y'].copy()

In [20]:
print('Features shape: {}'.format(shape(features)))
print('Labels shape: {}'.format(shape(y)))

Features shape: (727, 0)
Labels shape: (727,)


## Important: sklearn can't use *date* as a feature.

Let's manipulate and create a feature using the raw date - **Num_views_per_day**.

Sklearn needs a number.

In [21]:
# time_since_pub: time since the video was published. Random data choose. Use the date I created this code: fix date point - 2021-05-09

# np.timedelta64(1, 'D'): time delta in numpy. Difference in days
# we have data on a granually day, meaning a difference less than a day makes sense.
features['time_since_pub'] = (pd.to_datetime("2021-05-09") - df_clean['date']) / np.timedelta64(1, 'D')

# used features
features['views'] = df_clean['views']
features['views_per_day'] = features['views'] / features['time_since_pub']

features = features.drop(['time_since_pub'], axis=1)   # time_since_pub only used for the calculation

# time_since_pub as a feature may impact the model once the numbers seem to increase a lot and the end of the time serie.
# The training&validations datasets may not have a normal distributed values.Thus, an umbalaced feature weights
# and random samples are important to train and fit a ml model


In [22]:
features.head()
features.tail()

Unnamed: 0,views,views_per_day
1041,31788,59.085502
488,65724,73.352679
528,34286,33.613725
1016,183723,362.372781
1468,405,0.624037


In [23]:
### not working on my personal computer###
# error: from_bounds() argument after * must be an iterable, not float

# TODO: update packages
#df_clean['date'].value_counts().plot(figsize=(20.10))

features.describe()

Unnamed: 0,views,views_per_day
count,727.0,727.0
mean,151081.0,521.378377
std,2005739.0,4208.615856
min,0.0,0.0
25%,367.5,5.422161
50%,3235.0,42.530973
75%,20957.0,212.160526
max,48724170.0,95913.728346


## 3. Data Preparation

Let's try to split the train&validation datasets 50/50.

How the 2 features **view** and **views_per_day** impacted the ML model? 
Does a simple model with only 2 features impact the way the YouTube videos will be selected?

In [24]:
# check all data on df_clean
# pd.set_option('display.max_rows', 527)
# df_clean

median_date = df_clean['date'].quantile(0.5, interpolation="midpoint")
median_date

# median date 2021-03-12 before

Timestamp('2021-01-13 00:00:00')

# Increasing the validation set

**Note**: prior models, without active learning samples, we split the dataset train and validation with only the **median date**.

## Active Learning samples - usually are added into the training dataset

## However, we'll try to add them into the validation dataset

In [45]:
# splitting features dataset - trying a 50/50 using a median date
# balanced dataset is important!!!
# code below can also be used
# Xtrain, Xval = features[df_clean['date'] < '2021-03-12'], features[df_clean['date'] >= '2021-03-12']
# ytrain, yval = y[df_clean['date'] < '2021-03-12 '], y[df_clean['date'] >= '2021-03-12 ']

# needed approach - mask parameter to select the data
mask_train = (df_clean['date'] < '2021-01-13') & (df_clean['new'] == 0)   # (df_clean['new'] == 0) ==> without active learning samples
mask_val = df_clean['date'] >= '2021-01-13'

Xtrain, Xval = features[mask_train], features[mask_val]
ytrain, yval = y[mask_train], y[mask_val]

Xtrain.shape, Xval.shape, ytrain.shape, yval.shape

# before without active learning: ((263, 2), (264, 2), (263,), (264,))
# I'm keeping my old date to split the train & validatrion datasets

((174, 2), (364, 2), (174,), (364,))

## Add the title feature

**Important**: transforming the Title string to numbers.

Building a matrix in which each column will be the counting word from the Title feature.

Import to notice that commom words like machine+learning will have a low weight.



In [28]:
title_train = df_clean[mask_train]['title']
title_val = df_clean[mask_val]['title']

# Vectorizing the Title features
title_vec = TfidfVectorizer(min_df=2)   # object defined; mind_df = 2 means the minimum numnber of words that be used to create a column

# bow: bag of words
title_bow_train = title_vec.fit_transform(title_train)     # fit + transform: store the words on the features, plus how many times the word appeared
title_bow_val = title_vec.transform(title_val)             # validation set ONLY transform. Validation should NOT learning the words

In [29]:
# checking
title_bow_train.shape
title_bow_train

<174x161 sparse matrix of type '<class 'numpy.float64'>'
	with 1065 stored elements in Compressed Sparse Row format>

 TfidfVectorizer function returns a vectorized sparse matrix. It's an optimize matrix in Scipy where only values NOT equal to zero are returned.

In [45]:
# the sparse matrix 'title_bow_train' contains 1659 elements NOT ZERO
1 - 1659/(263*241)   # % of ZERO elements on the sparse matrix, but only 3% are NOT ZERO elements. Meaning that the matrix is sparse computationally and mathematically speaking

0.9738257892494833

In [30]:
Xtrain.head()

# train dataset so far. But, now the 'Title' will be added to train&validation

Unnamed: 0,views,views_per_day
167,7244,61.389831
168,707648,5753.235772
169,2659,21.272
170,55759,442.531746
171,16390,129.055118


## IMPORTANT to note: 
Combining simple matrix - Xtrain&Xval - with a sparse matrix - title_bow_train & title_bow_val

Use scipy.sparse hstack and vstack

More details on hstack and vstack...stacking matrix (vectoes) horizontally and vertically

Sample:

hstack - [1 2]    [3 4]  -> [1 2 3 4]

vstack [1 2]      [3 4]  -> [1 2]
                            [3 4]
                            
USE *scipy.sparse hstack and vstack*, numpy sparse function may take TOO LONG, or not compute at all!!

In [31]:
# combining sparse matrix with original features
from scipy.sparse import hstack, vstack  

Xtrain_wtitle = hstack([Xtrain, title_bow_train])
Xval_wtitle = hstack([Xval, title_bow_val])

In [32]:
Xtrain_wtitle.shape, Xval_wtitle.shape

# 2 nummerical features on training dataset plus 241 columns from 'Title'

((174, 163), (364, 163))

## Random Forest

In [33]:
# check number of 1 samples under train dataset
print('Positive samples - videos select: {}'.format(ytrain.mean() * 263))
print(' % of positive samples - videos select: {}'.format(ytrain.mean() * 100))

# definitely unbalaced

Positive samples - videos select: 98.2471264367816
 % of positive samples - videos select: 37.35632183908046


The training dataset is not big!!

In [34]:
clf_rf = RandomForestClassifier(n_estimators=1000, random_state=0, class_weight='balanced', n_jobs=6)    # defined object

## Fitting the model against the train dataset

NOW: 3 features - views, views per day, and title

In [35]:
clf_rf.fit(Xtrain_wtitle, ytrain)

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  y_store_unique_indices = np.zeros(y.shape, dtype=np.int)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  y_encoded = np.zeros(y.shape, dtype=np.int)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  y_encoded = np.zeros(y.shape, dtype=np.int)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  y_encoded = np.zeros(y.shape, dtype=np.int)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  y_encoded = np.zeros(y.shape, dtype=np.int)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  y_encoded = np.zeros(y.

RandomForestClassifier(bootstrap=True, class_weight='balanced',
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, min_impurity_decrease=0.0,
                       min_impurity_split=None, min_samples_leaf=1,
                       min_samples_split=2, min_weight_fraction_leaf=0.0,
                       n_estimators=1000, n_jobs=6, oob_score=False,
                       random_state=0, verbose=0, warm_start=False)

In [38]:
print('ML model already trainded/fitted and ready to be used!')

ML model already trainded/fitted and ready to be used!


## Predicting if a video has been select

Probability = 1

predict_proba: returns a numpy array with prob of zero and prob of 1

In [39]:
pred = clf_rf.predict_proba(Xval_wtitle)[:, 1]   # only 1

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype=np.int)


## Metrics - validating the model


 Before: Random Forest Model

average precision: 0.4566659359075169

Roc Auc: 0.5182456140350877

In [40]:
# area of precision for decision tree
print('Random Forest Model')
average_precision_score(yval, pred)

Random Forest Model


0.41913349906528646

## IMPORTANT: any future model in PRD should have a greater than baseline model **0.49**

In [41]:
# area under curve of roc curve metric
roc_auc_score(yval, pred)

0.4738385553326131

## Validating metrics before and after the active learning samples

Increassing the validation dataset metrics:

roc auc: 0.47

average precision: 0.41

----------------------------------------------------------
Original decision metric - baseline model

roc auc: 0.49

average precision: 0.42

---------------------------------------------------------

Metrics seem to be on the variation range. Meaning AP & ROC AUC are close enough between baseline model and new model with active learning samples.
Please note that the number of samples are not significantly big, then the metrics might not change a lot...small steps definitely count!!!


## OK let's increasing the training dataset with active learning samples

In [58]:
# splitting features dataset - trying a 50/50 using a median date
# balanced dataset is important!!!
# code below can also be used
# Xtrain, Xval = features[df_clean['date'] < '2021-03-12'], features[df_clean['date'] >= '2021-03-12']
# ytrain, yval = y[df_clean['date'] < '2021-03-12 '], y[df_clean['date'] >= '2021-03-12 ']

# needed approach - mask parameter to select the data
mask_train = (df_clean['date'] < '2021-03-12')  # increasing training dataset
mask_val = (df_clean['date'] >= '2021-03-12') & (df_clean['new'] == 0)   # (df_clean['new'] == 0) ==> without active learning samples

Xtrain, Xval = features[mask_train], features[mask_val]
ytrain, yval = y[mask_train], y[mask_val]

Xtrain.shape, Xval.shape, ytrain.shape, yval.shape

# before without active learning: ((263, 2), (264, 2), (263,), (264,))
# I'm keeping my old date to split the train & validatrion datasets
# 2021-01-13 middle date for current df_clean

((458, 2), (264, 2), (458,), (264,))

In [59]:
title_train = df_clean[mask_train]['title']
title_val = df_clean[mask_val]['title']

# Vectorizing the Title features
title_vec = TfidfVectorizer(min_df=2)   # object defined; mind_df = 2 means the minimum numnber of words that be used to create a column

# bow: bag of words
title_bow_train = title_vec.fit_transform(title_train)     # fit + transform: store the words on the features, plus how many times the word appeared
title_bow_val = title_vec.transform(title_val)  

# combining sparse matrix with original features
from scipy.sparse import hstack, vstack  

Xtrain_wtitle = hstack([Xtrain, title_bow_train])
Xval_wtitle = hstack([Xval, title_bow_val])

clf_rf = RandomForestClassifier(n_estimators=1000, random_state=0, class_weight='balanced', n_jobs=6)    # defined object

clf_rf.fit(Xtrain_wtitle, ytrain)


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  y_store_unique_indices = np.zeros(y.shape, dtype=np.int)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  y_encoded = np.zeros(y.shape, dtype=np.int)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  y_encoded = np.zeros(y.shape, dtype=np.int)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  y_encoded = np.zeros(y.shape, dtype=np.int)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  y_encoded = np.zeros(y.shape, dtype=np.int)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  y_encoded = np.zeros(y.

RandomForestClassifier(bootstrap=True, class_weight='balanced',
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, min_impurity_decrease=0.0,
                       min_impurity_split=None, min_samples_leaf=1,
                       min_samples_split=2, min_weight_fraction_leaf=0.0,
                       n_estimators=1000, n_jobs=6, oob_score=False,
                       random_state=0, verbose=0, warm_start=False)

In [60]:
pred = clf_rf.predict_proba(Xval_wtitle)[:, 1]   # only 1

# area of precision for decision tree
print('Random Forest Model - average precision: ', average_precision_score(yval, pred))


# area under curve of roc curve metric
print('Random Forest Model - roc-auc: ', roc_auc_score(yval, pred))


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype=np.int)


Random Forest Model - average precision:  0.4263062579128017
Random Forest Model - roc-auc:  0.501812865497076


# Overrall
## Model 4: Random Forest with active learning samples
## Changing validation and training dataset


Active learning is impacting a bit the model, but not enough. Metrics are on the range.

Model Number	Model	Number of Features	Features	Metric	#	Remark
1	Baseline Decision Tree 	2	views; views per day	Average Precision	0.42	
1	Baseline Decision Tree 	2	views; views per day	ROC AUC	0.49	
2	Random Forest 	2	views; views per day	Average Precision	0.45	
2	Random Forest 	2	views; views per day	ROC AUC	0.51	
3	Random Forest 	3	views; views per day; title	Average Precision	0.41	increase validation dataset with active learning samples
3	Random Forest 	3	views; views per day; title	ROC AUC	0.47	increase validation dataset with active learning samples
3	Random Forest 	3	views; views per day; title	Average Precision	0.42	increase training dataset with active learning samples
3	Random Forest 	3	views; views per day; title	ROC AUC	0.5	increase training dataset with active learning samples

In [66]:
# saving df1 and df2 to be used on PRD final ML model
# df 727 rows × 5 columns
# df.columns: Index(['title', 'y', 'upload_date', 'view_count', 'new'], dtype='object')

df.to_csv('./raw_data/final_labels_dataset.csv')

#pd.concat([difficult_samples, random_samples]).to_csv('./raw_data/active_label.csv')