<h1>A second model</h1>
<p>After creating a initial model that fits the baseline, we want to create another model to beat the baseline.</p>

<h5>Importing the libraries</h5>
<p>Just like before, we are importing the libraries.</p>

In [1]:
import pandas as pd
import numpy as np
import re
import time

import bs4 as bs4
import json

import glob
import tqdm

pd.set_option("max.columns", 131)

#https://strftime.org/
%matplotlib inline
%pylab inline

Populating the interactive namespace from numpy and matplotlib


<p>We are reading the CSV file and another non null column was created and named "y". That's our target variable.</p>

In [2]:
df = pd.read_csv("raw_data_with_labels.csv", index_col=0)
df = df[df['y'].notnull()]
df.shape

(498, 16)

In [3]:
df_limpo = pd.DataFrame(index=df.index)


<h1>Data Cleaning</h1>
<p>Same as before. No changes here.</p>

In [4]:
clean_date = df['watch-time-text'].str.extract(r"(\d+) de ([a-z]+)\. de (\d+)")


In [5]:
clean_date[0] = clean_date[0].map(lambda x: "0"+x[0] if len(x) == 1 else x)


In [6]:
'''
It worked.
'''
clean_date[0]

0      03
1      16
2      02
3      13
4      30
       ..
496    01
497    31
498    10
499    25
500    21
Name: 0, Length: 498, dtype: object

In [7]:
mapa_meses = {"jan": "Jan",
              "fev": "Feb",
              "mar": "Mar", 
              "abr": "Apr", 
              "mai": "May", 
              "jun": "Jun",
              "jul": "Jul",
              "ago": "Aug", 
              "set": "Sep", 
              "out": "Oct", 
              "nov": "Nov",
              "dez": "Dec"}

clean_date[1] = clean_date[1].map(mapa_meses)

In [8]:
clean_date[1]

0      Sep
1      Nov
2      May
3      Aug
4      Nov
      ... 
496    Mar
497    May
498    Nov
499    Apr
500    Mar
Name: 1, Length: 498, dtype: object

In [9]:
clean_date = clean_date.apply(lambda x: " ".join(x), axis=1)

df_limpo['date'] = pd.to_datetime(clean_date, format="%d %b %Y")

<h1>View Cleaning</h1>
<p>Again, cleaning the views using the same old regular expression. No changes here either.</p>

In [10]:
views = df['watch-view-count'].str.extract(r"(\d+\.?\d*)", expand=False).str.replace(".", "").fillna(0).astype(int)


  views = df['watch-view-count'].str.extract(r"(\d+\.?\d*)", expand=False).str.replace(".", "").fillna(0).astype(int)


In [11]:
df_limpo['views'] = views

<h1>Creating Views</h1>
<p>The process of creating views is the same as before.</p>

In [12]:
features = pd.DataFrame(index=df_limpo.index)
y = df['y'].copy()

<h5>Cleaning the date type</h5>

In [13]:
pd.to_datetime("2021-01-01") - df_limpo["date"]

0      851 days
1      777 days
2      610 days
3      507 days
4      763 days
         ...   
496   1037 days
497    946 days
498    418 days
499    617 days
500    652 days
Name: date, Length: 498, dtype: timedelta64[ns]

In [14]:
features['time_since_pub'] = (pd.to_datetime("2021-01-01") - df_limpo['date']) / np.timedelta64(1, 'D')


In [15]:
features['time_since_pub'].head()

0    851.0
1    777.0
2    610.0
3    507.0
4    763.0
Name: time_since_pub, dtype: float64

<h5>Cleaning the views</h5>

In [16]:
features['views'] = df_limpo['views']

In [17]:
features['views_per_day'] = features['views'] / features['time_since_pub']

In [18]:
features = features.drop(['time_since_pub'], axis=1)

In [19]:
features.head()

Unnamed: 0,views,views_per_day
0,28028,32.93537
1,1131,1.455598
2,1816,2.977049
3,1171,2.309665
4,1228,1.609436


<h1>Creating some variables to our model</h1>
<p>Now we are gonna create some variables to use in our new model. Then, we are going to create the train and test variables, as always.</p>

In [22]:
mask_train = df_limpo['date'] < "2019-04-01"
mask_val = df_limpo['date'] >= "2019-04-01"

Xtrain, Xval = features[mask_train], features[mask_val]
ytrain, yval = y[mask_train], y[mask_val]
Xtrain.shape, Xval.shape, ytrain.shape, yval.shape

((228, 2), (270, 2), (228,), (270,))

<h1>Turn strings into numbers</h1>
<p>Our models can only understand numbers, not strings. And the column "title" only has words on it. To solve this problem, we can create a matrix wich counts how many times a word appears. Then we can create a column with each word and the counting in each line of dataframe. To do that we use the "TfidfVectorizer" library</p>

<p>TfidfVectorizer is a library that gives more weight to words that appears too little in all the videos but too much in a single video. The "min_df" parameter dictates in how many videos a word must appear at minimum. We can adjust that number at will, but by doing this, it will affect the model's performance.</p>

<h5>Sparcity of the matrix</h5>
<p>By default, the vectorizer give us a sparse matrix, for optmization purposes. Meaning that we will only store values != 0. This is a way to conserve memory without allocating unnecessary resources to store unnecessary values.</p>

In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer

df_limpo['title'] = df['watch-title']

title_train = df_limpo[mask_train]['title']
title_val = df_limpo[mask_val]['title']

#Title Bow is a bag of words
title_vec = TfidfVectorizer(min_df=2)
title_bow_train = title_vec.fit_transform(title_train)
title_bow_val = title_vec.transform(title_val)

In [28]:
title_bow_train

<228x193 sparse matrix of type '<class 'numpy.float64'>'
	with 1277 stored elements in Compressed Sparse Row format>

In [29]:
title_bow_val

<270x193 sparse matrix of type '<class 'numpy.float64'>'
	with 1266 stored elements in Compressed Sparse Row format>

<p>We have a matrix of 228 by 193. Meaning, our total matrix space, counting the zeroes is 228*193.</p>

In [26]:
title_bow_train.shape

(228, 193)

In [27]:
title_bow_train

<228x193 sparse matrix of type '<class 'numpy.float64'>'
	with 1277 stored elements in Compressed Sparse Row format>

<p>Almost all the dataframe (97%) is composed of zeroes. In order to preserve memory, we don't need to store the zeroes.</p>

In [25]:
1 - 1277/(228*193)

0.9709799109171894

<h5>Joining some matrices</h5>
<p>To join some matrices, we can use the "hstack" library from "scipy.sparse". There are 2 ways to join matrices. One way is through hstack, another through vstack. The differences between them can be seen down below. </p>

In [None]:
'''
hstack - [1 2]     [3 4]   -> [1 2 3 4] - 1x4

vstack - [1 2]     [3 4]   -> [1 2]
                              [3 4] - 2x2
'''

In [30]:
#Effectvily joining matrices
from scipy.sparse import hstack, vstack

Xtrain_wtitle = hstack([Xtrain, title_bow_train])
Xval_wtitle = hstack([Xval, title_bow_val])

In [31]:
Xtrain_wtitle.shape, Xval_wtitle.shape

((228, 195), (270, 195))

<h1>Creating a model with Random Forest</h1>
<p>In order to beat the scores in our baseline, now we are going to use the Random Forest algorithm.</p>

In [34]:
#from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

mdl = RandomForestClassifier(n_estimators=1000, random_state=0, class_weight="balanced", n_jobs=6)
mdl.fit(Xtrain_wtitle, ytrain)

RandomForestClassifier(class_weight='balanced', n_estimators=1000, n_jobs=6,
                       random_state=0)

<h5>Verifying the results</h5>
<p>After the training we are going to analyse the results with the "roc_auc_score" and the "average_precision_score". </p>

In [35]:
from sklearn.metrics import roc_auc_score, average_precision_score

In [37]:
p = mdl.predict_proba(Xval_wtitle)[:, 1]

In [38]:
average_precision_score(yval, p)

0.18367911080129234

In [39]:
roc_auc_score(yval, p)

0.5761094224924012

<h1>Conclusion</h1>
<p>We have a higher precision score but a lower auc_roc_score. So, we can not conclude anything. We need both the metrics to be higher than our baseline.</p>