<h1>Creating a new model</h1>
<p>In order to improve even more our baseline, we will create another model and try new things. This process doesn't change much from other models. And if we hit a step wich is different than before, i will just say it. Other than that, i will just follow along.</p>

In [1]:
import pandas as pd
import numpy as np
import re
import time

import bs4 as bs4
import json

import glob
import tqdm

pd.set_option("max.columns", 131)

#https://strftime.org/
%matplotlib inline
%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [2]:
df = pd.read_csv("labels_curso - to_label_2.csv", index_col=0).dropna(subset=["y"])

In [3]:
df.shape

(1164, 16)

In [4]:
df_limpo = pd.DataFrame(index=df.index)
df_limpo['title'] = df['watch-title']

<h1>Data Cleaning</h1>

In [5]:
clean_date = df['watch-time-text'].str.extract(r"(\d+) de ([a-z]+)\. de (\d+)")
clean_date[0] = clean_date[0].map(lambda x: "0"+x[0] if len(x) == 1 else x)
#clean_date[1] = clean_date[1].map(lambda x: x[0].upper()+x[1:])

mapa_meses = {"jan": "Jan",
              "fev": "Feb",
              "mar": "Mar", 
              "abr": "Apr", 
              "mai": "May", 
              "jun": "Jun",
              "jul": "Jul",
              "ago": "Aug", 
              "set": "Sep", 
              "out": "Oct", 
              "nov": "Nov",
              "dez": "Dec"}

clean_date[1] = clean_date[1].map(mapa_meses)

clean_date = clean_date.apply(lambda x: " ".join(x), axis=1)
clean_date.head()
df_limpo['date'] = pd.to_datetime(clean_date, format="%d %b %Y")

<h1>Cleaning the views</h1>

In [6]:
views = df['watch-view-count'].str.extract(r"(\d+\.?\d*)", expand=False).str.replace(".", "").fillna(0).astype(int)
df_limpo['views'] = views

  views = df['watch-view-count'].str.extract(r"(\d+\.?\d*)", expand=False).str.replace(".", "").fillna(0).astype(int)


<h1>Creating new features</h1>

In [7]:
features = pd.DataFrame(index=df_limpo.index)
y = df['y'].copy()

In [8]:
features['time_since_pub'] = (pd.to_datetime("2019-12-03") - df_limpo['date']) / np.timedelta64(1, 'D')
features['views'] = df_limpo['views']
features['views_per_day'] = features['views'] / features['time_since_pub']
features = features.drop(['time_since_pub'], axis=1)

In [9]:
features.head()

Unnamed: 0,views,views_per_day
0,28028,61.464912
394,1161,21.109091
393,141646,809.405714
392,325,21.666667
391,61,7.625


In [10]:
mask_train = df_limpo['date'] < "2019-04-01"
mask_val = (df_limpo['date'] >= "2019-04-01")

Xtrain, Xval = features[mask_train], features[mask_val]
ytrain, yval = y[mask_train], y[mask_val]
Xtrain.shape, Xval.shape, ytrain.shape, yval.shape

((555, 2), (609, 2), (555,), (609,))

<h5>Trying a new parameter for our vetorizer</h5>
<p>Now we are trying a new parameter for the vectorizer. The "ngram_range". This receives 2 values: the min and maximum words that are gonna be considered as different sets (titles in this case). Meaning that if we have something like ngram_range(1,2), then we will consider every separated word as 1 title and every 2 words together as 1 title. If we use ngram_range(2,2), that means we are only considering use a pair of words as our title.</p> 
    
<p>Some examples can be seen right down below.</p>

In [11]:
'''
intro to machine learning -> intro, to, machine, learning  -> ngram_range=(1,1)  
intro to machine learning -> intro, to, machine, learning, intro to, to machine, machine learning -> ngram_range=(1,2)  
intro to machine learning -> intro to, to machine, machine learning -> ngram_range=(2,2)  

'''

'\nintro to machine learning -> intro, to, machine, learning  -> ngram_range=(1,1)  \nintro to machine learning -> intro, to, machine, learning, intro to, to machine, machine learning -> ngram_range=(1,2)  \nintro to machine learning -> intro to, to machine, machine learning -> ngram_range=(2,2)  \n\n'

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer

title_train = df_limpo[mask_train]['title']
title_val = df_limpo[mask_val]['title']

title_vec = TfidfVectorizer(min_df=2, ngram_range=(1,4))
title_bow_train = title_vec.fit_transform(title_train)
title_bow_val = title_vec.transform(title_val)

In [13]:
title_bow_train.shape

(555, 1333)

<h1>Mounting the dataframes together</h1>

In [14]:
from scipy.sparse import hstack, vstack

Xtrain_wtitle = hstack([Xtrain, title_bow_train])
Xval_wtitle = hstack([Xval, title_bow_val])

In [15]:
Xtrain_wtitle.shape, Xval_wtitle.shape

((555, 1335), (609, 1335))

<h1>Random Forest</h1>

In [16]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

In [17]:
mdl = RandomForestClassifier(n_estimators=1000, random_state=0, min_samples_leaf=1, class_weight="balanced", n_jobs=6)
mdl.fit(Xtrain_wtitle, ytrain)

RandomForestClassifier(class_weight='balanced', n_estimators=1000, n_jobs=6,
                       random_state=0)

<h5>Showing some results</h5>

In [18]:
p = mdl.predict_proba(Xval_wtitle)[:, 1]

In [19]:
from sklearn.metrics import roc_auc_score, average_precision_score

In [20]:
average_precision_score(yval, p)

0.21042440914051194

In [21]:
roc_auc_score(yval, p)

0.6897446482278705

<h5>Tunning some parameters</h5>
<p>Some tunning was made, manually. Those can be seen down below. 
    But there are way better ways to do that.</p>

In [22]:
'''
ap 0.1887361518270232, auc 0.6766232234475438
ap 0.1773056089416357, auc 0.66359025771068 - min_samples_leaf=2

ap 0.17595568029276232, auc 0.6454708969747007 - min_df=1  
ap 0.17508239417116303, auc 0.650247685321696 - min_df=3  
ap 0.21140611241061869, auc 0.6785398360559061 - min_df=2, ngram_range=(1,2)  
ap 0.22228951304206077, auc 0.6914990859232175 - min_df=2, ngram_range=(1,3)  
ap 0.20564709615419985, auc 0.6848794008374124 - min_df=2, ngram_range=(1,4)  


ap 0.1834073532858915, auc 0.6716990033614437 - n_estimators=100
ap 0.1762251704396226, auc 0.6700772542312909 - n_estimators=100, min_samples_leaf=2  


RF ap 0.22228951304206077, auc 0.6914990859232175 - min_df=2, ngram_range=(1,3)  
'''

'\nap 0.1887361518270232, auc 0.6766232234475438\nap 0.1773056089416357, auc 0.66359025771068 - min_samples_leaf=2\n\nap 0.17595568029276232, auc 0.6454708969747007 - min_df=1  \nap 0.17508239417116303, auc 0.650247685321696 - min_df=3  \nap 0.21140611241061869, auc 0.6785398360559061 - min_df=2, ngram_range=(1,2)  \nap 0.22228951304206077, auc 0.6914990859232175 - min_df=2, ngram_range=(1,3)  \nap 0.20564709615419985, auc 0.6848794008374124 - min_df=2, ngram_range=(1,4)  \n\n\nap 0.1834073532858915, auc 0.6716990033614437 - n_estimators=100\nap 0.1762251704396226, auc 0.6700772542312909 - n_estimators=100, min_samples_leaf=2  \n\n\nRF ap 0.22228951304206077, auc 0.6914990859232175 - min_df=2, ngram_range=(1,3)  \n'

<h1>LightGBM</h1>
<p>A new algorithm will be tested. The LightGBM is used to have a good performance overall. Let's check that up.</p>

In [23]:
from lightgbm import LGBMClassifier

In [24]:
mdl = LGBMClassifier(random_state=0, class_weight="balanced", n_jobs=6)
mdl.fit(Xtrain_wtitle, ytrain)

LGBMClassifier(class_weight='balanced', n_jobs=6, random_state=0)

In [25]:
p = mdl.predict_proba(Xval_wtitle)[:, 1]



In [26]:
average_precision_score(yval, p), roc_auc_score(yval, p)

(0.1665181138269886, 0.5684378132924456)

<h1>Bayesian Optimization</h1>
<p>We chose some parameters for the LightGBM algorithm. But without any criteria whatsoever. We need a better way to  choose the parameters that is going to have the best performance and the best results. To do that, we will use something called Bayesian Optimization. </p>
<p>Bayesian Optimization is a way to optimize a model by tuning many parameters. The goal is to find the best metrics possible for our model. We can do that by choosing randomlly the parameters until we meet a pre defined number of steps. We can also make a brute force search, looking for all the possible parameters combinations. This method is extremamelly costy in both time and computational resources. </p>
<p>The optimization can also be done in a clever way.  Always looking to the parameters that are improving the model - reaching an optimization point. That is what we are trying to do down below, using a library called "forest_minimize".</p>

In [27]:
from skopt import forest_minimize

<p>A function called "tune_lgbm" is being created to find the best parameters.</p>

In [31]:
def tune_lgbm(params):
    print(params)
    lr = params[0]
    max_depth = params[1]
    min_child_samples = params[2]
    subsample = params[3]
    colsample_bytree = params[4]
    n_estimators = params[5]
    
    min_df = params[6]
    ngram_range = (1, params[7])
    
    title_vec = TfidfVectorizer(min_df=min_df, ngram_range=ngram_range)
    title_bow_train = title_vec.fit_transform(title_train)
    title_bow_val = title_vec.transform(title_val)
    
    Xtrain_wtitle = hstack([Xtrain, title_bow_train])
    Xval_wtitle = hstack([Xval, title_bow_val])
    
    mdl = LGBMClassifier(learning_rate=lr, num_leaves=2 ** max_depth, max_depth=max_depth, 
                         min_child_samples=min_child_samples, subsample=subsample,
                         colsample_bytree=colsample_bytree, bagging_freq=1,n_estimators=n_estimators, random_state=0, 
                         class_weight="balanced", n_jobs=6)
    mdl.fit(Xtrain_wtitle, ytrain)
    
    p = mdl.predict_proba(Xval_wtitle)[:, 1]
    
    print(roc_auc_score(yval, p))
    
    return -average_precision_score(yval, p)


space = [(1e-3, 1e-1, 'log-uniform'), # lr
          (1, 10), # max_depth
          (1, 20), # min_child_samples
          (0.05, 1.), # subsample
          (0.05, 1.), # colsample_bytree
          (100,1000), # n_estimators
          (1,5), # min_df
          (1,5)] # ngram_range



In [32]:
# The best parameters are gonna be stored in the variable "res" in the order specified by the tunning function
res = forest_minimize(tune_lgbm, space, random_state=160745, n_random_starts=20, n_calls=50, verbose=1)

Iteration No: 1 started. Evaluating function at random point.


TypeError: '<' not supported between instances of 'Version' and 'tuple'

In [30]:
res.x

NameError: name 'res' is not defined

<h1>Logistic Regression</h1>
<p>Now we are gonna test the Logistic Regression algorith.</p>
<p>Logistic regression is a statistical method used to predict the outcome of a dependent variable based on previous observations. It's a type of regression analysis and is a commonly used algorithm for solving binary classification problems.</p>
<p>Regression Analysis is a type of predictive modeling technique used to find the relationship between a dependent variable and one or more independent variables.</p>

<a href="https://learn.g2.com/logistic-regression">Referência</a></p>


In [34]:
from sklearn.preprocessing import MaxAbsScaler, StandardScaler
from scipy.sparse import csr_matrix

<h1>Scalers</h1>
<p>In order to make predictions we can't use too large or too small numbers. That's because this is gonna make our model too dependent on this high devious numbers. To make things simpler, we scale our numbers. And there is lots of ways of doing this. 
Here are some ways: <a href="https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#results">All scaling methods</a></p></p>

In [35]:
Xtrain_wtitle2 = csr_matrix(Xtrain_wtitle.copy())
Xval_wtitle2 = csr_matrix(Xval_wtitle.copy())

#scaler = StandardScaler()
scaler = MaxAbsScaler()



#Xtrain_wtitle2[:, :2] = scaler.fit_transform(Xtrain_wtitle2[:, :2].todense())
#Xval_wtitle2[:, :2] = scaler.transform(Xval_wtitle2[:, :2].todense())

Xtrain_wtitle2 = scaler.fit_transform(Xtrain_wtitle2)
Xval_wtitle2 = scaler.transform(Xval_wtitle2)

In [36]:
Xval_wtitle2.shape

(609, 1335)

In [37]:

mdl = LogisticRegression(C=0.5,n_jobs=6, random_state=0)
mdl.fit(Xtrain_wtitle2, ytrain)

LogisticRegression(C=0.5, n_jobs=6, random_state=0)

In [38]:
p = mdl.predict_proba(Xval_wtitle2)[:, 1]

In [39]:
average_precision_score(yval, p), roc_auc_score(yval, p)

(0.2162535946903888, 0.6861178274458927)

In [40]:
'''
(0.20616279322296893, 0.6606416229285841) - sem tuning, standardscaler

(0.20757989629841797, 0.6862357728371764) - sem tuning, maxabsscaler
(0.18953863996413214, 0.6741463702305833) - C=10, maxabsscaler

(0.21340786207179874, 0.6835525151854692) - C=0.5, maxabsscaler
'''

'\n(0.20616279322296893, 0.6606416229285841) - sem tuning, standardscaler\n\n(0.20757989629841797, 0.6862357728371764) - sem tuning, maxabsscaler\n(0.18953863996413214, 0.6741463702305833) - C=10, maxabsscaler\n\n(0.21340786207179874, 0.6835525151854692) - C=0.5, maxabsscaler\n'