<h1>Creating a new model</h1>
<p>In order to improve even more our baseline, we will create another model and try new things. This process doesn't change much from other models. And if we hit a step wich is different than before, i will just say it. Other than that, i will just follow along.</p>

In [1]:
import pandas as pd
import numpy as np
import re
import time

import bs4 as bs4
import json

import glob
import tqdm

pd.set_option("max.columns", 131)

#https://strftime.org/
%matplotlib inline
%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [2]:
df = pd.read_csv("labels_curso - to_label_2.csv", index_col=0).dropna(subset=["y"])

In [5]:
df.shape

(1164, 16)

In [6]:
df_limpo = pd.DataFrame(index=df.index)
df_limpo['title'] = df['watch-title']

<h1>Data Cleaning</h1>

In [7]:
clean_date = df['watch-time-text'].str.extract(r"(\d+) de ([a-z]+)\. de (\d+)")
clean_date[0] = clean_date[0].map(lambda x: "0"+x[0] if len(x) == 1 else x)
#clean_date[1] = clean_date[1].map(lambda x: x[0].upper()+x[1:])

mapa_meses = {"jan": "Jan",
              "fev": "Feb",
              "mar": "Mar", 
              "abr": "Apr", 
              "mai": "May", 
              "jun": "Jun",
              "jul": "Jul",
              "ago": "Aug", 
              "set": "Sep", 
              "out": "Oct", 
              "nov": "Nov",
              "dez": "Dec"}

clean_date[1] = clean_date[1].map(mapa_meses)

clean_date = clean_date.apply(lambda x: " ".join(x), axis=1)
clean_date.head()
df_limpo['date'] = pd.to_datetime(clean_date, format="%d %b %Y")

<h1>Cleaning the views</h1>

In [8]:
views = df['watch-view-count'].str.extract(r"(\d+\.?\d*)", expand=False).str.replace(".", "").fillna(0).astype(int)
df_limpo['views'] = views

  views = df['watch-view-count'].str.extract(r"(\d+\.?\d*)", expand=False).str.replace(".", "").fillna(0).astype(int)


<h1>Creating new features</h1>

In [9]:
features = pd.DataFrame(index=df_limpo.index)
y = df['y'].copy()

In [10]:
features['time_since_pub'] = (pd.to_datetime("2019-12-03") - df_limpo['date']) / np.timedelta64(1, 'D')
features['views'] = df_limpo['views']
features['views_per_day'] = features['views'] / features['time_since_pub']
features = features.drop(['time_since_pub'], axis=1)

In [11]:
features.head()

Unnamed: 0,views,views_per_day
0,28028,61.464912
394,1161,21.109091
393,141646,809.405714
392,325,21.666667
391,61,7.625


In [13]:
mask_train = df_limpo['date'] < "2019-04-01"
mask_val = (df_limpo['date'] >= "2019-04-01")

Xtrain, Xval = features[mask_train], features[mask_val]
ytrain, yval = y[mask_train], y[mask_val]
Xtrain.shape, Xval.shape, ytrain.shape, yval.shape

((555, 2), (609, 2), (555,), (609,))

<h5>Trying a new parameter for our vetorizer</h5>
<p>Now we are trying a new parameter for the vectorizer. The "ngram_range". This receives 2 values: the min and maximum words that are gonna be considered as different sets (titles in this case). Meaning that if we have something like ngram_range(1,2), then we will consider every separated word as 1 title and every 2 words together as 1 title. If we use ngram_range(2,2), that means we are only considering use a pair of words as our title.</p> 
    
<p>Some examples can be seen right down below.</p>

In [None]:
'''
intro to machine learning -> intro, to, machine, learning  -> ngram_range=(1,1)  
intro to machine learning -> intro, to, machine, learning, intro to, to machine, machine learning -> ngram_range=(1,2)  
intro to machine learning -> intro to, to machine, machine learning -> ngram_range=(2,2)  

'''

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer

title_train = df_limpo[mask_train]['title']
title_val = df_limpo[mask_val]['title']

title_vec = TfidfVectorizer(min_df=2, ngram_range=(1,4))
title_bow_train = title_vec.fit_transform(title_train)
title_bow_val = title_vec.transform(title_val)

In [15]:
title_bow_train.shape

(555, 1333)

<h1>Mounting the dataframes together</h1>

In [16]:
from scipy.sparse import hstack, vstack

Xtrain_wtitle = hstack([Xtrain, title_bow_train])
Xval_wtitle = hstack([Xval, title_bow_val])

In [17]:
Xtrain_wtitle.shape, Xval_wtitle.shape

((555, 1335), (609, 1335))

<h1>Random Forest</h1>

In [18]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

In [19]:
mdl = RandomForestClassifier(n_estimators=1000, random_state=0, min_samples_leaf=1, class_weight="balanced", n_jobs=6)
mdl.fit(Xtrain_wtitle, ytrain)

RandomForestClassifier(class_weight='balanced', n_estimators=1000, n_jobs=6,
                       random_state=0)

<h5>Showing some results</h5>

In [20]:
p = mdl.predict_proba(Xval_wtitle)[:, 1]

In [21]:
from sklearn.metrics import roc_auc_score, average_precision_score

In [22]:
average_precision_score(yval, p)

0.21042440914051194

In [23]:
roc_auc_score(yval, p)

0.6897446482278705

<h5>Tunning some parameters</h5>
<p>Some tunning was made, manually. Those can be seen down below. 
    But there are way better ways to do that.</p>

In [None]:
'''
ap 0.1887361518270232, auc 0.6766232234475438
ap 0.1773056089416357, auc 0.66359025771068 - min_samples_leaf=2

ap 0.17595568029276232, auc 0.6454708969747007 - min_df=1  
ap 0.17508239417116303, auc 0.650247685321696 - min_df=3  
ap 0.21140611241061869, auc 0.6785398360559061 - min_df=2, ngram_range=(1,2)  
ap 0.22228951304206077, auc 0.6914990859232175 - min_df=2, ngram_range=(1,3)  
ap 0.20564709615419985, auc 0.6848794008374124 - min_df=2, ngram_range=(1,4)  


ap 0.1834073532858915, auc 0.6716990033614437 - n_estimators=100
ap 0.1762251704396226, auc 0.6700772542312909 - n_estimators=100, min_samples_leaf=2  


RF ap 0.22228951304206077, auc 0.6914990859232175 - min_df=2, ngram_range=(1,3)  
'''

<h1>LightGBM</h1>
<p>A new algorithm will be tested. The LightGBM is used to have a good performance overall. Let's check that up.</p>