# ML regression

I'm going to:

- Normalise the dataset before applying ML. 
- Use **RFE** (Recursive Feature Elimination) to select the best 15 columns to work with, from the 133 I have.
- Split for validation methond (i'm still figuring out if k-folds or train-validation-test).
- Once my model is trained, infer data in the whole dataset, to have a dashboard comparing *real keyword* vs *infered keyword*.

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

from sklearn.model_selection import train_test_split
#from sklearn import preprocessing

In [2]:
df=pd.read_csv("input/dataset_final_processed.csv")
df.drop(columns='Unnamed: 0', inplace=True)
df.head()

Unnamed: 0,date,protestas_x,extremismo_x,rebelion_x,refugiados_x,precio_petroleo_x,juicio_x,corrupcion_x,inestabilidad_politica_x,terrorismo_x,...,teletrabajo,tinder,uber,uber eats,videoconferencia,videollamada,vox,yoga,zoom,unemployment
0,2019-01-27,-0.986667,-29.993333,-21.453333,-6.425,0.0,-111.88,-92.164,-7.77,-82.14,...,2.0,56.0,27.0,10.0,2.0,2.0,35.0,50.0,5.0,21.0
1,2019-02-03,-16.754286,-2.26,-17.413333,-15.266667,0.0,-485.394286,-168.828571,-23.5,-11.586667,...,1.0,51.0,35.0,15.0,3.0,1.0,27.0,44.0,4.0,23.0
2,2019-02-10,-98.83,-5.644,-26.305,0.0,3.72,-232.251429,-106.468571,-36.12,-145.26,...,2.0,50.0,100.0,9.0,3.0,2.0,16.0,41.0,3.0,23.0
3,2019-02-17,-41.448571,-4.226667,-26.57,1.276,0.0,-233.122857,-106.342857,-24.146667,-34.084,...,2.0,52.0,78.0,11.0,3.0,3.0,11.0,49.0,4.0,23.0
4,2019-02-24,-35.362857,0.0,-28.106667,-20.31,0.0,-173.451429,-112.497143,-5.16,-24.512,...,1.0,50.0,40.0,11.0,3.0,2.0,13.0,48.0,4.0,20.0


In [3]:
print("I have ", len(df.columns), " columns to play with")#columns=list(df.columns)

I have  132  columns to play with


# Normalise

- This dataset is composed by datasets with different scale, so normalize values is needed.
- To face a time series problem, working with NN is needed, and it tends to be computationally intensive, so I will play without the *date* column. Later it will be used for dashboarding.

In [4]:
# Transform into standard normal distribution using the z-score definition
X = df.drop(columns=["date","unemployment"]) #returns a numpy array
X = X.apply(lambda x: (x - np.mean(x)) / (np.max(x) - np.min(x)))
target=df["unemployment"]

In [5]:
X.head()

Unnamed: 0,protestas_x,extremismo_x,rebelion_x,refugiados_x,precio_petroleo_x,juicio_x,corrupcion_x,inestabilidad_politica_x,terrorismo_x,vigilancia_x,...,taxi,teletrabajo,tinder,uber,uber eats,videoconferencia,videollamada,vox,yoga,zoom
0,0.079454,-0.227,-0.04851,0.004997,0.055642,0.126841,-0.005161,0.04284,-0.40833,-0.129363,...,-0.059938,-0.125725,-0.244889,-0.18617,-0.394903,-0.090157,-0.069955,0.19115,0.040725,-0.098157
1,0.018951,0.038086,-0.013077,-0.073158,0.055642,-0.642666,-0.290909,-0.007129,0.077374,-0.53705,...,0.025428,-0.135826,-0.342929,-0.083606,-0.339958,-0.080056,-0.080056,0.106043,-0.053025,-0.108467
2,-0.295988,0.00574,-0.091062,0.06179,0.324817,-0.121146,-0.058478,-0.047218,-0.842861,-0.335707,...,0.647379,-0.125725,-0.362537,0.749727,-0.405892,-0.080056,-0.069955,-0.010978,-0.0999,-0.118776
3,-0.075805,0.019288,-0.093386,0.073069,0.055642,-0.122941,-0.05801,-0.009183,-0.077503,0.065125,...,0.464453,-0.125725,-0.323321,0.467676,-0.383914,-0.080056,-0.059854,-0.064169,0.0251,-0.108467
4,-0.052453,0.059688,-0.106864,-0.117738,0.055642,-7e-06,-0.080948,0.051131,-0.011607,-0.014582,...,0.24494,-0.135826,-0.362537,-0.019504,-0.383914,-0.080056,-0.069955,-0.042893,0.009475,-0.108467


In [6]:
target.head()

0    21.0
1    23.0
2    23.0
3    23.0
4    20.0
Name: unemployment, dtype: float64

# Selecting best features using RFE

(Recursive Feature Elimination)

In [8]:
rfc = LinearRegression()
rfecv = RFE(estimator=rfc, step=1, n_features_to_select=15)
rfecv.fit(X, target)

RFE(estimator=LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
                               normalize=False),
    n_features_to_select=15, step=1, verbose=0)

0.9087502809346615

# CSV with infered keyword vs real keyword

- Last 3 rows are the truly interesting one... Foreseeing with 3 weeks in advance

In [14]:
result=pd.DataFrame()
result["date"]=df["date"]
result["real_searches"]=df["unemployment"]
result["infered_searches"]=pd.DataFrame(rfecv.predict(X))
result["infered_searches"]=result["infered_searches"].apply(lambda x: 0 if x<0 else round(x,2))
result

Unnamed: 0,date,real_searches,infered_searches
0,2019-01-27,21.0,17.92
1,2019-02-03,23.0,22.33
2,2019-02-10,23.0,22.30
3,2019-02-17,23.0,22.23
4,2019-02-24,20.0,14.77
...,...,...,...
89,2020-10-11,30.0,36.72
90,2020-10-18,27.0,32.49
91,2020-10-25,0.0,1.47
92,2020-11-01,0.0,14.87


## CSV to append weekly the score

In [11]:
score = pd.DataFrame({"date": [max(df["date"])], 'score': [round(rfecv.score(X, target),4)]})
score.to_csv("input/weekly_score.csv")
score

Unnamed: 0,date,score
0,2020-11-08,0.9088


## CSV to overwrite weekly ranking of features by importance

In [15]:
features=pd.DataFrame()
features["features"]=X.columns
features["top_important"]=rfecv.ranking_
features.sort_values(by=["top_important"], inplace=True)
features.to_csv("input/ranking_of_features.csv")
features.head(25)
# Ranking of how important are the following keywords to infer in Google searches in Spain
# the keyword "unemployment"

Unnamed: 0,features,top_important
30,banco_mundial,1
125,videoconferencia,1
124,uber eats,1
57,bullying,1
118,skype,1
114,refugiados,1
59,caritas,1
63,comparecencia,1
105,pandemia,1
102,nacionalismo,1
