# Data processing:

- Merge 4 datasets with different lenght by date
- Google searches dataset:
    - There are 3 columns: Date, keywords, and score, grouped by week. It needs to be structured like a column for each keyword, and rows with the score (the score will be the trend index of Google Trends).
- Spanish news datasets:
    - There are 3 columns: Date, keyword and sentiment. The same keyword can appear many times the same day.
    - A column counting occurrences per day of a keyword must be created, also a column with mean sentiment per day.
    - Then a column called score will be created, multiplying occurrences * mean.
    - Then the final datset will be Date, as much columns as keywords, every column will have the score mentioned above as rows. Everything grouped by week.

In [1]:
import pandas as pd

In [2]:
df_political=pd.read_csv("./input/dashboard_spanish_news_political.csv.gz",compression='gzip', header=0, quotechar='"', error_bad_lines=False)
df_political.sort_values(by=["Date"],inplace=True)

df_economical=pd.read_csv("./input/dashboard_spanish_news_economical.csv.gz",compression='gzip', header=0, quotechar='"', error_bad_lines=False)
df_economical.sort_values(by=["Date"],inplace=True)

df_social=pd.read_csv("./input/dashboard_spanish_news_social.csv.gz",compression='gzip', header=0, quotechar='"', error_bad_lines=False)
df_social.sort_values(by=["Date"],inplace=True)

df_google=pd.read_csv("./input/data_pytrends.csv")
df_google.sort_values(by=["date"],inplace=True)

In [3]:
# let's check dates out because they will be a pain
df_google["date"].dtypes

dtype('O')

In [4]:
df_political["Date"].dtypes

dtype('O')

In [5]:
df_social["Date"].dtypes

dtype('O')

In [6]:
df_economical["Date"].dtypes

dtype('O')

# Dataset previews:

In [7]:
# count occurrences, average of sentiment, multiply both for new column
df_political.head()

Unnamed: 0,political,Date,Sentiment
15773,juicio,2019-01-01,0.18
52050,seguridad_nacional,2019-01-01,-6.55
52051,seguridad_nacional,2019-01-01,-1.43
52052,seguridad_nacional,2019-01-01,-1.43
53246,inestabilidad_politica,2019-01-01,-0.3


In [8]:
df_google.drop(columns='Unnamed: 0', inplace=True)
#df_google.rename(columns={'date':"Date"}, inplace=True)
df_google.head()

Unnamed: 0,keyword,date,trend_index
0,zoom,2019-01-06,5
1410,bildu,2019-01-06,3
6204,uber eats,2019-01-06,10
4982,medico,2019-01-06,89
2632,productividad,2019-01-06,39


# Google Dataset: 

- Creating a column for each keyword with the trend_index value.

I'm going to create a dataframe with the set of dates, and append the score of keywords to this dataset (kind of get dummies, but I can't do that, I have already the units I want for each column)

In [9]:
# creating df
df_google_dates=pd.DataFrame()

# creating the Date column in new dataset
df_google_dates["date"]=list(set(df_google["date"]))
df_google_dates["date"]=pd.to_datetime(df_google_dates["date"])
df_google_dates.sort_values(by=["date"],inplace=True)
df_google_dates.head()

Unnamed: 0,date
36,2019-01-06
7,2019-01-13
64,2019-01-20
38,2019-01-27
48,2019-02-03


In [10]:
df_google_dates.date.dtypes

dtype('<M8[ns]')

 - append to the empty dataframe with dates

In [11]:
# Creating the new columns. Trend index with the name of the corresponding keyword
keyword_list=list(set(df_google["keyword"]))
keyword_list.sort()
for k in keyword_list:
    df_google_dates[k]=df_google[(df_google['keyword'] == k)]["trend_index"].tolist()
#df_google_dates.index=df_google_dates["date"]
#df_google_dates.drop(columns="date",inplace=True)
df_google_dates.head()

Unnamed: 0,date,amazon,autonomo,ayuda alquiler,badi,banco alimentos,barometro,bildu,bullying,cabify,...,taxi,teletrabajo,tinder,uber,uber eats,videoconferencia,videollamada,vox,yoga,zoom
36,2019-01-06,59,47,38,38,5,32,3,24,20,...,42,2,56,27,10,2,2,35,50,5
7,2019-01-13,50,44,32,51,12,28,4,27,30,...,49,1,51,35,15,3,1,27,44,4
64,2019-01-20,44,41,20,38,13,31,2,25,100,...,100,2,50,100,9,3,2,16,41,3
38,2019-01-27,46,51,19,45,6,44,4,30,68,...,85,2,52,78,11,3,3,11,49,4
48,2019-02-03,48,47,21,32,7,58,5,24,45,...,67,1,50,40,11,3,2,13,48,4


In [12]:
# lets check it out if it's right
print(list(df_google_dates["zoom"][:10]),
      "<==>",
      df_google[df_google["keyword"]=="zoom"]["trend_index"].tolist()[:10],
      ", allright then"
     )

[5, 4, 3, 4, 4, 4, 4, 4, 3, 3] <==> [5, 4, 3, 4, 4, 4, 4, 4, 3, 3] , allright then


# Sliding "unemployment" column.

- Now, I have to remove the first 2 rows of the keyword "desempleo", and supress that space with the rest of the column, so the last 2 rows will be empty 

In [13]:
# I should perform feature ingineering before doing this, to check what's really going on

#desempleo_list=list(df_google_dates["desempleo"])

# delete first 0 positions and add empty ones at the end (not the most elegant)
#desempleo_list.pop(0)
#desempleo_list.pop(1)
#desempleo_list.append(0)
#desempleo_list.append(0)

# add to the dataset
#df_google_dates["desempleo"]=desempleo_list

# ok, it works
#df_google_dates[["Date","desempleo"]]

# Manipulating datasets with Spanish news and sentiment.

We'll need to:

- Create a column for each keyword
- Count occurrences of that keyword
- Measure average sentiment
- Group data by week, starting on monday, to merge with the Google dataset
- Combine occurrences and sentiment into one column representative of both, for each keyword

In [14]:
# let's pplay with the 1st dataset and a random keyword, for instance
df_political[df_political["political"]=="juicio"].head()

Unnamed: 0,political,Date,Sentiment
15773,juicio,2019-01-01,0.18
15774,juicio,2019-01-01,0.18
15784,juicio,2019-01-01,-6.08
15777,juicio,2019-01-01,-4.06
15778,juicio,2019-01-01,-4.06


- So, I need to measure the average of sentiment of each keyword per day

In [15]:
df_political[df_political["political"]=="juicio"].groupby("Date").mean().head()

Unnamed: 0_level_0,Sentiment
Date,Unnamed: 1_level_1
2019-01-01,-2.93
2019-01-02,-3.986667
2019-01-03,-2.35
2019-01-04,-4.8225
2019-01-05,-4.723333


- Also, counting occurrences of that keyword

In [16]:
df_political[df_political["political"]=="juicio"].groupby("Date").count().head()

Unnamed: 0_level_0,political,Sentiment
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2019-01-01,14,14
2019-01-02,18,18
2019-01-03,40,40
2019-01-04,16,16
2019-01-05,60,60


- Let's use an aggregate to perform both

In [17]:
df2=df_political[df_political["political"]=="juicio"].groupby(["Date"]).agg(['count','mean'])
# erase multiindex
df2.columns=df2.columns.droplevel(0)
df2.head()

Unnamed: 0_level_0,count,mean
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2019-01-01,14,-2.93
2019-01-02,18,-3.986667
2019-01-03,40,-2.35
2019-01-04,16,-4.8225
2019-01-05,60,-4.723333


- Great, now let's resample by week, starting on Sunday, like the Google Searches dataset

In [18]:
df2.index = pd.to_datetime(df2.index)
df2 = df2.resample('W-SUN').mean() #weekly totals
# score is how we are going to measure the keywords
df2["score"]=df2["count"]*df2["mean"]
df2.head()

Unnamed: 0_level_0,count,mean,score
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2019-01-06,27.666667,-4.097824,-113.373133
2019-01-13,68.857143,-6.805291,-468.592876
2019-01-20,36.857143,-6.178493,-227.721615
2019-01-27,37.714286,-5.767773,-217.52743
2019-02-03,36.0,-4.441655,-159.899564


ok, now I know how to do it, then let's continue creating a function to perform this for every keyword in every Spanish news dataset

- I need to create an empty dataframe, 
- loop por ,each keyword from a set of keywords
- perform what i did before for all keywords
- concat to the mepty dataframe
- put all this in a function

In [19]:
# pending of erasing this and writing it in a separate script
def creating_dataset(df,column):
    '''
    Column is the column in which are allocated the keywords, for every case: political, social and economical columns
    '''
    df["Date"]=pd.to_datetime(df["Date"])
    # list of new columns
    list_keywords=list(set(df[column]))
    # creating empty dataframe to append info
    df_final=pd.DataFrame()
    df_final["date"]=list(set(df["Date"]))
    
    for k in list_keywords:
        # creating a new dataframe for every keyword in the column, getting the occurrences of keyword and mean of sentiment
        df4=pd.DataFrame()
        df4=df[df[column]==k].groupby(["Date"]).agg(['count','mean'])
        # erase multiindex
        df4.columns=df4.columns.droplevel(0)
        # this will be our score, occurrences * mean 
        df4[k]=df4["count"]*df4["mean"]
        # date column to perform the join by it
        df4["date"]=df4.index
        df4.drop(columns=["count","mean"],inplace=True)
        # this is where we combine the empty dataset, every keyword in its place
        df_final=df_final.merge(df4,how='left', left_on='date', right_on='date')

    # resampling 
    # this is weird: transform date column in index, group by, then transform again index in column, to make the further join
    df_final.index=df_final["date"]
    df_final = df_final.resample('W-SUN').mean() #weekly totals
    df_final.sort_values(by="date", ascending=True, inplace=True)
    df_final["date"]=df_final.index
    df_final.reset_index(drop=True, inplace=True)
    
    
    return df_final

In [20]:
#from my_functions import creating_dataset

In [21]:
dfp = creating_dataset(df_political,"political")
dfs = creating_dataset(df_social,"social")
dfe = creating_dataset(df_economical,"economical")

In [22]:
dfe.date.dtypes

dtype('<M8[ns]')

In [23]:
dfp.head()

Unnamed: 0,inestabilidad_politica,terrorismo,juicio,vigilancia,protestas,corrupcion,precio_petroleo,rebelion,extremismo,refugiados,ejercito,seguridad_nacional,date
0,-7.77,-82.14,-111.88,-31.525,-0.986667,-92.164,,-21.453333,-29.993333,-6.425,-20.473333,-30.555,2019-01-06
1,-23.5,-11.586667,-485.394286,-67.15,-16.754286,-168.828571,,-17.413333,-2.26,-15.266667,-31.36,-5.313333,2019-01-13
2,-36.12,-145.26,-232.251429,-49.556,-98.83,-106.468571,3.72,-26.305,-5.644,,-23.7,-3.22,2019-01-20
3,-24.146667,-34.084,-233.122857,-14.53,-41.448571,-106.342857,,-26.57,-4.226667,1.276,-25.508571,-16.08,2019-01-27
4,-5.16,-24.512,-173.451429,-21.495,-35.362857,-112.497143,,-28.106667,,-20.31,-6.876667,-1.48,2019-02-03


In [24]:
dfs.head()

Unnamed: 0,emergencia_sanitaria,enfermedades_muy_infecciosas,ciencia,emprendimiento,precio_vivienda,subsidios,agresion_sexual,energias_renovables,inmigracion,censura_en_medios,vacunas,racismo,date
0,-88.38,-195.31,-6.68,1.68,,,-52.646667,,-1.45,,-23.89,,2019-01-06
1,-44.537143,-69.154286,-36.077143,5.3,-2.64,,-109.376,,0.896667,,-11.04,,2019-01-13
2,-56.36,-109.811429,-7.24,,,,-8.7,,3.912,,-5.89,,2019-01-20
3,-28.24,-101.22,-26.273333,-3.932,,,-17.14,,-6.73,,,,2019-01-27
4,-56.934286,-120.577143,-8.63,5.64,0.34,,-164.08,-2.78,-79.54,1.74,,,2019-02-03


In [25]:
dfe.head()

Unnamed: 0,precio_petroleo,macroeconomia_deuda_y_vulnerabilidad,prosperidad_economica_y_finanzas,job_quality_&_labor_market_performance,incertidumbre_economica,inflacion_economica,banco_mundial,finanzas_y_bancos,quiebra_economica,libre_comercio,crecimiento_economico,desempleo,pobreza,stock_market,date
0,,-37.705,,,-631.063333,,,,,,,-41.995,-41.553333,-19.043333,2019-01-06
1,,-7.46,,-9.82,-750.242857,,,,,-17.82,,-5.933333,-66.84,-14.948571,2019-01-13
2,3.72,-7.76,,-2.368,-791.151429,7.44,,,-11.61,,,-35.166667,-23.345714,-7.531429,2019-01-20
3,,-21.0,,-10.792,-770.017143,,,,0.78,,1.99,-14.246667,-12.205714,-16.442857,2019-01-27
4,,-29.86,,-3.333333,-696.837143,,-0.5,,-8.773333,,,-11.875,-41.56,-12.111429,2019-02-03


-  **don't worry by NAN, i'll deal with that later**

# Merging all datasets

- Create an empty dataframe.
- Create a column for it all dates
- Use a left join using the date column, to append in the proper place the keywords of all the other datasets

In [26]:
from datetime import datetime, date

In [27]:
# creating final dataset with everything
date1 = '2019-01-01'
date2 = datetime.now().date()
mydates = pd.date_range(date1, date2, freq="W").tolist()
df_final=pd.DataFrame()
df_final["date"]=mydates
df_final['date']=pd.to_datetime(df_final["date"])
df_final.head()

Unnamed: 0,date
0,2019-01-06
1,2019-01-13
2,2019-01-20
3,2019-01-27
4,2019-02-03


In [28]:
df_final.date.dtypes

dtype('<M8[ns]')

In [29]:
datasets = [ dfp, dfe, dfs,df_google_dates] 

In [30]:
#df.join(other.set_index('key'), on='key')

In [31]:
for d in datasets:
    #isplay(df_final.set_index('date').join(d.set_index('date')))
    #df_final=df_final.join(d.set_index("date"), on="date")
    df_final=df_final.merge(d,how='left', left_on="date", right_on="date")

In [40]:
df_final=df_final.fillna(0)

In [41]:
df_final.to_csv("./input/dataset_fnal.csv")

- Google datasets has been a pain, let's see if it's merged right

In [35]:
pd.DataFrame(df_final[["zoom","date"]]).tail()

Unnamed: 0,zoom,date
90,20.0,2020-09-27
91,20.0,2020-10-04
92,18.0,2020-10-11
93,21.0,2020-10-18
94,,2020-10-25


In [38]:
pd.DataFrame(df_google_dates[["zoom","date"]]).tail()

Unnamed: 0,zoom,date
25,22,2020-09-20
67,20,2020-09-27
81,20,2020-10-04
74,18,2020-10-11
84,21,2020-10-18


- further checks done.

# All right! I have my dataset. Now it's time for the feature engineering fun!

- Find better features to work with "unemployment". Greedy algorithm
- Moving forward the "unemplyment" rows 3 weeks ahead. Linking the whole rows of the dataset with the rows of uneployment 3 weeks ahead, when the final date (this week) is reached, it will be used to infer unemployment 3 weeks ahead.
- Normalise and standarise the dataset.
- Apply my personal weights, so every week closer to the actual date will be worth more than far weeks.