# Data processing:

- Merge 4 datasets with different lenght by date
- Google searches dataset:
    - There are 3 columns: Date, keywords, and score, grouped by week. It needs to be structured like a column for each keyword, and rows with the score (the score will be the trend index of Google Trends).
- Spanish news datasets:
    - There are 3 columns: Date, keyword and sentiment. The same keyword can appear many times the same day.
    - A column counting occurrences per day of a keyword must be created, also a column with mean sentiment per day.
    - Then a column called score will be created, multiplying occurrences * mean.
    - Then the final datset will be Date, as much columns as keywords, every column will have the score mentioned above as rows. Everything grouped by week.

In [1]:
import pandas as pd

In [2]:
df_political=pd.read_csv("./input/dashboard_spanish_news_political.csv.gz",compression='gzip', header=0, quotechar='"', error_bad_lines=False)
df_political.sort_values(by=["Date"],inplace=True)

df_economical=pd.read_csv("./input/dashboard_spanish_news_economical.csv.gz",compression='gzip', header=0, quotechar='"', error_bad_lines=False)
df_economical.sort_values(by=["Date"],inplace=True)

df_social=pd.read_csv("./input/dashboard_spanish_news_social.csv.gz",compression='gzip', header=0, quotechar='"', error_bad_lines=False)
df_social.sort_values(by=["Date"],inplace=True)

df_google=pd.read_csv("./input/data_pytrends.csv")
df_google.sort_values(by=["date"],inplace=True)

# Dataset previews:

In [3]:
df_political.head()

Unnamed: 0,political,Date,Sentiment
15773,juicio,2019-01-01,0.18
52050,seguridad_nacional,2019-01-01,-6.55
52051,seguridad_nacional,2019-01-01,-1.43
52052,seguridad_nacional,2019-01-01,-1.43
53246,inestabilidad_politica,2019-01-01,-0.3


In [5]:
df_google.drop(columns='Unnamed: 0', inplace=True)
#df_google.rename(columns={'date':"Date"}, inplace=True)
df_google.head()

Unnamed: 0,keyword,date,trend_index
0,zoom,2019-01-06,5
1410,bildu,2019-01-06,3
6204,uber eats,2019-01-06,10
4982,medico,2019-01-06,89
2632,productividad,2019-01-06,39


# Google Dataset: 

- Creating a column for each keyword with the trend_index value.

I'm going to create a dataframe with the set of dates, and append the score of keywords to this dataset (kind of get dummies, but I can't do that, I have already the units I want for each column)

In [6]:
# creating df
df_google_dates=pd.DataFrame()

# creating the Date column in new dataset
df_google_dates["date"]=list(set(df_google["date"]))
df_google_dates["date"]=pd.to_datetime(df_google_dates["date"])
df_google_dates.sort_values(by=["date"],inplace=True)
df_google_dates.head()

Unnamed: 0,date
22,2019-01-06
18,2019-01-13
31,2019-01-20
67,2019-01-27
86,2019-02-03


In [7]:
# Creating the new columns. Trend index with the name of the corresponding keyword
keyword_list=list(set(df_google["keyword"]))
keyword_list.sort()
for k in keyword_list:
    df_google_dates[k]=df_google[(df_google['keyword'] == k)]["trend_index"].tolist()
#df_google_dates.index=df_google_dates["date"]
#df_google_dates.drop(columns="date",inplace=True)
df_google_dates.head()

Unnamed: 0,date,amazon,autonomo,ayuda alquiler,badi,banco alimentos,barometro,bildu,bullying,cabify,...,taxi,teletrabajo,tinder,uber,uber eats,videoconferencia,videollamada,vox,yoga,zoom
22,2019-01-06,59,47,38,38,5,32,3,24,20,...,42,2,56,27,10,2,2,35,50,5
18,2019-01-13,50,44,32,51,12,28,4,27,30,...,49,1,51,35,15,3,1,27,44,4
31,2019-01-20,44,41,20,38,13,31,2,25,100,...,100,2,50,100,9,3,2,16,41,3
67,2019-01-27,46,51,19,45,6,44,4,30,68,...,85,2,52,78,11,3,3,11,49,4
86,2019-02-03,48,47,21,32,7,58,5,24,45,...,67,1,50,40,11,3,2,13,48,4


In [8]:
# lets check it out if it's right
print(list(df_google_dates["zoom"][:10]),
      "<==>",
      df_google[df_google["keyword"]=="zoom"]["trend_index"].tolist()[:10],
      ", allright then"
     )

[5, 4, 3, 4, 4, 4, 4, 4, 3, 3] <==> [5, 4, 3, 4, 4, 4, 4, 4, 3, 3] , allright then


# Sliding "unemployment" column.

- Now, I have to remove the first 2 rows of the keyword "desempleo", and supress that space with the rest of the column, so the last 2 rows will be empty 

In [10]:
# I should perform feature ingineering before doing this, to check what's really going on

#desempleo_list=list(df_google_dates["desempleo"])

# delete first 0 positions and add empty ones at the end (not the most elegant)
#desempleo_list.pop(0)
#desempleo_list.pop(1)
#desempleo_list.append(0)
#desempleo_list.append(0)

# add to the dataset
#df_google_dates["desempleo"]=desempleo_list

# ok, it works
#df_google_dates[["Date","desempleo"]]

# Manipulating datasets with Spanish news and sentiment.

We'll need to:

- Create a column for each keyword
- Count occurrences of that keyword
- Measure average sentiment
- Group data by week, starting on monday, to merge with the Google dataset
- Combine occurrences and sentiment into one column representative of both, for each keyword

In [9]:
# let's pplay with the 1st dataset and a random keyword, for instance
df_political[df_political["political"]=="juicio"].head()

Unnamed: 0,political,Date,Sentiment
15773,juicio,2019-01-01,0.18
15774,juicio,2019-01-01,0.18
15784,juicio,2019-01-01,-6.08
15777,juicio,2019-01-01,-4.06
15778,juicio,2019-01-01,-4.06


- So, I need to measure the average of sentiment of each keyword per day

In [10]:
df_political[df_political["political"]=="juicio"].groupby("Date").mean().head()

Unnamed: 0_level_0,Sentiment
Date,Unnamed: 1_level_1
2019-01-01,-2.93
2019-01-02,-3.986667
2019-01-03,-2.35
2019-01-04,-4.8225
2019-01-05,-4.723333


- Also, counting occurrences of that keyword

In [11]:
df_political[df_political["political"]=="juicio"].groupby("Date").count().head()

Unnamed: 0_level_0,political,Sentiment
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2019-01-01,14,14
2019-01-02,18,18
2019-01-03,40,40
2019-01-04,16,16
2019-01-05,60,60


- Let's use an aggregate to perform both

In [12]:
df2=df_political[df_political["political"]=="juicio"].groupby(["Date"]).agg(['count','mean'])
# erase multiindex
df2.columns=df2.columns.droplevel(0)
df2.head()

Unnamed: 0_level_0,count,mean
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2019-01-01,14,-2.93
2019-01-02,18,-3.986667
2019-01-03,40,-2.35
2019-01-04,16,-4.8225
2019-01-05,60,-4.723333


- Great, now let's resample by week, starting on Sunday, like the Google Searches dataset

In [13]:
df2.index = pd.to_datetime(df2.index)
df2 = df2.resample('W-SUN').mean() #weekly totals
# score is how we are going to measure the keywords
df2["score"]=df2["count"]*df2["mean"]
df2.head()

Unnamed: 0_level_0,count,mean,score
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2019-01-06,27.666667,-4.097824,-113.373133
2019-01-13,68.857143,-6.805291,-468.592876
2019-01-20,36.857143,-6.178493,-227.721615
2019-01-27,37.714286,-5.767773,-217.52743
2019-02-03,36.0,-4.441655,-159.899564


ok, now I know how to do it, then let's continue creating a function to perform this for every keyword in every Spanish news dataset

- I need to create an empty dataframe, 
- loop por ,each keyword from a set of keywords
- perform what i did before for all keywords
- concat to the mepty dataframe
- put all this in a function

In [64]:
# pending of erasing this and writing it in a separate script
def creating_dataset(df,column):
    '''
    Column is the column in which are allocated the keywords, for every case: political, social and economical columns
    '''
    
    # list of new columns
    list_keywords=list(set(df[column]))
    # creating empty dataframe to append info
    df_final=pd.DataFrame()
    df_final["date"]=list(set(df["Date"]))
    
    #date1 = '2019-01-06'
    #date2 = max(df["date"])
    #mydates = pd.date_range(date1, date2, freq="D").tolist()
    #df_final=pd.DataFrame()
    #df_final["date"]=mydates
    
    for k in list_keywords:
        # creating a new dataframe for every keyword in the column, getting the occurrences of keyword and mean of sentiment
        df4=pd.DataFrame()
        df4=df[df[column]==k].groupby(["Date"]).agg(['count','mean'])
        # erase multiindex
        df4.columns=df4.columns.droplevel(0)
        # this will be our score, occurrences * mean 
        df4[k]=df4["count"]*df4["mean"]
        # date column to perform the join by it
        df4["date"]=df4.index
        df4.drop(columns=["count","mean"],inplace=True)
        # this is where we combine the empty dataset, every keyword in its place
        df_final=df_final.merge(df4,how='left', left_on='date', right_on='date')
        
        
        # PROBLEMHERE DATES !!!! df_final["date"]=pd.to_datetime(df_final["date"])

        df_final.sort_values(by=["date"],inplace=True)


    # make datetime index for weekly resampling
    #df_final["date"]=pd.to_datetime(df_final['date']) 
    #df_final.index=df_final["date"]
    # resampling
    #df_final = df_final.resample('W-SUN').mean() #weekly totals
    #df_final.sort_values(by="date", ascending=True, inplace=True)
    # filling gaps
    #df_final=df_final.fillna(0)
    
    # this is for the future join
    
    return df_final

In [65]:
#from my_functions import creating_dataset

In [66]:
dfp = creating_dataset(df_political,"political")
dfs = creating_dataset(df_social,"social")
dfe = creating_dataset(df_economical,"economical")

In [26]:
dfp.head()

Unnamed: 0,date,juicio,precio_petroleo,rebelion,protestas,corrupcion,seguridad_nacional,extremismo,vigilancia,ejercito,inestabilidad_politica,refugiados,terrorismo
0,2019-01-01,-41.02,,,10.12,,-15.96,,-3.64,-26.98,-0.6,,
1,2019-01-02,-71.76,,-43.28,-19.34,-98.72,-5.9,,-93.02,-64.98,,-0.84,-7.34
2,2019-01-03,-94.0,,,-20.36,-7.56,-102.26,,-7.44,-6.44,,,-110.1
3,2019-01-04,-77.16,,-15.94,-20.28,-71.02,1.9,-20.18,-22.0,2.74,-14.94,-14.06,-147.9
4,2019-01-05,-283.4,,,38.16,-154.82,,3.42,,-9.34,,-0.2,-117.96


In [27]:
dfs.head()

Unnamed: 0,date,ciencia,censura_en_medios,emprendimiento,agresion_sexual,racismo,emergencia_sanitaria,vacunas,inmigracion,subsidios,precio_vivienda,enfermedades_muy_infecciosas,energias_renovables
0,2019-01-01,,,,,,-35.6,-0.32,-7.9,,,-53.3,
1,2019-01-02,-1.42,,,-10.52,,-324.6,-47.46,-4.68,,,-147.4,
2,2019-01-03,,,,,,-24.8,,,,,-43.34,
3,2019-01-04,,,,,,-20.82,,-5.24,,,-21.3,
4,2019-01-05,-10.78,,,-141.32,,-101.9,,,,,-715.36,


In [29]:
dfe.head()

Unnamed: 0,date,prosperidad_economica_y_finanzas,precio_petroleo,finanzas_y_bancos,quiebra_economica,job_quality_&_labor_market_performance,stock_market,crecimiento_economico,desempleo,macroeconomia_deuda_y_vulnerabilidad,libre_comercio,incertidumbre_economica,banco_mundial,pobreza,inflacion_economica
0,2019-01-01,,,,,,-1.54,,,,,-190.64,,-7.02,
1,2019-01-02,,,,,,-9.48,,-113.86,-97.2,,-537.68,,-169.42,
2,2019-01-03,,,,,,-5.84,,-34.38,-8.04,,-1184.68,,-7.28,
3,2019-01-04,,,,,,-30.34,,-12.4,-25.96,,-805.96,,-18.42,
4,2019-01-05,,,,,,-42.8,,-7.34,-19.62,,-611.46,,-5.2,


-  **don't worry by NAN, i'll deal with that later**

# Merging all datasets

- Create an empty dataframe.
- Create a column for it all dates
- Use a left join using the date column, to append in the proper place the keywords of all the other datasets

In [32]:
from datetime import datetime, date

In [33]:
# creating final dataset with everything
date1 = '2019-01-01'
date2 = datetime.now().date()
mydates = pd.date_range(date1, date2, freq="D").tolist()
df_final=pd.DataFrame()
df_final["date"]=mydates
df_final.head()

Unnamed: 0,date
0,2019-01-01
1,2019-01-02
2,2019-01-03
3,2019-01-04
4,2019-01-05


In [36]:
datasets = [ dfp, dfe, dfs, df_google_dates] 

In [41]:
df_google_dates["date"]=df_google_dates["date"].astype(int)

In [45]:
for d in datasets:
    d['date']=d['date'].astype(int)
    #print(d["date"].dtype)

ValueError: invalid literal for int() with base 10: '2019-01-01'

In [35]:
#df42=df_final.merge(dfp,how='left', left_on='date', right_on='date')

In [25]:
# i need to concat instead of merging
'''for d in dfs:
    df_final=df_final.merge(dfe,how='left', left_on='date', right_on='date')
df_final=df_final.fillna(0)'''


"for d in dfs:\n    df_final=df_final.merge(dfe,how='left', left_on='date', right_on='date')\ndf_final=df_final.fillna(0)"

In [26]:
df_final.head()

Unnamed: 0,date
0,2019-01-06
1,2019-01-13
2,2019-01-20
3,2019-01-27
4,2019-02-03


In [27]:
df_final.columns

Index(['date'], dtype='object')