# Sentiment Analysis project.

I have requested data from **The Gdelt Project** (check info in the Readme), the goal is:

- Get a Analysis solution for spanish. Compare it with the Sentiment metrics provided by Gdelt.
- Automate it in Google Cloud Platform.
- Plot results with a sweet Data Studio dashboard.

![ Alt text](https://thumbs.gfycat.com/LeafyVibrantFreshwatereel-small.gif)

- Well, get this dataset costs me 2.15€, let's see what I've brought from BigQuery.

In [48]:
import pandas as pd
import re 

In [44]:
df = pd.read_csv("input/sentiment_analysis_160221.csv")
df

Unnamed: 0,Date,SourceCommonName,DocumentIdentifier,Sentiment,news_in_Spain
0,2019-01-24,ideal.es,https://www.ideal.es/economia/taxistas-barcelo...,-4.25,desempleo
1,2019-01-24,publico.es,https://www.publico.es/sociedad/proteccion-dat...,-3.22,desempleo
2,2019-01-24,regio7.cat,https://www.regio7.cat/arreu-catalunya-espanya...,-3.23,desempleo
3,2019-01-24,larioja.com,https://www.larioja.com/economia/taxistas-barc...,-4.25,desempleo
4,2019-01-24,vilaweb.cat,https://www.vilaweb.cat/noticies/els-taxistes-...,-3.26,desempleo
...,...,...,...,...,...
6792,2020-06-18,regiondigital.com,http://regiondigital.com/noticias/economia/329...,0.70,desempleo
6793,2020-06-18,elconfidencial.com,https://www.elconfidencial.com/espana/2020-06-...,-0.07,desempleo
6794,2020-06-18,elconfidencial.com,https://blogs.elconfidencial.com/economia/cons...,-0.13,desempleo
6795,2020-06-18,libertaddigital.com,https://www.libertaddigital.com/espana/2020-06...,-5.05,desempleo


- This is going to be fun. I have news in spanish, catalan, valencian ... Let's praise for there's no vask/gaellician

### 1. Data cleaning.

I need to extract as much info as possible from the column DocumentIdentifier, so I need to:
- Clean the https://www./ thing.
- Remove the name of the source.
- Create a list with the words belonging to the url.

In [54]:
df = pd.read_csv("input/sentiment_analysis_160221.csv")
# the last one is needed for nothing
df.drop(columns= 'news_in_Spain', inplace=True)
df.columns

Index(['Date', 'SourceCommonName', 'DocumentIdentifier', 'Sentiment'], dtype='object')

In [55]:
# removing the source from the url
df["DocumentIdentifier"] = [b.replace(a, '').strip() for a, b in zip(df["SourceCommonName"], df["DocumentIdentifier"])]
df.head(2)

Unnamed: 0,Date,SourceCommonName,DocumentIdentifier,Sentiment
0,2019-01-24,ideal.es,https://www./economia/taxistas-barcelona-desco...,-4.25
1,2019-01-24,publico.es,https://www./sociedad/proteccion-datos-aepd-pe...,-3.22


In [56]:
# removing the internet protocol
df["DocumentIdentifier"]= [x.replace("https://www./","") for x in df["DocumentIdentifier"]]
df["DocumentIdentifier"]= [x.replace("http:///","") for x in df["DocumentIdentifier"]]
df.head(2)

Unnamed: 0,Date,SourceCommonName,DocumentIdentifier,Sentiment
0,2019-01-24,ideal.es,economia/taxistas-barcelona-desconvocan-huelga...,-4.25
1,2019-01-24,publico.es,sociedad/proteccion-datos-aepd-perdona-multa-4...,-3.22


In [59]:
# create a list of strings
df["DocumentIdentifier"]= [re.split('-|_|/|!', x)  for x in df["DocumentIdentifier"]] 

In [67]:
df

Unnamed: 0,Date,SourceCommonName,DocumentIdentifier,Sentiment
0,2019-01-24,ideal.es,"[economia, taxistas, barcelona, desconvocan, h...",-4.25
1,2019-01-24,publico.es,"[sociedad, proteccion, datos, aepd, perdona, m...",-3.22
2,2019-01-24,regio7.cat,"[arreu, catalunya, espanya, mon, 2019, 01, 24,...",-3.23
3,2019-01-24,larioja.com,"[economia, taxistas, barcelona, desconvocan, h...",-4.25
4,2019-01-24,vilaweb.cat,"[noticies, els, taxistes, desconvoquen, la, va...",-3.26
...,...,...,...,...
6792,2020-06-18,regiondigital.com,"[noticias, economia, 329040, mas, de, 500000, ...",0.70
6793,2020-06-18,elconfidencial.com,"[espana, 2020, 06, 18, mallorca, alemania, apl...",-0.07
6794,2020-06-18,elconfidencial.com,"[https:, , blogs., economia, consultorio, labo...",-0.13
6795,2020-06-18,libertaddigital.com,"[espana, 2020, 06, 18, la, salvaje, campana, d...",-5.05
