# Sentiment Analysis project.

I have requested data from **The Gdelt Project** (check info in the Readme), the goal is:

- Get an Analysis solution for spanish languaje. Compare it with the Sentiment metrics provided by Gdelt.
- Automate it in Google Cloud Platform.
- Plot results with a sweet Data Studio dashboard.

![ Alt text](https://thumbs.gfycat.com/LeafyVibrantFreshwatereel-small.gif)

- Well, this dataset cost me 2.15€, let's see what I've brought from BigQuery.

In [136]:
import pandas as pd
import re 

In [137]:
df = pd.read_csv("input/sentiment_analysis_160221.csv")
df

Unnamed: 0,Date,SourceCommonName,DocumentIdentifier,Sentiment,news_in_Spain
0,2019-01-24,ideal.es,https://www.ideal.es/economia/taxistas-barcelo...,-4.25,desempleo
1,2019-01-24,publico.es,https://www.publico.es/sociedad/proteccion-dat...,-3.22,desempleo
2,2019-01-24,regio7.cat,https://www.regio7.cat/arreu-catalunya-espanya...,-3.23,desempleo
3,2019-01-24,larioja.com,https://www.larioja.com/economia/taxistas-barc...,-4.25,desempleo
4,2019-01-24,vilaweb.cat,https://www.vilaweb.cat/noticies/els-taxistes-...,-3.26,desempleo
...,...,...,...,...,...
6792,2020-06-18,regiondigital.com,http://regiondigital.com/noticias/economia/329...,0.70,desempleo
6793,2020-06-18,elconfidencial.com,https://www.elconfidencial.com/espana/2020-06-...,-0.07,desempleo
6794,2020-06-18,elconfidencial.com,https://blogs.elconfidencial.com/economia/cons...,-0.13,desempleo
6795,2020-06-18,libertaddigital.com,https://www.libertaddigital.com/espana/2020-06...,-5.05,desempleo


- This is going to be fun. I have news in spanish, catalan, valencian ... Let's praise for there's no vask/gaellician

### 1. Data cleaning.

I need to extract as much info as possible from the column DocumentIdentifier, so I need to:
- Clean the https://www./ thing.
- Remove the name of the source.
- Create a list with the words belonging to the url.

In [212]:
df = pd.read_csv("input/sentiment_analysis_160221.csv")
df.drop(columns= 'news_in_Spain', inplace=True)
df.rename(columns={'Date':"date",'SourceCommonName':"source",'DocumentIdentifier':"url",
           "Sentiment":"gdelt_sentiment"},inplace=True)
df.columns

Index(['date', 'source', 'url', 'gdelt_sentiment'], dtype='object')

In [214]:
# removing the source from the url
df["url"] = [b.replace(a, '').strip() for a, b in zip(df["source"], df["url"])]
df.head(2)

Unnamed: 0,date,source,url,gdelt_sentiment
0,2019-01-24,ideal.es,https://www./economia/taxistas-barcelona-desco...,-4.25
1,2019-01-24,publico.es,https://www./sociedad/proteccion-datos-aepd-pe...,-3.22


In [215]:
# removing the internet protocol
deleting_list=["https://www./","http:///","https:",'http:',"//www./","///","//www./"]
for d in deleting_list:
    df["url"]= [x.replace(d,"") for x in df["url"]]
df.head(2)

Unnamed: 0,date,source,url,gdelt_sentiment
0,2019-01-24,ideal.es,economia/taxistas-barcelona-desconvocan-huelga...,-4.25
1,2019-01-24,publico.es,sociedad/proteccion-datos-aepd-perdona-multa-4...,-3.22


In [216]:
# create a list of strings without symbols
df["url"]= [re.split('-|_|/|!|\.', x)  for x in df["url"]] 
df.head(2)

Unnamed: 0,date,source,url,gdelt_sentiment
0,2019-01-24,ideal.es,"[economia, taxistas, barcelona, desconvocan, h...",-4.25
1,2019-01-24,publico.es,"[sociedad, proteccion, datos, aepd, perdona, m...",-3.22


In [217]:
def remove_numbers(list): 
    '''This function remove numbers from a list'''
    pattern = '[0-9]'
    list = [re.sub(pattern, '', i) for i in list] 
    return list

In [218]:
# let's remove the numbers
df["url"]= [remove_numbers(x) for x in df["url"]] 

# now let's remove the generated spaces
df["url"]=[' '.join(x).split() for x in df["url"]]

In [219]:
df["url"]

0       [economia, taxistas, barcelona, desconvocan, h...
1       [sociedad, proteccion, datos, aepd, perdona, m...
2       [arreu, catalunya, espanya, mon, els, taxistes...
3       [economia, taxistas, barcelona, desconvocan, h...
4       [noticies, els, taxistes, desconvoquen, la, va...
                              ...                        
6792    [noticias, economia, mas, de, clientes, recibe...
6793     [espana, mallorca, alemania, aplausos, turistas]
6794    [blogs, economia, consultorio, laboral, erte, ...
6795    [espana, la, salvaje, campana, de, acoso, y, d...
6796    [andalucia, andalucia, permitira, abrir, zonas...
Name: url, Length: 6797, dtype: object

### 2. Data processing.
- Remove articles and redundant words.
- Generate a column for each word counting their occurrences.