# Sentiment Analysis

This notebook includes all actions to describe the data (i.e. key statistics, etc).<br>
**Data**: CSV saved in '../data/raw/rawdata.csv. This file contains all updated tweets (old and new) as per 'acquisition module'

**Key actions** 
<br> <hr>
- Identify and remove duplicate records
- Identify and remove Tweets done by user = BiciMAD 
- Examine data for potential issues
- Identify and fill in missing values
- *Remove low variance columns (potentially not needed)*
- Identify potential outliers *(potentially not needed)*
- Correct incorrect data types *(potentially only text variable)*
- Remove special characters and clean categorical variables *(potentially only text variable)*
<br>

## 1. Read & clean data

#### 1.1 Read data
<hr>

In [1]:
# all modules
import pandas as pd
# if packaches need upload 
# pip install --upgrade pip

In [2]:
# load dataset
data = pd.read_csv('../data/raw/rawdata.csv')

In [3]:
# change date type from 'object' to 'date'
data['date'] = pd.to_datetime(data['date'])

In [4]:
# getting today's Timestamp
today = pd.Timestamp.today().floor('D')
# .normalize() does the same thing

In [5]:
today

Timestamp('2020-10-24 00:00:00')

In [6]:
data = data[(data['date'] > today )]

In [7]:
data.shape[0]

53

In [8]:
# select required columns
data = data.drop(columns =['Unnamed: 0'])

In [9]:
data.shape

(53, 6)

#### 1.2 Clean data
<hr>

#### 1.3 Take out BiciMad Tweets
<hr>

In [11]:
# data analysis => sorting
data = data.sort_values('user_name', ascending=False)

In [12]:
data['user_name'].value_counts()

Javi                          9
BICIMAD EN LUCHA              3
Sito                          3
japarra                       3
Plataforma Sindical           3
Carlos                        3
CarlosIS                      2
villalba1200                  2
Saúl TM                       1
itu                           1
talanquera                    1
Indiario                      1
Pedro                         1
Taxi Barcelona #NoOnProp22    1
BiciMAD                       1
Robert                        1
Maribel Nuñez                 1
Miguel Ángel Gómez            1
j_medina                      1
deteibols                     1
Laborioso Gandul🇪🇸🏁🇪🇸         1
Pablo Carrascón 🔻             1
David González T.             1
Trotamundos                   1
oelon                         1
paola #RegulacionJustaYa      1
#MADBikeStatus 🛎🚲⏱            1
yosisoy                       1
Gacetín Madrid                1
🚌🚲🚶🏽‍♀️ Marta Serrano 💚       1
Ricardo Jose Serrano          1
jesusric

In [13]:
data = data[data.user_name != 'BiciMAD']

In [14]:
data.columns

Index(['date', 'id', 'text', 'user_name', 'user_id', 'user_screen_name'], dtype='object')

## 2. Explore data

#### 2.1 Sort values by 'date' and reset index
<hr>

In [15]:
data = data.reset_index()

In [16]:
data = data.drop(columns =['index'])

#### 2.2 Check most recent tweets from users
<hr>

In [17]:
# First tweet available date
data['date'].min()

Timestamp('2020-10-24 01:47:04')

In [18]:
# Most recent tweet date
data['date'].max()

Timestamp('2020-10-24 08:39:16')

## 3. Sentiment analysis

#### 3.1 Prepare text
<hr>

In [19]:
import re

##### Ver que hacer con las 'ñ'
<hr>

In [20]:
def clean_tweet(tweet):
    return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", tweet).split())
    #return " ".join(re.sub("([^0-9A-Za-z \t])|(\w+:\/\/\S+)", "", tweet).split())

In [21]:
text = 'Señores de @BiciMAD @MADRID las bicis están muy descuidadas'

In [22]:
re.sub("(@[A-Za-zñÑüÜáéíóú0-9]+)|([^0-9A-ZñÑüÜáéíóúa-z \t])|(\w+:\/\/\S+)", " ", text)

'Señores de     las bicis están muy descuidadas'

In [23]:
# Updated the tweets_clean 
data['tweets_clean'] = data['text'].apply(clean_tweet) 

In [24]:
# Print the updated dataframe 
# data

In [25]:
from transformers import pipeline

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [26]:
classifier = pipeline('sentiment-analysis')

In [27]:
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("dccuchile/bert-base-spanish-wwm-cased")
model = AutoModelForMaskedLM.from_pretrained("dccuchile/bert-base-spanish-wwm-cased")

Some weights of BertForMaskedLM were not initialized from the model checkpoint at dccuchile/bert-base-spanish-wwm-cased and are newly initialized: ['cls.predictions.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [28]:
def transform (x):
    return classifier(x)
# Apply transform function to all tweets 
data['sentiment']=data['tweets_clean'].apply(transform)

In [29]:
data["score"] = [data["sentiment"][i][0]['score'] for i in range(data.shape[0])]

In [30]:
data["label"] = [data["sentiment"][i][0]['label'] for i in range(data.shape[0])]

In [31]:
sum(data["label"] == "POSITIVE")

0

In [32]:
sum(data["label"] == "NEGATIVE")

52

In [33]:
data['score'].mean()

0.958027469424101

In [34]:
score = data['score']

In [35]:
# series (watch the index)
positive = (data["label"] == "POSITIVE")
score[positive].mean()

nan

In [36]:
# series (watch the index)
negative = (data["label"] == "NEGATIVE")
score[negative].mean()

0.958027469424101

In [37]:
# data.label == 'POSITIVE'

##### 3.3.2 Code 'negative' and 'postive' with 0 and 1

In [38]:
data['label_coded'] = data['label'].apply(lambda x: 1 if x == 'POSITIVE' else -1)

In [39]:
# data

In [40]:
data['score_coded'] = data['label_coded'] * data['score']

In [41]:
# data

In [42]:
# Drop duplicates before sav
# Read alrady existing data 
df_old = pd.read_csv('../data/results/data_sentiment.csv')
df_old = df_old.astype(str)

In [43]:
df_str = data.astype(str)

In [44]:
df = pd.merge(df_old, df_str, how ='outer')

In [45]:
df = df[df.date != 'date']

In [46]:
df.drop_duplicates(subset=['id'],keep='last', inplace= True)

In [47]:
df.reset_index()

Unnamed: 0.2,index,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,date,id,text,user_name,tweets_clean,sentiment,score,label,label_coded,score_coded,user_id,user_screen_name
0,0,0,0.0,0.0,2020-09-29 06:34:23,1310830261450539009,RT @carnecrudaradio: Quiero felicitar al alcal...,alex vega,RT Quiero felicitar al alcalde por su exitosa ...,"[{'label': 'NEGATIVE', 'score': 0.985985696315...",0.9859856963157654,NEGATIVE,-1,-0.9859856963157654,,
1,1,1,1.0,1.0,2020-09-29 07:01:33,1310837099189473280,Señores de @BiciMAD @MADRID las bicis están mu...,Neuroneater,Se ores de las bicis est n muy descuidadas lo ...,"[{'label': 'NEGATIVE', 'score': 0.981788218021...",0.9817882180213928,NEGATIVE,-1,-0.9817882180213928,,
2,2,2,2.0,2.0,2020-09-29 07:43:50,1310847740386201600,@JMDLatina Espero de este distrito no solo que...,Andrés Pina,Espero de este distrito no solo que proteja el...,"[{'label': 'NEGATIVE', 'score': 0.980981051921...",0.9809810519218444,NEGATIVE,-1,-0.9809810519218444,,
3,3,3,3.0,3.0,2020-09-29 07:53:20,1310850131344920576,RT @_AguilarM: @PlataformaEMT @BiciMAD @bcarab...,ElMaNDaLoRiaNo,RT AguilarM O la fecha de la ltima OPE para Av...,"[{'label': 'NEGATIVE', 'score': 0.983189225196...",0.9831892251968384,NEGATIVE,-1,-0.9831892251968384,,
4,4,4,4.0,4.0,2020-09-29 08:05:56,1310853301810888704,La misma vergüenza de TODOS los días. Una esta...,Diego Azul,La misma verg enza de TODOS los d as Una estac...,"[{'label': 'NEGATIVE', 'score': 0.994368195533...",0.9943681955337524,NEGATIVE,-1,-0.9943681955337524,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9022,9022,,,,2020-10-24 08:25:02,1319917806775898113,RT @PlataformaEMT: Hemos salido a la calle a p...,Carlos,RT Hemos salido a la calle a preguntar y aqu e...,"[{'label': 'NEGATIVE', 'score': 0.958437263965...",0.9584372639656067,NEGATIVE,-1,-0.9584372639656067,1.2021517960320532e+18,Carlos45343133
9023,9023,,,,2020-10-24 06:57:41,1319895822499303424,RT @PlataformaEMT: Hemos salido a la calle a p...,BICIMAD EN LUCHA,RT Hemos salido a la calle a preguntar y aqu e...,"[{'label': 'NEGATIVE', 'score': 0.958437263965...",0.9584372639656067,NEGATIVE,-1,-0.9584372639656067,1.199827990361858e+18,BicimadL
9024,9024,,,,2020-10-24 06:57:47,1319895847413436416,RT @PlataformaEMT: Hemos salido a la calle a p...,BICIMAD EN LUCHA,RT Hemos salido a la calle a preguntar y aqu e...,"[{'label': 'NEGATIVE', 'score': 0.958437263965...",0.9584372639656067,NEGATIVE,-1,-0.9584372639656067,1.199827990361858e+18,BicimadL
9025,9025,,,,2020-10-24 07:32:13,1319904513063571456,RT @diego_rebollo: Esta es la realidad de @Bic...,BICIMAD EN LUCHA,RT rebollo Esta es la realidad de Estaciones a...,"[{'label': 'NEGATIVE', 'score': 0.983878135681...",0.9838781356811523,NEGATIVE,-1,-0.9838781356811523,1.199827990361858e+18,BicimadL


In [48]:
# check new Tweets are in df
df.sort_values('date', ascending = False).head(10)

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,date,id,text,user_name,tweets_clean,sentiment,score,label,label_coded,score_coded,user_id,user_screen_name
8984,,,,2020-10-24 08:39:16,1319921388438900741,RT @PlataformaEMT: Hemos salido a la calle a p...,japarra,RT Hemos salido a la calle a preguntar y aqu e...,"[{'label': 'NEGATIVE', 'score': 0.958437263965...",0.9584372639656068,NEGATIVE,-1,-0.9584372639656068,1272437244.0,japarra1633
8983,,,,2020-10-24 08:39:10,1319921363931549696,RT @PlataformaEMT: Hemos salido a la calle a p...,japarra,RT Hemos salido a la calle a preguntar y aqu e...,"[{'label': 'NEGATIVE', 'score': 0.958437263965...",0.9584372639656068,NEGATIVE,-1,-0.9584372639656068,1272437244.0,japarra1633
8985,,,,2020-10-24 08:39:01,1319921325683736576,RT @PlataformaEMT: Hemos salido a la calle a p...,japarra,RT Hemos salido a la calle a preguntar y aqu e...,"[{'label': 'NEGATIVE', 'score': 0.958437263965...",0.9584372639656068,NEGATIVE,-1,-0.9584372639656068,1272437244.0,japarra1633
9022,,,,2020-10-24 08:25:02,1319917806775898113,RT @PlataformaEMT: Hemos salido a la calle a p...,Carlos,RT Hemos salido a la calle a preguntar y aqu e...,"[{'label': 'NEGATIVE', 'score': 0.958437263965...",0.9584372639656068,NEGATIVE,-1,-0.9584372639656068,1.2021517960320532e+18,Carlos45343133
9021,,,,2020-10-24 08:24:56,1319917780985151488,RT @PlataformaEMT: Hemos salido a la calle a p...,Carlos,RT Hemos salido a la calle a preguntar y aqu e...,"[{'label': 'NEGATIVE', 'score': 0.958437263965...",0.9584372639656068,NEGATIVE,-1,-0.9584372639656068,1.2021517960320532e+18,Carlos45343133
9020,,,,2020-10-24 08:24:51,1319917761410273281,RT @PlataformaEMT: Hemos salido a la calle a p...,Carlos,RT Hemos salido a la calle a preguntar y aqu e...,"[{'label': 'NEGATIVE', 'score': 0.958437263965...",0.9584372639656068,NEGATIVE,-1,-0.9584372639656068,1.2021517960320532e+18,Carlos45343133
8981,,,,2020-10-24 08:16:42,1319915710634741761,RT @PlataformaEMT: Hemos salido a la calle a p...,oelon,RT Hemos salido a la calle a preguntar y aqu e...,"[{'label': 'NEGATIVE', 'score': 0.958437263965...",0.9584372639656068,NEGATIVE,-1,-0.9584372639656068,1395477475.0,oelon65
9014,,,,2020-10-24 08:01:50,1319911968950673409,@MADRID @BiciMAD @JMDLatina Deseando que llegu...,Indiario,Deseando que lleguen por metro La Peseta,"[{'label': 'NEGATIVE', 'score': 0.967384099960...",0.9673840999603271,NEGATIVE,-1,-0.9673840999603271,4504395567.0,indiario
8990,,,,2020-10-24 07:45:59,1319907978984755201,RT @PlataformaEMT: Hemos salido a la calle a p...,Taxi Barcelona #NoOnProp22,RT Hemos salido a la calle a preguntar y aqu e...,"[{'label': 'NEGATIVE', 'score': 0.958437263965...",0.9584372639656068,NEGATIVE,-1,-0.9584372639656068,375032643.0,dammkring
8996,,,,2020-10-24 07:42:31,1319907106917679105,RT @BiciMAD: Hoy se ha puesto en marcha la nue...,Ricardo Jose Serrano,RT Hoy se ha puesto en marcha la nueva estaci ...,"[{'label': 'NEGATIVE', 'score': 0.963619709014...",0.9636197090148926,NEGATIVE,-1,-0.9636197090148926,274948847.0,ricardomas45


### 2.3 Save to csv

- **Check Dataframe shape** 
<br>
        Check df shape
<br>
        Check new tweets (i.e. difference between old and updated df)
- **Save to existing 'rawdata.csv'** 
<br> 
        Save only aditional tweets (i.e. df updated) 
<br>

In [49]:
# Updated df shape (rows cols)
df.shape

(9027, 15)

In [50]:
# New tweets 
df.shape[0] - df_old.shape[0]

52

In [51]:
# save to csv - add a dataframe to an existing csv file
df.to_csv('../data/results/data_sentiment.csv', header=True)