# Sentiment Analysis

This notebook includes all actions to describe the data (i.e. key statistics, etc).<br>
**Data**: CSV saved in '../data/raw/rawdata.csv. This file contains all updated tweets (old and new) as per 'acquisition module'

**Key actions** 
<br> <hr>
- Identify and remove duplicate records
- Identify and remove Tweets done by user = BiciMAD 
- Examine data for potential issues
- Identify and fill in missing values
- *Remove low variance columns (potentially not needed)*
- Identify potential outliers *(potentially not needed)*
- Correct incorrect data types *(potentially only text variable)*
- Remove special characters and clean categorical variables *(potentially only text variable)*
<br>

## 1. Read & clean data

#### 1.1 Read data
<hr>

In [1]:
# all modules
import pandas as pd
# if packaches need upload 
# pip install --upgrade pip

In [2]:
# load dataset
data = pd.read_csv('../data/raw/rawdata.csv')

In [3]:
# change date type from 'object' to 'date'
data['date'] = pd.to_datetime(data['date'])

In [4]:
# getting today's Timestamp
today = pd.Timestamp.today().floor('D')
# .normalize() does the same thing

In [5]:
today

Timestamp('2020-10-26 00:00:00')

In [6]:
data = data[(data['date'] > today )]

In [7]:
data.shape[0]

287

In [8]:
# select required columns
data = data.drop(columns =['Unnamed: 0'])

In [9]:
data.shape

(287, 6)

#### 1.2 Clean data
<hr>

#### 1.3 Take out BiciMad Tweets
<hr>

In [10]:
# data analysis => sorting
data = data.sort_values('user_name', ascending=False)

In [11]:
data['user_name'].value_counts()

BICIMAD EN LUCHA                              17
BiciMAD                                       14
villalba1200                                   9
Israel M                                       9
j_medina                                       9
                                              ..
La Casa de Campo de Madrid                     1
José Luis Nieto Bueno                          1
Kike                                           1
Kamchatka #Mequedoencasa #SaldremosenComun     1
Más Madrid Hortaleza                           1
Name: user_name, Length: 159, dtype: int64

In [12]:
data = data[data.user_name != 'BiciMAD']

In [13]:
data.columns

Index(['date', 'id', 'text', 'user_name', 'user_id', 'user_screen_name'], dtype='object')

## 2. Explore data

#### 2.1 Sort values by 'date' and reset index
<hr>

In [14]:
data = data.reset_index()

In [15]:
data = data.drop(columns =['index'])

#### 2.2 Check most recent tweets from users
<hr>

In [16]:
# First tweet available date
data['date'].min()

Timestamp('2020-10-26 03:02:23')

In [17]:
# Most recent tweet date
data['date'].max()

Timestamp('2020-10-26 18:34:00')

## 3. Sentiment analysis

#### 3.1 Prepare text
<hr>

In [18]:
import re

##### Ver que hacer con las 'ñ'
<hr>

In [19]:
def clean_tweet(tweet):
    return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", tweet).split())
    #return " ".join(re.sub("([^0-9A-Za-z \t])|(\w+:\/\/\S+)", "", tweet).split())

In [20]:
text = 'Señores de @BiciMAD @MADRID las bicis están muy descuidadas'

In [21]:
re.sub("(@[A-Za-zñÑüÜáéíóú0-9]+)|([^0-9A-ZñÑüÜáéíóúa-z \t])|(\w+:\/\/\S+)", " ", text)

'Señores de     las bicis están muy descuidadas'

In [22]:
# Updated the tweets_clean 
data['tweets_clean'] = data['text'].apply(clean_tweet) 

In [23]:
# Print the updated dataframe 
# data

In [24]:
from transformers import pipeline

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [25]:
classifier = pipeline('sentiment-analysis')

In [26]:
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("dccuchile/bert-base-spanish-wwm-cased")
model = AutoModelForMaskedLM.from_pretrained("dccuchile/bert-base-spanish-wwm-cased")

Some weights of BertForMaskedLM were not initialized from the model checkpoint at dccuchile/bert-base-spanish-wwm-cased and are newly initialized: ['cls.predictions.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [27]:
def transform (x):
    return classifier(x)
# Apply transform function to all tweets 
data['sentiment']=data['tweets_clean'].apply(transform)

In [28]:
data["score"] = [data["sentiment"][i][0]['score'] for i in range(data.shape[0])]

In [29]:
data["label"] = [data["sentiment"][i][0]['label'] for i in range(data.shape[0])]

In [30]:
sum(data["label"] == "POSITIVE")

3

In [31]:
sum(data["label"] == "NEGATIVE")

270

In [32]:
data['score'].mean()

0.9554958655720666

In [33]:
score = data['score']

In [34]:
# series (watch the index)
positive = (data["label"] == "POSITIVE")
score[positive].mean()

0.8439774910608927

In [35]:
# series (watch the index)
negative = (data["label"] == "NEGATIVE")
score[negative].mean()

0.9567349586221907

In [36]:
# data.label == 'POSITIVE'

##### 3.3.2 Code 'negative' and 'postive' with 0 and 1

In [37]:
data['label_coded'] = data['label'].apply(lambda x: 1 if x == 'POSITIVE' else -1)

In [38]:
# data

In [39]:
data['score_coded'] = data['label_coded'] * data['score']

In [40]:
# data

In [41]:
# Drop duplicates before sav
# Read alrady existing data 
df_old = pd.read_csv('../data/results/data_sentiment.csv')
df_old = df_old.astype(str)

In [42]:
df_str = data.astype(str)

In [43]:
df = pd.merge(df_old, df_str, how ='outer')

In [44]:
df = df[df.date != 'date']

In [45]:
df.drop_duplicates(subset=['id'],keep='last', inplace= True)

In [46]:
df.reset_index()

Unnamed: 0.2,index,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,Unnamed: 0.1.1.1,Unnamed: 0.1.1.1.1,Unnamed: 0.1.1.1.1.1,Unnamed: 0.1.1.1.1.1.1,Unnamed: 0.1.1.1.1.1.1.1,date,...,text,user_name,tweets_clean,sentiment,score,label,label_coded,score_coded,user_id,user_screen_name
0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2020-09-29 06:34:23,...,RT @carnecrudaradio: Quiero felicitar al alcal...,alex vega,RT Quiero felicitar al alcalde por su exitosa ...,"[{'label': 'NEGATIVE', 'score': 0.985985696315...",0.9859856963157654,NEGATIVE,-1,-0.9859856963157654,,
1,1,1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2020-09-29 07:01:33,...,Señores de @BiciMAD @MADRID las bicis están mu...,Neuroneater,Se ores de las bicis est n muy descuidadas lo ...,"[{'label': 'NEGATIVE', 'score': 0.981788218021...",0.9817882180213928,NEGATIVE,-1,-0.9817882180213928,,
2,2,2,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2020-09-29 07:43:50,...,@JMDLatina Espero de este distrito no solo que...,Andrés Pina,Espero de este distrito no solo que proteja el...,"[{'label': 'NEGATIVE', 'score': 0.980981051921...",0.9809810519218444,NEGATIVE,-1,-0.9809810519218444,,
3,3,3,3.0,3.0,3.0,3.0,3.0,3.0,3.0,2020-09-29 07:53:20,...,RT @_AguilarM: @PlataformaEMT @BiciMAD @bcarab...,ElMaNDaLoRiaNo,RT AguilarM O la fecha de la ltima OPE para Av...,"[{'label': 'NEGATIVE', 'score': 0.983189225196...",0.9831892251968384,NEGATIVE,-1,-0.9831892251968384,,
4,4,4,4.0,4.0,4.0,4.0,4.0,4.0,4.0,2020-09-29 08:05:56,...,La misma vergüenza de TODOS los días. Una esta...,Diego Azul,La misma verg enza de TODOS los d as Una estac...,"[{'label': 'NEGATIVE', 'score': 0.994368195533...",0.9943681955337524,NEGATIVE,-1,-0.9943681955337524,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9430,9664,,,,,,,,,2020-10-26 12:51:41,...,RT @PlataformaEMT: #27O MAÑANA ES EL DÍA #27O\...,Afectados @biciMAD,RT 27O MA ANA ES EL D A 27O TOD SOMOS BICIMAD ...,"[{'label': 'NEGATIVE', 'score': 0.988935887813...",0.9889358878135681,NEGATIVE,-1,-0.9889358878135681,1.2962069425290363e+18,Afectadosbicim1
9431,9665,,,,,,,,,2020-10-26 07:59:26,...,Deterioro de BiciMAD: 700 bicicletas pendiente...,Actualidad IT,Deterioro de BiciMAD 700 bicicletas pendientes...,"[{'label': 'NEGATIVE', 'score': 0.989495456218...",0.9894954562187195,NEGATIVE,-1,-0.9894954562187195,2729380093.0,ActualidadIT
9432,9666,,,,,,,,,2020-10-26 07:03:27,...,RT @PlataformaEMT: Buenos días plataformer@s.\...,Abre los ojos,RT Buenos d as plataformer Comenzamos TOD CON ...,"[{'label': 'NEGATIVE', 'score': 0.982271611690...",0.9822716116905212,NEGATIVE,-1,-0.9822716116905212,1.0660293724218737e+18,Abrelos05774477
9433,9667,,,,,,,,,2020-10-26 14:20:09,...,RT @PlataformaEMT: #27O MAÑANA ES EL DÍA #27O\...,ARKAITZ,RT 27O MA ANA ES EL D A 27O TOD SOMOS BICIMAD ...,"[{'label': 'NEGATIVE', 'score': 0.988935887813...",0.9889358878135681,NEGATIVE,-1,-0.9889358878135681,373994352.0,ARKAITX14_4_69


In [47]:
# check new Tweets are in df
df.sort_values('date', ascending = False).head(10)

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,Unnamed: 0.1.1.1,Unnamed: 0.1.1.1.1,Unnamed: 0.1.1.1.1.1,Unnamed: 0.1.1.1.1.1.1,Unnamed: 0.1.1.1.1.1.1.1,date,id,text,user_name,tweets_clean,sentiment,score,label,label_coded,score_coded,user_id,user_screen_name
9430,,,,,,,,,2020-10-26 18:34:00,1320795831642083331,RT @BicimadL: Buenos días x usuarios de Bicima...,provis,RT Buenos d as x usuarios de BicimaD Ma ana 27...,"[{'label': 'NEGATIVE', 'score': 0.969375550746...",0.9693755507469176,NEGATIVE,-1,-0.9693755507469176,827400128.0,provisoscar
9497,,,,,,,,,2020-10-26 18:31:31,1320795207932235779,RT @Coordinadora25S: Es el mismo guión de siem...,Relegados,RT Es el mismo gui n de siempre el Ayuntamient...,"[{'label': 'NEGATIVE', 'score': 0.987592160701...",0.9875921607017516,NEGATIVE,-1,-0.9875921607017516,262643800.0,Relegados
9611,,,,,,,,,2020-10-26 18:29:14,1320794633937559553,RT @Coordinadora25S: Es el mismo guión de siem...,David ♜☭,RT Es el mismo gui n de siempre el Ayuntamient...,"[{'label': 'NEGATIVE', 'score': 0.987592160701...",0.9875921607017516,NEGATIVE,-1,-0.9875921607017516,315179397.0,DavidComunero
9410,,,,,,,,,2020-10-26 18:29:10,1320794616493428741,Me da pena no poder asistir mañana a la concen...,⚡CARLOS⚡,Me da pena no poder asistir ma ana a la concen...,"[{'label': 'NEGATIVE', 'score': 0.983483374118...",0.9834833741188048,NEGATIVE,-1,-0.9834833741188048,497006366.0,CarlosRMA94
9600,,,,,,,,,2020-10-26 18:29:07,1320794604011225088,@protestona1 y el sabotaje a bicimad para volv...,Filetesaurio,y el sabotaje a bicimad para volver a privatiz...,"[{'label': 'NEGATIVE', 'score': 0.915158450603...",0.9151584506034852,NEGATIVE,-1,-0.9151584506034852,2903103118.0,Filetesaurio
9616,,,,,,,,,2020-10-26 18:29:05,1320794594016186368,"Es el mismo guión de siempre, el Ayuntamiento ...",Coordinadora 25S,Es el mismo gui n de siempre el Ayuntamiento d...,"[{'label': 'NEGATIVE', 'score': 0.989747285842...",0.9897472858428956,NEGATIVE,-1,-0.9897472858428956,780717692.0,Coordinadora25S
9657,,,,,,,,,2020-10-26 18:25:56,1320793801896697856,RT @Coordinadora25S: Es el mismo guión de siem...,Angel,RT Es el mismo gui n de siempre el Ayuntamient...,"[{'label': 'NEGATIVE', 'score': 0.987592160701...",0.9875921607017516,NEGATIVE,-1,-0.9875921607017516,856127714.0,ablascos
9632,,,,,,,,,2020-10-26 18:25:51,1320793781998931969,RT @PlataformaEMT: Los usuarios no encuentran ...,Busero 20 🔻,RT Los usuarios no encuentran bicicletas dispo...,"[{'label': 'NEGATIVE', 'score': 0.983690381050...",0.98369038105011,NEGATIVE,-1,-0.98369038105011,588917468.0,busero20
9617,,,,,,,,,2020-10-26 18:24:54,1320793543993106433,RT @Coordinadora25S: Es el mismo guión de siem...,Chirrin Dul Ari,RT Es el mismo gui n de siempre el Ayuntamient...,"[{'label': 'NEGATIVE', 'score': 0.987592160701...",0.9875921607017516,NEGATIVE,-1,-0.9875921607017516,512927168.0,ChirrinDul
9437,,,,,,,,,2020-10-26 18:24:36,1320793466218184705,RT @Coordinadora25S: Es el mismo guión de siem...,lluna vermella,RT Es el mismo gui n de siempre el Ayuntamient...,"[{'label': 'NEGATIVE', 'score': 0.987592160701...",0.9875921607017516,NEGATIVE,-1,-0.9875921607017516,243377060.0,llunavermella


### 2.3 Save to csv

- **Check Dataframe shape** 
<br>
        Check df shape
<br>
        Check new tweets (i.e. difference between old and updated df)
- **Save to existing 'rawdata.csv'** 
<br> 
        Save only aditional tweets (i.e. df updated) 
<br>

In [48]:
# Updated df shape (rows cols)
df.shape

(9435, 20)

In [49]:
# New tweets 
df.shape[0] - df_old.shape[0]

29

In [50]:
# save to csv - add a dataframe to an existing csv file
df.to_csv('../data/results/data_sentiment.csv', header=True)