# Data Analysis

This notebook includes all actions to describe the data (i.e. key statistics, etc).<br>
**Data**: CSV saved in '../data/raw/rawdata.csv. This file contains all updated tweets (old and new) as per 'acquisition module'

**Key actions** 
<br> <hr>
- Identify and remove duplicate records
- Identify and remove Tweets done by user = BiciMAD 
- Examine data for potential issues
- Identify and fill in missing values
- *Remove low variance columns (potentially not needed)*
- Identify potential outliers *(potentially not needed)*
- Correct incorrect data types *(potentially only text variable)*
- Remove special characters and clean categorical variables *(potentially only text variable)*
<br>

## 1. Read & clean data

#### 1.1 Read data
<hr>

In [44]:
# all modules
import pandas as pd
# if packaches need upload 
# pip install --upgrade pip

In [45]:
# load dataset
data = pd.read_csv('../data/raw/rawdata.csv')

In [46]:
# select required columns
data = data.drop(columns =['Unnamed: 0'])

In [47]:
data.shape

(8525, 4)

In [48]:
# data.dtypes

In [49]:
# data.head(5)

#### 1.2 Clean data
<hr>

In [50]:
# change date type from 'object' to 'date'
data['date'] = pd.to_datetime(data['date'])

In [51]:
# data.dtypes

In [52]:
data.isnull().sum()

date         0
id           0
text         0
user_name    0
dtype: int64

In [53]:
# count duplicates 'id' to be sure
len(data['id'])-len(data['id'].drop_duplicates())

0

In [54]:
data.isnull().sum()

date         0
id           0
text         0
user_name    0
dtype: int64

In [55]:
# data['id'].describe()

In [56]:
# drop duplicates
data.drop_duplicates(subset=['id'],keep='last', inplace= True)

In [57]:
data.shape

(8525, 4)

#### 1.3 Take out BiciMad Tweets
<hr>

In [58]:
# data analysis => sorting
data = data.sort_values('user_name', ascending=False)

In [59]:
data['user_name'].value_counts()

BiciMAD                174
ErBoteRojo             144
BICIMAD EN LUCHA       141
Jesús García Diaz      127
Plataforma Sindical    116
                      ... 
Nancy                    1
Oscar Guzman             1
Efectivamente y no       1
Iván S. 🔻🌹☭              1
Manuel Mos               1
Name: user_name, Length: 3520, dtype: int64

In [60]:
data = data[data.user_name != 'BiciMAD']

In [61]:
data.columns

Index(['date', 'id', 'text', 'user_name'], dtype='object')

## 2. Explore data

#### 2.1 Sort values by 'date' and reset index
<hr>

In [62]:
data = data.sort_values(by ='date', ascending=True)

In [63]:
data = data.reset_index()

In [64]:
data = data.drop(columns =['index'])

In [65]:
data.tail(20)

Unnamed: 0,date,id,text,user_name
8331,2020-10-22 11:28:42,1319239253012918273,RT @PlataformaEMT: Estación 98 de @BiciMAD 1 d...,andresamtb
8332,2020-10-22 11:33:28,1319240450746863617,RT @PlataformaEMT: Estación 98 de @BiciMAD 1 d...,jam
8333,2020-10-22 11:34:45,1319240774878461954,El cementerio de BiciMad https://t.co/g88Q2Eua...,The Neo Worker
8334,2020-10-22 11:35:51,1319241050494541824,RT @PlataformaEMT: Estación 98 de @BiciMAD 1 d...,Busero 20 🔻
8335,2020-10-22 11:38:26,1319241699462467584,RT @PlataformaEMT: Estación 98 de @BiciMAD 1 d...,j_medina
8336,2020-10-22 11:39:23,1319241941918298118,@PlataformaEMT @BiciMAD @AlmeidaPP_ @bcarabant...,ErBoteRojo
8337,2020-10-22 11:40:01,1319242100807000065,RT @PlataformaEMT: Estación 98 de @BiciMAD 1 d...,ErBoteRojo
8338,2020-10-22 11:40:59,1319242342797352960,RT @PlataformaEMT: Estación 98 de @BiciMAD 1 d...,Julio
8339,2020-10-22 11:44:45,1319243290705842178,RT @villalba1200: @PlataformaEMT @BiciMAD @bca...,Juan talavera
8340,2020-10-22 11:45:44,1319243537565814786,RT @Rita_Maestre: En el Centro de Operaciones ...,Alberto Martin Guio


#### 2.2 Check most recent tweets from users
<hr>

In [66]:
# data.loc[data['user_name'] == 'Blanca Fernandez']

In [67]:
# Sacar subsets
# data_subset = data[data['user_name'].isin(['Blanca Fenrandez', 'BICIMAD EN LUCHA'])]

In [68]:
# data_subset

In [69]:
# First tweet available date
data['date'].min()

Timestamp('2020-09-29 06:34:23')

In [70]:
# Most recent tweet date
data['date'].max()

Timestamp('2020-10-22 12:06:06')

In [71]:
# saving data
# data.to_csv('../data/processed/data.csv', index=False)

## 3. Sentiment analysis

#### 3.1 Prepare text
<hr>

In [72]:
import re

##### Ver que hacer con las 'ñ'
<hr>

'''
from unicodedata import normalize
s = "Pingüino: Málãgà ês uñ̺ã cíudãd fantástica y èn Logroño me pica el... moñǫ̝̘̦̞̟̩̐̏̋͌́ͬ̚͡õ̪͓͍̦̓ơ̤̺̬̯͂̌͐͐͟o͎͈̳̠̼̫͂̊"
NFD y eliminar diacríticos
s = re.sub(
        r"([^n\u0300-\u036f]|n(?!\u0303(?![\u0300-\u036f])))[\u0300-\u036f]+", r"\1", 
        normalize( "NFD", s), 0, re.I
    )
NFC
s = normalize( 'NFC', s)
print( s )
'''

def clean_tweet(tweet):
    data['tweets_clean'] = re.sub(
        r"([^n\u0300-\u036f]|n(?!\u0303(?![\u0300-\u036f])))[\u0300-\u036f]+", r"\1", 
        normalize( "NFD", data['tweets_clean']), 0, re.I
    )
    data['tweets_clean'] = normalize('NFC', data['tweets_clean'].text)
    data['tweets_clean'] = ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", tweet).split())
    #return " ".join(re.sub("([^0-9A-Za-z \t])|(\w+:\/\/\S+)", "", tweet).split())

In [73]:
def clean_tweet(tweet):
    return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", tweet).split())
    #return " ".join(re.sub("([^0-9A-Za-z \t])|(\w+:\/\/\S+)", "", tweet).split())

In [74]:
# Updated the tweets_clean 
data['tweets_clean'] = data['text'].apply(clean_tweet) 

In [75]:
# Print the updated dataframe 
data

Unnamed: 0,date,id,text,user_name,tweets_clean
0,2020-09-29 06:34:23,1310830261450539009,RT @carnecrudaradio: Quiero felicitar al alcal...,alex vega,RT Quiero felicitar al alcalde por su exitosa ...
1,2020-09-29 07:01:33,1310837099189473280,Señores de @BiciMAD @MADRID las bicis están mu...,Neuroneater,Se ores de las bicis est n muy descuidadas lo ...
2,2020-09-29 07:43:50,1310847740386201600,@JMDLatina Espero de este distrito no solo que...,Andrés Pina,Espero de este distrito no solo que proteja el...
3,2020-09-29 07:53:20,1310850131344920576,RT @_AguilarM: @PlataformaEMT @BiciMAD @bcarab...,ElMaNDaLoRiaNo,RT AguilarM O la fecha de la ltima OPE para Av...
4,2020-09-29 08:05:56,1310853301810888704,La misma vergüenza de TODOS los días. Una esta...,Diego Azul,La misma verg enza de TODOS los d as Una estac...
...,...,...,...,...,...
8346,2020-10-22 11:59:00,1319246876642840577,"RT @dvlnosmz: Mientras Madrid sigue colapsada,...",🏳️‍🌈🇪🇸 (🛴DEJA EL COCHE EN CASA🚲),RT Mientras Madrid sigue colapsada sin rastrea...
8347,2020-10-22 11:59:48,1319247077705175040,RT @Rita_Maestre: En el Centro de Operaciones ...,🏳️‍🌈🇪🇸 (🛴DEJA EL COCHE EN CASA🚲),RT Maestre En el Centro de Operaciones de EMT ...
8348,2020-10-22 12:00:15,1319247190204780544,RT @PlataformaEMT: Estación 98 de @BiciMAD 1 d...,Stielike,RT Estaci n 98 de 1 d a despu s Qu nos encontr...
8349,2020-10-22 12:03:41,1319248055162621958,RT @cso191: #RT @MasmadridS: RT @MasMadrid__: ...,🏳️‍🌈🇪🇸 (🛴DEJA EL COCHE EN CASA🚲),RT RT RT Visitamos el Centro de Operaciones de...


In [76]:
from transformers import pipeline

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [77]:
classifier = pipeline('sentiment-analysis')

In [78]:
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("dccuchile/bert-base-spanish-wwm-cased")
model = AutoModelForMaskedLM.from_pretrained("dccuchile/bert-base-spanish-wwm-cased")

Some weights of BertForMaskedLM were not initialized from the model checkpoint at dccuchile/bert-base-spanish-wwm-cased and are newly initialized: ['cls.predictions.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [79]:
def transform (x):
    return classifier(x)
# Apply transform function to all tweets 
data['sentiment']=data['tweets_clean'].apply(transform)

In [None]:
def tokenizer (x): 
    return tokenizer.tokenize(x)
tokens = data['tweets_clean'].apply(tokenizer)

In [80]:
data["score"] = [data["sentiment"][i][0]['score'] for i in range(data.shape[0])]

In [81]:
data["label"] = [data["sentiment"][i][0]['label'] for i in range(data.shape[0])]

In [82]:
data.head(5)

Unnamed: 0,date,id,text,user_name,tweets_clean,sentiment,score,label
0,2020-09-29 06:34:23,1310830261450539009,RT @carnecrudaradio: Quiero felicitar al alcal...,alex vega,RT Quiero felicitar al alcalde por su exitosa ...,"[{'label': 'NEGATIVE', 'score': 0.985985696315...",0.985986,NEGATIVE
1,2020-09-29 07:01:33,1310837099189473280,Señores de @BiciMAD @MADRID las bicis están mu...,Neuroneater,Se ores de las bicis est n muy descuidadas lo ...,"[{'label': 'NEGATIVE', 'score': 0.981788218021...",0.981788,NEGATIVE
2,2020-09-29 07:43:50,1310847740386201600,@JMDLatina Espero de este distrito no solo que...,Andrés Pina,Espero de este distrito no solo que proteja el...,"[{'label': 'NEGATIVE', 'score': 0.980981051921...",0.980981,NEGATIVE
3,2020-09-29 07:53:20,1310850131344920576,RT @_AguilarM: @PlataformaEMT @BiciMAD @bcarab...,ElMaNDaLoRiaNo,RT AguilarM O la fecha de la ltima OPE para Av...,"[{'label': 'NEGATIVE', 'score': 0.983189225196...",0.983189,NEGATIVE
4,2020-09-29 08:05:56,1310853301810888704,La misma vergüenza de TODOS los días. Una esta...,Diego Azul,La misma verg enza de TODOS los d as Una estac...,"[{'label': 'NEGATIVE', 'score': 0.994368195533...",0.994368,NEGATIVE


In [83]:
sum(data["label"] == "POSITIVE")

619

In [84]:
sum(data["label"] == "NEGATIVE")

7732

In [85]:
data['score'].mean()

0.9246914345352196

In [86]:
score = data['score']

In [87]:
# series (watch the index)
positive = (data["label"] == "POSITIVE")

score[positive].mean()

0.7399847657692067

In [88]:
# series (watch the index)
negative = (data["label"] == "NEGATIVE")

score[negative].mean()

0.9394784790212727

In [106]:
data['label'].all()   # because one element is zero

'NEGATIVE'

In [107]:
data['label'].any()   # because one (or more) elements are non-zero

'NEGATIVE'

In [108]:
data.label == 'POSITIVE'

0       False
1       False
2       False
3       False
4       False
        ...  
8346    False
8347     True
8348    False
8349    False
8350    False
Name: label, Length: 8351, dtype: bool

In [137]:
data['label_coded'] = data['label'].apply(lambda x: 1 if x == 'POSITIVE' else -1)

In [138]:
data

Unnamed: 0,date,id,text,user_name,tweets_clean,sentiment,score,label,label_coded,score_coded
0,2020-09-29 06:34:23,1310830261450539009,RT @carnecrudaradio: Quiero felicitar al alcal...,alex vega,RT Quiero felicitar al alcalde por su exitosa ...,"[{'label': 'NEGATIVE', 'score': 0.985985696315...",0.985986,NEGATIVE,-1,0.985986
1,2020-09-29 07:01:33,1310837099189473280,Señores de @BiciMAD @MADRID las bicis están mu...,Neuroneater,Se ores de las bicis est n muy descuidadas lo ...,"[{'label': 'NEGATIVE', 'score': 0.981788218021...",0.981788,NEGATIVE,-1,0.981788
2,2020-09-29 07:43:50,1310847740386201600,@JMDLatina Espero de este distrito no solo que...,Andrés Pina,Espero de este distrito no solo que proteja el...,"[{'label': 'NEGATIVE', 'score': 0.980981051921...",0.980981,NEGATIVE,-1,0.980981
3,2020-09-29 07:53:20,1310850131344920576,RT @_AguilarM: @PlataformaEMT @BiciMAD @bcarab...,ElMaNDaLoRiaNo,RT AguilarM O la fecha de la ltima OPE para Av...,"[{'label': 'NEGATIVE', 'score': 0.983189225196...",0.983189,NEGATIVE,-1,0.983189
4,2020-09-29 08:05:56,1310853301810888704,La misma vergüenza de TODOS los días. Una esta...,Diego Azul,La misma verg enza de TODOS los d as Una estac...,"[{'label': 'NEGATIVE', 'score': 0.994368195533...",0.994368,NEGATIVE,-1,0.994368
...,...,...,...,...,...,...,...,...,...,...
8346,2020-10-22 11:59:00,1319246876642840577,"RT @dvlnosmz: Mientras Madrid sigue colapsada,...",🏳️‍🌈🇪🇸 (🛴DEJA EL COCHE EN CASA🚲),RT Mientras Madrid sigue colapsada sin rastrea...,"[{'label': 'NEGATIVE', 'score': 0.814892470836...",0.814892,NEGATIVE,-1,0.814892
8347,2020-10-22 11:59:48,1319247077705175040,RT @Rita_Maestre: En el Centro de Operaciones ...,🏳️‍🌈🇪🇸 (🛴DEJA EL COCHE EN CASA🚲),RT Maestre En el Centro de Operaciones de EMT ...,"[{'label': 'POSITIVE', 'score': 0.661616325378...",0.661616,POSITIVE,1,0.661616
8348,2020-10-22 12:00:15,1319247190204780544,RT @PlataformaEMT: Estación 98 de @BiciMAD 1 d...,Stielike,RT Estaci n 98 de 1 d a despu s Qu nos encontr...,"[{'label': 'NEGATIVE', 'score': 0.967031657695...",0.967032,NEGATIVE,-1,0.967032
8349,2020-10-22 12:03:41,1319248055162621958,RT @cso191: #RT @MasmadridS: RT @MasMadrid__: ...,🏳️‍🌈🇪🇸 (🛴DEJA EL COCHE EN CASA🚲),RT RT RT Visitamos el Centro de Operaciones de...,"[{'label': 'NEGATIVE', 'score': 0.553843438625...",0.553843,NEGATIVE,-1,0.553843


In [139]:
data['score_coded'] = data['label_coded'] * data['score']

In [140]:
data

Unnamed: 0,date,id,text,user_name,tweets_clean,sentiment,score,label,label_coded,score_coded
0,2020-09-29 06:34:23,1310830261450539009,RT @carnecrudaradio: Quiero felicitar al alcal...,alex vega,RT Quiero felicitar al alcalde por su exitosa ...,"[{'label': 'NEGATIVE', 'score': 0.985985696315...",0.985986,NEGATIVE,-1,-0.985986
1,2020-09-29 07:01:33,1310837099189473280,Señores de @BiciMAD @MADRID las bicis están mu...,Neuroneater,Se ores de las bicis est n muy descuidadas lo ...,"[{'label': 'NEGATIVE', 'score': 0.981788218021...",0.981788,NEGATIVE,-1,-0.981788
2,2020-09-29 07:43:50,1310847740386201600,@JMDLatina Espero de este distrito no solo que...,Andrés Pina,Espero de este distrito no solo que proteja el...,"[{'label': 'NEGATIVE', 'score': 0.980981051921...",0.980981,NEGATIVE,-1,-0.980981
3,2020-09-29 07:53:20,1310850131344920576,RT @_AguilarM: @PlataformaEMT @BiciMAD @bcarab...,ElMaNDaLoRiaNo,RT AguilarM O la fecha de la ltima OPE para Av...,"[{'label': 'NEGATIVE', 'score': 0.983189225196...",0.983189,NEGATIVE,-1,-0.983189
4,2020-09-29 08:05:56,1310853301810888704,La misma vergüenza de TODOS los días. Una esta...,Diego Azul,La misma verg enza de TODOS los d as Una estac...,"[{'label': 'NEGATIVE', 'score': 0.994368195533...",0.994368,NEGATIVE,-1,-0.994368
...,...,...,...,...,...,...,...,...,...,...
8346,2020-10-22 11:59:00,1319246876642840577,"RT @dvlnosmz: Mientras Madrid sigue colapsada,...",🏳️‍🌈🇪🇸 (🛴DEJA EL COCHE EN CASA🚲),RT Mientras Madrid sigue colapsada sin rastrea...,"[{'label': 'NEGATIVE', 'score': 0.814892470836...",0.814892,NEGATIVE,-1,-0.814892
8347,2020-10-22 11:59:48,1319247077705175040,RT @Rita_Maestre: En el Centro de Operaciones ...,🏳️‍🌈🇪🇸 (🛴DEJA EL COCHE EN CASA🚲),RT Maestre En el Centro de Operaciones de EMT ...,"[{'label': 'POSITIVE', 'score': 0.661616325378...",0.661616,POSITIVE,1,0.661616
8348,2020-10-22 12:00:15,1319247190204780544,RT @PlataformaEMT: Estación 98 de @BiciMAD 1 d...,Stielike,RT Estaci n 98 de 1 d a despu s Qu nos encontr...,"[{'label': 'NEGATIVE', 'score': 0.967031657695...",0.967032,NEGATIVE,-1,-0.967032
8349,2020-10-22 12:03:41,1319248055162621958,RT @cso191: #RT @MasmadridS: RT @MasMadrid__: ...,🏳️‍🌈🇪🇸 (🛴DEJA EL COCHE EN CASA🚲),RT RT RT Visitamos el Centro de Operaciones de...,"[{'label': 'NEGATIVE', 'score': 0.553843438625...",0.553843,NEGATIVE,-1,-0.553843


In [None]:
data

In [89]:
data.to_csv('../data/results/data_sentiment.csv', index=False)

#### 3.3 Hypothesis testing
<hr>

##### 3.3.1 Separate tweets from today and before

In [90]:
# getting today's Timestamp
today = pd.Timestamp.today().floor('D')
# .normalize() does the same thing

In [91]:
today

Timestamp('2020-10-22 00:00:00')

In [92]:
data_today = data[(data['date'] > today )]

In [93]:
data_past = data[(data['date'] < today )]

In [94]:
data_past.shape[0]

8064

In [95]:
data_today.shape[0]

287

##### 3.3.2 Code 'negative' and 'postive' with 0 and 1