# Data Analysis

This notebook includes all actions to describe the data (i.e. key statistics, etc).<br>
**Data**: CSV saved in '../data/raw/rawdata.csv. This file contains all updated tweets (old and new) as per 'acquisition module'

**Key actions** 
<br> <hr>
- Identify and remove duplicate records
- Identify and remove Tweets done by user = BiciMAD 
- Examine data for potential issues
- Identify and fill in missing values
- *Remove low variance columns (potentially not needed)*
- Identify potential outliers *(potentially not needed)*
- Correct incorrect data types *(potentially only text variable)*
- Remove special characters and clean categorical variables *(potentially only text variable)*
<br>

## 1. Read & clean data

#### 1.1 Read data
<hr>

In [1]:
# all modules
import pandas as pd
# if packaches need upload 
# pip install --upgrade pip

In [2]:
# load dataset
data = pd.read_csv('../data/raw/rawdata.csv')

In [3]:
# select required columns
data = data.drop(columns =['Unnamed: 0'])

In [4]:
data.shape

(8109, 4)

In [5]:
data.dtypes

date         object
id            int64
text         object
user_name    object
dtype: object

In [6]:
data.head(5)

Unnamed: 0,date,id,text,user_name
0,2020-10-10 11:07:26,1314885243531399169,@PlataformaEMT @FRAVM @BiciMAD @AlmeidaPP_ @bc...,VICENTE RODRIGUEZ MU
1,2020-10-10 10:38:11,1314877884067188736,@PlataformaEMT @MADRID @AlmeidaPP_ @bcarabante...,Stielike
2,2020-10-10 10:35:49,1314877290489184261,@mcascallares Ha podido tratarse de un fallo p...,BiciMAD
3,2020-10-10 10:35:14,1314877141658542083,"@AnxoOroisPhoto Te pedimos disculpas, ha podid...",BiciMAD
4,2020-10-10 10:13:18,1314871622487212033,"Hola @BiciMAD la app no funciona, llevo 1 hora...",AnxoOroisPhotography


#### 1.2 Clean data
<hr>

In [7]:
# change date type from 'object' to 'date'
data['date'] = pd.to_datetime(data['date'])

In [8]:
data.dtypes

date         datetime64[ns]
id                    int64
text                 object
user_name            object
dtype: object

In [9]:
data.isnull().sum()

date         0
id           0
text         0
user_name    0
dtype: int64

In [10]:
# count duplicates 'id' to be sure
len(data['id'])-len(data['id'].drop_duplicates())

0

In [11]:
data.isnull().sum()

date         0
id           0
text         0
user_name    0
dtype: int64

In [12]:
data['id'].describe()

count    8.109000e+03
mean     1.315903e+18
std      2.516499e+15
min      1.310830e+18
25%      1.313579e+18
50%      1.316399e+18
75%      1.318196e+18
max      1.318995e+18
Name: id, dtype: float64

In [13]:
# drop duplicates
data.drop_duplicates(subset=['id'],keep='last', inplace= True)

In [14]:
data.shape

(8109, 4)

#### 1.3 Take out BiciMad Tweets
<hr>

In [15]:
# data analysis => sorting
data = data.sort_values('user_name', ascending=False)

In [16]:
data['user_name'].value_counts()

BiciMAD                 171
ErBoteRojo              134
Jesús García Diaz       120
Plataforma Sindical     109
Julio                    96
                       ... 
Llámame como quieras      1
hola                      1
jmgm                      1
Chebastos                 1
incondicional             1
Name: user_name, Length: 3392, dtype: int64

In [17]:
data = data[data.user_name != 'BiciMAD']

In [18]:
data.columns

Index(['date', 'id', 'text', 'user_name'], dtype='object')

## 2. Explore data

#### 2.1 Sort values by 'date' and reset index
<hr>

In [19]:
data = data.sort_values(by ='date', ascending=True)

In [20]:
data = data.reset_index()

In [21]:
data = data.drop(columns =['index'])

In [22]:
data.tail(20)

Unnamed: 0,date,id,text,user_name
7918,2020-10-21 18:36:49,1318984603177279490,RT @rojobote: @Canito_80 @bcarabante @ParkingK...,Trotamundos
7919,2020-10-21 18:37:04,1318984665701732353,RT @Esther_Gomez_M: Hoy hemos visitado la coch...,Angel Callejo Muñoz
7920,2020-10-21 18:37:20,1318984731913015296,RT @MasCarabanchel: ☹️ Abandonadas! Así están ...,Nora
7921,2020-10-21 18:38:03,1318984912544858112,RT @MasCarabanchel: ☹️ Abandonadas! Así están ...,Señor de la cebolla✌🏼🇪🇸✌🏼
7922,2020-10-21 18:38:17,1318984971416182784,RT @MasCarabanchel: ☹️ Abandonadas! Así están ...,@DESIERTA64 ✌🏻✌🏻🇪🇦
7923,2020-10-21 18:39:26,1318985263306113030,RT @CeciliaGessa: Que verguenzaaaa @BiciMAD !!...,Bruno
7924,2020-10-21 18:41:05,1318985675358818304,RT @MasCarabanchel: ☹️ Abandonadas! Así están ...,Pablo Sastre Olmos
7925,2020-10-21 18:44:50,1318986621560233984,"RT @ElSaltoDiario: La Plataforma Sindical, sin...",Jesús García Diaz
7926,2020-10-21 18:47:34,1318987308113289216,RT @MasCarabanchel: ☹️ Abandonadas! Así están ...,José Luis Nieto Bueno
7927,2020-10-21 18:53:26,1318988784067878912,RT @MasCarabanchel: ☹️ Abandonadas! Así están ...,Más Madrid Vicálvaro


#### 2.2 Check most recent tweets from users
<hr>

In [23]:
# data.loc[data['user_name'] == 'Blanca Fernandez']

In [24]:
# Sacar subsets
# data_subset = data[data['user_name'].isin(['Blanca Fenrandez', 'BICIMAD EN LUCHA'])]

In [25]:
# data_subset

In [23]:
# First tweet available date
data['date'].min()

Timestamp('2020-09-29 06:34:23')

In [24]:
# Most recent tweet date
data['date'].max()

Timestamp('2020-10-21 19:19:05')

In [25]:
# saving data
# data.to_csv('../data/processed/data.csv', index=False)

## 3. Sentiment analysis

#### 3.1 Prepare text
<hr>

In [29]:
import re

##### Ver que hacer con las 'ñ'
<hr>

'''
from unicodedata import normalize
s = "Pingüino: Málãgà ês uñ̺ã cíudãd fantástica y èn Logroño me pica el... moñǫ̝̘̦̞̟̩̐̏̋͌́ͬ̚͡õ̪͓͍̦̓ơ̤̺̬̯͂̌͐͐͟o͎͈̳̠̼̫͂̊"
# -> NFD y eliminar diacríticos
s = re.sub(
        r"([^n\u0300-\u036f]|n(?!\u0303(?![\u0300-\u036f])))[\u0300-\u036f]+", r"\1", 
        normalize( "NFD", s), 0, re.I
    )
# -> NFC
s = normalize( 'NFC', s)
print( s )
'''

def clean_tweet(tweet):
    data['tweets_clean'] = re.sub(
        r"([^n\u0300-\u036f]|n(?!\u0303(?![\u0300-\u036f])))[\u0300-\u036f]+", r"\1", 
        normalize( "NFD", data['tweets_clean']), 0, re.I
    )
    data['tweets_clean'] = normalize('NFC', data['tweets_clean'].text)
    data['tweets_clean'] = ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", tweet).split())
    #return " ".join(re.sub("([^0-9A-Za-z \t])|(\w+:\/\/\S+)", "", tweet).split())

In [30]:
def clean_tweet(tweet):
    return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", tweet).split())
    #return " ".join(re.sub("([^0-9A-Za-z \t])|(\w+:\/\/\S+)", "", tweet).split())

In [31]:
# Updated the tweets_clean 
data['tweets_clean'] = data['text'].apply(clean_tweet) 

In [32]:
# Print the updated dataframe 
data

Unnamed: 0,date,id,text,user_name,tweets_clean
0,2020-09-29 06:34:23,1310830261450539009,RT @carnecrudaradio: Quiero felicitar al alcal...,alex vega,RT Quiero felicitar al alcalde por su exitosa ...
1,2020-09-29 07:01:33,1310837099189473280,Señores de @BiciMAD @MADRID las bicis están mu...,Neuroneater,Se ores de las bicis est n muy descuidadas lo ...
2,2020-09-29 07:43:50,1310847740386201600,@JMDLatina Espero de este distrito no solo que...,Andrés Pina,Espero de este distrito no solo que proteja el...
3,2020-09-29 07:53:20,1310850131344920576,RT @_AguilarM: @PlataformaEMT @BiciMAD @bcarab...,ElMaNDaLoRiaNo,RT AguilarM O la fecha de la ltima OPE para Av...
4,2020-09-29 08:05:56,1310853301810888704,La misma vergüenza de TODOS los días. Una esta...,Diego Azul,La misma verg enza de TODOS los d as Una estac...
...,...,...,...,...,...
6944,2020-10-19 22:24:16,1318317065267253249,RT @MADRID: .@BiciMAD 🚲 llega en octubre a Lat...,Gbj,RT llega en octubre a Latina y Fuencarral El P...
6945,2020-10-19 22:25:32,1318317385502367746,RT @Rita_Maestre: La foto que resume lo que ha...,JaviSeCa,RT Maestre La foto que resume lo que ha hecho ...
6946,2020-10-19 22:26:53,1318317727140974598,RT @Rita_Maestre: La foto que resume lo que ha...,J. L.,RT Maestre La foto que resume lo que ha hecho ...
6947,2020-10-19 22:27:59,1318318002102849536,RT @Rita_Maestre: La foto que resume lo que ha...,Corto Maltés,RT Maestre La foto que resume lo que ha hecho ...


In [33]:
from transformers import pipeline

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [34]:
classifier = pipeline('sentiment-analysis')

In [35]:
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("dccuchile/bert-base-spanish-wwm-cased")
model = AutoModelForMaskedLM.from_pretrained("dccuchile/bert-base-spanish-wwm-cased")

Some weights of BertForMaskedLM were not initialized from the model checkpoint at dccuchile/bert-base-spanish-wwm-cased and are newly initialized: ['cls.predictions.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [36]:
def transform (x):
    return classifier(x)
# Apply transform function to all tweets 
data['sentiment']=data['tweets_clean'].apply(transform)

In [37]:
data["score"] = [data["sentiment"][i][0]['score'] for i in range(data.shape[0])]

In [38]:
data["label"] = [data["sentiment"][i][0]['label'] for i in range(data.shape[0])]

In [39]:
data.head(5)

Unnamed: 0,date,id,text,user_name,tweets_clean,sentiment,score,label
0,2020-09-29 06:34:23,1310830261450539009,RT @carnecrudaradio: Quiero felicitar al alcal...,alex vega,RT Quiero felicitar al alcalde por su exitosa ...,"[{'label': 'NEGATIVE', 'score': 0.985985696315...",0.985986,NEGATIVE
1,2020-09-29 07:01:33,1310837099189473280,Señores de @BiciMAD @MADRID las bicis están mu...,Neuroneater,Se ores de las bicis est n muy descuidadas lo ...,"[{'label': 'NEGATIVE', 'score': 0.981788218021...",0.981788,NEGATIVE
2,2020-09-29 07:43:50,1310847740386201600,@JMDLatina Espero de este distrito no solo que...,Andrés Pina,Espero de este distrito no solo que proteja el...,"[{'label': 'NEGATIVE', 'score': 0.980981051921...",0.980981,NEGATIVE
3,2020-09-29 07:53:20,1310850131344920576,RT @_AguilarM: @PlataformaEMT @BiciMAD @bcarab...,ElMaNDaLoRiaNo,RT AguilarM O la fecha de la ltima OPE para Av...,"[{'label': 'NEGATIVE', 'score': 0.983189225196...",0.983189,NEGATIVE
4,2020-09-29 08:05:56,1310853301810888704,La misma vergüenza de TODOS los días. Una esta...,Diego Azul,La misma verg enza de TODOS los d as Una estac...,"[{'label': 'NEGATIVE', 'score': 0.994368195533...",0.994368,NEGATIVE


In [40]:
sum(data["label"] == "POSITIVE")

341

In [42]:
sum(data["label"] == "NEGATIVE")

6608

In [46]:
data['score'].mean()

0.9329787673821877

In [48]:
score = data['score']

In [51]:
# series (watch the index)
positive = (data["label"] == "POSITIVE")

score[positive].mean()

0.7833219177562121

In [52]:
# series (watch the index)
negative = (data["label"] == "NEGATIVE")

score[negative].mean()

0.9407016768438187

In [41]:
data.to_csv('../data/results/data_sentiment.csv', index=False)

#### 3.13 Predictions
<hr>

"""import torch.nn.functional as F
pt_predictions = F.softmax(pt_outputs[0], dim=-1)""