# Data Analysis

This notebook includes all actions to describe the data (i.e. key statistics, etc).<br>
**Data**: CSV saved in '../data/raw/rawdata.csv. This file contains all updated tweets (old and new) as per 'acquisition module'

**Key actions** 
<br> <hr>
- Identify and remove duplicate records
- Identify and remove Tweets done by user = BiciMAD 
- Examine data for potential issues
- Identify and fill in missing values
- *Remove low variance columns (potentially not needed)*
- Identify potential outliers *(potentially not needed)*
- Correct incorrect data types *(potentially only text variable)*
- Remove special characters and clean categorical variables *(potentially only text variable)*
<br>

## 1. Read & clean data

#### 1.1 Read data
<hr>

In [182]:
# all modules
import pandas as pd
# if packaches need upload 
# pip install --upgrade pip

In [183]:
# load dataset
data = pd.read_csv('../data/raw/rawdata.csv')

In [184]:
# select required columns
data = data.drop(columns =['Unnamed: 0'])

In [185]:
data.shape

(8874, 4)

In [186]:
# data.dtypes

In [187]:
# data.head(5)

#### 1.2 Clean data
<hr>

In [188]:
# change date type from 'object' to 'date'
data['date'] = pd.to_datetime(data['date'])

In [189]:
# data.dtypes

In [190]:
data.isnull().sum()

date         0
id           0
text         0
user_name    0
dtype: int64

In [191]:
# count duplicates 'id' to be sure
len(data['id'])-len(data['id'].drop_duplicates())

0

In [192]:
data.isnull().sum()

date         0
id           0
text         0
user_name    0
dtype: int64

In [193]:
# data['id'].describe()

In [194]:
# drop duplicates
data.drop_duplicates(subset=['id'],keep='last', inplace= True)

In [195]:
data.shape

(8874, 4)

#### 1.3 Take out BiciMad Tweets
<hr>

In [196]:
# data analysis => sorting
data = data.sort_values('user_name', ascending=False)

In [197]:
data['user_name'].value_counts()

BiciMAD                181
BICIMAD EN LUCHA       147
ErBoteRojo             145
Jesús García Diaz      140
Plataforma Sindical    128
                      ... 
RG                       1
jj                       1
Pablo Rivas              1
LA ESPERANZA❤CRECE       1
Bego CL                  1
Name: user_name, Length: 3594, dtype: int64

In [198]:
data = data[data.user_name != 'BiciMAD']

In [199]:
data.columns

Index(['date', 'id', 'text', 'user_name'], dtype='object')

## 2. Explore data

#### 2.1 Sort values by 'date' and reset index
<hr>

In [200]:
data = data.sort_values(by ='date', ascending=True)

In [201]:
data = data.reset_index()

In [202]:
data = data.drop(columns =['index'])

In [234]:
data.tail(20)

Unnamed: 0,date,id,text,user_name
8673,2020-10-23 10:57:08,1319593695839223810,"RT @ignprados: Hola, @BiciMAD, hoy a las 9:10 ...",Alberto.
8674,2020-10-23 10:58:47,1319594110144204801,@BiciMAD en Latina🥰 https://t.co/TtvpUOO5dw,🏳️‍🌈🇪🇸 (🛴DEJA EL COCHE EN CASA🚲)
8675,2020-10-23 11:06:13,1319595980975648769,RT @PlataformaEMT: Cientos de bicicletas de @B...,Jbp
8676,2020-10-23 11:19:16,1319599263979024389,RT @PlataformaEMT: Buenos días plataformer@s.\...,jason adonis
8677,2020-10-23 11:21:19,1319599779979186176,"RT @carmartinco: La 204 de nuevo, @BiciMAD sup...",Jesús García Diaz
8678,2020-10-23 11:21:30,1319599828742119426,RT @diego_rebollo: Esta es la realidad de @Bic...,Jesús García Diaz
8679,2020-10-23 11:21:35,1319599849373863937,RT @PlataformaEMT: Buenos días plataformer@s.\...,Jesús García Diaz
8680,2020-10-23 11:21:44,1319599886959054848,RT @PlataformaEMT: Cientos de bicicletas de @B...,Jesús García Diaz
8681,2020-10-23 11:22:27,1319600066546597888,RT @MomoChan_says: @BiciMAD estación 42 con 6 ...,Jesús García Diaz
8682,2020-10-23 11:24:28,1319600576091574272,RT @PlataformaEMT: Buenos días plataformer@s.\...,Nomientas28


#### 2.2 Check most recent tweets from users
<hr>

In [235]:
# getting today's Timestamp
today = pd.Timestamp.today().floor('D')
# .normalize() does the same thing

In [236]:
today

Timestamp('2020-10-23 00:00:00')

In [237]:
data_today = data[(data['date'] > today )]

In [238]:
data_past = data[(data['date'] < today )]

In [239]:
data_past.shape[0]

8543

In [240]:
data_today.shape[0]

150

In [241]:
data_today

Unnamed: 0,date,id,text,user_name
8543,2020-10-23 00:02:23,1319428924116434944,RT @Esther_Gomez_M: Parece que @EMTmadrid pref...,Ana Moradillo Alonso
8544,2020-10-23 00:11:28,1319431206086332419,RT @edugaresp: El servicio de alquiler de bici...,antonio 🏳️‍🌈antonio MAD
8545,2020-10-23 00:13:54,1319431821067694082,RT @Esther_Gomez_M: Prometió @AlmeidaPP_ 50 nu...,Kike
8546,2020-10-23 01:03:15,1319444240099934208,RT @Esther_Gomez_M: Para conocer la difícil si...,Rober
8547,2020-10-23 02:26:26,1319465173086818305,RT @Rita_Maestre: La foto que resume lo que ha...,Dani Aparicio
...,...,...,...,...
8688,2020-10-23 11:29:32,1319601848693739520,RT @PlataformaEMT: Buenos días plataformer@s.\...,Plataforma Sindical
8689,2020-10-23 11:31:01,1319602221357596675,RT @PlataformaEMT: La colaboración público pri...,Marcos_HTZ
8690,2020-10-23 11:31:22,1319602309316382721,RT @PlataformaEMT: La colaboración público pri...,Jbp
8691,2020-10-23 11:32:20,1319602554146357251,RT @verdechamberi: Para no perderse el seguimi...,Gonzalo Lopez


In [242]:
# data.loc[data['user_name'] == 'Blanca Fernandez']

In [243]:
# Sacar subsets
# data_subset = data[data['user_name'].isin(['Blanca Fenrandez', 'BICIMAD EN LUCHA'])]

In [244]:
# data_subset

In [245]:
# First tweet available date
# data['date'].min()

In [246]:
# Most recent tweet date
# data['date'].max()

In [247]:
# saving data
# data.to_csv('../data/processed/data.csv', index=False)

## 3. Sentiment analysis

#### 3.1 Prepare text
<hr>

In [248]:
import re

##### Ver que hacer con las 'ñ'
<hr>

In [249]:
def clean_tweet(tweet):
    return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", tweet).split())
    #return " ".join(re.sub("([^0-9A-Za-z \t])|(\w+:\/\/\S+)", "", tweet).split())

In [250]:
data_today_clean_tweet = data_today.copy()

In [251]:
# Updated the tweets_clean 
data_today_clean_tweet['tweets_clean'] = data_today['text'].apply(clean_tweet)

In [252]:
# Print the updated dataframe 
data_today_clean_tweet

Unnamed: 0,date,id,text,user_name,tweets_clean
8543,2020-10-23 00:02:23,1319428924116434944,RT @Esther_Gomez_M: Parece que @EMTmadrid pref...,Ana Moradillo Alonso,RT Gomez M Parece que prefiere usar bicicletas...
8544,2020-10-23 00:11:28,1319431206086332419,RT @edugaresp: El servicio de alquiler de bici...,antonio 🏳️‍🌈antonio MAD,RT El servicio de alquiler de bicicletas de Ma...
8545,2020-10-23 00:13:54,1319431821067694082,RT @Esther_Gomez_M: Prometió @AlmeidaPP_ 50 nu...,Kike,RT Gomez M Prometi 50 nuevas estaciones de est...
8546,2020-10-23 01:03:15,1319444240099934208,RT @Esther_Gomez_M: Para conocer la difícil si...,Rober,RT Gomez M Para conocer la dif cil situaci n d...
8547,2020-10-23 02:26:26,1319465173086818305,RT @Rita_Maestre: La foto que resume lo que ha...,Dani Aparicio,RT Maestre La foto que resume lo que ha hecho ...
...,...,...,...,...,...
8688,2020-10-23 11:29:32,1319601848693739520,RT @PlataformaEMT: Buenos días plataformer@s.\...,Plataforma Sindical,RT Buenos d as plataformer y contin an infland...
8689,2020-10-23 11:31:01,1319602221357596675,RT @PlataformaEMT: La colaboración público pri...,Marcos_HTZ,RT La colaboraci n p blico privada de los gobi...
8690,2020-10-23 11:31:22,1319602309316382721,RT @PlataformaEMT: La colaboración público pri...,Jbp,RT La colaboraci n p blico privada de los gobi...
8691,2020-10-23 11:32:20,1319602554146357251,RT @verdechamberi: Para no perderse el seguimi...,Gonzalo Lopez,RT Para no perderse el seguimiento que est hac...


In [254]:
df_today = data_today_clean_tweet.copy()

In [255]:
from transformers import pipeline

In [256]:
classifier = pipeline('sentiment-analysis')

In [257]:
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("dccuchile/bert-base-spanish-wwm-cased")
model = AutoModelForMaskedLM.from_pretrained("dccuchile/bert-base-spanish-wwm-cased")

Some weights of BertForMaskedLM were not initialized from the model checkpoint at dccuchile/bert-base-spanish-wwm-cased and are newly initialized: ['cls.predictions.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [258]:
def transform (x):
    return classifier(x)
# Apply transform function to all tweets 
df_today['sentiment'] = data_today_clean_tweet['tweets_clean'].apply(transform)

In [292]:
df_today

Unnamed: 0,date,id,text,user_name,tweets_clean,sentiment,score,label,sentiment2
8543,2020-10-23 00:02:23,1319428924116434944,RT @Esther_Gomez_M: Parece que @EMTmadrid pref...,Ana Moradillo Alonso,RT Gomez M Parece que prefiere usar bicicletas...,"[{'label': 'NEGATIVE', 'score': 0.940226495265...",<generator object <genexpr> at 0x1449eb0c0>,<generator object <genexpr> at 0x1449eb2a0>,"[[{'label': 'NEGATIVE', 'score': 0.9402264952..."
8544,2020-10-23 00:11:28,1319431206086332419,RT @edugaresp: El servicio de alquiler de bici...,antonio 🏳️‍🌈antonio MAD,RT El servicio de alquiler de bicicletas de Ma...,"[{'label': 'NEGATIVE', 'score': 0.556998789310...",<generator object <genexpr> at 0x1449eb0c0>,<generator object <genexpr> at 0x1449eb2a0>,"[[{'label': 'NEGATIVE', 'score': 0.5569987893..."
8545,2020-10-23 00:13:54,1319431821067694082,RT @Esther_Gomez_M: Prometió @AlmeidaPP_ 50 nu...,Kike,RT Gomez M Prometi 50 nuevas estaciones de est...,"[{'label': 'NEGATIVE', 'score': 0.915848910808...",<generator object <genexpr> at 0x1449eb0c0>,<generator object <genexpr> at 0x1449eb2a0>,"[[{'label': 'NEGATIVE', 'score': 0.9158489108..."
8546,2020-10-23 01:03:15,1319444240099934208,RT @Esther_Gomez_M: Para conocer la difícil si...,Rober,RT Gomez M Para conocer la dif cil situaci n d...,"[{'label': 'NEGATIVE', 'score': 0.977445721626...",<generator object <genexpr> at 0x1449eb0c0>,<generator object <genexpr> at 0x1449eb2a0>,"[[{'label': 'NEGATIVE', 'score': 0.9774457216..."
8547,2020-10-23 02:26:26,1319465173086818305,RT @Rita_Maestre: La foto que resume lo que ha...,Dani Aparicio,RT Maestre La foto que resume lo que ha hecho ...,"[{'label': 'NEGATIVE', 'score': 0.944254934787...",<generator object <genexpr> at 0x1449eb0c0>,<generator object <genexpr> at 0x1449eb2a0>,"[[{'label': 'NEGATIVE', 'score': 0.9442549347..."
...,...,...,...,...,...,...,...,...,...
8688,2020-10-23 11:29:32,1319601848693739520,RT @PlataformaEMT: Buenos días plataformer@s.\...,Plataforma Sindical,RT Buenos d as plataformer y contin an infland...,"[{'label': 'NEGATIVE', 'score': 0.930744111537...",<generator object <genexpr> at 0x1449eb0c0>,<generator object <genexpr> at 0x1449eb2a0>,"[[{'label': 'NEGATIVE', 'score': 0.9307441115..."
8689,2020-10-23 11:31:01,1319602221357596675,RT @PlataformaEMT: La colaboración público pri...,Marcos_HTZ,RT La colaboraci n p blico privada de los gobi...,"[{'label': 'NEGATIVE', 'score': 0.982082843780...",<generator object <genexpr> at 0x1449eb0c0>,<generator object <genexpr> at 0x1449eb2a0>,"[[{'label': 'NEGATIVE', 'score': 0.9820828437..."
8690,2020-10-23 11:31:22,1319602309316382721,RT @PlataformaEMT: La colaboración público pri...,Jbp,RT La colaboraci n p blico privada de los gobi...,"[{'label': 'NEGATIVE', 'score': 0.982082843780...",<generator object <genexpr> at 0x1449eb0c0>,<generator object <genexpr> at 0x1449eb2a0>,"[[{'label': 'NEGATIVE', 'score': 0.9820828437..."
8691,2020-10-23 11:32:20,1319602554146357251,RT @verdechamberi: Para no perderse el seguimi...,Gonzalo Lopez,RT Para no perderse el seguimiento que est hac...,"[{'label': 'NEGATIVE', 'score': 0.794018328189...",<generator object <genexpr> at 0x1449eb0c0>,<generator object <genexpr> at 0x1449eb2a0>,"[[{'label': 'NEGATIVE', 'score': 0.7940183281..."


In [284]:
df_today['sentiment'] = df_today.sentiment.apply(str)

In [290]:
df_today['sentiment2'] = df_today['sentiment'].str.split(',') 

In [291]:
df_today['sentiment2']

8543    [[{'label': 'NEGATIVE',  'score': 0.9402264952...
8544    [[{'label': 'NEGATIVE',  'score': 0.5569987893...
8545    [[{'label': 'NEGATIVE',  'score': 0.9158489108...
8546    [[{'label': 'NEGATIVE',  'score': 0.9774457216...
8547    [[{'label': 'NEGATIVE',  'score': 0.9442549347...
                              ...                        
8688    [[{'label': 'NEGATIVE',  'score': 0.9307441115...
8689    [[{'label': 'NEGATIVE',  'score': 0.9820828437...
8690    [[{'label': 'NEGATIVE',  'score': 0.9820828437...
8691    [[{'label': 'NEGATIVE',  'score': 0.7940183281...
8692    [[{'label': 'NEGATIVE',  'score': 0.9820828437...
Name: sentiment2, Length: 150, dtype: object

In [288]:
print(re.split('XXX|YYY|ZZZ', df_today.sentiment))

TypeError: expected string or bytes-like object

In [263]:
df_today["score"] = (df_today["sentiment"][i][0]['score'] for i in range(df_today.shape[0]))

In [264]:
df_today

Unnamed: 0,date,id,text,user_name,tweets_clean,sentiment,score
8543,2020-10-23 00:02:23,1319428924116434944,RT @Esther_Gomez_M: Parece que @EMTmadrid pref...,Ana Moradillo Alonso,RT Gomez M Parece que prefiere usar bicicletas...,"[{'label': 'NEGATIVE', 'score': 0.940226495265...",<generator object <genexpr> at 0x1449eb0c0>
8544,2020-10-23 00:11:28,1319431206086332419,RT @edugaresp: El servicio de alquiler de bici...,antonio 🏳️‍🌈antonio MAD,RT El servicio de alquiler de bicicletas de Ma...,"[{'label': 'NEGATIVE', 'score': 0.556998789310...",<generator object <genexpr> at 0x1449eb0c0>
8545,2020-10-23 00:13:54,1319431821067694082,RT @Esther_Gomez_M: Prometió @AlmeidaPP_ 50 nu...,Kike,RT Gomez M Prometi 50 nuevas estaciones de est...,"[{'label': 'NEGATIVE', 'score': 0.915848910808...",<generator object <genexpr> at 0x1449eb0c0>
8546,2020-10-23 01:03:15,1319444240099934208,RT @Esther_Gomez_M: Para conocer la difícil si...,Rober,RT Gomez M Para conocer la dif cil situaci n d...,"[{'label': 'NEGATIVE', 'score': 0.977445721626...",<generator object <genexpr> at 0x1449eb0c0>
8547,2020-10-23 02:26:26,1319465173086818305,RT @Rita_Maestre: La foto que resume lo que ha...,Dani Aparicio,RT Maestre La foto que resume lo que ha hecho ...,"[{'label': 'NEGATIVE', 'score': 0.944254934787...",<generator object <genexpr> at 0x1449eb0c0>
...,...,...,...,...,...,...,...
8688,2020-10-23 11:29:32,1319601848693739520,RT @PlataformaEMT: Buenos días plataformer@s.\...,Plataforma Sindical,RT Buenos d as plataformer y contin an infland...,"[{'label': 'NEGATIVE', 'score': 0.930744111537...",<generator object <genexpr> at 0x1449eb0c0>
8689,2020-10-23 11:31:01,1319602221357596675,RT @PlataformaEMT: La colaboración público pri...,Marcos_HTZ,RT La colaboraci n p blico privada de los gobi...,"[{'label': 'NEGATIVE', 'score': 0.982082843780...",<generator object <genexpr> at 0x1449eb0c0>
8690,2020-10-23 11:31:22,1319602309316382721,RT @PlataformaEMT: La colaboración público pri...,Jbp,RT La colaboraci n p blico privada de los gobi...,"[{'label': 'NEGATIVE', 'score': 0.982082843780...",<generator object <genexpr> at 0x1449eb0c0>
8691,2020-10-23 11:32:20,1319602554146357251,RT @verdechamberi: Para no perderse el seguimi...,Gonzalo Lopez,RT Para no perderse el seguimiento que est hac...,"[{'label': 'NEGATIVE', 'score': 0.794018328189...",<generator object <genexpr> at 0x1449eb0c0>


In [266]:
df_today["label"] = (df_today["sentiment"][i][0]['label'] for i in range(df_today.shape[0]))

In [267]:
df_today

Unnamed: 0,date,id,text,user_name,tweets_clean,sentiment,score,label
8543,2020-10-23 00:02:23,1319428924116434944,RT @Esther_Gomez_M: Parece que @EMTmadrid pref...,Ana Moradillo Alonso,RT Gomez M Parece que prefiere usar bicicletas...,"[{'label': 'NEGATIVE', 'score': 0.940226495265...",<generator object <genexpr> at 0x1449eb0c0>,<generator object <genexpr> at 0x1449eb2a0>
8544,2020-10-23 00:11:28,1319431206086332419,RT @edugaresp: El servicio de alquiler de bici...,antonio 🏳️‍🌈antonio MAD,RT El servicio de alquiler de bicicletas de Ma...,"[{'label': 'NEGATIVE', 'score': 0.556998789310...",<generator object <genexpr> at 0x1449eb0c0>,<generator object <genexpr> at 0x1449eb2a0>
8545,2020-10-23 00:13:54,1319431821067694082,RT @Esther_Gomez_M: Prometió @AlmeidaPP_ 50 nu...,Kike,RT Gomez M Prometi 50 nuevas estaciones de est...,"[{'label': 'NEGATIVE', 'score': 0.915848910808...",<generator object <genexpr> at 0x1449eb0c0>,<generator object <genexpr> at 0x1449eb2a0>
8546,2020-10-23 01:03:15,1319444240099934208,RT @Esther_Gomez_M: Para conocer la difícil si...,Rober,RT Gomez M Para conocer la dif cil situaci n d...,"[{'label': 'NEGATIVE', 'score': 0.977445721626...",<generator object <genexpr> at 0x1449eb0c0>,<generator object <genexpr> at 0x1449eb2a0>
8547,2020-10-23 02:26:26,1319465173086818305,RT @Rita_Maestre: La foto que resume lo que ha...,Dani Aparicio,RT Maestre La foto que resume lo que ha hecho ...,"[{'label': 'NEGATIVE', 'score': 0.944254934787...",<generator object <genexpr> at 0x1449eb0c0>,<generator object <genexpr> at 0x1449eb2a0>
...,...,...,...,...,...,...,...,...
8688,2020-10-23 11:29:32,1319601848693739520,RT @PlataformaEMT: Buenos días plataformer@s.\...,Plataforma Sindical,RT Buenos d as plataformer y contin an infland...,"[{'label': 'NEGATIVE', 'score': 0.930744111537...",<generator object <genexpr> at 0x1449eb0c0>,<generator object <genexpr> at 0x1449eb2a0>
8689,2020-10-23 11:31:01,1319602221357596675,RT @PlataformaEMT: La colaboración público pri...,Marcos_HTZ,RT La colaboraci n p blico privada de los gobi...,"[{'label': 'NEGATIVE', 'score': 0.982082843780...",<generator object <genexpr> at 0x1449eb0c0>,<generator object <genexpr> at 0x1449eb2a0>
8690,2020-10-23 11:31:22,1319602309316382721,RT @PlataformaEMT: La colaboración público pri...,Jbp,RT La colaboraci n p blico privada de los gobi...,"[{'label': 'NEGATIVE', 'score': 0.982082843780...",<generator object <genexpr> at 0x1449eb0c0>,<generator object <genexpr> at 0x1449eb2a0>
8691,2020-10-23 11:32:20,1319602554146357251,RT @verdechamberi: Para no perderse el seguimi...,Gonzalo Lopez,RT Para no perderse el seguimiento que est hac...,"[{'label': 'NEGATIVE', 'score': 0.794018328189...",<generator object <genexpr> at 0x1449eb0c0>,<generator object <genexpr> at 0x1449eb2a0>


Unnamed: 0,date,id,text,user_name,tweets_clean,sentiment,score,label
0,2020-09-29 06:34:23,1310830261450539009,RT @carnecrudaradio: Quiero felicitar al alcal...,alex vega,RT Quiero felicitar al alcalde por su exitosa ...,"[{'label': 'NEGATIVE', 'score': 0.985985696315...",0.985986,NEGATIVE
1,2020-09-29 07:01:33,1310837099189473280,Señores de @BiciMAD @MADRID las bicis están mu...,Neuroneater,Se ores de las bicis est n muy descuidadas lo ...,"[{'label': 'NEGATIVE', 'score': 0.981788218021...",0.981788,NEGATIVE
2,2020-09-29 07:43:50,1310847740386201600,@JMDLatina Espero de este distrito no solo que...,Andrés Pina,Espero de este distrito no solo que proteja el...,"[{'label': 'NEGATIVE', 'score': 0.980981051921...",0.980981,NEGATIVE
3,2020-09-29 07:53:20,1310850131344920576,RT @_AguilarM: @PlataformaEMT @BiciMAD @bcarab...,ElMaNDaLoRiaNo,RT AguilarM O la fecha de la ltima OPE para Av...,"[{'label': 'NEGATIVE', 'score': 0.983189225196...",0.983189,NEGATIVE
4,2020-09-29 08:05:56,1310853301810888704,La misma vergüenza de TODOS los días. Una esta...,Diego Azul,La misma verg enza de TODOS los d as Una estac...,"[{'label': 'NEGATIVE', 'score': 0.994368195533...",0.994368,NEGATIVE


In [40]:
sum(data["label"] == "POSITIVE")

637

In [41]:
sum(data["label"] == "NEGATIVE")

8056

In [42]:
data['score'].mean()

0.9242670292631058

In [43]:
score = data['score']

In [44]:
# series (watch the index)
positive = (data["label"] == "POSITIVE")

score[positive].mean()

0.7425865419618376

In [45]:
# series (watch the index)
negative = (data["label"] == "NEGATIVE")

score[negative].mean()

0.9386327778245391

In [46]:
data['label'].all()   # because one element is zero

'NEGATIVE'

In [47]:
data['label'].any()   # because one (or more) elements are non-zero

'NEGATIVE'

In [48]:
data.label == 'POSITIVE'

0       False
1       False
2       False
3       False
4       False
        ...  
8688    False
8689    False
8690    False
8691    False
8692    False
Name: label, Length: 8693, dtype: bool

In [49]:
data['label_coded'] = data['label'].apply(lambda x: 1 if x == 'POSITIVE' else -1)

In [50]:
data

Unnamed: 0,date,id,text,user_name,tweets_clean,sentiment,score,label,label_coded
0,2020-09-29 06:34:23,1310830261450539009,RT @carnecrudaradio: Quiero felicitar al alcal...,alex vega,RT Quiero felicitar al alcalde por su exitosa ...,"[{'label': 'NEGATIVE', 'score': 0.985985696315...",0.985986,NEGATIVE,-1
1,2020-09-29 07:01:33,1310837099189473280,Señores de @BiciMAD @MADRID las bicis están mu...,Neuroneater,Se ores de las bicis est n muy descuidadas lo ...,"[{'label': 'NEGATIVE', 'score': 0.981788218021...",0.981788,NEGATIVE,-1
2,2020-09-29 07:43:50,1310847740386201600,@JMDLatina Espero de este distrito no solo que...,Andrés Pina,Espero de este distrito no solo que proteja el...,"[{'label': 'NEGATIVE', 'score': 0.980981051921...",0.980981,NEGATIVE,-1
3,2020-09-29 07:53:20,1310850131344920576,RT @_AguilarM: @PlataformaEMT @BiciMAD @bcarab...,ElMaNDaLoRiaNo,RT AguilarM O la fecha de la ltima OPE para Av...,"[{'label': 'NEGATIVE', 'score': 0.983189225196...",0.983189,NEGATIVE,-1
4,2020-09-29 08:05:56,1310853301810888704,La misma vergüenza de TODOS los días. Una esta...,Diego Azul,La misma verg enza de TODOS los d as Una estac...,"[{'label': 'NEGATIVE', 'score': 0.994368195533...",0.994368,NEGATIVE,-1
...,...,...,...,...,...,...,...,...,...
8688,2020-10-23 11:29:32,1319601848693739520,RT @PlataformaEMT: Buenos días plataformer@s.\...,Plataforma Sindical,RT Buenos d as plataformer y contin an infland...,"[{'label': 'NEGATIVE', 'score': 0.930744111537...",0.930744,NEGATIVE,-1
8689,2020-10-23 11:31:01,1319602221357596675,RT @PlataformaEMT: La colaboración público pri...,Marcos_HTZ,RT La colaboraci n p blico privada de los gobi...,"[{'label': 'NEGATIVE', 'score': 0.982082843780...",0.982083,NEGATIVE,-1
8690,2020-10-23 11:31:22,1319602309316382721,RT @PlataformaEMT: La colaboración público pri...,Jbp,RT La colaboraci n p blico privada de los gobi...,"[{'label': 'NEGATIVE', 'score': 0.982082843780...",0.982083,NEGATIVE,-1
8691,2020-10-23 11:32:20,1319602554146357251,RT @verdechamberi: Para no perderse el seguimi...,Gonzalo Lopez,RT Para no perderse el seguimiento que est hac...,"[{'label': 'NEGATIVE', 'score': 0.794018328189...",0.794018,NEGATIVE,-1


In [51]:
data['score_coded'] = data['label_coded'] * data['score']

In [52]:
data

Unnamed: 0,date,id,text,user_name,tweets_clean,sentiment,score,label,label_coded,score_coded
0,2020-09-29 06:34:23,1310830261450539009,RT @carnecrudaradio: Quiero felicitar al alcal...,alex vega,RT Quiero felicitar al alcalde por su exitosa ...,"[{'label': 'NEGATIVE', 'score': 0.985985696315...",0.985986,NEGATIVE,-1,-0.985986
1,2020-09-29 07:01:33,1310837099189473280,Señores de @BiciMAD @MADRID las bicis están mu...,Neuroneater,Se ores de las bicis est n muy descuidadas lo ...,"[{'label': 'NEGATIVE', 'score': 0.981788218021...",0.981788,NEGATIVE,-1,-0.981788
2,2020-09-29 07:43:50,1310847740386201600,@JMDLatina Espero de este distrito no solo que...,Andrés Pina,Espero de este distrito no solo que proteja el...,"[{'label': 'NEGATIVE', 'score': 0.980981051921...",0.980981,NEGATIVE,-1,-0.980981
3,2020-09-29 07:53:20,1310850131344920576,RT @_AguilarM: @PlataformaEMT @BiciMAD @bcarab...,ElMaNDaLoRiaNo,RT AguilarM O la fecha de la ltima OPE para Av...,"[{'label': 'NEGATIVE', 'score': 0.983189225196...",0.983189,NEGATIVE,-1,-0.983189
4,2020-09-29 08:05:56,1310853301810888704,La misma vergüenza de TODOS los días. Una esta...,Diego Azul,La misma verg enza de TODOS los d as Una estac...,"[{'label': 'NEGATIVE', 'score': 0.994368195533...",0.994368,NEGATIVE,-1,-0.994368
...,...,...,...,...,...,...,...,...,...,...
8688,2020-10-23 11:29:32,1319601848693739520,RT @PlataformaEMT: Buenos días plataformer@s.\...,Plataforma Sindical,RT Buenos d as plataformer y contin an infland...,"[{'label': 'NEGATIVE', 'score': 0.930744111537...",0.930744,NEGATIVE,-1,-0.930744
8689,2020-10-23 11:31:01,1319602221357596675,RT @PlataformaEMT: La colaboración público pri...,Marcos_HTZ,RT La colaboraci n p blico privada de los gobi...,"[{'label': 'NEGATIVE', 'score': 0.982082843780...",0.982083,NEGATIVE,-1,-0.982083
8690,2020-10-23 11:31:22,1319602309316382721,RT @PlataformaEMT: La colaboración público pri...,Jbp,RT La colaboraci n p blico privada de los gobi...,"[{'label': 'NEGATIVE', 'score': 0.982082843780...",0.982083,NEGATIVE,-1,-0.982083
8691,2020-10-23 11:32:20,1319602554146357251,RT @verdechamberi: Para no perderse el seguimi...,Gonzalo Lopez,RT Para no perderse el seguimiento que est hac...,"[{'label': 'NEGATIVE', 'score': 0.794018328189...",0.794018,NEGATIVE,-1,-0.794018


In [53]:
data.to_csv('../data/results/data_sentiment.csv', index=False)

#### 3.3 Hypothesis testing
<hr>

##### 3.3.1 Separate tweets from today and before

In [60]:
# check new Tweets are in df
data.sort_values('date', ascending = False).head(10)

Unnamed: 0,date,id,text,user_name,tweets_clean,sentiment,score,label,label_coded,score_coded
8692,2020-10-23 11:44:42,1319605665401360385,RT @PlataformaEMT: La colaboración público pri...,Trotamundos,RT La colaboraci n p blico privada de los gobi...,"[{'label': 'NEGATIVE', 'score': 0.982082843780...",0.982083,NEGATIVE,-1,-0.982083
8691,2020-10-23 11:32:20,1319602554146357251,RT @verdechamberi: Para no perderse el seguimi...,Gonzalo Lopez,RT Para no perderse el seguimiento que est hac...,"[{'label': 'NEGATIVE', 'score': 0.794018328189...",0.794018,NEGATIVE,-1,-0.794018
8690,2020-10-23 11:31:22,1319602309316382721,RT @PlataformaEMT: La colaboración público pri...,Jbp,RT La colaboraci n p blico privada de los gobi...,"[{'label': 'NEGATIVE', 'score': 0.982082843780...",0.982083,NEGATIVE,-1,-0.982083
8689,2020-10-23 11:31:01,1319602221357596675,RT @PlataformaEMT: La colaboración público pri...,Marcos_HTZ,RT La colaboraci n p blico privada de los gobi...,"[{'label': 'NEGATIVE', 'score': 0.982082843780...",0.982083,NEGATIVE,-1,-0.982083
8688,2020-10-23 11:29:32,1319601848693739520,RT @PlataformaEMT: Buenos días plataformer@s.\...,Plataforma Sindical,RT Buenos d as plataformer y contin an infland...,"[{'label': 'NEGATIVE', 'score': 0.930744111537...",0.930744,NEGATIVE,-1,-0.930744
8687,2020-10-23 11:28:27,1319601575778811904,La colaboración público privada de los gobiern...,Plataforma Sindical,La colaboraci n p blico privada de los gobiern...,"[{'label': 'NEGATIVE', 'score': 0.985178053379...",0.985178,NEGATIVE,-1,-0.985178
8686,2020-10-23 11:28:19,1319601542501224448,RT @PlataformaEMT: Cientos de bicicletas de @B...,Javi,RT Cientos de bicicletas de se amontonan en la...,"[{'label': 'NEGATIVE', 'score': 0.880333662033...",0.880334,NEGATIVE,-1,-0.880334
8685,2020-10-23 11:28:08,1319601494841327616,RT @PlataformaEMT: Buenos días plataformer@s.\...,Javi,RT Buenos d as plataformer y contin an infland...,"[{'label': 'NEGATIVE', 'score': 0.930744111537...",0.930744,NEGATIVE,-1,-0.930744
8684,2020-10-23 11:27:12,1319601262975909893,"RT @carmartinco: La 204 de nuevo, @BiciMAD sup...",Javi,RT La 204 de nuevo supongo que avisar is a red...,"[{'label': 'NEGATIVE', 'score': 0.979150831699...",0.979151,NEGATIVE,-1,-0.979151
8683,2020-10-23 11:26:57,1319601198765363200,RT @diego_rebollo: Esta es la realidad de @Bic...,Javi,RT rebollo Esta es la realidad de Estaciones a...,"[{'label': 'NEGATIVE', 'score': 0.983878135681...",0.983878,NEGATIVE,-1,-0.983878


##### 3.3.2 Code 'negative' and 'postive' with 0 and 1

In [61]:
# series (watch the index)
negative = (data["label"] == "NEGATIVE")

data_past[negative]['score_coded'].mean()

  after removing the cwd from sys.path.


-0.9390167360510748

In [62]:
data_today[negative]['score_coded'].mean()

  """Entry point for launching an IPython kernel.


-0.9178306574690832

In [63]:
api.update_status(f'this is a comment', tweet_id)

NameError: name 'api' is not defined