# Tweeter Sentiment Analysis

This notebook includes all actions to do sentiment analysis using Transformers .<br>
**Data**: CSV saved in '../data/processed/data.csv. This file contains all updated tweets (old and new) as per 'analysis module'

**Key actions** 
<br> <hr>
- Identify and remove duplicate records
- Identify and remove Tweets done by user = BiciMAD 
- Examine data for potential issues
- Identify and fill in missing values
- *Remove low variance columns (potentially not needed)*
- Identify potential outliers *(potentially not needed)*
- Correct incorrect data types *(potentially only text variable)*
- Remove special characters and clean categorical variables *(potentially only text variable)*
<br>

## 1. Read & clean data

#### 1.1 Read data
<hr>

In [180]:
# all modules
import pandas as pd
# if packaches need upload 
#¬†pip install --upgrade pip

In [196]:
# load dataset
data = pd.read_csv('../data/processed/data.csv')

In [198]:
data.shape

(4494, 4)

In [184]:
data.dtypes

date         object
id            int64
text         object
user_name    object
dtype: object

In [185]:
data.head(5)

Unnamed: 0,date,id,text,user_name
0,2020-10-10 11:07:26,1314885243531399169,@PlataformaEMT @FRAVM @BiciMAD @AlmeidaPP_ @bc...,VICENTE RODRIGUEZ MU
1,2020-10-10 10:38:11,1314877884067188736,@PlataformaEMT @MADRID @AlmeidaPP_ @bcarabante...,Stielike
2,2020-10-10 10:35:49,1314877290489184261,@mcascallares Ha podido tratarse de un fallo p...,BiciMAD
3,2020-10-10 10:35:14,1314877141658542083,"@AnxoOroisPhoto Te pedimos disculpas, ha podid...",BiciMAD
4,2020-10-10 10:13:18,1314871622487212033,"Hola @BiciMAD la app no funciona, llevo 1 hora...",AnxoOroisPhotography


#### 1.2 Clean data
<hr>

In [186]:
# change date type from 'object' to 'date'
data['date'] = pd.to_datetime(data['date'])

In [187]:
data.dtypes

date         datetime64[ns]
id                    int64
text                 object
user_name            object
dtype: object

In [188]:
data.isnull().sum()

date         0
id           0
text         0
user_name    0
dtype: int64

In [189]:
# count duplicates 'id' to be sure
len(data['id'])-len(data['id'].drop_duplicates())

0

In [190]:
data.isnull().sum()

date         0
id           0
text         0
user_name    0
dtype: int64

In [191]:
data['id'].describe()

count    4.643000e+03
mean     1.314161e+18
std      1.951459e+15
min      1.310830e+18
25%      1.312023e+18
50%      1.314581e+18
75%      1.315936e+18
max      1.317360e+18
Name: id, dtype: float64

In [192]:
# drop duplicates
data.drop_duplicates(subset=['id'],keep='last', inplace= True)

In [193]:
data.shape

(4643, 4)

#### 1.3 Take out BiciMad Tweets
<hr>

In [194]:
# data analysis => sorting
data = data.sort_values('user_name', ascending=False)

In [195]:
data['user_name'].value_counts()

BiciMAD                149
ErBoteRojo              91
deteibols               87
Jes√∫s Garc√≠a Diaz       77
Plataforma Sindical     76
                      ... 
#NiOlvidoNiPerdon        1
Unai Zarraolandia        1
pili bu                  1
Carabanchel              1
CarlAmado                1
Name: user_name, Length: 1761, dtype: int64

In [137]:
data = data[data.user_name != 'BiciMAD']

In [138]:
data.columns

Index(['date', 'id', 'text', 'user_name'], dtype='object')

## 2. Explore data

#### 2.1 Sort values by 'date' and reset index
<hr>

In [139]:
data = data.sort_values(by ='date', ascending=True)

In [140]:
data = data.reset_index()

In [141]:
data = data.drop(columns =['index'])

In [142]:
data.tail(20)

Unnamed: 0,date,id,text,user_name
4474,2020-10-17 06:03:09,1317345386223276032,RT @PlataformaEMT: Llego @bcarabante y paraliz...,Angel Salinas
4475,2020-10-17 06:04:20,1317345683767164928,RT @imberbe67: Aqu√≠ un ejemplo de c√≥mo se la p...,Juan talavera
4476,2020-10-17 06:13:41,1317348035182080003,"RT @Batperro82: Hoy, gracias a mi odisea por l...",Miguel √Ångel Medina
4477,2020-10-17 06:13:56,1317348099883438080,RT @PlataformaEMT: @bcarabante @EMTmadrid Hay ...,ErBoteRojo
4478,2020-10-17 06:14:02,1317348124239810560,RT @PlataformaEMT: @bcarabante @EMTmadrid @Bic...,ErBoteRojo
4479,2020-10-17 06:14:18,1317348189394128896,RT @JessGarcaDiaz3: @bcarabante @EMTmadrid Lo ...,ErBoteRojo
4480,2020-10-17 06:14:43,1317348295929483265,RT @Bacoher: @bcarabante @EMTmadrid Ser√° sinve...,ErBoteRojo
4481,2020-10-17 06:16:22,1317348711303942144,"RT @Batperro82: Hoy, gracias a mi odisea por l...",movilidadleganes
4482,2020-10-17 06:29:38,1317352049676341253,RT @PlataformaEMT: @bcarabante @EMTmadrid Hay ...,Mayi77
4483,2020-10-17 06:35:38,1317353559046934533,@bcarabante @EMTmadrid Han incorporado conduct...,Mayi77


#### 2.2 Check most recent tweets from users
<hr>

In [157]:
data.loc[data['user_name'] == 'Blanca Fernandez']

Unnamed: 0,date,id,text,user_name,tweets_clean
141,2020-09-30 16:27:39,1311341950835007493,un mes m√°s... y @BiciMAD no parece que vaya po...,Blanca Fernandez,un mes m s y no parece que vaya por buen camino
1392,2020-10-02 19:39:15,1312114944427479040,de mal en peor cads@vez que me meto a b√≥chese ...,Blanca Fernandez,de mal en peor cads que me meto a b chese como...
2379,2020-10-10 12:18:52,1314903221127786496,que pena de verdad @BiciMAD https://t.co/ZJocU...,Blanca Fernandez,que pena de verdad
2720,2020-10-11 18:15:02,1315355242876280834,dejar o no dejar de seguir a @BiciMAD para no ...,Blanca Fernandez,dejar o no dejar de seguir a para no leer por ...
3181,2020-10-12 20:04:29,1315745172119060481,"pues no, no dejar√© de seguir a @BiciMAD porqu...",Blanca Fernandez,pues no no dejar de seguir a porque sus trabaj...


In [158]:
# Sacar subsets
#¬†data_subset = data[data['user_name'].isin(['Blanca Fenrandez', 'BICIMAD EN LUCHA'])]

In [159]:
# data_subset

In [160]:
# First tweet available date
data['date'].min()

Timestamp('2020-09-29 06:34:23')

In [161]:
# Most recent tweet date
data['date'].max()

Timestamp('2020-10-17 07:02:08')

In [82]:
# saving data
#¬†data.to_csv('../data/processed/data.csv', index=False)

In [172]:
data

Unnamed: 0,date,id,text,user_name,tweets_clean
0,2020-09-29 06:34:23,1310830261450539009,RT @carnecrudaradio: Quiero felicitar al alcal...,alex vega,RT Quiero felicitar al alcalde por su exitosa ...
1,2020-09-29 07:01:33,1310837099189473280,Se√±ores de @BiciMAD @MADRID las bicis est√°n mu...,Neuroneater,Se ores de las bicis estan muy descuidadas lo ...
2,2020-09-29 07:43:50,1310847740386201600,@JMDLatina Espero de este distrito no solo que...,Andr√©s Pina,Espero de este distrito no solo que proteja el...
3,2020-09-29 07:53:20,1310850131344920576,RT @_AguilarM: @PlataformaEMT @BiciMAD @bcarab...,ElMaNDaLoRiaNo,RT AguilarM O la fecha de la ultima OPE para A...
4,2020-09-29 08:05:56,1310853301810888704,La misma verg√ºenza de TODOS los d√≠as. Una esta...,Diego Azul,La misma verguenza de TODOS los dias Una estac...
...,...,...,...,...,...
4489,2020-10-17 06:49:45,1317357111261728768,"RT @Batperro82: Hoy, gracias a mi odisea por l...",AP-Madrid Nosevende,RT Hoy gracias a mi odisea por las estaciones ...
4490,2020-10-17 06:53:35,1317358078686265345,La p√©sima gesti√≥n de Bicimad nos obligan a vol...,BICIMAD EN LUCHA,La pesima gestion de Bicimad nos obligan a vol...
4491,2020-10-17 07:01:21,1317360030895738880,RT @Bacoher: @bcarabante @EMTmadrid Ser√° sinve...,Julio,RT Sera sinverguenza tiene usted la como un so...
4492,2020-10-17 07:01:52,1317360160378159104,RT @PlataformaEMT: @bcarabante @EMTmadrid @Bic...,Julio,RT Si no lo hace se encontrara muy pronto con ...


## 3. Sentiment analysis

#### 3.1 Prepare text
<hr>

In [None]:
import re
from unicodedata import normalize

In [None]:
[re.sub(r'[\n\r]*','', str(x)) for x in df['team']]

In [173]:
data['tweets_clean'] = data['text']re.sub(
        r"([^n\u0300-\u036f]|n(?!\u0303(?![\u0300-\u036f])))[\u0300-\u036f]+", r"\1", 
        normalize( "NFD", s), 0, re.I
    )
data['tweets_clean']  = normalize( 'NFC', data['tweets_clean'])

SyntaxError: invalid syntax (<ipython-input-173-adfeb1bbf51e>, line 1)

In [167]:
s = "Ping√ºino: M√°l√£g√† √™s u√±Ã∫√£ c√≠ud√£d fant√°stica y √®n Logro√±o me pica el... mo√±oÃêÃöÃèÃãÕåÃÅÕ¨Õ°Ã®ÃùÃòÃ¶ÃûÃüÃ©oÃÉÃìÃ™ÕìÕçÃ¶oÕÇÃåÕêÕêÕüÃõÃ§Ã∫Ã¨ÃØoÕÇÃäÕéÕàÃ≥Ã†ÃºÃ´"


# -> NFD y eliminar diacr√≠ticos
s = re.sub(
        r"([^n\u0300-\u036f]|n(?!\u0303(?![\u0300-\u036f])))[\u0300-\u036f]+", r"\1", 
        normalize( "NFD", s), 0, re.I
    )

# -> NFC
s = normalize( 'NFC', s)

print( s )

Pinguino: Malaga es una ciudad fantastica y en Logro√±o me pica el... mo√±oooo


In [176]:
def clean_tweet(tweet):
    data['tweets_clean'] = re.sub(
        r"([^n\u0300-\u036f]|n(?!\u0303(?![\u0300-\u036f])))[\u0300-\u036f]+", r"\1", 
        normalize( "NFD", data['tweets_clean']), 0, re.I
    )
    data['tweets_clean'] = normalize('NFC', data['tweets_clean'].text)
    data['tweets_clean'] = ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", tweet).split())
    #return " ".join(re.sub("([^0-9A-Za-z \t])|(\w+:\/\/\S+)", "", tweet).split())

In [177]:
data['tweets_clean'] = data['text'].apply(clean_tweet) 

TypeError: normalize() argument 2 must be str, not Series

In [170]:
def clean_tweet(tweet):
    return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", tweet).split())
    #return " ".join(re.sub("([^0-9A-Za-z \t])|(\w+:\/\/\S+)", "", tweet).split())

In [171]:
# Updated the city columns 
data['tweets_clean'] = data['text'].apply(clean_tweet) 
  
# Print the updated dataframe 
display(data)

Unnamed: 0,date,id,text,user_name,tweets_clean
0,2020-09-29 06:34:23,1310830261450539009,RT @carnecrudaradio: Quiero felicitar al alcal...,alex vega,RT Quiero felicitar al alcalde por su exitosa ...
1,2020-09-29 07:01:33,1310837099189473280,Se√±ores de @BiciMAD @MADRID las bicis est√°n mu...,Neuroneater,Se ores de las bicis estan muy descuidadas lo ...
2,2020-09-29 07:43:50,1310847740386201600,@JMDLatina Espero de este distrito no solo que...,Andr√©s Pina,Espero de este distrito no solo que proteja el...
3,2020-09-29 07:53:20,1310850131344920576,RT @_AguilarM: @PlataformaEMT @BiciMAD @bcarab...,ElMaNDaLoRiaNo,RT AguilarM O la fecha de la ultima OPE para A...
4,2020-09-29 08:05:56,1310853301810888704,La misma verg√ºenza de TODOS los d√≠as. Una esta...,Diego Azul,La misma verguenza de TODOS los dias Una estac...
...,...,...,...,...,...
4489,2020-10-17 06:49:45,1317357111261728768,"RT @Batperro82: Hoy, gracias a mi odisea por l...",AP-Madrid Nosevende,RT Hoy gracias a mi odisea por las estaciones ...
4490,2020-10-17 06:53:35,1317358078686265345,La p√©sima gesti√≥n de Bicimad nos obligan a vol...,BICIMAD EN LUCHA,La pesima gestion de Bicimad nos obligan a vol...
4491,2020-10-17 07:01:21,1317360030895738880,RT @Bacoher: @bcarabante @EMTmadrid Ser√° sinve...,Julio,RT Sera sinverguenza tiene usted la como un so...
4492,2020-10-17 07:01:52,1317360160378159104,RT @PlataformaEMT: @bcarabante @EMTmadrid @Bic...,Julio,RT Si no lo hace se encontrara muy pronto con ...


In [166]:
data['tweets_clean'][1]

'Se ores de las bicis est n muy descuidadas lo saben A menor uso de bicis m s de metro y bus y'

In [1]:
from transformers import pipeline

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [None]:
classifier = pipeline('sentiment-analysis')

In [None]:
classifier('We are very happy to show you the ü§ó Transformers library.')

In [None]:
results = classifier(["We are very happy to show you the ü§ó Transformers library.", "We hope you don't hate it."])
for result in results:
    print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

In [8]:
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=953.0, style=ProgressStyle(description_‚Ä¶




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=669491321.0, style=ProgressStyle(descri‚Ä¶




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=871891.0, style=ProgressStyle(descripti‚Ä¶




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=112.0, style=ProgressStyle(description_‚Ä¶




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=39.0, style=ProgressStyle(description_w‚Ä¶




In [10]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [11]:
inputs = tokenizer("We are very happy to show you the ü§ó Transformers library.")

In [12]:
print(inputs)

{'input_ids': [101, 11312, 10320, 12495, 19308, 10114, 11391, 10855, 10103, 100, 58263, 13299, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [13]:
pt_batch = tokenizer(
    ["We are very happy to show you the ü§ó Transformers library.", "We hope you don't hate it."],
    padding=True,
    truncation=True,
    return_tensors="pt")

In [14]:
for key, value in pt_batch.items():
    print(f"{key}: {value.numpy().tolist()}")

input_ids: [[101, 11312, 10320, 12495, 19308, 10114, 11391, 10855, 10103, 100, 58263, 13299, 119, 102], [101, 11312, 18763, 10855, 11530, 112, 162, 39487, 10197, 119, 102, 0, 0, 0]]
token_type_ids: [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
attention_mask: [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]]


In [15]:
pt_outputs = pt_model(**pt_batch)

NameError: name 'pt_model' is not defined

In [None]:
print(pt_outputs)

#### 3.13 Predictions
<hr>

In [None]:
import torch.nn.functional as F
pt_predictions = F.softmax(pt_outputs[0], dim=-1)