### Preprocessing of the data 

In [1]:
run ./preprocessing.ipynb

Total tweets to evaluate: 177
Evaluated tweets so far: 411


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  "File names:"
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  ]


Total corpus tweets: 8978
Total corpus tweets after cleaning: 7356


### Tokenization and stemming

Download Spanish stopwords in Spanish:

In [2]:
# Download spanish stopwords
import nltk
nltk.download("stopwords")

from nltk.corpus import stopwords
spanish_stopwords = stopwords.words('spanish')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/david.santosg/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Get non-words, and extend array of non-words with characters `¿` and `¿`.

In [3]:
from string import punctuation
non_words = list(punctuation)

# Add spanish punctuation
non_words.extend(['¿', '¡'])
non_words.extend(map(str,range(10)))

Define stemmer and tokenizer, based on previous steps.

In [4]:
from sklearn.feature_extraction.text import CountVectorizer       
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize

# based on http://www.cs.duke.edu/courses/spring14/compsci290/assignments/lab02.html
stemmer = SnowballStemmer('spanish')
def stem_tokens(tokens, stemmer):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed

def tokenize(text):
    # remove non letters
    text = ''.join([c for c in text if c not in non_words])
    # tokenize
    tokens =  word_tokenize(text)

    # stem
    try:
        stems = stem_tokens(tokens, stemmer)
    except Exception as e:
        print(e)
        print(text)
        stems = ['']
    return stems

In [5]:
tweets_corpus.sample(10)

Unnamed: 0,content,polarity
6920,“El Número Uno” confirma su éxito y gana otro ...,P
371,La suerte de la fea la guapa la desea #tipicas...,NEU
3790,Da a entender que es partidario de recurrir a ...,P
1844,Los recortes y recargos deprimen la economía. ...,N
1596,“: tu insistencia ha tenido recompensa: 99 € q...,N
3205,"Mi artículo en ESD: ""La cita de Camps y Rajoy ...",N
513,🐄,P
4435,Curro Romero y su mujer Carmen Tello en el #17...,P
5653,Con esta visita relámpago queremos participar ...,P
2833,"Vaya, el PP vasco también confirma la deriva d...",N


### Model Evaluation

Import libraries:

In [45]:
from sklearn.cross_validation import cross_val_score
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline

We convert from strings to numerics the polarity values

In [46]:
tweets_corpus['polarity_bin'] = 0
tweets_corpus.polarity_bin[tweets_corpus.polarity.isin(['P'])] = 1
tweets_corpus.polarity_bin[tweets_corpus.polarity.isin(['N'])] = -1
tweets_corpus.polarity_bin.value_counts(normalize=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


 1    0.500816
-1    0.358347
 0    0.140837
Name: polarity_bin, dtype: float64

In [48]:
g = tweets_corpus.groupby('polarity_bin')
g.apply(lambda x: x.sample(g.size().min()).reset_index(drop=True))

Unnamed: 0_level_0,Unnamed: 1_level_0,content,polarity,polarity_bin
polarity_bin,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
-1,0,"Ya está, es que se ha roto el alternador. El t...",N,-1
-1,1,“: tweeteando en la gala?Q vergüenza!! Y pidie...,N,-1
-1,2,"Como en la Edad Media, el populismo penal prov...",N,-1
-1,3,Amonestado el público argentino en La Cartuja!!!,N,-1
-1,4,RT : en 6 meses los españoles sacaron 54.000 m...,N,-1
-1,5,RT : Esta madrugada se ha roto el glaciar Peri...,N,-1
-1,6,¿Qué se le estará pasando x la cabeza a esa ge...,N,-1
-1,7,"Báñez: ""Los minijobs no caben. Jornadas a tiem...",N,-1
-1,8,"A las 3 en , se ultiman los preparativos para ...",N,-1
-1,9,Creéis que hay o no choque de trenes entre Mon...,N,-1


In [51]:
g.value_counts(normalize=True)

AttributeError: 'DataFrameGroupBy' object has no attribute 'value_counts'

Now we use SVC model with optimization via GridSearch

In [8]:
from sklearn.model_selection import GridSearchCV
from sklearn.multiclass import OneVsRestClassifier
vectorizer = CountVectorizer(
                analyzer = 'word',
                tokenizer = tokenize,
                lowercase = True,
                stop_words = spanish_stopwords)

pipeline = Pipeline([
    ('vect', vectorizer),
    ('cls', LinearSVC()),
])

As long as we don't have binary classification, we must binarize the polarity and use a multiclass learning algorithm.

In [9]:
'''from sklearn.preprocessing import label_binarize
tweets_corpus.polarity_bin = label_binarize(tweets_corpus.polarity_bin, classes=[-1, 0, 1])'''

'from sklearn.preprocessing import label_binarize\ntweets_corpus.polarity_bin = label_binarize(tweets_corpus.polarity_bin, classes=[-1, 0, 1])'

In [10]:
'''params = {
    'cls__C': (0.2, 0.5, 0.7),
    'cls__loss': ('hinge', 'squared_hinge'),
    'cls__max_iter': (500, 1000)
}
gs = GridSearchCV(pipeline, params, n_jobs=-1, cv=5)
gs.fit(tweets_corpus.content, tweets_corpus.polarity_bin)'''

"params = {\n    'cls__C': (0.2, 0.5, 0.7),\n    'cls__loss': ('hinge', 'squared_hinge'),\n    'cls__max_iter': (500, 1000)\n}\ngs = GridSearchCV(pipeline, params, n_jobs=-1, cv=5)\ngs.fit(tweets_corpus.content, tweets_corpus.polarity_bin)"

In [11]:
'''gs.best_params_'''

'gs.best_params_'

We obtain that the best parameters are:

{'cls__estimator__C': 0.2,

 'cls__estimator__loss': 'hinge',
 
 'cls__estimator__max_iter': 500,
 
 'vect__max_df': 1.9,
 
 'vect__max_features': 1000,
 
 'vect__min_df': 10,
 
 'vect__ngram_range': (1, 1)}

In [12]:
'''from sklearn.externals import joblib
joblib.dump(gs, 'grid_search.pkl')'''

"from sklearn.externals import joblib\njoblib.dump(gs, 'grid_search.pkl')"

Import cross validation:

In [13]:
from sklearn.cross_validation import cross_val_predict

In [14]:
model = LinearSVC(
    C=.2, 
    loss='hinge', 
    max_iter=500, 
    random_state=None, 
    penalty='l2'
)

# Define vectorizer with the previously created tokenizer and stopwords array
vectorizer = CountVectorizer(
    analyzer = 'word',
    tokenizer = tokenize,
    lowercase = True,
    stop_words = spanish_stopwords,
    ngram_range=(1, 1),
    max_features=1000
)

corpus_data_features = vectorizer.fit_transform(tweets_corpus.content)
corpus_data_features_nd = corpus_data_features.toarray()

In [15]:
y=tweets_corpus.polarity_bin

In [16]:
'''scores = cross_val_score(
    model,
    corpus_data_features_nd[0:len(tweets_corpus)],
    y=tweets_corpus.polarity_bin,
    scoring='roc_auc',
    cv=5
    )

scores.mean()'''

"scores = cross_val_score(\n    model,\n    corpus_data_features_nd[0:len(tweets_corpus)],\n    y=tweets_corpus.polarity_bin,\n    scoring='roc_auc',\n    cv=5\n    )\n\nscores.mean()"

### Polarity Prediction

In [17]:
tweets_no_label = pd.read_csv(test_tweets_raw, encoding='utf-8')
from cleaner import clean_tweets
tweets_no_label = clean_tweets(tweets_no_label, 'text')
print('Number of tweets: %d' % tweets_no_label.shape[0])
tweets_no_label.sample(10)

Number of tweets: 177


Unnamed: 0,id,text
81,a1be2bde,Madrid: Cuatro manifestaciones por el 8-M... y...
140,7cc88b8f,"Tienes razón hermano, no puedo criticar eso. ..."
12,66d69741,"Como hará Griezmann para jugar en Madrid, Barç..."
175,96ef2a30,Y de k vale cumplir siemore salimos maltratad...
152,c8cda282,Igual que los del Barça hacerse del PSG.
144,16466a71,Dale un rato más y será más q el Barça (?)
158,97e7b943,Eso decidselo a Marca que cada día dan más ve...
126,58685dd8,Pavor tengo que ahora los del #PSG miren hacia...
32,c221d218,Coño sin ampliar parecía un preso
145,306e4bc2,Valverde es responsable de este Barça fiable a...


Now we do some cleansing of the data, erasing again the links, usernames, newline characters, multiple spaces and emojis.

In [18]:
tweets_no_label.sample(10)

Unnamed: 0,id,text
27,5ea3e4b5,McGuane podría debutar con el Barça y converti...
0,aa24173d,Han robado por el método del alunizaje en la t...
133,26c47161,Félix Brych ayer en el partido de champions #P...
124,9cd8b232,LO QUE PASA ES QUE EL QUE HABLA PAJA SOY VOH ...
143,a9ad7a20,"No vale, no saben lo feliz que estuve cuando ..."
149,b7cf2bde,No pusieron los 5 al barcaAH NO PARAA
32,c221d218,Coño sin ampliar parecía un preso
52,94687a81,Es un torneito molero con premio mas o menos ...
36,3b5f1919,Todavía duele que con el Barca haya hecho ped...
10,a122a538,Un jugador brasileño de 21 años muy bueno q j...


### Language detection

Due to the fact that some tweets are in catalan, for language detection purposes we are only going to process about the ones in spanish for sentiment purposes.

We use three different libraries for language detection and keep those tweets on which at least two of these libraries agree on the language being Spanish.

In [19]:
import langid
from langdetect import detect
import textblob

def langid_safe(tweet):
    try:
        return langid.classify(tweet)[0]
    except Exception as e:
        pass
        
def langdetect_safe(tweet):
    try:
        return detect(tweet)
    except Exception as e:
        pass

def textblob_safe(tweet):
    try:
        return textblob.TextBlob(tweet).detect_language()
    except Exception as e:
        pass

ModuleNotFoundError: No module named 'langid'

Create 3 new columns specifying the detected language of the tweet.

In [20]:
tweets_no_label['lang_langid'] = tweets_no_label.text.apply(langid_safe)
tweets_no_label['lang_langdetect'] = tweets_no_label.text.apply(langdetect_safe)
tweets_no_label['lang_textblob'] = tweets_no_label.text.apply(textblob_safe)

NameError: name 'langid_safe' is not defined

Save as CSV.

In [21]:
tweets_no_label.to_csv('tweets_parsed.csv', encoding='utf-8')

We select the tweets in Spanish as follows:
- If the language detected is Spanish by at least 2 libraries, leave.
- If the language detected is Spanish in at least 1 library, print and append to the dataset manually.
- If none of the languages detected is Spanish, remove.

In [22]:
# Leave tweets whose detected language is Spanish (majority):
spanish_query = ''' (lang_langdetect == 'es' and lang_langid == 'es') or (lang_langdetect == 'es' and lang_textblob == 'es') or (lang_textblob == 'es' and lang_langid == 'es') '''
tweets_spanish = tweets_no_label.query(spanish_query)

print('Tweets in Spanish: %d' % tweets_spanish.shape[0])

# Print tweets in doubtful language:
nonspanish_query = ''' ((lang_langdetect != 'es' and lang_langid != 'es') or (lang_langdetect != 'es' and lang_textblob != 'es') or (lang_textblob != 'es' and lang_langid != 'es')) and (lang_textblob == 'es' or lang_langid == 'es' or lang_langdetect == 'es') '''
tweets_doubtful = tweets_no_label.query(nonspanish_query)

print('Tweets whose language is not clear: %d' % tweets_doubtful.shape[0])

tweets_doubtful

UndefinedVariableError: name 'lang_langdetect' is not defined

In [23]:
# Append rest of the tweets in Spanish manually
tweets_spanish = pd.concat([tweets_spanish, tweets_doubtful.query(''' id == '79cdded5' ''')])
tweets_spanish = pd.concat([tweets_spanish, tweets_doubtful.query(''' id == '26fe7471' ''')])
tweets_spanish = pd.concat([tweets_spanish, tweets_doubtful.query(''' id == 'cd0d8bcb' ''')])
tweets_spanish = pd.concat([tweets_spanish, tweets_doubtful.query(''' id == '97af720a' ''')])
tweets_spanish = pd.concat([tweets_spanish, tweets_doubtful.query(''' id == '09c0f4cc' ''')])
tweets_spanish = pd.concat([tweets_spanish, tweets_doubtful.query(''' id == '5a533794' ''')])
tweets_spanish = pd.concat([tweets_spanish, tweets_doubtful.query(''' id == '9046f222' ''')])
tweets_spanish = pd.concat([tweets_spanish, tweets_doubtful.query(''' id == '5df2d140' ''')])
tweets_spanish = pd.concat([tweets_spanish, tweets_doubtful.query(''' id == 'c5343fa0' ''')])
tweets_spanish = pd.concat([tweets_spanish, tweets_doubtful.query(''' id == '12d82762' ''')])
tweets_spanish = pd.concat([tweets_spanish, tweets_doubtful.query(''' id == 'dcc02374' ''')])
tweets_spanish = pd.concat([tweets_spanish, tweets_doubtful.query(''' id == '8f9d73cf' ''')])
tweets_spanish = pd.concat([tweets_spanish, tweets_doubtful.query(''' id == '6f30beca' ''')])
tweets_spanish = pd.concat([tweets_spanish, tweets_doubtful.query(''' id == '9cd8b232' ''')])
tweets_spanish = pd.concat([tweets_spanish, tweets_doubtful.query(''' id == '3c78bdb5' ''')])
tweets_spanish = pd.concat([tweets_spanish, tweets_doubtful.query(''' id == '3beadb3a' ''')])
tweets_spanish = pd.concat([tweets_spanish, tweets_doubtful.query(''' id == 'c8cda282' ''')])
tweets_spanish = pd.concat([tweets_spanish, tweets_doubtful.query(''' id == 'fce60e59' ''')])
tweets_spanish = pd.concat([tweets_spanish, tweets_doubtful.query(''' id == '7bd204cc' ''')])

print('Tweets in Spanish: %d' % tweets_spanish.shape[0])

NameError: name 'tweets_spanish' is not defined

Define pipeline:

In [24]:
pipeline = Pipeline([
    ('vect', CountVectorizer(
            analyzer = 'word',
            tokenizer = tokenize,
            lowercase = True,
            stop_words = spanish_stopwords,
            ngram_range=(1, 1),
            max_features=1000
            )),
    ('cls', LinearSVC(C=.2, loss='hinge',max_iter=500,multi_class='ovr',
             random_state=None,
             penalty='l2',
             )),
])

In [25]:
pipeline.fit(tweets_corpus.content, tweets_corpus.polarity_bin)
tweets_no_label['polarity'] = pipeline.predict(tweets_no_label.text)

In [26]:
tweets_no_label[['text', 'polarity']].sample(30)

Unnamed: 0,text,polarity
50,Veremos si es tan superior cuándo juegue cont...,-1
167,"Lo siento, pero 3-0. Lo otro son campitos men...",-1
123,Con poco suerte tendremos tambièn previa de la...,1
28,El Espanyol ha sacado más puntos contra el Ma...,-1
165,"El Barça no te necesita, mejor ya vete a Chin...",1
77,"Neymar se fue al PSG en busca de “títulos”, si...",1
38,Tanto lo alababan que fue el creador del fútb...,1
15,Y un mundo en el que Madrí y Barca no estén e...,-1
152,Igual que los del Barça hacerse del PSG.,1
146,"Y el nota ademas es del Barca, jajajaja",1


Re-convert polarity to a string.

In [27]:
tweets = tweets_no_label.copy()
tweets['polarity_bin'] = 'Neutral'
tweets.polarity_bin[tweets.polarity.isin([1])] = 'Positive'
tweets.polarity_bin[tweets.polarity.isin([-1])] = 'Negative'
tweets.polarity_bin.value_counts(normalize=True)
tweets[['text', 'polarity_bin']].sample(30)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,text,polarity_bin
54,No tranquilo que el chollo del atleti este añ...,Positive
21,"Este es un cagon, la verdad ayer esperaba un ...",Negative
11,MESSI tira del carro del Barça y cuando se va...,Negative
121,"Se le viene Real Madrid, Barça o Bayern Munich...",Positive
94,Grande #Messi KX,Positive
158,Eso decidselo a Marca que cada día dan más ve...,Positive
5,Pero que te crees? Xk os sorprende lo de este...,Negative
150,📌 Informa : ha fet cursa contínua avui a la C...,Negative
114,Vaya cabezazo del niño. Me recordó al de la f...,Negative
62,entonces por qué lo llamas realmadridización?...,Negative


Remove aux. columns:

In [28]:
tweets.drop(['lang_langid', 'lang_langdetect','lang_textblob','polarity'], axis=1, inplace=True)

ValueError: labels ['lang_langid' 'lang_langdetect' 'lang_textblob'] not contained in axis

In [40]:
tweets.sample(10)

Unnamed: 0,id,text,polarity,polarity_bin
156,42d8ce05,"Como dijo Draxler, el planteamiento era una v...",-1,Negative
87,88f58b8e,No soy del Atleti pero me da pena lo que han e...,-1,Negative
37,67ae6b97,"#JavierMascherano sobre #Messi en :""Es el jug...",1,Positive
102,e703f7b2,Cuando el PSG permitió entrar a sus ultras jus...,-1,Negative
109,6d1bd293,"Yo soy del Barça, hinchaba por Neymar",1,Positive
129,7fa82da8,Entonces Arthur ya esta prácticamente cerrado...,-1,Negative
49,1d485d6c,"Lo bueno es q fue expulsado, ese ejemplo a tu...",1,Positive
32,c221d218,Coño sin ampliar parecía un preso,-1,Negative
169,1a938e84,VIDEO: La brutal exhibición de Koke en el entr...,1,Positive
24,977bf140,Si jugaras en el cielo moriría por verte! #Fo...,1,Positive


Rename column `polarity_bin` to `polarity`:

In [41]:
tweets.drop(['polarity'], axis=1, inplace=True)

In [42]:
tweets = tweets.rename(columns={'polarity_bin': 'polarity'})

In [43]:
tweets.sample(10)

Unnamed: 0,id,text,polarity
96,8f9d73cf,Cualquiera que no estuvieran ni Barça ni Madrid.,Positive
9,02802aa0,Hoy juega mi querido Barça y mi cuerpo y garga...,Positive
115,ea75493e,"????? A ver subnormal, creo que no se te da e...",Negative
12,66d69741,"Como hará Griezmann para jugar en Madrid, Barç...",Positive
125,3c78bdb5,"Sport, el Barça descartó en el pasado a Lucas ...",Positive
167,a3798203,"Lo siento, pero 3-0. Lo otro son campitos men...",Negative
47,484c36cf,La victoria de ayer a la prensa tampoco le val...,Positive
33,bb0ee4ad,No me preocupa tanto el planteamiento táctico...,Negative
66,23303f58,"📷 [GALERIA] El recupera efectius / Roger, Aic...",Positive
101,21813244,"M. Bartra debió quedarse en el Barcelona, una ...",Positive


Export tweets as CSV:

In [44]:
tweets[['id', 'polarity']].to_csv('tweets_polarity_bin.csv', encoding='utf-8', index=False)