### Contenido del Dataset

Twitter Sentiment140: Tweets related to brands/keywords. Website includes papers and research ideas. (77 MB)

The data is a CSV with emoticons removed. Data file format has 6 fields:
- **0** - the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
- **1** - the id of the tweet (2087)
- **2** - the date of the tweet (Sat May 16 23:58:44 UTC 2009)
- **3** - the query (lyx). If there is no query, then this value is NO_QUERY.
- **4** - the user that tweeted (robotickilldozr)
- **5** - the text of the tweet (Lyx is cool)

If you use this data, please cite Sentiment140 as your source.
- **Link** : http://help.sentiment140.com/for-students/
- **CSV** : https://docs.google.com/file/d/0B04GJPshIjmPRnZManQwWEdTZjg/edit

## 1. Carga de los datos

In [20]:
import pandas as pd
header_names = ["polarity","id","date","query","user","tweet"]
df = pd.read_csv("trainingandtestdata/training.1600000.processed.noemoticon.csv", sep=',', names = header_names, encoding="ISO-8859-1",header=None)
test = pd.read_csv("trainingandtestdata/testdata.manual.2009.06.14.csv", sep=',', names = header_names, encoding="ISO-8859-1",header=None)

In [21]:
def df_size(df):
    """Return the size of a DataFrame in Megabyes"""
    total = 0.0
    for col in df:
        total += df[col].nbytes
    return total/1048576

In [22]:
print('Tamaño de archivo:',df_size(df), 'MB')

Tamaño de archivo: 73.2421875 MB


In [23]:
df.head(5)

Unnamed: 0,polarity,id,date,query,user,tweet
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [24]:
df.shape

(1600000, 6)

Comprobamos de que todas las columnas estar cargando correctamente.

In [25]:
for x in df:
    print (x)

polarity
id
date
query
user
tweet


In [26]:
category = df['polarity'].values
tweets = df['tweet'].values

In [27]:
print (category)

[0 0 0 ..., 4 4 4]


Definimos funcion de carga de datos.

In [28]:
def cargar_datos():
    import pandas as pd
    header_names = ["polarity","id","date","query","user","tweet"]
    df = pd.read_csv("trainingandtestdata/training.1600000.processed.noemoticon.csv", sep=',', names = header_names, encoding="ISO-8859-1",header=None)
    test = pd.read_csv("trainingandtestdata/testdata.manual.2009.06.14.csv", sep=',', names = header_names, encoding="ISO-8859-1",header=None)
    category = df['polarity'].values
    tweets = df['tweet'].values
    return df,test,category,tweets
#    df = pd.read_csv("trainingandtestdata/training.1600000.processed.noemoticon.csv", sep=',', names = header_names, encoding="ISO-8859-1",header=None,nrows = 10000)


## 2. Preprocesamiento

In [29]:
print(df['polarity'].value_counts())

4    800000
0    800000
Name: polarity, dtype: int64


In [30]:
import string
print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


Consideramos válido el apóstrofe (') debido a que este es parte de varios *stopwords*, concepto que será explicado luego.

In [31]:
import re

tweets_alphanum=[]
for i in tweets:
    x = re.sub(r"[^a-zA-Z' ]","",i)
    tweets_alphanum.append(x)

In [32]:
tweets_lower=[]
for i in tweets_alphanum:
    tweets_lower.append(i.lower())
tweets_lower

["switchfoot httptwitpiccomyzl  awww that's a bummer  you shoulda got david carr of third day to do it d",
 "is upset that he can't update his facebook by texting it and might cry as a result  school today also blah",
 'kenichan i dived many times for the ball managed to save   the rest go out of bounds',
 'my whole body feels itchy and like its on fire ',
 "nationwideclass no it's not behaving at all i'm mad why am i here because i can't see you all over there ",
 'kwesidei not the whole crew ',
 'need a hug ',
 "loltrish hey  long time no see yes rains a bit only a bit  lol  i'm fine thanks  how's you ",
 "tatianak nope they didn't have it ",
 'twittera que me muera  ',
 "spring break in plain city it's snowing ",
 'i just repierced my ears ',
 "caregiving i couldn't bear to watch it  and i thought the ua loss was embarrassing     ",
 'octolinz it it counts idk why i did either you never talk to me anymore ',
 "smarrison i would've been the first but i didn't have a gun    not really

In [33]:
import collections as co
tokens_list = []
c = co.Counter() 

for text in tweets_lower:
    tokens = text.split()
    c.update(tokens)

print('10 tokens mas comunes:',c.most_common(10))        
print('Numero de tokens:',sum(c.values()))
print('Tamaño del vocabulario:',len(c))

10 tokens mas comunes: [('i', 750906), ('to', 564540), ('the', 519719), ('a', 377907), ('my', 314054), ('and', 298430), ('you', 269918), ('is', 235930), ('it', 230782), ('for', 215705)]
Numero de tokens: 20664174
Tamaño del vocabulario: 808732


Los *stopwords* o palabras vacías son términos que por lo general no guardan mucha información relevante para el procesamiento de la información textual (si no se analiza a nivel contextual). Por ello, para este caso, se removerán los stopwords de los títulos. Indique la cantidad de stopwords que está removiendo

In [34]:
import nltk
nltk.download('stopwords')
stopwords_eng = nltk.corpus.stopwords.words('english')
print(stopwords_eng)

[nltk_data] Downloading package stopwords to /home/alulab/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'an

In [35]:
tweets_lower_clean=[]
count = 0
for tweet in tweets_lower:
    tweet_clean = ''
    for token in tweet.split():
        if token not in stopwords_eng:
            tweet_clean += token
            tweet_clean +=' '
        else:
            count+=1
    tweets_lower_clean.append(tweet_clean)
print("Cantidad de stopwords removidos:",count)

Cantidad de stopwords removidos: 8509854


Con los tweets con solamente carácteres alfanuméricos en minuscula limpios de stopwords, se procederá a analizar los datos.

In [36]:
import collections as co
tokens_list = []
c = co.Counter() 

for text in tweets_lower_clean:
    tokens = text.split()
    c.update(tokens)

print('10 tokens mas comunes:',c.most_common(10))        
print('Numero de tokens:',sum(c.values()))
print('Tamaño del vocabulario:',len(c))
## Observacion: También se puede calcular el tamaño del vocabulario con
## len(list(set(" ".join(tweets_lower_clean).split())))

10 tokens mas comunes: [("i'm", 128139), ('good', 89368), ('day', 84761), ('get', 81573), ('like', 77766), ('go', 72945), ('today', 64615), ('going', 64079), ('love', 63430), ('work', 62857)]
Numero de tokens: 12154320
Tamaño del vocabulario: 808557


Definimos funciones de preprocesamiento y limpieza de datos resumiendo todo el análisis previo, como también una función de cálculo de estádisticas según el diccionario.

In [37]:
def preprocesar(tweets):
    tweets_alphanum=[]
    for i in tweets:
        x = re.sub(r"[^a-zA-Z' ]","",i)
        tweets_alphanum.append(x)
    tweets_lower=[]
    for tweet in tweets_alphanum:
        tweets_lower.append(tweet.lower())
    import nltk
    nltk.download('stopwords')
    stopwords_eng = nltk.corpus.stopwords.words('english')
    tweets_lower_clean=[]
    for tweet in tweets_lower:
        tweet_clean = ''
        for token in tweet.split():
            if token not in stopwords_eng:
                tweet_clean += token + ' '
        tweets_lower_clean.append(tweet_clean)
    import collections as co
    diccionario = co.Counter() 
    for text in tweets_lower_clean:
        tokens = text.split()
        diccionario.update(tokens)
    return diccionario, tweets_lower_clean

In [38]:
def estadisticas(diccionario):
    print('10 tokens más comunes:',diccionario.most_common(10))        
    print('Número de tokens:',sum(diccionario.values()))
    print('Tamaño del vocabulario:',len(diccionario))

In [39]:
counters, tweets_lower_clean = preprocesar(tweets)

[nltk_data] Downloading package stopwords to /home/alulab/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [40]:
estadisticas(counters)

10 tokens más comunes: [("i'm", 128139), ('good', 89368), ('day', 84761), ('get', 81573), ('like', 77766), ('go', 72945), ('today', 64615), ('going', 64079), ('love', 63430), ('work', 62857)]
Número de tokens: 12154320
Tamaño del vocabulario: 808557


## 3. Clasificación

Deben mostrar un reporte de clasificación (accuracy) con diversos algoritmos de aprendizaje supervisado tradicionales (adicionales a los mostrados en clase) usando los siguientes vectores:

In [41]:
import nltk
import re
def clean_tokens_profesor(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    return filtered_tokens

In [42]:
import nltk
import re
def clean_tokens(text):
    filtered_tokens = set(" ".join(text).split())
    return filtered_tokens

- **1**. Bag of Words (booleano)

In [45]:
from sklearn.feature_extraction.text import CountVectorizer

import numpy as np
text = np.array(tweets_lower_clean)
text

array([ "switchfoot httptwitpiccomyzl awww that's bummer shoulda got david carr third day ",
       "upset can't update facebook texting might cry result school today also blah ",
       'kenichan dived many times ball managed save rest go bounds ', ...,
       'ready mojo makeover ask details ',
       'happy th birthday boo alll time tupac amaru shakur ',
       'happy charitytuesday thenspcc sparkscharity speakinguphh '], 
      dtype='<U176')

In [46]:
vectorizer = CountVectorizer(max_features = 10000)
txd_matrix = vectorizer.fit_transform(text)
txd_matrix.shape

(1600000, 10000)

In [47]:
txd_matrix = CountVectorizer(binary=True,max_features = 10000).fit_transform(text)
print(txd_matrix)

  (0, 2146)	1
  (0, 8756)	1
  (0, 1318)	1
  (0, 2138)	1
  (0, 3595)	1
  (0, 7779)	1
  (0, 1182)	1
  (0, 8712)	1
  (0, 568)	1
  (1, 868)	1
  (1, 242)	1
  (1, 8882)	1
  (1, 7542)	1
  (1, 7236)	1
  (1, 2012)	1
  (1, 5504)	1
  (1, 8694)	1
  (1, 2968)	1
  (1, 9280)	1
  (1, 1266)	1
  (1, 9296)	1
  (2, 3543)	1
  (2, 7227)	1
  (2, 7510)	1
  (2, 5270)	1
  :	:
  (1599994, 9472)	1
  (1599994, 3566)	1
  (1599994, 9773)	1
  (1599994, 8832)	1
  (1599995, 795)	1
  (1599995, 2875)	1
  (1599995, 9737)	1
  (1599995, 3064)	1
  (1599995, 7542)	1
  (1599996, 4369)	1
  (1599996, 6082)	1
  (1599996, 1850)	1
  (1599996, 3884)	1
  (1599997, 5251)	1
  (1599997, 5613)	1
  (1599997, 2300)	1
  (1599997, 458)	1
  (1599997, 7032)	1
  (1599998, 221)	1
  (1599998, 8698)	1
  (1599998, 846)	1
  (1599998, 968)	1
  (1599998, 3814)	1
  (1599998, 8832)	1
  (1599999, 3814)	1


In [48]:
from sklearn.feature_extraction.text import TfidfVectorizer
tf_matrix = TfidfVectorizer(max_features = 10000, use_idf=False).fit_transform(text)
tfidf_matrix = TfidfVectorizer().fit_transform(text)
tf_matrix.shape

(1600000, 10000)

Definimos función que nos devuelve la matriz Bag of Words (booleano).

In [49]:
def Bag_Of_Words(tweets_lower_clean, max_words_dictionary = None):
    from sklearn.feature_extraction.text import CountVectorizer
    import numpy as np
    text = np.array(tweets_lower_clean)
    txd_matrix = CountVectorizer(binary=True,max_features = max_words_dictionary).fit_transform(text)
    return txd_matrix

- **2** TF-IDF

In [50]:
from sklearn.feature_extraction.text import TfidfVectorizer

X_1 = TfidfVectorizer(max_features=10000).fit_transform(tweets_lower_clean)
print(X_1)
X_1.shape

  (0, 569)	0.297447998948
  (0, 8710)	0.235296305296
  (0, 1181)	0.355227854861
  (0, 7779)	0.418023218636
  (0, 3593)	0.190409828461
  (0, 2137)	0.334006343766
  (0, 1317)	0.481393667494
  (0, 8754)	0.380197072952
  (0, 2145)	0.176420743146
  (1, 9293)	0.312089210869
  (1, 1265)	0.193146011584
  (1, 9277)	0.292964831383
  (1, 2965)	0.292873668745
  (1, 8692)	0.366465402238
  (1, 5502)	0.257980594255
  (1, 2011)	0.291913346463
  (1, 7235)	0.37366234893
  (1, 7540)	0.228525503212
  (1, 8878)	0.177516669887
  (1, 243)	0.255267514947
  (1, 870)	0.345117596773
  (2, 5292)	0.339619681078
  (2, 8835)	0.347237270736
  (2, 633)	0.443687621568
  (2, 5270)	0.45806327023
  :	:
  (1599994, 2773)	0.355319278557
  (1599994, 807)	0.292498614727
  (1599994, 4665)	0.332436095793
  (1599994, 907)	0.3785636548
  (1599995, 7540)	0.418949004935
  (1599995, 3062)	0.439367601521
  (1599995, 9735)	0.48779613319
  (1599995, 2873)	0.453231123445
  (1599995, 796)	0.433675366124
  (1599996, 3883)	0.43217538643
  

(1600000, 10000)

Definimos función que nos devuelve la matriz TF-IDF.

In [51]:
def TF_IDF(tweets_lower_clean, max_words_dictionary = None):
    from sklearn.feature_extraction.text import TfidfVectorizer
    X_1 = TfidfVectorizer(stop_words='english',max_features=10000).fit_transform(tweets_lower_clean)
    return (X_1)

In [52]:
y = np.array(category)
y.shape

(1600000,)

- **3** Word Vectors (embeddings) de spacy

In [53]:
import spacy

In [54]:
#Deben instalar el modelo "en_core_web_lg"
#nlp = spacy.load('en_core_web_lg')

In [55]:
#doc=nlp(u'this is a sentence.')

Definimos función que nos devuelve la matriz Word Vectors (embeddings) de spacy.

In [56]:
def Word_Vectors(tweets_lower_clean, max_words_dictionary = None):
    return 0

## 4. Pruebas

Primero se define una función de ayuda para ejecutar un modelo y que se imprima el resultado de clasificación (accuracy) mediante validación cruzada (k_fold=5):

In [57]:
from sklearn.model_selection import cross_val_score
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import Perceptron
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import NearestCentroid
from sklearn.ensemble import RandomForestClassifier
from collections import Counter

In [58]:
def run_model(clf, X, y):
    scores = cross_val_score(clf, X, y, cv=5)
    print("%s accuracy: %0.2f (+/- %0.2f)" % \
          (str(clf.__class__).split('.')[-1].replace('>','').replace("'",''), 
          scores.mean(), scores.std() * 2))

In [59]:
def run_models(X,y):
    #run_model(LinearSVC(), X, y)
    run_model(SGDClassifier(), X, y)
    run_model(Perceptron(), X, y)
    run_model(PassiveAggressiveClassifier(), X, y)
    run_model(BernoulliNB(), X, y)
    #run_model(MultinomialNB(), X, y)
    #run_model(KNeighborsClassifier(), X, y)
    #run_model(NearestCentroid(), X, y)
    #run_model(RandomForestClassifier(n_estimators=100, max_depth=10), X, y)

Ahora efectuamos la prueba.

- **1**. Bag of Words (booleano)

In [60]:
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
text = np.array(tweets_lower_clean)
txd_matrix = CountVectorizer(binary=True,max_features = 10000).fit_transform(text)
run_models(txd_matrix,y)

SGDClassifier accuracy: 0.76 (+/- 0.00)
Perceptron accuracy: 0.69 (+/- 0.01)
PassiveAggressiveClassifier accuracy: 0.69 (+/- 0.01)
BernoulliNB accuracy: 0.76 (+/- 0.01)


- **2** TF-IDF

In [61]:
from sklearn.feature_extraction.text import TfidfVectorizer
X_1 = TfidfVectorizer(stop_words='english',max_features=10000).fit_transform(tweets_lower_clean)
run_models(X_1,y)

SGDClassifier accuracy: 0.75 (+/- 0.01)
Perceptron accuracy: 0.68 (+/- 0.01)
PassiveAggressiveClassifier accuracy: 0.72 (+/- 0.01)
BernoulliNB accuracy: 0.75 (+/- 0.01)
