### Contenido del Dataset

Twitter Sentiment140: Tweets related to brands/keywords. Website includes papers and research ideas. (77 MB)

The data is a CSV with emoticons removed. Data file format has 6 fields:
- **0** - the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
- **1** - the id of the tweet (2087)
- **2** - the date of the tweet (Sat May 16 23:58:44 UTC 2009)
- **3** - the query (lyx). If there is no query, then this value is NO_QUERY.
- **4** - the user that tweeted (robotickilldozr)
- **5** - the text of the tweet (Lyx is cool)

If you use this data, please cite Sentiment140 as your source.
- **Link** : http://help.sentiment140.com/for-students/
- **CSV** : https://docs.google.com/file/d/0B04GJPshIjmPRnZManQwWEdTZjg/edit

## 1. Carga de los datos

In [28]:
import pandas as pd
import random

p = 0.95  # 1% of the lines
# keep the header, then take only 1% of lines

header_names = ["polarity","id","date","query","user","tweet"]
df = pd.read_csv("trainingandtestdata/training.1600000.processed.noemoticon.csv", sep=',', names = header_names, encoding="ISO-8859-1",header=None, skiprows=lambda i: i>0 and random.random() < p)
test = pd.read_csv("trainingandtestdata/testdata.manual.2009.06.14.csv", sep=',', names = header_names, encoding="ISO-8859-1",header=None)

In [29]:
def df_size(df):
    """Return the size of a DataFrame in Megabyes"""
    total = 0.0
    for col in df:
        total += df[col].nbytes
    return total/1048576

In [30]:
print('Tamaño de archivo:',df_size(df), 'MB')

Tamaño de archivo: 3.64288330078125 MB


In [31]:
df.head(10)

Unnamed: 0,polarity,id,date,query,user,tweet
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467814119,Mon Apr 06 22:20:40 PDT 2009,NO_QUERY,cooliodoc,@angry_barista I baked you a cake but I ated it
2,0,1467817225,Mon Apr 06 22:21:27 PDT 2009,NO_QUERY,crosland_12,@cocomix04 ill tell ya the story later not a ...
3,0,1467837762,Mon Apr 06 22:26:48 PDT 2009,NO_QUERY,Dogbook,Emily will be glad when Mommy is done training...
4,0,1467838189,Mon Apr 06 22:26:54 PDT 2009,NO_QUERY,missannabanana,laying in bed with no voice..
5,0,1467840386,Mon Apr 06 22:27:31 PDT 2009,NO_QUERY,MadameCrow,I am in pain. My back and sides hurt. Not to m...
6,0,1467843215,Mon Apr 06 22:28:18 PDT 2009,NO_QUERY,mikecogh,@ozesteph1992 Shame to hear this Stephan
7,0,1467854345,Mon Apr 06 22:31:09 PDT 2009,NO_QUERY,ace587,@pinkserendipity yes sprint has 4g only in bal...
8,0,1467857975,Mon Apr 06 22:32:06 PDT 2009,NO_QUERY,szrhnds602,Borders closed at 10
9,0,1467859025,Mon Apr 06 22:32:22 PDT 2009,NO_QUERY,LeeseEllen,So many channels.... yet so so boring... lazy ...


In [32]:
df.shape

(79580, 6)

Comprobamos de que todas las columnas estar cargando correctamente.

In [33]:
for x in df:
    print (x)

polarity
id
date
query
user
tweet


In [34]:
category = df['polarity'].values
tweets = df['tweet'].values

In [36]:
print (category)

[0 0 0 ... 4 4 4]


Definimos funcion de carga de datos.

In [37]:
def cargar_datos(porcentaje):
    import pandas as pd
    import random

    p = porcentaje  # 1% of the lines
    # keep the header, then take only 1% of lines
    header_names = ["polarity","id","date","query","user","tweet"]
    df = pd.read_csv("trainingandtestdata/training.1600000.processed.noemoticon.csv", sep=',', names = header_names, encoding="ISO-8859-1",header=None, skiprows=lambda i: i>0 and random.random() < p)
    test = pd.read_csv("trainingandtestdata/testdata.manual.2009.06.14.csv", sep=',', names = header_names, encoding="ISO-8859-1",header=None)
    category = df['polarity'].values
    tweets = df['tweet'].values
    return df,test,category,tweets
#    df = pd.read_csv("trainingandtestdata/training.1600000.processed.noemoticon.csv", sep=',', names = header_names, encoding="ISO-8859-1",header=None,nrows = 10000)


## 2. Preprocesamiento

In [38]:
print(df['polarity'].value_counts())

4    39797
0    39783
Name: polarity, dtype: int64


In [39]:
import string
print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


Consideramos válido el apóstrofe (') debido a que este es parte de varios *stopwords*, concepto que será explicado luego.

In [40]:
import re

tweets_alphanum=[]
for i in tweets:
    x = re.sub(r"http\S+|www\S+"," ",i)
    x = re.sub(r"[^\w'@ ]|@\W+"," ",x)
    x = re.sub(r"^(\d ?)+| (\d ?)+"," ",x)
    tweets_alphanum.append(x)

In [41]:
tweets_lower=[]
for i in tweets_alphanum:
    tweets_lower.append(i.lower())

Los *stopwords* o palabras vacías son términos que por lo general no guardan mucha información relevante para el procesamiento de la información textual (si no se analiza a nivel contextual). Por ello, para este caso, se removerán los stopwords de los títulos. Indique la cantidad de stopwords que está removiendo

In [42]:
import nltk
nltk.download('stopwords')
stopwords_eng = nltk.corpus.stopwords.words('english')
print(stopwords_eng)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\alirapal\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'whe

In [43]:
tweets_lower_clean=[]
count = 0
for tweet in tweets_lower:
    tweet_clean = ''
    for token in tweet.split():
        if token not in stopwords_eng:
            tweet_clean += token
            tweet_clean += ' '
        else:
            count += 1    
        tweets_lower_clean.append(tweet_clean)
print("Cantidad de stopwords removidos:",count)

Cantidad de stopwords removidos: 430167


Con los tweets con solamente carácteres alfanuméricos en minuscula limpios de stopwords, se procederá a analizar los datos.

In [44]:
import collections as co
tokens_list = []
c = co.Counter() 

for text in tweets_lower_clean:
    tokens = text.split()
    c.update(tokens)

print('10 tokens mas comunes:',c.most_common(10))
print('Numero de tokens:',sum(c.values()))
print('Tamaño del vocabulario:',len(c))
## Observacion: También se puede calcular el tamaño del vocabulario con
## len(list(set(" ".join(tweets_lower_clean).split())))

10 tokens mas comunes: [("i'm", 67099), ('get', 36750), ('like', 35372), ('quot', 35331), ('good', 34730), ('got', 33796), ('day', 33264), ('going', 32864), ('go', 29578), ('know', 26618)]
Numero de tokens: 5343919
Tamaño del vocabulario: 80311


Definimos funciones de preprocesamiento y limpieza de datos resumiendo todo el análisis previo, como también una función de cálculo de estádisticas según el diccionario.

In [65]:
def preprocesar(tweets):
    tweets_alphanum=[]
    for i in tweets:
        x = re.sub(r"http\S+|www\S+"," ",i)
        x = re.sub(r"[^\w'@ ]|@\W+|[^a-zA-Z]'[^a-zA-Z]"," ",x)
        x = re.sub(r"^(\d ?)+| (\d ?)+"," ",x)
        tweets_alphanum.append(x)
    tweets_lower=[]
    for tweet in tweets_alphanum:
        tweets_lower.append(tweet.lower())
    import nltk
    nltk.download('stopwords')
    stopwords_eng = nltk.corpus.stopwords.words('english')
    tweets_lower_clean=[]
    for tweet in tweets_lower:
        tweet_clean = ''
        for token in tweet.split():
            if token not in stopwords_eng:
                tweet_clean += token + ' '
        tweets_lower_clean.append(tweet_clean)
    import collections as co
    diccionario = co.Counter() 
    for text in tweets_lower_clean:
        tokens = text.split()
        diccionario.update(tokens)
    return diccionario, tweets_lower_clean

In [46]:
def estadisticas(diccionario):
    print('10 tokens más comunes:',diccionario.most_common(10))        
    print('Número de tokens:',sum(diccionario.values()))
    print('Tamaño del vocabulario:',len(diccionario))

## 3. Clasificación

Deben mostrar un reporte de clasificación (accuracy) con diversos algoritmos de aprendizaje supervisado tradicionales (adicionales a los mostrados en clase) usando los siguientes vectores:

- **1**. Bag of Words (booleano)

In [70]:
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

text = np.array(tweets_lower_clean)
txd_matrix = CountVectorizer(binary=True,max_features = 10000).fit_transform(text)
tf_matrix.shape

(1042116, 10000)

Definimos función que nos devuelve la matriz Bag of Words (booleano).

In [71]:
def Bag_Of_Words(tweets_lower_clean, max_words_dictionary = None):
    from sklearn.feature_extraction.text import CountVectorizer
    import numpy as np
    text = np.array(tweets_lower_clean)
    txd_matrix = CountVectorizer(binary=True,max_features = max_words_dictionary).fit_transform(text)
    return txd_matrix

- **2** TF-IDF

In [72]:
from sklearn.feature_extraction.text import TfidfVectorizer

X_1 = TfidfVectorizer(max_features=10000).fit_transform(tweets_lower_clean)

Definimos función que nos devuelve la matriz TF-IDF.

In [73]:
def TF_IDF(tweets_lower_clean, max_words_dictionary = None):
    from sklearn.feature_extraction.text import TfidfVectorizer
    X_1 = TfidfVectorizer(stop_words='english',max_features=10000).fit_transform(tweets_lower_clean)
    return (X_1)

- **3** Word Vectors (embeddings) de spacy

Instalar spacy con las instrucciones en el siguiente link:
<br>Spacy: https://spacy.io/usage/
<br>Package: https://spacy.io/usage/models#section-available

In [55]:
import spacy

In [56]:
#Deben instalar el modelo "en_core_web_lg"
nlp = spacy.load('en_core_web_lg')

In [79]:
##Importante: No sobrepasar lons 1 000 000 de caracteres, por lo cual hay que comprobar con un len(text)
diccionario, tweets_lower_clean = preprocesar(tweets)
text = ''.join(tweets_lower_clean)
len(text)

3947815

In [78]:
doc = nlp(text)

ValueError: [E088] Text of length 3947815 exceeds maximum of 1000000. The v2.x parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the `nlp.max_length` limit. The limit is in number of characters, so you can check whether your inputs are too long by checking `len(text)`.

Definimos función que nos devuelve la matriz Word Vectors (embeddings) de spacy.

In [21]:
def Word_Vectors(tweets_lower_clean, max_words_dictionary = None):
    return 0

## 4. Pruebas

Primero se define una función de ayuda para ejecutar un modelo y que se imprima el resultado de clasificación (accuracy) mediante validación cruzada (k_fold=5):

In [57]:
from sklearn.model_selection import cross_val_score
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import Perceptron
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import NearestCentroid
from sklearn.ensemble import RandomForestClassifier
from collections import Counter

In [75]:
y = np.array(category)
y.shape

(79580,)

In [58]:
def run_model(clf, X, y):
    scores = cross_val_score(clf, X, y, cv=5)
    print("%s accuracy: %0.2f (+/- %0.2f)" % \
          (str(clf.__class__).split('.')[-1].replace('>','').replace("'",''), 
          scores.mean(), scores.std() * 2))

In [59]:
def run_models(X,y):
    #run_model(LinearSVC(), X, y)
    run_model(SGDClassifier(), X, y)
    run_model(Perceptron(), X, y)
    run_model(PassiveAggressiveClassifier(), X, y)
    run_model(BernoulliNB(), X, y)
    #run_model(MultinomialNB(), X, y)
    #run_model(KNeighborsClassifier(), X, y)
    #run_model(NearestCentroid(), X, y)
    #run_model(RandomForestClassifier(n_estimators=100, max_depth=10), X, y)

Ahora efectuamos la prueba.

- **1**. Bag of Words (booleano)

In [60]:
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
text = np.array(tweets_lower_clean)
txd_matrix = CountVectorizer(binary=True,max_features = 10000).fit_transform(text)
run_models(txd_matrix,y)

SGDClassifier accuracy: 0.76 (+/- 0.00)
Perceptron accuracy: 0.69 (+/- 0.01)
PassiveAggressiveClassifier accuracy: 0.69 (+/- 0.01)
BernoulliNB accuracy: 0.76 (+/- 0.01)


- **2** TF-IDF

In [61]:
from sklearn.feature_extraction.text import TfidfVectorizer
X_1 = TfidfVectorizer(stop_words='english',max_features=10000).fit_transform(tweets_lower_clean)
run_models(X_1,y)

SGDClassifier accuracy: 0.75 (+/- 0.01)
Perceptron accuracy: 0.68 (+/- 0.01)
PassiveAggressiveClassifier accuracy: 0.72 (+/- 0.01)
BernoulliNB accuracy: 0.75 (+/- 0.01)
