## Tweet classification

File names:

In [1]:
# To read
test_tweets_raw = 'datasets/test_nolabel.csv'
train_tweets_raw = 'datasets/train.csv'
corpus_tweets_2012_xml = 'general-train-tagged-3l.xml'
corpus_tweets_2017_xml = 'intertass-train-tagged.xml'

# To generate
corpus_tweets_2012_csv = 'general-train-tagged-3l.csv'
corpus_tweets_2017_csv = 'intertass-train-tagged.csv'
corpus_tweets_csv = 'corpus_tweets_prprocessed.csv'

Import Pandas and Numpy:

In [2]:
import pandas as pd
import numpy as np

### Load datasets

In [3]:
tweets_test = pd.read_csv(test_tweets_raw, encoding='utf-8')
tweets_train = pd.read_csv(train_tweets_raw, encoding='utf-8')

print('Total tweets to evaluate: %d' % len(tweets_test))
print('Evaluated tweets so far: %d' % len(tweets_train))

Total tweets to evaluate: 177
Evaluated tweets so far: 411


### POS Tagging

Import libraries to read XML:

In [4]:
from lxml import objectify

Import/read most recent corpus (2017):

In [5]:
# 4 values of sentiment: N, P, NONE, NEU
xml = objectify.parse(open(corpus_tweets_2017_xml))
root = xml.getroot()
general_tweets_corpus_train_2017 = pd.DataFrame(columns=('content', 'polarity'))
tweets = root.getchildren()
for i in range(0, len(tweets)):
    tweet = tweets[i]
    row = dict(zip(['content', 'polarity'], [tweet.content.text, tweet.sentiment.polarity.value.text]))
    row_s = pd.Series(row)
    row_s.name = i
    general_tweets_corpus_train_2017 = general_tweets_corpus_train_2017.append(row_s)
    
general_tweets_corpus_train_2017.to_csv(corpus_tweets_2017_csv, index=False, encoding='utf-8')

Import/read biggest corpus (2012), to concatenate it with the previous one:

In [6]:
# 4 values of sentiment: N, P, NONE, NEU
xml = objectify.parse(open(corpus_tweets_2012_xml))
root = xml.getroot()
general_tweets_corpus_train_2012 = pd.DataFrame(columns=('content', 'polarity'))
tweets = root.getchildren()
for i in range(0, len(tweets)):
    tweet = tweets[i]
    row = dict(zip(['content', 'polarity'], [tweet.content.text, tweet.sentiments.polarity.value.text]))
    row_s = pd.Series(row)
    row_s.name = i
    general_tweets_corpus_train_2012 = general_tweets_corpus_train_2012.append(row_s)
    
general_tweets_corpus_train_2012.to_csv(corpus_tweets_2012_csv, index=False, encoding='utf-8')

Concatenate general corpus dataset with 2017 one, to have a better result:

In [7]:
tweets_corpus = pd.concat([
        general_tweets_corpus_train_2012,
        general_tweets_corpus_train_2017
    ])
tweets_corpus.sample(10)

Unnamed: 0,content,polarity
4917,2011: la circulación agregada de los diarios d...,N
5829,Genial J.Coll haciendo pedagogía vs mentira dl...,N
3254,Con el PSOE Extremadura formaba parte de la ca...,NEU
661,"Si, soy democrata radical y ellos no condenan ...",N
3231,Durao Barroso pide a Raoy que explique cuanto ...,NONE
333,Bdías. Que el Reino Unido se quede fuera indic...,P
1043,Curioso. Al de Amaiur no le han aplaudido ni l...,N
5274,"Buenos días, hoy en Sevilla firmamos el conven...",P
70,Por desgracia el sorteo que tenía pensado hace...,N
136,Menudo break en blanco ha logrado DelPotro. Va...,NEU


In [8]:
print('Total corpus tweets: %d' % len(tweets_corpus))

Total corpus tweets: 8227


Remove twets from the corpus where `polarity` is `NONE`:

In [9]:
# Remove tweets with polarity 'NONE'
tweets_corpus = tweets_corpus.query('polarity != "NONE"')

### Data cleaning

First, we define some cleaning functions:

In [11]:
# clean tweets tools
from cleaner import clean_tweets

Now, we clean the train and test data with the previous function.

In [12]:
tweets_corpus = clean_tweets(tweets_corpus, 'content')

In [13]:
print('Total corpus tweets after cleaning: %d' % len(tweets_corpus))

Total corpus tweets after cleaning: 6605


In [14]:
tweets_corpus.sample(10)

Unnamed: 0,content,polarity
5129,"” POLONIA D AYER. Primer gag, nuevo personaje ...",P
3865,Uyy que se me olvidaba dejaros esta pagina par...,NEU
6503,". "" en Andalucía a través de las urnas es pos...",P
5636,Gracias VisitaElHierro RT : Para los que no s...,P
4600,Encuesta cachonda en la web dl PSOE-A: Ves jus...,NEU
1821,Se ha montado la de Dios! A ver: habrá que sor...,N
349,FF para . Ella ya sabe...,P
6275,"Sólo Grecia, Malta, Chipre y España no tienen ...",N
37,"Confirmao tengo el pie roto, he notado como el...",N
108,Preciosa de verdad - Perdóname by Pablo Alborá...,P


Export corpus tweets as CSV:

In [17]:
tweets_corpus.to_csv(corpus_tweets_csv, encoding='utf-8', index=False)