## Tweet classification

File names:

In [1]:
# To read
test_tweets_raw = 'datasets/test_nolabel.csv'
train_tweets_raw = 'datasets/train.csv'
corpus_tweets_2012_xml = 'general-train-tagged-3l.xml'
corpus_tweets_2017_xml = 'intertass-train-tagged.xml'
emojis_csv = 'emojis.csv'

# To generate
corpus_tweets_2012_csv = 'general-train-tagged-3l.csv'
corpus_tweets_2017_csv = 'intertass-train-tagged.csv'
corpus_tweets_csv = 'corpus_tweets_preprocessed.csv'

Import Pandas and Numpy:

In [2]:
import pandas as pd
import numpy as np

### Load datasets

In [3]:
tweets_test = pd.read_csv(test_tweets_raw, encoding='utf-8')
tweets_train = pd.read_csv(train_tweets_raw, encoding='utf-8')

print('Total tweets to evaluate: %d' % len(tweets_test))
print('Evaluated tweets so far: %d' % len(tweets_train))

Total tweets to evaluate: 177
Evaluated tweets so far: 411


### POS Tagging

Import libraries to read XML:

In [4]:
from lxml import objectify

Import/read most recent corpus (2017):

In [44]:
# 4 values of sentiment: N, P, NONE, NEU
xml = objectify.parse(open(corpus_tweets_2017_xml))
root = xml.getroot()
general_tweets_corpus_train_2017 = pd.DataFrame(columns=('content', 'polarity'))
tweets = root.getchildren()
for i in range(0, len(tweets)):
    tweet = tweets[i]
    row = dict(zip(['content', 'polarity'], [tweet.content.text, tweet.sentiment.polarity.value.text]))
    row_s = pd.Series(row)
    row_s.name = i
    general_tweets_corpus_train_2017 = general_tweets_corpus_train_2017.append(row_s)

Import/read biggest corpus (2012), to concatenate it with the previous one:

In [45]:
# 4 values of sentiment: N, P, NONE, NEU
xml = objectify.parse(open(corpus_tweets_2012_xml))
root = xml.getroot()
general_tweets_corpus_train_2012 = pd.DataFrame(columns=('content', 'polarity'))
tweets = root.getchildren()
for i in range(0, len(tweets)):
    tweet = tweets[i]
    row = dict(zip(['content', 'polarity'], [tweet.content.text, tweet.sentiments.polarity.value.text]))
    row_s = pd.Series(row)
    row_s.name = i
    general_tweets_corpus_train_2012 = general_tweets_corpus_train_2012.append(row_s)

Import/read emoji sentiment dataset, to concatenate with the previous ones. Build column `polarity` according to the following criteria:
- If sentiment score is between -1 and -0.2, consider it a **negative** sentiment (`N`).
- If sentiment score is between -0.2 and 0.2, consider it a **neutral** sentiment (`NEU`).
- If sentiment score is between 0.2 and 1, consider it a **positive** sentiment (`P`).

In [46]:
# Read emojis CSV
emoji_dataset = pd.read_csv(emojis_csv, encoding='utf-8')

# Init dataframe to append to corpus
emoji_corpus = pd.DataFrame(columns=('content', 'polarity'))

# Build column 'polarity
emoji_dataset['polarity'] = 'NEU'
emoji_dataset['polarity'][emoji_dataset.sentiment < -0.2] = 'N'
emoji_dataset['polarity'][emoji_dataset.sentiment > 0.2] = 'P'

for i, row in emoji_dataset.iterrows():
    new_row = dict(zip(['content', 'polarity'], [chr(int(row.emoji, 16)), row.polarity]))
    row_s = pd.Series(new_row)
    row_s.name = i
    emoji_corpus = emoji_corpus.append(row_s)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # Remove the CWD from sys.path while we load stuff.


Concatenate general corpus dataset with 2017 one, to have a better result:

In [47]:
tweets_corpus = pd.concat([
        general_tweets_corpus_train_2012,
        general_tweets_corpus_train_2017,
        emoji_corpus
    ])

In [48]:
# Import regex tools
import re

# Build emoji regex
emoji_string = '|'.join(emoji_corpus['content'])
emoji_regex = re.compile(r'(%s)' % emoji_string)

In [49]:
tweets_corpus.sample(10)

Unnamed: 0,content,polarity
6839,Buenos dias amigos!direccion mallorca a dar nu...,NEU
6436,Planeta creativo: nuevo placer otra interesant...,P
3025,"No somos telesucesos, si quiere información co...",P
3344,Nuestro grupo quiere acercar más la actividad ...,P
2671,¿Donde está Urdangarín? Toda Barcelona le busc...,N
3934,Los medios extranjeros coinciden en que el jui...,N
623,Os doy una pista: Se va a sentir muy identific...,P
5890,A día de hoy 53.220 “@cora_alvarez: @pedroj_ra...,P
2626,Nunca te acostarás sin tener un desengaño más.,N
578,📹,P


In [50]:
print('Total corpus tweets: %d' % len(tweets_corpus))

Total corpus tweets: 8978


Remove tweets from the corpus where `polarity` is `NONE`:

In [51]:
# Remove tweets with polarity 'NONE'
tweets_corpus = tweets_corpus.query('polarity != "NONE"')

### Data cleaning

In [52]:
# Define tweet cleaning function
def clean_tweets(tweets, col_name):
    # Remove links
    tweets[col_name] = tweets[col_name].map(lambda x: re.sub(re.compile('https?:\/\/t\.co\/[\w]{8,8}'), '', x))
    # Remove usernames
    tweets[col_name] = tweets[col_name].map(lambda x: re.sub(re.compile('@[A-Za-z0-9_]+'), '', x))
    # Remove newline character
    tweets[col_name] = tweets[col_name].map(lambda x: re.sub(re.compile('[\n\r]+'), '', x))
    # Insert space between emojis
    tweets[col_name] = tweets[col_name].map(lambda x: re.sub(emoji_regex, r' \1 ', x))
    # Replace multiple spaces with single one
    tweets[col_name] = tweets[col_name].map(lambda x: re.sub(re.compile('[\s]+'), ' ', x))
    return tweets

Now, we clean the train and test data with the previous function.

In [53]:
tweets_corpus = clean_tweets(tweets_corpus, 'content')

In [54]:
print('Total corpus tweets after cleaning: %d' % len(tweets_corpus))

Total corpus tweets after cleaning: 7356


In [55]:
tweets_corpus.sample(10)

Unnamed: 0,content,polarity
54,"Granado reconoce que da los datos de paro ""con...",N
1914,RT : Con el carro de Manolo Escobar? “: ¿Dónde...,N
361,lo mas gracioso es que diciendo esas cosas so...,N
6040,La encuesta de EP da mayoría absoluta al PP en...,NEU
2961,El jurado popular considera inocentes a Camps ...,P
4275,"Ana Diaz , grupo Abengoa : "" En #Andalucia som...",P
5089,RT : Jueves de REVISTA! En Portada HOY Mujer :...,P
6226,"Buenas noches twitteros,como me gusta el twitt...",P
620,"Un placer! ;-) RT : Vemos a , habitual en nues...",P
4971,Ofrenda floral a Blas Infante,P


Export corpus tweets as CSV:

In [56]:
tweets_corpus.to_csv(corpus_tweets_csv, encoding='utf-8', index=False)