## Tweet classification

File names:

In [1]:
# To read
test_tweets_raw = 'datasets/test_nolabel.csv'
train_tweets_raw = 'datasets/train.csv'
corpus_tweets_2012_xml = 'general-train-tagged-3l.xml'
corpus_tweets_2017_xml = 'intertass-train-tagged.xml'
emojis_csv = 'emojis.csv'

# To generate
corpus_tweets_2012_csv = 'general-train-tagged-3l.csv'
corpus_tweets_2017_csv = 'intertass-train-tagged.csv'
corpus_tweets_csv = 'corpus_tweets_prprocessed.csv'

Import Pandas and Numpy:

In [2]:
import pandas as pd
import numpy as np

### Load datasets

In [3]:
tweets_test = pd.read_csv(test_tweets_raw, encoding='utf-8')
tweets_train = pd.read_csv(train_tweets_raw, encoding='utf-8')

print('Total tweets to evaluate: %d' % len(tweets_test))
print('Evaluated tweets so far: %d' % len(tweets_train))

Total tweets to evaluate: 177
Evaluated tweets so far: 411


### POS Tagging

Import libraries to read XML:

In [4]:
from lxml import objectify

Import/read most recent corpus (2017):

In [5]:
# 4 values of sentiment: N, P, NONE, NEU
xml = objectify.parse(open(corpus_tweets_2017_xml))
root = xml.getroot()
general_tweets_corpus_train_2017 = pd.DataFrame(columns=('content', 'polarity'))
tweets = root.getchildren()
for i in range(0, len(tweets)):
    tweet = tweets[i]
    row = dict(zip(['content', 'polarity'], [tweet.content.text, tweet.sentiment.polarity.value.text]))
    row_s = pd.Series(row)
    row_s.name = i
    general_tweets_corpus_train_2017 = general_tweets_corpus_train_2017.append(row_s)

Import/read biggest corpus (2012), to concatenate it with the previous one:

In [6]:
# 4 values of sentiment: N, P, NONE, NEU
xml = objectify.parse(open(corpus_tweets_2012_xml))
root = xml.getroot()
general_tweets_corpus_train_2012 = pd.DataFrame(columns=('content', 'polarity'))
tweets = root.getchildren()
for i in range(0, len(tweets)):
    tweet = tweets[i]
    row = dict(zip(['content', 'polarity'], [tweet.content.text, tweet.sentiments.polarity.value.text]))
    row_s = pd.Series(row)
    row_s.name = i
    general_tweets_corpus_train_2012 = general_tweets_corpus_train_2012.append(row_s)

Import/read emoji sentiment dataset, to concatenate with the previous ones. Build column `polarity` according to the following criteria:
- If sentiment score is between -1 and -0.2, consider it a **negative** sentiment (`N`).
- If sentiment score is between -0.2 and 0.2, consider it a **neutral** sentiment (`NEU`).
- If sentiment score is between 0.2 and 1, consider it a **positive** sentiment (`P`).

In [7]:
# Read emojis CSV
emoji_dataset = pd.read_csv(emojis_csv, encoding='utf-8')

# Init dataframe to append to corpus
emoji_corpus = pd.DataFrame(columns=('content', 'polarity'))

# Build column 'polarity
emoji_dataset['polarity'] = 'NEU'
emoji_dataset['polarity'][emoji_dataset.sentiment < -0.2] = 'N'
emoji_dataset['polarity'][emoji_dataset.sentiment > 0.2] = 'P'

for i, row in emoji_dataset.iterrows():
    new_row = dict(zip(['content', 'polarity'], [chr(int(row.emoji, 16)), row.polarity]))
    row_s = pd.Series(new_row)
    row_s.name = i
    emoji_corpus = emoji_corpus.append(row_s)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # Remove the CWD from sys.path while we load stuff.


Concatenate general corpus dataset with 2017 one, to have a better result:

In [8]:
tweets_corpus = pd.concat([
        general_tweets_corpus_train_2012,
        general_tweets_corpus_train_2017,
        emoji_corpus
    ])

In [9]:
# Build emojis regex
import re
emoji_regex = re.compile(r'[%s]' % (''.join(emoji_corpus['content'])))

In [10]:
tweets_corpus.sample(10)

Unnamed: 0,content,polarity
4242,@estherpalomera se sabe quién comparece por pa...,NONE
3845,Aquí tenéis un vídeo de mi amigo Pinto Colorad...,P
6660,"Como siempre, hay quien intenta manchar el éxi...",N
23,Accidente en BUS-VAO A-6 km. 12. Motorista de ...,N
5049,"Periodistas de informativos T5 agredidos ayer,...",N
768,Buenos días Magaluf ! 👌🏻✅\n• @BHmallorca • #B...,P
67,😀,P
3917,http://t.co/FUleUc9K,NONE
2501,Curiosa imagen veremos mañana. El Ministro Cañ...,NONE
5590,"Primero se les trata de vividores, luego de v...",N


In [11]:
print('Total corpus tweets: %d' % len(tweets_corpus))

Total corpus tweets: 8978


Remove tweets from the corpus where `polarity` is `NONE`:

In [12]:
# Remove tweets with polarity 'NONE'
tweets_corpus = tweets_corpus.query('polarity != "NONE"')

### Data cleaning

First, we define some cleaning functions:

In [13]:
# clean tweets tools
from cleaner import clean_tweets

Now, we clean the train and test data with the previous function.

In [14]:
tweets_corpus = clean_tweets(tweets_corpus, 'content', emoji_regex)

In [15]:
print('Total corpus tweets after cleaning: %d' % len(tweets_corpus))

Total corpus tweets after cleaning: 7356


In [16]:
tweets_corpus.sample(10)

Unnamed: 0,content,polarity
4642,. llama a la regeneración en #Andalucía frente...,N
714,La vida nos lleva por caminos inextricables j...,P
720,Ⓔ,P
695,"+1 RT :que entreguen las armas, que se entregu...",N
5016,Una de fakes de esos que todos los medios nos ...,N
2999,"EL BARRIO en : ""Cádiz es tierra de embajadores...",N
5938,Y a ti ;-)) RT : Contándote a ti :-) RT : Qué ...,P
4218,"""Redacción simulada"" para aprender periodismo ...",NEU
720,el sábado unas risas todos juntos... A por ot...,P
27,2 horas esperando en Comisaría Gandia y han a...,N


Export corpus tweets as CSV:

In [17]:
tweets_corpus.to_csv(corpus_tweets_csv, encoding='utf-8', index=False)