<a href="https://colab.research.google.com/github/elisasmenendez/ds-tweet-sentiment/blob/master/classifier/pt_tweets_classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science Test

The test consists of creating a tweet dashboard to showcase text mining, data visualization, statistical analysis and development skills.

* The app needs to stream tweets written in Portuguese from twitter API .
* The app needs to classify tweet sentiment.
  * Example classes are positive and negative denoting the tweet sentiment, but feel free to add more classes as you see fit.
* The app needs to show tweets being classified in real-time in a dashboard.
* The classifier must be written in Python, as it is our language of choice regarding data science. The rest can be written in your language of choice.
* Feel free to use python libraries, no need to write things from scratch.
* Accuracy report
  * What metric are you using? Why ?
  * Which type of test did you choose ?
  * Include the test dataset.
* Your code need to be on your GitHub profile.

Bonus points if you do the following:

* Dashboard with metrics (e.g. charts with tweet sentiment, time series, etc etc).
* Your code is scalable.
* Your code is hosted on a Cloud provider.
* Your code runs inside docker containers, bonus points if running inside a Kubernetes Cluster.

# Imports

First, let's do some imports.

In [2]:
import glob
import pandas as pd 
import re
import string

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import cross_val_predict
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression

import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.tokenize import word_tokenize, TweetTokenizer

import pickle

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


# Dataset

We chose the dataset proposed by Brum and Nunes (2017), since it is the larger manually anotated corpus in Portuguese, considering the 3-polarity classification: negative, positive and neutral. Although the dataset is not public available (due to Twitter Privacy Policy), you can contact the first author to get the original dataset. For further references check: http://bitbucket.org/HBrum/tweetsentbr/ 

Once you get the dataset, just drag-and-drop to the Google Colab folder on the left menu for temporary use.

In [4]:
# Load all files into a single DataFrame (positive, negative and neutral)
all_files = glob.glob("tweets*") 
df_files = (pd.read_fwf(f, header=None) for f in all_files)
df1 = pd.concat(df_files, ignore_index=True)
df1.rename(columns = {0: 'id', 1: 'tweet'}, inplace=True)

# Load the evaluation from a separate file
df2 = pd.read_csv('tweetSentimentBR.txt', sep='\t', header=None)
df2.rename(columns = {0: 'id', 1: 'hashtag', 2: 'evaluations', 4:'sentiment'}, inplace=True)

# Merge both data frames by the tweet ID
df = pd.merge(df1, df2, on='id')
df = df[['id','tweet','hashtag','evaluations','sentiment']]

In [5]:
df[['tweet','sentiment']]

Unnamed: 0,tweet,sentiment
0,tô passada com esse cara quanta merda pode sai...,-1
1,coitada da namorada,-1
2,esse japa não entendi porra nenhuma de orquíde...,-1
3,aí vc fica até NUMBER assistindo e acorda cedo...,-1
4,imagina que insuportável ter de dar de comer p...,-1
...,...,...
14995,lazaro falou bale fitness e ana maria braga es...,0
14996,simpatia na trama das seis ingrid guimarães mo...,0
14997,ocidentais tem mta dificuldade pra aceitar com...,0
14998,USERNAME que horas vc chega em belém / aeropor...,0


# Pre-processing

We first execute some basic pre-processing steps: removing punctuation, urls and stop words.

In [6]:
def preprocess_tweet_text(tweet):
    # remove urls
    tweet = re.sub(r"http\S+|www\S+|https\S+", '', tweet.lower())

    # remove punctuations (fastest way)
    tweet = tweet.translate(str.maketrans('', '', string.punctuation))
    
    return tweet

tweets = df['tweet'].apply(preprocess_tweet_text)
classes = df['sentiment']

In [7]:
stop_words = set(nltk.corpus.stopwords.words('portuguese'))

# Pipeline, Tests and Metrics

Here, we create a pipeline to run and evaluate our classifier. 
<br><br>
About testing: we chose the K-fold Cross-validation method (cross_val_predict), since it runs k tests by randomly spliting the dataset in multiple test and training parts, which gives a better indication of how the model would peform on unseen data.
<br><br>
About metrics: the classification report shows several metrics, such as, precision, recall and F-measure. However, we chose the average F-measure as our main evaluation metric since it combines both precision and recall together.

In [8]:
sentiments=['negative','neutral','positive']
def run_pipeline(pipeline, tweets, classes):
  pipeline.fit(tweets,classes)
  results = cross_val_predict(pipeline, tweets, classes, cv = 10)
  print(classification_report(y_true=classes, y_pred=results, target_names=sentiments))

# Classifier - Naive Bayes

We start with a simple Naive Bayes classifier and try different combinations of Tokenizers (e.g. Word vs. Tweet) and Vectorizers (Count vs. TF-IDF). The results showed that is no significant difference between a simple word tokenizer and tweet specific tokenizer. Moreover, it also showed that the simple Count vectorizer performs better than the TF-IDF vectorizer.  

In [None]:
print("------ Pipeline: Word Tokenizer and Count Vectorizer ------\n")
pipeline = Pipeline([
  ('counts', CountVectorizer(analyzer="word", tokenizer=word_tokenize, stop_words=stop_words)),
  ('classifier', MultinomialNB())
])
run_pipeline(pipeline, tweets, classes)

print("------ Pipeline: Word Tokenizer and Tfidf Vectorizer ------\n")
pipeline = Pipeline([
  ('counts', TfidfVectorizer(analyzer="word", tokenizer=word_tokenize, stop_words=stop_words)),
  ('classifier', MultinomialNB())
])
run_pipeline(pipeline, tweets, classes)

print("------ Pipeline: Tweet Tokenizer and Count Vectorizer ------\n")
tweet_tokenizer = TweetTokenizer() 
pipeline = Pipeline([
  ('counts', CountVectorizer(analyzer="word", tokenizer=tweet_tokenizer.tokenize, stop_words=stop_words)),
  ('classifier', MultinomialNB())
])
run_pipeline(pipeline, tweets, classes)

print("------ Pipeline: Tweet Tokenizer and Tfidf Vectorizer ------\n")
tweet_tokenizer = TweetTokenizer() 
pipeline = Pipeline([
  ('counts', TfidfVectorizer(analyzer="word", tokenizer=tweet_tokenizer.tokenize, stop_words=stop_words)),
  ('classifier', MultinomialNB())
])
run_pipeline(pipeline, tweets, classes)

print("------ Pipeline: Word Tokenizer and Count Vectorizer + Ngrams ------\n")
pipeline = Pipeline([
  ('counts', CountVectorizer(ngram_range = (1, 2), stop_words=stop_words)),
  ('classifier', MultinomialNB())
])
run_pipeline(pipeline, tweets, classes)

------ Pipeline: Word Tokenizer and Count Vectorizer ------

              precision    recall  f1-score   support

    negative       0.62      0.69      0.65      4426
     neutral       0.55      0.31      0.40      3926
    positive       0.69      0.81      0.75      6648

    accuracy                           0.65     15000
   macro avg       0.62      0.61      0.60     15000
weighted avg       0.63      0.65      0.63     15000

------ Pipeline: Word Tokenizer and Tfidf Vectorizer ------

              precision    recall  f1-score   support

    negative       0.67      0.60      0.63      4426
     neutral       0.64      0.19      0.29      3926
    positive       0.61      0.91      0.73      6648

    accuracy                           0.63     15000
   macro avg       0.64      0.56      0.55     15000
weighted avg       0.64      0.63      0.59     15000

------ Pipeline: Tweet Tokenizer and Count Vectorizer ------

              precision    recall  f1-score   support


# Classifier - Logistic Regression

The authors in (Brum and Nunes, 2017) performed a prelimary testing in their dataset, which showed that the Logistic Regression algorithm had the best performance results. Hence, we chose this approach as our final classifier, using Word tokenizer and Count Vectorizer.

In [None]:
print("------ Pipeline: LBFGS Solver ------\n")
pipeline = Pipeline([
  ('counts', CountVectorizer(analyzer="word", tokenizer=word_tokenize, stop_words=stop_words)),
  ('classifier', LogisticRegression(solver='lbfgs', max_iter=300))
])
run_pipeline(pipeline, tweets, classes)

------ Pipeline: LBFGS Solver ------

              precision    recall  f1-score   support

    negative       0.65      0.63      0.64      4426
     neutral       0.51      0.48      0.49      3926
    positive       0.73      0.77      0.75      6648

    accuracy                           0.65     15000
   macro avg       0.63      0.63      0.63     15000
weighted avg       0.65      0.65      0.65     15000



# Saving the model

Now, after our experiments, we can train the Logistic Regression algorithm using all the dataset and save the model and the vectorizer to use in the tweet sentiment analysis task.

In [9]:
vectorizer = CountVectorizer(analyzer="word", tokenizer=word_tokenize, stop_words=stop_words)
freq_tweets = vectorizer.fit_transform(tweets,)

In [None]:
classifier =  LogisticRegression(solver='lbfgs', max_iter=300)
classifier.fit(freq_tweets,classes)

In [11]:
pickle.dump(vectorizer, open('vectorizer.sav', 'wb'))
pickle.dump(classifier, open('classifier_regression.sav', 'wb'))

# Manual tests

Finally, some final manual tests to check our model.

In [12]:
vectorizer_sav = pickle.load(open('vectorizer.sav', 'rb'))
classifier_sav = pickle.load(open('classifier_regression.sav', 'rb'))

In [13]:
tests = ["jojo todynho cancelada",
         "jojo todynho chata",
         "jojo todynho lenda",
         "jojo todynho ícone",
         "jojo todynho perfeita",
         "jojo todynho rainha"]
freq_tests = vectorizer_sav.transform(tests)

In [14]:
for t, c in zip (tests, classifier_sav.predict(freq_tests)):
    print (t +", "+ c) 

jojo todynho cancelada, -1
jojo todynho chata, -1
jojo todynho lenda, 1
jojo todynho ícone, 1
jojo todynho perfeita, 1
jojo todynho rainha, 1


# References

BRUM, Henrico Bertini; NUNES, Maria das Graças Volpe. (2017). Building a sentiment corpus of tweets in brazilian portuguese. Available at: https://www.aclweb.org/anthology/L18-1658.pdf

Gaurav Singhal Tutorial: https://www.pluralsight.com/guides/building-a-twitter-sentiment-analysis-in-python

Felipe Santana Tutorial: https://minerandodados.com.br/analise-de-sentimentos-utilizando-dados-do-twitter/

Best way to strip punctuation from a string https://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string

Which metrics to use: https://blog.infegy.com/understanding-sentiment-analysis-and-sentiment-accuracy

Test Train vs. Cross Validation: https://medium.com/@eijaz/holdout-vs-cross-validation-in-machine-learning-7637112d3f8f