In [1]:
# Utils
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import clear_output

# Preprocessing
import nltk
from nltk.stem import SnowballStemmer
from nltk.stem.porter import *
import treetaggerwrapper as ttpw
import re
import string
import emoji

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from sklearn.preprocessing import StandardScaler, MaxAbsScaler
from sklearn.model_selection import train_test_split

# Models
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm
from sklearn.naive_bayes import MultinomialNB

# Metrics
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score

# from num2words import num2words
# from nltk import word_tokenize
# from nltk.stem import WordNetLemmatizer
# from sklearn.utils import resample
# from sklearn.pipeline import Pipeline
# from sklearn.feature_selection import SelectKBest, chi2

clear_output()

## 1. Data Exploration

The dataset chosen for this text classification project contains tweets related to the Black Lives Matter movement, which was founded in 2013 and gained international attention in May 2020 in USA.

The dataset is composed of 80000 labelled tweets and for each tweet there are also some informations about the user that published it and the tweet itself.

The classes are:

* ”1” if the tweet shows a positive sentiment towards the movement.
* ”0” if the tweet shows a negative sentiment towards the movement.

The distribution of the tweets between the two classes is balanced: there are 40069 tweets classified as 0 and 39931 tweets classified as 1.

Also the average length of tweets between the two classes is similar: class 0 tweets have an average length of 121 characters, while class 1 tweets have an average length of 133 characters.

Analysing the dataset, it pops also out that there are a lot of retweets and most of the tweets are duplicates, so, inspecting more the dataset, we can see that there are 35006 distinct tweets.

The quality of the tweets is also quite low: there are often unicode characters, emojis and web-links, most of them are trimmed and they are also hardly classifiable even for a human:

`yes if u mean this one https://t.co/r9MaeWmRy7`<br/>
`@magicalhyunjin USE TOO`<br/>
`RT @NotReallyaDr: Happy birthday, Chris!`<br/>

The informations about the tweets mentioned before are a subset of the attributes of the Tweet object [1].

The correlation between the labels and the informations about tweets and users resulted very low. <br/>
Furthermore, the quality of the informations is low: there are a lot of missing values, often incorrigible by applying some interpolations.

For these reasons, I decided to drop every information and develop a model only with the tweets and them labels.

Due to the previously explained problems about the tweets, in the following steps (especially in the Preprocessing one) we will see that a non-invasive text cleaning approach pays more for this text classification task, since in some tweets there are so few features that even the user mentioned (’@’) could be useful to classify a tweet.

## 2. Preprocessing 

The full preprocessing pipeline scheduled for this project is the following: 

* Cut unicode characters and make text lowercase
* Cut mentions (’@’)
* Apply a Tagger [2] on the text or apply a Stemmer [3], [4]
* Cut websites’ links
* Convert emojis [5] to text
* Remove punctuation
* Remove stopwords
* Remove ”RT” word

Although, this pipeline was too invasive and so I decided to maintain the essential steps, in particular those that don’t remove the words used:

* Cut unicode characters and make text lowercase 
* Apply a Tagger [2] on the text
* Cut websites’ links
* Convert emojis [5] to text

Unicode characters were mainly present at the end of the tweets as suspension points. Since they do not provide any information to the classifier, I decided to cut them off.

Removing the user mentioned and the ”RT” word led to a clean text. However, as anticipated in the Data Exploration step, the features extractable for certain tweets are few and even the user mentioned could be used as a feature. This is also true for stopwords, that when removed could lead to almost empty tweets. <br/>
Of course this approach should not be followed when using a bigger dataset, because the need of the usernames (or indiscriminate features, in general) to classify a tweet is certainly synonym of overfitting.

I also involved a lemmatization process in order to reduce each word to its base form and to flat out different words to represent the same feature.<br/>
To do so I used the TreeTagger tool [2]. Its effects are the following:

`This is a very short text`
`this be a very short text`

Finally, emojis to text conversion was helpful for two main reasons:

1. A certain sequence of emojis are used for the users to express appreciation for the Black Lives Matter movement
2. Instead of being discarded, emojis have become a feature.

After this preprocessing pipeline, I used TfidfVectorizer [6] from sklearn in order to transform the reviews into a TF-IDF features matrix.


In [3]:
# Load data
train_data = load_dataset("development.jsonl")
test_data = load_dataset("evaluation.jsonl")

# Explore data
explore_data(train_data)

# Preprocessing
train_data["full_text"] = train_data["full_text"].apply(lambda tweet: preprocess(tweet))
test_data["full_text"] = test_data["full_text"].apply(lambda tweet: preprocess(tweet))

train_data.head()

There are 40069 tweets classified as 0.
There are 39931 tweets classified as 1.

Distinct tweets inside train dataset: 35006

The average length of tweets labelled as 0 is 121.8228056602361
The average length of tweets labelled as 1 is 133.4803285667777


Unnamed: 0,class,full_text,text_length
0,0,bawdzisnaughty sgfgjay radioshadilay you be a ...,93
1,1,rt bigipipi : ive be stay largely offline beca...,139
2,1,rt emanisblazed : this website have a huge lis...,140
3,0,rt elijahdaniel : why be it 8am and i be liter...,123
4,0,rt youranoncentral : report from the battle : ...,124


## 3. Algorithm choice

Initially, I make an inspection using 5 different classifiers: 

* SGD Classifier [7]
* Random Forest Classifier [8] 
* Logistic Regressor [9]
* SVC [10]
* LinearSVC [11]

The evaluation set was made up of the 20% of the development set.

![title](Accuracies.png)

Before doing hyperparameters tuning, the SGDClassifier reached 89,31% of accuracy. After this first attempt, I compared it to the other classifiers and decided to not tune it because it wasn’t promising.

After the SGDClassifier, I tried the RandomForestClassifier, which performed well on evaluation set and led to an accuracy of 91.475%. So, I decided to do some hyperparameters tuning on it (more on next section), but the accuracy did not increase so much: it gained just 0.05% of accuracy (91.525%).

In addition, I tried to use the LogisticRegressor; I thought that this regressor could be a good one for this project because of its capability on modelling binary classification problems. Using the LogisticRegressor the accuracy reached 92%. <br/>
Once tuned, this model was the only in which I can highlight a significant increase, since the accuracy raised up to 93.275%.

Finally, I decided to try the SVC and the LinearSVC classifiers, assuming that even with the small preprocessing steps done before the dataset was splitted as clearly as possible into two different clusters.<br/>
Initially, the SVC classifier and the LinearSVC achieved an accuracy of 92.7% and 93.24%, respectively.<br/>
Since training the SVC is quite time consuming, the accuracy found was lower than its linear version and so the dataset seemed to be linearly separable, I chose to only tune the latter.<br/>
By the way, the improvements achieved by means of the hyperparameters tuning on LinearSVC were inexistent on evaluation set, and the final accuracy remained the same.

Even if Support Vector Machine algorithm (with linear kernel) perform similarly to the Logistic Regression, I think that the first one performs better on this project because of the sensitivity to marginal values. The sigmoid function of the LogisticRegressor tends to not properly identify simil-neutral values, while the Support Vector Machine algorithm tries to construct the best widest possible separating line to split this two clusters.

All the evaluation for the accuracy were done with accuracy score, as suggested in the assignment, and all the classifiers were provided by the Scikit Learn package.

In [194]:
vectorizer, X_train, X_test, y_train, y_test = split_dataset(train_data)
X_train_scaled, X_test_scaled = scale_dataset(X_train, X_test)
classifiers = ["SGD", "RFC", "LR", "LSVC", "SVC"] # "SGD", "RFC", "LR", "LSVC", "SVC"

# SGD Classifier

if "SGD" in classifiers:
    print("SGD Classifier\n")
    classifier = SGDClassifier(n_jobs=4, max_iter=300000, random_state=42) # , loss="hinge", penalty="l2"
    _, accuracy, predictions = train(classifier, X_train, X_test, y_train, y_test)

    print("- - -\n")

# Random Forest Classifier

if "RFC" in classifiers:
    print("Random Forest Classifier\n")
    classifier = RandomForestClassifier(n_jobs=4, random_state=42) # n_estimators=100, 
    _, accuracy, predictions = train(classifier, X_train, X_test, y_train, y_test)

    print("- - -\n")

# Logistic Regressor

if "LR" in classifiers:
    print("Logistic Regressor\n")
    classifier = LogisticRegression(n_jobs=4, max_iter=300000, random_state=42) # tol=0.01, 
    _, accuracy, predictions = train(classifier, X_train, X_test, y_train, y_test)

    print("- - -\n")

# LinearSVC

if "LSVC" in classifiers:
    print("Linear SVC\n")
    classifier = svm.LinearSVC(max_iter=300000, random_state=42)
    _, accuracy, predictions = train(classifier, X_train, X_test, y_train, y_test)

    print("- - -\n")

# SVM

if "SVC" in classifiers:
    print("SVC")
    classifier = svm.SVC(max_iter = 300000, random_state=42) #tol=0.1, 
    _, accuracy, predictions = train(classifier, X_train, X_test, y_train, y_test)

    print("- - -\n")

SGD Classifier

Accuracy: 
(accuracy_score):0.89675
(f1_score): 0.8960246376811595
- - -

Random Forest Classifier

Accuracy: 
(accuracy_score):0.91275
(f1_score): 0.9125036332058113
- - -

Logistic Regressor

Accuracy: 
(accuracy_score):0.92075
(f1_score): 0.920639457234545
- - -

Linear SVC

Accuracy: 
(accuracy_score):0.9314375
(f1_score): 0.9314208595510693
- - -

SVC
Accuracy: 
(accuracy_score):0.927125
(f1_score): 0.9270016380730695
- - -



## 4. Tuning and validation

* Random Forest Classifier: <br/>
    num estimators in [20, 50, 80, 100, 120]
* Logistic Regressor: <br/>
    C in range(1,101)
* LinearSVC:<br/>
    C in [0.1, 0.5, 0.7, 1, 10, 20, 30, 40],<br/>
    tol in [0.1, 0.01, 0.001, 0.0001, 0.00005, 0.00001, 0.000005]
    
For the TfidfVectorizer, I fixed ngram range to (1, 2), especially because I didn’t dropped any word from the tweets.

![title](Accuracies-2.png)


In order to tune properly the classifiers I tried different values of one main parameter, C (for Logistic Regressor and LinearSVC) and the number of estimators for the Random Forest Classifier.

The reason why in the previous step I mentioned that the Random Forest Classifier when tuned didn’t increased the accuracy is that the best number of estimators found was 100, which is also the default value for this classifier.

For the other classifiers, LinearSVC and Logistic Regressor, I thought that C was the most important parameter I should have worked on, since I developed a small preprocessing pipeline and marginal values could have been a problem.

After some attempts, for the LinearSVC classifier I ended up that C = 0.7 was the best in terms of accuracy. So, this classifier works better with a large margin around the hyperplane. Moreover, the best tollerance value found is 0.01. The accuracy reached on evaluation set is 93.24%, while in test set the accuracy is 93.39%.

The best C value found for the Logistic Regressor is 35. This means that this model is more prone to overfit the data, and the confirmation comes also from the leaderboard: in fact, the accuracy on evaluation set is 93.27% and on test set is also 93.27%.

So, comparing the Logistic Regressor with the LinearSVC and their best C values, we can see that the Logistic Regressor performs better on the evaluation set because it overfits the data (C = 35), but then on test set the LinearSVC achieves better results.

In [35]:
vectorizer, X_train, X_test, y_train, y_test = split_dataset(train_data)
X_train_scaled, X_test_scaled = scale_dataset(X_train, X_test)
classifiers = ["RFC", "LR", "LSVC"] # "RFC", "LR", "LSVC"

# Random Forest Classifier

if "RFC" in classifiers:

    print("Random Forest Classifier\n")

    accuracies = {}

    for num_estimators in [20, 50, 80, 100, 120]:
        classifier = RandomForestClassifier(n_jobs=4, n_estimators=num_estimators)
        _, accuracy, predictions = train(classifier, X_train, X_test, y_train, y_test, print_accuracy=False)
        accuracies[str(num_estimators)] = float(accuracy)

    #relevant_features(vectorizer, classifier)
    best_num_estimators, max_accuracy = get_best_parameter(accuracies, "Best number of estimators:")

    print("- - -\n")

# Logistic Regressor

if "LR" in classifiers:

    print("Logistic Regressor\n")

    accuracies = {}

    for c in range(1,101):
        classifier = LogisticRegression(C=c, n_jobs=4, max_iter=300000)
        _, accuracy, predictions = train(classifier, X_train, X_test, y_train, y_test, print_accuracy=False)
        accuracies[str(c)] = float(accuracy)

    #relevant_features(vectorizer, classifier)
    best_c, max_accuracy = get_best_parameter(accuracies, "Best C:")

    print("- - -\n")

# Linear SVC

if "LSVC" in classifiers:
    
    joint_tuning = True
    tune_c = False
    tune_tol = False

    print("Linear SVC\n")

    accuracies = {}
    
    if joint_tuning == True:
        for c in [0.1, 0.5, 0.7, 1, 10, 20, 30, 40]:
            for tol in [0.1, 0.01, 0.001, 0.0001, 0.00005, 0.00001, 0.000005]:        
                classifier = svm.LinearSVC(C=c, tol=tol, max_iter=300000)
                _, accuracy, predictions = train(classifier, X_train, X_test, y_train, y_test, print_accuracy=False)
                accuracies[str(c) + "," + str(tol)] = float(accuracy)

        #relevant_features(vectorizer, classifier)
        best_c, max_accuracy = get_best_parameter(accuracies, "Best C and tol:")
    
    accuracies = {}
    if tune_c == True:
        for c in [0.1, 0.5, 0.7, 1, 10, 20, 30, 40]:
            classifier = svm.LinearSVC(C=c, max_iter=300000)
            _, accuracy, predictions = train(classifier, X_train, X_test, y_train, y_test, print_accuracy=False)
            accuracies[str(c)] = float(accuracy)

        #relevant_features(vectorizer, classifier)
        best_c, max_accuracy = get_best_parameter(accuracies, "Best C:")

    accuracies = {}
    if tune_tol == True:
        for tol in [0.1, 0.01, 0.001, 0.0001, 0.00005, 0.00001, 0.000005]:
            classifier = svm.LinearSVC(tol=tol, max_iter=300000)
            _, accuracy, predictions = train(classifier, X_train, X_test, y_train, y_test, print_accuracy=False)
            accuracies[str(tol)] = float(accuracy)

        #relevant_features(vectorizer, classifier)
        best_c, max_accuracy = get_best_parameter(accuracies, "Best tol:")

    print("- - -\n")


Random Forest Classifier

Best number of estimators: 100
Max accuracy: 0.91525
- - -

Logistic Regressor

Best C: 35
Max accuracy: 0.93275
- - -

Linear SVC

Best C and tol: 1,0.1
Max accuracy: 0.9324375
- - -


In [197]:
classifier0 = svm.LinearSVC(C=0.7, tol = 0.01, max_iter=300000) # 0.9314375 on local, 0.9339 on leaderboard

classifier1 = svm.LinearSVC(C=1, tol = 0.1, max_iter=300000) # 0.9324375 on local, 0.9335 on leaderboard 

predictions = predict(classifier0, train_data, test_data)
export(predictions, test_data)

Done! :)


### Code

In [2]:
def load_dataset(filename):
    data = pd.read_json(filename, lines=True)
    
    data.drop(['id',\
               'id_str',\
               'metadata',\
               'source',\
               'lang',\
               'coordinates',\
               'place',\
               'created_at',\
               'in_reply_to_status_id_str',\
               'in_reply_to_status_id',\
               'in_reply_to_user_id_str',\
               'in_reply_to_user_id',\
               'quoted_status_id_str',\
               'quoted_status_id',\
               'quoted_status',\
               'in_reply_to_screen_name',\
               'withheld_in_countries',\
               'display_text_range',\
               'retweeted',\
               'favorited',\
               'truncated',\
               'contributors',\
               'user',\
               'is_quote_status',\
               'retweet_count',\
               'favorite_count',\
               'possibly_sensitive'], axis=1, inplace=True)
    return data

def explore_data(train_data, plot = False):
    print("There are " + str(train_data["class"].value_counts()[0]) + " tweets classified as 0.")
    print("There are " + str(train_data["class"].value_counts()[1]) + " tweets classified as 1.")
    print()
    print("Distinct tweets inside train dataset:", str(len(train_data[train_data['full_text'].duplicated() == False]["full_text"])))

    train_data['text_length'] = [len(t) for t in train_data.full_text]
    
    mean_length_class_0 = train_data[train_data["class"] == 0].text_length.mean()
    mean_length_class_1 = train_data[train_data["class"] == 1].text_length.mean()
    
    print()
    print("The average length of tweets labelled as 0 is " + str(mean_length_class_0))
    print("The average length of tweets labelled as 1 is " + str(mean_length_class_1))
    
    if plot == True:
        fig, ax = plt.subplots(figsize=(5, 5))
        plt.boxplot(train_data.text_length)
        plt.show()
    
    return 

punc_list = [".","_",";",":","!","?","/","\\",",","$","&",")","(","'","\"", "#"]
stopwords = [stopword.replace("\n", "") for stopword in open("stopwords.txt", "r")]
stopwords = ["i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", "yourself", "yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its", "itself", "they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this", "that", "these", "those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having", "do", "does", "did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until", "while", "of", "at", "by", "for", "with", "about", "against", "between", "into", "through", "during", "before", "after", "above", "below", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again", "further", "then", "once", "here", "there", "when", "where", "why", "how", "all", "any", "both", "each", "few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so", "than", "too", "very", "s", "t", "can", "will", "just", "don", "should", "now"]
stemmer = PorterStemmer()
tagger = ttpw.TreeTagger(TAGLANG='en', TAGDIR ="/Users/carlopeluso/Downloads")

def preprocess(text):
    # Exclude non-ascii characters
    text = ''.join([word for word in text for i in range(len(word)) if ord(word[i]) <= 127])
    
    # Make text lowercase
    text = text.lower().replace("@", "")
    
    # Tagger
    tags = tagger.tag_text(text)
    text = ' '.join([t.split('\t')[-1] for t in tags])
    
    # Cut mentions
    #text = ' '.join([word for word in text.split(" ") if not word.startswith("@")])
    
    # Cut websites' links
    #text = re.sub('https?://[A-Za-z0-9./]+','',text)
    
    # Transform emojis into text
    text = emoji.demojize(text)
    
    # Remove punctuations defined inside punctuation list
    #t = str.maketrans(dict.fromkeys(punc_list, " "))
    #text = text.translate(t)
    
    # Stemmer
    #text = ' '.join([stemmer.stem(word) for word in text.split()])
    
    # Cut of "rt" and "twitter"
    #text = " ".join([word for word in text.lower().split() if word not in ["rt", "twitter"]])
    
    return text

def full_preprocess(text):
    
    text = ''.join([word for word in text for i in range(len(word)) if ord(word[i]) <= 127])
    
    # Make text lowercase
    text = text.lower().replace("@", "")
    
    # Cut mentions
    text = ' '.join([word for word in text.split(" ") if not word.startswith("@")])
    
    # Tagger
    tags = tagger.tag_text(text)
    text = ' '.join([t.split('\t')[-1] for t in tags])
    
    # Cut websites' links
    text = re.sub('https?://[A-Za-z0-9./]+','',text)
    
    # Transform emojis into text
    text = emoji.demojize(text)
    
    # Remove punctuations defined inside punctuation list
    t = str.maketrans(dict.fromkeys(punc_list, " "))
    text = text.translate(t)
    
    # Remove stopwords
    text = ' '.join([word for word in text.split() if word not in stopwords])
    
    # Stemmer
    text = ' '.join([stemmer.stem(word) for word in text.split()])
    
    # Cut of "rt" and "twitter"
    text = " ".join([word for word in text.lower().split() if word not in ["rt", "twitter"]])
    
    return text

# - - - 

def split_dataset(data):
    vectorizer = TfidfVectorizer(ngram_range=(1,2))
    X = vectorizer.fit_transform(data["full_text"])

    X_train, X_test, y_train, y_test = train_test_split(X, data.loc[:, "class"], test_size=0.2, random_state=0)
    
    return vectorizer, X_train, X_test, y_train, y_test
    
def train(classifier, X_train, X_test, y_train, y_test, print_accuracy = True, *args):
    classifier.fit(X_train, y_train)
    predictions = classifier.predict(X_test)
    accuracy = accuracy_score(y_test, predictions)

    if print_accuracy == True:
        print("Accuracy: \n(accuracy_score):%s" % (accuracy_score(y_test, predictions)))
        print("(f1_score):", f1_score(y_test, predictions, average='weighted'))
    
    return classifier, accuracy, predictions

def grid_search(classifier, train_data, *args):
    vectorizer, X_train, X_test, y_train, y_test = split_dataset(train_data)
    classifier.fit(X_train, y_train)
    
    print("Max accuracy: ", classifier.best_score_)
    print("Best model: ", classifier.best_estimator_)
    
def relevant_features(vectorizer, classifier, *args):
    feature_to_coef = {
        word: coef for word, coef in zip(
            vectorizer.get_feature_names(), classifier.coef_[0]
        )
    }

    print("\nBest words for BLM")
    for best_positive in sorted(
            feature_to_coef.items(),
            key=lambda x: x[1],
            reverse=True)[:5]:
        print(best_positive)
    
    print("\nBest words against BLM")
    for best_negative in sorted(
            feature_to_coef.items(),
            key=lambda x: x[1])[:5]:
        print(best_negative)

def get_best_parameter(accuracies, title, plot=False):
    
    if plot == True:
        accuracies_df = pd.DataFrame.from_dict(accuracies, orient='index',columns=["accuracy"])
        sns.lineplot(x=accuracies_df.index, y = "accuracy", data= accuracies_df)
    
    best_param = max(accuracies, key=accuracies.get)
    max_accuracy = accuracies[str(best_param)]

    print(title, best_param)
    print("Max accuracy:", max_accuracy)
    
    return best_param, max_accuracy
        
def predict(classifier, train_data, test_data, scaled = None, *args):
    vectorizer = TfidfVectorizer(ngram_range=(1, 2))
    X_train = vectorizer.fit_transform(train_data["full_text"])
    X_test = vectorizer.transform(test_data["full_text"])

    if scaled is not None:
        scaler = MaxAbsScaler()
        scaler.fit(X_train)
        X_train = scaler.transform(X_train)
        X_test = scaler.transform(X_test)
    
    classifier.fit(X_train, train_data.loc[:, "class"])

    predictions = classifier.predict(X_test)
    return predictions

def export(predictions, test_data):
    
    with open('exam_export.csv', 'w') as file:
        file.write("Id,Predicted\n")
        for index in test_data.index:
            s = predictions[index]
            file.write(str(index) + "," + str(s) + "\n")
    print("Done! :)")
    return