#### In this notebook the model training on binary labelled tweets is performed. But none of the models was later used.  
#### Refer to the `training_multiclass.ipynb` notebook as the main code as it's very similar to this one.

In [1]:
from nltk import download
download('punkt')
download('stopwords')

import zipfile
import pandas as pd
from nltk.corpus import stopwords
import re
from nltk.stem import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from numpy import average as avg
from sklearn import model_selection, naive_bayes, svm
from sklearn.metrics import accuracy_score

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\flori\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\flori\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


# Read Training Data
Data source: https://www.kaggle.com/datasets/kazanova/sentiment140  

Columns:
1. label:   The polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
2. id:      The id of the tweet
3. date:     The date of the tweet (Sat May 16 23:58:44 UTC 2009)
4. flag:     The query (lyx). If there is no query, then this value is NO_QUERY.
5. user:     The user that tweeted
6. raw_text: The text of the tweet

In [2]:
archive = zipfile.ZipFile("data.zip", "r")
data = pd.read_csv(archive.open("train_binary.csv"), header=None, encoding_errors="replace")
data.columns = ["label", "id", "date", "flag", "user", "raw_text"]
data.head()

Unnamed: 0,label,id,date,flag,user,raw_text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


Number of tweets per label:

In [3]:
data["label"].value_counts()

0    800000
4    800000
Name: label, dtype: int64

There are no neutral labelled tweets.
So we change the labels to binary form.
(0 = negative
1 = positive)

In [4]:
data.loc[data["label"]==4, "label"] = 1

# Data Preprocessing

Turning everything into lowercase characters

In [5]:
data["text"] = [entry.lower() for entry in data["raw_text"]]

Removing Stopwords (Common words like "my", "he", "is", ...)

In [6]:
stop = stopwords.words('english')
# we think 'no' and 'not' might be important words for the sentiment and don't want them to be removed,
# so we remove them from the list of stopwords
stop.remove("no")
stop.remove("not")

# stopwords are applied later by the tf-idf vectorizer

Removing links, tags and several punctuations from tweets using regular expressions

In [7]:
data["text"] = data["text"].apply(lambda x: re.sub("http[s]?://\S+", "", x))
data["text"] = data["text"].apply(lambda x: re.sub("@\S+", "", x))
data["text"] = data["text"].apply(lambda x: re.sub("-|\.|,|'|\?|\!|`|\*", "", x))

Word-Stemming  
This did not have the desired effect, but made the model worse instead

In [8]:
sstemmer = SnowballStemmer("english")
# pstemmer = PorterStemmer()

# data['text'] = data['text'].apply(lambda x: ' '.join([sstemmer.stem(word) for word in x.split()]))

Tokenize Words (transform sentence into list of words)

In [9]:
# This process takes a few minutes. So you only need to do it once, save it to a csv-file and read that file later
### comment these lines after running once:
data["text"] = [str(word_tokenize(entry)) for entry in data["text"]]
data.to_csv('tokenized_data.csv', index=False)
###

data = pd.read_csv('tokenized_data.csv')

Our finished data:

In [10]:
data.head()

Unnamed: 0,label,id,date,flag,user,raw_text,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t...","['awww', 'thats', 'a', 'bummer', 'you', 'shoul..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...,"['is', 'upset', 'that', 'he', 'cant', 'update'..."
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...,"['i', 'dived', 'many', 'times', 'for', 'the', ..."
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire,"['my', 'whole', 'body', 'feels', 'itchy', 'and..."
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all....","['no', 'its', 'not', 'behaving', 'at', 'all', ..."


# Model Training

Split Training/Test-Data and transform into a vectorized form

In [11]:
x_train, x_test, y_train, y_test = model_selection.train_test_split(data["text"], data["label"], test_size=0.2, random_state=42)

In [12]:
# define an apply the TF-IDF Vectorizer from sklearn
tfidf_vect = TfidfVectorizer(analyzer="word", strip_accents="unicode", stop_words=stop, min_df=10)
data_tfidf = tfidf_vect.fit_transform(data["text"])
x_train_tfidf = tfidf_vect.transform(x_train)
x_test_tfidf = tfidf_vect.transform(x_test)

In [13]:
print("# of Features (Words):", len(tfidf_vect.get_feature_names_out()))

# of Features (Words): 35019


Naive Bayes Classifier with cross-validation

In [14]:
NB = naive_bayes.MultinomialNB()
xval = model_selection.cross_validate(NB, data_tfidf, data["label"], cv=10)

# fit model on all data
NB.fit(data_tfidf, data["label"])

# Average Accuracy
print(f"Average Accuracy of X-Val: {round(avg(xval['test_score'])*100, 2)}%")

Average Accuracy of X-Val: 76.3%


Support Vector Machine

We did not use GridSearch here, because one iteration of training takes about 20 minutes on our local computers.  
Some manual hyperparameter-tuning was done and the parameters below give the best result we could find.  
Also the Naive Bayes seems to work much better on binary data, so we saw no need to tune the algorithm further.
  
We also need to set the max_iter parameter for the training to not run endlessly. You can adjust the cache_size (used RAM in MB) and max_iter according to your hardware.  

The parameters below are the best we could find by manual hyperparameter tuning.

In [17]:
# Beacuse of the large dataset this can take a while
SVM = svm.SVC(C=0.9, kernel='rbf', gamma='auto', decision_function_shape="ovo", random_state=10, cache_size=2500, max_iter=2500)
SVM.fit(x_train_tfidf, y_train)

predictions_SVM = SVM.predict(x_test_tfidf)
print(f"SVM Accuracy Score: {round(accuracy_score(predictions_SVM, y_test)*100, 2)}%")



SVM Accuracy Score: 52.93%
