<h1>Sentiment Analysis</h1>

This notebook demonstrates how a text tokenizer can be used to turn a corpus of text into a matrix of numeric values that then can be used in regulat machine learning applications. This way a model can be trained to predict labels with a piece of text as input. As can be seen at the end of the notebook, a test string is used as input and the machine gives a probability for each of the possible labels.

The dataset used in this notebook originates from [Kaggle](https://www.kaggle.com/datasets/tariqsays/sentiment-dataset-with-1-million-tweets). A 100000 random sample from the English negative, uncertainty and positive observations was taken, so no litigious and no other languages.

Bas Michielsen MSc 2023

In [1]:
import pandas, sklearn
pandas.set_option("max_colwidth", 200)
pandas.set_option("display.float_format", '{:.2f}'.format)
random_state = 42

# üìÉ Sample the data
A random sample of 25 observations is taken from the dataset.

In [2]:
df = pandas.read_csv("./data.zip")
df.sample(25)

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Text,Language,Label
36830,354122,480396,"‚ÄúI see‚Äî so. . . It‚Äôs as if you were sharing? . . Your soul I mean?‚Äù \n\nWide, curious eyes only seem to get bigger as the cat asks questions. \n\nHe‚Äôs. So. Interesting ! https://t.co/DtrIf3ajbI",en,negative
63516,315895,428832,@JDMills19 Awwww truth hurts poor baby...,en,negative
17816,78248,106131,"Tfw Tool knew you can guide ppl with art so they created a hive mind ""by accident""",en,negative
52258,306650,416213,#OpportunityAlert: We‚Äôre looking for new Board members with leadership and business management skills to oversee the next phase of our development. If you think you can help improve people‚Äôs lives...,en,positive
17662,373308,506382,"@AP_Politics Competition? You're tripping! The Republicans cut healthcare for children. The CHIP program saved so many children and made things a little easier for single parents, like myself, who...",en,positive
66444,403954,547911,@PansLavender Maybe you do?,en,uncertainty
57653,327199,444101,"@bolarinwa___ Go for Dembele now, oh your club is poor and can't attract quality players.",en,negative
57224,472779,641309,"@atinyseongstar I got ateez audience ü•π I‚Äôm so excited to see them, but still a little sad at how fast m&amp;g went ü•≤üñ§",en,positive
29452,590727,801392,"ya'll be out here looking fine, smelling amazing, dressed in incredible clothing... with a nigga in a pair of pacsun skinny jeans, some jordan 11's &amp; a graphic t-shirt from zumies. that's so u...",en,positive
66524,45318,61463,Now this is a take that is not spoken of enough. Too may people focusing on the wrong things. https://t.co/uioR8yoAwt,en,negative


# Preprocessing
## üÜî Encoding

Here the labels are mapped to integers. Because one value is neutral, the value `0` is used for that, the other values then become positive `+1` and negative `-1`. A new column for target is created.

In [3]:
label_map = {"negative": -1, "uncertainty": 0, "positive": 1}
df["Target"] = df["Label"].map(label_map)
df.sample(10)

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Text,Language,Label,Target
72331,18279,24762,@gidimide I watch you sleep everytime so what‚Äôs bad there?,en,negative,-1
92383,268673,364669,Sonya Pitts stole me from somewhere the he,en,uncertainty,0
26704,233678,316964,@belmontfraud FR?? I MISS BULLYING YOU SO BAD PLEASE DONT GET SUSPENDED BROTHER,en,negative,-1
34012,620654,842114,@AlboMP @Trish_Corry His business case is based on kicking the can down the road so fossil fuel can rip as much as possible out of the economy for as long as possible. That‚Äôs why they (and you) ge...,en,uncertainty,0
84730,113126,153542,@NickRadoux @mich_yao @JoelLovelock Maybe the were trying to guess yours,en,uncertainty,0
10576,317248,430648,"@JessieJadeCooks It is wild how a company with infinite resources and man power can fail to make the streaming system as good as Twitch, after years of trying to improve it",en,positive,1
69422,206328,279797,@CassiusGren There's that line where Whitebeard almost disagrees when she says she heard Arthur was the best. Maybe there was some jealousy and the Kingwood was part of it.,en,uncertainty,0
31423,234960,318739,@CS6543 @pmdfoster But the EU‚Äôs farm to fork policy means that there has to be a clear trail. The UK is no longer part of that same system. Ofc they could sell it illegally in the EU but given the...,en,positive,1
29419,176225,238942,"@Crazybabe11 Absolutely! He saved 6 people with his organs so his heart is still beating somewhere! Yeah, the social media is definitely part of the problem these days, always need some reality ch...",en,uncertainty,0
90348,51263,69527,@dh4onethingonly I'll suck up to the Indian based mobs.....at least with them I get another go.....Christianity seems a bit too focused on telling where I went wrong.........,en,negative,-1


In order to decrease training time, here a sample size of 5000 observation is specified. Also, the vectorizer is limited to a maximum of 100 words. Technically it is possible to increase these values at the cost of training time and possibly increasing the outcome quality, however, in this case even dramatic increases seemed to create mere insignificant improvements. The vectorizer then turns the corpus of texts (5000 observations) in numeric representations for the 100 most prominent words excluding the stop words of the English language. For every observation it will give a `0` for any word that is not present in the observation, or a higher value for a word that is. The expected shape of the output is therefore 5000 times 100 values.

In [4]:
sample_size = 5000
max_words = 100
min_df = .01

df = df.sample(sample_size, random_state=random_state)
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_features=max_words, min_df=min_df, stop_words="english")
X_vectorized = vectorizer.fit_transform(df["Text"]).toarray()
X_vectorized.shape

(5000, 100)

Because the vectorizer removed the original text from the observation it is added again here. This is done so that the same data can be used also for test evaluation purposes by humans.

In [5]:
X = pandas.DataFrame(X_vectorized)
X[max_words] = df["Text"].values
y = df["Target"]
X

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,91,92,93,94,95,96,97,98,99,100
0,0.00,0.00,0.00,0.00,0.00,0.28,0.00,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,4/ Product Risk is the risk that your founding insight will not be powerful enough for you to achieve product-market fit.\n\nThe best market-risk companies have evidence that there will be demand ...
1,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,People are not having Good nutritional food in school and old age centers \n\nAgencies take Bottled water and throw in Airport \n\nThanks to Rules\n\nCooked food is thrown away as Trash each night...
2,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,@fletch_biggsss Fletcher u probably still suck dick for Xanax
3,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,"@WlTCHOFSCARLET ""Your right. That wouldn't be a good idea."""
4,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,"@elgoonishshive @shadowraptor51 99.999% of all the drama and problems in the Herald's books are the direct result evil acts by evil people for the sake of power, almost nothing (with a very few ve..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,@PACinTX @DNC @SpeakerPelosi @SenSchumer @RepAdamSchiff @KamalaHarris @DickDurbin @TeamPelosi @JoeBiden @tedlieu @CNN Which is why the president is probably promoting Johnson and Johnson‚Ä¶ Because ...
4996,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,@Zakiyyah6 I almost made the tomb raider uncharted comparison but went with assassins creed and witcher because i am playing witcher 3. Also take out the swinging and spiderman PS4 is basically a ...
4997,0.00,0.00,0.00,0.45,0.00,0.00,0.00,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,"@atinyseongstar I got ateez audience ü•π I‚Äôm so excited to see them, but still a little sad at how fast m&amp;g went ü•≤üñ§"
4998,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,"@PennieRoyalTea @StandingforXX @RepanseDe Nope. But keep trying to pigeon-hole me if it helps you. I don't mind replying.\n\nHere's a question for you, since I'm answering all yours. What is a tra..."


## ü™ì Splitting into train/test

Here the dataset is split into a train set and a test set. From the train set the original text will be removed again, as this is not a numeric feature and cannot be used in training.

In [6]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=random_state)
X_train = X_train.drop([max_words], axis=1)
X_train

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
4227,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
4676,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
800,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
3671,0.00,0.40,0.00,0.00,0.00,0.00,0.00,0.48,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
4193,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4426,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
466,0.00,0.00,0.00,0.00,0.66,0.00,0.00,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
3092,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
3772,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.00,0.70,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00


# Modelling
Given that the target variable is a scale going from negative to positive but uses classes, the sigmoid kernel is likely best suited for this problem. The class_weight hyper parameter ensures that the weight is recalculated for each class.

In [7]:
from sklearn.svm import SVC
model = SVC(kernel="sigmoid", probability=True, class_weight="balanced")
model.fit(X_train, y_train)
X_test_messages = X_test[max_words]
X_test = X_test.drop([max_words], axis=1)
score = model.score(X_test, y_test)
print("Accuracy:", score)

Accuracy: 0.934


# Evaluation
Now, for every observation in the test set a prediction is given. Also the truth value, and the original text are included. For brevity reasons only a random sample of 50 is displayed.

In [8]:
pred = model.predict_proba(X_test)
predictions = pandas.DataFrame(pred, columns=label_map.keys())
predictions["truth"] = y_test.map(dict((v,k) for k, v in label_map.items())).values
predictions["text"] = X_test_messages.values
predictions.sample(50)

Unnamed: 0,negative,uncertainty,positive,truth,text
254,0.08,0.8,0.11,uncertainty,@Him_HerSports Oh I almost forgot... The Zebras need a sky judge
557,0.0,0.05,0.95,positive,"@uscreentv @finrobinson @mikeeschwartz Start thinking systems throughout every aspect of the biz - your people, your customers, your marketing, your content - how can you make all of that easier? ..."
660,1.0,0.0,0.0,negative,QUESTION ü•π
964,0.01,0.99,0.0,uncertainty,Probably the most underrated superstar of all time. Fuck Scott Stevens. https://t.co/E0zmpBiQyO
221,0.07,0.9,0.04,uncertainty,literally EVERYONE calls when my mind is somewhere else
943,0.07,0.87,0.06,negative,"@Chris2_0_0_9 @BenABunchofNums @ChristinaPushaw @wrong_speak @KJTorrance @nypost Then call him an asshole, not a racist insult. Doing that tells everyone he is not better than any other racist."
753,0.0,0.99,0.0,uncertainty,@digitaldutta @SaketGokhale He needs all possible support. That's upto us. We owe it to him for exposing many facts and showing a legal way forward
629,0.94,0.05,0.01,negative,@slefander @Highfieldoval @K200494 @TheBishF1 @LewisHamilton Not sure why you are giving him the befit of the doubt. My original question is still unanswered. Any other centimillionaire claiming...
441,0.21,0.75,0.04,uncertainty,"@ScottDworkin , man!Don't you know what planet the Tramp Defense and Bottle Washing Team is on??Of course, on a*flat earth,somewhere!\n\n@texson6886 @DesignationSix @american2084 @RabbiJill @Macle..."
488,1.0,0.0,0.0,negative,@SeanChislom20 The problem with the LIV tour is it‚Äôs literally funded by the Saudi Government. So the golfers that act like this isn‚Äôt political are just wrong because it‚Äôs backed and funded by th...


A classification report gives information about the precision and recall. Also the classification report can be ran on the test set as well as on the train set. If very different outcomes are presented, the model may be overfitted. Here the outcomes are rather similar, so the model is likely fit to generalize in the real world.

In [9]:
from sklearn.metrics import classification_report

pred = model.predict(X_train)
report = classification_report(y_train, pred, target_names=label_map.keys())
print("Train set")
print(report)

pred = model.predict(X_test)
report = classification_report(y_test, pred, target_names=label_map.keys())
print("Test set")
print(report)

Train set
              precision    recall  f1-score   support

    negative       0.94      0.94      0.94      1451
 uncertainty       0.87      0.86      0.87      1116
    positive       0.93      0.94      0.93      1433

    accuracy                           0.92      4000
   macro avg       0.91      0.91      0.91      4000
weighted avg       0.92      0.92      0.92      4000

Test set
              precision    recall  f1-score   support

    negative       0.97      0.95      0.96       335
 uncertainty       0.89      0.91      0.90       273
    positive       0.94      0.94      0.94       392

    accuracy                           0.93      1000
   macro avg       0.93      0.93      0.93      1000
weighted avg       0.93      0.93      0.93      1000



# Inference

In [10]:
message = "the broken car is useless"
message_vectorized = vectorizer.transform([message]).toarray()
inference = model.predict_proba(message_vectorized)
result = pandas.DataFrame(inference, columns=label_map.keys())
result

Unnamed: 0,negative,uncertainty,positive
0,1.0,0.0,0.0


In [11]:
message = "the sun shines and everything is good"
message_vectorized = vectorizer.transform([message]).toarray()
inference = model.predict_proba(message_vectorized)
result = pandas.DataFrame(inference, columns=label_map.keys())
result

Unnamed: 0,negative,uncertainty,positive
0,0.0,0.0,1.0


In [12]:
message = "anything may happen at any given moment"
message_vectorized = vectorizer.transform([message]).toarray()
inference = model.predict_proba(message_vectorized)
result = pandas.DataFrame(inference, columns=label_map.keys())
result

Unnamed: 0,negative,uncertainty,positive
0,0.07,0.9,0.04
