### Twitter Sentiment Analysis

dataset imported from https://www.kaggle.com/datasets/arkhoshghalb/twitter-sentiment-analysis-hatred-speech

### Importing

#### Libraries

In [1]:
import pandas as pd
import numpy as np

#Text representation
from sklearn.feature_extraction.text import TfidfVectorizer

#### Dataset

In [2]:
#Train dataset
df = pd.read_csv('train.csv')

In [3]:
df

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation
...,...,...,...
31957,31958,0,ate @user isz that youuu?ðððððð...
31958,31959,0,to see nina turner on the airwaves trying to...
31959,31960,0,listening to sad songs on a monday morning otw...
31960,31961,1,"@user #sikh #temple vandalised in in #calgary,..."


In [4]:
df.loc[[31957]]['tweet']

31957    ate @user isz that youuu?ðððððð...
Name: tweet, dtype: object

### Preparing data

In [5]:
df.rename(columns={'tweet': 'text'},inplace=True)

In [6]:
df['length'] = df['text'].apply(lambda x: len(x))

In [7]:
df.drop('id',inplace=True,axis=1)

In [8]:
df.head()

Unnamed: 0,label,text,length
0,0,@user when a father is dysfunctional and is s...,102
1,0,@user @user thanks for #lyft credit i can't us...,122
2,0,bihday your majesty,21
3,0,#model i love u take with u all the time in ...,86
4,0,factsguide: society now #motivation,39


In [9]:
df['label'].value_counts() # data is imbalanced

0    29720
1     2242
Name: label, dtype: int64

### Text cleaning

I plan to:
- remove stop words (NLP model will work bettel if it is not distracted by non-informational words)
- lower case of the words
- remove numbers
- remove extra white spaces
- remove punctuation
- tokenization (Splitting the text into words/sentences)
- lemmatization (the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form)
- stemming (the process of finding the base, root-form, stem of the word)

In [10]:
import re

In [11]:
import nltk
nltk.download("stopwords")
nltk.download('punkt') # tokenization of text

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\beori\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\beori\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [12]:
from nltk import word_tokenize #tokenization by nltk

In [13]:
from nltk.stem import WordNetLemmatizer #lemmatization by nltk

In [14]:
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\beori\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\beori\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [15]:
df

Unnamed: 0,label,text,length
0,0,@user when a father is dysfunctional and is s...,102
1,0,@user @user thanks for #lyft credit i can't us...,122
2,0,bihday your majesty,21
3,0,#model i love u take with u all the time in ...,86
4,0,factsguide: society now #motivation,39
...,...,...,...
31957,0,ate @user isz that youuu?ðððððð...,68
31958,0,to see nina turner on the airwaves trying to...,131
31959,0,listening to sad songs on a monday morning otw...,63
31960,1,"@user #sikh #temple vandalised in in #calgary,...",67


#### Lower

In [16]:
def convert_to_lower(text):
    return text.lower()

In [17]:
df['text'] = df['text'].apply(lambda x: convert_to_lower(x))

#### Removing numbers

In [18]:
def remove_numbers(text):
    number_pattern = r'\d+'
    without_number = re.sub(pattern=number_pattern, repl=" ", string=text)
    return without_number

In [19]:
df['text'] = df['text'].apply(lambda x: remove_numbers(x))

In [20]:
df

Unnamed: 0,label,text,length
0,0,@user when a father is dysfunctional and is s...,102
1,0,@user @user thanks for #lyft credit i can't us...,122
2,0,bihday your majesty,21
3,0,#model i love u take with u all the time in ...,86
4,0,factsguide: society now #motivation,39
...,...,...,...
31957,0,ate @user isz that youuu?ðððððð...,68
31958,0,to see nina turner on the airwaves trying to...,131
31959,0,listening to sad songs on a monday morning otw...,63
31960,1,"@user #sikh #temple vandalised in in #calgary,...",67


#### Removing non-ascii characters

In [21]:
def remove_non_ascii(text):
    return re.sub(r'[^\x00-\x7f]',r'', text) 

In [22]:
df['text'] = df['text'].apply(lambda x: remove_non_ascii(x))

####  Removing punctuation

In [23]:
def remove_punctuation(text):
    import string 
    return text.translate(str.maketrans('', '', string.punctuation))
    # return re.sub("\W"," ",text)
    

In [24]:
df['text'] = df['text'].apply(lambda x: remove_punctuation(x))

In [25]:
df

Unnamed: 0,label,text,length
0,0,user when a father is dysfunctional and is so...,102
1,0,user user thanks for lyft credit i cant use ca...,122
2,0,bihday your majesty,21
3,0,model i love u take with u all the time in u...,86
4,0,factsguide society now motivation,39
...,...,...,...
31957,0,ate user isz that youuu,68
31958,0,to see nina turner on the airwaves trying to...,131
31959,0,listening to sad songs on a monday morning otw...,63
31960,1,user sikh temple vandalised in in calgary wso ...,67


#### Removing extra white spaces

In [26]:
def remove_extra_white_spaces(text):
    single_char_pattern = r'\s+[a-zA-Z]\s+'
    without_sc = re.sub(pattern=single_char_pattern, repl=" ", string=text)
    return without_sc

In [27]:
df['text'] = df['text'].apply(lambda x: remove_extra_white_spaces(x))

In [28]:
df

Unnamed: 0,label,text,length
0,0,user when father is dysfunctional and is so s...,102
1,0,user user thanks for lyft credit cant use caus...,122
2,0,bihday your majesty,21
3,0,model love take with all the time in ur,86
4,0,factsguide society now motivation,39
...,...,...,...
31957,0,ate user isz that youuu,68
31958,0,to see nina turner on the airwaves trying to...,131
31959,0,listening to sad songs on monday morning otw t...,63
31960,1,user sikh temple vandalised in in calgary wso ...,67


#### Removing stopwords and making tokens

In [29]:
from nltk.corpus import stopwords
stopWords = stopwords.words('english')

In [30]:
def remove_stopwords(text):
    removed = []
    stop_words = list(stopwords.words("english"))
    tokens = word_tokenize(text)
    for i in range(len(tokens)):
        if tokens[i] not in stop_words:
            removed.append(tokens[i])
    return " ".join(removed)

In [31]:
df['text'] = df['text'].apply(lambda x: remove_stopwords(x))

#### Lemmatizing

In [32]:
def lemmatizing(text):
    lemmatizer = WordNetLemmatizer()
    tokens = word_tokenize(text)
    for i in range(len(tokens)):
        lemma_word = lemmatizer.lemmatize(tokens[i])
        tokens[i] = lemma_word
    return " ".join(tokens)

In [33]:
df['text'] = df['text'].apply(lambda x: lemmatizing(x))

#### Length after cleaning

In [34]:
df['length_after_cleaning'] = df['text'].apply(lambda x: len(x))

In [35]:
df

Unnamed: 0,label,text,length,length_after_cleaning
0,0,user father dysfunctional selfish drag kid dys...,102,58
1,0,user user thanks lyft credit cant use cause do...,122,96
2,0,bihday majesty,21,14
3,0,model love take time ur,86,23
4,0,factsguide society motivation,39,29
...,...,...,...,...
31957,0,ate user isz youuu,68,18
31958,0,see nina turner airwave trying wrap mantle gen...,131,92
31959,0,listening sad song monday morning otw work sad,63,46
31960,1,user sikh temple vandalised calgary wso condem...,67,52


#### TF-IDF Vectorizer

In [36]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [37]:
tf_wb= TfidfVectorizer()
X_tf = tf_wb.fit_transform(df['text'])

In [38]:
X_tf

<31962x37149 sparse matrix of type '<class 'numpy.float64'>'
	with 247058 stored elements in Compressed Sparse Row format>

#### Converting dense matrix

In [44]:
X_tf = X_tf.toarray()

### Splitting data

In [39]:
from sklearn.model_selection import train_test_split

In [45]:
X_train_tf, X_test_tf, y_train_tf, y_test_tf = train_test_split(X_tf, df['label'].values, test_size=0.3)

In [46]:
X_train_tf

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

### Implementing models

In [42]:
from sklearn.naive_bayes import GaussianNB

In [47]:
NB = GaussianNB()
NB.fit(X_train_tf, y_train_tf)

#### Predict

In [48]:
NB_pred= NB.predict(X_test_tf)
print(NB_pred)

[0 0 0 ... 1 0 0]


#### Metrics

In [52]:
from sklearn.metrics import accuracy_score, classification_report

In [50]:
print(accuracy_score(y_test_tf, NB_pred))

0.8382521639378454


In [53]:
print(classification_report(y_test_tf, NB_pred, target_names=['tweet is not racist/sexist', 'tweet is racist/sexist']))

                            precision    recall  f1-score   support

tweet is not racist/sexist       0.96      0.86      0.91      8916
    tweet is racist/sexist       0.24      0.59      0.34       673

                  accuracy                           0.84      9589
                 macro avg       0.60      0.72      0.62      9589
              weighted avg       0.91      0.84      0.87      9589



#### Making pipelines with several models

In [79]:
# X = df.drop('label',axis=1)
X = df['text']
y = df['label']

In [102]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42,shuffle=True, stratify=y)

In [103]:
y_test.value_counts()

0    5945
1     448
Name: label, dtype: int64

In [82]:
from sklearn.pipeline import Pipeline
# pipeline lets us unite transformer and a model in one block
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.linear_model import SGDClassifier
from sklearn.neighbors import KNeighborsClassifier

from sklearn import metrics

from sklearn.model_selection import GridSearchCV


In [104]:
sgd_ppl_clf = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('sgd_clf', SGDClassifier(random_state=42))])
knb_ppl_clf = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('knb_clf', KNeighborsClassifier(n_neighbors=10))])
sgd_ppl_clf.fit(X_train, y_train)
knb_ppl_clf.fit(X_train, y_train)

##### SGDClassifier

In [105]:
predicted_sgd = sgd_ppl_clf.predict(X_test)
print(metrics.classification_report(predicted_sgd, y_test))

              precision    recall  f1-score   support

           0       1.00      0.96      0.98      6205
           1       0.38      0.91      0.54       188

    accuracy                           0.95      6393
   macro avg       0.69      0.93      0.76      6393
weighted avg       0.98      0.95      0.96      6393



##### K-neighbors classifier

In [85]:
predicted_sgd = knb_ppl_clf.predict(X_test)
print(metrics.classification_report(predicted_sgd, y_test))

              precision    recall  f1-score   support

           0       1.00      0.94      0.97      6343
           1       0.11      1.00      0.20        50

    accuracy                           0.94      6393
   macro avg       0.55      0.97      0.58      6393
weighted avg       0.99      0.94      0.96      6393



As i see , linear model shows better results. Let's change some parameters

##### Changing parameters of the best model

In [106]:
sgd_ppl_clf = Pipeline([
    ('tfidf', TfidfVectorizer(ngram_range=(1, 2))),
    ('sgd_clf', SGDClassifier(penalty='elasticnet', class_weight='balanced', random_state=42))])
sgd_ppl_clf.fit(X_train, y_train)
predicted_sgd = sgd_ppl_clf.predict(X_test)
print(metrics.classification_report(predicted_sgd, y_test))

              precision    recall  f1-score   support

           0       0.96      0.98      0.97      5796
           1       0.77      0.58      0.66       597

    accuracy                           0.94      6393
   macro avg       0.86      0.78      0.82      6393
weighted avg       0.94      0.94      0.94      6393



### Summary

- I've done cleaning of the text, tokenizing and lemmatizing of the text. It can be done better, but for now it was my best.
- Then i used TF_IDF to transform text to vector
- Finally i used several models, linear model showed better results.



### Experiments...

In [54]:
from transformers import AutoTokenizer, AutoModel
import torch


The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


Moving 0 files to the new cache system


0it [00:00, ?it/s]

In [55]:
#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
    sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    return sum_embeddings / sum_mask

In [None]:
df['text'].tolist()

In [62]:
sentences = df['text'].tolist()

In [65]:
#Load AutoModel from huggingface model repository
tokenizer = AutoTokenizer.from_pretrained("sberbank-ai/sbert_large_nlu_ru")
model = AutoModel.from_pretrained("sberbank-ai/sbert_large_nlu_ru")

Downloading:   0%|          | 0.00/1.71G [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


In [66]:
#Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, max_length=24, return_tensors='pt')

In [67]:
#Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

#Perform pooling. In this case, mean pooling
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

KeyboardInterrupt: 