<a href="https://www.kaggle.com/code/bcodep06/covid-19-text-classification?scriptVersionId=260856032" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# COVID-19 Tweet Sentiment Classification

This notebook performs text classification on COVID-19 related tweets. The steps include:
1. Data Loading
2. Preprocessing (cleaning, mentions, hashtags, compound words, lemmatization)
3. Train/Test Split
4. TF-IDF Vectorization
5. Label Encoding
6. Model Training with Logistic Regression
7. Evaluation on Training and Test Sets

In [1]:
pip install -U scikit-learn imbalanced-learn

Collecting scikit-learn
  Downloading scikit_learn-1.7.2-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (11 kB)
Collecting imbalanced-learn
  Downloading imbalanced_learn-0.14.0-py3-none-any.whl.metadata (8.8 kB)
Downloading scikit_learn-1.7.2-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (9.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.7/9.7 MB[0m [31m60.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading imbalanced_learn-0.14.0-py3-none-any.whl (239 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m240.0/240.0 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: scikit-learn, imbalanced-learn
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 1.2.2
    Uninstalling scikit-learn-1.2.2:
      Successfully uninstalled scikit-learn-1.2.2
  Attempting uninstall: imbalanced-learn
    Found existing installation: imbalanced-learn 0.13.0
    

In [2]:
# Import required libraries
import numpy as np
import pandas as pd
import re
import matplotlib.pyplot as plt
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC

## 1. Load Dataset

In [3]:
# Load CSV dataset
df = pd.read_csv('/kaggle/input/covid-19-nlp-text-classification/Corona_NLP_train.csv',encoding = 'latin1')

# Select relevant columns
text_data = df.iloc[:, -2:]
text_data.head(8)

Unnamed: 0,OriginalTweet,Sentiment
0,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,Neutral
1,advice Talk to your neighbours family to excha...,Positive
2,Coronavirus Australia: Woolworths to give elde...,Positive
3,My food stock is not the only one which is emp...,Positive
4,"Me, ready to go at supermarket during the #COV...",Extremely Negative
5,As news of the regionÂs first confirmed COVID...,Positive
6,Cashier at grocery store was sharing his insig...,Positive
7,Was at the supermarket today. Didn't buy toile...,Neutral


In [4]:
# Checking imbalance
print(text_data['Sentiment'].value_counts())

Sentiment
Positive              11422
Negative               9917
Neutral                7713
Extremely Positive     6624
Extremely Negative     5481
Name: count, dtype: int64


## 2. Text Preprocessing

In [5]:
# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()

# List of known keywords to help split compound words in hashtags or concatenated words
KEYWORDS = ['coronavirus', 'vaccine', 'lockdown', 'outbreak', 'airline', 'webcheckin']

# Function to split compound words using known keywords
def split_compound_words(text):
    for kw in KEYWORDS:
        text = re.sub(f'({kw})([a-z]+)', r'\1 \2', text)
    return text

def text_preprocessing(text):
    # Convert all text to lowercase to standardize words
    text = text.lower()

    # Remove words followed by a colon
    text = re.sub(r'\w+:','', text)
    # Remove hashtags symbol (#) but keep the word following it
    text = re.sub(r'#(\w+)', r'\1', text)
    # Split known compound words in hashtags or concatenated words (like 'coronavirusoutbreak')
    text = split_compound_words(text)
    # Replace all mentions (@username) with a placeholder 'MENTION'
    text = re.sub(r'@\w+', 'MENTION', text)
    # Remove URLs and links from the text
    text = re.sub(r'https\S+|www\S+|\/\/t\.co/\S+', '', text)
    # Remove any text inside parentheses
    text = re.sub(r'\([^)]*\)','', text)
    # Replace all numbers with a placeholder '<NUM>'
    text = re.sub(r'\d+', ' <NUM> ', text)
    # Remove punctuation characters to simplify text
    text = re.sub(r'[.,!?;:&$|=]', ' ', text)
    # Replace multiple spaces with a single space and remove leading/trailing spaces
    text = re.sub(r'\s+', ' ', text).strip()


    # Lemmatize each token
    tokens = word_tokenize(text)
    tokens = [lemmatizer.lemmatize(t) for t in tokens]
    text = ' '.join(tokens)
    
    return text

# Apply preprocessing
text_data['CleanedTweet'] = text_data['OriginalTweet'].apply(text_preprocessing)

## 3. Train/Test Split

In [6]:
X = text_data['CleanedTweet']
y = text_data['Sentiment']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

## 4. Label Encoding

In [7]:
le = LabelEncoder()
y_train_enc = le.fit_transform(y_train)
y_test_enc = le.transform(y_test)

## 5. TF-IDF Vectorization

In [8]:
tfidf = TfidfVectorizer(
    stop_words='english',
    ngram_range=(1,2),
    max_features=3000,
    min_df=3,
    max_df=0.85
)


X_train_vec = tfidf.fit_transform(X_train)
X_test_vec = tfidf.transform(X_test)

### Optional: Inspect features of a sample tweet

In [9]:
feature_names = tfidf.get_feature_names_out()
first_row = X_train_vec[3].toarray()[0]
non_zero_indices = first_row.nonzero()[0]

for idx in non_zero_indices:
    print(feature_names[idx], first_row[idx])

cost 0.23546308923109677
insight 0.25881675186755987
instead 0.24522822568529665
inventory 0.6129768078089474
management 0.2890562191177521
mention 0.1008789834113556
minimize 0.31965466407992127
need 0.2967742912663483
retail 0.18051833428537906
risk 0.20534966872051374
supplychain 0.2880174754972808


## 6. Train Logistic Regression Model

In [10]:
from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler(random_state=42)
X_train_res, y_train_res = ros.fit_resample(X_train_vec, y_train_enc)

In [11]:
from sklearn.decomposition import TruncatedSVD

# Applying TruncatedSVD for dimensionality reduction
svd = TruncatedSVD(n_components=400, random_state=42)
X_train_vec_svd = svd.fit_transform(X_train_res)
X_test_vec_svd = svd.transform(X_test_vec)

In [12]:
from sklearn.metrics import balanced_accuracy_score
models = {
    'LogisticRegression': LogisticRegression(C = 0.5,max_iter=1000, class_weight='balanced',solver='lbfgs',random_state=42),
    'Naive Bayes': MultinomialNB(),
    'Linear SVC': LinearSVC(C = 0.5,class_weight='balanced', random_state=42)
}

for label, model in models.items():
    print(f"\nTraining Model With {label}....")

    if label == "Naive Bayes":
        model.fit(X_train_vec, y_train_enc)
        y_pred = model.predict(X_test_vec)
    else:
        model.fit(X_train_vec_svd, y_train_res)
        y_pred = model.predict(X_test_vec_svd)

    print(f"\nResults for {label}:")
    print(classification_report(y_test_enc, y_pred, target_names=le.classes_))
    print(f"Balanced Accuracy ({label}):", balanced_accuracy_score(y_test_enc, y_pred))




Training Model With LogisticRegression....

Results for LogisticRegression:
                    precision    recall  f1-score   support

Extremely Negative       0.44      0.62      0.52      1096
Extremely Positive       0.50      0.62      0.55      1325
          Negative       0.42      0.26      0.32      1983
           Neutral       0.45      0.73      0.55      1543
          Positive       0.49      0.28      0.35      2285

          accuracy                           0.46      8232
         macro avg       0.46      0.50      0.46      8232
      weighted avg       0.46      0.46      0.44      8232

Balanced Accuracy (LogisticRegression): 0.5011910979937176

Training Model With Naive Bayes....

Results for Naive Bayes:
                    precision    recall  f1-score   support

Extremely Negative       0.69      0.23      0.34      1096
Extremely Positive       0.68      0.28      0.40      1325
          Negative       0.42      0.50      0.46      1983
           Neutra

## 7. Evaluate Model

In [13]:
from sklearn.model_selection import GridSearchCV

param_grid = {'C': [0.5, 1, 2, 5]}
grid = GridSearchCV(LogisticRegression(max_iter=2000, class_weight='balanced', random_state=42),
                    param_grid, cv=3, scoring='balanced_accuracy', n_jobs=-1)
grid.fit(X_train_vec, y_train_enc)
print("Best Params:", grid.best_params_)
print("Best Score:", grid.best_score_)


Best Params: {'C': 5}
Best Score: 0.565425290638423


## 8. Retrain final model with C=5

In [14]:
final_model = LogisticRegression(C=5, max_iter=2000, class_weight='balanced', random_state=42)
final_model.fit(X_train_vec, y_train_enc)
y_pred = final_model.predict(X_test_vec)

print(classification_report(y_test_enc, y_pred, target_names=le.classes_))
print("Balanced Accuracy:", balanced_accuracy_score(y_test_enc, y_pred))


                    precision    recall  f1-score   support

Extremely Negative       0.54      0.68      0.60      1096
Extremely Positive       0.60      0.69      0.64      1325
          Negative       0.53      0.41      0.47      1983
           Neutral       0.59      0.76      0.66      1543
          Positive       0.57      0.44      0.50      2285

          accuracy                           0.57      8232
         macro avg       0.56      0.60      0.57      8232
      weighted avg       0.56      0.57      0.56      8232

Balanced Accuracy: 0.5964872451832405
