In [1]:
import pandas as pd
import re
import nltk
import string
import contractions
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Download once
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

# Load data
df = pd.read_csv("test_twitter_x_test.csv")[['text']]
df.columns = ['Text']
df = df.dropna()

# Initialize
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

# Preprocessing function
def preprocess_tweet(text):
    text = contractions.fix(text.lower())
    text = re.sub(r'http\S+|@\w+|#\w+|[^a-zA-Z\s]', '', text)
    tokens = nltk.word_tokenize(text)
    tokens = [lemmatizer.lemmatize(t) for t in tokens if t not in stop_words and t not in string.punctuation]
    return tokens

# Apply preprocessing
df['Tokens'] = df['Text'].apply(preprocess_tweet)
print(df.head())


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


                                                Text  \
0  @AmericanAir In car gng to DFW. Pulled over 1h...   
1  @AmericanAir after all, the plane didn’t land ...   
2  @SouthwestAir can't believe how many paying cu...   
3  @USAirways I can legitimately say that I would...   
4  @AmericanAir still no response from AA. great ...   

                                              Tokens  
0  [car, gng, dfw, pulled, hr, ago, icy, road, on...  
1  [plane, land, identical, worse, condition, grk...  
2  [believe, many, paying, customer, left, high, ...  
3  [legitimately, say, would, rather, driven, cro...  
4             [still, response, aa, great, job, guy]  


#### (Python 3.11 Environment)

### Step 1: Preprocessing Tweets

We begin by preprocessing each tweet using the following steps:
- Convert to lowercase
- Expand contractions (e.g., "can't" → "cannot")
- Remove URLs, mentions, hashtags, punctuation
- Tokenize into words using `nltk.word_tokenize()`
- Remove stopwords and lemmatize tokens

This transforms raw tweets into clean word lists for further vectorization.


In [2]:
from gensim.models import KeyedVectors

# Load Google News Word2Vec binary model
word2vec_model = KeyedVectors.load_word2vec_format(
    "GoogleNews-vectors-negative300.bin.gz", binary=True
)

# Confirm dimensions
print(f"Word2Vec model loaded. Vector size: {word2vec_model.vector_size}")


Word2Vec model loaded. Vector size: 300


### Step 2: Load Google News Word2Vec Model using Gensim

We load the pretrained **Google News Word2Vec** model using `gensim.models.KeyedVectors`. This model contains 3 million English word vectors, each of 300 dimensions.

The model is loaded from the binary `.bin.gz` file using:

```python
'KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin.gz", binary=True)'
```

This provides vector representations for individual words, which we will use to build tweet-level embeddings.


In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

# Create TF-IDF dictionary from CleanText (tokens joined into strings)
df['CleanText'] = df['Tokens'].apply(lambda tokens: ' '.join(tokens))

tfidf = TfidfVectorizer()
tfidf.fit(df['CleanText'])
tfidf_dict = dict(zip(tfidf.get_feature_names_out(), tfidf.idf_))

# Weighted average vector using Word2Vec + TF-IDF
def tfidf_weighted_word2vec(tokens, model, tfidf_dict, vector_size):
    vectors = [
        model[word] * tfidf_dict[word]
        for word in tokens if word in model and word in tfidf_dict
    ]
    return np.mean(vectors, axis=0) if vectors else np.zeros(vector_size)

# Compute final vector for each tweet
df['Vector'] = df['Tokens'].apply(lambda tokens: tfidf_weighted_word2vec(tokens, word2vec_model, tfidf_dict, word2vec_model.vector_size))

# View result
print(df[['Text', 'Vector']].head())


                                                Text  \
0  @AmericanAir In car gng to DFW. Pulled over 1h...   
1  @AmericanAir after all, the plane didn’t land ...   
2  @SouthwestAir can't believe how many paying cu...   
3  @USAirways I can legitimately say that I would...   
4  @AmericanAir still no response from AA. great ...   

                                              Vector  
0  [-0.323791, 0.52586997, 0.02905167, 0.24832352...  
1  [0.6987391, -0.55277914, 0.060138952, -0.59743...  
2  [0.38663167, 0.36437044, -0.09021091, 0.616413...  
3  [-0.042288974, 0.36418095, 0.36169785, 0.55547...  
4  [0.50087684, 0.32249323, -0.0906745, 0.2391912...  


### Step 3: Convert Tweets to Fixed-Length Vectors Using Word2Vec + TF-IDF

We generate a 300-dimensional vector representation for each tweet by combining:

1. **Pretrained Google Word2Vec model**: provides word-level 300D embeddings.
2. **TF-IDF weighting**: measures word importance within the dataset.

For each tweet:
- We extract tokens.
- For each token in the Word2Vec vocabulary and TF-IDF dict:
  - Multiply its embedding by its TF-IDF weight.
- Average all such weighted embeddings to get a single vector.

This produces one semantically rich vector per tweet, ready for classification.


In [4]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.preprocessing import LabelEncoder
from imblearn.over_sampling import SMOTE
import numpy as np

# Add sentiment column from original file
df_sent = pd.read_csv("test_twitter_x_test.csv")[['sentiment']]
df['Sentiment'] = df_sent['sentiment']

# Encode labels to 0,1,2
le = LabelEncoder()
y = le.fit_transform(df['Sentiment'])
X = np.vstack(df['Vector'].values)

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Apply SMOTE to training data
smote = SMOTE(random_state=42)
X_train_bal, y_train_bal = smote.fit_resample(X_train, y_train)

# Train logistic regression
model = LogisticRegression(max_iter=1000)
model.fit(X_train_bal, y_train_bal)

# Evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred, target_names=le.classes_))


              precision    recall  f1-score   support

    negative       0.72      0.40      0.52       545
     neutral       0.16      0.31      0.21       108
    positive       0.11      0.32      0.17        79

    accuracy                           0.38       732
   macro avg       0.33      0.34      0.30       732
weighted avg       0.57      0.38      0.43       732



### Step 4: Train a Classifier Using Tweet Vectors

We train a sentiment classifier using the tweet embeddings created in Step 3. ('38% acc' obtained)

Steps:
1. Load the `sentiment` labels and encode them to numeric format (0=negative, 1=neutral, 2=positive).
2. Split the dataset into train and test sets using `train_test_split`.
3. Use `SMOTE` to balance class distribution in the training set.
4. Train a `LogisticRegression` model on the resampled vectors.
5. Evaluate the model using precision, recall, and F1-score via `classification_report`.

This trained model will be used to classify unseen tweets in later steps.


In [5]:
import gensim.downloader as api
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report
from imblearn.over_sampling import SMOTE
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
import pandas as pd

# Load GloVe 100D
glove_model = api.load("glove-wiki-gigaword-100")
print(f"GloVe model loaded. Vector size: {glove_model.vector_size}")

# Reload DataFrame with Sentiment
df = pd.read_csv("test_twitter_x_test.csv")[['text', 'sentiment']]
df.columns = ['Text', 'Sentiment']
df.dropna(subset=['Text'], inplace=True)
df['Text'] = df['Text'].astype(str)

# Preprocess
df['Tokens'] = df['Text'].apply(preprocess_tweet)
df['CleanText'] = df['Tokens'].apply(lambda tokens: ' '.join(tokens))

# TF-IDF for GloVe
tfidf = TfidfVectorizer()
tfidf.fit(df['CleanText'])
tfidf_dict = dict(zip(tfidf.get_feature_names_out(), tfidf.idf_))

# GloVe vectorization using TF-IDF
def tfidf_weighted_glove(tokens, model, tfidf_dict, vector_size):
    vectors = [model[word] * tfidf_dict[word] for word in tokens if word in model and word in tfidf_dict]
    return np.mean(vectors, axis=0) if vectors else np.zeros(vector_size)

# Generate tweet vectors
df['Vector'] = df['Tokens'].apply(lambda tokens: tfidf_weighted_glove(tokens, glove_model, tfidf_dict, glove_model.vector_size))
X = np.vstack(df['Vector'].values)
y = LabelEncoder().fit_transform(df['Sentiment'])

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

# Balance with SMOTE
X_train_bal, y_train_bal = SMOTE(random_state=42).fit_resample(X_train, y_train)

# Train on GloVe
glove_model_classifier = LogisticRegression(max_iter=1000)
glove_model_classifier.fit(X_train_bal, y_train_bal)

# Evaluate
y_pred = glove_model_classifier.predict(X_test)
print(classification_report(y_test, y_pred, target_names=['negative', 'neutral', 'positive']))


GloVe model loaded. Vector size: 100
              precision    recall  f1-score   support

    negative       0.71      0.35      0.47       545
     neutral       0.14      0.31      0.19       108
    positive       0.10      0.28      0.14        79

    accuracy                           0.34       732
   macro avg       0.32      0.31      0.27       732
weighted avg       0.56      0.34      0.40       732



### Step 5: Train Classifier Using GloVe 100D Vectors 

To resolve dimension mismatch issues in prediction, we train a second classifier using **GloVe 100D vectors**. ('34% acc' obtained)

Steps:
1. Load the `glove-wiki-gigaword-100` embedding model.
2. Preprocess tweets and generate TF-IDF dictionary.
3. Vectorize each tweet using a **TF-IDF weighted average of GloVe vectors**.
4. Split the dataset and balance classes using `SMOTE`.
5. Train a `LogisticRegression` model on the 100D tweet vectors.

This GloVe-trained model will be used in the next step to predict sentiment for new tweets.


In [6]:
# GloVe-based TF-IDF weighted average vector
def tfidf_weighted_glove(tokens, glove_model, tfidf_dict, vector_size):
    vectors = [
        glove_model[word] * tfidf_dict[word]
        for word in tokens if word in glove_model and word in tfidf_dict
    ]
    return np.mean(vectors, axis=0) if vectors else np.zeros(vector_size)

# Prediction function
def predict_tweet_sentiment(tweet, model, glove_model, tfidf_dict):
    tokens = preprocess_tweet(tweet)
    vector = tfidf_weighted_glove(tokens, glove_model, tfidf_dict, glove_model.vector_size).reshape(1, -1)
    pred = model.predict(vector)[0]
    return pred  # 0 = negative, 1 = neutral, 2 = positive 


### Step 6: Predict Sentiment for a New Tweet Using GloVe-Based Model

In this final step, we use the model trained on GloVe 100D vectors to classify the sentiment of any input tweet.

Steps:
1. Preprocess the tweet (cleaning, tokenization, lemmatization)
2. Create a TF-IDF weighted average of GloVe embeddings for the tokens
3. Use the trained model to predict sentiment

This function enables real-time sentiment analysis on unseen tweets.

Label mapping:
- `0` → Negative
- `1` → Neutral
- `2` → Positive
