### Data Exploration 

Load the dataset and perform Exploratory Data Analysis (EDA)

In [1]:
import pandas as pd

# Load the dataset to inspect its structure
tweet_data = pd.read_csv('../data_file/tweet_sentiments.csv', encoding='ISO-8859-1')

# Display the first few rows of the dataset
tweet_data.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


The dataset consists of the following key columns:

* **tweet_text:** The actual text of the Tweet.
* **emotion_in_tweet_is_directed_at:** The product or brand mentioned in the Tweet (e.g., iPhone, iPad, Google).
* **is_there_an_emotion_directed_at_a_brand_or_product:** The sentiment or emotion expressed in the Tweet (e.g., Positive emotion, Negative emotion).

In [2]:
# Check for missing values and get a summary of the dataset
missing_values = tweet_data.isnull().sum()

# Check the distribution of sentiment classes
sentiment_distribution = tweet_data['is_there_an_emotion_directed_at_a_brand_or_product'].value_counts()

# Display the missing values and sentiment distribution
print("Missing Values:")
print(missing_values)

print("\nSentiment Distribution:")
print(sentiment_distribution)

Missing Values:
tweet_text                                               1
emotion_in_tweet_is_directed_at                       5802
is_there_an_emotion_directed_at_a_brand_or_product       0
dtype: int64

Sentiment Distribution:
No emotion toward brand or product    5389
Positive emotion                      2978
Negative emotion                       570
I can't tell                           156
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64


The dataset contains the following observations:

1. There is one missing value in the `tweet_text` column and a large number (5802) in the `emotion_in_tweet_is_directed_at column`, which is not crucial for sentiment classification as our primary target is the sentiment.

2. The sentiment distribution shows a significant imbalance, with:
   * **5389** instances labeled as "No emotion toward brand or product."
   * **2978** labeled as "Positive emotion."
   * **570** labeled as "Negative emotion."
   * **156** instances labeled as "I can't tell."

### Data Preparation

**Step 1:** Data Cleaning

In [3]:
# Drop rows with missing tweet_text and drop the column 'emotion_in_tweet_is_directed_at' as it is not necessary for sentiment analysis
cleaned_tweet_data = tweet_data.dropna(subset=['tweet_text']).drop(columns=['emotion_in_tweet_is_directed_at'])

# Display the cleaned dataset for further inspection
cleaned_tweet_data.head()

Unnamed: 0,tweet_text,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Positive emotion


In [4]:
# Cross-checking that there are no missing values
missing_values = cleaned_tweet_data.isnull().sum()

missing_values

tweet_text                                            0
is_there_an_emotion_directed_at_a_brand_or_product    0
dtype: int64

The dataset has been successfully cleaned by removing the missing entries from the `tweet_text` column and dropping the irrelevant `emotion_in_tweet_is_directed_at` column.

**Step 2:** Data Preprocessing

In [5]:
# Import necessary libraries for text preprocessing
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import nltk

# Download necessary NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Define stopwords and lemmatizer
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

# Function for text preprocessing
def preprocess_text(text):
    # Lowercase the text
    text = text.lower()
    # Remove punctuation and special characters
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Tokenization
    words = word_tokenize(text)
    # Remove stop words and lemmatize
    words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
    return ' '.join(words)

# Apply the preprocessing function to the 'tweet_text' column
cleaned_tweet_data['cleaned_text'] = cleaned_tweet_data['tweet_text'].apply(preprocess_text)

# Display the first few rows of the preprocessed data
cleaned_tweet_data[['tweet_text', 'cleaned_text']].head()

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\engig\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\engig\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\engig\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,tweet_text,cleaned_text
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,wesley g iphone hr tweeting riseaustin dead ne...
1,@jessedee Know about @fludapp ? Awesome iPad/i...,jessedee know fludapp awesome ipadiphone app y...
2,@swonderlin Can not wait for #iPad 2 also. The...,swonderlin wait ipad also sale sxsw
3,@sxsw I hope this year's festival isn't as cra...,sxsw hope year festival isnt crashy year iphon...
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,sxtxstate great stuff fri sxsw marissa mayer g...


The `cleaned_text` column now contains preprocessed Tweet text, which has been:

* Converted to lowercase.
* Stripped of punctuation and special characters.
* Tokenized, with stop words removed.
* Lemmatized to reduce words to their base form.

**Step 3:** Feature Engineering

Convert the cleaned text into numerical features using Term Frequency-Inverse Document Frequency (TF-IDF).

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Define the TF-IDF Vectorizer with a maximum of 5000 features and stop words removed
tfidf_vectorizer = TfidfVectorizer(max_features=5000)

# Fit and transform the cleaned text data into TF-IDF features
X_tfidf = tfidf_vectorizer.fit_transform(cleaned_tweet_data['cleaned_text'])

# Display the shape of the resulting TF-IDF matrix
X_tfidf.shape

(9092, 5000)

The resulting matrix contains 9,092 rows (one for each Tweet) and 5,000 TF-IDF features (words or terms) based on the cleaned text data. Each cell in the matrix represents the TF-IDF score for a specific word in a specific Tweet

In [7]:
# Convert the sparse TF-IDF matrix to a dense array
tfidf_sample = pd.DataFrame(X_tfidf.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

# Display the first 5 rows and first 10 columns of the TF-IDF matrix
print(tfidf_sample.iloc[:5, :10])

   aapl  aaron  aarpbulletin   ab  abacus  abba  abc  ability  able  abnormal
0   0.0    0.0           0.0  0.0     0.0   0.0  0.0      0.0   0.0       0.0
1   0.0    0.0           0.0  0.0     0.0   0.0  0.0      0.0   0.0       0.0
2   0.0    0.0           0.0  0.0     0.0   0.0  0.0      0.0   0.0       0.0
3   0.0    0.0           0.0  0.0     0.0   0.0  0.0      0.0   0.0       0.0
4   0.0    0.0           0.0  0.0     0.0   0.0  0.0      0.0   0.0       0.0


Above displays a small portion of the matrix, showing how the first 5 Tweets relate to the first 10 words in the vocabulary. The values represent the TF-IDF scores for each word in each Tweet.

## Modeling

### 1. Binary Classification

Converted the sentiment labels into a binary classification problem (positive vs. negative). Trained binary classifiers models like Logistic Regression, Random Forest and SVM using the TF-IDF features.

**1(a). Logistic Regression**

In [8]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

# Convert sentiment labels into a binary classification task (positive vs. negative)
# Exclude 'No emotion' and 'I can't tell' classes for this binary task
binary_data = cleaned_tweet_data[cleaned_tweet_data['is_there_an_emotion_directed_at_a_brand_or_product'].isin(['Positive emotion', 'Negative emotion'])]

# Re-apply the TF-IDF transformation on the filtered data
X_tfidf_binary = tfidf_vectorizer.transform(binary_data['cleaned_text'])

# Prepare target variable (1 for positive, 0 for negative)
y_binary = binary_data['is_there_an_emotion_directed_at_a_brand_or_product'].apply(lambda x: 1 if x == 'Positive emotion' else 0)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_tfidf_binary, y_binary, test_size=0.2, random_state=42, stratify=y_binary)

# Train a Logistic Regression model as a baseline
logreg = LogisticRegression(class_weight='balanced', max_iter=1000)
logreg.fit(X_train, y_train)

# Predict on the test set
y_pred = logreg.predict(X_test)

# Evaluate the model
accuracy_log = accuracy_score(y_test, y_pred)
report_log = classification_report(y_test, y_pred, target_names=['Negative', 'Positive'])

# Display the evaluation scores
print(f"Accuracy: {accuracy_log}")
print("Classification Report:\n", report_log)


Accuracy: 0.8394366197183099
Classification Report:
               precision    recall  f1-score   support

    Negative       0.50      0.68      0.57       114
    Positive       0.93      0.87      0.90       596

    accuracy                           0.84       710
   macro avg       0.72      0.77      0.74       710
weighted avg       0.86      0.84      0.85       710



Results for the baseline logistic regression binary classification task (positive vs. negative sentiment):

* **Accuracy:** 84%

* **Negative Class (Precision: 0.50, Recall: 0.68, F1-score: 0.57):** The model struggles with correctly identifying negative sentiment, which is expected due to the class imbalance.

* **Positive Class (Precision: 0.93, Recall: 0.87, F1-score: 0.90):** The model performs much better with positive sentiment.

**1(b). Logistic Regression with SMOTE**

In [9]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score
from imblearn.over_sampling import SMOTE

# Convert sentiment labels into a binary classification task (positive vs. negative)
binary_data = cleaned_tweet_data[cleaned_tweet_data['is_there_an_emotion_directed_at_a_brand_or_product'].isin(['Positive emotion', 'Negative emotion'])]

# Re-apply the TF-IDF transformation on the filtered data
X_tfidf_binary = tfidf_vectorizer.transform(binary_data['cleaned_text'])

# Prepare target variable (1 for positive, 0 for negative)
y_binary = binary_data['is_there_an_emotion_directed_at_a_brand_or_product'].apply(lambda x: 1 if x == 'Positive emotion' else 0)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_tfidf_binary, y_binary, test_size=0.2, random_state=42, stratify=y_binary)

# Apply SMOTE to the training data to handle class imbalance
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

# Train a Logistic Regression model with SMOTE-applied data
logreg = LogisticRegression(class_weight='balanced', max_iter=1000)
logreg.fit(X_train_smote, y_train_smote)

# Predict on the test set
y_pred_smote = logreg.predict(X_test)

# Evaluate the model
accuracy_log_smote = accuracy_score(y_test, y_pred_smote)
report_log_smote = classification_report(y_test, y_pred_smote, target_names=['Negative', 'Positive'])

# Display the evaluation scores
print(f"Accuracy after SMOTE: {accuracy_log_smote}")
print("Classification Report after SMOTE:\n", report_log_smote)

Accuracy after SMOTE: 0.8492957746478873
Classification Report after SMOTE:
               precision    recall  f1-score   support

    Negative       0.53      0.61      0.57       114
    Positive       0.92      0.89      0.91       596

    accuracy                           0.85       710
   macro avg       0.73      0.75      0.74       710
weighted avg       0.86      0.85      0.85       710



Results for the binary classification task for logistic regression model with SMOTE applied:

* **Accuracy:** 85%

* **Negative Class (Precision: 0.53, Recall: 0.61, F1-score: 0.57):** The model shows some improvement in precision for the negative sentiment (minority class) after applying SMOTE. However, recall slightly dropped compared to the baseline, leading to an unchanged 
F1-score. This suggests that while the model is better at identifying true negatives, it still struggles with correctly classifying all the negative cases.

* **Positive Class (Precision: 0.92, Recall: 0.89, F1-score: 0.91):** The performance for the positive class is still strong but has slightly reduced from the baseline model, which is likely a trade-off introduced by balancing the negative class.

**1(c). Tuned Logistic Regression**

In [10]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression

# Define the hyperparameter grid to search
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],  # Regularization strength
    'penalty': ['l2'],  # Penalty norm (L1 norm is typically supported with 'liblinear' or 'saga')
    'solver': ['liblinear', 'saga']  # Solvers compatible with penalty
}

# Initialize the Logistic Regression model
logreg = LogisticRegression(class_weight='balanced', max_iter=1000)

# Set up GridSearchCV with 3-fold cross-validation
grid_search = GridSearchCV(estimator=logreg, param_grid=param_grid, cv=3, verbose=1, n_jobs=-1)

# Fit the GridSearchCV to the SMOTE-balanced training data
grid_search.fit(X_train_smote, y_train_smote)

# Display the best hyperparameters found
best_params = grid_search.best_params_
print(f"Best Parameters: {best_params}")

# Train the best model found by GridSearchCV
best_logreg = grid_search.best_estimator_
y_pred_best = best_logreg.predict(X_test)

# Evaluate the tuned model
accuracy_tuned = accuracy_score(y_test, y_pred_best)
report_tuned = classification_report(y_test, y_pred_best, target_names=['Negative', 'Positive'])

# Display the evaluation scores
print(f"Accuracy after Tuning: {accuracy_tuned}")
print("Classification Report after Tuning:\n", report_tuned)

Fitting 3 folds for each of 10 candidates, totalling 30 fits
Best Parameters: {'C': 100, 'penalty': 'l2', 'solver': 'liblinear'}
Accuracy after Tuning: 0.8774647887323944
Classification Report after Tuning:
               precision    recall  f1-score   support

    Negative       0.62      0.61      0.61       114
    Positive       0.92      0.93      0.93       596

    accuracy                           0.88       710
   macro avg       0.77      0.77      0.77       710
weighted avg       0.88      0.88      0.88       710

