#   NLP Text Classification

## Objective:
To develop machine learning models to classify emotions.

# 1. Loading and Preprocessing

### 1.1: Load the Dataset

In [11]:
import pandas as pd

# Load the dataset
file_path = r"C:\Users\asus\Downloads\nlp_dataset.csv" 
df = pd.read_csv(file_path)

# Display the first few rows and the columns of the dataframe
print(df.head())
print(df.columns)

                                             Comment Emotion
0  i seriously hate one subject to death but now ...    fear
1                 im so full of life i feel appalled   anger
2  i sit here to write i start to dig out my feel...    fear
3  ive been really angry with r and i feel like a...     joy
4  i feel suspicious if there is no one outside l...    fear
Index(['Comment', 'Emotion'], dtype='object')


### 1.2: Preprocess the Text

In [13]:
import re
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Ensure NLTK resources are downloaded
import nltk
nltk.download('punkt')
nltk.download('stopwords')

def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Remove numbers
    text = re.sub(r'\d+', '', text)
    # Tokenization
    tokens = word_tokenize(text)
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    return ' '.join(tokens)

# Replace 'Comment' with the actual column name
df['cleaned_text'] = df['Comment'].apply(preprocess_text)

# Display the cleaned text
print(df[['Comment', 'cleaned_text']].head())


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\asus\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\asus\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


                                             Comment  \
0  i seriously hate one subject to death but now ...   
1                 im so full of life i feel appalled   
2  i sit here to write i start to dig out my feel...   
3  ive been really angry with r and i feel like a...   
4  i feel suspicious if there is no one outside l...   

                                        cleaned_text  
0  seriously hate one subject death feel reluctan...  
1                         im full life feel appalled  
2  sit write start dig feelings think afraid acce...  
3  ive really angry r feel like idiot trusting fi...  
4  feel suspicious one outside like rapture happe...  


### Explanation of Preprocessing

* Lowercasing: Converting the text to lowercase helps maintain consistency (e.g., "Happy" and "happy" should be treated the same).

* Punctuation removal: We remove punctuation as it generally doesn't add much value for text classification, but removing it ensures uniformity.

* Number removal: Since numbers often don't hold significance in text classification (unless the context requires it), they are removed.

* Tokenization: Breaking text into individual words allows us to analyze each word separately.

* Stopwords removal: Words like "the", "is", "at", and "and" are common but don't contribute much to the classification task, so removing them focuses on the more meaningful content.

### Impact on Model Performance

Cleaning the text helps the model focus on the most relevant information. By removing noise (e.g., stopwords, punctuation, numbers), the model can more effectively learn the patterns in the data, improving its ability to classify emotions. Tokenization allows us to turn the text into units (words) that the model can process.

# 2. Feature Extraction

In [14]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Initialize CountVectorizer and TfidfVectorizer
count_vectorizer = CountVectorizer()
tfidf_vectorizer = TfidfVectorizer()

# Transform the cleaned text into numerical features
X_count = count_vectorizer.fit_transform(df['cleaned_text'])
X_tfidf = tfidf_vectorizer.fit_transform(df['cleaned_text'])

# Check the shape of transformed data (rows: samples, columns: unique words/features)
print("CountVectorizer shape:", X_count.shape)
print("TfidfVectorizer shape:", X_tfidf.shape)

CountVectorizer shape: (5937, 8815)
TfidfVectorizer shape: (5937, 8815)


### Explanation of Feature Extraction

#### CountVectorizer

* It creates a bag-of-words representation of the text. Each unique word in the text is assigned a feature index, and the value of that feature is the frequency of the word's occurrence in the text.
* Example: If the text is "I am happy", the vector might look like [1, 1, 1] where each index corresponds to the word frequency.

#### TfidfVectorizer:

* In addition to word counts, the TF-IDF (Term Frequency-Inverse Document Frequency) method considers how important a word is by reducing the weight of commonly used words that appear in many documents.
* It balances the frequency of a word with its informativeness. Frequent but less informative words (like "the", "is") are given lower weights.

### Impact of Feature Extraction on Performance

* CountVectorizer captures the raw frequency of words, which may work well but could give too much importance to common words.
* TfidfVectorizer provides a more balanced view by lowering the importance of frequently occurring words, which helps prevent overfitting and improves generalization.

# 3. Model Development

In [15]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, df['Emotion'], test_size=0.2, random_state=42)

# Initialize models
nb_model = MultinomialNB()
svm_model = SVC(kernel='linear')

# Train Naive Bayes model
nb_model.fit(X_train, y_train)

# Train SVM model
svm_model.fit(X_train, y_train)

SVC(kernel='linear')

### Explanation

**Naive Bayes:** The Multinomial Naive Bayes model is well-suited for text data where the features are word counts or TF-IDF values. It works on the assumption of conditional independence between words given the class.

**SVM:** The Support Vector Machine (SVM) model tries to find a hyperplane that best separates the data into different classes. For text classification, a linear kernel is often effective.

# 4. Model Comparison

In [16]:
# Predictions for Naive Bayes
y_pred_nb = nb_model.predict(X_test)

# Predictions for SVM
y_pred_svm = svm_model.predict(X_test)

# Accuracy and Classification Report for Naive Bayes
print("Naive Bayes Accuracy:", accuracy_score(y_test, y_pred_nb))
print("Naive Bayes Classification Report:\n", classification_report(y_test, y_pred_nb))

# Accuracy and Classification Report for SVM
print("SVM Accuracy:", accuracy_score(y_test, y_pred_svm))
print("SVM Classification Report:\n", classification_report(y_test, y_pred_svm))

Naive Bayes Accuracy: 0.9116161616161617
Naive Bayes Classification Report:
               precision    recall  f1-score   support

       anger       0.88      0.95      0.91       392
        fear       0.92      0.92      0.92       416
         joy       0.95      0.87      0.90       380

    accuracy                           0.91      1188
   macro avg       0.91      0.91      0.91      1188
weighted avg       0.91      0.91      0.91      1188

SVM Accuracy: 0.946969696969697
SVM Classification Report:
               precision    recall  f1-score   support

       anger       0.93      0.96      0.94       392
        fear       0.97      0.91      0.94       416
         joy       0.94      0.97      0.96       380

    accuracy                           0.95      1188
   macro avg       0.95      0.95      0.95      1188
weighted avg       0.95      0.95      0.95      1188



### Explanation

**Accuracy:** Measures the percentage of correct predictions. While useful, accuracy alone can be misleading if the dataset is imbalanced.

**F1-score:** A balance between precision and recall, the F1-score is a good measure when classes are imbalanced (i.e., when some emotions may occur more often than others).

**Naive Bayes:** It's computationally efficient and works well when the assumptions of word independence hold, which often works surprisingly well in text classification tasks.

**SVM:** SVM is often more accurate as it finds the optimal boundary between classes. However, it is computationally more expensive than Naive Bayes.