# 1. Loading and Preprocessing
Load the dataset and perform necessary preprocessing steps. This should include text cleaning, tokenization, and removal of stopwords. Explain the preprocessing techniques used and their impact on model performance.

In [1]:
import pandas as pd
data = pd.read_csv('C:/Users/Asus/Downloads/nlp_dataset.csv')

In [2]:
data

Unnamed: 0,Comment,Emotion
0,i seriously hate one subject to death but now ...,fear
1,im so full of life i feel appalled,anger
2,i sit here to write i start to dig out my feel...,fear
3,ive been really angry with r and i feel like a...,joy
4,i feel suspicious if there is no one outside l...,fear
...,...,...
5932,i begun to feel distressed for you,fear
5933,i left feeling annoyed and angry thinking that...,anger
5934,i were to ever get married i d have everything...,joy
5935,i feel reluctant in applying there because i w...,fear


In [3]:
data["Emotion"].unique()

array(['fear', 'anger', 'joy'], dtype=object)

In [4]:
text=data['Comment']

In [5]:
import re
def clean_text(text):
    text = text.lower() 
    text = re.sub(r'\s+', ' ', text) 
    return text

data['cleaned_text'] = data['Comment'].apply(clean_text)


In [6]:
from nltk.tokenize import word_tokenize
data['tokens'] = data['cleaned_text'].apply(word_tokenize)
data['tokens']

0       [i, seriously, hate, one, subject, to, death, ...
1             [im, so, full, of, life, i, feel, appalled]
2       [i, sit, here, to, write, i, start, to, dig, o...
3       [ive, been, really, angry, with, r, and, i, fe...
4       [i, feel, suspicious, if, there, is, no, one, ...
                              ...                        
5932           [i, begun, to, feel, distressed, for, you]
5933    [i, left, feeling, annoyed, and, angry, thinki...
5934    [i, were, to, ever, get, married, i, d, have, ...
5935    [i, feel, reluctant, in, applying, there, beca...
5936    [i, just, wanted, to, apologize, to, you, beca...
Name: tokens, Length: 5937, dtype: object

In [7]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

data['tokens'] = data['tokens'].apply(lambda x: [word for word in x if word not in stop_words])

Preprocessing makes the text clearer and easier for the model to understand, helping it focus on the words that matter.
1.Cleaning reduces noise:
Cleaning removes unnecessary characters like punctuation and converts text to lowercase. This helps the model focus on important words without getting confused by irrelevant details.
Tokenization breaks text into words:
2.Tokenization splits the text into individual words (tokens), making it easier to analyze word patterns and frequencies.
3.Removing stopwords improves focus:
Stopwords like "the" and "is" are common but don't add much meaning. Removing them helps the model focus on more important words, improving accuracy.

# 2. Feature Extraction:
Implement feature extraction using CountVectorizer or TfidfVectorizer. Describe how the chosen method transforms the text data into numerical features.

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(data['cleaned_text'])

TF-IDF (Term Frequency-Inverse Document Frequency) is an advanced technique used to evaluate the importance of a word in a document relative to a collection of documents. Unlike Bag of Words, which only considers the frequency of words, TF-IDF also considers how common or rare a word is across all documents, making it more effective for tasks like text classification and information retrieval. Term Frequency (TF) is a method which measures how frequently a term appears in a document.The higher the frequencythe more important the word is in that document. Inverse Document Frequency (IDF) is a method which measures how important a term is across the entire set of documents.The less frequent the term across all documents, the higher its IDF value, meaning it carries more significance. How TfidfVectorizer Works: 1.Tokenization: Splits the text into individual terms (words or tokens). 2.Calculation of TF: Calculates the frequency of each term in a document. 3.Calculation of IDF: Computes the inverse document frequency for each term across all documents. 4.TF-IDF Matrix: Multiplies the TF and IDF values for each term, creating a matrix where each row corresponds to a document and each column to a term, with values representing the TF-IDF score. In general, TfidfVectorizer is often preferred for most NLP tasks due to its ability to provide a more nuanced representation of text data by considering both term frequency and document frequency.

# 3. Model Development:
Train the following machine learning models
a)Naive Bayesb)Support Vector Machine 

In [9]:
y=data['Emotion']

In [10]:
from sklearn.naive_bayes import MultinomialNB
model_nb = MultinomialNB()
model_nb.fit(X, y)

In [11]:
from sklearn.svm import SVC
model_svm = SVC(kernel='linear')
model_svm.fit(X, y)


In [14]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


# 4. Model Comparison
Evaluate the model using appropriate metrics (e.g., accuracy, F1-score). Provide a brief explanation of the chosen model and its suitability for emotion classification.

In [13]:
from sklearn.metrics import accuracy_score, f1_score
y_pred_nb = model_nb.predict(X_test)
y_pred_svm = model_svm.predict(X_test)

accuracy_nb = accuracy_score(y_test, y_pred_nb)
f1_nb = f1_score(y_test, y_pred_nb, average='weighted')

accuracy_svm = accuracy_score(y_test, y_pred_svm)
f1_svm = f1_score(y_test, y_pred_svm, average='weighted')

print(f'Naive Bayes - Accuracy: {accuracy_nb}, F1-Score: {f1_nb}')
print(f'SVM - Accuracy: {accuracy_svm}, F1-Score: {f1_svm}')


Naive Bayes - Accuracy: 0.984006734006734, F1-Score: 0.9840173067226208
SVM - Accuracy: 0.9915824915824916, F1-Score: 0.9915838847398599


Model Comparison
Metrics Used:
1.Accuracy: Measures how many predictions are correct out of the total.
2.F1-Score: Balances precision and recall, useful for uneven class distribution.

After training both models (Naive Bayes and SVM), we compare their performance using accuracy and F1-score on the test data.
Naive Bayes works well with text classification, especially with smaller datasets. Fast and efficient for high-dimensional data.
Support Vector Machine (SVM) handles complex classification boundaries well and often yields higher accuracy for text data.
Naive Bayes is faster but less accurate, while SVM generally performs better with more complex text datasets, making it the preferred choice for emotion classification when accuracy is crucial.In this dataset,after evaluation Naive Bayes has an accuracy of 0.98 and F1-Score is 0.98 and SVM has an accuracy of 0.99 and F1-Score is 0.99.This indicates that SVM may be better at capturing more complex relationships in the data.