
# Emotion Classification in Text Samples

This notebook implements a machine learning pipeline to classify emotions in text samples. Below are the key objectives and components:
1. **Loading and Preprocessing**: Clean and prepare the text data for analysis.
2. **Feature Extraction**: Transform text into numerical representations using `TfidfVectorizer`.
3. **Model Development**: Train and compare Naive Bayes and Support Vector Machine models.
4. **Model Comparison**: Evaluate model performance using accuracy and F1-score.



## 1. Loading and Preprocessing

The dataset is loaded, and preprocessing steps are applied to clean the text. This includes:
- **Lowercasing**: Ensures uniformity in text.
- **Punctuation Removal**: Removes non-informative symbols.
- **Tokenization**: Splits text into individual words.
- **Stopword Removal**: Removes common words (e.g., "the", "and") that do not contribute to meaning.

These steps help in reducing noise and improving model performance.


In [None]:
import pandas as pd
import string
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

# Load your dataset
dataset = pd.read_csv('nlp_dataset.csv')  

# Define preprocessing function
def preprocess_text(text):
    text = text.lower()  # Convert to lowercase
    text = text.translate(str.maketrans('', '', string.punctuation))  # Remove punctuation
    tokens = text.split()  # Tokenize
    tokens = [word for word in tokens if word not in ENGLISH_STOP_WORDS]  # Remove stopwords
    return ' '.join(tokens)

# Apply preprocessing
dataset['Processed_Comment'] = dataset['Comment'].apply(preprocess_text)



## 2. Feature Extraction

Text data is converted into numerical features using `TfidfVectorizer`:
- **TF-IDF (Term Frequency-Inverse Document Frequency)** assigns weights based on word frequency and importance across documents.
- Limits the feature set to the top 500 terms for computational efficiency.

TF-IDF captures the significance of words while reducing the impact of common terms.


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=500)  # Use top 500 features

# Transform the text
X = tfidf_vectorizer.fit_transform(dataset['Processed_Comment']).toarray()

# Extract target labels
y = dataset['Emotion']



## 3. Model Development

Two models are trained to classify emotions:
1. **Naive Bayes (MultinomialNB)**: A probabilistic model well-suited for text data.
2. **Support Vector Machine (SVC)**: A discriminative model effective for high-dimensional spaces.

The dataset is split into training and testing sets to evaluate model performance.


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Naive Bayes model
nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)

# SVM model
svm_model = SVC(kernel='linear')
svm_model.fit(X_train, y_train)



## 4. Model Evaluation and Comparison

The models are evaluated using the following metrics:
- **Accuracy**: Proportion of correct predictions.
- **F1-Score (Weighted)**: Balances precision and recall for imbalanced datasets.

Performance results are printed for both models, enabling comparison.


In [None]:
from sklearn.metrics import accuracy_score, f1_score, classification_report

# Evaluate Naive Bayes
nb_predictions = nb_model.predict(X_test)
nb_accuracy = accuracy_score(y_test, nb_predictions)
nb_f1 = f1_score(y_test, nb_predictions, average='weighted')

# Evaluate SVM
svm_predictions = svm_model.predict(X_test)
svm_accuracy = accuracy_score(y_test, svm_predictions)
svm_f1 = f1_score(y_test, svm_predictions, average='weighted')

# Print results
print("Naive Bayes - Accuracy:", nb_accuracy, "F1-Score:", nb_f1)
print("SVM - Accuracy:", svm_accuracy, "F1-Score:", svm_f1)



## Conclusion

This notebook demonstrates an end-to-end pipeline for emotion classification in text samples. 
- Naive Bayes and Support Vector Machines were trained and compared.
- Results are evaluated to determine the best-performing model.

Further improvements could include hyperparameter tuning, exploring deep learning models, and visualizing results.
