# Task 2 — Sentiment Analysis (TF-IDF + Logistic Regression)

**Objective:** Build and evaluate a sentiment classifier on customer reviews using TF-IDF features and logistic regression.  

**Deliverable:** Jupyter Notebook with preprocessing, modeling, evaluation, and saved model.


In [5]:
import nltk
nltk.download('movie_reviews')


[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\shruthi\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\movie_reviews.zip.


True

In [6]:
from nltk.corpus import movie_reviews
import random

# Create dataset
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

random.shuffle(documents)

# Convert to text
reviews = [" ".join(words) for words, label in documents]
labels = [label for words, label in documents]

import pandas as pd
df = pd.DataFrame({'review': reviews, 'sentiment': labels})
df.head()


Unnamed: 0,review,sentiment
0,as i write the review for the new hanks / ryan...,pos
1,mighty joe young blunders about for nearly twe...,neg
2,"susan granger ' s review of "" big eden "" ( jou...",pos
3,""" the red violin "" is a cold , sterile feature...",neg
4,""" the animal "" is a marginally inspired comedy...",neg


In [7]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(
    df['review'], df['sentiment'], test_size=0.2, random_state=42
)

# TF-IDF
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Logistic Regression
model = LogisticRegression(max_iter=1000)
model.fit(X_train_tfidf, y_train)

# Predictions
y_pred = model.predict(X_test_tfidf)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))


Accuracy: 0.8175

Classification Report:
               precision    recall  f1-score   support

         neg       0.78      0.86      0.82       191
         pos       0.86      0.78      0.82       209

    accuracy                           0.82       400
   macro avg       0.82      0.82      0.82       400
weighted avg       0.82      0.82      0.82       400


Confusion Matrix:
 [[165  26]
 [ 47 162]]
