# Sentiment Analysis Lab: Movie Review Classification

**Objective:** Train a machine learning model to classify movie reviews as positive or negative.

**Dataset:** movie_reviews.csv

## Step 1: Import Libraries

In [2]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject

## Step 2: Load the Dataset

In [None]:
# Load the movie reviews dataset
df = pd.read_csv('movie_reviews.csv')

print("Dataset loaded successfully!")
print(f"Dataset shape: {df.shape}")
print(f"\nFirst few rows:\n{df.head()}")

## Step 3: Extract Text and Label Columns

In [None]:
# Extract features (X) and target (y)
X = df['text']  # Movie reviews
y = df['label']  # Sentiment labels (positive/negative)

print(f"Number of reviews: {len(X)}")
print(f"Label distribution:\n{y.value_counts()}")

## Step 4: Transform Text into Numerical Features using TF-IDF

In [None]:
# Initialize TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the text data
X_vectorized = vectorizer.fit_transform(X)

print(f"Feature matrix shape: {X_vectorized.shape}")
print(f"Number of unique words (features): {len(vectorizer.get_feature_names_out())}")

## Step 5: Split Data into Training and Testing Sets (80/20)

In [None]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X_vectorized, y, test_size=0.2, random_state=42
)

print(f"Training set size: {X_train.shape[0]}")
print(f"Testing set size: {X_test.shape[0]}")

## Step 6: Initialize and Train the Logistic Regression Model

In [None]:
# Initialize the model
model = LogisticRegression(max_iter=1000)

# Train the model
model.fit(X_train, y_train)

print("Model training complete!")

## Step 7: Make Predictions on the Test Set

In [None]:
# Predict labels for test set
y_pred = model.predict(X_test)

print(f"Predictions made for {len(y_pred)} test samples")

## Step 8: Calculate and Print Accuracy Score

In [None]:
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy Score: {accuracy:.2f}")
print(f"Accuracy Percentage: {accuracy * 100:.2f}%")

## Step 9: Print Classification Report

In [None]:
# Display detailed classification metrics
print("Classification Report:")
print(classification_report(y_test, y_pred))

## Key Insight

Assess how well the model generalizes to new, unseen reviews. The accuracy score and classification report show:
- **Precision**: Of all reviews predicted as positive/negative, how many were correct?
- **Recall**: Of all actual positive/negative reviews, how many did we identify?
- **F1-Score**: The harmonic mean of precision and recall

A high accuracy on test data indicates the model successfully learned patterns from the training data and can generalize to new movie reviews!