# Movie Genre Classification
This notebook covers the task of predicting the genre of a movie based on its plot summary or other textual information.
We will use the [Movie Plot Summaries dataset](https://www.kaggle.com/datasets/jrobischon/wikipedia-movie-plots) for this task.
The notebook includes dataset download, preprocessing, model training, and evaluation.

In [None]:
# Install necessary libraries
!pip install -q scikit-learn pandas numpy


In [None]:
# Download dataset
import os
import urllib.request

dataset_url = 'https://github.com/jrobischon/wikipedia-movie-plots/raw/master/movie_plots.csv'
dataset_path = 'movie_plots.csv'

if not os.path.exists(dataset_path):
    print('Downloading dataset...')
    urllib.request.urlretrieve(dataset_url, dataset_path)
    print('Download complete.')
else:
    print('Dataset already exists.')


In [None]:
# Load dataset
import pandas as pd

df = pd.read_csv(dataset_path)
df.head()

## Data Preprocessing
- We will use the `Plot` column as input features and `Genre` column as labels.
- Since movies can have multiple genres, we will simplify by using the first genre listed.
- We will use TF-IDF vectorization for text features.


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder

# Simplify genre to first listed genre
df['Primary_Genre'] = df['Genre'].apply(lambda x: x.split(',')[0])

X = df['Plot']
y = df['Primary_Genre']

# Encode labels
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.2, random_state=42)

# Vectorize text
vectorizer = TfidfVectorizer(max_features=5000)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)


## Model Training
We will train a Logistic Regression classifier on the TF-IDF features.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

model = LogisticRegression(max_iter=200)
model.fit(X_train_tfidf, y_train)

# Predictions
y_pred = model.predict(X_test_tfidf)

# Evaluation
print('Accuracy:', accuracy_score(y_test, y_pred))
print('\nClassification Report:\n', classification_report(y_test, y_pred, target_names=label_encoder.classes_))