# Cell 1: Import Libraries

This cell imports the necessary Python libraries for the project. These libraries provide various functions for data manipulation, text processing, and machine learning.

-   `numpy`: For numerical operations.

-   `pandas`: For data manipulation using DataFrames.
-   `nltk`: Natural Language Toolkit for text processing (tokenization, stopwords, lemmatization).
-   `re`: Regular expression operations for cleaning text.
-   `string`: For string-related operations.
-   `stopwords`: From `nltk.corpus`, for removing common English words.
-   `word_tokenize`: From `nltk.tokenize`, for splitting text into words.
-   `WordNetLemmatizer`: From `nltk.stem`, for reducing words to their base form.
-   `TfidfVectorizer`: From `sklearn.feature_extraction.text`, for converting text to numerical features using TF-IDF.
-   `train_test_split`: From `sklearn.model_selection`, for splitting data into training and testing sets.
-   `SVC`: From `sklearn.svm`, the Support Vector Classifier (RBF kernel).
-   `accuracy_score, classification_report`: From `sklearn.metrics`, for evaluating model performance.
-   `Download NLTK Resources`:
    -   `punkt`: Tokenizer for splitting text into sentences and words.
    
    -   `punkt_tab`: The punkt tokenizer relies on this model to determine where sentences end.
    -   `stopwords`: A list of common English words to be removed from the text.
    -   `wordnet`: A lexical database that helps in lemmatization (reducing words to their base or dictionary form).

In [1]:
import numpy as np
import pandas as pd
import nltk
import re
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer  # Using lemmatizer instead of stemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report

# Download NLTK resources
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')  # For lemmatization

[nltk_data] Downloading package punkt to C:\Users\Yashasvi
[nltk_data]     Acharya\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to C:\Users\Yashasvi
[nltk_data]     Acharya\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to C:\Users\Yashasvi
[nltk_data]     Acharya\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to C:\Users\Yashasvi
[nltk_data]     Acharya\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

# Load Data and Initial Exploration

This cell loads the training and test datasets using pandas. The data is assumed to be in a ":::" separated format. It also displays the first few rows of each dataset and provides some basic information about the training dataset.

-   Loads data from the specified paths, using ":::" as the separator.

-   Assigns column names to the DataFrame.
-   Displays the first few rows of the data.
-   Prints info about the data like the type and the non null count.
-   Prints the number of missing values in each column.


In [3]:
# Load training and test data with correct column names
train_path = "train_data.txt"
test_path = "test_data.txt"

train_data = pd.read_csv(train_path, sep=":::", names=["ID", "TITLE", "GENRE", "DESCRIPTION"], engine="python")
test_data = pd.read_csv(test_path, sep=":::", names=["ID", "TITLE", "DESCRIPTION"], engine="python")  # Assuming no GENRE in test data

# Display first few rows
print("Training Data:")
print(train_data.head())
print("\nTest Data:")
print(test_data.head())

# Check for missing values and data info
print("\nTraining Data Info:")
print(train_data.info())
print("\nMissing Values in Training Data:")
print(train_data.isnull().sum())

Training Data:
   ID                               TITLE       GENRE  \
0   1       Oscar et la dame rose (2009)       drama    
1   2                       Cupid (1997)    thriller    
2   3   Young, Wild and Wonderful (1980)       adult    
3   4              The Secret Sin (1915)       drama    
4   5             The Unrecovered (2007)       drama    

                                         DESCRIPTION  
0   Listening in to a conversation between his do...  
1   A brother and sister with a past incestuous r...  
2   As the bus empties the students for their fie...  
3   To help their unemployed father make ends mee...  
4   The film's title refers not only to the un-re...  

Test Data:
   ID                          TITLE  \
0   1          Edgar's Lunch (1998)    
1   2      La guerra de papá (1977)    
2   3   Off the Beaten Track (2010)    
3   4        Meu Amigo Hindu (2015)    
4   5             Er nu zhai (1955)    

                                         DESCRIPTION  
0   

# Data Preprocessing

## Text Cleaning Function

This cell defines a function called `clean_text` that performs text cleaning and preprocessing. This function takes a text string as input and performs the following operations:

-   Handle Non-String Input: Checks if the input is a string. If not, it returns an empty string.

-   Lowercase Conversion: Converts the text to lowercase.
-   Remove Punctuation and Numbers: Removes any character that is not a letter or whitespace.
-   Tokenization: Splits the text into individual words (tokens).
-   Lemmatization and Stopword Removal: Reduces each word to its base form (lemma) and removes common English stop words.
-   Join Tokens: Joins the cleaned tokens back into a single string.

This function helps standardize the text data and remove noise, making it more suitable for machine learning models. It is applied to the "DESCRIPTION" column of both the training and test sets.


In [4]:
# Initialize lemmatizer and stop words
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def clean_text(text):
    # Handle non-string input (e.g., NaN)
    if not isinstance(text, str):
        return ''
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation and numbers
    text = re.sub(r'[^a-z\s]', '', text)
    # Tokenize
    tokens = word_tokenize(text)
    # Lemmatize and remove stop words
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words and len(word) > 2]
    return ' '.join(tokens)

# Apply preprocessing
train_data["TextCleaning"] = train_data["DESCRIPTION"].apply(clean_text)
test_data["TextCleaning"] = test_data["DESCRIPTION"].apply(clean_text)

# Display a sample
print("Sample Processed Training Description:")
print(train_data["TextCleaning"].iloc[0])

Sample Processed Training Description:
listening conversation doctor parent yearold oscar learns nobody courage tell week live furious refuse speak anyone except straighttalking rose lady pink meet hospital stair christmas approach rose us fantastical experience professional wrestler imagination wit charm allow oscar live life love full company friend pop corn einstein bacon childhood sweetheart peggy blue


# Preparing The Data


## TF-IDF Vectorization

This cell uses the TF-IDF (Term Frequency-Inverse Document Frequency) vectorizer to convert the cleaned text data into numerical features that can be used by the machine learning model.

-   TfidfVectorizer: Converts text documents into a matrix of TF-IDF features.

-   max_features: Limits the number of features to the top 'n' most frequent terms.
-   ngram_range=(1, 2): Consider both unigrams and bigrams.

The code then fits the vectorizer on the training data and transforms both the training and test data into TF-IDF feature matrices.


In [5]:
X = train_data["TextCleaning"]
y = train_data["GENRE"]
X_train_split, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

vectorizer = TfidfVectorizer(max_features=10000, ngram_range=(1, 2))

X_train = vectorizer.fit_transform(X_train_split)
X_val = vectorizer.transform(X_val)
X_test = vectorizer.transform(test_data["TextCleaning"])

print("Training Features Shape:", X_train.shape)
print("Validation Features Shape:", X_val.shape)
print("Test Features Shape:", X_test.shape)

Training Features Shape: (43371, 10000)
Validation Features Shape: (10843, 10000)
Test Features Shape: (54200, 10000)


## Grid Search for Logistic Regression

This cell performs hyperparameter tuning for Logistic Regression using Grid Search. It finds the best value for the regularization parameter C using cross-validation.

-   `param_grid`: Defines the hyperparameters and their possible values to search over.

    *   `C`: Regularization parameter.

-   `GridSearchCV`: Performs an exhaustive search over the specified parameter grid.

    *   `LogisticRegression()`: The Logistic Regression model.
    
    *   `param_grid`: The hyperparameter grid.
    *   `cv=3`: 3-fold cross-validation.
    *   `scoring='accuracy'`: Uses accuracy as the scoring metric.

The code then fits the `GridSearchCV` object on the training data to find the best combination of hyperparameters. It prints the best parameters and the best cross-validation score.


In [6]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

# Define the parameter grid for C
param_grid = {'C': [0.1, 1.0, 10.0]}

# Initialize the base model
base_model = LogisticRegression(max_iter=1000, class_weight='balanced')

# Set up Grid Search with 3-fold cross-validation
grid = GridSearchCV(estimator=base_model, param_grid=param_grid, cv=3, scoring='accuracy', n_jobs=-1)

# Train with Grid Search
print("Running Grid Search for Logistic Regression...")
grid.fit(X_train, y_train)
print("Grid Search completed.")

# Extract the best model
model = grid.best_estimator_
print("Best Parameters:", grid.best_params_)
print(f"Best Cross-Validation Accuracy: {grid.best_score_:.4f}")

Running Grid Search for Logistic Regression...
Grid Search completed.
Best Parameters: {'C': 10.0}
Best Cross-Validation Accuracy: 0.5375


# Model Evaluation and Metrics

This cell evaluates the performance of the trained Logistic Regression model.

-   Training accuracy: accuracy of the model on the training data

-   Validation accuracy: accuracy of the model on the validation data
-   Classification report: precision, recall and f1-score for each class on the validation data


In [7]:
train_accuracy = model.score(X_train, y_train)
print(f"Training Accuracy: {train_accuracy:.4f}")

y_pred_val = model.predict(X_val)
val_accuracy = accuracy_score(y_val, y_pred_val)
print(f"Validation Accuracy: {val_accuracy:.4f}")
print("\nClassification Report:")
print(classification_report(y_val, y_pred_val))

Training Accuracy: 0.8076
Validation Accuracy: 0.5395

Classification Report:
               precision    recall  f1-score   support

      action        0.34      0.44      0.38       263
       adult        0.51      0.57      0.54       112
   adventure        0.19      0.29      0.23       139
   animation        0.24      0.24      0.24       104
   biography        0.07      0.05      0.06        61
      comedy        0.56      0.56      0.56      1443
       crime        0.17      0.23      0.19       107
 documentary        0.78      0.68      0.73      2659
       drama        0.68      0.53      0.59      2697
      family        0.19      0.29      0.23       150
     fantasy        0.18      0.12      0.14        74
   game-show        0.71      0.75      0.73        40
     history        0.05      0.04      0.05        45
      horror        0.58      0.70      0.63       431
       music        0.45      0.69      0.54       144
     musical        0.17      0.20      0

In [12]:
import joblib
from sklearn.preprocessing import LabelEncoder

# Initialize LabelEncoder
le = LabelEncoder()

# Fit LabelEncoder on training labels and transform
le.fit(train_data["GENRE"])

# Save the trained model
joblib.dump(model, 'movie_genre_model.pkl')

# Save the TF-IDF vectorizer
joblib.dump(vectorizer, 'tfidf_vectorizer.pkl')

# Save the LabelEncoder
joblib.dump(le, 'label_encoder.pkl')

print("Model, vectorizer, and label encoder saved.")


Model, vectorizer, and label encoder saved.


# Generate and Save Predictions

This cell uses the trained Logistic Regression model to generate predictions on the test data and saves the predictions to a CSV file.

-   Test the model on the test data

-   Create a submission file.


In [8]:
test_predictions = model.predict(X_test)
test_data['PREDICTED_GENRE'] = test_predictions

output = test_data[['ID', 'TITLE', 'PREDICTED_GENRE']]
output.to_csv('submission.csv', index=False)

print("Predictions saved to 'submission.csv'.")
print("Sample Predictions:")
print(output.head())

Predictions saved to 'submission.csv'.
Sample Predictions:
   ID                          TITLE PREDICTED_GENRE
0   1          Edgar's Lunch (1998)          comedy 
1   2      La guerra de papá (1977)           drama 
2   3   Off the Beaten Track (2010)     documentary 
3   4        Meu Amigo Hindu (2015)           drama 
4   5             Er nu zhai (1955)         romance 


In [13]:
import joblib
from sklearn.preprocessing import LabelEncoder

# Initialize LabelEncoder
le = LabelEncoder()

# Fit LabelEncoder on training labels and transform
le.fit(train_data["GENRE"])

# Save the trained model
joblib.dump(model, 'movie_genre_model.pkl')

# Save the TF-IDF vectorizer
joblib.dump(vectorizer, 'tfidf_vectorizer.pkl')

# Save the LabelEncoder
joblib.dump(le, 'label_encoder.pkl')

print("Model, vectorizer, and label encoder saved.")


Model, vectorizer, and label encoder saved.
