Movie Metadata Analysis

The Movie Metadata Analysis project leverages machine learning to predict the release year of a director's next movie and its probable genres. The project covers extensive data cleaning, feature engineering, visualization, modeling, and evaluation.

Project Overview

This project utilizes a comprehensive dataset to analyze and predict key aspects of future movies based on a director's past works. It encompasses all aspects of a data science project, from cleaning to feature engineering and modeling.

Features

Data Cleaning: Resolves missing data issues using effective imputation strategies.
Feature Engineering: Creates binary features for genres and calculates relevant statistics for directors.
Data Visualization: Provides various visualizations to understand the data better.
Predictive Modeling: Develops machine learning models for regression and classification.

Data Description

The original dataset is a comprehensive collection of metadata for movies. It contains 28 columns and 5,043 entries, each representing different aspects of a movie, such as its title, director, and financial performance. The following is a brief overview of the dataset:

Columns: The dataset contains various attributes of a movie, including:
- color: Indicates whether the movie is in color or black-and-white.
- director_name: The name of the movie's director.
- num_critic_for_reviews: The number of critic reviews received.
- duration: The duration of the movie in minutes.
- gross: The gross earnings of the movie.
- genres: The genres associated with the movie.
- title_year: The year the movie was released.
- imdb_score: The IMDb rating of the movie.
Data Quality: Some columns contain missing values, including color, director_name, gross, content_rating, budget, and title_year.
Data Types: The dataset includes a mix of categorical and numerical data:
- Categorical columns: color, director_name, genres, movie_title, language, country, content_rating, and movie_imdb_link.
- Numerical columns: num_critic_for_reviews, duration, gross, num_voted_users, cast_total_facebook_likes, facenumber_in_poster, num_user_for_reviews, budget, title_year, imdb_score, aspect_ratio, and movie_facebook_likes.

Installation

To set up the project locally:

Clone the repository:

git clone https://github.com/ascender1729/CDS-IISC-P1-DataSci-PreDoc.git

Navigate to the project directory:
```
cd CDS-IISC-P1-DataSci-PreDoc
```

Create a virtual environment and install dependencies:

python -m venv env
source env/bin/activate
pip install -r requirements.txt

Usage

Activate the virtual environment:
- On Windows: .\env\Scripts\activate
- On macOS/Linux: source env/bin/activate
Run the Jupyter notebook:
```
jupyter notebook
```
Open and execute the cells in the relevant notebook.

Analysis

Data Cleaning

The enhanced dataset addressed missing values in both numerical and categorical columns using the following approaches:

Numerical columns: Missing values were filled with their median values, maintaining the central tendency of the data.
Categorical columns: Missing values were filled with their mode values or placeholders where appropriate, depending on the context.

Feature Engineering

New features were engineered to enhance the predictive models:

Binary features for each genre: Binary features were created for each genre, allowing for multi-label classification of movies based on their genres.
Director-specific features: Features such as average release intervals, average gross earnings, and intervals between consecutive releases were calculated to provide insights into a director’s typical behavior and success.

Visualization

To better understand the data, various visualizations were used:

Correlation Heatmap	Distribution of director name	Distribution of content rating

Distribution of Color	Distribution of language	Distribution of country

Scatter Plot: duration vs gross	Scatter Plot: budget vs gross	Scatter Plot: title year vd gross

Scatter Plot: actor 1 fb likes vs gross	Scatter Plot: actor 2 fb likes vs gross	Scatter Plot: actor 3 fb likes vs gross

Modeling

The project involved two predictive models for distinct yet related objectives:

Release Year Prediction

To predict the release year of a director's next movie, a Gradient Boosting Regressor was utilized. This method was chosen for its ability to handle complex relationships and its effectiveness in regression tasks. The following steps were taken:

Feature Selection:
- The relevant features, such as average_gross, average_release_interval, and director_facebook_likes, were identified to focus the model on the most impactful variables.
Hyperparameter Tuning:
- The Gradient Boosting Regressor was fine-tuned using Grid Search Cross-Validation to identify the best combination of hyperparameters.
- Parameters such as n_estimators, max_depth, and learning_rate were varied to improve the model's predictive performance.
Cross-Validation:
- The model's robustness was tested using cross-validation, ensuring that the selected hyperparameters provided consistent performance across different data subsets.

Genre Prediction

For the multi-label classification task of predicting a movie's genres, the Classifier Chain method was employed. This method enables chaining together classifiers to handle multiple labels simultaneously. The approach included the following steps:

Model Selection:
- The Classifier Chain was combined with a Voting Classifier, which used ensemble methods like Random Forest and XGBoost. This ensemble approach leverages the strengths of both models to improve overall performance.
Training:
- The Voting Classifier was trained using the soft voting strategy, which considers the probability outputs of each model to make final predictions. This approach often results in higher accuracy, particularly for multi-label problems.
Evaluation:
- The Classifier Chain was evaluated using several key metrics, including F1 score, accuracy, precision, recall, and Hamming loss.
- These metrics provided a comprehensive view of the model's effectiveness in predicting multiple genres simultaneously.

Results

Release Year Prediction

The Gradient Boosting Regressor achieved a mean absolute error (MAE) of 5.94465 and an R-squared value (R²) of 0.469028.
The model's cross-validation scores varied between 0.33920605 and 0.46901381, with a mean score of 0.40512641961070217.
These results indicate that the model performs reasonably well in predicting the release year.

Genre Prediction

The Classifier Chain model achieved an F1 score of 0.465631, an accuracy of 0.120912, a precision of 0.602423, and a recall of 0.379466.
The Hamming loss for the model was 0.0957155, showcasing the model's efficacy in multi-label classification.
Confusion matrices were generated for each genre to further analyze the model's performance. The visualizations are presented below:

1. Game Show	2. Crime	3. Animation

4. Comedy	5. Short	6. Documentary

7. Drama	8. Film-Noir	9. Music

10. Thriller	11. Romance	12. News

13. Musical	14. Action	15. Adventure

16. War	17. Mystery	18. Fantasy

19. Reality TV	20. Biography	21. Family

22. Western	23. Horror	24. Sci-Fi

25. History	26. Sport

Evaluation Metrics

Below is a table summarizing the model evaluation metrics:

Model	Metric	Score
Gradient Boosting Regressor	Mean Absolute Error (MAE)	5.94465
Gradient Boosting Regressor	Mean Squared Error (MSE)	86.0554
Gradient Boosting Regressor	R-Squared (R²)	0.469028
Classifier Chain (Voting Classifier)	F1 Score	0.465631
Classifier Chain (Voting Classifier)	Accuracy	0.120912
Classifier Chain (Voting Classifier)	Hamming Loss	0.0957155
Classifier Chain (Voting Classifier)	Precision	0.602423
Classifier Chain (Voting Classifier)	Recall	0.379466

Conclusion

The Movie Metadata Analysis project successfully developed predictive models for both the release year and genres of future movies directed by the same individual. The enhanced dataset, after thorough cleaning and feature engineering, provided robust insights for analysis and model building. The models achieved reasonable accuracy and were evaluated using various metrics to assess their performance comprehensively.

Contributing

Contributions are welcome to extend the project or improve the existing methodologies.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

Pavan Kumar - pavankumard.pg19.ma@nitp.ac.in

LinkedIn: @ascender1729

Project Link: CDS-IISC-P1-DataSci-PreDoc

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
images		images
DataSci_IISC_CDS_P1.ipynb		DataSci_IISC_CDS_P1.ipynb
LICENSE		LICENSE
README.md		README.md
datasci_iisc_cds_p1.py		datasci_iisc_cds_p1.py
p1_movie_metadata.csv		p1_movie_metadata.csv
p1_movie_metadata_enhanced.csv		p1_movie_metadata_enhanced.csv
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly