Skip to content

A data science project utilizing machine learning to predict movie release years and genres based on directors' previous works.

License

Notifications You must be signed in to change notification settings

ascender1729/CDS-IISC-P1-DataSci-PreDoc

Repository files navigation

Movie Metadata Analysis

The Movie Metadata Analysis project leverages machine learning to predict the release year of a director's next movie and its probable genres. The project covers extensive data cleaning, feature engineering, visualization, modeling, and evaluation.

Table of Contents

Project Overview

This project utilizes a comprehensive dataset to analyze and predict key aspects of future movies based on a director's past works. It encompasses all aspects of a data science project, from cleaning to feature engineering and modeling.

Features

  • Data Cleaning: Resolves missing data issues using effective imputation strategies.
  • Feature Engineering: Creates binary features for genres and calculates relevant statistics for directors.
  • Data Visualization: Provides various visualizations to understand the data better.
  • Predictive Modeling: Develops machine learning models for regression and classification.

Data Description

The original dataset is a comprehensive collection of metadata for movies. It contains 28 columns and 5,043 entries, each representing different aspects of a movie, such as its title, director, and financial performance. The following is a brief overview of the dataset:

  • Columns: The dataset contains various attributes of a movie, including:

    • color: Indicates whether the movie is in color or black-and-white.
    • director_name: The name of the movie's director.
    • num_critic_for_reviews: The number of critic reviews received.
    • duration: The duration of the movie in minutes.
    • gross: The gross earnings of the movie.
    • genres: The genres associated with the movie.
    • title_year: The year the movie was released.
    • imdb_score: The IMDb rating of the movie.
  • Data Quality: Some columns contain missing values, including color, director_name, gross, content_rating, budget, and title_year.

  • Data Types: The dataset includes a mix of categorical and numerical data:

    • Categorical columns: color, director_name, genres, movie_title, language, country, content_rating, and movie_imdb_link.
    • Numerical columns: num_critic_for_reviews, duration, gross, num_voted_users, cast_total_facebook_likes, facenumber_in_poster, num_user_for_reviews, budget, title_year, imdb_score, aspect_ratio, and movie_facebook_likes.

Installation

To set up the project locally:

  1. Clone the repository:

    git clone https://github.com/ascender1729/CDS-IISC-P1-DataSci-PreDoc.git
  2. Navigate to the project directory:

    cd CDS-IISC-P1-DataSci-PreDoc
  3. Create a virtual environment and install dependencies:

    python -m venv env
    source env/bin/activate
    pip install -r requirements.txt

Usage

  1. Activate the virtual environment:

    • On Windows: .\env\Scripts\activate
    • On macOS/Linux: source env/bin/activate
  2. Run the Jupyter notebook:

    jupyter notebook
  3. Open and execute the cells in the relevant notebook.

Analysis

Data Cleaning

The enhanced dataset addressed missing values in both numerical and categorical columns using the following approaches:

  • Numerical columns: Missing values were filled with their median values, maintaining the central tendency of the data.
  • Categorical columns: Missing values were filled with their mode values or placeholders where appropriate, depending on the context.

Feature Engineering

New features were engineered to enhance the predictive models:

  • Binary features for each genre: Binary features were created for each genre, allowing for multi-label classification of movies based on their genres.
  • Director-specific features: Features such as average release intervals, average gross earnings, and intervals between consecutive releases were calculated to provide insights into a director’s typical behavior and success.

Visualization

To better understand the data, various visualizations were used:

Correlation Heatmap Distribution of director name Distribution of content rating
Correlation Heatmap Bar Chart 1 Bar Chart 2
Distribution of Color Distribution of language Distribution of country
Bar Chart 3 Bar Chart 4 Bar Chart 5
Scatter Plot: duration vs gross Scatter Plot: budget vs gross Scatter Plot: title year vd gross
Scatter Plot 1 Scatter Plot 2 Scatter Plot 3
Scatter Plot: actor 1 fb likes vs gross Scatter Plot: actor 2 fb likes vs gross Scatter Plot: actor 3 fb likes vs gross
Scatter Plot 4 Scatter Plot 5 Scatter Plot 6

Modeling

The project involved two predictive models for distinct yet related objectives:

Release Year Prediction

To predict the release year of a director's next movie, a Gradient Boosting Regressor was utilized. This method was chosen for its ability to handle complex relationships and its effectiveness in regression tasks. The following steps were taken:

  1. Feature Selection:

    • The relevant features, such as average_gross, average_release_interval, and director_facebook_likes, were identified to focus the model on the most impactful variables.
  2. Hyperparameter Tuning:

    • The Gradient Boosting Regressor was fine-tuned using Grid Search Cross-Validation to identify the best combination of hyperparameters.
    • Parameters such as n_estimators, max_depth, and learning_rate were varied to improve the model's predictive performance.
  3. Cross-Validation:

    • The model's robustness was tested using cross-validation, ensuring that the selected hyperparameters provided consistent performance across different data subsets.

Genre Prediction

For the multi-label classification task of predicting a movie's genres, the Classifier Chain method was employed. This method enables chaining together classifiers to handle multiple labels simultaneously. The approach included the following steps:

  1. Model Selection:

    • The Classifier Chain was combined with a Voting Classifier, which used ensemble methods like Random Forest and XGBoost. This ensemble approach leverages the strengths of both models to improve overall performance.
  2. Training:

    • The Voting Classifier was trained using the soft voting strategy, which considers the probability outputs of each model to make final predictions. This approach often results in higher accuracy, particularly for multi-label problems.
  3. Evaluation:

    • The Classifier Chain was evaluated using several key metrics, including F1 score, accuracy, precision, recall, and Hamming loss.
    • These metrics provided a comprehensive view of the model's effectiveness in predicting multiple genres simultaneously.

Results

Release Year Prediction

  • The Gradient Boosting Regressor achieved a mean absolute error (MAE) of 5.94465 and an R-squared value (R²) of 0.469028.
  • The model's cross-validation scores varied between 0.33920605 and 0.46901381, with a mean score of 0.40512641961070217.
  • These results indicate that the model performs reasonably well in predicting the release year.

Genre Prediction

  • The Classifier Chain model achieved an F1 score of 0.465631, an accuracy of 0.120912, a precision of 0.602423, and a recall of 0.379466.

  • The Hamming loss for the model was 0.0957155, showcasing the model's efficacy in multi-label classification.

  • Confusion matrices were generated for each genre to further analyze the model's performance. The visualizations are presented below:

1. Game Show 2. Crime 3. Animation
Game Show Crime Animation
4. Comedy 5. Short 6. Documentary
Comedy Short Documentary
7. Drama 8. Film-Noir 9. Music
Drama Film-Noir Music
10. Thriller 11. Romance 12. News
Thriller Romance News
13. Musical 14. Action 15. Adventure
Musical Action Adventure
16. War 17. Mystery 18. Fantasy
War Mystery Fantasy
19. Reality TV 20. Biography 21. Family
Reality TV Biography Family
22. Western 23. Horror 24. Sci-Fi
Western Horror Sci-Fi
25. History 26. Sport
History Sport

Evaluation Metrics

Below is a table summarizing the model evaluation metrics:

Model Metric Score
Gradient Boosting Regressor Mean Absolute Error (MAE) 5.94465
Gradient Boosting Regressor Mean Squared Error (MSE) 86.0554
Gradient Boosting Regressor R-Squared (R²) 0.469028
Classifier Chain (Voting Classifier) F1 Score 0.465631
Classifier Chain (Voting Classifier) Accuracy 0.120912
Classifier Chain (Voting Classifier) Hamming Loss 0.0957155
Classifier Chain (Voting Classifier) Precision 0.602423
Classifier Chain (Voting Classifier) Recall 0.379466

Conclusion

The Movie Metadata Analysis project successfully developed predictive models for both the release year and genres of future movies directed by the same individual. The enhanced dataset, after thorough cleaning and feature engineering, provided robust insights for analysis and model building. The models achieved reasonable accuracy and were evaluated using various metrics to assess their performance comprehensively.

Contributing

Contributions are welcome to extend the project or improve the existing methodologies.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

Pavan Kumar - pavankumard.pg19.ma@nitp.ac.in

LinkedIn: @ascender1729

Project Link: CDS-IISC-P1-DataSci-PreDoc