# Looking For Trends In Movie Choices

## Project Overview

In this project we'll be analyzing a movie dataset made by the user *Rounak Banik* off of [Kaggle](https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset). The goal is to look for specific trends that the most popular movies show.

TMDB

Here is the data dictionary:

-`adult`: Indicates if the movie is X-rated <br>
-`belongs_to_collection`: Stringified dictionary that provides information on the movie series the particular film belongs to. <br>
-`budget`: Cost to make the movie in USD <br>
-`genres`: Stringified list of dictionaries that list out all the genres associated with the movie <br>
-`homepage`: Homepage of the movie <br>
-`id`: ID of the movie <br>
-`imdb_id`: IMDB assigned ID of the movie <br>
-`original_language`: Native language of the movie <br>
-`original_title`: Original title of the movie <br>
-`overview`: Description of the movie <br>
-`poster_path`: URL of the poster image <br>
-`popularity`: How popular the movie was <br>
-`producation_companies`: Companies who produced the movie <br>
-`producation_countries`: Countries where the movie was produced <br>
-`release_date`: Date movie was released <br>
-`revenue`: How much the movie made <br>
-`runtime`: How long the movie was <br>
-`spoken_languages`: Spoken languages in the movie <br>
-`status`: Status of the movie (Released, To Be Released, Announced, etc.) <br>
-`tagline`: Tagline of the movie <br>
-`title`: Official title of the movie <br>
-`video`: Indicates if there is a video present of the movie with TMDB <br>
-`vote_average`: Average rating of the movie <br>
-`vote_count`: Number of votes by users <br>

Let's import our libraries now.

## Importing Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import tabulate as tabulate
%matplotlib inline

movies = pd.read_csv(r'C:\Users\david\Downloads\archive\movies_metadata.csv', low_memory=False)

## Intial Cleaning 

The first thing we're going to do is remove the columns that have a high percentage of NaN entries or columns that seem redundant or inapplicable to our analysis. Let

In [2]:
print(movies.shape)
movies.columns

(45466, 24)


Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count'],
      dtype='object')

In [12]:
movies['belongs_to_collection'].head(5)

0    {'id': 10194, 'name': 'Toy Story Collection', ...
1                                                  NaN
2    {'id': 119050, 'name': 'Grumpy Old Men Collect...
3                                                  NaN
4    {'id': 96871, 'name': 'Father of the Bride Col...
Name: belongs_to_collection, dtype: object

In [4]:
movies.drop(columns=['adult', 'belongs_to_collection', 'genres', 'homepage',
       'imdb_id', 'original_title', 'overview',
       'poster_path', 'spoken_languages', 'tagline', 'video', 'production_companies', 'production_countries'])

Unnamed: 0,budget,id,original_language,popularity,release_date,revenue,runtime,status,title,vote_average,vote_count
0,30000000,862,en,21.946943,1995-10-30,373554033.0,81.0,Released,Toy Story,7.7,5415.0
1,65000000,8844,en,17.015539,1995-12-15,262797249.0,104.0,Released,Jumanji,6.9,2413.0
2,0,15602,en,11.7129,1995-12-22,0.0,101.0,Released,Grumpier Old Men,6.5,92.0
3,16000000,31357,en,3.859495,1995-12-22,81452156.0,127.0,Released,Waiting to Exhale,6.1,34.0
4,0,11862,en,8.387519,1995-02-10,76578911.0,106.0,Released,Father of the Bride Part II,5.7,173.0
...,...,...,...,...,...,...,...,...,...,...,...
45461,0,439050,fa,0.072051,,0.0,90.0,Released,Subdue,4.0,1.0
45462,0,111109,tl,0.178241,2011-11-17,0.0,360.0,Released,Century of Birthing,9.0,3.0
45463,0,67758,en,0.903007,2003-08-01,0.0,90.0,Released,Betrayal,3.8,6.0
45464,0,227506,en,0.003503,1917-10-21,0.0,87.0,Released,Satan Triumphant,0.0,0.0


In [16]:
movies['status'].value_counts()

Released           45014
Rumored              230
Post Production       98
In Production         20
Planned               15
Canceled               2
Name: status, dtype: int64

In [30]:
for element in movies['status']:
    if element == 'Canceled':
        print(movies['title'])

canceled_movies = movies['title'].str.contains('Canceled')
canceled_movies.value_counts()

0                          Toy Story
1                            Jumanji
2                   Grumpier Old Men
3                  Waiting to Exhale
4        Father of the Bride Part II
                    ...             
45461                         Subdue
45462            Century of Birthing
45463                       Betrayal
45464               Satan Triumphant
45465                       Queerama
Name: title, Length: 45466, dtype: object
0                          Toy Story
1                            Jumanji
2                   Grumpier Old Men
3                  Waiting to Exhale
4        Father of the Bride Part II
                    ...             
45461                         Subdue
45462            Century of Birthing
45463                       Betrayal
45464               Satan Triumphant
45465                       Queerama
Name: title, Length: 45466, dtype: object


In [31]:
type(movies['belongs_to_collection'])

pandas.core.series.Series