<a href="https://colab.research.google.com/github/chadwill05/IMDB_movie_analysis/blob/main/IMDB_Movie_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



# ** Welcome to IMDB top 10,000 movie analysis!**




One of my favorite activities is watching movies! It de-stresses me, so this project was very fun and interactive for me. I hope you enjoy my analysis as I take you step-by-step through my thought process!

DATA SOURCE: ***KAGGLE***

<img src = 'https://image.cnbcfm.com/api/v1/image/104768589-movies-anywhere.JPG?v=1507816437&w=740&h=416&ffmt=webp&vtcrop=y' >

PART 1: IMPORTING LIBRARIES AND DATA


In [31]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
import matplotlib.dates as mdates
import colorlover as cl
from numpy import linspace
import re


In [32]:
#Notebook Presentation
pd.options.display.float_format = '{:.2f}'.format


In [33]:
#Reading the Data
df = pd.read_csv('movies.csv')

PART 2 CHECKING DATA/ STATS

In [34]:
#Checking if there is any null values in any columns
df.isnull().sum()

title             0
year              0
runtime           0
certificate     356
genre             0
director          0
stars             0
rating            0
metascore      1973
votes             0
gross          2834
dtype: int64

In [35]:
#Getting some quick stats
df.describe()

Unnamed: 0,runtime,rating,metascore,votes,gross
count,9849.0,9849.0,7876.0,9849.0,7015.0
mean,110.58,6.72,59.07,90835.41,40.1
std,21.88,0.81,17.22,166834.03,66.91
min,45.0,4.9,7.0,10000.0,0.0
25%,96.0,6.1,47.0,16836.0,2.34
50%,106.0,6.7,59.5,34052.0,17.04
75%,120.0,7.3,72.0,90513.0,48.81
max,439.0,9.3,100.0,2780534.0,936.66


In [36]:
df

Unnamed: 0,title,year,runtime,certificate,genre,director,stars,rating,metascore,votes,gross
0,The Shawshank Redemption,1994,142,R,Drama,"['Frank Darabont', 'Tim Robbins', 'Morgan Free...","['Tim Robbins', 'Morgan Freeman', 'Bob Gunton'...",9.30,82.00,2780534,28.34
1,The Godfather,1972,175,R,"Crime, Drama","['Francis Ford Coppola', 'Marlon Brando', 'Al ...","['Marlon Brando', 'Al Pacino', 'James Caan', '...",9.20,100.00,1935895,134.97
2,Ramayana: The Legend of Prince Rama,1993,135,PG,"Animation, Action, Adventure","['Ram Mohan', 'Yûgô Sakô', 'Koichi Saski', 'Ar...","['Yûgô Sakô', 'Koichi Saski', 'Arun Govil', 'N...",9.20,,12470,
3,The Chaos Class,1975,87,,"Comedy, Drama","['Ertem Egilmez', 'Kemal Sunal', 'Münir Özkul'...","['Kemal Sunal', 'Münir Özkul', 'Halit Akçatepe...",9.20,,42018,
4,Daman,2022,121,,"Adventure, Drama","['Lenka Debiprasad', 'Vishal Mourya', 'Karan K...","['Vishal Mourya', 'Karan Kandhapan', 'Babushan...",9.10,,13372,
...,...,...,...,...,...,...,...,...,...,...,...
9844,Welcome to the Jungle,I) (2013,95,Not Rated,"Action, Adventure, Comedy","['Rob Meltzer', 'Jean-Claude Van Damme', 'Adam...","['Jean-Claude Van Damme', 'Adam Brody', 'Rob H...",4.90,25.00,13770,
9845,Boat Trip,2002,94,R,Comedy,"['Mort Nathan', 'Cuba Gooding Jr.', 'Horatio S...","['Cuba Gooding Jr.', 'Horatio Sanz', 'Roselyn ...",4.90,18.00,31972,8.59
9846,Did You Hear About the Morgans?,2009,103,PG-13,"Comedy, Drama, Romance","['Marc Lawrence', 'Hugh Grant', 'Sarah Jessica...","['Hugh Grant', 'Sarah Jessica Parker', 'Sam El...",4.90,27.00,41830,29.58
9847,The Crow: Salvation,2000,102,R,"Action, Crime, Fantasy","['Bharat Nalluri', 'Kirsten Dunst', 'William A...","['Kirsten Dunst', 'William Atherton', 'Debbie ...",4.90,,11938,


In [37]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
runtime,9849.0,110.58,21.88,45.0,96.0,106.0,120.0,439.0
rating,9849.0,6.72,0.81,4.9,6.1,6.7,7.3,9.3
metascore,7876.0,59.07,17.22,7.0,47.0,59.5,72.0,100.0
votes,9849.0,90835.41,166834.03,10000.0,16836.0,34052.0,90513.0,2780534.0
gross,7015.0,40.1,66.91,0.0,2.34,17.04,48.81,936.66


In [38]:
#some non-numeric stats
df.select_dtypes(include=[object]).describe().T

Unnamed: 0,count,unique,top,freq
title,9849,9496,A Star Is Born,4
year,9849,191,2018,308
certificate,9493,24,R,4005
genre,9849,423,"Comedy, Drama, Romance",476
director,9849,9820,"['Sylvester Stallone', 'Sylvester Stallone', '...",3
stars,9849,9803,"['William Shatner', 'Leonard Nimoy', 'DeForest...",6


In [39]:
def remove_non_numeric_characters(string):
  return re.sub('[^0-9]', '', string)

df['year'] = df['year'].apply(remove_non_numeric_characters)

PART 3: CHART ANALYSIS

Now that we imported the data and got some stats, let's make some cool charts to display the data!

In [40]:
#Generate colors
colors = cl.scales['9']['seq']['Blues']  # Get 9 colors from the 'Blues' scale
colors = cl.interp(colors, 100)  # Expand to 100 colors

# Loop over select stats in the dataframe
for column in df[['year', 'certificate', 'genre']].columns:

    # Create a count for the current column
    count = df[column].value_counts().sort_index()


    # Create a histogram trace for the current column
    trace = go.Bar(
        x=count.index,
        y=count.values,
        name=f"{column} Distribution",

    )

    # Create the figure and add the trace
    fig = go.Figure(trace)

    # Update the layout
    fig.update_layout(
        title=f"{column} Distribution",
        xaxis=dict(title=column),
        yaxis=dict(title="Count"),
    )

    # Show the plot
    fig.show()

In [41]:
# Sort the unique years in ascending order
sorted_years = sorted(df['year'].unique())

# Box plot showing the year distribution broken down by certificate
fig = px.box(
    df,
    x='year',
    y='certificate',
    title='Movie Certificate by Year',
    labels={'year': 'Year', 'certificate': 'Certificate'},
    points='all',  # Show all points
    color='certificate',
    category_orders={"year": sorted_years}  # This line ensures the years are sorted
)



# Show the plot
fig.show()

In [42]:
#Average Runtime by Year
ave_rating_by_year = df.groupby('year')['runtime'].mean().reset_index(name='average')
fig= px.line(
    ave_rating_by_year,
    title='Average Runtime by Year',
    x='year',
    y='average',
    labels={'average':'Average Runtime'}
)
fig.show()

In [43]:

# Sort the DataFrame by 'year' in ascending order
heatmap_data = df.sort_values('year').groupby(['year', 'genre']).size().reset_index(name='count')

# Create the DataFrame pivot for the heatmap
heatmap_pivot = heatmap_data.pivot(index='year', columns='genre', values='count').fillna(0)

# Sort the index to ensure years are in ascending order
heatmap_pivot.sort_index(ascending=True, inplace=True)

# Create the heat map
fig = px.imshow(
    heatmap_pivot,
    labels=dict(x="Genre", y="Year", color="Frequency"),
    title="Frequency of Movie Genres Over the Years"
)

# Customize the layout
fig.update_layout(
    xaxis=dict(title='Genre'),
    yaxis=dict(title='Year')
)

# Show the plot
fig.show()
