
# Movie Data Analysis Notebook

This notebook contains a series of analyses performed on two datasets related to movies and awards. The analyses include trends over time, correlations, and visual insights.

### Datasets:
1. **Movies Dataset**: Contains information about movies, such as title, director, genres, IMDb scores, and user reviews.
2. **Awards Dataset**: Contains information about awards, including categories, winners, and films.

Each analysis is implemented in its own section for clarity.



## 1. Load and Explore the Datasets

We start by loading the datasets and exploring their structure to understand the available information.



In [2]:
import pandas as pd
import os

ABS_PATH = os.getcwd()
awards_df =pd.read_csv(ABS_PATH+'/data/the_oscar_award.csv')
movies_df = pd.read_csv(ABS_PATH+'/data/movie_data.csv')
movie_metadata = pd.read_csv(ABS_PATH+'/data/CMU_dataset/movie.metadata.tsv', sep='\t')

# Display the first few rows of each dataset
print("Movies Dataset:")
print(movies_df.head())

print("\nAwards Dataset:")
print(awards_df.head())

movie_metadata.columns =['Wikipedia movie ID','Freebase movie ID','Movie name','Movie release date',
                        'Movie box office revenue','Movie runtime','Movie languages',
                        'Movie countries','Movie genres']




Movies Dataset:
   index      director_name  duration      actor_2_name  \
0      0      James Cameron     178.0  Joel David Moore   
1      1     Gore Verbinski     169.0     Orlando Bloom   
2      2         Sam Mendes     148.0      Rory Kinnear   
3      3  Christopher Nolan     164.0    Christian Bale   
4      4        Doug Walker       NaN        Rob Walker   

                            genres     actor_1_name  \
0  Action|Adventure|Fantasy|Sci-Fi      CCH Pounder   
1         Action|Adventure|Fantasy      Johnny Depp   
2        Action|Adventure|Thriller  Christoph Waltz   
3                  Action|Thriller        Tom Hardy   
4                      Documentary      Doug Walker   

                                         movie_title  num_voted_users  \
0                                            Avatar            886204   
1          Pirates of the Caribbean: At World's End            471220   
2                                           Spectre            275868   
3     

In [3]:
import re
# Normalize movie names by removing special characters and extra spaces

def normalize_name(name):
    return re.sub(r'\s+', ' ', re.sub(r'[^a-zA-Z0-9\s]', '', name)).strip().lower()

movies_df['normalized_title'] = movies_df['movie_title'].apply(normalize_name)
movie_metadata['normalized_name'] = movie_metadata['Movie name'].apply(normalize_name)

# Find common movie names in both dataframes after normalization
common_movies_normalized = set(movies_df['normalized_title']).intersection(
    set(movie_metadata['normalized_name'])
)
print(f"Number of common movies after normalization: {len(common_movies_normalized)}")
print("Common movies after normalization:", common_movies_normalized)

# Merge datasets based on normalized names
merged_normalized_df = pd.merge(movies_df, movie_metadata, how='inner', left_on='normalized_title', right_on='normalized_name')

# Remove duplicates from the merged dataframe
merged_normalized_df.drop_duplicates(subset='normalized_title', inplace=True)

# Display the merged dataframe without duplicates
print(f"Number of common movies after removing duplicates: {len(merged_normalized_df)}")
merged_normalized_df.head()


Number of common movies after normalization: 3997
Common movies after normalization: {'the fisher king', 'bright lights big city', 'the jacket', 'desperado', 'my life in ruins', 'the pianist', 'the talented mr ripley', 'killers', 'moulin rouge', 'september dawn', 'the remains of the day', 'shalako', 'toy story', 'the mummy tomb of the dragon emperor', 'the good heart', 'spanglish', 'a hard days night', 'star wars episode ii attack of the clones', 'modern times', 'xmen the last stand', 'kama sutra a tale of love', 'the other guys', 'die hard 2', 'treasure planet', 'everything put together', 'ghost rider spirit of vengeance', 'district 9', 'set it off', 'city of life and death', 'gridiron gang', 'flushed away', 'dragon hunters', 'the circle', 'poohs heffalump movie', 'star wars episode v the empire strikes back', 'chicago', 'a civil action', 'the lion king', 'police academy', 'run lola run', 'neighbors', 'flicka', 'blades of glory', 'the great debaters', 'monsoon wedding', 'horrible boss

Unnamed: 0,index,director_name,duration,actor_2_name,genres,actor_1_name,movie_title,num_voted_users,actor_3_name,movie_imdb_link,...,Wikipedia movie ID,Freebase movie ID,Movie name,Movie release date,Movie box office revenue,Movie runtime,Movie languages,Movie countries,Movie genres,normalized_name
0,0,James Cameron,178.0,Joel David Moore,Action|Adventure|Fantasy|Sci-Fi,CCH Pounder,Avatar,886204,Wes Studi,http://www.imdb.com/title/tt0499549/?ref_=fn_t...,...,4273140,/m/0bth54,Avatar,2009-12-10,2782275000.0,178.0,"{""/m/02h40lc"": ""English Language"", ""/m/06nm1"":...","{""/m/09c7w0"": ""United States of America"", ""/m/...","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science...",avatar
3,1,Gore Verbinski,169.0,Orlando Bloom,Action|Adventure|Fantasy,Johnny Depp,Pirates of the Caribbean: At World's End,471220,Jack Davenport,http://www.imdb.com/title/tt0449088/?ref_=fn_t...,...,1689394,/m/05nlx4,Pirates of the Caribbean: At World's End,2007-05-19,963420400.0,169.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/02qvnvs"": ""Swashbuckler films"", ""/m/0hj3n...",pirates of the caribbean at worlds end
4,2,Sam Mendes,148.0,Rory Kinnear,Action|Adventure|Thriller,Christoph Waltz,Spectre,275868,Stephanie Sigman,http://www.imdb.com/title/tt2379713/?ref_=fn_t...,...,3015828,/m/08kqhx,Spectre,1977,,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/03npn"": ""Horror"", ""/m/015w9s"": ""Televisio...",spectre
5,3,Christopher Nolan,164.0,Christian Bale,Action|Thriller,Tom Hardy,The Dark Knight Rises,1144337,Joseph Gordon-Levitt,http://www.imdb.com/title/tt1345836/?ref_=fn_t...,...,29075630,/m/0bpm4yw,The Dark Knight Rises,2012-07-16,1078009000.0,165.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America"", ""/m/...","{""/m/0lsxr"": ""Crime Fiction"", ""/m/01jfsb"": ""Th...",the dark knight rises
6,5,Andrew Stanton,132.0,Samantha Morton,Action|Adventure|Sci-Fi,Daryl Sabara,John Carter,212204,Polly Walker,http://www.imdb.com/title/tt0401729/?ref_=fn_t...,...,982480,/m/03whyr,John Carter,2012-03-08,282778100.0,,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/0cq23f0"": ""Sci-Fi Adventure"", ""/m/06n90"":...",john carter


## 2. Genre Trends Over Time

In [4]:
import pandas as pd
import matplotlib.pyplot as plt
import ipywidgets as widgets
from ipywidgets import interactive_output, VBox, HBox, Layout

# Assuming movies_df is already loaded. If not, load your DataFrame here.
# For example:
# movies_df = pd.read_csv('your_movies_dataset.csv')

# 1. Clean and preprocess the 'title_year' column
movies_df['title_year'] = pd.to_numeric(movies_df['title_year'], errors='coerce')  # Convert to numeric
movies_df = movies_df.dropna(subset=['title_year'])  # Drop NaN values
movies_df['title_year'] = movies_df['title_year'].astype(int)  # Convert to integer
movies_df = movies_df[(movies_df['title_year'] >= 1880) & (movies_df['title_year'] <= 2012)]  # Filter valid years

# 2. Convert the 'genres' column to a list of genres and explode
movies_df['genres'] = movies_df['genres'].astype(str).str.split('|')
movies_expanded = movies_df.explode('genres')

# 3. Function to prepare data based on interval and top_n
def prepare_genre_data(interval, top_n):
    # Create year groups
    movies_expanded['year_group'] = (movies_expanded['title_year'] // interval) * interval
    
    # Group by year_group and genre to count occurrences
    genre_counts = movies_expanded.groupby(['year_group', 'genres']).size().unstack(fill_value=0)
    
    # Identify top N genres
    top_genres = genre_counts.sum().sort_values(ascending=False).head(top_n).index
    
    # Filter for top N genres
    genre_counts_top = genre_counts[top_genres]
    
    return genre_counts_top

# 4. Plotting function with fixed figure size
def plot_genre_trends(metric, top_n, interval):
    genre_counts_top = prepare_genre_data(interval, top_n)
    
    if metric == 'Percentage':
        genre_counts_plot = genre_counts_top.div(genre_counts_top.sum(axis=1), axis=0) * 100
        ylabel = 'Percentage (%)'
    else:
        genre_counts_plot = genre_counts_top
        ylabel = 'Number of Movies'
    
    # Clear the current figure
    plt.close('all')
    
    # Create a fixed-size figure
    plt.figure(figsize=(14, 8))
    
    # Plot the stacked bar chart
    ax = genre_counts_plot.plot(kind='bar', stacked=True, colormap='viridis', ax=plt.gca())
    
    plt.title(f'Genre Trends Over Time ({metric})', fontsize=16)
    plt.xlabel('Year Group', fontsize=14)
    plt.ylabel(ylabel, fontsize=14)
    
    # Position the legend outside the plot area
    plt.legend(title='Genres', bbox_to_anchor=(1.05, 1), loc='upper left')
    
    # Adjust layout manually
    plt.subplots_adjust(left=0.05, right=0.75, top=0.9, bottom=0.1)
    
    # Rotate x-axis labels for better readability
    plt.xticks(rotation=45, ha='right')
    
    # Add grid lines
    plt.grid(axis='y', linestyle='--', alpha=0.7)
    
    plt.show()

# 5. Create interactive widgets with controlled layout
metric_widget = widgets.RadioButtons(
    options=['Number of Movies', 'Percentage'],
    description='Metric:',
    disabled=False
)

top_n_widget = widgets.IntSlider(
    value=10,
    min=1,
    max=20,
    step=1,
    description='Top N Genres:',
    continuous_update=False,
    style={'description_width': 'initial'}
)

interval_widget = widgets.Dropdown(
    options=[5, 10, 15, 20],
    value=10,
    description='Interval (years):',
    disabled=False,
    style={'description_width': 'initial'}
)

# 6. Use interactive_output for better layout control
ui = VBox([
    HBox([metric_widget, top_n_widget, interval_widget])
])

out = widgets.Output()

def update_plot(metric, top_n, interval):
    with out:
        out.clear_output(wait=True)
        try:
            plot_genre_trends(metric, top_n, interval)
        except Exception as e:
            print(f"An error occurred: {e}")

interactive_plot = interactive_output(
    update_plot,
    {
        'metric': metric_widget,
        'top_n': top_n_widget,
        'interval': interval_widget
    }
)

# 7. Display the widgets and plot
display(ui, out)

# Optionally, initialize the plot with default values
update_plot(metric_widget.value, top_n_widget.value, interval_widget.value)


VBox(children=(HBox(children=(RadioButtons(description='Metric:', options=('Number of Movies', 'Percentage'), …

Output()

In [5]:
import pandas as pd
import plotly.graph_objs as go
from ipywidgets import widgets, interactive_output, VBox, HBox, Layout
from IPython.display import display


# 1. Clean and preprocess the 'title_year' column
movies_df['title_year'] = pd.to_numeric(movies_df['title_year'], errors='coerce')  # Convert to numeric
movies_df = movies_df.dropna(subset=['title_year'])  # Drop NaN values
movies_df['title_year'] = movies_df['title_year'].astype(int)  # Convert to integer
movies_df = movies_df[(movies_df['title_year'] >= 1880) & (movies_df['title_year'] <= 2012)]  # Filter valid years

# 2. Convert the 'genres' column to a list of genres and explode
movies_df['genres'] = movies_df['genres'].astype(str).str.split('|')
movies_expanded = movies_df.explode('genres')

# 3. Function to prepare data based on interval and top_n
def prepare_genre_data(interval, top_n):
    # Create year groups
    movies_expanded['year_group'] = (movies_expanded['title_year'] // interval) * interval
    
    # Group by year_group and genre to count occurrences
    genre_counts = movies_expanded.groupby(['year_group', 'genres']).size().unstack(fill_value=0)
    
    # Identify top N genres
    top_genres = genre_counts.sum().sort_values(ascending=False).head(top_n).index
    
    # Filter for top N genres
    genre_counts_top = genre_counts[top_genres]
    
    return genre_counts_top

# 4. Initialize Plotly FigureWidget with original "dark" theme
fig_widget = go.FigureWidget(
    layout=go.Layout(
        title=dict(
            text='Genre Trends Over Time',
            font=dict(size=20),  # Larger font size for better visibility
            x=0.5  # Center the title
        ),
        xaxis=dict(
            title='Year Group',
            title_standoff=30,  # Push the x-axis title further down
            tickangle=0,  # Ensure ticks remain horizontal for clarity
        ),
        yaxis=dict(
            title='Number of Movies',
        ),
        template='plotly_dark',  # Dark theme for original aesthetic
        legend=dict(
            title='Genres',
            orientation='h',  # Horizontal legend
            yanchor='bottom',
            y=-0.3,  # Place the legend below the plot
            xanchor='center',
            x=0.5
        ),
        margin=dict(l=50, r=50, t=100, b=100),  # Adjust margins for better spacing
    )
)

# 5. Plotting function using Plotly's Stacked Area Chart
def plot_genre_trends_plotly(metric, top_n, interval):
    genre_counts_top = prepare_genre_data(interval, top_n)
    
    if metric == 'Percentage':
        genre_counts_plot = genre_counts_top.div(genre_counts_top.sum(axis=1), axis=0) * 100
        ylabel = 'Percentage (%)'
    else:
        genre_counts_plot = genre_counts_top
        ylabel = 'Number of Movies'
    
    # Update Y-axis title
    fig_widget.layout.yaxis.title.text = ylabel
    
    # Clear existing data in the figure
    fig_widget.data = []
    
    # Add a stacked area trace for each genre
    for genre in genre_counts_plot.columns:
        fig_widget.add_trace(
            go.Scatter(
                x=genre_counts_plot.index,
                y=genre_counts_plot[genre],
                mode='lines',
                stackgroup='one',  # Stacking enabled
                name=genre,
                line=dict(width=0.5)  # Subtle line around the area
            )
        )
    
    # Update layout title
    fig_widget.layout.title.text = f'Genre Trends Over Time ({metric})'

# 6. Create interactive widgets
metric_widget = widgets.RadioButtons(
    options=['Number of Movies', 'Percentage'],
    description='Metric:',
    disabled=False
)

top_n_widget = widgets.IntSlider(
    value=10,
    min=1,
    max=20,
    step=1,
    description='Top N Genres:',
    continuous_update=False,
    style={'description_width': 'initial'},
    layout=Layout(width='50%')
)

interval_widget = widgets.Dropdown(
    options=[5, 10, 15, 20],
    value=10,
    description='Interval (years):',
    disabled=False,
    style={'description_width': 'initial'},
    layout=Layout(width='50%')
)

# 7. Define the update function
def update_plot(metric, top_n, interval):
    plot_genre_trends_plotly(metric, top_n, interval)

# 8. Link widgets to the update function using interactive_output
interactive_plot = interactive_output(
    update_plot,
    {
        'metric': metric_widget,
        'top_n': top_n_widget,
        'interval': interval_widget
    }
)

# 9. Arrange widgets and plot in a layout
ui = VBox([
    HBox([metric_widget, top_n_widget, interval_widget]),
    fig_widget  # Place the FigureWidget directly in the layout
])

# 10. Display the widgets and plot
display(ui)

# 11. Initialize the plot with default values
update_plot(metric_widget.value, top_n_widget.value, interval_widget.value)


VBox(children=(HBox(children=(RadioButtons(description='Metric:', options=('Number of Movies', 'Percentage'), …