# IMDB Exploratory Data Analysis: Visual Notebook
## Project by: Cain Seriph

## This Jupyter Notebook provides a comprehensive analysis of IMDB entries using a cleaned datasets, sourced from the IMDB non-commercial collection. The analysis offers insights into content release trends, genre popularity,and more.

In [None]:
# Import required libraries
import gzip
import shutil
import pandas as pd
import plotly.express as px

In [2]:
# Extract the movies.csv.gz file
with gzip.open('data/movies.csv.gz', 'rb') as f_in:
    with open('data/movies.csv', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

## Jumping right in we load the cleaned Movies dataset created from the IMDB non-commercial release. And verify its status.

In [3]:
# Importing Cleaned Movie data
movies = pd.read_csv('data/movies.csv')

In [4]:
# Verify dataframe loaded
movies.head()


Unnamed: 0,Title,Year,Genres,Rating,Votes
0,Miss Jerry,1894,Romance,5.3,222
1,The Corbett-Fitzsimmons Fight,1897,"Documentary,News,Sport",5.2,551
2,The Story of the Kelly Gang,1906,"Action,Adventure,Biography",6.0,976
3,L'enfant prodigue,1907,Drama,5.6,31
4,Robbery Under Arms,1907,Drama,4.3,28


In [5]:
# Check number of entries
len(movies)

363630

In [6]:
# Check datatypes
data_types = movies.dtypes
print(data_types)

Title      object
Year        int64
Genres     object
Rating    float64
Votes       int64
dtype: object


## Genres will be our next big focus so first a little polish. Next up we count and build our list of alphas amongst the pack, based off of their share of releases.

In [7]:
# Convert 'Genres' column to string
movies['Genres'] = movies['Genres'].astype(str)

In [8]:
# Count unique entries in the 'Genres' column
unique_genres_count = movies['Genres'].nunique()
print(unique_genres_count)

1256


In [9]:
# Exploding the 'genres' column
genres_exploded = movies['Genres'].str.split(',').explode().reset_index(drop=True)

genres_exploded.head()

0        Romance
1    Documentary
2           News
3          Sport
4         Action
Name: Genres, dtype: object

In [10]:
# Count unique entries in Genres column, print count
unique_genres_count = genres_exploded.nunique()
print(unique_genres_count)

27


In [11]:
# DataFrame movies in column genres count how many times each entry from unique_genres_count exists, print results
genre_counts = movies['Genres'].str.get_dummies(sep=',').sum().reset_index()
genre_counts.columns = ['Genre', 'Count']
print(genre_counts)

          Genre   Count
0        Action   35070
1         Adult       1
2     Adventure   21473
3     Animation    7171
4     Biography   12841
5        Comedy   89291
6         Crime   32051
7   Documentary   67320
8         Drama  164613
9        Family   15579
10      Fantasy   11154
11    Film-Noir     864
12    Game-Show      17
13      History   11329
14       Horror   25647
15        Music   10419
16      Musical    7801
17      Mystery   14729
18         News     710
19   Reality-TV     184
20      Romance   39723
21       Sci-Fi    8621
22        Sport    4779
23    Talk-Show      44
24     Thriller   30760
25          War    7356
26      Western    5684


In [12]:
# from genre_counts print list of the 10 Genres using count column
top_genres = genre_counts.nlargest(10, 'Count')
print(top_genres)
with open('data/top_genres.txt', 'w') as f:
    for index, row in top_genres.iterrows():
        f.write(f"{row['Genre']}: {row['Count']}\n")
        # save top_genres as csv
top_genres.to_csv('data/top_genres.csv', index=False)

          Genre   Count
8         Drama  164613
5        Comedy   89291
7   Documentary   67320
20      Romance   39723
0        Action   35070
6         Crime   32051
24     Thriller   30760
14       Horror   25647
2     Adventure   21473
9        Family   15579


## Here we showcase the elites among genres and take it next level with a bar graph visualization utilizing the Plotly library. Who knew Drama was king?

In [13]:
# Plotly bar 'top_genres'
fig = px.bar(top_genres, x='Genre', y='Count', title='Top 10 Genres',
            color_discrete_sequence=['yellow'],) # IMDB likes yellow
# Outline bars for clarity
fig.update_traces(marker=dict(line=dict(width=1, color='DarkSlateGrey')))
fig.show()

## Ratings quickly seemed suspicious so more care is taken towards Voting.

In [14]:
# Find 10 highest rated 'Movie' titles
top_rating = movies.nlargest(10, 'Rating')[['Title', 'Rating']]
# Print list
print(top_rating.to_string(index=False))

                                            Title  Rating
                 Auf den Spuren des Hans im Glück    10.0
                                 Broadway Legends    10.0
                                          Kaputol    10.0
                                       D on Dance    10.0
                              Rainy in Glenageary    10.0
                                It's a Love Thang    10.0
Love Live! Series 9th Anniversary LOVE LIVE! FEST    10.0
                               Olu Bliss: Dive In    10.0
                            Tetonica Castro: Home    10.0
                                             Ixel    10.0


In [15]:
# Find the average rating amongst all titles
average_rating = movies['Rating'].mean()
print("Average Rating:", average_rating)

Average Rating: 6.231308472898275


In [16]:
# Find 10 highest voted 'Movie' titles
top_vote = movies.nlargest(10, 'Votes')[['Title', 'Votes']]
top_vote.insert(1, ' ', ' ')
# Print list
print(top_vote.to_string(index=False))

                                            Title     Votes
                         The Shawshank Redemption   3013434
                                  The Dark Knight   2989960
                                        Inception   2657138
                                       Fight Club   2436128
                                     Forrest Gump   2355891
                                     Pulp Fiction   2311032
                                     Interstellar   2296921
                                       The Matrix   2135884
                                    The Godfather   2103011
The Lord of the Rings: The Fellowship of the Ring   2088682


## Now Some expectedly familiar titles pop up tasty Pie chart also utilizing Plotly. Dunno if they would all be my top ten, but definitely among my go to list for re-watch or referral.

In [17]:
# Plotly pie for 'top_vote'
fig_pie = px.pie(top_vote, names='Title', values='Votes', title='Top 10 Votes Distribution')
fig_pie.update_layout(title_x=0.395)  # Adjust the title placement
fig_pie.show()

In [18]:
# Find the average vote count amongst all titles
average_votes = movies['Votes'].mean()
print("Average Vote Count:", average_rating)

Average Vote Count: 6.231308472898275


# Next a deeper dive into genres as we breakdown the movies dataframe into children using our top genres list for reference and order.

## First up will be a look at the head of the pack Drama films.

In [19]:
# Create a new DataFrame 'drama' from 'movies' DataFrame where 'Genres' contains 'Drama'
drama = movies[movies['Genres'].str.contains('Drama', na=False)].reset_index(drop=True)

# Write the 'Drama' Movies DataFrame to a CSV file
drama.to_csv('data/drama_mov.csv', index=False) 

In [20]:
# Intro 'drama'
drama.head()

Unnamed: 0,Title,Year,Genres,Rating,Votes
0,L'enfant prodigue,1907,Drama,5.6,31
1,Robbery Under Arms,1907,Drama,4.3,28
2,Amleto,1908,Drama,3.2,32
3,Don Quijote,1908,Drama,4.3,23
4,Hamlet,1910,Drama,4.5,41


In [21]:
# Count 'Drama' titles
len(drama)

164613

## From here on I will sort films from earliest to latest release. Mainly because I find it interesting to see when they began.

In [22]:
# Sorting the DataFrame from earliest to latest
drama = drama.sort_values(by='Year', ignore_index=True) # Cleaner look
drama.head()


Unnamed: 0,Title,Year,Genres,Rating,Votes
0,La vie et la passion de Jésus Christ,1903,"Biography,Drama",6.5,752
1,S. Lubin's Passion Play,1903,Drama,4.4,11
2,Dingjunshan,1905,Drama,6.3,53
3,L'enfant prodigue,1907,Drama,5.6,31
4,Violante,1907,Drama,3.4,19


In [23]:
# Grouping by decade and counting entries
drama['Decade'] = (drama['Year'] // 10) * 10
decade_counts_drama = drama.groupby('Decade').size().reset_index(name='Count')

# Printing results
print(decade_counts_drama)

# Saving results to a text file
with open('data/drama_decades.txt', 'w') as f:
    f.write(decade_counts_drama.to_string())

    Decade  Count
0     1900     18
1     1910   1261
2     1920   2350
3     1930   4733
4     1940   4827
5     1950   7466
6     1960  10289
7     1970  12932
8     1980  14843
9     1990  16255
10    2000  26073
11    2010  42994
12    2020  20572


## A nice look at Drama climbing the ranks using another Bar graph thanks to Plotly. And heavy visual impact on the blow taking from COVID.

In [24]:
# Plotly bar 'Drama' title release by decade
fig = px.bar(decade_counts_drama, x='Decade', y='Count', # color='Decade', 
             color_discrete_sequence=['firebrick'], # Drama is usually  red
             title='Drama Releases by Decade')
# Outline bars for clarity
fig.update_traces(marker=dict(line=dict(width=1, color='DarkSlateGrey')))
# Label the bar for 2010 pre-covid vertically
fig.add_annotation(x=2010, y=decade_counts_drama.loc[decade_counts_drama['Decade'] == 2010, 'Count'].values[0],
                   text="Pre-COVID",
                   showarrow=True,
                   arrowhead=2,
                   ax=0,
                   ay=-40,
                   font=dict(size=12))
fig.show()


In [25]:
# Find 10 highest rated 'Drama' titles
top_rating_drama = drama.nlargest(10, 'Rating')[['Title', 'Rating']]
# Print list
print(top_rating_drama.to_string(index=False))

                                 Title  Rating
                              The Poet    10.0
The Secret Diaries of the Film Censors    10.0
                         Ka Mon Bajwat    10.0
                                  Ixel    10.0
                             Displaced    10.0
                      7 Days in a Coma    10.0
                   Rainy in Glenageary    10.0
                               Kaputol    10.0
                            D on Dance    10.0
                                Rijali    10.0


In [26]:
# Find the average rating amongst all 'Drama' titles
average_rating_drama = drama['Rating'].mean()
print("Average Rating:", average_rating_drama)

Average Rating: 6.257878782356193


In [27]:
# Find 10 highest voted 'Drama' titles
top_vote_drama = drama.nlargest(10, 'Votes')[['Title', 'Votes']]
# Add space between columns for easier reading by adding a blank column with (2) spaces
top_vote_drama.insert(1, ' ', ' ')
# Print list
print(top_vote_drama.to_string(index=False))

                                            Title     Votes
                         The Shawshank Redemption   3013434
                                  The Dark Knight   2989960
                                       Fight Club   2436128
                                     Forrest Gump   2355891
                                     Pulp Fiction   2311032
                                     Interstellar   2296921
                                    The Godfather   2103011
The Lord of the Rings: The Fellowship of the Ring   2088682
    The Lord of the Rings: The Return of the King   2059827
                            The Dark Knight Rises   1896633


## Again its the Votes that seem more honest many a familiar titles from the over all list. Amazing how close the two are in the popular vote race.

In [28]:
# Plotly pie distribution of 'Drama' titles top votes
fig_pie = px.pie(top_vote_drama, names='Title', values='Votes', title='Top 10 Voted Drama Films')
fig_pie.update_layout(title_x=0.4)  # Adjust the title placement
fig_pie.show()

In [29]:
# Find the average vote count amongst all 'Drama' titles
average_votes_drama = drama['Votes'].mean()
print("Average Vote Count:", average_votes_drama)

Average Vote Count: 3708.965045288039


## Onto Comedy films, which I figured would either take the crown or be rather low in the pecking order.

In [30]:
# Create a new DataFrame 'comedy' from 'movies' DataFrame where 'Genres' contains 'Comedy'
comedy = movies[movies['Genres'].str.contains('Comedy', na=False)].reset_index(drop=True)

# Write the 'Comedy' Movies DataFrame to a CSV file
comedy.to_csv('data/comedy_mov.csv', index=False) 

In [31]:
# Intro 'comedy'
comedy.head()

Unnamed: 0,Title,Year,Genres,Rating,Votes
0,Salome Mad,1909,Comedy,3.8,24
1,Házasodik az uram,1913,Comedy,3.5,37
2,Die Insel der Seligen,1913,"Comedy,Fantasy",4.6,77
3,A Regiment of Two,1913,"Comedy,Drama",6.3,27
4,Wo ist Coletti?,1913,"Comedy,Crime",6.3,51


In [32]:
# Count 'Comedy' titles
len(comedy)

89291

In [33]:
# Sorting the DataFrame from earliest to latest
comedy = comedy.sort_values(by='Year', ignore_index=True) 
comedy.head()


Unnamed: 0,Title,Year,Genres,Rating,Votes
0,Solser en Hesse,1900,Comedy,2.7,11
1,Lika mot lika,1906,Comedy,3.4,31
2,Um Cavalheiro Deveras Obsequioso,1909,Comedy,4.7,21
3,Salome Mad,1909,Comedy,3.8,24
4,La Chicanera,1909,"Comedy,Musical",4.6,13


In [34]:
# Grouping by decade and counting entries
comedy['Decade'] = (comedy['Year'] // 10) * 10
decade_counts_comedy = comedy.groupby('Decade').size().reset_index(name='Count')

# Printing results
print(decade_counts_comedy)

# Saving results to a text file
with open('data/comedy_decades.txt', 'w') as f:
    f.write(decade_counts_comedy.to_string())

    Decade  Count
0     1900      6
1     1910    306
2     1920    838
3     1930   3129
4     1940   2927
5     1950   3883
6     1960   5450
7     1970   6387
8     1980   7438
9     1990   8705
10    2000  15015
11    2010  24151
12    2020  11056


## Relatively close to Drama almost as if siblings. Numbers like these I would almost be okay with throwing Greek in front of it.

In [35]:
# Plotly bar 'Comedy' title release by decade
fig = px.bar(decade_counts_comedy, x='Decade', y='Count', # color='Decade', 
             color_discrete_sequence=['yellow'], # Comedy is usually  yellow
             title='Comedy Releases by Decade')
fig.update_traces(marker=dict(line=dict(width=1, color='DarkSlateGrey')))
# Label the bar for 2010 pre-covid vertically
fig.add_annotation(x=2010, y=decade_counts_comedy.loc[decade_counts_comedy['Decade'] == 2010, 'Count'].values[0],
                   text="Pre-COVID",
                   showarrow=True,
                   arrowhead=2,
                   ax=0,
                   ay=-40,
                   font=dict(size=12))
fig.show()

In [36]:
# Find 10 highest rated Comedy titles
top_rating_comedy = comedy.nlargest(10, 'Rating')[['Title', 'Rating']]
# Print list
print(top_rating_comedy.to_string(index=False))

                                                              Title  Rating
                                   Auf den Spuren des Hans im Glück    10.0
                                   Las locuras del Dr. Arisos Tenes    10.0
                                                      Lost in Vyond    10.0
                                                        Introvertic    10.0
                                       Don Gil von den grünen Hosen     9.9
                                                Hauptsache Minister     9.9
                                                   Bad Psychiatrist     9.9
Was nicht im Baedecker steht: Bitte, einsteigen zu Käses Rundfahrt!     9.8
                                                Ben Blue's Brothers     9.8
                                                          Quadrille     9.8


In [37]:
# Find the average rating amongst all 'Comedy" titles
average_rating_comedy = comedy['Rating'].mean()
print("Average Rating:", average_rating_comedy)

Average Rating: 5.95986829579689


In [38]:
# Find 10 highest voted 'Comedy' titles
top_vote_comedy = comedy.nlargest(10, 'Votes')[['Title', 'Votes']]
top_vote_comedy.insert(1, ' ', ' ')
# Print list
print(top_vote_comedy.to_string(index=False))

                  Title     Votes
       Django Unchained   1770268
The Wolf of Wall Street   1663985
     Back to the Future   1363981
Guardians of the Galaxy   1308133
        The Truman Show   1262442
               Deadpool   1194045
                     Up   1170170
           Finding Nemo   1150177
              Toy Story   1112155
         Monsters, Inc.   1018346


## I will give Django a pass as most of us like our Comedy a little dark even if we won't admit it. But The Wolf of Wall Street being not only in the category but so high atop the mountain is a tad scary.

In [39]:
# Plotly pie distribution of 'Comedy' titles top votes
fig_pie = px.pie(top_vote_comedy, names='Title', values='Votes', title='Top 10 Voted Comedy Films')
fig_pie.update_layout(title_x=0.45)  # Adjust the title placement
fig_pie.show()

In [40]:
# Find the average vote count amongst all 'Comedy' titles
average_votes_comedy = comedy['Votes'].mean()
print("Average Vote Count:", average_votes_comedy)

Average Vote Count: 4045.885117201062


## We now visit the Documentary genres. A category that at first seemed odd for movies til I thought about it longer. And wow 3rd in rank, who says smart films don't have a place.

In [41]:
# Create a new DataFrame 'documentary' from 'movies' DataFrame where 'Genres' contains 'Documentary'
docu = movies[movies['Genres'].str.contains('Documentary', na=False)].reset_index(drop=True)

# Write the 'Documentary' Movies DataFrame to a CSV file
docu.to_csv('data/documentary_mov.csv', index=False) 

In [42]:
# Intro docu
docu.head()

Unnamed: 0,Title,Year,Genres,Rating,Votes
0,The Corbett-Fitzsimmons Fight,1897,"Documentary,News,Sport",5.2,551
1,Life of Villa,1912,"Documentary,War",7.8,33
2,Dr. Mawson in the Antarctic,1913,Documentary,5.7,28
3,The Adventures of Buffalo Bill,1917,"Documentary,Western",6.4,27
4,"Joliet Prison, Joliet, Ill.",1914,Documentary,5.8,10


In [43]:
# Count 'Documentary' titles
len(docu)

67320

In [44]:
# Sorting the DataFrame from earliest to latest
docu = docu.sort_values(by='Year', ignore_index=True) 
docu.head()

Unnamed: 0,Title,Year,Genres,Rating,Votes
0,Birmingham,1896,Documentary,4.0,22
1,The Corbett-Fitzsimmons Fight,1897,"Documentary,News,Sport",5.2,551
2,Reproduction of the Corbett and Fitzsimmons Fight,1897,"Documentary,News,Sport",4.3,65
3,Saída dos Operários do Arsenal da Marinha,1898,Documentary,4.7,11
4,A Rua Augusta em Dia de Festa,1898,Documentary,3.0,10


In [45]:
# Grouping by decade and counting entries
docu['Decade'] = (docu['Year'] // 10) * 10
decade_counts_docu = docu.groupby('Decade').size().reset_index(name='Count')

# Printing results
print(decade_counts_docu)

# Saving results to a text file
with open('data/documentary_decades.txt', 'w') as f:
    f.write(decade_counts_docu.to_string())

    Decade  Count
0     1890     11
1     1900     16
2     1910     50
3     1920    106
4     1930    128
5     1940    170
6     1950    262
7     1960    752
8     1970   1494
9     1980   1980
10    1990   4270
11    2000  15094
12    2010  29611
13    2020  13376


## Our Documentary Bar graph is especially interesting to see how new Documentary releases are and have taken such a high seat. Considering the destruction COVID had on the movie industry, I have to speculate Documentaries will either rise higher possibly taking the crown. Also possible is a plummit as people seek escape from the quarantine boredom.

In [46]:
# Plotly bar 'Documentary' title release by decade
fig = px.bar(decade_counts_docu, x='Decade', y='Count', # color='Decade', 
             color_discrete_sequence=['gold'], # Historical is usually  gold
             title='Documentary Releases by Decade')
fig.update_traces(marker=dict(line=dict(width=1, color='DarkSlateGrey')))
# Label the bar for 2010 pre-covid vertically
fig.add_annotation(x=2010, y=decade_counts_docu.loc[decade_counts_docu['Decade'] == 2010, 'Count'].values[0],
                   text="Pre-COVID",
                   showarrow=True,
                   arrowhead=2,
                   ax=0,
                   ay=-40,
                   font=dict(size=12))
fig.show()

In [47]:
# Find 10 highest rated "Documentary" titles"
top_rating_docu = docu.nlargest(10, 'Rating')[['Title', 'Year', 'Rating']]
# Print list
print(top_rating_docu.to_string(index=False))

                            Title  Year  Rating
                 Broadway Legends  2002    10.0
Bio jednom jedan... Dusko Radovic  2006    10.0
                          Carraco  2022    10.0
                       COMPLEXion  2023    10.0
   Paradise (bunnies and flowers)  2023    10.0
 Retratos de República Dominicana  2024    10.0
       Opioids: The Hidden Crisis  2024    10.0
                          Inbound  2025    10.0
 Making of Sash! With My Own Eyes  2000     9.9
                         Kot ptic  2006     9.9


In [48]:
# Find the average rating amongst all 'Documentary' titles
average_rating_docu = docu['Rating'].mean()
print("Average Rating:", average_rating_docu)

Average Rating: 7.1867988710635755


In [49]:
# Find 10 highest voted 'Documentary' titles
top_vote_docu = docu.nlargest(10, 'Votes')[['Title', 'Votes']]
top_vote_docu.insert(1, ' ', ' ')
# Print list
print(top_vote_docu.to_string(index=False))

                Title    Votes
Bowling for Columbine   150408
      Fahrenheit 9/11   133554
        Super Size Me   115207
   Jackass: The Movie   104011
   The Social Dilemma    91622
An Inconvenient Truth    85902
   Jackass Number Two    81327
            Free Solo    80994
           Inside Job    80863
                Senna    79995


## Voting always shakes things up a bit as titles from my youth holding their own. And having to admit the Jackass films count as documentaries feels strange.

In [50]:
# Plotly pie distribution of 'Documentary' titles top votes
fig_pie = px.pie(top_vote_docu, names='Title', values='Votes', title='Top 10 Voted Documentary Films')
fig_pie.update_layout(title_x=0.46)  # Adjust the title placement
fig_pie.show()

In [51]:
# Find the average vote count amongst all 'Comedy' titles
average_votes_docu = docu['Votes'].mean()
print("Average Vote Count:", average_votes_docu)

Average Vote Count: 262.96200237670826


## A I would have thought favorite for many, Romance comes to stage.

In [52]:
# Create a new DataFrame 'romance' from 'movies' DataFrame where 'Genres' contains 'Romance'
romance = movies[movies['Genres'].str.contains('Romance', na=False)].reset_index(drop=True)

# Write the 'Romance' Movies DataFrame to a CSV file
romance.to_csv('data/romance_mov.csv', index=False) 

In [53]:
# Intro romance
romance.head()

Unnamed: 0,Title,Year,Genres,Rating,Votes
0,Miss Jerry,1894,Romance,5.3,222
1,La dame aux camélias,1912,"Drama,Romance",5.3,45
2,Amor fatal,1911,"Drama,Romance",7.5,24
3,Anny - en gatepiges roman,1912,"Drama,Romance",4.6,17
4,Den glade løjtnant,1912,Romance,3.8,11


In [54]:
# Count 'Romance' titles
len(romance)

39723

In [55]:
# Sorting the DataFrame from earliest to latest
romance = romance.sort_values(by='Year', ignore_index=True) 
romance.head()

Unnamed: 0,Title,Year,Genres,Rating,Votes
0,Miss Jerry,1894,Romance,5.3,222
1,A Viúva Alegre,1909,Romance,5.3,21
2,Sumurûn,1910,Romance,5.0,31
3,Amor fatal,1911,"Drama,Romance",7.5,24
4,Arrah-Na-Pogue,1911,"Drama,Romance",3.2,28


In [56]:
# Grouping by decade and counting entries
romance['Decade'] = (romance['Year'] // 10) * 10
decade_counts_romance = romance.groupby('Decade').size().reset_index(name='Count')

# Printing results
print(decade_counts_romance)

# Saving results to a text file
with open('data/romance_decades.txt', 'w') as f:
    f.write(decade_counts_romance.to_string())

    Decade  Count
0     1890      1
1     1900      1
2     1910    230
3     1920    841
4     1930   2224
5     1940   1600
6     1950   2036
7     1960   2544
8     1970   2415
9     1980   2787
10    1990   3946
11    2000   6600
12    2010   9611
13    2020   4887


## Fairly steady contender with a couple of spikes and fairing as best as one could during COVID.

In [57]:
# Plotly bar 'Romance' title release by decade
fig = px.bar(decade_counts_romance, x='Decade', y='Count', # color='Decade', 
             color_discrete_sequence=['orange'], # Musicals are usually  orange
             title='Romance Releases by Decade')
fig.update_traces(marker=dict(line=dict(width=1, color='DarkSlateGrey')))
# Label the bar for 2010 pre-covid vertically
fig.add_annotation(x=2010, y=decade_counts_romance.loc[decade_counts_romance['Decade'] == 2010, 'Count'].values[0],
                   text="Pre-COVID",
                   showarrow=True,
                   arrowhead=2,
                   ax=0,
                   ay=-40,
                   font=dict(size=12))
fig.show()


In [58]:
# Find 10 highest rated 'Romance' titles
top_romance_romance = romance.nlargest(10, 'Rating')[['Title', 'Rating']]
# Print list
print(top_romance_romance.to_string(index=False))

                                        Title  Rating
Dekh Mujhe Bhi - Syed Fardeen and Shweta Jean    10.0
                                  The College     9.9
                        Pyar Kiya Toh Nibhana     9.8
                               Peluang Ketiga     9.8
                    Yello Jogappa Ninnaramane     9.8
                             Pop Lock 'n Roll     9.7
                                  Get Over It     9.7
                               Tahanan (Home)     9.7
                          A Ghetto Love Story     9.7
                                      Othello     9.6


In [59]:
# Find the average rating amongst all 'Romance' titles
average_rating_romance = romance['Rating'].mean()
print("Average Rating:", average_rating_romance)

Average Rating: 6.090688014500416


In [60]:
# Find 10 highest voted 'Romance' titles
top_vote_romance = romance.nlargest(10, 'Votes')[['Title', 'Votes']]
top_vote_romance.insert(1, ' ', ' ')
# Print list
print(top_vote_romance.to_string(index=False))

                                Title     Votes
                         Forrest Gump   2355891
                              Titanic   1328713
Eternal Sunshine of the Spotless Mind   1125792
                    Good Will Hunting   1124684
                  Slumdog Millionaire    892227
  Le fabuleux destin d'Amélie Poulain    813889
                      La vita è bella    770987
              Silver Linings Playbook    756705
  The Curious Case of Benjamin Button    717025
                                  Her    696551


## Surprised but not judging on some entries emerging in the Romance genre. I like Slumdog and benjamin Button so nice to see them get such praise.

In [61]:
# Plotly pie distribution of 'Romance' titles top votes
fig_pie = px.pie(top_vote_romance, names='Title', values='Votes', title='Top 10 Voted Romance Films')
fig_pie.update_layout(title_x=0.46)  # Adjust the title placement
fig_pie.show()

In [62]:
# Find the average vote count amongst all 'Romance' titles
average_votes_romance = romance['Votes'].mean()
print("Average Vote Count:", average_votes_romance)

Average Vote Count: 3698.346751252423


## Heads up as we look into Action films.

In [63]:
# Create a new DataFrame 'action' from 'movies' DataFrame where 'Genres' contains 'Action'
action = movies[movies['Genres'].str.contains('Action', na=False)].reset_index(drop=True)

# Write the 'Action' Movies DataFrame to a CSV file
action.to_csv('data/action_mov.csv', index=False) 

In [64]:
# Intro action
action.head()

Unnamed: 0,Title,Year,Genres,Rating,Votes
0,The Story of the Kelly Gang,1906,"Action,Adventure,Biography",6.0,976
1,What Happened to Mary,1912,"Action,Drama,Thriller",6.2,36
2,Who Will Marry Mary?,1913,"Action,Adventure",5.2,29
3,Cameo Kirby,1914,"Action,Drama,Romance",6.5,18
4,The Exploits of Elaine,1914,Action,6.2,107


In [65]:
# Count 'Action' titles
len(action)

35070

In [66]:
# Sorting the DataFrame from earliest to latest
action = action.sort_values(by='Year', ignore_index=True) 
action.head()

Unnamed: 0,Title,Year,Genres,Rating,Votes
0,The Story of the Kelly Gang,1906,"Action,Adventure,Biography",6.0,976
1,Chûshingura,1910,"Action,Drama",5.6,29
2,Attack on the Gold Escort,1911,"Action,Drama",4.2,26
3,What Happened to Mary,1912,"Action,Drama,Thriller",6.2,36
4,Cooee and the Echo,1912,"Action,Adventure",5.4,25


In [67]:
# Grouping by decade and counting entries
action['Decade'] = (action['Year'] // 10) * 10
decade_counts_action = action.groupby('Decade').size().reset_index(name='Count')

# Printing results
print(decade_counts_action)

# Saving results to a text file
with open('data/action_decades.txt', 'w') as f:
    f.write(decade_counts_action.to_string())

    Decade  Count
0     1900      1
1     1910    118
2     1920    416
3     1930    659
4     1940    507
5     1950    720
6     1960   1657
7     1970   3122
8     1980   4058
9     1990   4672
10    2000   5252
11    2010   9216
12    2020   4672


## Action is another I would have guessed higher in release numbers, but I guess it still has to be funny or serious above the fast pace.

In [68]:
# Plotly bar 'Action' title release by decade
fig = px.bar(decade_counts_action, x='Decade', y='Count', # color='Decade', 
             color_discrete_sequence=['dodgerblue'], # Action is usually  blue
             title='Action Releases by Decade')
fig.update_traces(marker=dict(line=dict(width=1, color='DarkSlateGrey')))
# Label the bar for 2010 pre-covid vertically
fig.add_annotation(x=2010, y=decade_counts_action.loc[decade_counts_action['Decade'] == 2010, 'Count'].values[0],
                   text="Pre-COVID",
                   showarrow=True,
                   arrowhead=2,
                   ax=0,
                   ay=-40,
                   font=dict(size=12))
fig.show()

In [69]:
# Find 10 highest rated 'Action' titles
top_action_action = action.nlargest(10, 'Rating')[['Title', 'Rating']]
# Print list
print(top_action_action.to_string(index=False))

                                        Title  Rating
                          The last USSR blues    10.0
                                       Vo tme    10.0
                 The Treasure of Pancho Villa     9.9
                                Tujhko Pukare     9.8
                                      One Way     9.8
                           Azotes de Barrio 2     9.8
                                      The RVM     9.8
                         Susuko ba ako, inay?     9.7
                             The Knight Squad     9.7
OF THE SEA: a film about California Fishermen     9.7


In [70]:
# Find the average rating amongst all 'Action' titles
average_rating_action = action['Rating'].mean()
print("Average Rating:", average_rating_action)

Average Rating: 5.720379241516967


In [71]:
# Find 10 highest voted 'Action' titles
top_vote_action = action.nlargest(10, 'Votes')[['Title', 'Votes']]
top_vote_action.insert(1, ' ', ' ')
# Print list
print(top_vote.to_string(index=False))

                                            Title     Votes
                         The Shawshank Redemption   3013434
                                  The Dark Knight   2989960
                                        Inception   2657138
                                       Fight Club   2436128
                                     Forrest Gump   2355891
                                     Pulp Fiction   2311032
                                     Interstellar   2296921
                                       The Matrix   2135884
                                    The Godfather   2103011
The Lord of the Rings: The Fellowship of the Ring   2088682


## I like action movies tho not always my top. Votes were much cleared on peoples tastes here.

In [72]:
# Plotly pie distribution of 'Action' titles top votes
fig_pie = px.pie(top_vote_action, names='Title', values='Votes', title='Top 10 Voted Action Films')
fig_pie.update_layout(title_x=0.4)  # Adjust the title placement
fig_pie.show()

In [73]:
# Find the average vote count amongst all 'Action' titles
average_votes_action = action['Votes'].mean()
print("Average Vote Count:", average_votes_action)

Average Vote Count: 10736.526575420587


## Crime movies steps up, with in my opinion a much further increasing variety compared to other film types.

In [74]:
# Create a new DataFrame 'crime' from 'movies' DataFrame where 'Genres' contains 'Crime'
crime = movies[movies['Genres'].str.contains('Crime', na=False)].reset_index(drop=True)

# Write the 'Crime' Movies DataFrame to a CSV file
crime.to_csv('data/crime_mov.csv', index=False) 

In [75]:
# Intro crime
crime.head()

Unnamed: 0,Title,Year,Genres,Rating,Votes
0,Ansigttyven I,1910,Crime,3.9,21
1,Zigomar contre Nick Carter,1912,"Crime,Thriller",6.0,54
2,What 80 Million Women Want,1913,"Crime,Drama,Romance",4.1,56
3,Fantômas I: À l'ombre de la guillotine,1913,"Crime,Drama",6.9,2612
4,In the Bishop's Carriage,1913,"Crime,Drama",5.6,27


In [76]:
# Count 'Crime' titles
len(crime)

32051

In [77]:
# Sorting the DataFrame from earliest to latest
crime = crime.sort_values(by='Year', ignore_index=True) 
crime.head()

Unnamed: 0,Title,Year,Genres,Rating,Votes
0,Ansigttyven I,1910,Crime,3.9,21
1,Harry the Footballer,1911,"Adventure,Crime,Drama",4.3,34
2,Zigomar contre Nick Carter,1912,"Crime,Thriller",6.0,54
3,Le mystère des roches de Kador,1912,"Crime,Drama",6.6,452
4,L'enfant de Paris,1913,"Crime,Drama",7.2,483


In [78]:
# Grouping by decade and counting entries
crime['Decade'] = (crime['Year'] // 10) * 10
decade_counts_crime = crime.groupby('Decade').size().reset_index(name='Count')

# Printing results
print(decade_counts_crime)

# Saving results to a text file
with open('data/crime_decades.txt', 'w') as f:
    f.write(decade_counts_crime.to_string())

    Decade  Count
0     1910    138
1     1920    294
2     1930   1334
3     1940   1163
4     1950   1737
5     1960   2290
6     1970   2858
7     1980   2845
8     1990   3715
9     2000   4623
10    2010   7262
11    2020   3792


# Surprisingly close to Romance in both steady and fairly "tanky" towards the impact of COVID.

In [79]:
# Plotly bar 'Crime' title release by decade
fig = px.bar(decade_counts_crime, x='Decade', y='Count', # color='Decade', 
             color_discrete_sequence=['purple'], # Crime is usually purple
             title='Crime Releases by Decade')
fig.update_traces(marker=dict(line=dict(width=1, color='DarkSlateGrey')))
# Label the bar for 2010 pre-covid vertically
fig.add_annotation(x=2010, y=decade_counts_crime.loc[decade_counts_crime['Decade'] == 2010, 'Count'].values[0],
                   text="Pre-COVID",
                   showarrow=True,
                   arrowhead=2,
                   ax=0,
                   ay=-40,
                   font=dict(size=12))
fig.show()


In [80]:
# Find 10 highest rated 'Crime' titles
top_crime_crime = crime.nlargest(10, 'Rating')[['Title', 'Rating']]
# Print list
print(top_crime_crime.to_string(index=False))

                                    Title  Rating
La vida por mi barrio 13 (Mafia mexicana)    10.0
                      Der Mann von drüben     9.8
                       Party im Zwielicht     9.8
                       Juventud en drogas     9.8
                            Tujhko Pukare     9.8
                                Asatveera     9.8
                            Dheera Samrat     9.8
                                   Redrum     9.7
                                      4N6     9.7
           Die Dame in der schwarzen Robe     9.6


In [81]:
# Find the average rating amongst all 'Crime' titles
average_rating_crime = crime['Rating'].mean()
print("Average Rating:", average_rating_crime)

Average Rating: 6.008676796355808


In [82]:
# Find 10 highest voted 'Crime' titles
top_vote_crime = crime.nlargest(10, 'Votes')[['Title', 'Votes']]
top_vote_crime.insert(1, ' ', ' ')
# Print list
print(top_vote_crime.to_string(index=False))

                   Title     Votes
         The Dark Knight   2989960
            Pulp Fiction   2311032
           The Godfather   2103011
                   Se7en   1888593
 The Wolf of Wall Street   1663985
The Silence of the Lambs   1617769
                   Joker   1589935
            The Departed   1474380
          The Green Mile   1470910
   The Godfather Part II   1416708


## Some really good Crime films out there not sure if its an easier mark to hit or we all just favor the bad ones. Clear voice as to which film is top of the tier here.

In [83]:
# Plotly pie distribution of 'Crime' titles top votes
fig_pie = px.pie(top_vote_crime, names='Title', values='Votes', title='Top 10 Voted Crime Films')
fig_pie.update_layout(title_x=0.45)  # Adjust the title placement
fig_pie.show()

In [84]:
# Find the average vote count amongst all 'Crime' titles
average_votes_crime = crime['Votes'].mean()
print("Average Vote Count:", average_votes_crime)

Average Vote Count: 7033.99007831269


## Thriller films take focus ahead. Google taught me the main difference between Action and Thriller is as follows. Action films are often fast paced physical confrontations. Where as Thriller films take on a more suspenseful or mystery role often focusing on psychological tension or a danger to the protagonist.

In [85]:
# Create a new DataFrame 'thriller' from 'movies' DataFrame where 'Genres' contains 'Thriller'
thriller = movies[movies['Genres'].str.contains('Thriller', na=False)].reset_index(drop=True)

# Write the 'Thriller' Movies DataFrame to a CSV file
thriller.to_csv('data/thriller_mov.csv', index=False) 

In [86]:
# Intro thriller
thriller.head()

Unnamed: 0,Title,Year,Genres,Rating,Votes
0,What Happened to Mary,1912,"Action,Drama,Thriller",6.2,36
1,Zigomar contre Nick Carter,1912,"Crime,Thriller",6.0,54
2,Der Andere,1913,"Drama,Thriller",5.4,126
3,"The $5,000,000 Counterfeiting Plot",1914,"Crime,Thriller",6.8,29
4,After Five,1915,"Comedy,Crime,Thriller",4.8,26


In [87]:
# Count 'Thriller' titles
len(thriller)

30760

In [88]:
# Sorting the DataFrame from earliest to latest
thriller = thriller.sort_values(by='Year', ignore_index=True) 
thriller.head()

Unnamed: 0,Title,Year,Genres,Rating,Votes
0,One Hundred Years Ago,1911,"Drama,Thriller",2.3,19
1,Zigomar contre Nick Carter,1912,"Crime,Thriller",6.0,54
2,What Happened to Mary,1912,"Action,Drama,Thriller",6.2,36
3,Strike,1912,"Drama,Thriller",5.0,12
4,"Zigomar, peau d'anguille - Épisode 1: La résur...",1913,"Action,Thriller",5.8,23


In [89]:
# Grouping by decade and counting entries
thriller['Decade'] = (thriller['Year'] // 10) * 10
decade_counts_thriller = thriller.groupby('Decade').size().reset_index(name='Count')

# Printing results
print(decade_counts_thriller)

# Saving results to a text file
with open('data/thriller_decades.txt', 'w') as f:
    f.write(decade_counts_thriller.to_string())

    Decade  Count
0     1910     34
1     1920     63
2     1930    142
3     1940    226
4     1950    457
5     1960    880
6     1970   1500
7     1980   1908
8     1990   3379
9     2000   4826
10    2010  10477
11    2020   6868


## Thrillers currently take the crown for survival during COVID. Considering the theme of both it makes sense.

In [90]:
# Plotly bar 'Thriller' title release by decade
fig = px.bar(decade_counts_thriller, x='Decade', y='Count', # color='Decade', 
             color_discrete_sequence=['aqua'], # Thriller is basically action so another blue
             title='Thriller Releases by Decade')
fig.update_traces(marker=dict(line=dict(width=1, color='DarkSlateGrey')))
# Label the bar for 2010 pre-covid vertically
fig.add_annotation(x=2010, y=decade_counts_thriller.loc[decade_counts_thriller['Decade'] == 2010, 'Count'].values[0],
                   text="Pre-COVID",
                   showarrow=True,
                   arrowhead=2,
                   ax=0,
                   ay=-40,
                   font=dict(size=12))
fig.show()


In [91]:
# Find 10 highest rated 'Thriller' titles
top_rating_thriller = thriller.nlargest(10, 'Rating')[['Title', 'Rating']]
# Print list
print(top_rating_thriller.to_string(index=False))

                    Title  Rating
                   Vo tme    10.0
            Alone at Home     9.8
    The Trees of the East     9.8
                    Rugna     9.8
   The Sound of Southside     9.8
   Nasoor - Let's Restart     9.8
                    Ajaan     9.8
            Dheera Samrat     9.8
        The Platinum Loop     9.8
Virinchi Independent Film     9.7


In [92]:
# Find the average rating amongst all 'Thriller' titles
average_rating_thriller = thriller['Rating'].mean()
print("Average Rating:", average_rating_thriller)

Average Rating: 5.598881664499351


In [93]:
# Find 10 highest voted 'Thriller' titles
top_vote_thriller = thriller.nlargest(10, 'Votes')[['Title', 'Votes']]
top_vote_thriller.insert(1, ' ', ' ')
# Print list
print(top_vote_thriller.to_string(index=False))

                   Title     Votes
   The Dark Knight Rises   1896633
The Silence of the Lambs   1617769
                   Joker   1589935
          Shutter Island   1531324
            The Departed   1474380
                 Memento   1367851
       Kill Bill: Vol. 1   1239769
          Reservoir Dogs   1124287
               Gone Girl   1110965
  No Country for Old Men   1109883


## Again Dark Knight taking the peoples hearts and several personal favorites made the list.

In [94]:
# Plotly pie distribution of 'Thriller' titles top votes
fig_pie = px.pie(top_vote_thriller, names='Title', values='Votes', title='Top 10 Voted Thriller Films')
fig_pie.update_layout(title_x=0.45)  # Adjust the title placement
fig_pie.show()

In [95]:
# Find the average vote count amongst all 'Thriller' titles
average_votes_thriller = thriller['Votes'].mean()
print("Average Vote Count:", average_votes_thriller)

Average Vote Count: 6627.625292587776


## Horror my above all favorite genre finally up to bat.

In [96]:
# Create a new DataFrame 'horror' from 'movies' DataFrame where 'Genres' contains 'Horror'
horror = movies[movies['Genres'].str.contains('Horror', na=False)].reset_index(drop=True)

# Write the 'Horror' Movies DataFrame to a CSV file
horror.to_csv('data/horror_mov.csv', index=False) 

In [97]:
# Intro horror
horror.head()

Unnamed: 0,Title,Year,Genres,Rating,Votes
0,Der Student von Prag,1913,"Drama,Fantasy,Horror",6.4,2533
1,The Avenging Conscience: or 'Thou Shalt Not Kill',1914,"Crime,Drama,Horror",6.4,1504
2,The Ghost Breaker,1914,"Adventure,Horror",4.8,49
3,Der Golem,1914,Horror,6.7,1280
4,Der Hund von Baskerville,1914,"Crime,Horror,Mystery",5.6,167


In [98]:
# Count 'Horror' titles
len(horror)

25647

In [99]:
# Sorting the DataFrame from earliest to latest
horror = horror.sort_values(by='Year', ignore_index=True) 
horror.head()

Unnamed: 0,Title,Year,Genres,Rating,Votes
0,Hidaka iriai zakura,1909,"Drama,Horror",5.9,15
1,Botan dôrô,1910,"Drama,Horror",4.5,13
2,Trilby,1912,Horror,3.9,33
3,Satana,1912,"Drama,Horror",5.3,37
4,I misteri della psiche,1912,"Drama,Fantasy,Horror",6.3,17


In [100]:
# Grouping by decade and counting entries
horror['Decade'] = (horror['Year'] // 10) * 10
decade_counts_horror = horror.groupby('Decade').size().reset_index(name='Count')

# Printing results
print(decade_counts_horror)

# Saving results to a text file
with open('data/horror_decades.txt', 'w') as f:
    f.write(decade_counts_horror.to_string())

    Decade  Count
0     1900      1
1     1910     68
2     1920    101
3     1930    140
4     1940    190
5     1950    327
6     1960    721
7     1970   1483
8     1980   1805
9     1990   1673
10    2000   3882
11    2010   9593
12    2020   5663


## Another stout contender during COVID. Oh my the 19080's horror spike is a gem. 

In [101]:
# Plotly bar 'Horror' title release by decade
fig = px.bar(decade_counts_horror, x='Decade', y='Count', # color='Decade', 
             color_discrete_sequence=['limegreen'], # Horror is usually  dark green oddly
             title='Horror Releases by Decade')
fig.update_traces(marker=dict(line=dict(width=1, color='DarkSlateGrey')))
# Label the bar for 2010 pre-covid vertically
fig.add_annotation(x=2010, y=decade_counts_horror.loc[decade_counts_horror['Decade'] == 2010, 'Count'].values[0],
                   text="Pre-COVID",
                   showarrow=True,
                   arrowhead=2,
                   ax=0,
                   ay=-40,
                   font=dict(size=12))
fig.show()

In [102]:
# Find 10 highest rated 'Horror' titles
top_rating_horror = horror.nlargest(10, 'Rating')[['Title', 'Rating']]
# Print list
print(top_rating_horror.to_string(index=False))

                                     Title  Rating
            T.T.T. [Terror in Teruel Town]     9.6
              The Forest Through the Trees     9.6
                      Sandook - Ek Rahasya     9.5
                              Mashaarojinn     9.5
                   Guard: Revenge for Love     9.5
                             God Loves You     9.4
                    Dead Slate: Beginnings     9.4
Michael and Ghostface: Best Buds the Movie     9.4
                               Clownface 3     9.4
                       Happy Birthday Luci     9.4


In [103]:
# Find the average rating amongst all 'Horror' titles
average_rating_horror = horror['Rating'].mean()
print("Average Rating:", average_rating_horror)

Average Rating: 4.996252973057278


In [104]:
# Find 10 highest voted 'Horror' titles
top_vote_horror = horror.nlargest(10, 'Votes')[['Title', 'Votes']]
top_vote_horror.insert(1, ' ', ' ')
# Print list
print(top_vote_horror.to_string(index=False))

          Title     Votes
    The Shining   1156771
          Alien   1006850
    I Am Legend    843154
         Aliens    802962
American Psycho    768157
        Get Out    749922
    World War Z    747182
         Psycho    746547
           Jaws    688133
     Zombieland    641092


## While I know everyone of these films, so many beloved gems most likely collecting dust near the bottom. Funny how the modern film industry barely competes. Not really tho seriously I love me some good Horror, its all shock thrills these days.

In [105]:
# Plotly pie distribution of 'Horror' titles top votes
fig_pie = px.pie(top_vote_horror, names='Title', values='Votes', title='Top 10 Voted Horror Films')
fig_pie.update_layout(title_x=0.45)  # Adjust the title placement
fig_pie.show()

In [106]:
# Find the average vote count amongst all 'Horror' titles
average_votes_horror = horror['Votes'].mean()
print("Average Vote Count:", average_votes_horror)

Average Vote Count: 4763.918119078255


## On the horizon is Adventure films, thought everybody like adventures but apparently not.    

In [107]:
# Create a new DataFrame 'adventure' from 'movies' DataFrame where 'Genres' contains 'Adventure'
adventure = movies[movies['Genres'].str.contains('Adventure', na=False)].reset_index(drop=True)

# Write the 'adventure' Movies DataFrame to a CSV file
adventure.to_csv('data/adventure_mov.csv', index=False) 

In [108]:
# Intro adventure
adventure.head()

Unnamed: 0,Title,Year,Genres,Rating,Votes
0,The Story of the Kelly Gang,1906,"Action,Adventure,Biography",6.0,976
1,The Fairylogue and Radio-Plays,1908,"Adventure,Fantasy",5.2,78
2,Don Juan de Serrallonga,1910,"Adventure,Drama",3.5,22
3,L'inferno,1911,"Adventure,Drama,Fantasy",7.0,3739
4,The Adventures of Kathlyn,1913,Adventure,5.5,48


In [109]:
# Count 'Adventure' titles
len(adventure)

21473

In [110]:
# Sorting the DataFrame from earliest to latest
adventure = adventure.sort_values(by='Year', ignore_index=True) 
adventure.head()

Unnamed: 0,Title,Year,Genres,Rating,Votes
0,The Story of the Kelly Gang,1906,"Action,Adventure,Biography",6.0,976
1,The Fairylogue and Radio-Plays,1908,"Adventure,Fantasy",5.2,78
2,Sonho de Valsa,1909,"Adventure,Drama",2.4,25
3,Don Juan de Serrallonga,1910,"Adventure,Drama",3.5,22
4,L'inferno,1911,"Adventure,Drama,Fantasy",7.0,3739


In [111]:
# Grouping by decade and counting entries
adventure['Decade'] = (adventure['Year'] // 10) * 10
decade_counts_adventure = adventure.groupby('Decade').size().reset_index(name='Count')

# Printing results
print(decade_counts_adventure)

# Saving results to a text file
with open('data/adventure_decades.txt', 'w') as f:
    f.write(decade_counts_adventure.to_string())

    Decade  Count
0     1900      3
1     1910    327
2     1920    761
3     1930    800
4     1940    764
5     1950   1241
6     1960   1914
7     1970   2074
8     1980   1857
9     1990   1584
10    2000   2474
11    2010   5393
12    2020   2281


## Some of the earliest films were Adventure. Setting the stage to only be dwarfed by most of its siblings.

In [112]:
# Plotly bar 'Adventure' title release by decade
fig = px.bar(decade_counts_adventure, x='Decade', y='Count', # color='Decade', 
             color_discrete_sequence=['lawngreen'], # Adventure is usually  green
             title='Adventure Releases by Decade')
fig.update_traces(marker=dict(line=dict(width=1, color='DarkSlateGrey')))
# Label the bar for 2010 pre-covid vertically
fig.add_annotation(x=2010, y=decade_counts_adventure.loc[decade_counts_adventure['Decade'] == 2010, 'Count'].values[0],
                   text="Pre-COVID",
                   showarrow=True,
                   arrowhead=2,
                   ax=0,
                   ay=-40,
                   font=dict(size=12))
fig.show()

In [113]:
# Find 10 highest rated 'Adventure' titles
top_rating_adventure = adventure.nlargest(10, 'Rating')[['Title', 'Rating']]
# Print list
print(top_rating_adventure.to_string(index=False))

                           Title  Rating
Auf den Spuren des Hans im Glück    10.0
               Independent Roads     9.9
    The Treasure of Pancho Villa     9.9
               Hansel and Gretel     9.8
             Flying Over Everest     9.8
                Buried in Tucson     9.8
                           Parto     9.8
                 The Inventurers     9.8
             McTaggart's Fortune     9.8
              Borderline Forever     9.8


In [114]:
# Find the average rating amongst all 'Adventure' titles
average_rating_adventure = adventure['Rating'].mean()
print("Average Rating:", average_rating_adventure)

Average Rating: 5.867484748288549


In [115]:
# Find 10 highest voted 'Adventure' titles
top_vote_adventure = adventure.nlargest(10, 'Votes')[['Title', 'Votes']]
top_vote_adventure.insert(1, ' ', ' ')
# Print list
print(top_vote_adventure.to_string(index=False))

                                            Title     Votes
                                        Inception   2657138
                                     Interstellar   2296921
The Lord of the Rings: The Fellowship of the Ring   2088682
    The Lord of the Rings: The Return of the King   2059827
            The Lord of the Rings: The Two Towers   1856216
                                        Gladiator   1738157
                             Inglourious Basterds   1661387
                                        Star Wars   1496080
   Star Wars: Episode V - The Empire Strikes Back   1428476
                                           Avatar   1422305


## A crime has been committed looking ath this list. To heated to make personal stake, but I would love to hear peoples open opinions on what would populate their list.

In [116]:
# Plotly pie distribution of 'Adventure' titles top votes
fig_pie = px.pie(top_vote_adventure, names='Title', values='Votes', title='Top 10 Voted Adventure Films')
fig_pie.update_layout(title_x=0.397)  # Adjust the title placement
fig_pie.show()

In [117]:
# Find the average vote count amongst all 'Adventure' titles
average_votes_adventure = adventure['Votes'].mean()
print("Average Vote Count:", average_votes_adventure)

Average Vote Count: 14509.495412844037


## Onto Family films whom is rather low on the list, considering I would have thought it better to be a super type unto itself.

In [118]:
# Create a new DataFrame 'family' from 'movies' DataFrame where 'Genres' contains 'Family'
family = movies[movies['Genres'].str.contains('Family', na=False)].reset_index(drop=True)

# Write the 'Family' Movies DataFrame to a CSV file
family.to_csv('data/family_mov.csv', index=False) 

In [119]:
# Intro family
family.head()

Unnamed: 0,Title,Year,Genres,Rating,Votes
0,The Life of Moses,1909,"Biography,Drama,Family",5.5,65
1,"His Majesty, the Scarecrow of Oz",1914,"Adventure,Comedy,Family",5.3,553
2,The Patchwork Girl of Oz,1914,"Adventure,Comedy,Family",5.4,603
3,Alice in Wonderland,1915,"Adventure,Family,Fantasy",6.1,856
4,The Babes in the Woods,1917,"Drama,Family,Fantasy",5.7,43


In [120]:
# Count 'Family' titles
len(family)

15579

In [121]:
# Sorting the DataFrame from earliest to latest
family = family.sort_values(by='Year', ignore_index=True) 
family.head()

Unnamed: 0,Title,Year,Genres,Rating,Votes
0,The Life of Moses,1909,"Biography,Drama,Family",5.5,65
1,"His Majesty, the Scarecrow of Oz",1914,"Adventure,Comedy,Family",5.3,553
2,The Patchwork Girl of Oz,1914,"Adventure,Comedy,Family",5.4,603
3,Alice in Wonderland,1915,"Adventure,Family,Fantasy",6.1,856
4,Snow White,1916,"Adventure,Family,Fantasy",3.8,61


In [122]:
# Grouping by decade and counting entries
family['Decade'] = (family['Year'] // 10) * 10
decade_counts_family = family.groupby('Decade').size().reset_index(name='Count')

# Printing results
print(decade_counts_family)

# Saving results to a text file
with open('data/family_decades.txt', 'w') as f:
    f.write(decade_counts_family.to_string())

    Decade  Count
0     1900      1
1     1910     13
2     1920     44
3     1930    193
4     1940    295
5     1950    627
6     1960    824
7     1970   1280
8     1980   1632
9     1990   1543
10    2000   2271
11    2010   4920
12    2020   1936


## No surprise to see COVID really hammer down the releases of Family films. I had to check and it was 1937 when Disney stepped into the feature film ring. Would have figured a bigger bump but we all know it got there. Problem here is it kinda falls into the babysitter category.

In [123]:
# Plotly bar 'Family' title release by decade
fig = px.bar(decade_counts_family, x='Decade', y='Count', # color='Decade', 
             color_discrete_sequence=['yellow'], # Family films usually comical so yellow
             title='Family Releases by Decade')
fig.update_traces(marker=dict(line=dict(width=1, color='DarkSlateGrey')))
# Label the bar for 2010 pre-covid vertically
fig.add_annotation(x=2010, y=decade_counts_family.loc[decade_counts_family['Decade'] == 2010, 'Count'].values[0],
                   text="Pre-COVID",
                   showarrow=True,
                   arrowhead=2,
                   ax=0,
                   ay=-40,
                   font=dict(size=12))
fig.show()


In [124]:
# Find 10 highest rated 'Family' titles
top_rating_family = family.nlargest(10, 'Rating')[['Title', 'Rating']]
# Print list
print(top_rating_family.to_string(index=False))

                           Title  Rating
                        The Poet    10.0
            Kids on Kids on Kids    10.0
Auf den Spuren des Hans im Glück    10.0
               It's a Love Thang    10.0
                     Dhh Lekacha     9.9
               Hansel and Gretel     9.8
                          Partav     9.8
               The Road to Truth     9.8
                    Amche Samsar     9.8
              An American Posada     9.8


In [125]:
# Find the average rating amongst all 'Family' titles
average_rating_family = family['Rating'].mean()
print("Average Rating:", average_rating_family)

Average Rating: 6.2178445343090045


In [129]:
# Find 10 highest voted 'Family' titles
top_vote_family = family.nlargest(10, 'Votes')[['Title', 'Votes']]
top_vote_family.insert(1, ' ', ' ')
# Print list
print(top_vote_family.to_string(index=False))

                                        Title     Votes
                                       WALL·E   1248803
Harry Potter and the Deathly Hallows - Part 2    987688
                Sen to Chihiro no kamikakushi    897618
        Harry Potter and the Sorcerer's Stone    896345
      Harry Potter and the Chamber of Secrets    722822
     Harry Potter and the Prisoner of Azkaban    722684
          Harry Potter and the Goblet of Fire    711186
                                   Home Alone    691178
    Harry Potter and the Order of the Phoenix    660589
Harry Potter and the Deathly Hallows - Part 1    625351


## Found some of those missing Adventure films. Jokes aside who all is happy Wall-E take the throne? "Sen to Chihiro no Kamikakushi" is the Japanese titles for "Spirited Away". Also how is a film series about forgetting your child make the list?

In [127]:
# Plotly pie distribution of 'Family' titles top votes
fig_pie = px.pie(top_vote_family, names='Title', values='Votes', title='Top 10 Voted Family Films')
fig_pie.update_layout(title_x=0.4)  # Adjust the title placement
fig_pie.show()

In [128]:
# Find the average vote count amongst all 'Family' titles
average_votes_family = family['Votes'].mean()
print("Average Vote Count:", average_votes_family)

Average Vote Count: 3485.1672122729315
