import files from imdb data, filter to desired results. From a brief look at the data I can see that 1, there are movies that haven't come out yet; 2, there are movies without genres. Since neither of these will be helpful for my analysis I have decided to remove them. Additionally my rationale has determined that I don't want to look at movies before 2014 so I've removed those too. From the ratings data I am removing anything that has a low number of ratings since I don't consider that data significant

In [4]:
import pandas as pd
df_imdb_basics = pd.read_csv('zippedData/imdb.title.basics.csv.gz', compression='gzip')
df_filtered = df_imdb_basics[(df_imdb_basics['start_year'] <= 2020) & (df_imdb_basics['start_year'] >= 2014)]
df_filtered = df_filtered.dropna(subset=['genres'])
df_imdb_ratings = pd.read_csv('zippedData/imdb.title.ratings.csv.gz', compression='gzip')
df_imdb_ratings = df_imdb_ratings[(df_imdb_ratings['numvotes'] > 100)]
df_imdb_akas = pd.read_csv('zippedData/imdb.title.akas.csv.gz', compression='gzip')

Now I am going to split out the genres into multiple columns.

In [5]:
df_filtered = df_filtered.dropna(subset=['genres'])
genres = df_filtered['genres'].str.split(",", n = 1, expand = True)
df_filtered['genre1']= genres[0]
df_filtered['genre2']= genres[1]
genres2 = df_filtered['genre2'].str.split(",", n = 1, expand = True)
df_filtered['genre2']= genres2[0]
df_filtered['genre3']= genres2[1]
df_filtered.head()

Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres,genre1,genre2,genre3
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama",Biography,Drama,
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama,Drama,,
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama",Comedy,Drama,
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy",Comedy,Drama,Fantasy
5,tt0111414,A Thin Life,A Thin Life,2018,75.0,Comedy,Comedy,,


I am choosing to use pandasql to join all the dataframes and make some queries about the data. First I'm joining the basics data with ratings, then I am joining the 'akas' data. Since I want to focus on US only data for this analysis this will help me filter out non-US market data.

In [6]:
from pandasql import sqldf
pysqldf =  lambda q: sqldf(q, globals())
q = '''SELECT *
        FROM df_filtered
        JOIN df_imdb_ratings
        USING(tconst)
        ;'''

imdb_joined_df = pysqldf(q)
imdb_joined_df.head(2)

Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres,genre1,genre2,genre3,averagerating,numvotes
0,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama,Drama,,,6.9,4517
1,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy",Comedy,Drama,Fantasy,6.5,119


In [7]:
q = '''SELECT *
        FROM imdb_joined_df
        JOIN df_imdb_akas
        ON df_imdb_akas.title_id = imdb_joined_df.tconst
        ;'''

imdb_all = pysqldf(q)
imdb_all.head(2)

Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres,genre1,genre2,genre3,averagerating,numvotes,title_id,ordering,title,region,language,types,attributes,is_original_title
0,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama,Drama,,,6.9,4517,tt0069049,1,O Outro Lado do Vento,BR,,imdbDisplay,,0.0
1,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama,Drama,,,6.9,4517,tt0069049,2,The Other Side of the Wind,US,,imdbDisplay,,0.0


Now I can filter the data to US only. I am also getting duplicates so I will clear those out.

In [8]:
imdb_all = imdb_all[imdb_all['region'] == 'US']
imdb_all = imdb_all.drop_duplicates(subset = 'tconst')

In [12]:
#rating by genre
q = '''SELECT primary_title, averagerating
        FROM imdb_all
        WHERE genre1 == 'Documentary' OR genre2 == 'Documentary' OR genre3 == 'Documentary'
        ;'''

docs = pysqldf(q)
docs.head(10)

Unnamed: 0,primary_title,averagerating
0,Homecoming: A Film by Beyoncé,7.5
1,Heart Like a Hand Grenade,7.4
2,The Black Godfather,6.8
3,Tell No One,8.9
4,Beautiful Noise,6.5
5,Evolution of a Criminal,6.9
6,Sticky: A (Self) Love Story,6.5
7,Defying the Nazis: The Sharps' War,7.2
8,Dawg Fight,6.3
9,Tab Hunter Confidential,7.7


In [23]:
docs['averagerating'].mean()

7.0074000000000005

In [35]:
averagerating_list = []
def rating_by_genre(genre):
    for genre in imdb_all:
        if genre == 'genre1' or  genre == 'genre2' or  genre == 'genre3':
            averagerating.list.append('averagerating')
            return averagerating_list

In [36]:
rating_by_genre('Documentary')

NameError: name 'averagerating' is not defined

In [21]:
q = '''SELECT *
        FROM imdb_all
        WHERE averagerating > 7.0
        ORDER BY averagerating DESC
        ;'''

top_rated = pysqldf(q)
top_rated.head()

Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres,genre1,genre2,genre3,averagerating,numvotes,title_id,ordering,title,region,language,types,attributes,is_original_title
0,tt7131622,Once Upon a Time ... in Hollywood,Once Upon a Time ... in Hollywood,2019,159.0,"Comedy,Drama",Comedy,Drama,,9.7,5600,tt7131622,20,Untitled #9,US,,working,,0.0
1,tt6842524,"Hare Krishna! The Mantra, the Movement and the...","Hare Krishna! The Mantra, the Movement and the...",2017,90.0,Documentary,Documentary,,,9.5,829,tt6842524,3,"Hare Krishna! The Mantra, the Movement and the...",US,,,,0.0
2,tt8354112,Mosul,Mosul,2019,86.0,Documentary,Documentary,,,9.5,617,tt8354112,3,Mosul,US,,imdbDisplay,,0.0
3,tt6859280,"The Nagano Tapes: Rewound, Replayed & Reviewed","The Nagano Tapes: Rewound, Replayed & Reviewed",2018,73.0,Documentary,Documentary,,,9.4,192,tt6859280,3,"The Nagano Tapes: Rewound, Replayed & Reviewed",US,,,,0.0
4,tt7738784,Peranbu,Peranbu,2018,147.0,Drama,Drama,,,9.4,9629,tt7738784,4,Peranbu,US,,imdbDisplay,,0.0
