## Questions to answer using this dataset

1. Does age-certification play a role in the ratings of a show/movie?
2. Do movies/tv shows with the genre "comedy" tend to do better than movies/tv shows w/out "comedy"?
3. Do 'drama' movies tend to have a higher runtime?
4. Do TV shows with more than 5 number of seasons tend to have a better rating than shows that donot?
5. Has the 'crime' genre increased in the last 25 years compared to the older years?
6. Do movies produced in US tend to have a higher runtime?
7. Does the lenght of the title of the show/movie affect the ratings?
8. Is data from last 25 years less popular than older data
9. Do shows with more than 5 seasons tend to have a higher or lower runtime than shows with less than 5 seasons?
10. In terms of ratings, is 'comedy' a more successful genre or 'horror'?

In [1]:
import pandas

#pandas.options.display.max_columns = None
#pandas.options.display.max_rows = None

netflix_data = pandas.read_csv(
    r"C:/Users/alish/OneDrive/Desktop/HTH_build_sumer_2022/Finding-an-Interesting-Dataset-alisk8/titles.csv")

netflix_data

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
0,ts300399,Five Came Back: The Reference Films,SHOW,This collection includes 12 World War II-era p...,1945,TV-MA,48,['documentation'],['US'],1.0,,,,0.600,
1,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976,R,113,"['crime', 'drama']",['US'],,tt0075314,8.3,795222.0,27.612,8.2
2,tm127384,Monty Python and the Holy Grail,MOVIE,"King Arthur, accompanied by his squire, recrui...",1975,PG,91,"['comedy', 'fantasy']",['GB'],,tt0071853,8.2,530877.0,18.216,7.8
3,tm70993,Life of Brian,MOVIE,"Brian Cohen is an average young Jewish man, bu...",1979,R,94,['comedy'],['GB'],,tt0079470,8.0,392419.0,17.505,7.8
4,tm190788,The Exorcist,MOVIE,12-year-old Regan MacNeil begins to adapt an e...,1973,R,133,['horror'],['US'],,tt0070047,8.1,391942.0,95.337,7.7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5801,tm1014599,Fine Wine,MOVIE,A beautiful love story that can happen between...,2021,,100,"['romance', 'drama']",['NG'],,tt13857480,6.9,39.0,0.966,
5802,tm1108171,Edis Starlight,MOVIE,Rising star Edis's career journey with ups and...,2021,,74,"['music', 'documentation']",[],,,,,1.036,8.5
5803,tm1045018,Clash,MOVIE,A man from Nigeria returns to his family in Ca...,2021,,88,"['family', 'drama']","['NG', 'CA']",,tt14620732,6.5,32.0,0.709,
5804,tm1098060,Shadow Parties,MOVIE,A family faces destruction in a long-running c...,2021,,116,"['action', 'thriller']",[],,tt10168094,6.2,9.0,2.186,


## Column Descriptions

| Column | Description |
| :---------- | ----------- |
| Title  | Name of the movie/tv show |
| Type | whether the entity is a movie or a tv show |
| Description | short description of the plot of the movie/tv show |
| Release yaer | year the movie/tv show was released in |
| Age certification | "TV-MA", "R", "PG", "PG-13", "TV-14" defines which ages is the movie/tv show suitable for |
| Runtime | length of movie/tv show in minutes |
| Genre | Type of plot the movie/tv show has |
| Production Countries | The country the movie/tv show was produced in |
| Type | whether the entity is a movie or a tv show |
| Seasons | number of seasons in a (only applies to) TV show |
| Imdb ID | IMDB ID number |
| Imdb Score | A rating out of 10 from IMDB |
| Imdb Votes | number of votes for the movie/tv show on IMDB |
| Tmdb Popularity | Popularity of a movie/tv show on TMDB from a scale of 0 - 182.35 |
| Tmdb Score | A rating out of 10 from TMDB |

In [2]:
# Question 1: Does age-certification play a role in the ratings of a show/movie?

# drop all null values within age cert, imdb score, tmdb score
clean_data = netflix_data.dropna(subset=['age_certification', 'imdb_score', 'tmdb_score'])

# separate the cleaned data according to age cert 
pg_13_data = clean_data[ clean_data['age_certification'] == "PG-13" ]
tv_ma_data = clean_data[ clean_data['age_certification'] == "TV-MA" ]
r_data = clean_data[ clean_data['age_certification'] == "R" ]
pg_data = clean_data[ clean_data['age_certification'] == "PG" ]
tv_14_data = clean_data[ clean_data['age_certification'] == "TV-14" ]

# find the mean of each separated data's score/rating column 
pg13_mean_i = "{:.2f}".format(pg_13_data['imdb_score'].mean())
pg13_mean_t = "{:.2f}".format(pg_13_data['tmdb_score'].mean())
print("PG-13 movies/tv shows have a mean of", pg13_mean_i, "IMDB score", "and", pg13_mean_t, "of TMDB score")

tvma_mean_i = "{:.2f}".format(tv_ma_data['imdb_score'].mean())
tvma_mean_t = "{:.2f}".format(tv_ma_data['tmdb_score'].mean())
print("TV-MA movies/tv shows have a mean of", tvma_mean_i, "IMDB score", "and", tvma_mean_t, "of TMDB score")

r_mean_i = "{:.2f}".format(r_data['imdb_score'].mean())
r_mean_t = "{:.2f}".format(r_data['tmdb_score'].mean())
print("R movies/tv shows have a mean of", r_mean_i, "IMDB score", "and", r_mean_t, "of TMDB score")

pg_mean_i = "{:.2f}".format(pg_data['imdb_score'].mean())
pg_mean_t = "{:.2f}".format(pg_data['tmdb_score'].mean())
print("TV-MA movies/tv shows have a mean of", pg_mean_i, "IMDB score", "and", pg_mean_t, "of TMDB score")

tv_mean_i = "{:.2f}".format(tv_14_data['imdb_score'].mean())
tv_mean_t = "{:.2f}".format(tv_14_data['tmdb_score'].mean())
print("TV-MA movies/tv shows have a mean of", tv_mean_i, "IMDB score", "and", tv_mean_t, "of TMDB score")

PG-13 movies/tv shows have a mean of 6.45 IMDB score and 6.59 of TMDB score
TV-MA movies/tv shows have a mean of 7.08 IMDB score and 7.36 of TMDB score
R movies/tv shows have a mean of 6.32 IMDB score and 6.47 of TMDB score
TV-MA movies/tv shows have a mean of 6.21 IMDB score and 6.56 of TMDB score
TV-MA movies/tv shows have a mean of 7.26 IMDB score and 7.57 of TMDB score


In [3]:
# Question 2: Do movies/tv shows with the genre "comedy" tend to do better than movies/tv shows w/out "comedy"?

# drop all null values within genre, imdb score, tmdb score
clean_data1 = netflix_data.dropna(subset=['genres', 'imdb_score', 'tmdb_score'])

# get data with and without comedy
data_w_comedy = clean_data1[ clean_data1['genres'].str.contains('comedy')] 
data_wo_comedy = clean_data1[ ~ clean_data1['genres'].str.contains('comedy')] 

# find means 
w_com_mean_i = "{:.2f}".format(data_w_comedy['imdb_score'].mean())
w_com_mean_t = "{:.2f}".format(data_w_comedy['tmdb_score'].mean())

wo_com_mean_i = "{:.2f}".format(data_wo_comedy['imdb_score'].mean())
wo_com_mean_t = "{:.2f}".format(data_wo_comedy['tmdb_score'].mean())

print("Movies/Tv shows that have comedy as a genre have an IMDB mean rating of", w_com_mean_i, "and TMDB mean rating of", w_com_mean_t)
print("Movies/Tv shows that don't have comedy as a genre have an IMDB mean rating of", wo_com_mean_i, "and TMDB mean rating of", wo_com_mean_t)

Movies/Tv shows that have comedy as a genre have an IMDB mean rating of 6.42 and TMDB mean rating of 6.74
Movies/Tv shows that don't have comedy as a genre have an IMDB mean rating of 6.62 and TMDB mean rating of 6.86


In [4]:
# Question 3: Do 'drama' movies tend to have a higher runtime?

clean_data2 = netflix_data.dropna(subset=['genres', 'runtime'])

# get data with only movies. Divide into two sections: with drama and without drama
movies_w_drama = clean_data2[ (clean_data2['type'] == 'MOVIE') & (clean_data2['genres'].str.contains('drama')) ]
mowies_wo_drama = clean_data2[ (~ clean_data2['genres'].str.contains('drama')) & (clean_data2['type'] == 'MOVIE') ]

w_dra_mean = "{:.2f}".format(movies_w_drama['runtime'].mean())
wo_dra_mean = "{:.2f}".format(mowies_wo_drama['runtime'].mean())

print("Movies without drama have a mean runtime of", wo_dra_mean, "while movies with drama have a mean of", w_dra_mean)

Movies without drama have a mean runtime of 85.98 while movies with drama have a mean of 111.80


In [5]:
# Question 4: Do TV shows with more than 5 number of seasons tend to have a better rating than shows that donot?

clean_data3 = netflix_data.dropna(subset=['imdb_score', 'tmdb_score', 'seasons'])

shows_more_than_5 = clean_data3[ (clean_data3['type'] == 'SHOW') & (clean_data3['seasons'] >= 5)]
shows_less_than_5 = clean_data3[ (clean_data3['type'] == 'SHOW') & (clean_data3['seasons'] < 5)]

l_five_mean_i = "{:.2f}".format(shows_less_than_5['imdb_score'].mean())
l_five_mean_t = "{:.2f}".format(shows_less_than_5['tmdb_score'].mean())

m_five_mean_i = "{:.2f}".format(shows_more_than_5['imdb_score'].mean())
m_five_mean_t = "{:.2f}".format(shows_more_than_5['tmdb_score'].mean())

print("Shows with less than 5 seasons")
print("IMDB: ", l_five_mean_i)
print("TMDB: ", l_five_mean_t)

print()

print("Shows with more than 5 seasons")
print("IMDB: ", m_five_mean_i)
print("TMDB: ", m_five_mean_t)

Shows with less than 5 seasons
IMDB:  6.98
TMDB:  7.47

Shows with more than 5 seasons
IMDB:  7.40
TMDB:  7.65


In [6]:
# Question 5: Has the 'crime' genre increased in the last 25 years compared to the older years?

clean_data3 = netflix_data.dropna(subset=['release_year', 'genres'])

data_win_25_years = clean_data3[ (clean_data3['release_year'] > 1996) ]
data_b_25_years = clean_data3[ clean_data3['release_year'] <= 1996 ]

data_win_25_years_c = data_win_25_years[ data_win_25_years['genres'].str.contains('crime') ]
data_b_25_years_c = data_b_25_years[ data_b_25_years['genres'].str.contains('crime') ]

crime_per_win = "{:.2f}".format(( data_win_25_years_c['genres'].count() ) / (data_win_25_years['genres'].count()) * 100);
crim_per_b = ( data_b_25_years_c['genres'].count() ) / (data_b_25_years['genres'].count()) * 100;

print("About", crime_per_win, "% of the movies/TV shows had 'crime' genre within the last 25 years")
print("About", crim_per_b, "% of the movies/TV shows had 'crime' genre before last 25 years")

About 15.24 % of the movies/TV shows had 'crime' genre within the last 25 years
About 18.75 % of the movies/TV shows had 'crime' genre before last 25 years


In [7]:
# Question 6: Do movies produced in US tend to have a higher runtime?

clean_data4 = netflix_data.dropna(subset=['type', 'runtime', 'production_countries'])

movies_in_us = clean_data4[ (clean_data4['type'] == 'MOVIE') & (clean_data4['production_countries'].str.contains('US') )]
movies_not_us = clean_data4[ (clean_data4['type'] == 'MOVIE') & (~ clean_data4['production_countries'].str.contains('US') )]

mean_us = "{:.2f}".format(movies_in_us['runtime'].mean())
mean_n_us = "{:.2f}".format(movies_not_us['runtime'].mean())

print("Movies produced fully or partially in the US have a mean runtime of", mean_us, "while movies not produced in US have a mean runtime of", mean_n_us)

Movies produced fully or partially in the US have a mean runtime of 89.94 while movies not produced in US have a mean runtime of 104.88


In [8]:
# Question 7: Does the lenght of the title of the show/movie affect the ratings?

clean_data5 = netflix_data.dropna(subset=['title', 'imdb_score', 'tmdb_score'])

data_w_less_than_15 = clean_data5[ clean_data5['title'].str.len() <= 15 ]
data_w_more_than_15 = clean_data5[ clean_data5['title'].str.len() > 15 ]

mean_i_w_less_than_15 = data_w_less_than_15['imdb_score'].mean()
mean_t_w_less_than_15 = data_w_less_than_15['tmdb_score'].mean()

mean_i_w_more_than_15 = data_w_more_than_15['imdb_score'].mean()
mean_t_w_more_than_15 = data_w_more_than_15['tmdb_score'].mean()

print("Data that have title names with less than or equal to 15 characters: ")
print("IMDB rating mean:", mean_i_w_less_than_15)
print("TMDB rating mean:", mean_t_w_less_than_15)

print()

print("Data that have title names with more than 15 characters: ")
print("IMDB rating mean:", mean_i_w_more_than_15)
print("TMDB rating mean:", mean_t_w_more_than_15)

Data that have title names with less than or equal to 15 characters: 
IMDB rating mean: 6.490647212869435
TMDB rating mean: 6.75379723157501

Data that have title names with more than 15 characters: 
IMDB rating mean: 6.58623005877414
TMDB rating mean: 6.8745172124265315


In [13]:
# Question 8: Is data from last 25 years less popular than older data

clean_data7 = netflix_data.dropna(subset=['release_year', 'imdb_score', 'tmdb_score'])

data_in_25_years = clean_data7[ (clean_data7['release_year'] > 1996) ]
older_data = clean_data7[ clean_data7['release_year'] <= 1996 ]

i_rat_in25 = data_in_25_years['imdb_score'].mean()
t_rat_in25 = data_in_25_years['tmdb_score'].mean()

i_rat_older = older_data['imdb_score'].mean()
t_rat_older = older_data['tmdb_score'].mean()

print("Mean ratings of data from last 25 years:")
print("IMDB:", i_rat_in25)
print("TMDB:", t_rat_in25)

print()

print("Mean ratings of older data:")
print("IMDB:", i_rat_older)
print("TMDB:", t_rat_older)

Mean ratings of data from last 25 years:
IMDB: 6.529948927477017
TMDB: 6.817711950970377

Mean ratings of older data:
IMDB: 6.71125
TMDB: 6.595625


In [12]:
# Question 9: Do shows with more than 5 seasons tend to have a higher or lower runtime than shows with less than 5 seasons?

clean_data6 = netflix_data.dropna(subset=['type', 'runtime', 'seasons'])

shows_mt_5 = clean_data6[ (clean_data6['type'] == 'SHOW') & (clean_data6['seasons'] > 5)]
shows_lt_5 = clean_data6[ (clean_data6['type'] == 'SHOW') & (clean_data6['seasons'] <= 5)]

mean_f_mt5 = shows_mt_5['runtime'].mean()
mean_f_lt5 = shows_lt_5['runtime'].mean()

print("Shows with more than 5 seasons tend to have a per episode runtime of", mean_f_mt5)
print("Shows with less than 5 seasons tend to have a per episode runtime of", mean_f_lt5)

Shows with more than 5 seasons tend to have a per episode runtime of 32.69911504424779
Shows with less than 5 seasons tend to have a per episode runtime of 39.178903826266804


In [20]:
# Question 10: In terms of ratings, is 'comedy' a more successful genre or 'horror'?

clean_data8 = netflix_data.dropna(subset=['genres', 'imdb_score', 'tmdb_score'])

comedy_data = clean_data8[ clean_data8['genres'].str.contains('comedy') ]
horror_data = clean_data8[ clean_data8['genres'].str.contains('horror') ]

mean_c_i = comedy_data['imdb_score'].mean()
mean_c_t = comedy_data['tmdb_score'].mean()

mean_h_i = horror_data['imdb_score'].mean()
mean_h_t = horror_data['tmdb_score'].mean()

print('Comedy data ratings: ')
print('IMDB', mean_c_i)
print('TMDB', mean_c_t)

print()

print('Horror data ratings: ')
print('IMDB', mean_h_i)
print('TMDB', mean_h_t)

Comedy data ratings: 
IMDB 6.420749279538905
TMDB 6.736023054755044

Horror data ratings: 
IMDB 6.015168539325843
TMDB 6.461235955056179
