## Assignment 3 Data Analysis using Pandas

This assignment will contain 1 question with details as below. The due date is October 16 (Friday), 2020 23:59PM. Each late day will result in 20% loss of total points.

### Question 1 (100 points) Celluloid ceiling

Wonder Woman             |  Captain Marvel
:-------------------------:|:-------------------------:
![wonderwoman](https://upload.wikimedia.org/wikipedia/en/e/ed/Wonder_Woman_%282017_film%29.jpg) | ![marvel](https://upload.wikimedia.org/wikipedia/pt/5/59/Captain_Marvel_%282018%29.jpg)

Women are involved in the film industry in all roles, including as film directors, actresses, cinematographers, film producers, film critics, and other film industry professions, though women have been underrepresented in all these positions. Studies found that women have always had a presence in film acting, but have consistently been underrepresented, and on average significantly less well paid. 

In 2015, Forbes reported that "...just 21 of the 100 top-grossing films of 2014 featured a female lead or co-lead, while only 28.1% of characters in 100 top-grossing films were female... This means it’s much rarer for women to get the sort of blockbuster role which would warrant the massive backend deals many male counterparts demand (Tom Cruise in Mission: Impossible or Robert Downey Jr. in Iron Man, for example)".

Also, Forbes' analysis of US acting salaries in 2013 determined that the "...men on Forbes’ list of top-paid actors for that year made 2½ times as much money as the top-paid actresses. That means that Hollywood's best-compensated actresses made just 40 cents for every dollar that the best-compensated men made. 


In this assignment, we want to examine whether and how women representation is lacking in the film industry. We will adopt The Bechdel test as a measure of the representation of women in the film industry. The test is named after the American cartoonist Alison Bechdel in whose 1985 comic strip Dykes to Watch Out For the test first appeared. **A movie is said to meet the Bechdel test  following three criteria: (1) it has to have at least two women in it, who (2) who talk to each other, about (3) something besides a man.**

We are going to obtain the data ourselves to perform the analysis. Specifically, we will retrieve the movie metadata from IMDB (Internet Movie Database), an online database of information related to films, television programs, home videos, video games, and streaming content online – including cast, production crew and personal biographies, plot summaries, trivia, ratings, and fan and critical reviews. As of January 2020, IMDb has approximately 6.5 million titles (including episodes) and 10.4 million personalities in its database, as well as 83 million registered users.


The IMDb Top 250 is a list of the top rated 250 films, based on ratings by the registered users of the website using the methods described. We will focus on these famous movies in this analysis:

**Question 1.1** (20 points): We will retrieve the metadata of IMDb Top 250 movies from the [IMDb charts](https://www.imdb.com/chart/top/). For each movie on the list, we can scrape the following characteristics from the information page. For example, from the [page of top rated movie "The Shawshank Redemption"](https://www.imdb.com/title/tt0111161/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8962-327b42fe94b1&pf_rd_r=F4QFC0SVZN1HTDHCY3C0&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_1), we want to extract the metadata about this movie as:
- IMDb id (0111161)
- Movie name (The Shawshank Redemption)
- Year (1994)
- Director (Frank Darabont)
- Starring (Tim Robbins, Morgan Freeman, Bob Gunton)
- Rating (9.3)
- Number of reviews (2,291,324)
- Genres (Drama)
- Country (USA)
- Language (English)
- Budget (\$25,000,000)
- Box Office Revenue (\$28,815,291)
- Runtime (142 min)

![imdb](https://mrfloris.com/files/images/imdb-top250-page-start.png)


After scraping the 250 movies, save the data as a dataframe ```imdb_top_movies```. 
Also, saving the dataframe to a local file ```imdb_top_movies.csv``` so that later you can load it without scraping the website twice.

Hint: You can get the links to these movies from the IMDb top chart page, and then scrape each movie page by sending the request to these links. At each movie page, the information requested are located at different sections. 

In [1]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
from collections import Counter, OrderedDict
from scipy import stats
import json

In [2]:
# Question 1.1
top250page= requests.get("https://www.imdb.com/chart/top/")
top250page.status_code

200

In [3]:
soup = BeautifulSoup(top250page.content, "html.parser")

#### __Note__: I have noticed that the last movie from the Top 250 changes frequently (2/3 days at most). As such, the movies included on the dataset will vary depending on the date the code is run.

In [4]:
# Retrieving Page Links

page_links=[]
all_links = soup.select(".titleColumn a")
for i in range(len(all_links)):
    page_links.append("https://www.imdb.com"+ all_links[i]['href'])

In [5]:
# Gathering IMDB ids

movie_id=[]
all_links = soup.select(".titleColumn a")
for i in range(len(all_links)):
    movie_id.append(all_links[i]['href'])
    movie_id[i]= movie_id[i][9:]
    movie_id[i]= movie_id[i][:-1]

In [6]:
# Movie Names

movie_names=[]
for i in range(len(all_links)):
    movie_names.append(all_links[i].get_text())

In [7]:
# Director and Stars names

all_names = []
director_names = []
starring_names = []
for i in range(len(all_links)):
    all_names.append(all_links[i]['title'].split(","))


for i in range(len(all_names)):
    director_names.append(all_names[i][0])
    starring_names.append(all_names[i][1:])
    
for i in range(len(director_names)):
    director_names[i] = director_names[i].replace("(dir.)","")
    
for i in range(len(starring_names)):
    starring_names[i] = str(starring_names[i])
    starring_names[i] = starring_names[i].replace("['","").replace("]","")\
    .replace("']","").replace('"','').replace('[ ','').replace("'","")\
    .replace("  "," ").replace('"]','').replace("[' ","").replace(' ""','')\
    .replace(' "]', '').replace('" ','').replace(' "','').rstrip().lstrip()

In [8]:
# Launch Year

years_list = soup.find_all(class_="secondaryInfo")
year=[]
for i in range(250):
    year.append(int(years_list[i].get_text().replace("(","").replace(")","")))

In [9]:
# IMDB Ratings

ratings=[]

html_ratings = soup.select(".ratingColumn.imdbRating strong")

for i in range(len(html_ratings)):
    ratings.append(html_ratings[i].get_text())

ratings = list(map(float, ratings))

In [10]:
# Number of Reviews

reviews_number = []

for i in range(len(html_ratings)):
    reviews_number.append(html_ratings[i]["title"].strip(" user ratings"))

for i in range(len(reviews_number)):
    reviews_number[i] = reviews_number[i][13:]

for i in range(len(reviews_number)):
    reviews_number[i] = reviews_number[i].replace(",", "")
    
reviews_count = [int(i) for i in reviews_number]

In [11]:
# Retrieving all 250 movies html 

movie_pages=[]
for i in range(len(page_links)):
    movie_pages.append(requests.get(page_links[i]))

In [12]:
# Parsing all html

soup=[]

for i in range(len(movie_pages)):
    soup.append(BeautifulSoup(movie_pages[i].content, "html.parser"))

In [13]:
# Genres

genre_list=[]

for i in range(len(movie_pages)):
    temp= []
    for j in range(len(soup[i].select(".subtext a"))-1):
        temp.append(soup[i].select(".subtext a")[j].get_text())
    genre_list.append(temp)
    
for i in range(len(genre_list)):
    genre_list[i] = str(genre_list[i])
    genre_list[i] = genre_list[i].replace("['","").replace("']","")\
    .replace("'","").replace("  "," ").rstrip().lstrip()
for i in range(len(genre_list)):
    genre_list[i]= genre_list[i].split(",")
    
    
genre_1=[]
genre_2=[]
genre_3=[]
   
for i in range(len(genre_list)):
    if len(genre_list[i]) ==1:
        genre_1.append(genre_list[i][0])
        genre_2.append("null")
        genre_3.append("null")
    elif len(genre_list[i]) == 2:
        genre_1.append(genre_list[i][0])
        genre_2.append(genre_list[i][1])
        genre_3.append("null")
    elif len(genre_list[i]) == 3:
        genre_1.append(genre_list[i][0])
        genre_2.append(genre_list[i][1])
        genre_3.append(genre_list[i][2])

for i in range(len(genre_1)):
    genre_1[i].replace(" ","")
    
for i in range(len(genre_2)):
    genre_2[i] = genre_2[i].replace(" ","")

for i in range(len(genre_3)):
    genre_3[i] = genre_3[i].replace(" ","")
    
for i in range(len(genre_2)):
    if genre_2[i] == "null":
        genre_2[i] = np.nan

for i in range(len(genre_3)):
    if genre_3[i] == "null":
        genre_3[i] = np.nan

In [14]:
# Runtime

run_time=[]
temp=[]
for i in range(len(movie_pages)):
    temp = soup[i].find_all("time")
    temp = temp[1].get_text()
    temp = temp.replace(" min","")
    temp = int(temp)
    run_time.append(temp)

In [15]:
# Countries

all_countries=[]

for i in range(len(movie_pages)):
    for x in soup[i].find_all(class_ = "txt-block"):
        temp=[]
        if "Country" in x.text:
            temp.append(x.text)
            temp = temp[0].lstrip("Country:\n")
            temp = temp.rstrip("\n")
            temp = temp.replace("|",",")
            temp = temp.replace("\n","")
            all_countries.append(temp)

In [16]:
len(all_countries)

251

##### __Note__: From this code (and depending on the time that the requests are run) I obtain a list that varies from 250 to 253 elements. I should have 250. I deal with this by removing the elements that are longer than expected.

##### This is not foolproof, but it worked well in this specific situation.

In [17]:
for i in range(len(all_countries)):
    try:
        if len(all_countries[i]) > 40:
            all_countries.pop(i)
    except IndexError:
            pass

In [18]:
len(all_countries)

250

In [19]:
# Languages

languages=[None] * 250
for i in range(len(movie_pages)):
    for x in soup[i].find_all(class_ = "txt-block"):
        temp=[]
        if "Language" in x.text:
            temp.append(x.text)
            temp = temp[0].lstrip("Language:\n")
            temp = temp.rstrip("\n")
            temp = temp.replace("|",",")
            temp = temp.replace("\n","")
            languages[i] = temp

In [20]:
# Budgets

budget= [0] * 250

for i in range(len(movie_pages)):
    for x in soup[i].find_all(class_ = "txt-block"):
        temp=[]
        if "Budget" in x.text:
            temp.append(x.text)
            temp = temp[0].lstrip("Budget:\n")
            temp = temp.replace("(estimated)","")
            temp = temp.rstrip()
            temp = temp.replace(",","")
            if "$" in temp:
                temp = temp.replace("$","")
                budget[i] = float(temp)
            elif "TRL" in temp:
                temp = temp.replace("TRL","")
                budget[i] = float(temp) * 0.13
            elif "JPY" in temp:
                temp = temp.replace("JPY","")
                #Yen to dollar conversion rate
                budget[i] = float(temp) * 0.0095
            elif "RL" in temp:
                temp = temp.replace("RL","")
                temp = float(temp) * 0.18
                budget[i] = temp
            elif "FRF" in temp:
                temp = temp.replace("FRF","")
                budget[i] = float(temp)* 1.18 * 0.152449
            elif "EUR" in temp:
                temp = temp.replace("EUR","")
                temp = float(temp)* 1.18
                budget[i] = float(temp)* 1.18
            elif "INR" in temp:
                temp = temp.replace("INR","")
                budget[i] = float(temp)*0.014
            elif "DEM" in temp:
                temp = temp.replace("DEM","")
                budget[i] = float(temp) * 0.60482
            elif "GBP" in temp:
                temp = temp.replace("GBP","")
                budget[i]= float(temp) * 1.30
            elif "RUR" in temp:
                temp = temp.replace("RUR","")
                budget[i] = float(temp) * 0.013
            elif "AUD" in temp:
                temp = temp.replace("AUD","")
                budget[i] = float(temp) * 0.72
            elif "KRW" in temp:
                temp = temp.replace("KRW","")
                budget[i] = float(temp) * 0.00087
                
# Later, I will need to replace the zeros (value appended when Budget info is not available) with `np.nan`.

In [21]:
# Box Office Revenue

global_revenue = [0] * 250

for i in range(len(movie_pages)):
    for x in soup[i].find_all(class_ = "txt-block"):
        temp=[]
        if "Cumulative Worldwide Gross" in x.text:
            temp.append(x.text)
            temp = temp[0].lstrip("Cumulative Worldwide Gross:\n")
            temp = temp.rstrip()
            temp = temp.replace(",","")
            temp = temp.replace("$","")
            global_revenue[i] = int(temp)
            
# Later, I will need to replace the zeros (value appended when Box Office Revenue info is not available) with `np.nan`.

In [22]:
# Merging data into Pandas Data Frame

imdb_top_movies = pd.DataFrame({"Movie Name": movie_names, "Year": year,
                                "Director": director_names, "Starring": starring_names,
                                "Rating": ratings, "Number of Reviews": reviews_number,
                                "All Genres": genre_list,
                                "Genre 1": genre_1, "Genre 2": genre_2, 
                                "Genre 3": genre_3 ,"Country": all_countries, 
                                "Language": languages, "Budget ($)": budget, 
                                "Box Office Revenue ($)":global_revenue, "Runtime": run_time},
                               index = movie_id)

imdb_top_movies.index.name = "IMDB id"

imdb_top_movies["Budget ($)"].replace(0, np.nan, inplace=True)
imdb_top_movies["Box Office Revenue ($)"].replace(0, np.nan, inplace=True)

**Question 1.2** (5 points) If you group the movies by release years, show the number of movies at each decade in a descendingu order.

In [23]:
#Question 1.2

imdb_top_movies["Decades"] = imdb_top_movies["Year"]//10*10

In [24]:
imdb_top_movies.groupby("Decades")["Decades"].count().sort_values(ascending= False)

Decades
2010    50
2000    48
1990    40
1980    29
1950    23
1970    18
1960    18
1940    10
1920     7
1930     6
2020     1
Name: Decades, dtype: int64

**Quesion 1.3** (5 points) Show the number of movies by the distribution of runtime at quartile (0-25%, 25-50%, 50-75%, 75-100%).

In [25]:
# Question 1.3

dist_labels = ["Fourth", "Third", "Second", "First"]
imdb_top_movies["Quartile_Runtime"] = pd.qcut(imdb_top_movies["Runtime"], q=[0, .25, .5, .75, 1], labels=dist_labels)
imdb_top_movies.groupby("Quartile_Runtime")["Runtime"].count()

Quartile_Runtime
Fourth    63
Third     62
Second    62
First     63
Name: Runtime, dtype: int64

**Question 1.4** (5 points) What is the proportion of movies that have Budget higher than 75% of all movies (i.e. the third quartile)?

In [26]:
# Quesion 1.4

dist_labels = ["Fourth", "Third", "Second", "First"]
imdb_top_movies["Movies by Budget"] = pd.qcut(imdb_top_movies["Budget ($)"], q= [0, .25, .5, 0.75, 1], labels = dist_labels)

In [27]:
# Absolute number of movies
imdb_top_movies.groupby("Movies by Budget")["Budget ($)"].count()["Third"]

57

In [28]:
# Relative number of movies
relative_nr = imdb_top_movies.groupby("Movies by Budget")["Budget ($)"].count()["Third"] / 250 

In [29]:
print(f"The proportion of movies that have Budget higher than 75% of all movies is {relative_nr}.")

The proportion of movies that have Budget higher than 75% of all movies is 0.228.


**Question 1.5** (5 points) Show the top 10 most popular actor/actresses in terms of number of movies they have starred. 

In [30]:
# Question 1.5

list_star1=[]
list_star2=[]

for i in range(len(imdb_top_movies)):
        list_star1.append(imdb_top_movies["Starring"][i].split(",")[0])
        list_star2.append(imdb_top_movies["Starring"][i].split(",")[1].lstrip())

In [31]:
list_stars = list_star1 + list_star2

In [32]:
Counter(list_stars).most_common(10)

[('Robert De Niro', 9),
 ('Tom Hanks', 6),
 ('Charles Chaplin', 6),
 ('Harrison Ford', 6),
 ('Christian Bale', 5),
 ('Clint Eastwood', 5),
 ('Leonardo DiCaprio', 5),
 ('Al Pacino', 4),
 ('Brad Pitt', 4),
 ('Toshirô Mifune', 4)]

**Question 1.6** (5 points) Show the top 5 directors with the most total box office revenues.

In [33]:
# Question 1.6

imdb_top_movies.groupby("Director")["Box Office Revenue ($)"].sum().sort_values(ascending=False).head(5)

Director
Anthony Russo         4.846160e+09
Christopher Nolan     4.143007e+09
Steven Spielberg      3.055116e+09
Peter Jackson         2.973633e+09
Pete Docter           2.172152e+09
Name: Box Office Revenue ($), dtype: float64

**Question 1.7** (5 points) Show the average ratings of movies across the genres and decades.

In [34]:
# Question 1.7

# Genres

unordered_unique_genres = list(set(genre_1 + genre_2 + genre_3))

unordered_unique_genres.pop(0)

unique_genres = []

for i in range(len(unordered_unique_genres)):
    unique_genres.append(unordered_unique_genres[i].replace(" ",""))
    

ratings_dict = OrderedDict.fromkeys(unique_genres, 0)
list_keys = list(ratings_dict)
list_keys.sort()
films_per_genre_dict = OrderedDict.fromkeys(unique_genres,0)

for i in range(len(imdb_top_movies)):
    for x in range(len(list_keys)):
        if imdb_top_movies["Genre 1"][i] == list_keys[x]:
            ratings_dict[list_keys[x]] += imdb_top_movies["Rating"][i]
            films_per_genre_dict[list_keys[x]] += 1
            
        if imdb_top_movies["Genre 2"][i] == list_keys[x]:
            ratings_dict[list_keys[x]] += imdb_top_movies["Rating"][i]
            films_per_genre_dict[list_keys[x]] += 1
            
        if imdb_top_movies["Genre 3"][i] == list_keys[x]:
            ratings_dict[list_keys[x]] += imdb_top_movies["Rating"][i]
            films_per_genre_dict[list_keys[x]] += 1

genres_avg_ratings = {a: round(float(ratings_dict[a] / films_per_genre_dict[a]),2)\
                      for a in ratings_dict if a in films_per_genre_dict}
genres_avg_ratings = sorted(genres_avg_ratings.items())

In [35]:
# Genres

genres_avg_ratings

[('Action', 8.33),
 ('Adventure', 8.28),
 ('Animation', 8.22),
 ('Biography', 8.22),
 ('Comedy', 8.19),
 ('Crime', 8.28),
 ('Drama', 8.26),
 ('Family', 8.23),
 ('Fantasy', 8.29),
 ('Film-Noir', 8.23),
 ('History', 8.21),
 ('Horror', 8.35),
 ('Music', 8.4),
 ('Musical', 8.2),
 ('Mystery', 8.26),
 ('Romance', 8.2),
 ('Sci-Fi', 8.28),
 ('Sport', 8.09),
 ('Thriller', 8.21),
 ('War', 8.23),
 ('Western', 8.35)]

In [36]:
# Decades

round(imdb_top_movies.groupby("Decades")["Rating"].mean(), 2)

Decades
1920    8.10
1930    8.25
1940    8.24
1950    8.23
1960    8.26
1970    8.31
1980    8.22
1990    8.37
2000    8.26
2010    8.19
2020    8.50
Name: Rating, dtype: float64

**Question 1.8** (5 points) Creat a new column ```ROI``` that measures the return on investment using the (box revenue-budget)/budget, and compare the ROI between movies in English and those in non-English. Use the t-test to examine whether such difference is statistically significant (You can use ```scipy.stats.ttest_ind``` to test the mean difference of two distributions)

##### __Note__: I will present two different solutions to this question. 
##### The first, considering "movie in english" as a movie in which the language "English" is present, even if other languages are present too.
##### The second, considering that if there are english and other languages present, it is labeled non-english.

In [37]:
# Solution 1 (If English present, consider English)

imdb_top_movies["ROI"] = imdb_top_movies["Box Office Revenue ($)"]/imdb_top_movies["Budget ($)"]

english = []
other_langs = []


for i in range(len(imdb_top_movies)):
    if "English" in imdb_top_movies["Language"].iloc[i]:
        english.append(imdb_top_movies["ROI"].iloc[i])
    else:
        other_langs.append(imdb_top_movies["ROI"].iloc[i])

english = np.array(english)
other_langs = np.array(other_langs)

stats.ttest_ind(english,other_langs, nan_policy="omit")

Ttest_indResult(statistic=-1.3676897103304824, pvalue=0.17283053389639766)

##### As seen from the t-test, as we have a p-value of >0.05 and also >0.1, we cannot reject the null hypothesis that the ROI for english and other languages is on average the same, all else constant.

In [38]:
# Solution 2 (If english and more languages, consider non-english)

imdb_top_movies["ROI"] = imdb_top_movies["Box Office Revenue ($)"]/imdb_top_movies["Budget ($)"]

english = []
other_langs = []

for i in range(len(imdb_top_movies)):
    if imdb_top_movies["Language"].iloc[i] == "English":
        english.append(imdb_top_movies["ROI"].iloc[i])
    else:
        other_langs.append(imdb_top_movies["ROI"].iloc[i])
        
english = np.array(english)
other_langs = np.array(other_langs)

stats.ttest_ind(english,other_langs, nan_policy="omit")

Ttest_indResult(statistic=1.8673107495970935, pvalue=0.06321282822691394)

##### Here, the results are different. We reject the null hypothesis for a significance level of 10% (however, we would not reject it for the 5% level).

##### This means there is a statistically significant different between the ROI for english and non-english movies, for a significance level of 10%.  

**Question 1.9** (5 points) Do the commercially successfuly movies also receive higher ratings. Check the correlations between box office revenues and ratings using Pearman and Spearman correlations.

In [39]:
imdb_top_movies["Box Office Revenue ($)"].corr(imdb_top_movies["Rating"], method = "pearson")

0.2366342853827657

##### There seems to be a low correlation between Box Office Revenue and Rating, using the Pearson correlation method. The correlation is positive, but small.

In [40]:
imdb_top_movies["Box Office Revenue ($)"].corr(imdb_top_movies["Rating"], method = "spearman")

0.25247283563144834

##### The same as above is true, using the Spearman method. There is a slightly higher correlation, but still not very high.

**Question 1.10** (10 points) Now let's retrieve data from Bechdel Test Movie website [for each movie](https://bechdeltest.com/). You can send the requests to the API: https://bechdeltest.com/api/v1/doc#getMovieByImdbId. For example, for the movie The Shawshank Redemption (the IMDb id: 0111161), you can simply call: http://bechdeltest.com/api/v1/getMovieByImdbId?imdbid=0111161. 

Create a dataframe ```bechdel_imdb_top``` that merge the bechdel test info with the ```imdb_top_movies``` show how many top 250 movies are also in the bechdel test website.

In [41]:
responses=[]

for i in range(len(movie_id)):
    url ="http://bechdeltest.com/api/v1/getMovieByImdbId?imdbid="+str(movie_id[i])
    responses.append(requests.get(url))


bechdel_info=[]
for i in range(len(responses)):
    bechdel_info.append(json.loads(responses[i].text))

bechdel_imdb_top  = pd.DataFrame(data = bechdel_info)
bechdel_imdb_top.rename(columns={"rating": "Bechdel_test"}, inplace=True)
bechdel_test = np.array(bechdel_imdb_top["Bechdel_test"])
imdb_top_movies["Bechdel Test"] = bechdel_test

In [42]:
counter=0
for i in range(len(bechdel_info)):
    if len(bechdel_info[i]) > 3:
           counter+=1

In [43]:
print(f"Out of the Top 250 IMDB movies, {counter} are in the bechdel test website.")

Out of the Top 250 IMDB movies, 236 are in the bechdel test website.


**Question 1.11** (5 points) Show how many movies in terms of percentage) that has passed the test in different ways (Number from 0 to 3 (0 means no two women, 1 means no talking, 2 means talking about a man, 3 means it passes the test)

In [44]:
# Pass at 1

imdb_top_movies.loc[imdb_top_movies["Bechdel Test"] == 1]["Bechdel Test"].count() / 250 

0.324

In [45]:
# Pass at 2

imdb_top_movies.loc[imdb_top_movies["Bechdel Test"] == 2]["Bechdel Test"].count() / 250

0.092

In [46]:
# Pass at 3

imdb_top_movies.loc[imdb_top_movies["Bechdel Test"] == 3]["Bechdel Test"].count() / 250 

0.332

**Question 1.12** (5 points) Show the percenage of movies given differen genres that has passed the test in different ways (Number from 0 to 3 (0 means no two women, 1 means no talking, 2 means talking about a man, 3 means it passes the test))

In [47]:
films_per_genre={}

for key in unique_genres:
    films_per_genre[key] = 0    
    
for i in range(len(imdb_top_movies)):
    for j in range(len(unique_genres)):
        if imdb_top_movies["Genre 1"][i] == unique_genres[j]:
            films_per_genre[imdb_top_movies["Genre 1"][i]]+=1
            
        if imdb_top_movies["Genre 2"][i] == unique_genres[j]:
            films_per_genre[imdb_top_movies["Genre 2"][i]]+=1
    
        if imdb_top_movies["Genre 3"][i] == unique_genres[j]:
            films_per_genre[imdb_top_movies["Genre 3"][i]]+=1    

            
pass1_genre1 = dict(imdb_top_movies.loc[imdb_top_movies['Bechdel Test'] ==1][["Genre 1"]]\
                    .sort_values("Genre 1").groupby("Genre 1")["Genre 1"].count())
pass1_genre2 = dict(imdb_top_movies.loc[imdb_top_movies['Bechdel Test'] ==1][["Genre 2"]]\
                    .sort_values("Genre 2").groupby("Genre 2")["Genre 2"].count())
pass1_genre3 = dict(imdb_top_movies.loc[imdb_top_movies['Bechdel Test'] ==1][["Genre 3"]]\
                    .sort_values("Genre 3").groupby("Genre 3")["Genre 3"].count())

pass2_genre1 = dict(imdb_top_movies.loc[imdb_top_movies['Bechdel Test'] ==2][["Genre 1"]]\
                    .sort_values("Genre 1").groupby("Genre 1")["Genre 1"].count())
pass2_genre2 = dict(imdb_top_movies.loc[imdb_top_movies['Bechdel Test'] ==2][["Genre 2"]]\
                    .sort_values("Genre 2").groupby("Genre 2")["Genre 2"].count())
pass2_genre3 = dict(imdb_top_movies.loc[imdb_top_movies['Bechdel Test'] ==2][["Genre 3"]]\
                    .sort_values("Genre 3").groupby("Genre 3")["Genre 3"].count())

pass3_genre1 = dict(imdb_top_movies.loc[imdb_top_movies['Bechdel Test'] ==3][["Genre 1"]]\
                    .sort_values("Genre 1").groupby("Genre 1")["Genre 1"].count())
pass3_genre2 = dict(imdb_top_movies.loc[imdb_top_movies['Bechdel Test'] ==3][["Genre 2"]]\
                    .sort_values("Genre 2").groupby("Genre 2")["Genre 2"].count())
pass3_genre3 = dict(imdb_top_movies.loc[imdb_top_movies['Bechdel Test'] ==3][["Genre 3"]]\
                    .sort_values("Genre 3").groupby("Genre 3")["Genre 3"].count())


final_countlist_pass1 = Counter(pass1_genre1) + Counter(pass1_genre2) + Counter(pass1_genre3)
final_countlist_pass2 = Counter(pass2_genre1) + Counter(pass2_genre2) + Counter(pass2_genre3)
final_countlist_pass3 = Counter(pass3_genre1) + Counter(pass3_genre2) + Counter(pass3_genre3)

percent_pass1 = {k: round(float(final_countlist_pass1[k])/films_per_genre[k], 4) for k in final_countlist_pass1}
percent_pass2 = {k: round(float(final_countlist_pass2[k])/films_per_genre[k], 4) for k in final_countlist_pass2}
percent_pass3 = {k: round(float(final_countlist_pass3[k])/films_per_genre[k], 4) for k in final_countlist_pass3}

for i in range(len(unique_genres)):
    if unique_genres[i] not in percent_pass1.keys():
        percent_pass1[unique_genres[i]] = 0
        
    if unique_genres[i] not in percent_pass2.keys():
        percent_pass2[unique_genres[i]] = 0
        
    if unique_genres[i] not in percent_pass3.keys():
        percent_pass3[unique_genres[i]] = 0

percent_pass1 = OrderedDict(sorted(percent_pass1.items(), key=lambda x: x[0]))
percent_pass2 = OrderedDict(sorted(percent_pass2.items(), key=lambda x: x[0]))
percent_pass3 = OrderedDict(sorted(percent_pass3.items(), key=lambda x: x[0]))

In [48]:
# Pass at 1 by genre (in percentage)

percent_pass1

OrderedDict([('Action', 0.3256),
             ('Adventure', 0.3571),
             ('Animation', 0.4545),
             ('Biography', 0.3846),
             ('Comedy', 0.3488),
             ('Crime', 0.3585),
             ('Drama', 0.326),
             ('Family', 0.2),
             ('Fantasy', 0.2857),
             ('Film-Noir', 0.3333),
             ('History', 0.2143),
             ('Horror', 0.25),
             ('Music', 0.6667),
             ('Musical', 0),
             ('Mystery', 0.4667),
             ('Romance', 0.1739),
             ('Sci-Fi', 0.2727),
             ('Sport', 0.5714),
             ('Thriller', 0.3077),
             ('War', 0.25),
             ('Western', 0.3333)])

In [49]:
# Pass at 2 by genre (in percentage)

percent_pass2

OrderedDict([('Action', 0),
             ('Adventure', 0.0536),
             ('Animation', 0),
             ('Biography', 0.1154),
             ('Comedy', 0.0698),
             ('Crime', 0.1321),
             ('Drama', 0.105),
             ('Family', 0.0667),
             ('Fantasy', 0.1429),
             ('Film-Noir', 0),
             ('History', 0.0714),
             ('Horror', 0.25),
             ('Music', 0),
             ('Musical', 0),
             ('Mystery', 0.0333),
             ('Romance', 0.087),
             ('Sci-Fi', 0.0909),
             ('Sport', 0),
             ('Thriller', 0.1026),
             ('War', 0),
             ('Western', 0)])

In [50]:
# Pass at 3 by genre (in percentage)

percent_pass3

OrderedDict([('Action', 0.4651),
             ('Adventure', 0.3571),
             ('Animation', 0.4545),
             ('Biography', 0.2692),
             ('Comedy', 0.3256),
             ('Crime', 0.283),
             ('Drama', 0.3039),
             ('Family', 0.4),
             ('Fantasy', 0.5),
             ('Film-Noir', 0.3333),
             ('History', 0.4286),
             ('Horror', 0.25),
             ('Music', 0.3333),
             ('Musical', 0.5),
             ('Mystery', 0.2333),
             ('Romance', 0.4783),
             ('Sci-Fi', 0.5455),
             ('Sport', 0.2857),
             ('Thriller', 0.3846),
             ('War', 0.2),
             ('Western', 0.1667)])

**Question 1.13** (5 points) Show the top 10 highest-rated movies that passed the test completely (rating=3) 

In [51]:
imdb_top_movies[imdb_top_movies["Bechdel Test"] == 3][["Movie Name","Rating"]]\
.sort_values("Rating", ascending=False).head(10)

Unnamed: 0_level_0,Movie Name,Rating
IMDB id,Unnamed: 1_level_1,Unnamed: 2_level_1
468569,O Cavaleiro das Trevas,9.0
108052,A Lista de Schindler,8.9
110912,Pulp Fiction,8.8
167261,O Senhor dos Anéis - As Duas Torres,8.7
1375666,A Origem,8.7
133093,Matrix,8.6
99685,Tudo Bons Rapazes,8.6
102926,O Silêncio dos Inocentes,8.6
38650,Do Céu Caiu Uma Estrela,8.6
245429,A Viagem de Chihiro,8.5


**Question 1.14** (5 points) Compareing the movies that passed (rating=3) and failed the test (rating=0), are their ROI different? Explain.

In [52]:
ROI_passed = imdb_top_movies.loc[imdb_top_movies["Bechdel Test"] == 3]["ROI"]
ROI_passed = np.array(ROI_passed)
ROI_failed = imdb_top_movies.loc[imdb_top_movies["Bechdel Test"] == 0]["ROI"]
ROI_failed = np.array(ROI_failed)

In [53]:
stats.ttest_ind(ROI_passed, ROI_failed, nan_policy="omit")

Ttest_indResult(statistic=1.5871024012802712, pvalue=0.11525933460978338)

##### Acording to the t-test, we cannot reject the null hypothesis for a 10% level of significance. That means there is no statistically significant difference in ROI between movies that passed and movies that failed the text. 

**Question 1.15** (10 points) Now load the ```bechdel_imdb.json``` that contains the all movies that are rated by the Bechdel Test website. Are women representation improved over the decades? Create a dataframe ```bechdel_imdb```, comparing the top 250 and other movies, in terms of percentage, how many passed/failed the test? 

In [54]:
bechdel_imdb = pd.read_json("bechdel_imdb.json")

bechdel_imdb["decade"] = bechdel_imdb["year"]//10*10

round(bechdel_imdb[["rating", "decade"]].groupby("decade")["rating"].mean(), 2) # Women representation over the decades

# Note: The values below can only range from 0 to 3, according to the Bechdel Test ratings

decade
1880    0.00
1890    0.25
1900    0.10
1910    1.58
1920    1.20
1930    2.07
1940    1.98
1950    1.97
1960    1.81
1970    1.96
1980    2.07
1990    2.17
2000    2.21
2010    2.28
2020    2.58
Name: rating, dtype: float64

##### There is an observale improvement in women representation since the 1960s, and it has picked up the pace in the last two decades.

In [55]:
# Percentage of movies that passed the test in some way, relative to all the 250 top movies.

imdb_top_movies[imdb_top_movies["Bechdel Test"] > 0]["Bechdel Test"].count() / len(imdb_top_movies)

0.748

In [56]:
# Percentage of movies that passed the test in some way, relative to the 250 top movies that have a Bechdel Test rating (235, as seen in Question 1.10)

round(imdb_top_movies[imdb_top_movies["Bechdel Test"] > 0]["Bechdel Test"].count() / 235, 4)

0.7957

In [57]:
# Percentage of movies that passed the test in some way, relative to all the movies rated on the Bechdel Test website.

round(bechdel_imdb[bechdel_imdb["rating"]> 0]["rating"].count() / len(bechdel_imdb), 4) 

0.8984