## Assignment 3 Data Analysis using Pandas

This assignment will contain 1 question with details as below. The due date is October 16 (Friday), 2020 23:59PM. Each late day will result in 20% loss of total points.

### Question 1 (100 points) Celluloid ceiling

Wonder Woman             |  Captain Marvel
:-------------------------:|:-------------------------:
![wonderwoman](https://upload.wikimedia.org/wikipedia/en/e/ed/Wonder_Woman_%282017_film%29.jpg) | ![marvel](https://upload.wikimedia.org/wikipedia/pt/5/59/Captain_Marvel_%282018%29.jpg)

Women are involved in the film industry in all roles, including as film directors, actresses, cinematographers, film producers, film critics, and other film industry professions, though women have been underrepresented in all these positions. Studies found that women have always had a presence in film acting, but have consistently been underrepresented, and on average significantly less well paid. 

In 2015, Forbes reported that "...just 21 of the 100 top-grossing films of 2014 featured a female lead or co-lead, while only 28.1% of characters in 100 top-grossing films were female... This means it’s much rarer for women to get the sort of blockbuster role which would warrant the massive backend deals many male counterparts demand (Tom Cruise in Mission: Impossible or Robert Downey Jr. in Iron Man, for example)".

Also, Forbes' analysis of US acting salaries in 2013 determined that the "...men on Forbes’ list of top-paid actors for that year made 2½ times as much money as the top-paid actresses. That means that Hollywood's best-compensated actresses made just 40 cents for every dollar that the best-compensated men made. 


In this assignment, we want to examine whether and how women representation is lacking in the film industry. We will adopt The Bechdel test as a measure of the representation of women in the film industry. The test is named after the American cartoonist Alison Bechdel in whose 1985 comic strip Dykes to Watch Out For the test first appeared. **A movie is said to meet the Bechdel test  following three criteria: (1) it has to have at least two women in it, who (2) who talk to each other, about (3) something besides a man.**

We are going to obtain the data ourselves to perform the analysis. Specifically, we will retrieve the movie metadata from IMDB (Internet Movie Database), an online database of information related to films, television programs, home videos, video games, and streaming content online – including cast, production crew and personal biographies, plot summaries, trivia, ratings, and fan and critical reviews. As of January 2020, IMDb has approximately 6.5 million titles (including episodes) and 10.4 million personalities in its database, as well as 83 million registered users.


The IMDb Top 250 is a list of the top rated 250 films, based on ratings by the registered users of the website using the methods described. We will focus on these famous movies in this analysis:

**Question 1.1** (20 points): We will retrieve the metadata of IMDb Top 250 movies from the [IMDb charts](https://www.imdb.com/chart/top/). For each movie on the list, we can scrape the following characteristics from the information page. For example, from the [page of top rated movie "The Shawshank Redemption"](https://www.imdb.com/title/tt0111161/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8962-327b42fe94b1&pf_rd_r=F4QFC0SVZN1HTDHCY3C0&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_1), we want to extract the metadata about this movie as:
- IMDb id (0111161)
- Movie name (The Shawshank Redemption)
- Year (1994)
- Director (Frank Darabont)
- Starring (Tim Robbins, Morgan Freeman, Bob Gunton)
- Rating (9.3)
- Number of reviews (2,291,324)
- Genres (Drama)
- Country (USA)
- Language (English)
- Budget (\$25,000,000)
- Box Office Revenue (\$28,815,291)
- Runtime (142 min)

![imdb](https://mrfloris.com/files/images/imdb-top250-page-start.png)


After scraping the 250 movies, save the data as a dataframe ```imdb_top_movies```. 
Also, saving the dataframe to a local file ```imdb_top_movies.csv``` so that later you can load it without scraping the website twice.

Hint: You can get the links to these movies from the IMDb top chart page, and then scrape each movie page by sending the request to these links. At each movie page, the information requested are located at different sections. 

In [1]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup

In [2]:
# Question 1.1

page = requests.get("https://www.imdb.com/chart/top/")
page

<Response [200]>

In [3]:
soup = BeautifulSoup(page.content, 'html.parser')

In [4]:
# We have found that the first title starts at line 59 of the code
tag = soup.find_all("a")[59]

# Find the 'href' that contains the link
link = tag['href']
link

'/title/tt0111161/'

In [5]:
link_dic = {}
rank = 0

# We add a step in the range since the links appear twince for the same title (image and url) and both go to the same place
for row in range(59, len(soup.find_all("a"))-54, 2):
    rank += 1
    temporary = soup.find_all("a")[row]
    link_dic[rank] = temporary['href']

In [6]:
half_url = link_dic[3]
page = requests.get("https://www.imdb.com/"+half_url)

In [7]:
new_soup = BeautifulSoup(page.content, 'html.parser')

In [113]:
data = {}
rank = []
imdb_id = []
title = []
year = []
director = []
starring = []
rating = []
n_reviews = []
genre = []
country = []
box_off_rev = []
language = []
budget = []
runtime = []
for i in range(1,250+1):
    
    # Link to be scrapped
    half_url = link_dic[i]
    headers = {'Accept-Language': 'en-US, en;q=0.5'}
    page = requests.get("https://www.imdb.com/"+half_url, headers = headers)
    new_soup = BeautifulSoup(page.content, 'html.parser')
    
    # Rank
    rank.append(i)
    data['Rank'] = rank
    
    # IMDBid
    imdb_id.append(str(half_url[9:-1]))
    data['imdb_id'] = imdb_id
    
    # Movie Name
    t = new_soup.find(class_='title_wrapper').get_text().strip()
    tit = ''
    for i in t:
        if i == '\xa0':
            break
        tit+=i
    title.append(tit)
    data['Movie'] = title

    # Find the Year
    y = new_soup.find(attrs = {'id':'titleYear'}).get_text()
    year_1 = [i for i in y if i not in ['(',')']]
    year.append(''.join(year_1))
    data['Year'] = year

    # Find the Director
    d = new_soup.find_all('div', attrs = {'class':'credit_summary_item'})[0].get_text().split()
    director_1 = d[1:]
    director.append(' '.join(director_1))
    data['Director'] = director

    # Find Starring
    s = new_soup.find_all(class_ = 'credit_summary_item')[2].get_text().split()
    starring_1 = [i for i in s if i not in ['Stars:','|','See','full','cast','&','crew','»']]
    starring.append(' '.join(starring_1))
    data['Starring'] = starring
    

    # Finding the movie Rating
    rate = new_soup.find(attrs = {'itemprop':"ratingValue"}).get_text()
    rating.append(rate)
    data['Rating'] = rating

    # Number of reviews
    n_rev = new_soup.find(attrs = {'itemprop':'ratingCount'}).get_text()
    n_reviews.append(n_rev)
    data['#Reviews'] = n_reviews

    # Finding the Genre
    gen = new_soup.find_all('div', attrs = {'class':'inline'})[-1].get_text().strip().split()
    help_gen =[]
    for i in gen:
        if i not in ['Genres:','|']:
            help_gen.append(i)
    help_gen = ','.join(help_gen)
    genre.append(help_gen)
    data['Genre'] = genre
    
#     # Finding the Country
#     c = new_soup.find_all('div', attrs = {'class':'txt-block'})[-13].get_text().split()[1]
#     country.append(c)
#     data['Country'] = country
    

    # Finding Country and Language
    count = new_soup.find_all('div', attrs = {'class':'article', 'id':'titleDetails'})[0].get_text().split()
    counter = 0
    for i in count:
        counter+=1
        if i == 'Country:':
            if count[counter] == 'New':
                special = count[counter]+' '+count[counter+1]
                country.append(special)
            else:
                country.append(count[counter])
        if i == 'Language:':
            language.append(count[counter])
            break
        data['Country'] = country
        data['Language'] = language

    # Finding the Box Office Revenue
    try:
        Box_Off = new_soup.find_all('div', attrs = {'class':'txt-block'})[-7].get_text().split()[3]
        box_off_rev.append(Box_Off)
        data['Box_Off_Rev'] = box_off_rev
    except:
        box_off_rev.append(np.nan)
        data['Box_Off_Rev'] = box_off_rev
        
#     # Finding the language
#     lang = new_soup.find_all('div', attrs = {'class':'txt-block'})[-14].get_text().strip().split()[1]
#     language.append(lang)
#     data['Language'] = language
    
    # Finding the Budget
    a = new_soup.find_all('div', attrs = {'class':'txt-block'})[-10].get_text().strip()
    budget_list = a.split()[0]
    budget.append(budget_list[7:len(budget_list)])
    data['Budget'] = budget

    # Finding the Runtime
    run = new_soup.find_all('div', attrs = {'class':'txt-block'})[-4].get_text().strip()
    runtime_1 = run[9:]
    runt = ''
    for i in runtime_1:
        if i == '\n':
            break
        runt +=i
    runtime.append(runt)
    data['Runtime'] = runtime

new_data = pd.DataFrame(data)
new_data.to_csv('top_250_imdb.csv')

In [114]:
new_data

Unnamed: 0,Rank,imdb_id,Movie,Year,Director,Starring,Rating,#Reviews,Genre,Country,Language,Box_Off_Rev,Budget,Runtime
0,1,0111161,The Shawshank Redemption,1994,Frank Darabont,"Tim Robbins, Morgan Freeman, Bob Gunton",9.3,2295658,Drama,USA,English,"$28,815,291","$25,000,000",142 min
1,2,0068646,The Godfather,1972,Francis Ford Coppola,"Marlon Brando, Al Pacino, James Caan",9.2,1584598,"Crime,Drama",USA,English,"$246,120,974","$6,000,000",175 min
2,3,0071562,The Godfather: Part II,1974,Francis Ford Coppola,"Al Pacino, Robert De Niro, Robert Duvall",9.0,1107109,"Crime,Drama",USA,English,"$48,035,783","$13,000,000",202 min
3,4,0468569,The Dark Knight,2008,Christopher Nolan,"Christian Bale, Heath Ledger, Aaron Eckhart",9.0,2260383,"Action,Crime,Drama,Thriller",USA,English,"$1,005,456,758","$185,000,000",152 min
4,5,0050083,12 Angry Men,1957,Sidney Lumet,"Henry Fonda, Lee J. Cobb, Martin Balsam",8.9,674534,"Crime,Drama",USA,English,$576,,96 min
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
245,246,0088247,The Terminator,1984,James Cameron,"Arnold Schwarzenegger, Linda Hamilton, Michael...",8.0,789343,"Action,Sci-Fi",UK,English,"$78,680,331","$6,400,000",107 min
246,247,0103639,Aladdin,1992,"Ron Clements, John Musker","Scott Weinger, Robin Williams, Linda Larkin",8.0,367441,"Animation,Adventure,Comedy,Family,Fantasy,Musi...",USA,English,"$504,050,219","$28,000,000",90 min
247,248,2991224,Tangerines,2013,Zaza Urushadze,"Lembit Ulfsak, Elmo Nüganen, Giorgi Nakashidze",8.2,38250,"Drama,War",Estonia,Estonian,"$1,024,132","EUR650,000",87 min
248,249,0050613,Throne of Blood,1957,Akira Kurosawa,"Toshirô Mifune, Minoru Chiaki, Isuzu Yamada",8.1,45565,"Drama,History",Japan,Japanese,"$46,808",,110 min


In [206]:
# Movie Name
t = new_soup.find(class_='title_wrapper').get_text().strip()
tit = ''
for i in t:
    if i == '\xa0':
        break
    tit+=i
tit
# title = [i for i in t if i not in ['(original', 'title)']]
# title = ' '.join(title)
# title

'Three Colors: Red'

In [448]:
# Find the Year
y = new_soup.find(attrs = {'id':'titleYear'}).get_text()
year = [i for i in y if i not in ['(',')']]
year = ''.join(year)
year

'1994'

In [423]:
# Find the Director
d = new_soup.find_all('div', attrs = {'class':'credit_summary_item'})[0].get_text().split()
director = d[1:]
director = ' '.join(director)
director

'Frank Darabont'

In [429]:
# Find Starring
a = new_soup.find_all(class_ = 'credit_summary_item')[2].get_text().split()
starring = [i for i in a if i not in ['Stars:','|','See','full','cast','&','crew','»']]
starring = ' '.join(starring)
starring

'Tim Robbins, Morgan Freeman, Bob Gunton'

In [392]:
# Finding the movie Rating
rating = new_soup.find(attrs = {'itemprop':"ratingValue"}).get_text()
rating

'9.3'

In [386]:
# Number of reviews
n_reviews = new_soup.find(attrs = {'itemprop':'ratingCount'}).get_text()
n_reviews

'2,292,364'

In [98]:
# Finding the Genre
a = new_soup.find_all('div', attrs = {'class':'inline'})[-1].get_text().strip().split()
genre = []
for i in a:
    if i not in ['Genres:','|']:
        genre.append(i)
        
genre = ','.join(genre)
genre

'Crime,Drama'

In [337]:
# Finding the Country
country = new_soup.find_all('div', attrs = {'class':'txt-block'})[-13].get_text().split()[1]
country

'Switzerland'

In [533]:
# Finding the Box Office Revenue
Box_Off_rev = new_soup.find_all('div', attrs = {'class':'txt-block'})[-7].get_text().split()[3]
Box_Off_rev

'$576'

In [434]:
# Finding the language
language = new_soup.find_all('div', attrs = {'class':'txt-block'})[-14].get_text().strip().split()[1]
language

'English'

In [435]:
# Finding the Budget
budget = new_soup.find_all('div', attrs = {'class':'txt-block'})[-10].get_text().strip()
a = budget.split()[0]
budget_list = a[7:len(a)]
budget_list

'$25,000,000'

In [30]:
# Finding the Runtime
run = new_soup.find_all('div', attrs = {'class':'txt-block'})[-4].get_text().strip()
runt = run[9:]
runti = ''
for i in runt:
    if i == '\n':
        break
    runti +=i
runti

'202 min'

**Question 1.2** (5 points) If you group the movies by release years, show the number of movies at each decade in a descendingu order.

In [2]:
# Question 1.2
dataset = pd.read_csv('top_250_imdb.csv')

In [17]:
# Check the min Year and the max Year
# Define the Bins
# Get the data from the dataframe 
bins = [i for i in range(1920, 2020+1, 10)]
year_list = list(dataset.Year)
decades = pd.cut(year_list, bins)

# Descending order of number of movies
pd.value_counts(decades)

# Descending order of Decades?


(2010, 2020]    46
(2000, 2010]    46
(1990, 2000]    45
(1980, 1990]    26
(1950, 1960]    23
(1970, 1980]    22
(1960, 1970]    16
(1940, 1950]    11
(1930, 1940]     8
(1920, 1930]     7
dtype: int64

**Quesion 1.3** (5 points) Show the number of movies by the distribution of runtime at quartile (0-25%, 25-50%, 50-75%, 75-100%).

In [18]:
# Question 1.3
run_list = []
run = list(dataset.Runtime)
for i in run:
    run_list.append(int(i[:-4]))

quartiles = pd.qcut(run_list, 4)  # Cut into quartiles

pd.value_counts(quartiles)

(107.25, 126.0]     63
(44.999, 107.25]    63
(145.0, 321.0]      62
(126.0, 145.0]      62
dtype: int64

**Question 1.4** (5 points) What is the proportion of movies that have Budget higher than 75% of all movies (i.e. the third quartile)?

In [19]:
# Question 1.4

# Get a clean data
not_null_budget = list(dataset.Budget[dataset.Budget.notnull()])
clean_budget = []
for i in not_null_budget:
    clean = []
    for j in i:
        if j in '1234567890':
            clean.append(j)
    if ''.join(clean) != '':
        clean_budget.append(int(''.join(clean)))
        
budget_quart = pd.qcut(clean_budget, 4) # Cut into quartiles
quart_list = [i for i in pd.value_counts(budget_quart)]
third_q = quart_list[3]/sum(quart_list)*100
print('{}% of movies have a Budget on the third quartile ( > 75% )'.format(round(third_q,2)))

23.46% of movies have a Budget on the third quartile ( > 75% )


**Question 1.5** (5 points) Show the top 10 most popular actor/actresses in terms of number of movies they have starred. 

In [20]:
# Question 1.5
from collections import Counter
stars_list = list(dataset.Starring)
starring = [i.split(',') for i in stars_list]

starring_clean = []
for i in starring:
    for j in i:
        starring_clean.append(j)
        
# Remove the spacing that happends on the begining of the name in some names       
new_star = []
for i in range(1, len(starring_clean), 3):
    new_star.append(starring_clean[i][1:])
    
for i in range(2, len(starring_clean), 3):
    new_star.append(starring_clean[i][1:])
    
for i in range(0, len(starring_clean), 3):
    new_star.append(starring_clean[i])
    
# Check
# len(new_star) == len(starring_clean)

actors_n = dict(Counter(new_star))
pd.Series(actors_n).sort_values(ascending = False).head(10)

Robert De Niro       9
Harrison Ford        6
Leonardo DiCaprio    6
Charles Chaplin      6
Tom Hanks            6
Toshirô Mifune       5
Christian Bale       5
Clint Eastwood       5
Al Pacino            4
James Stewart        4
dtype: int64

**Question 1.6** (5 points) Show the top 5 directors with the most total box office revenues.

In [95]:
# Question 1.6
# Clean box office revenues
# Create a Series with the sum of box office revenues per director
# get the top 5

In [3]:
# Clean box office revenues
no_na_box = dataset.Box_Off_Rev.dropna()
no_na_box = no_na_box.replace('Rodgers', np.nan)
no_na_box.str.replace(r'\D', '')
dataset['Box_Off_Rev'] = no_na_box.str.replace(r'\D', '')
dataset['Box_Off_Rev'] = dataset['Box_Off_Rev'].replace('', np.nan)
dataset['Box_Off_Rev'] = dataset['Box_Off_Rev'].dropna()
dataset['Box_Off_Rev'] = dataset['Box_Off_Rev'].astype(float)

In [22]:
dataset.groupby(['Director'])['Box_Off_Rev'].sum().sort_values(ascending = False).head(5)

Director
Anthony Russo, Joe Russo    4.846160e+09
Christopher Nolan           4.143007e+09
Steven Spielberg            3.055116e+09
Peter Jackson               2.973633e+09
David Yates                 1.342207e+09
Name: Box_Off_Rev, dtype: float64

**Question 1.7** (5 points) Show the average ratings of movies across the genres and decades.

In [5]:
# Question 1.7
bins = [i for i in range(1920, 2020+1, 10)]
year_list = list(dataset.Year)
decades = pd.cut(year_list, bins)
dataset['decade'] = pd.Series(decades)

In [26]:
df = pd.DataFrame(dataset.groupby(['Genre','decade'])['Rating'].mean().dropna())
df.head(25).sort_values('Rating', ascending = False)

Unnamed: 0_level_0,Unnamed: 1_level_0,Rating
Genre,decade,Unnamed: 2_level_1
"Action,Crime,Drama,Thriller","(2000, 2010]",9.0
"Action,Adventure,Sci-Fi,Thriller","(2000, 2010]",8.8
"Action,Adventure,Drama,Fantasy","(2000, 2010]",8.8
"Action,Adventure,Fantasy,Sci-Fi","(1970, 1980]",8.65
"Action,Drama,Mystery","(1960, 1970]",8.6
"Action,Adventure,Drama","(1950, 1960]",8.6
"Action,Adventure,Drama","(1990, 2000]",8.5
"Action,Crime,Drama,Thriller","(1990, 2000]",8.5
"Action,Drama,Mystery,Thriller","(2000, 2010]",8.4
"Action,Adventure","(2010, 2020]",8.4


In [37]:
# What I can do here is to just... Create a new Dataframe with 
#  - Decades - Rating - Main Genre - Second ...
genre_list = list(dataset.Genre)
genre = [i.split(',') for i in genre_list]     

genres_df = (dataset['Genre'].str.split(',', expand=True).rename(columns=lambda x: f"genre_{x+1}"))
genres = pd.concat([genres_df, dataset[['Rating','decade','Movie']]], axis = 1)
genres.rename(columns = {'genre_1':'Main'})

Unnamed: 0,Main,genre_2,genre_3,genre_4,genre_5,genre_6,genre_7,Rating,decade,Movie
0,Drama,,,,,,,9.3,"(1990, 2000]",The Shawshank Redemption
1,Crime,Drama,,,,,,9.2,"(1970, 1980]",The Godfather
2,Crime,Drama,,,,,,9.0,"(1970, 1980]",The Godfather: Part II
3,Action,Crime,Drama,Thriller,,,,9.0,"(2000, 2010]",The Dark Knight
4,Crime,Drama,,,,,,8.9,"(1950, 1960]",12 Angry Men
...,...,...,...,...,...,...,...,...,...,...
245,Action,Sci-Fi,,,,,,8.0,"(1980, 1990]",The Terminator
246,Animation,Adventure,Comedy,Family,Fantasy,Musical,Romance,8.0,"(1990, 2000]",Aladdin
247,Drama,War,,,,,,8.2,"(2010, 2020]",Tangerines
248,Drama,History,,,,,,8.1,"(1950, 1960]",Throne of Blood


In [38]:
pd.DataFrame(genres.groupby(['genre_1','genre_2','genre_3','genre_4','genre_5','decade'])['Rating'].mean().dropna()).head(10).sort_values('Rating', ascending = False)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Rating
genre_1,genre_2,genre_3,genre_4,genre_5,decade,Unnamed: 6_level_1
Animation,Adventure,Drama,Family,Musical,"(1990, 2000]",8.5
Animation,Action,Adventure,Family,Sci-Fi,"(2010, 2020]",8.4
Adventure,Biography,Drama,History,War,"(1960, 1970]",8.3
Adventure,Drama,History,Thriller,War,"(1960, 1970]",8.2
Animation,Adventure,Comedy,Family,Fantasy,"(1990, 2000]",8.15
Animation,Adventure,Comedy,Family,Fantasy,"(2000, 2010]",8.15
Action,Adventure,Comedy,Drama,War,"(1920, 1930]",8.1
Action,Crime,Drama,Mystery,Thriller,"(2000, 2010]",8.1
Animation,Action,Adventure,Family,Fantasy,"(2000, 2010]",8.1
Animation,Adventure,Comedy,Drama,Family,"(2010, 2020]",8.1


In [7]:
dataset.Genre
gen = dataset['Genre'].str.get_dummies(',')
gen['Rank'] = dataset.Rank

In [21]:
genres = pd.merge(dataset[['decade','Rating','Rank']], gen, on = 'Rank')

In [12]:
genres_dec = list(genres.columns[3:])
genres_dec.append('decade')

In [35]:
genres

Unnamed: 0,decade,Rating,Rank,Action,Adventure,Animation,Biography,Comedy,Crime,Drama,...,Horror,Music,Musical,Mystery,Romance,Sci-Fi,Sport,Thriller,War,Western
0,"(1990, 2000]",9.3,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,"(1970, 1980]",9.2,2,0,0,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
2,"(1970, 1980]",9.0,3,0,0,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
3,"(2000, 2010]",9.0,4,1,0,0,0,0,1,1,...,0,0,0,0,0,0,0,1,0,0
4,"(1950, 1960]",8.9,5,0,0,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
245,"(1980, 1990]",8.0,246,1,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
246,"(1990, 2000]",8.0,247,0,1,1,0,1,0,0,...,0,0,1,0,1,0,0,0,0,0
247,"(2010, 2020]",8.2,248,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,1,0
248,"(1950, 1960]",8.1,249,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


In [56]:
genres_mean_rat = pd.DataFrame(genres.groupby(['Action','Adventure','Animation','Biography','decade'])['Rating'].mean().dropna())

**Question 1.8** (5 points) Creat a new column ```ROI``` that measures the return on investment using the (box revenue-budget)/budget, and compare the ROI between movies in English and those in non-English. Use the t-test to examine whether such difference is statistically significant (You can use ```scipy.stats.ttest_ind``` to test the mean difference of two distributions)

In [208]:
# Clean the Budget:
budget_clean = dataset.Budget
budget_clean = budget_clean.str.replace(r'\D', '')
budget_clean = budget_clean.replace('', np.nan)
budget_clean = budget_clean.astype(float)
dataset['Budget'] = budget_clean

In [209]:
# Question 1.8
dataset['ROI (%)'] = ((dataset.Box_Off_Rev - dataset.Budget)/(dataset.Budget))*100
roi_eng = pd.DataFrame(dataset.groupby(dataset['Language'] == 'English')['ROI (%)'].mean())
roi_eng

Unnamed: 0_level_0,ROI (%)
Language,Unnamed: 1_level_1
False,732.548906
True,720.845947


In [210]:
roi_eng = np.array(dataset.loc[dataset['Language']=='English']['ROI (%)'].dropna())
roi_not_eng = np.array(dataset.loc[dataset['Language'] != 'English']['ROI (%)'].dropna())

In [211]:
from scipy import stats
stats.ttest_ind(roi_eng, roi_not_eng)

Ttest_indResult(statistic=-0.05541189826369033, pvalue=0.9558728147873792)

In [212]:
print('''Therefore we can conclude that the difference between English versus Non-English movies is not statistically significant given the very high p_value''')

Therefore we can conclude that the difference between English versus Non-English movies is not statistically significant given the very high p_value


**Question 1.9** (5 points) Do the commercially successfuly movies also receive higher ratings. Check the correlations between box office revenues and ratings using Pearman and Spearman correlations.

In [213]:
# Use groupby and aggregate because agregate allows for different operations LEcture 11, slide 20
# Or use apply 
pearson_corr = dataset[['Rating', 'Box_Off_Rev']]
pearson = pearson_corr.Box_Off_Rev.corr(pearson_corr.Rating, method = 'pearson')
spearman_corr = dataset[['Rating', 'Box_Off_Rev']]
spearman = spearman_corr.Box_Off_Rev.corr(spearman_corr.Rating, method = 'spearman')

print('The pearson coefficient is {} and the spearman coefficient is {}'.format(pearson, spearman))
print('''\nBoth correlation values are relatively small and close to 0, therefore there is a positive correlation but it is small, 
implying only a weak support for the claim that commercialy successeful movies tend to receive higher ratings''')
print('''\nPearson correlation assumes the data is normally distributed. However, Spearman does not make any assumption on the 
distribution of the data. That is the main reason for the difference.''')

The pearson coefficient is 0.20835809449531878 and the spearman coefficient is 0.14908299285023477

Both correlation values are relatively small and close to 0, therefore there is a positive correlation but it is small, 
implying only a weak support for the claim that commercialy successeful movies tend to receive higher ratings

Pearson correlation assumes the data is normally distributed. However, Spearman does not make any assumption on the 
distribution of the data. That is the main reason for the difference.


**Question 1.10** (10 points) Now let's retrieve data from Bechdel Test Movie website [for each movie](https://bechdeltest.com/). You can send the requests to the API: https://bechdeltest.com/api/v1/doc#getMovieByImdbId. For example, for the movie The Shawshank Redemption (the IMDb id: 0111161), you can simply call: http://bechdeltest.com/api/v1/getMovieByImdbId?imdbid=0111161. 

Create a dataframe ```bechdel_imdb_top``` that merge the bechdel test info with the ```imdb_top_movies``` show how many top 250 movies are also in the bechdel test website.

In [106]:
dataset = pd.read_csv('top_250_imdb.csv')

In [126]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup

bechdel_list = []
for i in range(1, len(imdb_id_list)+1):
    imdb_id = link_dic[i][9:-1]
    bechdel = requests.get("http://bechdeltest.com/api/v1/getMovieByImdbId?imdbid="+str(imdb_id))
    bechdel = bechdel.json()
    bechdel_list.append(bechdel)

In [241]:
bechdel_imdb_aux = pd.DataFrame(bechdel_list)

In [242]:
bechdel_imdb_aux['Rank'] = dataset.Rank

In [243]:
bechdel_imdb_top = pd.merge(bechdel_imdb_aux, dataset, on = 'Rank')

In [218]:
bechdel_imdb_top.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 250 entries, 0 to 249
Data columns (total 29 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   date         236 non-null    object  
 1   title        236 non-null    object  
 2   year         236 non-null    float64 
 3   rating       236 non-null    float64 
 4   visible      236 non-null    object  
 5   dubious      200 non-null    object  
 6   imdbid       236 non-null    object  
 7   submitterid  236 non-null    float64 
 8   id           236 non-null    float64 
 9   description  14 non-null     object  
 10  version      14 non-null     object  
 11  status       14 non-null     object  
 12  Rank         250 non-null    int64   
 13  Unnamed: 0   250 non-null    int64   
 14  imdb_id      250 non-null    int64   
 15  Movie        250 non-null    object  
 16  Year         250 non-null    int64   
 17  Director     250 non-null    object  
 18  Starring     250 non-null    o

**Question 1.11** (5 points) Show how many movies in terms of percentage) that has passed the test in different ways (Number from 0 to 3 (0 means no two women, 1 means no talking, 2 means talking about a man, 3 means it passes the test)

In [219]:
pass_test_percent = bechdel_imdb_top.groupby('rating')['Movie'].count()
pass_test_percent.apply(lambda x: (x/a.sum())*100)

rating
0.0    21.186441
1.0    34.322034
2.0     9.745763
3.0    34.745763
Name: Movie, dtype: float64

**Question 1.12** (5 points) Show the percenage of movies given differen genres that has passed the test in different ways (Number from 0 to 3 (0 means no two women, 1 means no talking, 2 means talking about a man, 3 means it passes the test))

In [224]:
# You need to be able to separate genres into different columns


**Question 1.13** (5 points) Show the top 10 highest-rated movies that passed the test completely (rating=3) 

In [221]:
bechdel_imdb_top.loc[bechdel_imdb_top['rating']==3.0].sort_values(['Rating'], ascending = False)[['Movie','Rating']].head(10)

Unnamed: 0,Movie,Rating
3,The Dark Knight,9.0
7,Pulp Fiction,8.9
5,Schindler's List,8.9
12,Inception,8.8
13,The Lord of the Rings: The Two Towers,8.7
15,The Matrix,8.7
16,Goodfellas,8.7
26,Spirited Away,8.6
29,Interstellar,8.6
28,Parasite,8.6


**Question 1.14** (5 points) Compareing the movies that passed (rating=3) and failed the test (rating=0), are their ROI different? Explain.

In [240]:
rat_3 = bechdel_imdb_top.loc[bechdel_imdb_top['rating']==3.0]['ROI (%)'].mean()
rat_0 = bechdel_imdb_top.loc[bechdel_imdb_top['rating']==0.0]['ROI (%)'].mean()
pd.DataFrame({'rating = 3': [rat_3], 'rating = 0': [rat_0]}, index = ['ROI %'])

Unnamed: 0,rating = 3,rating = 0
ROI %,846.750834,617.732173


In [249]:
print('''Return On Investment (ROI) seems to be significantly higher on average for movies that pass the Bechdel Test
''')

Return On Investment (ROI) seems to be significantly higher on average for movies that pass the Bechdel Test



**Question 1.15** (10 points) Now load the ```bechdel_imdb.json``` that contains the all movies that are rated by the Bechdel Test website. Are women representation improved over the decades? Create a dataframe ```bechdel_imdb```, comparing the top 250 and other movies, in terms of percentage, how many passed/failed the test? 

In [253]:
with open('bechdel_imdb.json') as json_file:
    bechdel_imdb = pd.read_json(json_file)

In [307]:
all_data_pass_rate = bechdel_imdb.groupby('rating')['title'].count()
a = pd.DataFrame(all_data_pass_rate.apply(lambda x: (x/all_data_pass_rate.sum())*100))
a.rename(columns = {'title': 'Pass rate (%)'})

Unnamed: 0_level_0,Pass rate (%)
rating,Unnamed: 1_level_1
0,10.158619
1,21.950082
2,10.181945
3,57.709354


In [306]:
bechdel_imdb.year.min()
bechdel_imdb.year.max()

2020

In [272]:
bins = [i for i in range(1880, 2020+1, 10)]
year_list_bech = list(bechdel_imdb.year)
decades_bech = pd.cut(year_list_bech, bins)

bechdel_imdb['decade'] = pd.Series(decades_bech)

In [304]:
pass_test_decades = bechdel_imdb_top.loc[bechdel_imdb_top['rating']==3.0].sort_values(['Rating'], ascending = False)[['decade','Rating','title']]
pass_test_decades.groupby('decade').count()

Unnamed: 0_level_0,Rating,title
decade,Unnamed: 1_level_1,Unnamed: 2_level_1
"(1920, 1930]",0,0
"(1930, 1940]",6,6
"(1940, 1950]",3,3
"(1950, 1960]",6,6
"(1960, 1970]",2,2
"(1970, 1980]",3,3
"(1980, 1990]",8,8
"(1990, 2000]",13,13
"(2000, 2010]",18,18
"(2010, 2020]",23,23


In [305]:
print('As we can observe in the table above, female representation has increased over the decades')

As we can observe in the table above, female representation has increased over the decades
