# Marvel Cinematic Universe Movie Ratings

Author: Eze Ahunanya 

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

In this project, the ratings of the Marvel Cinematic Universe movies will be explored. In particular, the report aims to answer the following questions: 'How do the scores from critics and general audience compare?' and 'What is the relationship of box office earnings and movie ratings?'. The data used will be sourced from the Rotten Tomatoes website. 

<a id='intro'></a>
## Data Wrangling

In [2]:
import requests 
from bs4 import BeautifulSoup

In [59]:
# url contains movie titles
url = 'https://en.wikipedia.org/wiki/Marvel_Cinematic_Universe#Films'

# save html file in response variable
response = requests.get(url)

In [60]:
# parse html file and save to soup variable
soup = BeautifulSoup(response.content, 'lxml')

In [104]:
movies_list = []

for i in [x for x in range(14, 37)]:
    
    # extract movie titles inside 'th' tag from 15th to 37th elements in the list
    movie_line = soup.find_all('th', scope="row")[i] 
    movie_title = movie_line.contents[0].contents[0].contents[0] 
    print(i, movie_title)
    movies_list.append(movie_title)

14 Iron Man
15 The Incredible Hulk
16 Iron Man 2
17 Thor
18 Captain America: The First Avenger
19 Marvel's The Avengers
20 Iron Man 3
21 Thor: The Dark World
22 Captain America: The Winter Soldier
23 Guardians of the Galaxy
24 Avengers: Age of Ultron
25 Ant-Man
26 Captain America: Civil War
27 Doctor Strange
28 Guardians of the Galaxy Vol. 2
29 Spider-Man: Homecoming
30 Thor: Ragnarok
31 Black Panther
32 Avengers: Infinity War
33 Ant-Man and the Wasp
34 Captain Marvel
35 Avengers: Endgame
36 Spider-Man: Far From Home


In [105]:
urls_list = []

for movie_title in movies_list:
    
    # format movie strings for urls
    movie_title = movie_title.lower().replace(" ", "_").replace("-", "_")\
    .replace(":", "").replace("'", "").replace(".", "")
    url = 'https://www.rottentomatoes.com/m/{}'.format(movie_title)
    print(url)
    urls_list.append(url) 

https://www.rottentomatoes.com/m/iron_man
https://www.rottentomatoes.com/m/the_incredible_hulk
https://www.rottentomatoes.com/m/iron_man_2
https://www.rottentomatoes.com/m/thor
https://www.rottentomatoes.com/m/captain_america_the_first_avenger
https://www.rottentomatoes.com/m/marvels_the_avengers
https://www.rottentomatoes.com/m/iron_man_3
https://www.rottentomatoes.com/m/thor_the_dark_world
https://www.rottentomatoes.com/m/captain_america_the_winter_soldier
https://www.rottentomatoes.com/m/guardians_of_the_galaxy
https://www.rottentomatoes.com/m/avengers_age_of_ultron
https://www.rottentomatoes.com/m/ant_man
https://www.rottentomatoes.com/m/captain_america_civil_war
https://www.rottentomatoes.com/m/doctor_strange
https://www.rottentomatoes.com/m/guardians_of_the_galaxy_vol_2
https://www.rottentomatoes.com/m/spider_man_homecoming
https://www.rottentomatoes.com/m/thor_ragnarok
https://www.rottentomatoes.com/m/black_panther
https://www.rottentomatoes.com/m/avengers_infinity_war
https://w

In [107]:
# correct faulty url addresses
urls_list[13] = 'https://www.rottentomatoes.com/m/doctor_strange_2016'
urls_list[17] = 'https://www.rottentomatoes.com/m/black_panther_2018'

In [108]:
urls_list

['https://www.rottentomatoes.com/m/iron_man',
 'https://www.rottentomatoes.com/m/the_incredible_hulk',
 'https://www.rottentomatoes.com/m/iron_man_2',
 'https://www.rottentomatoes.com/m/thor',
 'https://www.rottentomatoes.com/m/captain_america_the_first_avenger',
 'https://www.rottentomatoes.com/m/marvels_the_avengers',
 'https://www.rottentomatoes.com/m/iron_man_3',
 'https://www.rottentomatoes.com/m/thor_the_dark_world',
 'https://www.rottentomatoes.com/m/captain_america_the_winter_soldier',
 'https://www.rottentomatoes.com/m/guardians_of_the_galaxy',
 'https://www.rottentomatoes.com/m/avengers_age_of_ultron',
 'https://www.rottentomatoes.com/m/ant_man',
 'https://www.rottentomatoes.com/m/captain_america_civil_war',
 'https://www.rottentomatoes.com/m/doctor_strange_2016',
 'https://www.rottentomatoes.com/m/guardians_of_the_galaxy_vol_2',
 'https://www.rottentomatoes.com/m/spider_man_homecoming',
 'https://www.rottentomatoes.com/m/thor_ragnarok',
 'https://www.rottentomatoes.com/m/bla

In [20]:
import numpy as np
import pandas as pd

In [100]:
def get_movie_data(urls_list):
    """Get the movie data from its url.
    
    Arg: 
    url: link containing movie data. This should be a list.
    
    Returns:
    df: A dataframe containing movie data."""
    
    for i, url in enumerate(urls_list):    
        if i == 0:
            # save html file in respone variable
            response = requests.get(url)
            soup = BeautifulSoup(response.content, 'lxml')

            # extract scores data as dictionary
            scores_text = soup.find_all('script', type= 'text/javascript')[2].text
            scores_text_lines_list = scores_text.split('\n')
            for line in scores_text_lines_list:
                if 'root.RottenTomatoes.context.scoreInfo' in line:
                    scores_data_dict = line
                    break
            scores_data_dict = scores_data_dict.replace('root.RottenTomatoes.context.scoreInfo = ', '')\
            .replace('true', 'True').replace('false', 'False').replace('null', 'np.nan')[:-1]

            # convert dictionary into the first dataframe
            df = pd.DataFrame.from_dict(eval(dict), orient='index').reset_index()
        
            # extract other movie info and add to the dataframe
            movie_title = soup.title.text[:-len(' - Rotten Tomatoes')]
            df['movie_title'] = pd.Series([movie_title] * len(df['index']))
            release_date_theaters = soup.find_all('li', class_='meta-row clearfix')[6].contents[3].contents[1].text
            df['release_date_theaters'] = pd.Series([release_date_theaters] * len(df['index']))
            box_office_gross_usa = soup.find_all('li', class_='meta-row clearfix')[8].contents[3].text
            df['box_office_gross_usa'] = pd.Series([box_office_gross_usa] * len(df['index']))
            runtime = soup.find_all('li', class_='meta-row clearfix')[9].contents[3].text.split('\n')[2].replace(' ', '')
            df['runtime'] = pd.Series([runtime] * len(df['index']))
        
        else:
            # save html file in respone variable
            response = requests.get(url)
            soup = BeautifulSoup(response.content, 'lxml')

            # extract scores data as dictionary 
            scores_text = soup.find_all('script', type= 'text/javascript')[2].text
            scores_text_lines_list = scores_text.split('\n')
            for line in scores_text_lines_list:
                if 'root.RottenTomatoes.context.scoreInfo' in line:
                    scores_data_dict = line
                    break
            scores_data_dict = scores_data_dict.replace('root.RottenTomatoes.context.scoreInfo = ', '')\
            .replace('true', 'True').replace('false', 'False').replace('null', 'np.nan')[:-1]

            # convert dictionary into the second dataframe
            df1 = pd.DataFrame.from_dict(eval(dict), orient='index').reset_index()

            # extract other movie info and add to the dataframe
            movie_title = soup.title.text[:-len(' - Rotten Tomatoes')]
            df1['movie_title'] = pd.Series([movie_title] * len(df1['index']))
            release_date_theaters = soup.find_all('li', class_='meta-row clearfix')[6].contents[3].contents[1].text
            df1['release_date_theaters'] = pd.Series([release_date_theaters] * len(df1['index']))
            box_office_gross_usa = soup.find_all('li', class_='meta-row clearfix')[8].contents[3].text
            df1['box_office_gross_usa'] = pd.Series([box_office_gross_usa] * len(df1['index']))
            runtime = soup.find_all('li', class_='meta-row clearfix')[9].contents[3].text.split('\n')[2].replace(' ', '')
            df1['runtime'] = pd.Series([runtime] * len(df1['index']))

            # combine two dataframes
            df = pd.concat([df, df1], ignore_index = 'True')
            
    return df

In [103]:
df = get_movie_data(urls_list)
df

Unnamed: 0,index,score,averageRating,scoreSentiment,reviewCount,ratingCount,scoreType,likedCount,notLikedCount,certified,tomatometerState,audienceClass,movie_title,release_date_theaters,box_office_gross_usa,runtime
0,tomatometerAllCritics,90,7.44,POSITIVE,441,441,,399,42,True,certified-fresh,,Iron Man (2008),"May 2, 2008",$318.3M,2h6m
1,tomatometerTopCritics,88,6.78,POSITIVE,51,51,,45,6,True,certified-fresh,,Iron Man (2008),"May 2, 2008",$318.3M,2h6m
2,audienceAll,93,4.53,POSITIVE,14596,94118,ALL,87824,6294,False,,upright,Iron Man (2008),"May 2, 2008",$318.3M,2h6m
3,audienceVerified,95,4.63,POSITIVE,8947,69242,VERIFIED,65928,3314,False,,,Iron Man (2008),"May 2, 2008",$318.3M,2h6m
4,tomatometerAllCritics,90,7.44,POSITIVE,441,441,,399,42,True,certified-fresh,,The Incredible Hulk (2008),"Jun 13, 2008",$134.5M,1h52m
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
87,audienceVerified,95,4.63,POSITIVE,8947,69242,VERIFIED,65928,3314,False,,,Avengers: Endgame (2019),"Apr 26, 2019",$858.4M,3h1m
88,tomatometerAllCritics,90,7.44,POSITIVE,441,441,,399,42,True,certified-fresh,,Spider-Man: Far From Home (2019),"Jul 2, 2019",$390.7M,2h9m
89,tomatometerTopCritics,88,6.78,POSITIVE,51,51,,45,6,True,certified-fresh,,Spider-Man: Far From Home (2019),"Jul 2, 2019",$390.7M,2h9m
90,audienceAll,93,4.53,POSITIVE,14596,94118,ALL,87824,6294,False,,upright,Spider-Man: Far From Home (2019),"Jul 2, 2019",$390.7M,2h9m
