# Web Scraper and API calls

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Web-Scraper-and-API-calls" data-toc-modified-id="Web-Scraper-and-API-calls-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Web Scraper and API calls</a></span><ul class="toc-item"><li><span><a href="#The-Numbers.com-Scraper" data-toc-modified-id="The-Numbers.com-Scraper-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>The-Numbers.com Scraper</a></span><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1.1.1"><span class="toc-item-num">1.1.1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Summary" data-toc-modified-id="Summary-1.1.2"><span class="toc-item-num">1.1.2&nbsp;&nbsp;</span>Summary</a></span></li></ul></li><li><span><a href="#IMDb-Dataset-Downloader" data-toc-modified-id="IMDb-Dataset-Downloader-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>IMDb Dataset Downloader</a></span><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1.2.1"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Summary" data-toc-modified-id="Summary-1.2.2"><span class="toc-item-num">1.2.2&nbsp;&nbsp;</span>Summary</a></span></li></ul></li></ul></li></ul></div>

## The-Numbers.com Scraper 

### Introduction

This is the first notebook for data collection. We will use data provided by [The Numbers](https://www.the-numbers.com), which  is a great resource to obtain production budgets, domestic and international gross of movies. In order to make better and updated recommendations, this scraper takes data from the website [The Numbers Movie Budgets](https://www.the-numbers.com/movie/budgets/all) and stores data in a Pandas dataframe file. The scraper also provides is updated information compared to the dataset given in the project repository. 

In [1]:
# Importing necessary libraries 
import requests 
import pandas as pd 
from bs4 import BeautifulSoup
import re

In [2]:
#Exploring the movies' budgets landing page

BUDGET_PAGE_URL = "https://www.the-numbers.com/movie/budgets/all"
html_page = requests.get(BUDGET_PAGE_URL)
soup = BeautifulSoup(html_page.content, 'html.parser') 

The landing page for the budget looks like the image shown below. The first appearance of `<center>` contains `<tbody>` block, which is where the tabluated data is stored. 

![Landing page](../images/the_numbers_budget_inspect.png)

Let's explore the `<tbody>` element. 

In [3]:
# Find the first appearance of <tbody> element 
table = soup.find('table')
table.prettify

<bound method Tag.prettify of <table>
<tr><th> </th><th>Release<br/>Date</th><th>Movie</th><th>Production<br/>Budget</th><th>Domestic<br/>Gross</th><th>Worldwide<br/>Gross</th></tr>
<tr><td class="data">1</td>
<td><a href="/box-office-chart/daily/2019/04/23">Apr 23, 2019</a></td>
<td><b><a href="/movie/Avengers-Endgame-(2019)#tab=summary">Avengers: Endgame</a></b></td>
<td class="data"> $400,000,000</td>
<td class="data"> $858,373,000</td>
<td class="data"> $2,797,800,564</td>
</tr>
<tr><td class="data">2</td>
<td><a href="/box-office-chart/daily/2011/05/20">May 20, 2011</a></td>
<td><b><a href="/movie/Pirates-of-the-Caribbean-On-Stranger-Tides#tab=summary">Pirates of the Caribbean: On Stranger Tides</a></b></td>
<td class="data"> $379,000,000</td>
<td class="data"> $241,071,802</td>
<td class="data"> $1,045,713,802</td>
</tr>
<tr><td class="data">3</td>
<td><a href="/box-office-chart/daily/2015/04/22">Apr 22, 2015</a></td>
<td><b><a href="/movie/Avengers-Age-of-Ultron#tab=summary">Ave

Now, extracting the table contents with `<tr>` tags.

In [4]:
#Getting the table contents under <tr>
content = table.findAll('tr')
content[0]

<tr><th> </th><th>Release<br/>Date</th><th>Movie</th><th>Production<br/>Budget</th><th>Domestic<br/>Gross</th><th>Worldwide<br/>Gross</th></tr>

The first `<tr>` element contains the column names with `<th>` elements. The data has a class `data` assigned to it within `<td>` elements.  

In [5]:
#Extracting column names
col_names = content[0].findAll('th')

#Initialize the first column as 'id' 
columns = ['id']

#Iteratively extract column names 
for col in col_names[1:]:
    name = col.get_text(strip=True)
    #Split the string at upper cases  
    split_name = re.findall('[A-Z][^A-Z]*',name)
    columns.append('_'.join(split_name).lower())
columns        

['id',
 'release_date',
 'movie',
 'production_budget',
 'domestic_gross',
 'worldwide_gross']

Now, to find the data in the table, `content[1:]` is used. 

In [6]:
#Populating the movie data 
movie_data = content[1:]

#Create a 2D table for movies 
data = []

#Iteratively extract movie information
for movie in movie_data:
    movie_info = []
    for info in movie.findAll('td'):
        movie_info.append(info.get_text(strip=True))
    data.append(movie_info)

Assign values to a dataframe: 

In [7]:
df = pd.DataFrame(data,columns=columns)
df.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Apr 23, 2019",Avengers: Endgame,"$400,000,000","$858,373,000","$2,797,800,564"
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$379,000,000","$241,071,802","$1,045,713,802"
2,3,"Apr 22, 2015",Avengers: Age of Ultron,"$365,000,000","$459,005,868","$1,395,316,979"
3,4,"Dec 16, 2015",Star Wars Ep. VII: The Force Awakens,"$306,000,000","$936,662,225","$2,064,615,817"
4,5,"Apr 25, 2018",Avengers: Infinity War,"$300,000,000","$678,815,482","$2,044,540,523"


The next page is the second index in the `<div>` with `class='pagination` as shown below. 

![Next page element](../images/the_numbers_next_page.png)

In [8]:
#Find the next page link
next_pages = soup.find('div',class_='pagination')

#Get text from the second link
next_page = next_pages.findAll('a',href=True)[1]['href']
next_page

'/movie/budgets/all/101'

Now that all tools are complete, let's put everything together in neat functions. 

In [9]:
#Track the URLs 

#Next URL tracker and return the page count '/101','/201' and so on
def next_budget_page(soup,current_url):
    if current_url:
        current_page_index = current_url.split('/')[-1]
    else:
        current_page_index ='1'
    next_page_index = '/{}'.format(int(current_page_index)+100)
    next_pages = soup.find('div',class_='pagination')
    for urls in next_pages.findAll('a',href=True):         
        url = '/'+urls["href"].split('/')[-1]
        #Check if the next page index exists
        if url==next_page_index:            
            current_url = url
            break
    return current_url

#Get data from current URL and return a dataframe
def get_data(soup):
    table = soup.find('table')
    content = table.findAll('tr')
    columns = ['id'] 
    for col in content[0].findAll('th'):
        name = col.get_text(strip=True)  
        split_name = re.findall('[A-Z][^A-Z]*',name)
        columns.append('_'.join(split_name).lower())
    columns.remove('')
    data = []
    for movie in content[1:]:
        movie_info = []
        for info in movie.findAll('td'):
            movie_info.append(info.get_text(strip=True))
        data.append(movie_info)
    df = pd.DataFrame(data,columns=columns)
    return df

#Crawl into the next page
def crawl_page(base_url,current_url):
    html_page = requests.get(base_url+current_url)
    soup = BeautifulSoup(html_page.content, 'html.parser') 
    return soup

#An all in one function 
def all_in_one(base_url): 
    current_url = ''
    next_url = next_budget_page(crawl_page(base_url,current_url),current_url)
    soup = crawl_page(base_url,current_url)
    df = get_data(soup)
    while (current_url!=next_url):
        soup = crawl_page(base_url,next_url)
        df = pd.concat([df,get_data(soup)],ignore_index=True)
        current_url = next_url
        next_url=next_budget_page(crawl_page(base_url,next_url),current_url)
    return df

df = all_in_one(BUDGET_PAGE_URL)
df      

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Apr 23, 2019",Avengers: Endgame,"$400,000,000","$858,373,000","$2,797,800,564"
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$379,000,000","$241,071,802","$1,045,713,802"
2,3,"Apr 22, 2015",Avengers: Age of Ultron,"$365,000,000","$459,005,868","$1,395,316,979"
3,4,"Dec 16, 2015",Star Wars Ep. VII: The Force Awakens,"$306,000,000","$936,662,225","$2,064,615,817"
4,5,"Apr 25, 2018",Avengers: Infinity War,"$300,000,000","$678,815,482","$2,044,540,523"
...,...,...,...,...,...,...
6178,6179,Unknown,Red 11,"$7,000",$0,$0
6179,6180,"Apr 2, 1999",Following,"$6,000","$48,482","$240,495"
6180,6181,"Jul 13, 2005",Return to the Land of Wonders,"$5,000","$1,338","$1,338"
6181,6182,"Sep 29, 2015",A Plague So Pleasant,"$1,400",$0,$0


Let's convert the `id` column to numerals and write the dataframe to a CSV file for later use. But first, let's create `zippedData` folder.

In [10]:
#Create zippedData folder if it doesn't exist
! mkdir zippedData

A subdirectory or file zippedData already exists.


In [11]:
#Convert the indices to numerals in the 'id' column
df['id'] = df['id'].map(lambda x: x.replace(',','')).astype(int)
df.to_csv('zippedData/tn_movie_budgets_updated.csv.gz',index=None, compression='gzip')

### Summary
In this notebook, we saw how to scrape data from [The-Numbers](https://www.the-numbers.com/) and save the dateset to a folder.

## IMDb Dataset Downloader

### Introduction 

In this notebook, we will download the IMDb dataset that's availble for [download](https://datasets.imdbws.com/). Therefore, in order to use the latest data, the downloads were used instead of the provided dataset. The webpage should look like the screen capture below. 

![Landing page](../images/imdb_dataset.PNG)

In [2]:
#URL where updated IMDb dataset is located and load website with BeautifulSoup
imdb_dataset_url = 'https://datasets.imdbws.com/'
html_page = requests.get(imdb_dataset_url)
soup = BeautifulSoup(html_page.content, 'html.parser')
dataset_link = soup.find_all('a')

The dataset needed for this project is located in '.tsv.gz' files. Since the scraped data is based on movies, series data is also excluded from download.

In [3]:
#Iterate through href and download dataset. Skip 'episode' for TV series dataset 
#Files are downloaded in chunks
for link in dataset_link:
    url = link['href']
    if '.gz' in url:
        if 'episode' in url: 
            continue 
        else:
            path = 'zippedData/'+url.split('/')[-1]
            r = requests.get(url, stream=True)
            with open(path,'wb') as pypdf:
                for chunk in r.iter_content(chunk_size=1024):
                    if chunk: 
                        pypdf.write(chunk)

### Summary
Unlike [The-Numbers](https://www.the-numbers.com/), IMDb proves their dataset for public consumption, which makes things surprisingly easy.