# Creating a Webscraping Tool

The ability to access messy data from a variety of sources is a hallmark of a good data scientist. I will be making a generic webscraper below. While webscrapers should be tailored specifically to he site and data you are trying to scrape, I will use this as a template so as I progress I can refer back to this an make any necessary substitutions or tweaks as needed. I will want to design a scraper that grabs the necessary data I need an converts it to a useful `DataFrame` using `pandas`. I am following a tutorial from Angelica Dietzel [The Only Step-by-Step Guide You’ll Need to Build a Web Scraper With Python](https://medium.com/better-programming/the-only-step-by-step-guide-youll-need-to-build-a-web-scraper-with-python-e79066bd895a).

## Notes
* Webscraping is not illegal...per se but make sure on the sites terms and conditions. Check `robots.txt` file or type `www.example.com/robots.txt` to explore this.

* Have a sense of the data you are trying to scrape prior to writing a scraper.

*This will be a generic script but should be sufficiently flexible to be applied elsewhere.*

* We want to inspect the the html code of the site to determine which elements we would like to extract.

## Example site to be scraped

We will be scraping from the IMDb "Top 1,000 Movies", we will target the top 50 movies. I am using this as an example and will need to customize this to accomodate other target data sources.The information we are interested in:

* Title
* Year
* Movie length
* IMDb rating
* Movie Metascore
* # of votes
* U.S. Gross Earnings

## The URL of the Top 1,000

[Top 1,000](https://www.imdb.com/search/title/?groups=top_1000&ref_=adv_prv)


## Examine the HTML

The first step once we are on the site we want to scrape is to examine the HTML. This can be achieved in Chrome by right-clicking on the page and hitting "Inspect".

## Tools and Packages Used

*We will not be using all of the tools suggested in the tutorial. Specifically, we will not be using `Repl` as we are using `Jupyter Notebooks` as our IDE.

* `Repl` - Simple web-based IDE (NOT USING)
* `Requests` - allow us to send HTTP requests to get HTML
* `Beautiful Soup` - allow the parsing of HTML files
* `pandas` - assemble data into a `DataFrame` for cleaning and manipulation
* `NumPy` - support for mathematical functions and working with arrays


## Install and Load Tools

In [2]:
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

## Request contents from the URL

**Deconstructing URL Requests**

`url` is the variable we create and assign the URL to

`results` is the variable we create to store our `request.get` action

`requests.get(url, headers=headers)` is the method we use to grab the contents of the URL. The `headers` part tells our scraper to bring us English, based on our previous line of code.

In [3]:
url = "https://www.imdb.com/search/title/?groups=top_1000&ref_=adv_prv"
headers = {"Accept-Language": "en-US, en;q=0.5"} # This is common for accepting English language results.
results = requests.get(url, headers=headers)

## Use Beatiful Soup to get data

`soup` is the variable we create to assign the method `BeatifulSoup` to, which specifies a desired format of results using the HTML parser — this allows Python to read the components of the page rather than treating it as one long string

`print(soup.prettify())` will print what we’ve grabbed in a more structured tree format, making it easier to read

In [4]:
soup = BeautifulSoup(results.text, "html.parser")

## Store scraped data as variables

When we write code to extract our data, we need somewhere to store that data. Create variables for each type of data you’ll extract, and assign an empty list to it, indicated by square brackets []. Remember the list of information we wanted to grab from each movie from earlier:

In [5]:
#initiate data storage
titles = []
years = []
time = []
imdb_ratings = []
metascores = []
votes = []
us_gross = []

movie_div = soup.find_all('div', class_='lister-item mode-advanced')


## Putting the HTML in the appropriate `div` container

Since we are interested in the top 50 movies, we can use the "Inspect" Chrome feature to explore the different `div` tags. A little exploration reveals that `lister-item mode-advanced` is the `div` tag we are interested in. We want to tell our scraper to grab all 50 of those tags on the page. 

**Breaking `find_all` down:**

`movie_div` is the variable we’ll use to store all of the `div` containers with a class of `lister-item mode-advanced`

the `find_all()` method extracts all the div containers that have a class attribute of lister-item mode-advanced from what we have stored in our variable soup.

**Getting into each `lister-item mode-advanced` div**

When we grab each of the items we need in a `singlelister-item mode-advanced div` container, we need the scraper to loop to the next `lister-item mode-advanced div` container and grab those movie items too. And then it needs to loop to the next one and so on — 50 times for each page. For this to execute, we’ll need to wrap our scraper in a for loop.

**Breaking down the `for` loop:**
A `for` loop is used for iterating over a sequence. Our sequence being every `lister-item mode-advanced``div` container that we stored in `movie_div` container is the name of the variable that enters each `div`. You can name this whatever you want (`x`, `loop`, `banana`, `cheese`), and it wont change the function of the loop.

### Note:

In an effort to keep this shorter I won' go into great detail about what the `for` loop is doing in each `div`. Refer to the original tutorial, [The Only Step-by-Step Guide You’ll Need to Build a Web Scraper With Python](https://medium.com/better-programming/the-only-step-by-step-guide-youll-need-to-build-a-web-scraper-with-python-e79066bd895a), for a more detailed explanation.

In [6]:
#our loop through each container
for container in movie_div:

        #name
        name = container.h3.a.text
        titles.append(name)
        
        #year
        year = container.h3.find('span', class_='lister-item-year').text
        years.append(year)

        # runtime
        runtime = container.p.find('span', class_='runtime').text if container.p.find('span', class_='runtime').text else '-'
        time.append(runtime)

        #IMDb rating
        imdb = float(container.strong.text)
        imdb_ratings.append(imdb)

        #metascore
        m_score = container.find('span', class_='metascore').text if container.find('span', class_='metascore') else '-'
        metascores.append(m_score)

        #there are two NV containers, grab both of them as they hold both the votes and the grosses
        nv = container.find_all('span', attrs={'name': 'nv'})
        
        #filter nv for votes
        vote = nv[0].text
        votes.append(vote)
        
        #filter nv for gross
        grosses = nv[1].text if len(nv) > 1 else '-'
        us_gross.append(grosses)


This should store everything in a jumbled list. Not super helpful for data analysis. Next we will want to clean the data and organize into a `DataFrame`.

## Creating a `DataFrame` using `pandas`

`movies` is what we’ll name our `DataFrame`
`pd.DataFrame` is how we initialize the creation of a `DataFrame` with `pandas`.


In [7]:
movies = pd.DataFrame({
'movie': titles,
'year': years,
'timeMin': time,
'imdb': imdb_ratings,
'metascore': metascores,
'votes': votes,
'us_grossMillions': us_gross,
})

# Print DataFrame

print(movies)

                                                movie        year  timeMin  \
0                                       The Gentlemen      (2019)  113 min   
1                    Once Upon a Time... in Hollywood      (2019)  161 min   
2                                            Parasite      (2019)  132 min   
3                                          Knives Out      (2019)  131 min   
4                                   Avengers: Endgame      (2019)  181 min   
5                                                1917      (2019)  119 min   
6                                               Joker      (2019)  122 min   
7                                        Little Women      (2019)  135 min   
8                                         The Goonies      (1985)  114 min   
9                            The Shawshank Redemption      (1994)  142 min   
10                                      The Godfather      (1972)  175 min   
11                                   Django Unchained      (2012

## Data Cleaning

A review of the `DataFrame` from the previous call reveals that some of the objects are not in the appropriate data types (e.g. objects should be integer or float).


In [8]:
#cleaning data 
movies['year'] = movies['year'].str.extract('(\d+)').astype(int)
movies['timeMin'] = movies['timeMin'].str.extract('(\d+)').astype(int)
movies['metascore'] = movies['metascore'].astype(int)
movies['votes'] = movies['votes'].str.replace(',', '').astype(int)
movies['us_grossMillions'] = movies['us_grossMillions'].map(lambda x: x.lstrip('$').rstrip('M'))
movies['us_grossMillions'] = pd.to_numeric(movies['us_grossMillions'], errors='coerce')

# View cleaned data
print(movies)

                                                movie  year  timeMin  imdb  \
0                                       The Gentlemen  2019      113   7.9   
1                    Once Upon a Time... in Hollywood  2019      161   7.7   
2                                            Parasite  2019      132   8.6   
3                                          Knives Out  2019      131   7.9   
4                                   Avengers: Endgame  2019      181   8.4   
5                                                1917  2019      119   8.3   
6                                               Joker  2019      122   8.5   
7                                        Little Women  2019      135   7.9   
8                                         The Goonies  1985      114   7.8   
9                            The Shawshank Redemption  1994      142   9.3   
10                                      The Godfather  1972      175   9.2   
11                                   Django Unchained  2012     

## Save out `DataFrame` to a .csv

We will want to save out the cleaned `DataFrame` to a .csv format for future manipulation, visualization, and modeling if required.

In [10]:
#add dataframe to csv file named 'movies.csv'
movies.to_csv('movies_top50.csv')