## Data Collection

In this assignment, ...

We'll start by importing some libraries.

In [None]:
import pandas as pd
import requests
from bs4 import BeautifulSoup


## Before You Start
You will see some lines of code that call the `assert` function. **DO NOT** change or update or delete the assert statements. `assert` tests to make sure your code is running properly. These statements can help you see if things are working correctly.

## Part 1: Web Scraping
Boston College plays home football games in its Alumni Stadium. Many BC fans think think this is a big stadium as it can seat 44,500 people. That's a lot of people, but it's nowhere near the largest in the world. You may know that Michigan Stadium, also known as "The Big House", is tha largest stadium in the United States with a capacty of 107,601. Big, indeed.  

Like so many things, Wikipedia has a [page](https://en.wikipedia.org/wiki/List_of_stadiums_by_capacity) dedicated to the largest stadiums in the world. You'll see some interesting things on that page. In fact, the data may be counter to your initial expectations.

1. The largest stadium in the world is Narenda Modi, a cricket ground in India. Cricket is enormously popular in countries that don't start with "USA," so this might not be a surprise.

2. Number 2 is Rungrado 1st of May Stadium in North Korea. The North Korean national football (American translation: soccer) team plays in this stadium. Football makes sense because of its even more enormous popularity around the globe. North Korea...that one you may not have guessed.

3. Numbers 3-10 are all in the United States, and their tenants are...college football teams. Not NFL teams--*college* football teams. These stadiums are so big that they may have have a larger capacooty than the population of the college town they sit in.

And that's a question we might want to explore. What is the ratio of a stadium's capacity to the population of its city? To study that, we need to turn this Wikipedia page into data we can analyze.


### Scraping
Let's start by grabbing the contents of the Wikipedia page. 
  
1. Start my making a new variable called `wikipedia_URL`. Assign the following URL to that variable:     
[https://en.wikipedia.org/wiki/List_of_stadiums_by_capacity](https://en.wikipedia.org/wiki/List_of_stadiums_by_capacity). Make sure the URL is a string (in quotes).   

2. Use the Python function `requests.get()` to get the contents of the page. Use `wikipedia_URL` as the function's input value. Assign this to variable called `wikipedia_page`.
3. Make a variable named `soup_page`. Remember that the web scraping tool is called `BeautifulSoup`. You'll need to call `BeautifulSoup` with some arguments to pull the content from the page.

In [None]:
## YOUR CODE HERE
wikipedia_URL = "https://en.wikipedia.org/wiki/List_of_stadiums_by_capacity"
wikipedia_page = requests.get(wikipedia_URL)
soup_page = BeautifulSoup(wikipedia_page.content, 'html.parser')

In [None]:
## DON"T CHANGE THIS CELL...Doing some simple asserts to make sure your code is working.
assert wikipedia_URL
assert wikipedia_page
assert soup_page


### Do we Have the Right Content?

Use a Python function to get the title from the soup_page. Store it in a variable named `title_page`. Unfortunately, BeautifulSoup will return the title as "tag" type. You'll want to convert the value to a string before storing it in `title_page`.

In [None]:
## YOUR CODE HERE
title_page = str(soup_page.title)
print(title_page)

In [None]:
assert title_page == "<title>List of stadiums by capacity - Wikipedia</title>"
assert type(title_page) == str

### Finding the First Table
Take a look at the [web page](https://en.wikipedia.org/wiki/List_of_stadiums_by_capacity). You'll see **multiple** tables with stadum information. This would be a lot easier if all the data were in one table, but we don't have that luxury. Let's focus on finding the first table, the one with stadiums with Capacity of 100,000 or more.  

You should call the function `fund` on the `soup_page` variable. Store the result in a variable named `over_100000_table`. 

What is it we want to find? It's a table with an HTML `class_` of `sortable wikitable`. Use those two values as the arguments to `find`.




In [None]:
over_100000_table = soup_page.find('table', class_='sortable wikitable')

In [None]:
# Some asserts to make sure the code is working as expected.
assert over_100000_table
assert over_100000_table.name == 'table'

In [None]:
# Print the table to see what we have
print(over_100000_table)

When you print `over_100000_table`, you see HTML code defining a table. You can see the headers, the title of each column (e.g., "Country") and the contents of each row. Those data values in the rows are what we want to collect into a pandas dataframe.  

But wait. We only got the *first* table containing stadiums with capacities >= 100,000. What if there's a way to grab **all** of the tables? We used BeautifulSoup's `find` method to get one table. Maybe there's a `findAll` method for this? Let's give it a try.  

Make a variable named all_tables and use BeautifulSoup and `findAll()` to see what happens. Hint: The arguments will be the same as you used for `find()`.



In [None]:
all_tables = soup_page.find_all('table', class_='sortable wikitable')

In [None]:
assert all_tables

# There are seven tables on the Wikipedia page, check to see we have 7
assert len(all_tables) == 7

`findAll()` returns a list of all the HTML elements you specify. In this case, there are seven tables on the Wikipedia page: `findAll` should return a list with seven table element to the `all_tables` variable.

It's great to have the tables. Now we have to pull the data out of them. Let's try doing this slowly with a single table, the one stored in `over_100000_table`. Here's a simple way to understsnd what we need to do.

```
For every row in the table:
    Let row_data = All of the data entries ('td') in the row
    Use array indexing to store items in row_data into lists
```

This is a little tricky, so the code is provided. First, a simple example where we juse pick up the stadium names.

In [None]:
# make an empty list called stadium_names
stadium_names = []

# for every row (<tr>) in the HTML table
for row in over_100000_table.findAll('tr'):
    
    # find all table data (<td>) in the row
    row_data = row.findAll('td')

    # The try statement will run code unless there's an error
    # If there is, code execution jumps to the except statement
    try:
        stadium_name = row_data[0]
        stadium_names.append(stadium_name)
    except:
        continue

# return stadium_names ot see what we collected
stadium_names

Look at what we have in `stadium_names`: It's a list containing all of the data items in the first column of the table...perfect! Except it's not really what we want. We *really* want the stadium names as text. Instead, we have HTML code that contains the name of the stadium and a URL pointing to the stadium's wikipedia page. 

Luckily, BeautifulSoup provides a way to extract just the `text` from an HTML anchor (i.e., \<a\>). You can use `find()` to look for the \<a\> reference and then extract its text. Something like this:

```
data.find('a').text
```

But we can also just ask for the text directly like this:

```
data.text
```

Below is the same code we used above. See if you can add the `text` call where we append the first entry of `row_data` to `stadium_names`.

In [None]:
# make an empty list called stadium_names
stadium_names = []

# for every row (<tr>) in the HTML table
for row in over_100000_table.findAll('tr'):
    
    # find all table data (<td>) in the row
    row_data = row.findAll('td')

    # The try statement will run code unless there's an error
    # If there is, code execution jumps to the except statement
    try:
        ### HERE IS WHERE YOU WANT TO ADD THE CALL TO .text TO PULL OUT THE 
        ### DATA CELL'S TEXT
        stadium_name = row_data[0].text
        stadium_names.append(stadium_name)
    except:
        continue

# return stadium_names ot see what we collected
stadium_names

That's much better. We now have a list of stadium names that we can add to a pandas dataframe. Here's how we'd do that. We'll create a variable named `stadium_df` and add the names to it. We'll also add a column name, `stadium_name`.

In [None]:
# make a list containing the column names for the dataframe
column_names = ['stadium_name']

# make a dataframe with the column name and stadium names
stadium_df = pd.DataFrame(columns=column_names, data=stadium_names)

stadium_df.head()

So there's a dataframe with stadium names. You might imagine that we can continue grabbing other data columns like the capacity, city, country, etc. These are all accessble by indexing the list named `row_data` in our code. Here's an example where we'll grab the `Region` column from the data.

In [None]:
# make an empty list called stadium_names
stadium_names = []

# also make an empty list called region_names
region_names = []

# for every row (<tr>) in the HTML table
for row in over_100000_table.findAll('tr'):
    
    # find all table data (<td>) in the row
    row_data = row.findAll('td')

    # The try statement will run code unless there's an error
    # If there is, code execution jumps to the except statement
    try:
        ### HERE IS WHERE YOU WANT TO ADD THE CALL TO FIND() TO PULL OUT THE 
        ### HYPERLINK/URL
        stadium_name = row_data[0].text
        stadium_names.append(stadium_name)

        # Let's grab the region data. It's in the 5th column. But we use zero 
        # indexing, so it's actualy the 4th column
        region_name = row_data[4].text
        region_names.append(region_name)

    except:
        continue

# return region_names ot see what we collected
region_names

Now we have the 11 largest stadium names and their regions. We can combine these into a dataframe. We did this earlier for the stadium names. This time, we'll create a new dataframe, called `stadium_info` with two columns, `stadium_name` and `region`.

Here's the trick. The `DataFrame` constructor function essentiially wants you to pass data in row by row. We have two lists now that contain column data (`stadium_names` and `region_names`). What we really want is a list of lists that look something like this:

```
    [['Narendra Modi Stadium[1]', 'South Asia'],
     ['Rungrado 1st of May Stadium', 'East Asia'],
     ['Michigan Statium', 'North America'],
    ...
    ]

```

Notice that each one of those lists is a partial row from the original Wikipedia table. It's easy to make these  by hand if the lists are small. But it'd be time-consuming to do this for any sizeable amount of data.


### Python's zip function
Enter `zip`. `zip` takes multiple lists and returns an object containing new tuples that combine elements from each. The "object" part is a little confusing, so let's take a look at an example.

In [None]:
# Here's a list of letters
letters = ['a', 'b', 'c']

# And here's a list of numbers
numbers = [1, 2, 3]

# zip will return a zip object. Not quite what we want...yet.
zip_object = zip(letters, numbers)
print(zip_object)

A zip object isn't useful, but we can make it useful by simply forcing it to be a list:

In [None]:
zipped_list = list(zip_object)
print(zipped_list)

That's what we want to see...a list of tuples that contain the first elements of the input lists, the second elements, etc.   

Your turn. Make three lists containing:

1) The names of three people you know. Store these in a variable named `people`.
2) Your three favorite animals in a variable named `animals`.
3) Three cities you've visited or would like to visit in a variable named 'cities`.

Then use `zip` to make a new zip object. Calling `list` with the zip object should reveal a list with this structure:

```
    [[1st element of people, 1st element of animals, 1st element of cities],
     [2nd element of people, 2nd element of animals, 2nd element of cities],
     [3rd element of people, 3rd element of animals, 3rd element of cities]
    ]

In [None]:
#### YOUR CODE HERE

# three people you know
people = ['Keith', 'Greg', 'Carl']

# three animals
animals = ['Cat', 'ELephant', 'Lemur']

# three cities
cities = ['Beijing', 'Rome', 'Leeds']

# call zip below to combine the three lists you just made. 
zip_object = zip(people, animals, cities)

# Use that zip object as the input to 'list' to reveal the 
# zipped content
print(list(zip_object))


In [None]:
## Check code with asserts
assert people, "You need to add people to the empty list"
assert animals, "You need to add animals to the empty list"
assert cities, "You need to add cities to the empty list"
assert isinstance(list(zip_object), list)

### Using zip to create a pandas dataframe
Now we can try this with real data. Fill in the code below to make a list of data rows in the `DataFrame` function.

In [None]:
# make a list containing the column names for the dataframe
column_names = ['stadium_names', 'region_names']

# make a dataframe with the column name and stadium names
# The data should be a zipped list
stadium_df = pd.DataFrame(columns=column_names, data= list(zip(stadium_names, region_names))) #REPLACE WITH YOUR CODE)

# We'll check some things here with assert statements
assert stadium_df.shape == (11,2)

stadium_df.head()

### Grabbing all of the tables
So far, we've extracted data fromn a single Wikipedia table. Let's go get the remaining tables so we can have a complete set of data.  

Remember how we used BeautifulSoup to `find` a table with class id = `sortable wikitable`? Let's see what happens if we use `findAll`:

In [None]:
all_tables = soup_page.findAll('table', class_='sortable wikitable')

# print the length of all_tables
print(f"Beautiful Soup findAll returned {len(all_tables)} tables!")


Just as hoped, `findAll` retrieved all of the data tables on the Wikipedia page. Now we can extract the data from each table exactly as we did earlier!

In [None]:
# make an empty list called stadium_names
stadium_names = []

# also make an empty list called region_names
region_names = []

# for every table in the list named all_tables, go get the data!
for table in all_tables:

    # for every row (<tr>) in the HTML table
    for row in table.findAll('tr'):
    
        # find all table data (<td>) in the row
        row_data = row.findAll('td')

        # The try statement will run code unless there's an error
        # If there is, code execution jumps to the except statement
        try:
            # Get the stadium name in the 0th column
            stadium_name = row_data[0].text
            stadium_names.append(stadium_name)

            # Let's grab the region data. It's in the 5th column. 
            # But we use zero indexing, so it's actualy the 4th column
            region_name = row_data[4].text
            region_names.append(region_name)

        except:
            continue

print(f"There are {len(stadium_names)} stadiums on the Wikipedia page!")



Now for the final pieces of the assignment. To the code below, please add:
1. Variables named `capacities`, `city_names`, and `country_names`. 
2. In the Python `try:` statement, add the code needed to extract values from the Wikipedia table and insert those into the named variables.
3. Create a new DataFrame called `stadium_info` that has the complete set of data with appropriate column names.


In [None]:
# make an empty list called stadium_names
stadium_names = []

# also make an empty list called region_names
region_names = []

# YOUR CODE BELOW SHOULD CREATE NEW LISTS
# capacities, city_names, country_names

# for every table in the list named all_tables, go get the data!
for table in all_tables:

    # for every row (<tr>) in the HTML table
    for row in table.findAll('tr'):
    
        # find all table data (<td>) in the row
        row_data = row.findAll('td')

        # The try statement will run code unless there's an error
        # If there is, code execution jumps to the except statement
        try:
            # Get the stadium name in the 0th column
            stadium_name = row_data[0].text
            stadium_names.append(stadium_name)

            ## INSERT YOUR CODE FOR CAPACITY, CITY, & COUNTRY BELOW


            # Let's grab the region data. It's in the 5th column. 
            # But we use zero indexing, so it's "really" in the 4th column
            region_name = row_data[4].text
            region_names.append(region_name)

        except:
            continue

# make a list containing the column names for the dataframe
# YOU FILL IN THE COLUMN NAMES
column_names = []

# make a dataframe with the column name and stadium names
# The data should be a zipped list
# YOUR CODE GOES IN THE DATAFRAME FUNCTION
stadium_data = pd.DataFrame()

print(f"There are {stadium_data.shape[0]} rows, {stadium_data.shape[1]} columns in stadium_data.")


Let's look at the last few entries in `stadium_data`. Compare these with the final entires in the [last Wikipedia table](https://en.wikipedia.org/wiki/List_of_stadiums_by_capacity#Capacity_of_40,000â€“50,000)...they should be the same.

In [None]:
stadium_data.tail()

Everything looks good. Your final task is to save the table as a CSV file named `stadium_data.csv`. Remember how to do that?

In [None]:
### YOUR CODE TO SAVE THE STADIUM_DATA HERE

In [None]:
## doing a check to see that the file has been written to the current directory
import os
assert os.path.exists("stadium_data.csv"), "File named stadium_data.csv is not in the current directory."