In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab10.ipynb")

# Lab 10: Web Scraping and Manipulating Data

Welcome to Lab 10 of Data Wrangling and Visualization!

## Overview
The web provides a rich source of data, so we often want to work with that information programmatically in order to make sense of it. Sometimes, that data is provided to us by website creators via csv files or through an API (Application Programming Interface). Other times, we need to collect text from the web ourselves.

It is easy to pull HTML from a website but more difficult to find the information we want from HTML.  Parsing the HTML for targeted information and then storing that information in a structured format will be the focus of this activity. Once we have our data in a structured format, we will practice cleaning and reshaping the data.


## In today's lab, we will
- Become familiar with the structure of HTML and CSS by inspecting source code
- Use the Requests and Beautiful Soup libraries to acquire and parse data from websites
- Work through the challenges of web scraping (i.e. turning messy, unstructured data into workable form)
- Use Pandas methods to clean, manipulate, and reshape data for a brief EDA

This activity will make use of the Requests and Beautiful Soup Python modules.
Documentation can be found here: 
- https://docs.python-requests.org/en/latest/
- https://www.crummy.com/software/BeautifulSoup/

### Import packages


In [None]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import warnings
warnings.filterwarnings("ignore")
import seaborn as sns
sns.set_style('darkgrid')
import matplotlib.pyplot as plt
import numpy as np

### 1. Scrape and Beautify Cal Poly Humboldt Data

**Question 1.1:** Go to https://pine.humboldt.edu/anstud/cgi-bin/filter.pl?relevant=ft1yr_retention.out. Read through the webpage and inspect its source code. What information does this page provide? Once you have a feel for what is on the webpage, import its contents with the requests module.

In [None]:
# Import the data
page = ...

In [None]:
grader.check("q1_1")

**Question 1.2:** Inspect the text of the data you just imported. 

In [None]:
# inspect the text
page_text = ...
page_text

In [None]:
grader.check("q1_2")

**Question 1.3:** Create a Beautiful Soup object from `page_text`. 

In [None]:
# Create soup object
soup = ...
soup

In [None]:
grader.check("q1_3")

<!-- BEGIN QUESTION -->

**Question 1.4:** Use the `prettify()` method to turn the Beautiful Soup parse tree into a nicely formatted string with each HTML tag on its own line. Take a look through it and take note of any tags or attributes you are unfamiliear with. Look up any tags that aren't familiar.

In [None]:
# Print pretty data

<!-- END QUESTION -->

### 2. Parse and clean the data

**Question 2.1:** Use a BeautifulSoup method to determine how many tables are on the webpage. 

In [None]:
num_tables = ...
num_tables

In [None]:
grader.check("q2_1")

**Question 2.2:** Access the second to last table and put the contents in `table1_code`. Inspect it, and then give a brief description of what the table contains. 

_Type your answer here, replacing this text._

In [None]:
table1_code = ...
table1_code

In [None]:
grader.check("q2_2")

**Question 2.3:** Put the contents from the second to last table (`table1_code`) into Pandas Data Frame.

*HINT 1:* You will probably find it helpful to first create a nested list of the data. Each inner list should contain the contents of one row in the table. The outer list should contain all the rows. Use this nested list to create the Pandas DataFrame.

*HINT 2:* Notice that the first two rows in the table have headers (a slightly different structure from the rest of the table). You might find it helpful to first create the data frame to contain the body of the table. Then create the column names with the headers afterward.

In [None]:
# Create a nested list of data
nested_list = ...


# Put the data in a Pandas dataframe and rename columns as needed
table1_df = ...
table1_df.columns = ...
table1_df

In [None]:
grader.check("q2_3")

**Question 2.4:** Notice that even rows contain numbers and odd rows contain percents in `table1_df1`. Let's separate this into two Pandas data frames. One should contain only numbers, one should contain percents. 

*NOTE:* Make sure that both tables have data in the `Major Program` column.

In [None]:
# Create numbers data frame
table1_nums = ...
table1_nums

In [None]:
# Create percents data frame
table1_percents = ...
table1_percents

In [None]:
grader.check("q2_4")

<!-- BEGIN QUESTION -->

**Question 2.5:** Check the data types of each column in both tables from the previous question. 

In [None]:
# check nums dtypes

In [None]:
# check percents dtypes

<!-- END QUESTION -->

**Question 2.6:** Cast the datatypes in your tables appropriately. In other words, if a column contains numbers, cast it to an appropriate numeric data type.

*NOTE:* The percents in the `table1_percents` DataFrame should be floats.

In [None]:
# Cast datatypes in nums table
    
table1_nums.dtypes

In [None]:
# Cast datatypes in percents table

table1_percents.dtypes

In [None]:
grader.check("q2_6")

## 3. Visualize

**Question 3.1:** Visualize the relationship between the 1st Year retention rates (percents) in Fall 13 and the 1st Year retention rates (percents) in Fall 22. Make sure your visualization has a title and your axes are labeled. 

In [None]:
scatter_13_vs_22 = ...
plt.show()

In [None]:
grader.check("q3_1")

<!-- BEGIN QUESTION -->

**Question 3.2:** Visualize the trend in 1st Year retention rates (percents) over time for Biology. On the same axes plot the trend for English (in a different color). Make sure your plot has an informative legend, axes labels, and title. 

*HINT:* You will probably find it helpful to reshape your `table1_percents` dataframe to create this plot. 

In [None]:
# reshape the data as needed


In [None]:
# plot the trends
retention_trends = ...
plt.show()

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 3.3:** Create one more data visualization of your choice. This might require reshaping your data. Describe what insights you gain from your visualization.

In [None]:
# Visualize

<!-- END QUESTION -->

## You're done! 

Congratulations on finishing the lab! Gus is proud of you! Run the cell below and submit to Canvas. 

<img src="gus_big_stretch.JPG" alt="drawing" width="500"/>

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False, run_tests=True)