# Assignment \#03

## Formatting from J!Archive

**Outcomes**
1. Scrape structured data  
1. Begin getting data into a useable format

**Scenario:** You have recently been accepted to appear on the gameshow Jeopardy! In order to do such, you decide to review old games. In doing such, you realize that you are able to find a wealth of information on the website [J! Archive](https://j-archive.com/). 

Complete Section 0, either section 1 or section 2, and section 3. 

So either do (0, 1, 3) or (0, 2, 3). 

(0, 1, 2, 3) is also acceptable, though not required. 

0. Section 0: Navigating the Body
1. Section 1: Getting metadata from the head
1. Section 2: Using regular expressions to find clues
1. Section 3: Finding links out of the webpage

In [1]:
from urllib.request import urlopen
import time
from bs4 import BeautifulSoup, NavigableString, Tag
import pandas as pd
import re

## Section 0: Navigating the Body

***
We will use the same page as the last homework. The below code is given as part of the solution from Homework 2. 

In [2]:
url = "https://www.j-archive.com/showgame.php?game_id=6293"
sleep_time = 20
if type(sleep_time) == int:
    html = urlopen(url)
    time.sleep(sleep_time)
bs = BeautifulSoup(html, 'html.parser')

***
We are now going to navigate through the soup to try and find more about its contents without exploring the webpage. Using the `isinstance` function helps to determine if a Tag or a terminal node has been reached. This will help us throughout the assignment. It also allows us to determine how to proceed as we explore the tree. 

### Q1: Loop over the children of `bs` to find how many children it has. If a child is a `Tag`, get its name. 

__Answer:__

There are _ children. The name of the tag(s) are _. 


***
Notice that we have not yet found the head or body of the text. Extend the code from Q1 to look if any of the children of `Tag` objects have the name `head` or `body`.  

### Q2: Modify the code from Q1 to determine if the any of the `Tag` object children have the name `head` or `body`. Your code should print the additional tag references needed to directly access the head and body. 

e.g. if `head` is located at bs.sometag1.head, you should print `sometag1.head`. 

Now that we have the location of the head and body, we may continue. 

Alternatively, BeautifulSoup does a decent job of navigating tags when they are nested. For instance, bs.head can skip the reference to the middle layer. 

***
***
## Section 1: Getting metadata from the head

Our goal is to take the information from the head and store it in a dictionary that describes the content. Notice that we only want the `Tag` elements from the head. 

In order to do this, we will start with the first object. 

### Q1: Look at the `contents` of the `head` tag. What is the first non-"\n" element? 

Answer: title

***
We can begin at this item, and move through the rest using `.next_sibling` while skipping over the '\n' with an `if` statement. 

### Q2: Start a dictionary, `metadata`, whose first key is the `name` of the first non-"\n" `Tag` and takes the value of its `contents`. 

***
### Q3: Iterate through the siblings following the first non-"\n" element. If the element is "\n", skip it. If the element is not "\n", then print its name and contents. 

Hint: Use a for loop over the siblings. You should notice a pattern here. 

***

The `.name` method only prints the name of the tag. Here, we will want to handle what we add to the dictionary based on the tag. This will involve more logic statements. 

Links may add a reference to a style sheet or a favorite icon (used by the browser). The attribute `rel` defines the relationship between the link and the current webpage, whereas `href` defines the link itself. In our case, we will want to use `rel` as the key and `href` as the value for our dictionary. 

Meta tags offer information about the webpage. We will want to use the `name` as the key, and `content` as the value for our dictionary. 

### Q4: Following the example of A04, when we find the link to the images, print the correct key-value depending on the type of tag. For the siblings. 


***

To add new items to a dictionary named `my_dict`, we can use `my_dict.update({'key':'value'})`. Remember that keys cannot be lists. 

### Q5: Modify your code from Q4 to update the dictionary with these pairs. 

***
***
## Section 2: Using regular expressions to find clues. 

When scraping data, we likely want to acknowledge the structure that may already exist. The `.find_all()` method, while useful, may return a myraid of results. To be more precise in our searches, we can use regular expressions (regex) to specify with a degree of tolerance what we want. 

For example, I may want to get all tags that have a class containing the word `clue`. This could include classes such as `clueless`, `clue95`, `clue`, etc. The regular expression for such would be `clue\+`. The plus signifies that I am okay with anything trailing the word `clue`, but `clue` must be matched at least once.  

To find all clues from the Jeopardy round `id` with `clue_J`, we could use `bs.body.find_all('td', {'id':re.compile('clue_J+')})`. 

### Q1: Try running this expression. What ids are returned? Should we modify the code? 

**Answer:**

***
A hint for the previous question, you should modify the code. 

### Q2: Modify the code to only tags with the questions. This should be done in 1 line of code. 

***

Regular expressions work well for finding well-defined attributes and text with a little bit of forgiveness. 

This has allowed us to find the questions, but what if we want to find the `response` contestants give as well. If we try searching for any tag with synonyms of `response`, we find there are no expressions. This is due to the way the responses are stored. 

To convince you of this, note that multiple regular expressions can be combine using the pipe delimiter `|`. So to expression mutliple constraints, I could use `re.compile(a+|b+)` which will return expressions that include at least 1 a or b. 

To search all tags, we replace `td` with True. 

### Q3: Use a regular expression within a find_all statement to search all tags for anything that may contain a class with response or answer. 

***

To find the answers we seek, we will have to use a `lambda` expression. They are similar to mini functions and we will consider them in later assignments with functions.

We will use a series of functions and regex expressions to determine how to get the answers. This problem is generally seens as being more challenging and can also be handled via selenium. 

Any example of trying to find the answers using functions and multiple regex expressions is shown here for those interested:

> bs.body(True, {'class':re.compile("round+")})[0].find_all(lambda tag: bool(re.search("correct_+", str(tag))) & (isinstance(tag, Tag)))

***

Using regex expressions, we can however get the containers once we know they are `div` with attribute `onmouseover`. 

To accept any contents, we can use `r".*"` in our regex, which allows for any values. 

### Q4: Use a regular expression to get all the `div` tags that contain a correct answer. Save this list as `answers_within`. 

***

The questions are stored in the  contents of `onmouseout` where as the answers are stored in `onmouseover`. Recall from HW2 that we can get the contents of an attribute using `.get(attr)`. 

### Q5: Using a loop, print the questions stored in the `onmouseout` attribute. 

You will need to format these correctly in a systematic way that removes the toggle and clue portions. Note that using the string method `.split` will not work on a comma, but there are other strings you can split on. 

***

### Q6: Using a loop, print the answers stored in the `onmouseover` attribute. 

Hint: This is similar to homework 2. 

***
***

## Section 3: Finding links out of webpages. 

When building web crawlers, we may want to move through a series of webpages to map a site and the way they link together. These networks require being able to find links. The easiest way to find a link is using the `href` attribute. 

To find any tag with an `href`, we can make use of two statements. The first a search all tags statement `.find_all(True)`. The second is using the argument `{href: re.compile(".*")` which tells python to find all tags that have any `href`. 

### Q1: Combine these statements in 1 line to get all references below. What is interesting about the hyperlinks?

Some hyper links do not begin with http

***

Now, we may want to get some information about these hyper links. You may notice that some have titles, others do not. Link tags typically have a `rel` attribute that offers us more information about it. 

If the tag name is `a`, these tags typically have text associated with them or a title. In this case, we would prioritize getting the title over the text. If we were amassing data, we may want to include both systematically. 

### Q2: Write a loop that prints the value of `rel` for link tags, or the `title` for a tags. If the title is not there, then print the text. 