# Lab Assignment 2
## Data Engineering 1
### 20 points, Due Sunday, October 9 11:59pm

(1) In class we began the GitHub workflow by creating a new GitHub repository from scratch on github.com. But another way to begin a new GitHub repository is to start by copying someone else's repository into your own account (this is called "forking" a repository). Follow these steps to initialize your repository for this lab:

* Go to https://github.com/jkropko/lab2. Find the button marked Fork in the upper-right corner of the screen and push it. You will be taken to a new GitHub page that looks exactly the same, except that the URL now has your GitHub user name instead of "jkropko". (Side note: most open software has a public GitHub repo, and you can get your own copy of it by forking the repository. So for example, suppose you wish pandas had a more intuitive function for finding missing values in a data frame: you could fork https://github.com/pandas-dev/pandas and add your own code to your version of this repository to make your own custom version of pandas. If your code works extremely well, you can also suggest the original pandas repo incorporate your code by issuing a "pull request".)

* On your computer, choose a location for working with the local files and navigate to that location in a terminal. Use the `git clone` command to download your repository and activate the git commands. (No need to create a "lab2" folder yourself -- one will be created when you use `git clone`.)

* Use `cd` to enter the lab2 folder (but not the data folder inside this one). Then create three files:

    * A requirements.txt file that installs jupyterlab==3.4.7, pandas==1.4.4, numpy==1.23.3, requests==2.28.1, and beautifulsoup4==4.8.1

    * A Dockerfile that installs python:3.10.7-bullseye, copies the requirements.txt file into the container and runs pip install for the packages in the requirements.txt file, sets a default working directory inside the container with an appropriate name, exposes port 8888 and launches jupyter lab.
    
    * A .env file. It can be empty for now, but you will add API keys to this file.
    
* Use the `git add`, `git commit`, and `git push` commands to save these files to your GitHub repository. (Note that I already added a .gitignore file to the repository which will tell git not to upload your .env file)

* Build the Docker image from your Dockerfile, then run the Docker container from this image. Be careful to specify a port for running Jupyter Lab locally, attach the local folder as a volume to the container's working directory, and specify the .env file. 

* Use the containerized version of Jupyter Lab to work on the rest of this lab. Save your notebook in the container, and add, commit, and push this notebook to GitHub as well.

To receive credit for this problem, just type the URL of your GitHub repository in your notebook. [3 points]

https://github.com/beauleblond/lab2

(2) Now that you have a local copy of your GitHub repository, notice that there is a data folder. Inside this folder are 11 flat files, each containing another version of the same dataset. Only one of the files, `data_clean.csv`, can be loaded properly with a straightforward use of `pd.read_csv()`. The other 10 files all need either an argument within `pd.read_csv()` or a different pandas method to load properly. Find the code that correctly loads each of the 10 files. Display the head of each data frame to confirm that it is properly loaded. 

(Note: all of the arguments and methods you need are discussed in https://jkropko.github.io/surfing-the-data-pipeline/ch2.html. Don't worry about small differences in the column names, such as capitalization, if it remains clear which column is which. A couple of the data files have more missing data than `data_clean.csv`, which is fine so long as the missing values are coded as missing, and not as numeric values.) [3 points]

In [1]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
import json
import os
import collections
collections.Callable = collections.abc.Callable
oldpath = os.getcwd()

In [2]:
os.chdir("data")

In [3]:
data_clean = pd.read_csv('data_clean.csv')
data_clean.head()

Unnamed: 0,Country,Happiness score,Whisker-high,Whisker-low,Dystopia (1.92) + residual,Explained by: GDP per capita,Explained by: Social support,Explained by: Healthy life expectancy,Explained by: Freedom to make life choices,Explained by: Generosity,Explained by: Perceptions of corruption
0,Finland,7.632,7.695,7.569,2.595,1.305,1.592,0.874,0.681,0.192,0.393
1,Norway,7.594,7.657,7.53,2.383,1.456,1.582,0.861,0.686,0.286,0.34
2,Denmark,7.555,7.623,7.487,2.37,1.351,1.59,0.868,0.683,0.284,0.408
3,Iceland,7.495,7.593,7.398,2.426,1.343,1.644,0.914,0.677,0.353,0.138
4,Switzerland,7.487,7.57,7.405,2.32,1.42,1.549,0.927,0.66,0.256,0.357


In [4]:
data1 = pd.read_csv('data1.csv', header=2)
data1.head()

Unnamed: 0,Country,Happiness score,Whisker-high,Whisker-low,Dystopia (1.92) + residual,Explained by: GDP per capita,Explained by: Social support,Explained by: Healthy life expectancy,Explained by: Freedom to make life choices,Explained by: Generosity,Explained by: Perceptions of corruption
0,Finland,7.632,7.695,7.569,2.595,1.305,1.592,0.874,0.681,0.192,0.393
1,Norway,7.594,7.657,7.53,2.383,1.456,1.582,0.861,0.686,0.286,0.34
2,Denmark,7.555,7.623,7.487,2.37,1.351,1.59,0.868,0.683,0.284,0.408
3,Iceland,7.495,7.593,7.398,2.426,1.343,1.644,0.914,0.677,0.353,0.138
4,Switzerland,7.487,7.57,7.405,2.32,1.42,1.549,0.927,0.66,0.256,0.357


In [5]:
data2 = pd.read_csv('data2.txt', header=[2], comment = '/')
data2.head()

Unnamed: 0,Country,Happiness score,Whisker-high,Whisker-low,Dystopia (1.92) + residual,Explained by: GDP per capita,Explained by: Social support,Explained by: Healthy life expectancy,Explained by: Freedom to make life choices,Explained by: Generosity,Explained by: Perceptions of corruption
0,Finland,7.632,7.695,7.569,2.595,1.305,1.592,0.874,0.681,0.192,0.393
1,Norway,7.594,7.657,7.53,2.383,1.456,1.582,0.861,0.686,0.286,0.34
2,Denmark,7.555,7.623,7.487,2.37,1.351,1.59,0.868,0.683,0.284,0.408
3,Iceland,7.495,7.593,7.398,2.426,1.343,1.644,0.914,0.677,0.353,0.138
4,Switzerland,7.487,7.57,7.405,2.32,1.42,1.549,0.927,0.66,0.256,0.357


In [6]:
data3 = pd.read_csv('data3.txt', sep='\t', header=2)
data3.head()

Unnamed: 0,Country,Happiness score,Whisker-high,Whisker-low,Dystopia (1.92) + residual,Explained by: GDP per capita,Explained by: Social support,Explained by: Healthy life expectancy,Explained by: Freedom to make life choices,Explained by: Generosity,Explained by: Perceptions of corruption
0,Finland,7.632,7.695,7.569,2.595,1.305,1.592,0.874,0.681,0.192,0.393
1,Norway,7.594,7.657,7.53,2.383,1.456,1.582,0.861,0.686,0.286,0.34
2,Denmark,7.555,7.623,7.487,2.37,1.351,1.59,0.868,0.683,0.284,0.408
3,Iceland,7.495,7.593,7.398,2.426,1.343,1.644,0.914,0.677,0.353,0.138
4,Switzerland,7.487,7.57,7.405,2.32,1.42,1.549,0.927,0.66,0.256,0.357


In [7]:
data4 = pd.read_csv('data4.txt', sep='$', header=None, prefix='x')
data4.head()



  data4 = pd.read_csv('data4.txt', sep='$', header=None, prefix='x')


Unnamed: 0,x0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
0,Finland,7.632,7.695,7.569,2.595,1.305,1.592,0.874,0.681,0.192,0.393
1,Norway,7.594,7.657,7.53,2.383,1.456,1.582,0.861,0.686,0.286,0.34
2,Denmark,7.555,7.623,7.487,2.37,1.351,1.59,0.868,0.683,0.284,0.408
3,Iceland,7.495,7.593,7.398,2.426,1.343,1.644,0.914,0.677,0.353,0.138
4,Switzerland,7.487,7.57,7.405,2.32,1.42,1.549,0.927,0.66,0.256,0.357


In [8]:
data5 = pd.read_csv('data5.csv', skipfooter=2)
data5.head()

  data5 = pd.read_csv('data5.csv', skipfooter=2)


Unnamed: 0,Country,Happiness score,Whisker-high,Whisker-low,Dystopia (1.92) + residual,Explained by: GDP per capita,Explained by: Social support,Explained by: Healthy life expectancy,Explained by: Freedom to make life choices,Explained by: Generosity,Explained by: Perceptions of corruption
0,Finland,7.632,7.695,7.569,2.595,1.305,1.592,0.874,0.681,0.192,0.393
1,Norway,7.594,7.657,7.53,2.383,1.456,1.582,0.861,0.686,0.286,0.34
2,Denmark,7.555,7.623,7.487,2.37,1.351,1.59,0.868,0.683,0.284,0.408
3,Iceland,7.495,7.593,7.398,2.426,1.343,1.644,0.914,0.677,0.353,0.138
4,Switzerland,7.487,7.57,7.405,2.32,1.42,1.549,0.927,0.66,0.256,0.357


In [9]:
data6 = pd.read_csv('data6.dat')
data6.head()

Unnamed: 0,Country,Happiness score,Whisker-high,Whisker-low,Dystopia (1.92) + residual,Explained by: GDP per capita,Explained by: Social support,Explained by: Healthy life expectancy,Explained by: Freedom to make life choices,Explained by: Generosity,Explained by: Perceptions of corruption
0,Finland,7.632,7.695,7.569,2.595,999.0,999.0,999.0,0.681,0.192,0.393
1,Norway,7.594,7.657,7.53,999.0,999.0,1.582,999.0,0.686,0.286,0.34
2,Denmark,7.555,7.623,7.487,2.37,1.351,1.59,999.0,0.683,0.284,0.408
3,Iceland,7.495,7.593,999.0,2.426,1.343,1.644,0.914,0.677,0.353,999.0
4,Switzerland,7.487,7.57,7.405,2.32,1.42,1.549,0.927,0.66,0.256,0.357


In [10]:
!pip install openpyxl
data7 = pd.read_excel('data7.xlsx', sheet_name='Data')
data7.head()

[0m

Unnamed: 0,Country,Happiness score,Whisker-high,Whisker-low,Dystopia (1.92) + residual,Explained by: GDP per capita,Explained by: Social support,Explained by: Healthy life expectancy,Explained by: Freedom to make life choices,Explained by: Generosity,Explained by: Perceptions of corruption
0,Finland,7.632,7.695,7.569,2.595,1.305,1.592,0.874,0.681,0.192,0.393
1,Norway,7.594,7.657,7.53,2.383,1.456,1.582,0.861,0.686,0.286,0.34
2,Denmark,7.555,7.623,7.487,2.37,1.351,1.59,0.868,0.683,0.284,0.408
3,Iceland,7.495,7.593,7.398,2.426,1.343,1.644,0.914,0.677,0.353,0.138
4,Switzerland,7.487,7.57,7.405,2.32,1.42,1.549,0.927,0.66,0.256,0.357


In [11]:
data8 = pd.read_stata('data8.dta')
data8.head()

Unnamed: 0,country,happinessscore,whiskerhigh,whiskerlow,dystopia192residual,explainedbygdppercapita,explainedbysocialsupport,explainedbyhealthylifeexpectancy,explainedbyfreedomtomakelifechoi,explainedbygenerosity,explainedbyperceptionsofcorrupti
0,Finland,7.632,7.695,7.569,2.595,1.305,1.592,0.874,0.681,0.192,0.393
1,Norway,7.594,7.657,7.53,2.383,1.456,1.582,0.861,0.686,0.286,0.34
2,Denmark,7.555,7.623,7.487,2.37,1.351,1.59,0.868,0.683,0.284,0.408
3,Iceland,7.495,7.593,7.398,2.426,1.343,1.644,0.914,0.677,0.353,0.138
4,Switzerland,7.487,7.57,7.405,2.32,1.42,1.549,0.927,0.66,0.256,0.357


In [12]:
!pip install pyreadstat
data9 = pd.read_spss('data9.sav')
data9.head()

[0m

Unnamed: 0,country,happiness,whiskerhigh,whiskerlow,dystopia,gdpPC,socsupport,lifeexp,lifechoice,generous,corrupt
0,Finland,7.632,7.695,7.569,2.595,1.305,1.592,0.874,0.681,0.192,0.393
1,Norway,7.594,7.657,7.53,2.383,1.456,1.582,0.861,0.686,0.286,0.34
2,Denmark,7.555,7.623,7.487,2.37,1.351,1.59,0.868,0.683,0.284,0.408
3,Iceland,7.495,7.593,7.398,2.426,1.343,1.644,0.914,0.677,0.353,0.138
4,Switzerland,7.487,7.57,7.405,2.32,1.42,1.549,0.927,0.66,0.256,0.357


In [13]:
data10 = pd.read_sas('data10.xpt')
data10.head()

Unnamed: 0,COUNTRY,HAPPINES,WHISKERH,WHISKERL,DYSTOPIA,EXPLAINE,EXPLAIN2,EXPLAIN3,EXPLAIN4,EXPLAIN5,EXPLAIN6
0,b'Finland',7.632,7.695,7.569,2.595,1.305,1.592,0.874,0.681,0.192,0.393
1,b'Norway',7.594,7.657,7.53,2.383,1.456,1.582,0.861,0.686,0.286,0.34
2,b'Denmark',7.555,7.623,7.487,2.37,1.351,1.59,0.868,0.683,0.284,0.408
3,b'Iceland',7.495,7.593,7.398,2.426,1.343,1.644,0.914,0.677,0.353,0.138
4,b'Switzerland',7.487,7.57,7.405,2.32,1.42,1.549,0.927,0.66,0.256,0.357


(3) For this problem, you will be accessing the open APIs maintained by NASA. All of the API endpoints share a credentialing system with the same API key. First, register for an API key here: https://api.nasa.gov/. Once you have the key, save it in your .env file. 

Then click on Browse APIs and find the information for the Asteroids - NeoWs API. This API reports data on "near Earth objects" in space, such as asteroids. It reports the size of the objects, their speed, direction, and distance to Earth. It even codes whether or not NASA considers the object to be potentially hazardous to Earth.

Write the code to access this API and generate a single data frame with all of the near Earth objects reported by NASA over the last 7 days.

To receive full credit for this problem, make sure you

* Supply an accurate user-agent in the headers along with your email address in the 'From' field

* Provide your NASA API key in the way the documentation instructs

* Use `pd.json_normalize()` and `pd.concat()` to extract and combine several data frames from the JSON output.

Note: in your output, the `close_approach_data` field will still be in a dictionary within a list in the final data frame. That's okay for this problem. If you want to extract those data and store them as additional columns, please do: I recommend using list comprehensions as follows:
```
df['close_approach_date'] = [d[0][''close_approach_date''] for d in df['close_approach_data']]
```
[3 points]

In [14]:
os.chdir(oldpath)
nasa_token = os.environ['nasa_token']

In [15]:
useragent_url = 'https://httpbin.org/user-agent'
r = requests.get(useragent_url)
useragent = json.loads(r.text)['user-agent']

In [16]:
headers = {'User-Agent' : useragent,
          'From': 'bwl5cd@virginia.edu'}

In [17]:
root = 'https://api.nasa.gov'
start_date = '2022-10-6'
asteroids_endpoint = '/neo/rest/v1/feed?start_date={start_date}&api_key={api_key}'.format(start_date=start_date, 
                                                                                          api_key=nasa_token)
r = requests.get(root + asteroids_endpoint,
                headers = headers)
r

<Response [200]>

In [18]:
myjson = json.loads(r.text)
asteroids06 = pd.json_normalize(myjson, record_path = ['near_earth_objects', '2022-10-06'])
asteroids07 = pd.json_normalize(myjson, record_path = ['near_earth_objects', '2022-10-07'])
asteroids08 = pd.json_normalize(myjson, record_path = ['near_earth_objects', '2022-10-08'])
asteroids09 = pd.json_normalize(myjson, record_path = ['near_earth_objects', '2022-10-09'])
asteroids10 = pd.json_normalize(myjson, record_path = ['near_earth_objects', '2022-10-10'])
asteroids11 = pd.json_normalize(myjson, record_path = ['near_earth_objects', '2022-10-11'])
asteroids12 = pd.json_normalize(myjson, record_path = ['near_earth_objects', '2022-10-12'])
asteroids13 = pd.json_normalize(myjson, record_path = ['near_earth_objects', '2022-10-13'])
asteroids = pd.concat([asteroids06, asteroids07, asteroids08, asteroids09, 
                      asteroids10, asteroids11, asteroids12, asteroids13], ignore_index=True)
asteroids['close_approach_date'] = [d[0]['close_approach_date'] for d in asteroids['close_approach_data']]
asteroids

Unnamed: 0,id,neo_reference_id,name,nasa_jpl_url,absolute_magnitude_h,is_potentially_hazardous_asteroid,close_approach_data,is_sentry_object,links.self,estimated_diameter.kilometers.estimated_diameter_min,estimated_diameter.kilometers.estimated_diameter_max,estimated_diameter.meters.estimated_diameter_min,estimated_diameter.meters.estimated_diameter_max,estimated_diameter.miles.estimated_diameter_min,estimated_diameter.miles.estimated_diameter_max,estimated_diameter.feet.estimated_diameter_min,estimated_diameter.feet.estimated_diameter_max,close_approach_date
0,2163692,2163692,163692 (2003 CY18),http://ssd.jpl.nasa.gov/sbdb.cgi?sstr=2163692,18.26,False,"[{'close_approach_date': '2022-10-06', 'close_...",False,http://api.nasa.gov/neo/rest/v1/neo/2163692?ap...,0.592318,1.324463,592.318063,1324.463452,0.368049,0.822983,1943.300793,4345.352673,2022-10-06
1,2516428,2516428,516428 (2003 UR12),http://ssd.jpl.nasa.gov/sbdb.cgi?sstr=2516428,17.90,False,"[{'close_approach_date': '2022-10-06', 'close_...",False,http://api.nasa.gov/neo/rest/v1/neo/2516428?ap...,0.699125,1.563292,699.125232,1563.291544,0.434416,0.971384,2293.718027,5128.909430,2022-10-06
2,3291224,3291224,(2005 SP9),http://ssd.jpl.nasa.gov/sbdb.cgi?sstr=3291224,21.64,False,"[{'close_approach_date': '2022-10-06', 'close_...",False,http://api.nasa.gov/neo/rest/v1/neo/3291224?ap...,0.124898,0.279280,124.897854,279.280092,0.077608,0.173537,409.769876,916.273297,2022-10-06
3,3826795,3826795,(2018 PZ21),http://ssd.jpl.nasa.gov/sbdb.cgi?sstr=3826795,26.70,False,"[{'close_approach_date': '2022-10-06', 'close_...",False,http://api.nasa.gov/neo/rest/v1/neo/3826795?ap...,0.012149,0.027167,12.149404,27.166893,0.007549,0.016881,39.860251,89.130231,2022-10-06
4,3831613,3831613,(2018 TL3),http://ssd.jpl.nasa.gov/sbdb.cgi?sstr=3831613,24.40,False,"[{'close_approach_date': '2022-10-06', 'close_...",False,http://api.nasa.gov/neo/rest/v1/neo/3831613?ap...,0.035039,0.078350,35.039264,78.350176,0.021772,0.048685,114.958219,257.054393,2022-10-06
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
78,54075318,54075318,(2020 US1),http://ssd.jpl.nasa.gov/sbdb.cgi?sstr=54075318,25.09,False,"[{'close_approach_date': '2022-10-12', 'close_...",False,http://api.nasa.gov/neo/rest/v1/neo/54075318?a...,0.025501,0.057022,25.500869,57.021676,0.015846,0.035432,83.664270,187.078996,2022-10-12
79,2395289,2395289,395289 (2011 BJ2),http://ssd.jpl.nasa.gov/sbdb.cgi?sstr=2395289,18.30,False,"[{'close_approach_date': '2022-10-13', 'close_...",False,http://api.nasa.gov/neo/rest/v1/neo/2395289?ap...,0.581507,1.300289,581.507040,1300.289270,0.361332,0.807962,1907.831556,4266.041049,2022-10-13
80,2519354,2519354,519354 (2011 KR12),http://ssd.jpl.nasa.gov/sbdb.cgi?sstr=2519354,21.51,False,"[{'close_approach_date': '2022-10-13', 'close_...",False,http://api.nasa.gov/neo/rest/v1/neo/2519354?ap...,0.132603,0.296510,132.603497,296.510433,0.082396,0.184243,435.050856,972.803288,2022-10-13
81,54104724,54104724,(2021 AE2),http://ssd.jpl.nasa.gov/sbdb.cgi?sstr=54104724,24.87,False,"[{'close_approach_date': '2022-10-13', 'close_...",False,http://api.nasa.gov/neo/rest/v1/neo/54104724?a...,0.028220,0.063102,28.219868,63.101543,0.017535,0.039209,92.584871,207.026066,2022-10-13


(4) Read the following blog post about how to read a robots.txt file: https://yoast.com/ultimate-guide-robots-txt/ Then look at the robots.txt files for the following websites, and report whether they allow the following kinds of web-scraping. For each question, explain your answer and copy-and-paste the relevant parts of the robots.txt file if necessary:

a. Pulling the current listed prices for houses for sale in Charlottesville from zillow.com

This is disallowed. The following command dissalows any websraping under the tag \houses.

"Disallow: /homes/*/?"

b. Scraping current stock prices off of google.com/finance

This is allowed. The following allow statement appears in the robots.txt file and would allow scraping with any tag /finance.

"Allow: /finance"

c. Copying the list of twitter accounts that each NBA player follows

This is not allowed. The robots.txt file for twitter has a statment that would disallow accessing any tag before the tag /following. The command is repeated for common search engines and even has a comment "# Every bot that might possibly read and respect this file" with the following statement under it:

"Disallow: /*/following"

d. The lyrics to Lizzo's Good as Hell from https://genius.com/Lizzo-good-as-hell-lyrics

This is allowed. There was no disallow statment corresponding to lyrics, however, there was if you were using Uniscan or Searchie. Both of those webscraping agents are disallowed from crawling anything in the website. 

[2 points]

(5) In class we accessed the Genius API to find the most popular songs by Lizzo. But the Genius API does not allow us to download the lyrics to any songs. This looks like a job for web scraping. 

Good As Hell is the 15th track on Lizzo's 2019 album Cuz I Love You. In this problem, you will web-scrape genius.com to bring the lyrics to Good As Hell into Python, find links to the lyrics to the other songs on the Cuz I Love You album, and build a spider to extract all the song lyrics from this album.

A few tricks I found useful to complete these tasks:

** Trick 1**: A list comprehension is a for-loop across items of a list that places the outputs into another list. The general syntax is
```
[f(x) for x in list]
```
where `list` is the list of input values, `x` is a token that stands in for one general element of the input list, and `f(x)` is what we want to do to each element. For example, if this is my list of inputs:

In [19]:
inputs = [3,5,8,9,12]

and I want to square then subtract 1 from each element, I can use the syntax

In [20]:
[(x**2) - 1 for x in inputs]

[8, 24, 63, 80, 143]

For this problem, you will need to use a *filtered* list comprehension, which uses `if` to filter down the list of outputs. For example, if I only want to keep the even input numbers in this list (8 and 12), I can use `if x/2 == round(x/2)` (no decimals when dividing by 2) to narrow the list down:

In [21]:
[(x**2) - 1 for x in inputs if x/2 == round(x/2)]

[63, 143]

**Trick 2**: The second trick that will be needed is the `attrs{}` argument of BeautifulSoup's `.find_all()` method. When searching for particular HTML tags, you may want to narrow down the search to only those tags with a particular attribute. For example, if I saved the parsed HTML as `mysoup`, and I want to find all the `a` tags that also have an attribute `rel="noopener"`, I can type
```
mysoup.find_all('a', attrs{'rel':'noopener'})
```
If the attribute has no hyphens, and is not a Python reserved variable like `name` or `class`, then a shortcut for the same function is
```
mysoup.find_all('a', rel='noopener')
```
but if the attribute name hyphens or is a reserved variable, then the `attrs{}` approach still works.

**Trick 3**: Finally, suppose you have one text document in a way in which each paragraph is a separate item of a list. The `.join()` method can combine all these elements into a single string. For example, if I have a list

In [22]:
words = ['I','do','my','hair','toss,', 'check','my','nails','Baby,','how','you','feelin?','Feelin','good','as','hell']
words

['I',
 'do',
 'my',
 'hair',
 'toss,',
 'check',
 'my',
 'nails',
 'Baby,',
 'how',
 'you',
 'feelin?',
 'Feelin',
 'good',
 'as',
 'hell']

and I want to combine them into a single string in which each word is separated by a space, I type a space in quotes, then apply the `.join(words)` method to it:  

In [23]:
' '.join(words)

'I do my hair toss, check my nails Baby, how you feelin? Feelin good as hell'

Using these three tricks, perform the following tasks:

(a) Use `requests` and `BeautifulSoup` to scrape the lyrics to Good As Hell from https://genius.com/Lizzo-good-as-hell-lyrics

(b) Notice that at towards the bottom of https://genius.com/Lizzo-good-as-hell-lyrics there is a menu containing a track list for the Cuz I Love You album. Each link here takes you to the lyrics for that song. Use `requests` and `BeautifulSoup` to save the list of song lyric URLs in a Python list.

(c) Build a spider: first write a function that applies the code you wrote for part (a) to a user-supplied URL; then apply this function to each of the URLs in the list you constructed in part (b).

[5 points]

(a)

In [24]:
genius_token = os.environ['genius_token']
urltoscrape = 'https://genius.com/Lizzo-good-as-hell-lyrics'
#urltoscrape = 'https://genius.com/Lizzo-cuz-i-love-you-lyrics'
r = requests.get(urltoscrape,
               headers = {'User-Agent': useragent,
                         'From': 'bwl5cd@virginia.edu'})
my_html = BeautifulSoup(r.text, 'html.parser')

In [25]:
lyrics = my_html.find_all(class_ = 'Lyrics__Container-sc-1ynbvzw-6 YYrds')
lyrics = [search_result.get_text(separator="\n") for search_result in lyrics]
lyrics = '\n'.join(lyrics)
print(lyrics)

[Chorus]
I do my hair toss, check my nails
Baby, how you feelin'? (Feelin' good as hell)
Hair toss, check my nails
Baby, how you feelin'? (Feelin' good as hell)
[Verse 1]
Woo, child, tired of the bullshit
Go on, dust your shoulders off, keep it moving
Yes, Lord, tryna get some new shit
In there, swimwear, going-to-the-pool shit
Come now, come dry your eyes
You know you a star, you can touch the sky
I know that it's hard, but you have to try
If you need advice, let me simplify
[Pre-Chorus]
If he don't love you anymore
Just walk your fine ass out the door
[Chorus]
I do my hair toss, check my nails
Baby, how you feelin'? (Feelin' good as hell)
Hair toss, check my nails
Baby, how you feelin'? (Feelin' good as hell)
(Feeling good as hell)
Baby, how you feelin'? (Feelin' good as hell)
[Verse 2]
Woo, girl, need to kick off your shoes
Gotta take a deep breath, time to focus on you
All the big fights, long nights that you been through
I got a bottle of tequila I been saving for you
Boss up and 

b)

In [26]:
links = my_html.find_all(class_ = 'AlbumTracklist__Container-sc-123giuo-0 kGJQLs')
links = [link.a for link in links[0].contents]
links = [link['href'] for link in links if link is not None]
links

['https://genius.com/Lizzo-cuz-i-love-you-lyrics',
 'https://genius.com/Lizzo-like-a-girl-lyrics',
 'https://genius.com/Lizzo-juice-lyrics',
 'https://genius.com/Lizzo-soulmate-lyrics',
 'https://genius.com/Lizzo-jerome-lyrics',
 'https://genius.com/Lizzo-cry-baby-lyrics',
 'https://genius.com/Lizzo-tempo-lyrics',
 'https://genius.com/Lizzo-exactly-how-i-feel-lyrics',
 'https://genius.com/Lizzo-better-in-color-lyrics',
 'https://genius.com/Lizzo-heaven-help-me-lyrics',
 'https://genius.com/Lizzo-lingerie-lyrics',
 'https://genius.com/Lizzo-boys-lyrics',
 'https://genius.com/Lizzo-truth-hurts-lyrics',
 'https://genius.com/Lizzo-water-me-lyrics',
 'https://genius.com/Lizzo-good-as-hell-remix-lyrics']

c)

In [27]:
def get_lyrics(url, useragent):
    r = requests.get(url,
                     headers = {'User-Agent': useragent,
                                'From': 'bwl5cd@virginia.edu'})
    my_html = BeautifulSoup(r.text, 'html.parser')
    lyrics = my_html.find_all(class_ = 'Lyrics__Container-sc-1ynbvzw-6 YYrds')
    lyrics = [search_result.get_text(separator="\n") for search_result in lyrics]
    lyrics = '\n'.join(lyrics)
    return lyrics
all_lyrics = [get_lyrics(url, useragent) for url in links]

In [28]:
for i in all_lyrics:
    print(i)
    print('\n')

[Intro]
I'm cryin’ 'cause I love you, oh
(Ya ya ya, ya ya, ya ya)
[Verse 1]
Never been in love before
What the fuck are fucking feelings, yo?
Once upon a time, I was a ho
I don't even wanna ho no mo’
Got you something from the liquor store
Little bit of Lizzo and some Mo
Tryna open up a little mo'
Sorry if my heart a little slow
[Chorus]
I thought that I didn't care
I thought I was love-impaired
But baby, baby
I don't know what I'm gon' do
I'm cryin' ’cause I love you, oh
Yes, you (Ya ya, ya ya)
[Verse 2]
Got me standing in the rain
Gotta get my hair pressed again
I would do it for you all, my friend
Ready, baby? Will you be my man?
Wanna put you on a plane
Fly you out to wherever I am
Catch you on the low, I was ashamed
Now I’m crazy, 'bout to tat your name
[Chorus]
I thought that I didn’t care
I thought I was love-impaired
But baby, oh baby
I don't know what I'm gon' do
I’m cryin' 'cause I love you, yeah
I'm cryin', hey
[Chorus]
I thought that I didn't care
I thought I was love-impai

(6) Finding the hidden API:

APIs are the primary mechanism for transfering data over the internet, but most APIs are for internal use for a website and are not intended for outside users. When this is the case, there won't be an obvious link to the API and there won't be any documentation. You can still sometimes get access to the API. This exercise will guide you through one instance of finding and using a hidden API.

It's the time of year for Spirit Halloween stores to pop up. They often take over the stores where large chains have recently met their demise. Go to the store locator page: https://stores.spirithalloween.com/ Notice that the top hit says "Former Sears". I want to know how often Sears appears in the descriptions of these stores, what other dead chains show up, and how frequently. If you right click on this page and view source, the data that appears about the nearby stores does not appear. Instead this website is calling a hidden API that we can access.

For this problem, use the webpage inspector in the Mozilla Firefox web browser. If you don't have Firefox, download it here: https://www.mozilla.org/en-US/firefox/new/

**Step 1**: With Firefox, go to https://stores.spirithalloween.com/ Right click on this page, and select "Inspect". (You can still use Chrome or anything you want for Jupyter Lab, just have Firefox open for the inspect tool).

**Step 2**: Inspect is a complicated but extremely useful tool. It reports all APIs and other web-based connections that a website makes. Click on Network to see these connections. These are all the calls the Spirit Halloween website makes to various APIs to display images, maps, ads, and addresses. 

**Step 3**: In the right-hand window within the Inspect tool, click on Response. Our task is to sift through the various API calls and to look at the responses until we see the specific address information we need. This took me a long, long time, and I'll save you that trouble -- find the Domain maps.spirithalloween.com, click it, and look at the JSON that appears under Response. Scroll down until you see the addresses and the phrase "Former Sears". If you do not see maps.spirithalloween.com under Domains, reload the page and look again.

**Step 4**: Now that we know the API call that Spirit Halloween used to get the addresses, we can deduce the root, endpoint, and some of the parameters. Hover your mouse over file. It displays: https://maps.spirithalloween.com/api/getAsyncLocations?template=domain&level=domain. So the root should be https://maps.spirithalloween.com, the endpoint should be /api/getAsyncLocations, and two of the parameters should be template and level, both set equal to "domain".

**Step 5**: Take a closer look at Response. There are other parameters here we can use. At the bottom of the JSON output is a key named "options". I didn't know this for sure, but my bet was that the key-value pairs inside "options" can be changed as parameters in the call to the API. My goal is to get all of the Spirit Halloween stores, not just the ones near me. So some of the values I wanted to change are lat and lng, which define my location, radius which defines the distance in miles from my location, and limit, which I bet specifies the maximum number of results. I want these values to be 'lat': 40.380028, 'lng': -97.910156, which places my location in the middle of the country, 'radius': 1800 which captures the entire lower 48 states, and 'limit': 1500 which exceeds the number of Spirit Halloween stores (I had to guess-and-check that).

For this problem, take it from here. Issue an API call to https://maps.spirithalloween.com with endpoint /api/getAsyncLocations. Provide your user-agent and email address in the headers, and set the parameters as discussed in step 5. Find a way to extract the descriptive phrases such as 'Former Sears' and 'Next to Ashley Furniture' from each store's address and store them in a list. Filter this list to only the elements that contain the word "Former". Then report the frequencies. If you saved these phrases as a list named `formerlocs` you can get frequencies by typing:
```
formerlocsDF = pd.DataFrame()
formerlocsDF['formerlocs'] = formerlocs
formerlocsDF['formerlocs'].value_counts()
```
Hint: even though the output is JSON, some of the data inside the JSON is encoded as HTML, so you will need to use `BeautifulSoup` as well as the `json` package. [4 points]

In [82]:
root = 'https://maps.spirithalloween.com'
start_date = '2022-10-6'
spirit_endpoint = '/api/getAsyncLocations?template=domain&level=domain&\
                    lat=40.380028&lng=-97.910156&radius=1800&limit=1500'
r = requests.get(root + spirit_endpoint,
                headers = headers)
my_html = BeautifulSoup(r.text, 'html.parser')

In [83]:
my_json = json.loads(r.text)
#my_json['markers'][6]['info']
formerlocs = []
for i in my_json['markers']:
    my_html = BeautifulSoup(i['info'], 'html.parser')
    formerlocs.append(my_html.find_all(class_='address-two')[0].text)
formerlocs = [loc for loc in formerlocs if 'ormer' in loc]

In [84]:
formerlocsDF = pd.DataFrame()
formerlocsDF['formerlocs'] = formerlocs
formerlocsDF['formerlocs'].value_counts()

Former Sears                                      63
Former Pier 1                                     32
Former Dress Barn                                 25
Former Office Depot                               23
Former Party City                                 20
                                                  ..
Former Jo Anns Fabrics                             1
Former North Face in Outlet Shoppes of El Paso     1
Former Mazda Dealership                            1
Former E G Amish Furniture                         1
Former F21                                         1
Name: formerlocs, Length: 436, dtype: int64