In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab8.ipynb")

# Lab 8

**Note: Gradescope autograder will not work with this notebook because of the difference in how command line commands run in your notebook vs. on Gradescope. You will rely soley on the built-in notebook for feedback.**

**Because this, Lab 9 will not be graded; however, you are expected to know the material**

# Question 1: Basics

<!-- BEGIN QUESTION -->

## Question 1a: Downloading from URL

So far, we used `wget` to download from URLs. There is another command, `curl`, that also downloads from URLs. In this assignment, we will use `curl` and other command line tools to scrape [annual air quality summary data from EPA](https://aqs.epa.gov/aqsweb/airdata/download_files.html#Annual).

![aqi-chart](images/aqi-chart.png)

* Visit and [inspect the source code](https://www.lifewire.com/view-web-source-code-4151702) of this EPA webpage: https://aqs.epa.gov/aqsweb/airdata/download_files.html
* Use `curl` command to download the same URL and [save the output to a file using `>` redirect](http://swcarpentry.github.io/shell-novice/04-pipefilter/index.html). Name the output file `files.html`
* Use `tail` to print 10 last lines of `files.html`

_Use `!` or `%%bash` when running any shell command inside the notebook, and replace `# use [command] here` with your answer_


In [None]:
! curl -s https://... > files.html # complete the URL of `download_files.html` and redirect to `files.html`
! tail -n10 files.html # use tail to show last 10 lines of `files.html`


<!-- END QUESTION -->

## Question 1b: Chaining commands

Instead of saving the output to file, then printing the last ten lines, you can execute a [sequence of commands together using pipes](http://swcarpentry.github.io/shell-novice/04-pipefilter/index.html). Using pipes, write a one line shell command to replace Question 1a.

You can see from the output that the download progress is printed to screen. Something like this:
```
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  307k  100  307k    0     0   319k      0 --:--:-- --:--:-- --:--:--  319k
```
We don't want this to be a part of the html code, so use `curl`'s silent mode to eliminate the progress information.

Save the result to a python variable named `lastlines`.

In [None]:
lastlines = ! curl -s https://... | tail -n10 # combine 1a into one command


In [None]:
grader.check("q1b")

# Question 2: Text Processing

## Question 2a: `grep`

The `grep` command searches lines in a text file. Inspect what kinds of options are available using the help page: `grep --help`.

Use `grep` to find lines with the substring `zip` and print first five lines using `head`. To avoid downloading the html page over and over again, use `files.html` as your input to `grep`.

Save the resulting list of strings to a python variable named `ziplines`.

In [None]:
ziplines = ! grep ... files.html | head -n5 # look for lines containing string, `zip`


In [None]:
grader.check("q2a")

## Question 2b: Regular expression

Note how the word `zip` doesn't just appear as a part of zip file names. We want our `grep` to match more specific pattern that looks like zip file names. The part of the html that becomes the download link is [`href`](https://www.w3schools.com/tags/tryit.asp?filename=tryhtml_a_href). Let's search for filenames specified by `href`. 

Take, for example, the following line:
```
<a href="aqs_monitors.zip">Monitor Listing</a> (353,283 Rows, 6,307 KB)<BR><BR>
```
There are many different and better ways of doing this. We will use `grep` and regular expression patterns to extract the links. A good way to learn regular expression is to know the basics and experiment.

* [Quick reference of regular expression pattern](https://cheatography.com/davechild/cheat-sheets/regular-expressions/) showing the basic building blocks
* [Online utility for testing regular expression patterns](https://regexr.com)

One quick way could be to find the pattern `href="[some file name]` to find all substrings containing links. Then filter the lines again that contain the word `zip`.

Follow this sequence of commands using pipes:
1. **Find lines with substring`zip` using `grep` and `files.html`.**  
    First few lines would look similar to,
```
    the size of the (zipped) file, the number of data rows in the file,
<a href="aqs_sites.zip">Site Listing</a> (20,611 Rows, 982 KB)<BR>
<a href="aqs_monitors.zip">Monitor Listing</a> (353,283 Rows, 6,307 KB)<BR><BR>
            <TD style="font-size:90%;text-align:center;"><a href="annual_conc_by_monitor_2019.zip">annual_conc_by_monitor_2019.zip</a><BR>51,707 Rows<BR>3,081 KB<BR>As of 2019-11-13</TD>
```
1. **Find lines with `grep` and a [regular expression](http://swcarpentry.github.io/shell-novice/07-find/index.html) pattern that matches URLs: `href="[some-url]"`.**  
    _The pattern `'href=\"[^\"]*'` would matches any `href="` and following string until another quote appears: e.g. `[^a]` in regular expression means not to match a. When using `grep` with regular expressions, it is easier to use `-E` option (for extended regular expression)._ After using `grep` option for extracting just the matching substrings (`-o`), you would get something similar to this:
```
href="aqs_sites.zip
href="aqs_monitors.zip
href="annual_conc_by_monitor_2019.zip
href="annual_aqi_by_cbsa_2019.zip
href="annual_aqi_by_county_2019.zip    
```
1. **Strip `href="` by using `sed`**  
    We want just the URLs, so strip the unnecessary `href=` portion. Test using something like `echo "hello there" | sed 's/ll/rr/'`. For example, giving `'s/href=\"//g'` tells sed to replace `href="` with an empty string (`rr` is an empty string) globally (trailing `g`).  
    After running `sed`, you would get something like this:
```
aqs_sites.zip
aqs_monitors.zip
annual_conc_by_monitor_2019.zip
annual_aqi_by_cbsa_2019.zip
annual_aqi_by_county_2019.zip    
```
    [_Refer to Unix Power Tools for more information (UCSB NetID required)_](https://proquest-safaribooksonline-com.proxy.library.ucsb.edu:9443/book/operating-systems-and-server-administration/unix/0596003307/34dot-the-sed-stream-editor/upt3_chp_34_sect_3_html)
1. **Save the resulting list of strings (each string is a URL for a zip file) a python variable named `flist`**



In [None]:
# uncomment one # at a time to see intermediate results
flist = ! grep ... files.html \
#    | grep -o -E ... \
#    | sed ...

print('\n'.join(flist[:5]))

In [None]:
grader.check("q2b")

## Question 3

In the previous question, we created a variable `flist` that contains 1705 file names by scraping a webpage. Hypothetically, you can loop through the list variable and download any file you need for your analysis.

In this problem, we will work with a small portion of the data: `https://aqs.epa.gov/aqsweb/airdata/annual_aqi_by_county_2019.zip`. Download this using `wget` and unzip it to find `annual_aqi_by_county_2019.csv`.

<!-- BEGIN QUESTION -->

## Question 3a: Inspecting the Header

Use `head` to print the header line of `annual_aqi_by_county_2019.csv` and use `sed` to print each column into one line. You can replace commas (`,`) with a newline character (`\n`). Save to a python list variable named `aqiheader`. Each header name will become a list element.



In [None]:
aqiheader = ! head -n ... annual_aqi_by_county_2019.csv | sed 's/.../.../g' # your command

print('\n'.join(aqiheader))

In [None]:
grader.check("q3a")

<!-- END QUESTION -->

## Question 3b: Count Locations

How many counties are there in each state? Each row is a county, so you can extract the `State` column and count how many rows there are for each state using `uniq`. Then, use `cat` to number the output lines. Save the result to a python variable `county_counts`. 

First few lines should look like this:
```
   1	     16 "Alabama"
   2	      6 "Alaska"
   3	     13 "Arizona"
   4	     13 "Arkansas"
   5	     53 "California"
```


In [None]:
# uncomment one line at a time to see what each line does
county_counts = ! cut -d '...' -f ... annual_aqi_by_county_2019.csv \
#    | grep -v 'State' \
#    | uniq -c \
#    | cat -n


print('\n'.join(county_counts))

In [None]:
grader.check("q3b")

## Question 3c: View CSV in Table

Take first 5 lines of `annual_aqi_by_county_2019.csv`, extract first seven columns, and use `column` command to view data in a table. Assign the result to a python variable named `aqi_table`. The result looks like this:
```
"State"    "County"   "Year"  "Days with AQI"  "Good Days"  "Moderate Days"  "Unhealthy for Sensitive Groups Days"
"Alabama"  "Baldwin"  2019    166              140          26               0
"Alabama"  "Clay"     2019    63               58           5                0
"Alabama"  "Colbert"  2019    171              161          10               0
"Alabama"  "DeKalb"   2019    208              188          20               0
```


In [None]:
aqi_table = ! head -n ... annual_aqi_by_county_2019.csv | cut -d '...' -f ... | column -t -s','

print('\n'.join(aqi_table))

In [None]:
grader.check("q3c")

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(filtering=False, pdf=False, run_tests=True)