# Information Retrieval from Web Scraping

This follows our in class scenarios for scraping sculpture data from Smithsonian Institution. Recall our story as follows: one of our colleagues in the environmental science program was calling for assistance in extract the Smithsonian Institution's database of outdoor sculptures available at the following link:

https://collections.si.edu/search/results.htm?view=&dsort=&date.slider=&fq=object_type%3A%22Outdoor+sculpture%22&fq=data_source%3A%22Art+Inventories+Catalog%2C+Smithsonian+American+Art+Museum%22&q=

We have scraped all the result pages from the website, and stored them as ([Base64](https://docs.python.org/3/library/base64.html#base64.b64decode) encoded HTML contents (one page per line).

Your task is to help us parse these contents and transform each record on the page into a JSON object using DataProc + PySpark. A record is **`div`** tag with the class name **`record`** (as a CSS selector of `div.record`).

The JSON object should have the following structure for keys and values.

 * **`Label`**: a string that store the title of the record (the text of the **`h2`** tag)
 * Key/value pairs are extracted from the **`dl`** tag of the record, where each key is the text of the **`dt`** tag, and each value is a list of all **`dd`** tag. More information on the description list can be found here [`dl`](https://www.w3schools.com/tags/tag_dl.asp).

Note that we do not want to include records with `'Owner/Location'` listed as either `'Unlocated'`, `'Destroyed'`, `'Stolen'` or `'Anonymous Collection'`.

**INPUT:** the data is available on our class storage bucket (which can be read directly by PySpark).

**`gs://f22-csc-445/si_by_place.b64.gz`**

A copy of the data is also available on Google Drive for your inspection (please check out the first cell below).

**OUTPUT:** You must output 1 JSON line per input record if the sculpture is not classified as `'Unlocated'`, `'Destroyed'`, `'Stolen'` or `'Anonymous Collection'`. The output should be written to **`gs://f22-csc-445-fc/output-<EMPLID>_<LastName>`**.

Sample output for each line (already prettified for readability):
```JavaScript
{'Label': "Old Testament Children's Doors, (sculpture)",
 'Sculptor': ['Moore, Bruce 1905-1980'],
 'Architect': ['Fox, William B.'],
 'Founder': ['Modern Art Foundry', 'Associated Ironworkers'],
 'Medium': ['Bronze'],
 'Culture': ['French'],
 'Type': ['Sculptures-Outdoor Sculpture', 'Sculptures-Door', 'Sculptures'],
 'Owner/Location': ['Administered by Episcopal Diocese of California 1051 Taylor Street San Francisco California 94108',
  'Located Grace Cathedral Taylor & California Streets Entrance to south tower San Francisco California'],
 'Date': ['1964'],
 'Topic': ['Religion--Old Testament--Joseph',
  'Religion--Old Testament--Moses',
  'Religion--Old Testament--Samuel',
  'Religion--Old Testament--David',
  'Religion--Old Testament--Goliath',
  'Religion--Old Testament--Eli',
  'Allegory--Arts & Sciences--Industry',
  'Allegory--Quality--Fortitude',
  'Religion--Saint--St. Joan of Arc',
  'Occupation--Military--Commander',
  'Ethnic',
  'History--Medieval--France'],
 'Control number': ['IAS CA000992'],
 'Data Source': ['Art Inventories Catalog, Smithsonian American Art Museums'],
 'EDAN-URL': ['edanmdm:siris_ari_331668']}
```

**SUBMISSION:**
Similar to Homework 3, your submission include 2 files:
1.  A notebook named BDM_FC_<EMPL_ID>_<LastName>.ipynb to show that your script can run successfully on the cluster (use the directive %%writefile to store your script contents there)

2. A stand-alone Python file BDM_FC_<EMPL_ID>_<LastName>.py that can be run on my cluster setup (simlar to yours) using the following command:
```bash
gcloud --quiet dataproc jobs submit pyspark --cluster bdm-fc <YOUR_FILE.PY>
```

**CLUSTER CONFIGURATION:**
The cluster `bdm-fc` will be created with the BeautifulSoup [`bs4`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) package for your convenience. If you need additional packages to run your task please specify that in your submission.

Instructions for configurating your cluster with additional packages is available [here](https://cloud.google.com/dataproc/docs/tutorials/python-configuration).

##🔴 IMPORTANT
You CANNOT `collect()` (or `take()` and the like) data at any stage of your pipeline (to retrieve data to your driver code). You MUST process the data entirely using Spark's transformations. After all, this is expected to be a big data problem where we "bring compute to data".

##✔️ SANITY CHECK:
**83,170** records (after filter) in total

# Data download

In [1]:
%%shell
gdown "1--4kNCWgMnHogQP_wFx_XQVYWFgDHjVf&confirm=t"
gunzip si_by_place.b64.gz

UsageError: Cell magic `%%shell` not found.


# Your work

In [None]:
%%writefile BDM_FC_.py

import json
import pyspark
import base64
from bs4 import BeautifulSoup
import pyspark



In [9]:
from bs4 import BeautifulSoup

html_doc = '''<p class="story">
    Once upon a time there were three little sisters; and their names were
    <a class="sister" href="http://example.com/elsie" id="link1">
     Elsie
    </a>
    ,
    <a class="sister" href="http://example.com/lacie" id="link2">
     Lacie
    </a>
    and
    <a class="sister" href="http://example.com/tillie" id="link2">
     Tillie
    </a>
    ; and they lived at the bottom of a well.
   </p>'''

soup = BeautifulSoup(html_doc, 'html.parser')
sistertags = soup.find_all(class_='sister')
for tag in sistertags:
    print('----')
    print(tag.text.strip())

----
Elsie
----
Lacie
----
Tillie


In [3]:
sistertags

[<a class="sister" href="http://example.com/elsie" id="link1">
      Elsie
     </a>,
 <a class="sister" href="http://example.com/lacie" id="link2">
      Lacie
     </a>,
 <a class="sister" href="http://example.com/tillie" id="link2">
      Tillie
     </a>]