# Asynchrounous Tech Assignment

The top half of this document walks you through simple HTML web scraping techniques. Follow along with the video. After you have finished the video, complete the assignment at the bottom of this notebook. Answer the questions on Canvas and upload your saved workbook to receive full credit.

### Simple HTML Web Scraping
The objective of this example is to find one link on a webpage that you wish to extract information from. In this example, we will be identifying the total number of databases available on data.gov. You could easily adapt this example to find a CSV file on a webpage and download it, or some other simple task.

In [1]:
#Need to install this for HTML identification using CSS styles
!pip install cssselect
!pip install bs4
!pip install spacy
!python -m spacy download en

In [None]:
import pandas as pd

#package for accessing a website and retrieving its HTML source file
import requests

#package for parsing HTML - parsing means that Python will recognize HTML tags for you
from lxml import html

#alternative parsing package
from bs4 import BeautifulSoup as bsoup

#package for regular expressions - more on that later, but think of wildcard searches
import re

#importing a natural language processing package to count words and sentences
import spacy
nlp = spacy.load("en_core_web_sm")

#### Step 1
Get the HTML source file from the website you are interested in.

In [None]:
response = requests.get('http://www.data.gov/')
#print(response.text)

#### Step 2
Have Python parse (interpret) the HTML tags

In [None]:
#letting Python do the HTML parsing
doc = html.fromstring(response.text)

#### Step 3
Identify the specific thing you are interested in

In [None]:
link = doc.cssselect('small a')[0]
print(link.text)

#### Step 4
Develop a regular expression (pattern matching, wildcard matching). You can use this website to see how your regular expression will work as long as you provide it example text: https://pythex.org/

In [None]:
rxDatasets = re.compile(r"(?P<number>\b\d{0,3},?\d{0,3},?\d{1,3}\b)",
    re.IGNORECASE | re.DOTALL,
)

In [None]:
numDataSets = rxDatasets.search(link.text).group("number")
print(numDataSets)

### How do to this over multiple websites
We will now turn to the task of scraping Wikipedia articles and generating simple word, number, and sentence counts.

In [None]:
#Starting with 2 wikipedia articles
websites = ["https://en.wikipedia.org/wiki/Enron",
           "https://en.wikipedia.org/wiki/Enron_scandal"]

#making a blank dataframe to store our results
df_results = pd.DataFrame()

#rx for number of 1 digit numbers
rxOneDig = re.compile(r"\b\d\b",
    re.IGNORECASE | re.DOTALL,
)

#for loop for our websites
for url in websites:
    resp = requests.get(url)
    document = html.fromstring(resp.text)
    
    #identifying just the body of the article by its HTML element ID
    body = document.get_element_by_id("mw-content-text")
    
    #Using beautiful soup to parse HTML and return just text
    actual_text = bsoup(html.tostring(body),'html').text
    #print(actual_text)
    
    #Using a natural language processor for number of words and sentences
    nlp_document = nlp(actual_text)
    num_words=len(nlp_document)
    num_sent=len(list(nlp_document.sents))
    #print(num_words, num_sent)
    
    #using regular expression to find number of one digit numbers
    num_OneDig = len(rxOneDig.findall(actual_text))
    
    #storing into final df
    df_results = df_results.append({
         "link": url,
         "numberWords": num_words,
         "num1Dig": num_OneDig,
         "numSents": num_sent
     }, ignore_index=True)
    
df_results

# Assignment to turn in on Canvas
Using the code in the cell above, modify that code to scrape the 5 wikipedia articles listed below.

You should build a dataframe that stores the "link", the "numberWords",the number of *1 or 2* digit numbers "num1or2Dig", and the "numSents". Using this dataframe and descriptive (summary) statistics, answer the questions on Canvas. Make sure you save and upload this executed notebook to receive full credit.

Hint: for one OR two digit numbers, you should modify the regular expression rxOneDig such that it searches for one or two consecutive digits surrounded by a boundary.

In [None]:
#5 wikipedia articles to scrape
websites = ["https://en.wikipedia.org/wiki/Sarbanes%E2%80%93Oxley_Act",
           "https://en.wikipedia.org/wiki/Dodd%E2%80%93Frank_Wall_Street_Reform_and_Consumer_Protection_Act",
            "https://en.wikipedia.org/wiki/Michael_Lewis",
            "https://en.wikipedia.org/wiki/Tulane_University",
           "https://en.wikipedia.org/wiki/Mardi_Gras_in_New_Orleans"]