[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JamesMTucker/DATA_340_NLP/blob/master/assignment_notebooks/Webscraping.ipynb)

# Webscraping Assignment

Reminder: you are permitted to work with another classmate on this assignment. If you do, please submit a single notebook with both of your names at the top.

## Due date

Friday, February 24 (12:00 pm), 2023

## Assignment description

In this project you will write a Jupyter Notebook or R Markdown file to scrape a selected website. You will need to:

1. Write a function that takes a URL as input and returns the HTML of the page as a string.
2. Inspect the HTML of the page and use regular expressions to extract the documents within the page.
3. Model the documents in a corpus
4. Analyze the corpus using the bag of words model
5. Implement a TF-IDF model to extract the most n-important words for each document in the corpus.

### Objective

This assignment reinforces previous lecture topics on the linguistic background, properties of language, information theory, and Regular Expressions.


## Submission medium

Jupyter Notebook or R Markdown file. See additional instructions at the final section of this document.

## Code Dependencies

You will need to install the following packages:

- `requests`
- `re`
- `beautifulsoup4`
- `nltk`
- `pandas`
- `numpy`
- `matplotlib`


## Grading

This assignment is worth 10 points. (extra credit 1 point to final grade if you create a heatmap of the TF-IDF matrix)

## Write a function that takes a URL as input and returns the HTML of the page as a string

### 1.1 Write a function that takes a URL as input and returns the HTML of the page as a string

In [135]:
import requests

def get_html(url) -> str:
    """Get the HTML of a webpage and return the HTML as a string.
    
    Parameters
    ----------
    url : str
        The URL of the webpage to scrape.
    
    Returns
    -------
    str
        The HTML of the webpage as a string.
    """
    ## YOUR CODE HERE
    url_string = requests.get(url).text
    return url_string

### 1.2 Inspect the HTML of the page. Can you identify any patterns in the HTML that might be useful for extracting the documents within the page?

In [136]:
# Extract the the HTML source code from the URL (this is the same URL we used in class)
url = "https://www.gutenberg.org/files/1/1-0.txt"

html_source = get_html(url)

### 1.3 Use the BeautifulSoup library to create a BeautifulSoup object from the HTML string

In [137]:
from bs4 import BeautifulSoup as bs4

# YOUR CODE HERE
soup = bs4(html_source, features = 'lxml')

### 1.3 Extract the HTML body text and examine the contents.

In [138]:
# Please explain what the following line of code does in the cell below.
body = soup.find("body")

This part of the code finds the first instance in the html where there is a body tag and returns the text in that.

### 1.4 Use regular expressions to extract the documents within the page

In [139]:
import re

# Your regex here to capture the documents

# Success option 1
doc_extractor = r"(?<=\[Etext #\d])((.|\n)*?)(?=\[Etext #\d]|\*\*\*End of)"

# Explain this line of code in the cell below.
# __Note:__ You will need to use the `re.MULTILINE` flag to ensure that the
# regular expression matches across multiple lines.
found_documents: list = re.findall(doc_extractor, body.text, flags=re.MULTILINE)

assert len(found_documents) == 9, "Please check your regex. You should have found a total 9 documents."

## if you are having trouble with the regex remeber that you can use regex101.com to test and debug.


Explain: `documents = re.findall(doc_extractor, body.text, re.MULTILINE)`

re.findall looks for the parts of the document that matches the regular expression. We pass the regular expression in the function as doc_extractor. For the regular expression itself, it is looking for something that matches \[Etext #\] where \# is any number from 0 to 9. From that point on, it looks for any character including new lines and that is contained in a grouping. The end point for that grouping can be something like \[Etext #\] or "*** End of" which is the case for the final article in the text. This isolates the text of each article in the document and saves it. The argument, body.text, is just passing in the document to the function to look for things matching the regular expression. re.MULTILINE makes it so it'll match at the beginning and end of each new line.

## 1.5 Explore the contents of the Documents

In the matched documents, you will find a heading appended to the text by project Gutenberg. For the purposes of this assignment, I provided a cleaner function to extract the Gutenberg headings from the text for you.

In [140]:
def clean_gutenberg(text: str) -> str:
    """Clean the text of a Gutenberg document.
    
    Parameters
    ----------
    text : str
        The text of a Gutenberg document.
    
    Returns
    -------
    str
        The cleaned text of the document.
    """
    text = re.sub(r"\[Etext #\d+\]", "", text)
    text = re.sub(r"(\r\n)+", " ", text)
    text = re.sub(r"^ ?The Project Gutenberg.*?Independence\*\*", "", text, flags=re.MULTILINE)
    text = re.sub(r"^ ?\*\*\*\*The Project Gutenberg Etext of The U. S. Bill of Rights\*\*\*\*", "", text, flags=re.MULTILINE)
    text = re.sub(r"^ ?November.*?EST", "", text, flags=re.MULTILINE)
    text = re.sub(r"^ ?\*\*The Project.*?, USA", "", text, flags=re.MULTILINE)
    text = re.sub(r"^ ?\*\*\*\*\*The Project.*?corrections\. \*\*\*", "", text, flags=re.MULTILINE)
    text = re.sub(r"^ ?The Project.*?1775\.", "", text, flags=re.MULTILINE)
    text = re.sub(r"^ ?Officially.*?calendar\]", "", text, flags=re.MULTILINE)
    text = re.sub(r"^ ?\*\*The Project.*?, 1865", "", text, flags=re.MULTILINE)
    text = re.sub(r"^ ?The Project.*?, 1861", "", text, flags=re.MULTILINE)
    
    return text.strip()

In [141]:
corpus = []

for i, doc in enumerate(found_documents):
    # YOUR CODE HERE
    cleaned_doc = clean_gutenberg(doc[0])
    corpus.append(cleaned_doc)

In [142]:
corpus

["THE DECLARATION OF INDEPENDENCE OF THE UNITED STATES OF AMERICA When in the Course of human events, it becomes necessary for one people to dissolve the political bands which have connected them with another, and to assume, among the Powers of the earth, the separate and equal station to which the Laws of Nature and of Nature's God entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the separation. We hold these truths to be self-evident, that all men are created equal, that they are endowed by their Creator with certain unalienable Rights, that among these are Life, Liberty, and the pursuit of Happiness. That to secure these rights, Governments are instituted among Men, deriving their just powers from the consent of the governed, That whenever any Form of Government becomes destructive of these ends, it is the Right of the People to alter or to abolish it, and to institute new Government, laying its foundation on 

# Analyze the above corpus of documents using TF-IDF

In the follow steps, I would like for you to accomplish the follow preprocessing steps. 

1. Tokenize the documents
2. Lemmatize the tokens
3. Remove stop words
4. Remove punctuation
5. Apply TF-IDF to the corpus
    * You can write a TF-IDF model from sratch or use the `sklearn` library

_tip: see lecture notebooks 4, 5, and 6 for examples of how to work with pandas_


In [143]:
### TIP ###
## if you want to work with pandas create a dataframe with documents as rows and columns for the document number and the text
import pandas as pd
corpus = pd.DataFrame({"docID": range(len(corpus)), "text": corpus})

## Tokenize the documents

In [144]:
## Your code here
#Trying to use NLTK to tokenize
import nltk
from nltk.tokenize import word_tokenize

In [145]:
tokens = []
for i in range(len(corpus['text'])):
    doc_tokens = word_tokenize(corpus['text'][i].lower())
    tokens.append(doc_tokens)

In [146]:
corpus['tokens'] = tokens

In [147]:
corpus

Unnamed: 0,docID,text,tokens
0,0,THE DECLARATION OF INDEPENDENCE OF THE UNITED ...,"[the, declaration, of, independence, of, the, ..."
1,1,The United States Bill of Rights. The Ten Orig...,"[the, united, states, bill, of, rights, ., the..."
2,2,We observe today not a victory of party but a ...,"[we, observe, today, not, a, victory, of, part..."
3,3,"Four score and seven years ago, our fathers br...","[four, score, and, seven, years, ago, ,, our, ..."
4,4,THE CONSTITUTION OF THE UNITED STATES OF AMERI...,"[the, constitution, of, the, united, states, o..."
5,5,No man thinks more highly than I do of the pat...,"[no, man, thinks, more, highly, than, i, do, o..."
6,6,"In the name of God, Amen. We, whose names are...","[in, the, name, of, god, ,, amen, ., we, ,, wh..."
7,7,Fellow countrymen: At this second appearing t...,"[fellow, countrymen, :, at, this, second, appe..."
8,8,Fellow citizens of the United States: in comp...,"[fellow, citizens, of, the, united, states, :,..."


In [148]:
token_corpus = (corpus
               .explode('tokens')
               .drop(columns = ['text']))
token_corpus

Unnamed: 0,docID,tokens
0,0,the
0,0,declaration
0,0,of
0,0,independence
0,0,of
...,...,...
8,8,angels
8,8,of
8,8,our
8,8,nature


## Lemmatize the tokens

In [149]:
## Your code here
from nltk.stem.porter import *
from nltk.corpus import stopwords
import string
nltk.download('stopwords')

stemmer = PorterStemmer()

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ethan\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [150]:
lemmas = []
for token in token_corpus['tokens'].to_list():
    lemmas.append(stemmer.stem(token))

In [151]:
token_corpus['lemmas'] = lemmas

In [152]:
token_corpus

Unnamed: 0,docID,tokens,lemmas
0,0,the,the
0,0,declaration,declar
0,0,of,of
0,0,independence,independ
0,0,of,of
...,...,...,...
8,8,angels,angel
8,8,of,of
8,8,our,our
8,8,nature,natur


## Remove stop words

You can use the `nltk` library to remove stop words. You can also use the `SpaCy` library to remove stopwords.

In [153]:
## Your code here
stop_words = stopwords.words('english')
non_stop_words = []
for lemma in token_corpus['lemmas'].to_list():
    if lemma in stop_words:
        non_stop_words.append('')
    else:
        non_stop_words.append(lemma)

In [154]:
token_corpus['non_stop_words'] = non_stop_words
token_corpus = token_corpus[token_corpus['non_stop_words'] != '']
token_corpus

Unnamed: 0,docID,tokens,lemmas,non_stop_words
0,0,declaration,declar,declar
0,0,independence,independ,independ
0,0,united,unit,unit
0,0,states,state,state
0,0,america,america,america
...,...,...,...,...
8,8,",",",",","
8,8,better,better,better
8,8,angels,angel,angel
8,8,nature,natur,natur


## Remove punctuation

In [155]:
## Your code here
punct = list(string.punctuation) + list(string.digits)
non_punct = []
for word in token_corpus['non_stop_words'].to_list():
    if word in punct:
        non_punct.append('')
    else:
        non_punct.append(word)

In [156]:
token_corpus['non_punct'] = non_punct
token_corpus = token_corpus[token_corpus['non_punct'] != '']
token_corpus

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  token_corpus['non_punct'] = non_punct


Unnamed: 0,docID,tokens,lemmas,non_stop_words,non_punct
0,0,declaration,declar,declar,declar
0,0,independence,independ,independ,independ
0,0,united,unit,unit,unit
0,0,states,state,state,state
0,0,america,america,america,america
...,...,...,...,...,...
8,8,touched,touch,touch,touch
8,8,surely,sure,sure,sure
8,8,better,better,better,better
8,8,angels,angel,angel,angel


## Analyze the documents and corpus using TF-IDF

In [157]:
## Your code here
term_frequency = (token_corpus
                  .groupby(by=['docID', 'non_punct'])
                  .agg({'non_punct': 'count'})
                  .rename(columns={'non_punct': 'term_frequency'})
                  .reset_index()
                  .rename(columns={'non_punct': 'term'})
                 )

In [158]:
term_frequency

Unnamed: 0,docID,term,term_frequency
0,0,'s,1
1,0,--,1
2,0,1972,1
3,0,abdic,1
4,0,abolish,4
...,...,...,...
3383,8,written,4
3384,8,wrong,1
3385,8,year,4
3386,8,yet,3


In [159]:
document_frequency = (term_frequency
                      .groupby(['docID', 'term'])
                      .size()
                      .unstack()
                      .sum()
                      .reset_index()
                      .rename(columns={0: 'document_frequency'})
                     )

In [160]:
document_frequency

Unnamed: 0,term,document_frequency
0,'',4.0
1,'d,1.0
2,'s,4.0
3,--,5.0
4,.a,1.0
...,...,...
1986,year,6.0
1987,yet,3.0
1988,york,1.0
1989,young,1.0


In [161]:
term_frequency = term_frequency.merge(document_frequency)
term_frequency

Unnamed: 0,docID,term,term_frequency,document_frequency
0,0,'s,1,4.0
1,2,'s,2,4.0
2,4,'s,2,4.0
3,7,'s,4,4.0
4,0,--,1,5.0
...,...,...,...,...
3383,8,withal,1,1.0
3384,8,withhold,1,1.0
3385,8,wors,1,1.0
3386,8,written,4,1.0


In [162]:
import numpy as np
documents_in_corpus = term_frequency['docID'].nunique()
term_frequency['idf'] = np.log((1 + documents_in_corpus) / (1 + term_frequency['document_frequency'])) + 1
term_frequency['tfidf'] = term_frequency['term_frequency'] * term_frequency['idf']
term_frequency.sort_values(by=['term_frequency'], ascending=False)

Unnamed: 0,docID,term,term_frequency,document_frequency,idf,tfidf
970,4,shall,191,9.0,1.000000,191.000000
989,4,state,132,5.0,1.510826,199.428982
1108,4,unit,55,5.0,1.510826,83.095409
64,4,ani,42,8.0,1.105361,46.425142
735,4,offic,34,5.0,1.510826,51.368071
...,...,...,...,...,...,...
1463,7,almighti,1,4.0,1.693147,1.693147
1464,8,almighti,1,4.0,1.693147,1.693147
1466,7,alway,1,3.0,1.916291,1.916291
1469,7,american,1,3.0,1.916291,1.916291


In [175]:
from sklearn import preprocessing
term_frequency['tfidf_norm'] = preprocessing.normalize(term_frequency[['tfidf']], axis=0, norm='l2')
top_n_terms = term_frequency.sort_values(by=['docID', 'tfidf'], ascending=[True, False]).groupby(['docID']).head(5)
top_n_terms

Unnamed: 0,docID,term,term_frequency,document_frequency,idf,tfidf,tfidf_norm
482,0,ha,20,5.0,1.510826,30.216512,0.066227
1118,0,us,11,5.0,1.510826,16.619082,0.036425
986,0,state,10,5.0,1.510826,15.108256,0.033114
929,0,right,10,6.0,1.356675,13.566749,0.029735
465,0,govern,10,7.0,1.223144,12.231436,0.026808
967,1,shall,17,9.0,1.0,17.0,0.03726
987,1,state,8,5.0,1.510826,12.086605,0.026491
930,1,right,7,6.0,1.356675,9.496725,0.020815
607,1,law,6,6.0,1.356675,8.14005,0.017841
581,1,juri,4,3.0,1.916291,7.665163,0.0168


In [179]:
# Import altair for graphing the n highest terms in a heatmap

import altair as alt

# adding a little randomness to break ties in term ranking
top_tfidf_rand = top_n_terms.copy()
top_tfidf_rand['tfidf'] = top_tfidf_rand['tfidf_norm'] + np.random.rand(top_n_terms.shape[0])*0.0001

base = alt.Chart(top_tfidf_rand).encode(
    x = 'rank:O',
    y = 'docID:N'
).transform_window(
    rank = "rank()",
    sort = [alt.SortField("tfidf", order="descending")],
    groupby = ["docID"],
)

# heatmap specification
heatmap = base.mark_rect().encode(
    color = 'tfidf:Q'
)

# text labels, white for darker heatmap colors
text = base.mark_text(baseline='middle').encode(
    text = 'term:N',
    color = alt.condition(alt.datum.tfidf >= 0.23, alt.value('white'), alt.value('black'))
)

# display the three superimposed visualizations
(heatmap + text).properties(width = 600)

# Submission Instructions

Please submit your assignment as a Jupyter Notebook or R Markdown file. You can submit your assignment as a link to a Google Colab notebook or a link to a GitHub repository. If you are submitting a link to a GitHub repository, please make sure that your repository is public. If you email the notebook to me, please zip the file before sending it.