# Web Mining and Applied NLP (44-620)

## Web Scraping and NLP with Requests, BeautifulSoup, and spaCy

### Student Name: Anthony Schomer 
### https://github.com/anythonyschomer/web-scraping

Perform the tasks described in the Markdown cells below.  When you have completed the assignment make sure your code cells have all been run (and have output beneath them) and ensure you have committed and pushed ALL of your changes to your assignment repository.

Every question that requires you to write code will have a code cell underneath it; you may either write your entire solution in that cell or write it in a python file (`.py`), then import and run the appropriate code to answer the question.

1. Write code that extracts the article html from https://web.archive.org/web/20210327165005/https://hackaday.com/2021/03/22/how-laser-headlights-work/ and dumps it to a .pkl (or other appropriate file)

In [12]:
# Web Scraping and NLP with Requests, BeautifulSoup, and spaCy

Complete the tasks in the Python Notebook in this repository.
Make sure to add and push the pkl or text file of your scraped html (this is specified in the notebook)

## Rubric

* (Question 1) Article html stored in separate file that is committed and pushed: 1 pt
* (Question 2) Article text is correct: 1 pt
* (Question 3) Correct (or equivalent in the case of multiple tokens with same frequency) tokens printed: 1 pt
* (Question 4) Correct (or equivalent in the case of multiple lemmas with same frequency) lemmas printed: 1 pt
* (Question 5) Correct scores for first sentence printed: 2 pts (1 / function)
* (Question 6) Histogram shown with appropriate labelling: 1 pt
* (Question 7) Histogram shown with appropriate labelling: 1 pt
* (Question 8) Thoughtful answer provided: 1 pt


import requests
import pickle

# Make a GET request to the URL
url = 'https://web.archive.org/web/20210327165005/https://hackaday.com/2021/03/22/how-laser-headlights-work/'
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Extract the HTML content from the response
    html_content = response.text

    # Dump the HTML content into a pickle file
    with open('article.pkl', 'wb') as f:
        pickle.dump(html_content, f)
        print('Article HTML dumped into "article.pkl" file.')
else:
    print('Failed to retrieve article HTML.')


from collections import Counter
   import pickle
   import requests
   import spaCy
   from bs4 import BeautifulSoup
   import matplotlib.pyplot as plt
   
   
   
    # Using the Counter class from collections module
   my_list = [1, 2, 3, 3, 3, 4, 4, 5]
   counter = Counter(my_list)
   print(counter)

   # Using the pickle module
   data = {'name': 'John', 'age': 30}
   with open('data.pkl', 'wb') as f:
       pickle.dump(data, f)

   # Using the requests module
   response = requests.get('https://example.com')
   print(response.status_code)

   # Using the spacy library
   nlp = spaCy.load('en_core_web_sm')
   doc = nlp('This is an example sentence')
   print([token.text for token in doc])

   # Using the BeautifulSoup class from bs4 module
   html='Title'
   soup = BeautifulSoup(html,'html.parser')
   print(soup.title.text)

   # Using the pyplot module from matplotlib
   x = [1, 2, 3, 4, 5]
   y = [1, 4, 9, 16, 25]
   plt.plot(x, y)
   plt.show()
   
   !pip list

print('All prereqs installed.')

SyntaxError: invalid syntax (133358312.py, line 3)

2. Read in your article's html source from the file you created in question 1 and print it's text (use `.get_text()`)

In [None]:
import pickle
from bs4 import BeautifulSoup

# Read the HTML content from the pickle file
with open('article.pkl', 'rb') as f:
    html_content = pickle.load(f)

# Create a BeautifulSoup object for parsing the HTML
soup = BeautifulSoup(html_content, 'html.parser')

# Extract the text from the HTML using .get_text()
text = soup.get_text()

# Print the extracted text
print(text)

3. Load the article text into a trained `spaCy` pipeline, and determine the 5 most frequent tokens (converted to lower case).  Print the common tokens with an appropriate label.  Additionally, print the tokens their frequencies (with appropriate labels). Make sure to remove things we don't care about (punctuation, stopwords, whitespace).

In [None]:
import spacy
from collections import Counter
from spacy.lang.en.stop_words import STOP_WORDS

# Load the article text
article_text = "Your article text goes here."

# Load the spaCy English pipeline
nlp = spacy.load("en_core_web_sm")

# Process the article text using the spaCy pipeline
doc = nlp(article_text)

# Tokenize the text, convert to lowercase, and filter out unwanted tokens
tokens = [token.text.lower() for token in doc if not token.is_punct and not token.is_space and not token.is_stop]

# Get the most frequent tokens
most_common_tokens = Counter(tokens).most_common(5)

# Print the most common tokens
print("5 Most Frequent Tokens:")
for token, frequency in most_common_tokens:
    print(token, "-", frequency)

# Get all tokens with their frequencies
all_token_frequencies = Counter(tokens)

# Print all tokens with their frequencies
print("\nToken Frequencies:")
for token, frequency in all_token_frequencies.items():
    print(token, "-", frequency)

4. Load the article text into a trained `spaCy` pipeline, and determine the 5 most frequent lemmas (converted to lower case).  Print the common lemmas with an appropriate label.  Additionally, print the lemmas with their frequencies (with appropriate labels). Make sure to remove things we don't care about (punctuation, stopwords, whitespace).

5. Define the following methods:
    * `score_sentence_by_token(sentence, interesting_token)` that takes a sentence and a list of interesting token and returns the number of times that any of the interesting words appear in the sentence divided by the number of words in the sentence
    * `score_sentence_by_lemma(sentence, interesting_lemmas)` that takes a sentence and a list of interesting lemmas and returns the number of times that any of the interesting lemmas appear in the sentence divided by the number of words in the sentence
    
You may find some of the code from the in class notes useful; feel free to use methods (rewrite them in this cell as well).  Test them by showing the score of the first sentence in your article using the frequent tokens and frequent lemmas identified in question 3.

6. Make a list containing the scores (using tokens) of every sentence in the article, and plot a histogram with appropriate titles and axis labels of the scores. From your histogram, what seems to be the most common range of scores (put the answer in a comment after your code)?

7. Make a list containing the scores (using lemmas) of every sentence in the article, and plot a histogram with appropriate titles and axis labels of the scores.  From your histogram, what seems to be the most common range of scores (put the answer in a comment after your code)?

8. Which tokens and lexems would be ommitted from the lists generated in questions 3 and 4 if we only wanted to consider nouns as interesting words?  How might we change the code to only consider nouns? Put your answer in this Markdown cell (you can edit it by double clicking it).