# Web Mining and Applied NLP (44-620)

## Web Scraping and NLP with Requests, BeautifulSoup, and spaCy

### Student Name: Carlos Montoya III
#### Repo url: https://github.com/carlosmontoya3/web-scraping

Perform the tasks described in the Markdown cells below.  When you have completed the assignment make sure your code cells have all been run (and have output beneath them) and ensure you have committed and pushed ALL of your changes to your assignment repository.

Every question that requires you to write code will have a code cell underneath it; you may either write your entire solution in that cell or write it in a python file (`.py`), then import and run the appropriate code to answer the question.

In [2]:
# Create and activate a Python virtual environment. 
# Before starting the project, try all these imports FIRST
# Address any errors you get running this code cell 
# by installing the necessary packages into your active Python environment.
# Try to resolve issues using your materials and the web.
# If that doesn't work, ask for help in the discussion forums.
# You can't complete the exercises until you import these - start early! 
# We also import pickle and Counter (included in the Python Standard Library).

from collections import Counter
import pickle
import requests
import spacy
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt

!pip list

print('All prereqs installed.')

Package                   Version
------------------------- --------------
annotated-types           0.6.0
anyio                     4.3.0
appnope                   0.1.4
argon2-cffi               23.1.0
argon2-cffi-bindings      21.2.0
arrow                     1.3.0
asttokens                 2.4.1
async-lru                 2.0.4
attrs                     23.2.0
Babel                     2.14.0
beautifulsoup4            4.12.3
bleach                    6.1.0
blis                      0.7.11
catalogue                 2.0.10
certifi                   2024.2.2
cffi                      1.16.0
charset-normalizer        3.3.2
click                     8.1.7
cloudpathlib              0.16.0
comm                      0.2.2
confection                0.1.4
contourpy                 1.2.1
cycler                    0.12.1
cymem                     2.0.8
debugpy                   1.8.1
decorator                 5.1.1
defusedxml                0.7.1
en-core-web-sm            3.7.1
executing       

1. Write code that extracts the article html from https://web.archive.org/web/20210327165005/https://hackaday.com/2021/03/22/how-laser-headlights-work/ and dumps it to a .pkl (or other appropriate file)

### **Question 1 answer:**

In [3]:
url = "https://web.archive.org/web/20210327165005/https://hackaday.com/2021/03/22/how-laser-headlights-work/"
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    soup = BeautifulSoup(response.content, "html.parser")
    article_content = soup.find("article")
    article_html = str(article_content)
    
    with open("article_html.pkl", "wb") as file:
        pickle.dump(article_html, file)
        
    print("Article HTML extracted and saved to 'article_html.pkl'")
else:
    print("Failed to retrieve the webpage. Status code:", response.status_code)

Article HTML extracted and saved to 'article_html.pkl'


2. Read in your article's html source from the file you created in question 1 and print it's text (use `.get_text()`)

### **Question 2 answer:**

In [4]:
# Read the HTML content from the saved file
try:
    with open('article_html.pkl', 'rb') as file:
        article_html = pickle.load(file)
except FileNotFoundError:
    print("File not found. Please make sure the file 'article_html.pkl' exists.")
    exit()
except Exception as e:
    print(f"An error occurred: {e}")
    exit()

# Parse HTML content using BeautifulSoup
soup = BeautifulSoup(article_html, 'html.parser')

# Extract text from HTML and print it
article_text = soup.get_text()
print(article_text)



How Laser Headlights Work


                130 Comments            

by:
Lewin Day



March 22, 2021








When we think about the onward march of automotive technology, headlights aren’t usually the first thing that come to mind. Engines, fuel efficiency, and the switch to electric power are all more front of mind. However, that doesn’t mean there aren’t thousands of engineers around the world working to improve the state of the art in automotive lighting day in, day out.
Sealed beam headlights gave way to more modern designs once regulations loosened up, while bulbs moved from simple halogens to xenon HIDs and, more recently, LEDs. Now, a new technology is on the scene, with lasers!

Laser Headlights?!
BWM’s prototype laser headlight assemblies undergoing testing.
The first image brought to mind by the phrase “laser headlights” is that of laser beams firing out the front of an automobile. Obviously, coherent beams of monochromatic light would make for poor illumination outside o

3. Load the article text into a trained `spaCy` pipeline, and determine the 5 most frequent tokens (converted to lower case).  Print the common tokens with an appropriate label.  Additionally, print the tokens their frequencies (with appropriate labels). Make sure to remove things we don't care about (punctuation, stopwords, whitespace).

### **Question 3 answer:**

In [5]:
from spacy.lang.en.stop_words import STOP_WORDS

# Load the saved article text
try:
    with open('article_text.pkl', 'rb') as file:
        article_text = pickle.load(file)
except FileNotFoundError:
    print("File not found. Please make sure the file 'article_text.pkl' exists.")
    exit()
except Exception as e:
    print(f"An error occurred: {e}")
    exit()

# Load spaCy English model
nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])

# Process the text using spaCy pipeline
doc = nlp(article_text)

# Filter out stopwords, punctuation, and whitespace, and convert tokens to lowercase
tokens = [token.text.lower() for token in doc if token.is_alpha and not token.is_stop]

# Count the frequencies of each token
token_counter = Counter(tokens)

# Get the 5 most frequent tokens
most_common_tokens = token_counter.most_common(5)

# Print the most common tokens with their frequencies
print("Most common tokens:")
for token, frequency in most_common_tokens:
    print(f"{token}: {frequency}")

# Print all tokens with their frequencies
print("\nAll tokens with frequencies:")
for token, frequency in token_counter.items():
    print(f"{token}: {frequency}")

File not found. Please make sure the file 'article_text.pkl' exists.
Most common tokens:
laser: 35
headlights: 19
headlight: 11
technology: 10
led: 10

All tokens with frequencies:
laser: 35
headlights: 19
work: 2
comments: 1
lewin: 1
day: 3
march: 2
think: 1
onward: 1
automotive: 6
technology: 10
usually: 1
thing: 2
come: 5
mind: 3
engines: 1
fuel: 1
efficiency: 3
switch: 2
electric: 1
power: 3
mean: 1
thousands: 1
engineers: 2
world: 2
working: 1
improve: 1
state: 2
art: 1
lighting: 4
sealed: 2
beam: 7
gave: 1
way: 4
modern: 2
designs: 3
regulations: 1
loosened: 1
bulbs: 1
moved: 2
simple: 3
halogens: 1
xenon: 1
hids: 1
recently: 1
leds: 6
new: 3
scene: 1
lasers: 5
bwm: 2
prototype: 1
headlight: 11
assemblies: 1
undergoing: 1
testing: 1
image: 2
brought: 1
phrase: 1
beams: 5
firing: 1
automobile: 1
obviously: 1
coherent: 1
monochromatic: 1
light: 9
poor: 1
illumination: 2
outside: 1
specific: 1
spot: 2
distance: 1
away: 1
thankfully: 2
eyes: 1
instead: 1
consist: 1
solid: 1
diodes: 2

4. Load the article text into a trained `spaCy` pipeline, and determine the 5 most frequent lemmas (converted to lower case).  Print the common lemmas with an appropriate label.  Additionally, print the lemmas with their frequencies (with appropriate labels). Make sure to remove things we don't care about (punctuation, stopwords, whitespace).

### **Question 4 answer:**

In [None]:
def get_most_common_lemmas(text, n=5):
    """
    Get the n most common lemmas from the given text.

    Args:
        text (str): The input text.
        n (int): The number of most common lemmas to retrieve.

    Returns:
        list: A list of tuples containing the most common lemmas and their frequencies.
    """
    # Process the text using spaCy
    doc = nlp(text)

    # Initialize a Counter to count lemma frequencies
    lemma_counter = Counter()

    # Iterate over sentences in the document
    for sentence in doc.sents:
        # Get lemmas from the sentence
        lemmas = [token.lemma_.lower() for token in sentence if token.is_alpha]

        # Update the lemma counter
        lemma_counter.update(lemmas)

    # Get the most common lemmas
    most_common_lemmas = lemma_counter.most_common(n)

    return most_common_lemmas

# Get the article text
with open("article_html.pkl", "rb") as file:
    article_html = pickle.load(file)

# Parse HTML content using BeautifulSoup
soup = BeautifulSoup(article_html, 'html.parser')

# Extract text from HTML
article_text = soup.get_text()

# Get the 5 most common lemmas
most_common_lemmas = get_most_common_lemmas(article_text)

# Print the most common lemmas with their frequencies
print("Most common lemmas:")
for lemma, frequency in most_common_lemmas:
    print(f"{lemma}: {frequency}")


: 

5. Define the following methods:
    * `score_sentence_by_token(sentence, interesting_token)` that takes a sentence and a list of interesting token and returns the number of times that any of the interesting words appear in the sentence divided by the number of words in the sentence
    * `score_sentence_by_lemma(sentence, interesting_lemmas)` that takes a sentence and a list of interesting lemmas and returns the number of times that any of the interesting lemmas appear in the sentence divided by the number of words in the sentence
    
You may find some of the code from the in class notes useful; feel free to use methods (rewrite them in this cell as well).  Test them by showing the score of the first sentence in your article using the frequent tokens and frequent lemmas identified in question 3.

### **Question 5 answer:**

In [1]:
def score_sentence_by_token(sentence, interesting_tokens):
    """
    Calculate the score of a sentence based on the occurrence of interesting tokens.

    Args:
        sentence (str): The input sentence.
        interesting_tokens (list): List of interesting tokens to search for in the sentence.

    Returns:
        float: The score of the sentence, calculated as the number of interesting tokens divided by the number of words in the sentence.
    """
    # Tokenize the sentence
    tokens = sentence.split()

    # Count the occurrences of interesting tokens in the sentence
    interesting_token_count = sum(1 for token in tokens if token.lower() in interesting_tokens)

    # Calculate the score
    score = interesting_token_count / len(tokens)

    return score


def score_sentence_by_lemma(sentence, interesting_lemmas):
    """
    Calculate the score of a sentence based on the occurrence of interesting lemmas.

    Args:
        sentence (str): The input sentence.
        interesting_lemmas (list): List of interesting lemmas to search for in the sentence.

    Returns:
        float: The score of the sentence, calculated as the number of interesting lemmas divided by the number of words in the sentence.
    """
    # Process the sentence using spaCy
    doc = nlp(sentence)

    # Get lemmas from the sentence
    lemmas = [token.lemma_.lower() for token in doc if token.is_alpha]

    # Count the occurrences of interesting lemmas in the sentence
    interesting_lemma_count = sum(1 for lemma in lemmas if lemma in interesting_lemmas)

    # Calculate the score
    score = interesting_lemma_count / len(lemmas)

    return score


6. Make a list containing the scores (using tokens) of every sentence in the article, and plot a histogram with appropriate titles and axis labels of the scores. From your histogram, what seems to be the most common range of scores (put the answer in a comment after your code)?

### **Question 6 answer:**

In [1]:
import matplotlib.pyplot as plt

def score_sentence_by_token(sentence, interesting_tokens):
    """
    Calculate the score of a sentence based on the frequency of interesting tokens.

    Args:
        sentence (str): The input sentence.
        interesting_tokens (list): A list of interesting tokens.

    Returns:
        float: The score of the sentence.
    """
    # Tokenize the sentence
    tokens = sentence.split()

    # Count the number of interesting tokens in the sentence
    interesting_token_count = sum(1 for token in tokens if token.lower() in interesting_tokens)

    # Calculate the score
    if len(tokens) > 0:
        score = interesting_token_count / len(tokens)
    else:
        score = 0  # Avoid division by zero

    return score

# Define the list of interesting tokens (use the most common tokens from question 3)
interesting_tokens = [token.lower() for token, _ in most_common_tokens]

# Get the article text
with open("article_html.pkl", "rb") as file:
    article_html = pickle.load(file)

# Parse HTML content using BeautifulSoup
soup = BeautifulSoup(article_html, 'html.parser')

# Extract text from HTML
article_text = soup.get_text()

# Split the article text into sentences
sentences = article_text.split(".")

# Calculate scores for each sentence
sentence_scores = [score_sentence_by_token(sentence, interesting_tokens) for sentence in sentences]

# Plot histogram
plt.hist(sentence_scores, bins=20, color='skyblue', edgecolor='black')
plt.title('Distribution of Sentence Scores based on Token Frequency')
plt.xlabel('Score')
plt.ylabel('Frequency')
plt.show()


NameError: name 'most_common_tokens' is not defined

7. Make a list containing the scores (using lemmas) of every sentence in the article, and plot a histogram with appropriate titles and axis labels of the scores.  From your histogram, what seems to be the most common range of scores (put the answer in a comment after your code)?

### **Question 7 answer:**

8. Which tokens and lexems would be ommitted from the lists generated in questions 3 and 4 if we only wanted to consider nouns as interesting words?  How might we change the code to only consider nouns? Put your answer in this Markdown cell (you can edit it by double clicking it).

### **Question 8 answer:**