# Web Mining and Applied NLP (44-620)

## Web Scraping and NLP with Requests, BeautifulSoup, and spaCy

### Student Name: Courtney Pigford 
https://github.com/cboss320/Web-scraping

Perform the tasks described in the Markdown cells below.  When you have completed the assignment make sure your code cells have all been run (and have output beneath them) and ensure you have committed and pushed ALL of your changes to your assignment repository.

Every question that requires you to write code will have a code cell underneath it; you may either write your entire solution in that cell or write it in a python file (`.py`), then import and run the appropriate code to answer the question.

1. Write code that extracts the article html from https://web.archive.org/web/20210327165005/https://hackaday.com/2021/03/22/how-laser-headlights-work/ and dumps it to a .pkl (or other appropriate file)

In [4]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

url = "https://web.archive.org/web/20210327165005/https://hackaday.com/2021/03/22/how-laser-headlights-work/"
html = urlopen(url).read()
soup = BeautifulSoup(html, features="html.parser")

for script in soup(["script", "style"]):
    script.extract()
    
text = soup.get_text()
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
text = '\n'.join(chunk for chunk in chunks if chunk)
          
print(text)

How
Laser
Headlights
Work
|
Hackaday
38
captures
22
Mar
2021
-
26
May
2023
Feb
MAR
Aug
27
2020
2021
2022
success
fail
About
this
capture
COLLECTED
BY
Organization:
Internet
Archive
Focused
crawls
are
collections
of
frequently-updated
webcrawl
data
from
narrow
(as
opposed
to
broad
or
wide)
web
crawls,
often
focused
on
a
single
domain
or
subdomain.
Collection:
top_domains-00250
TIMESTAMPS
The
Wayback
Machine
-
https://web.archive.org/web/20210327165005/https://hackaday.com/2021/03/22/how-laser-headlights-work/
Skip
to
content
Hackaday
Primary
Menu
Home
Blog
Hackaday.io
Tindie
Hackaday
Prize
Submit
About
Search
for:
March
27,
2021
How
Laser
Headlights
Work
130
Comments
by:
Lewin
Day
March
22,
2021
When
we
think
about
the
onward
march
of
automotive
technology,
headlights
aren’t
usually
the
first
thing
that
come
to
mind.
Engines,
fuel
efficiency,
and
the
switch
to
electric
power
are
all
more
front
of
mind.
However,
that
doesn’t
mean
there
aren’t
thousands
of
engineers
around
the
world
worki

2. Read in your article's html source from the file you created in question 1 and print it's text (use `.get_text()`)

In [12]:
from bs4 import BeautifulSoup 
import requests 

url = "https://web.archive.org/web/20210327165005/https://hackaday.com/2021/03/22/how-laser-headlights-work/"

html_content = requests.get(url).text 

soup = BeautifulSoup(html_content, 'html.parser')

texts = soup.find_all('p')

for text in texts:
    print(text.get_text())

When we think about the onward march of automotive technology, headlights aren’t usually the first thing that come to mind. Engines, fuel efficiency, and the switch to electric power are all more front of mind. However, that doesn’t mean there aren’t thousands of engineers around the world working to improve the state of the art in automotive lighting day in, day out.
Sealed beam headlights gave way to more modern designs once regulations loosened up, while bulbs moved from simple halogens to xenon HIDs and, more recently, LEDs. Now, a new technology is on the scene, with lasers!

The first image brought to mind by the phrase “laser headlights” is that of laser beams firing out the front of an automobile. Obviously, coherent beams of monochromatic light would make for poor illumination outside of a very specific spot quite some distance away. Thankfully for our eyes, laser headlights don’t work in this way at all.
Instead, laser headlights consist of one or more solid state laser diode

3. Load the article text into a trained `spaCy` pipeline, and determine the 5 most frequent tokens (converted to lower case).  Print the common tokens with an appropriate label.  Additionally, print the tokens their frequencies (with appropriate labels). Make sure to remove things we don't care about (punctuation, stopwords, whitespace).

In [18]:
import spacy 
import glob

nlp = spacy.load("en_core_web_sm")
path = '/desktop/44-620/web-scraping/article\\*.txt'

for file in glob.glob(path): 
    with open(file, encoding='utf-8', errors='ignore') as file_in: 
        text = file_in.read()
        lines = text.split('\n')
        for lines in lines: 
            line =nlp(line)
            for token in line:
                print(token)

4. Load the article text into a trained `spaCy` pipeline, and determine the 5 most frequent lemmas (converted to lower case).  Print the common lemmas with an appropriate label.  Additionally, print the lemmas with their frequencies (with appropriate labels). Make sure to remove things we don't care about (punctuation, stopwords, whitespace).

In [19]:
import nltk 
from nltk.stem import WordNetLemmatizer
nltk.download('average_perceptron_tagger')
from nltk.corpus import wordnet 

lemmatizer = WordNetLemmatizer()



SyntaxError: unterminated string literal (detected at line 4) (3516372884.py, line 4)

5. Define the following methods:
    * `score_sentence_by_token(sentence, interesting_token)` that takes a sentence and a list of interesting token and returns the number of times that any of the interesting words appear in the sentence divided by the number of words in the sentence
    * `score_sentence_by_lemma(sentence, interesting_lemmas)` that takes a sentence and a list of interesting lemmas and returns the number of times that any of the interesting lemmas appear in the sentence divided by the number of words in the sentence
    
You may find some of the code from the in class notes useful; feel free to use methods (rewrite them in this cell as well).  Test them by showing the score of the first sentence in your article using the frequent tokens and frequent lemmas identified in question 3.

6. Make a list containing the scores (using tokens) of every sentence in the article, and plot a histogram with appropriate titles and axis labels of the scores. From your histogram, what seems to be the most common range of scores (put the answer in a comment after your code)?

7. Make a list containing the scores (using lemmas) of every sentence in the article, and plot a histogram with appropriate titles and axis labels of the scores.  From your histogram, what seems to be the most common range of scores (put the answer in a comment after your code)?

8. Which tokens and lexems would be ommitted from the lists generated in questions 3 and 4 if we only wanted to consider nouns as interesting words?  How might we change the code to only consider nouns? Put your answer in this Markdown cell (you can edit it by double clicking it).