# Web Mining and Applied NLP (44-620)

## Final Project: Article Summarizer

### Student Name: **Caleb Sellinger**

### GitHub Repo [HERE](https://github.com/crsellinger/article-summarizer)

Perform the tasks described in the Markdown cells below.  When you have completed the assignment make sure your code cells have all been run (and have output beneath them) and ensure you have committed and pushed ALL of your changes to your assignment repository.

You should bring in code from previous assignments to help you answer the questions below.

Every question that requires you to write code will have a code cell underneath it; you may either write your entire solution in that cell or write it in a python file (`.py`), then import and run the appropriate code to answer the question.

In [50]:
from collections import Counter
import pickle
import requests
import spacy
from spacytextblob.spacytextblob import SpacyTextBlob
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
import selenium

!pip list

print('All prereqs installed.')

Package                 Version
----------------------- -----------
annotated-types         0.7.0
asttokens               3.0.0
attrs                   25.3.0
beautifulsoup4          4.13.4
blis                    1.3.0
catalogue               2.0.10
certifi                 2025.8.3
cffi                    1.17.1
charset-normalizer      3.4.2
click                   8.2.1
cloudpathlib            0.21.1
colorama                0.4.6
comm                    0.2.3
confection              0.1.5
contourpy               1.3.3
cycler                  0.12.1
cymem                   2.0.11
debugpy                 1.8.15
decorator               5.2.1
en_core_web_sm          3.8.0
executing               2.2.0
fonttools               4.59.0
h11                     0.16.0
idna                    3.10
ipykernel               6.30.1
ipython                 9.4.0
ipython_pygments_lexers 1.1.1
jedi                    0.19.2
Jinja2                  3.1.6
joblib                  1.5.1
jupyter_client    

1. Find on the internet an article or blog post about a topic that interests you and you are able to get the text for using the technologies we have applied in the course.  Get the html for the article and store it in a file (which you must submit with your project)

In [51]:
# Webpage uses JavaScript to load content, need Selenium to run JS to pull content
# BS4 does not run scripts and was only pulling root div with scripts
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

url = "https://www.kaggle.com/datasets/sudan007kaggler/reddit-rwallstreet-bets-posts-dataset-labelled/data"

# Set up the Selenium WebDriver
driver = webdriver.Chrome()

# Load the page
driver.get(url)

try:
    # Wait for the specific div with role="rowgroup" to be present
    # Wait a max of 20 seconds.
    # This only pulls what is originally loaded onto the page by JS. From testing, its around 80 rows.
    # In order to get all 41,981, I would need to scroll down the table and wait for it to load. I don't know if scripting a scroll action is possible with Selenium
    wait = WebDriverWait(driver, 20)
    wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, 'div[role="rowgroup"]')))

    # Get the page source after JavaScript has loaded
    html_content = driver.page_source

    # Now, parse the fully loaded HTML with BeautifulSoup
    soup = BeautifulSoup(html_content, 'html.parser')

    # Find table
    element = soup.find_all('td')
    # print(element[8])

    if element:
        with open('content.pkl','wb') as file:
            # Strip text before pickling, was running into max recursion depth error bc html was too complicated
            # Using join() method to combine all strings in list into one string for Doc element
            pickle.dump("\n".join([e.get_text(strip=True) for e in element]), file)
        # print(element.prettify)
    else:
        print("Could not find the div with role='rowgroup' even after waiting.")

finally:
    # Close browser window
    driver.quit()

2. Read in your article's html source from the file you created in question 1 and do sentiment analysis on the article/post's text (use `.get_text()`).  Print the polarity score with an appropriate label.  Additionally print the number of sentences in the original article (with an appropriate label)

In [52]:
# Spacy Pipeline
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("spacytextblob")

# Reading file into python object
with open('content.pkl','rb') as file:
    content = pickle.load(file)

# Splitting every other line in file, transforming to Doc type, and appending into list
# Need Doc type for polarity score
list_sentences = content.splitlines()[8::2]
list_sentences = [nlp(line) for line in list_sentences]
# print(list_sentences)

# Print number of sentences
print(f"Number of posts: {len(list_sentences)}")

# Polarity scores for each line
list_of_scores = [line._.blob.polarity for line in list_sentences]
for i, s in enumerate(list_of_scores):
    print(f"Post {i}: {s}")

Number of posts: 76
Post 0: 0.0
Post 1: -0.6
Post 2: 0.0
Post 3: -0.25
Post 4: 0.0
Post 5: 0.0
Post 6: 0.0
Post 7: 0.0
Post 8: 0.0
Post 9: 0.7
Post 10: 0.0
Post 11: 0.0
Post 12: -0.5
Post 13: -0.025
Post 14: 0.14325396825396827
Post 15: 0.0
Post 16: 0.0
Post 17: -0.5
Post 18: 0.0
Post 19: 0.0
Post 20: 0.0
Post 21: 0.2857142857142857
Post 22: 0.0
Post 23: 0.0
Post 24: 0.125
Post 25: -0.30000000000000004
Post 26: 0.0
Post 27: -0.033333333333333326
Post 28: 0.0
Post 29: -0.2
Post 30: 0.0
Post 31: 0.0
Post 32: 0.0
Post 33: 0.0
Post 34: -0.15555555555555556
Post 35: 0.0
Post 36: 0.0
Post 37: 0.0
Post 38: -0.09583333333333333
Post 39: 0.0
Post 40: 0.6
Post 41: 0.0
Post 42: 0.0
Post 43: 0.0
Post 44: -0.011111111111111108
Post 45: -0.8
Post 46: 0.0
Post 47: 0.8
Post 48: 0.2
Post 49: 0.0
Post 50: 0.0
Post 51: 0.0
Post 52: 0.0
Post 53: 0.0
Post 54: 0.0
Post 55: 0.0
Post 56: 0.0
Post 57: 0.0
Post 58: 0.0
Post 59: 0.0
Post 60: 0.0
Post 61: 0.0
Post 62: 0.0
Post 63: 0.0
Post 64: 0.0
Post 65: 0.0
Po

3. Load the article text into a trained `spaCy` pipeline, and determine the 5 most frequent tokens (converted to lower case).  Print the common tokens with an appropriate label.  Additionally, print the tokens their frequencies (with appropriate labels)

In [53]:
import re

doc = nlp(content)
# print(doc)

# Remove whitespace/punctuation/stopwords and lower case tokens
tokens = [
    word.lower_
    for word in doc
    if not (word.is_space
            or word.is_punct 
            or word.is_stop 
            or word.is_digit 
            or word.is_currency 
            or re.search(r'\d', word.text) # Regex to remove digits
            )
]
# print(tokens)

freq = Counter(map(str, tokens))
# Top 5 most common tokens put into a list
common_tokens = [f"{x}: {y}" for x, y in freq.most_common(5)]
print("Top 5 most common tokens:")
print(*common_tokens, sep="\n")
print(*tokens,sep=' | ')

Top 5 most common tokens:
jul: 20
aug: 6
dip: 4
buy: 4
hold: 3
values | aren | covering | hold | fucking | stocks | party | dudes | shady | going | pre | market | hype | diamond | hands | relax | printer | sings | song | people | army | dogeeeee | rkt | dividends | dividends | stimmy | zom | looks | good | basically | moonkey | mertgag | ageddon | uwmc | finally | play | market | meaningless | robinhood | aal | american | airlines | year | long | consolidation | breakout | cheers | moom | proud | holdings | m | comparing | big | boys | single | page | sell | ll | january | pst | forget | webull | htmw | won | let | trade | fake | money | blackberry | dip | market | level | data | uk | brothers | sisters | freetrade | allowing | gme | bb | amc | trading | instant | account | topups | o | counterfeit | shares | right | public | won | let | buys | dip | unite | machine | ve | won | evil | doers | left | end | trade | forced | yolo | gamestop | like | stock | flowers | dd | palantir | pltr

4. Load the article text into a trained `spaCy` pipeline, and determine the 5 most frequent lemmas (converted to lower case).  Print the common lemmas with an appropriate label.  Additionally, print the lemmas with their frequencies (with appropriate labels).

In [54]:
# Remove whitespace/punctuation/stopwords and lower case tokens
tokens = [
    word
    for word in doc
    if not (
        word.is_space
        or word.is_punct
        or word.is_stop
        or word.is_digit
        or word.is_currency
        or re.search(r"\d", word.text)  # Regex to remove digits
    )
]

# Lemmatize and lowercase
lemmas = [word.lemma_.lower() for word in tokens]

freq = Counter(map(str, lemmas))
# Top 5 most common lemmas put into a list
common_lemmas = [f"{x}: {y}" for x, y in freq.most_common(5)]
print("Top 5 most common tokens:")
print(*common_lemmas, sep="\n")
print(*lemmas, sep=" | ")

Top 5 most common tokens:
jul: 20
buy: 6
aug: 6
hold: 5
dip: 4
value | aren | cover | hold | fucking | stock | party | dude | shady | go | pre | market | hype | diamond | hand | relax | printer | sing | song | people | army | dogeeeee | rkt | dividend | dividend | stimmy | zom | look | good | basically | moonkey | mertgag | ageddon | uwmc | finally | play | market | meaningless | robinhood | aal | american | airlines | year | long | consolidation | breakout | cheer | moom | proud | holding | m | compare | big | boy | single | page | sell | ll | january | pst | forget | webull | htmw | win | let | trade | fake | money | blackberry | dip | market | level | datum | uk | brother | sister | freetrade | allow | gme | bb | amc | trading | instant | account | topup | o | counterfeit | share | right | public | win | let | buy | dip | unite | machine | ve | win | evil | doer | leave | end | trade | force | yolo | gamestop | like | stock | flower | dd | palantir | pltr | low | company | important

5. Make a list containing the scores (using tokens) of every sentence in the article, and plot a histogram with appropriate titles and axis labels of the scores. From your histogram, what seems to be the most common range of scores (put the answer in a comment after your code)?

6. Make a list containing the scores (using lemmas) of every sentence in the article, and plot a histogram with appropriate titles and axis labels of the scores.  From your histogram, what seems to be the most common range of scores (put the answer in a comment after your code)?

7. Using the histograms from questions 5 and 6, decide a "cutoff" score for tokens and lemmas such that fewer than half the sentences would have a score greater than the cutoff score.  Record the scores in this Markdown cell

* Cutoff Score (tokens): 
* Cutoff Score (lemmas):

Feel free to change these scores as you generate your summaries.  Ideally, we're shooting for at least 6 sentences for our summary, but don't want more than 10 (these numbers are rough estimates; they depend on the length of your article).

8. Create a summary of the article by going through every sentence in the article and adding it to an (initially) empty list if its score (based on tokens) is greater than the cutoff score you identified in question 8.  If your loop variable is named `sent`, you may find it easier to add `sent.text.strip()` to your list of sentences.  Print the summary (I would cleanly generate the summary text by `join`ing the strings in your list together with a space (`' '.join(sentence_list)`).

9. Print the polarity score of your summary you generated with the token scores (with an appropriate label). Additionally, print the number of sentences in the summarized article.

10. Create a summary of the article by going through every sentence in the article and adding it to an (initially) empty list if its score (based on lemmas) is greater than the cutoff score you identified in question 8.  If your loop variable is named `sent`, you may find it easier to add `sent.text.strip()` to your list of sentences.  Print the summary (I would cleanly generate the summary text by `join`ing the strings in your list together with a space (`' '.join(sentence_list)`).

11. Print the polarity score of your summary you generated with the lemma scores (with an appropriate label). Additionally, print the number of sentences in the summarized article.

12.  Compare your polarity scores of your summaries to the polarity scores of the initial article.  Is there a difference?  Why do you think that may or may not be?.  Answer in this Markdown cell.  

13. Based on your reading of the original article, which summary do you think is better (if there's a difference).  Why do you think this might be?