Title: DSC360 Project 
Exercise Author: Dipanjan Sarkar  
Date: 19 October 2024  
Modified By: Caleb Trimble  
Description: Creates a preprocessor for normalizing ingested text. Creates a modified version of the corpus model with the preprocessor being called for each of the processes, and applies Pandas Series and apply/lambda to the text for normalization. Runs the corpus normalizer through a large text file, and returns normalized text. Finally returns tokens for the first 1000 values in the document.  
Code adapted from Text Analytics with Python - Second Edition (Sarkar D., 2019) and modified using Copilot.

In [116]:
import logging
import requests
import pandas as pd
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer
import numpy as np
import spacy
import nltk
from nltk.corpus import stopwords
from transformers import pipeline, logging as transformers_logging
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import text_analysis as ta
from normalizer import normalize_corpus  # Ensure proper import
import re
import urllib.parse

def preprocess_text(text):
    # Remove code snippets
    text = re.sub(r'```.*?```', '', text, flags=re.DOTALL)
    text = re.sub(r'\[.*?\]\(.*?\)', '', text)
    return text

def preprocess_articles(articles):
    processed_articles = []
    for article in articles:
        processed_article = preprocess_text(article)
        processed_articles.append(processed_article)
    return processed_articles

transformers_logging.set_verbosity_error()

# Download necessary NLTK data
nltk.download('stopwords')

# Load the pre-trained spaCy model
nlp = spacy.load('en_core_web_md')
query = input('Topic to summarize: ')

encoded_query = urllib.parse.quote_plus(query)

# Fetch data from the API
url = 'https://newsapi.org/v2/everything'
api_key = '427ccb958c6b49d7b42895baf4055f6a'
params = {
    'q': encoded_query, 
    'apiKey': api_key, 
    'excludeDomains': 'youtube.com'
}
response = requests.get(url, params=params)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\caleb\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [117]:
if response.status_code == 200:
    data = response.json()
    articles_data = [
        {'content': article['content'], 
         'publishedAt': article['publishedAt'],
         'source': article['source']['name']
         }
        for article in data['articles'] if article.get('content')  # Use .get to avoid KeyError
    ]

    # Print the articles_data to inspect the keys
    print("Articles Data:")
    for article in articles_data:
        print(article)

Articles Data:
{'content': 'Florida is one of 10 states where abortion is on the ballot this election\r\nA closely watched proposal to restore abortion rights in Florida is on track for defeat, in a significant blow to efforts to… [+2399 chars]', 'publishedAt': '2024-11-06T02:59:20Z', 'source': 'BBC News'}
{'content': 'A 23-year-old man in Florida has been charged with animal cruelty after allegedly leaving his dog tied to a fence as people were evacuating for Hurricane Milton, according to a press release from the… [+2799 chars]', 'publishedAt': '2024-10-16T18:00:42Z', 'source': 'Gizmodo.com'}
{'content': 'In an interview with BBC Newsnight in 2016, the legendary musician spoke about growing up in Chicago and his friendship with celebrities.', 'publishedAt': '2024-11-06T04:42:20Z', 'source': 'BBC News'}
{'content': "This as-told-to essay is based on a conversation with Rick Shiver, a 72-year-old retired firefighter in Port Orange, part of the Daytona Beach metropolitan area on Florida

In [118]:

    # Proceed with your normal processing
    df_articles = pd.DataFrame(articles_data)

    # Print DataFrame columns to check for 'content'
    print("DataFrame Columns:", df_articles.columns)

    if 'content' in df_articles.columns:
        # Preprocess the articles
        articles = df_articles['content'].tolist()
        processed_articles = preprocess_articles(articles)
        df_articles['processed_content'] = processed_articles
    else:
        print("Error: 'content' column not found in DataFrame")


    if 'publishedAt' in df_articles.columns:
        df_articles['publishedAt'] = pd.to_datetime(df_articles['publishedAt'])
        df_articles.set_index('publishedAt', inplace=True)
    else:
        print("Error: 'publishedAt' column not found in DataFrame")

DataFrame Columns: Index(['content', 'publishedAt', 'source'], dtype='object')


In [119]:

    # Normalize the corpus
    normalized_articles = ta.normalize_articles(df_articles['content'].tolist(), normalize_corpus)
    stop_words = set(stopwords.words('english'))

    # Tokenize the normalized articles
    tokenized_articles = [article.split() for article in normalized_articles]
    filtered_articles = [[word for word in article if word not in stop_words] for article in tokenized_articles]

  soup = BeautifulSoup(text, "html.parser")


In [120]:

    # Vectorize using CountVectorizer
    cv = CountVectorizer(min_df=0., max_df=1.)
    cv_matrix = cv.fit_transform(normalized_articles)
    cv_matrix = cv_matrix.toarray()
    vocab = cv.get_feature_names_out()
    cv_df = pd.DataFrame(cv_matrix, columns=vocab)
    print(cv_df.head())

   abc  abercrombie  ability  abortion  abroad  accept  access  accord  \
0    0            0        0         2       0       0       0       0   
1    0            0        0         0       0       0       0       0   
2    0            0        0         0       0       0       0       1   
3    0            0        0         0       0       0       0       0   
4    0            0        0         0       0       0       0       0   

   accordi  account  ...  xin  yahoo  yard  ye  year  years  yet  york  youth  \
0        0        0  ...    0      0     0   0     0      0    0     0      0   
1        0        0  ...    0      0     0   0     0      0    0     0      0   
2        0        0  ...    0      0     0   0     1      0    0     0      0   
3        0        0  ...    0      0     0   0     0      0    0     0      0   
4        0        0  ...    0      0     0   0     1      0    0     0      0   

   yuen  
0     0  
1     0  
2     0  
3     0  
4     0  

[5 rows

In [121]:

    # Apply TF-IDF transformation
    tt = TfidfTransformer(norm='l2', use_idf=True, smooth_idf=True)
    tt_matrix = tt.fit_transform(cv_matrix)
    tt_matrix = tt_matrix.toarray()
    tt_df = pd.DataFrame(np.round(tt_matrix, 2), columns=vocab)
    print(tt_df.head())

   abc  abercrombie  ability  abortion  abroad  accept  access  accord  \
0  0.0          0.0      0.0      0.43     0.0     0.0     0.0    0.00   
1  0.0          0.0      0.0      0.00     0.0     0.0     0.0    0.00   
2  0.0          0.0      0.0      0.00     0.0     0.0     0.0    0.22   
3  0.0          0.0      0.0      0.00     0.0     0.0     0.0    0.00   
4  0.0          0.0      0.0      0.00     0.0     0.0     0.0    0.00   

   accordi  account  ...  xin  yahoo  yard   ye  year  years  yet  york  \
0      0.0      0.0  ...  0.0    0.0   0.0  0.0  0.00    0.0  0.0   0.0   
1      0.0      0.0  ...  0.0    0.0   0.0  0.0  0.00    0.0  0.0   0.0   
2      0.0      0.0  ...  0.0    0.0   0.0  0.0  0.19    0.0  0.0   0.0   
3      0.0      0.0  ...  0.0    0.0   0.0  0.0  0.00    0.0  0.0   0.0   
4      0.0      0.0  ...  0.0    0.0   0.0  0.0  0.16    0.0  0.0   0.0   

   youth  yuen  
0    0.0   0.0  
1    0.0   0.0  
2    0.0   0.0  
3    0.0   0.0  
4    0.0   0.0  

[

In [122]:

    # Generate embeddings with spaCy
    embeddings = []
    for article in normalized_articles:
        doc = nlp(article)
        mean_vector = np.mean([token.vector for token in doc if not token.is_stop], axis=0)
        embeddings.append(mean_vector)
    df_embeddings = pd.DataFrame(embeddings, index=df_articles.index)

    # Apply NER
    for article in normalized_articles:
        doc = nlp(article)
        print(f"Entities in Article: {article}")
        for ent in doc.ents:
            print(f"{ent.text}: {ent.label_}")

    # Check embedding for a specific word
    word = query.lower()
    if word in nlp.vocab:
        print(f"spaCy embedding for '{word}': {nlp.vocab[word].vector}")
    else:
        print(f"Word '{word}' not found in vocabulary.")

Entities in Article: florida one state abortion ballot election closely watch proposal restore abortion right florida track defeat significant blow effort [ char ]
florida: GPE
one: CARDINAL
florida: GPE
Entities in Article: hurricane helene milton leave something particularly nasty behind wake bacteria know cause flesh eat infection florida health official warn resident stay awa [ char ]
helene milton: PERSON
florida: GPE
awa: PERSON
Entities in Article: year old man florida charge animal cruelty allegedly leave dog tie fence people evacuate hurricane milton accord press release [ char ]
year old: DATE
florida: GPE
hurricane milton: ORG
Entities in Article: interview bbc newsnight legendary musician speak grow chicago friendship celebrity
bbc newsnight: PERSON
chicago: GPE
Entities in Article: tell essay base conversation rick shiver year old retire firefighter port orange part daytona beach metropolitan area floridas atlantic coast follo [ char ]
rick: PERSON
year old: DATE
daytona b

In [123]:
# Summarization using Hugging Face's Transformers
summarizer = pipeline('summarization', model='t5-large', framework='pt')
generator = pipeline('text-generation', model='gpt2')

def generate_summaries(normalized_articles, df_articles, summarizer, generator):
    summaries = []
    for article, (_, article_row) in zip(normalized_articles, df_articles.iterrows()):
        input_len = len(article.split())
        max_len = min(max(input_len // 2, 50), input_len)  # Set max_length to half the input length
        min_len = max(input_len // 4, 30)
        summary = summarizer(article, max_length=max_len, min_length=min_len, do_sample=False, truncation=True)[0]['summary_text']
        coherent_summary = generator(summary, max_length=50, num_return_sequences=1)[0]['generated_text']
        source = article_row['source']
        final_summary = f"{coherent_summary} (Source: {source})"
        summaries.append(final_summary if coherent_summary else "No Summary Available")
    return summaries

# Main script

# Assuming normalized_articles and df_articles are already defined
summaries = generate_summaries(normalized_articles, df_articles, summarizer, generator)

# Print summaries
for i, summary in enumerate(summaries):
    print(f"Summary {i+1}: {summary}")
else:
    print(f'Error: {response.status_code} - {response.text}')




Summary 1: abortion ballot election closely watch proposal restore abortion right florida track defeat significant blow effort [ charlykevist file ] The California Supreme Court denied marriage equality to same-sex couples who have filed briefs in federal appeals court challenging a state law that (Source: BBC News)
Summary 2: bacteria know cause flesh eat infection florida health official warn resident stay awa . helene milton-spaer

Read more: More: New information reveals Ebola virus may come to England through Africa doctor

What does it mean (Source: Gizmodo.com)
Summary 3: year old man florida charge animal cruelty allegedly leave dog tie fence people evacuate hurricane milton accordion man finds dog tied fence he takes dog

A homeowner who claims he is an animal lover told WTOC he got "wounded" by (Source: Gizmodo.com)
Summary 4: interview bbc newsnight legendary musician and former member of the "Unification" band!

This event will feature music from the band members "Marianne,

This project was a rollercoaster of highs and lows, culminating in a somewhat lackluster ending compared to the effort invested. The objective was straightforward: create an abstract summarizer for news articles. While I technically met this objective early on, the results were initially unsatisfactory, indicating significant room for improvement. I faced various challenges, including compatibility issues that led me to create a new environment, generate a requirements.txt file, and reinstall all dependencies. These setbacks were frustrating but also provided valuable learning experiences in managing project environments and dependencies effectively.

To enhance user input and experience, I implemented modifications such as using the urllib module for encoded multi-word queries and excluding YouTube results due to formatting issues. Initially, the basic query parameters worked fine, but they were limited in handling multi-word inputs. Switching to an encoded query presented challenges, particularly with urllib.parse not identifying proper column names. This issue was resolved by using quote_plus, which correctly encoded multi-word queries. This adjustment significantly improved the flexibility and user-friendliness of the application.

Once I successfully retrieved the queries, I normalized the articles using a modified preprocessor and normalizer. This step was crucial in ensuring that the text was clean and consistent, which is essential for accurate analysis and summarization. The preprocessing involved removing code snippets, special characters, and unnecessary whitespace, making the text more uniform. Following normalization, I vectorized the results using both CountVectorizer and TfidfTransformer. However, the resulting vectors showed little relation in words, with many zero-valued features, indicating sparse matrices. This highlighted the challenge of capturing meaningful features from textual data, especially when the text is diverse.

Next, I tackled embeddings and Named Entity Recognition (NER). The NER performed well, identifying nouns accurately at a high rate of precision. This was a positive outcome, as it demonstrated the effectiveness of the model in recognizing entities within the text. However, the keyword embedding function struggled with multi-word queries. For example, while the model correctly identified "Ukraine" in the vocabulary, it failed to recognize "Ukraine war." This discrepancy pointed to the need for better handling of phrases and compound words within the embedding model.

Creating the summarizer was the most challenging part of the project. Selecting the optimal transformer model from Hugging Face and using a generator to reintegrate stop-words was tricky. I experimented with various models: t5-small was quick and provided moderate accuracy, while t5-base was slower but more precise. Ultimately, t5-large emerged as the best option, offering the highest accuracy despite requiring significant processing power. The choice of model greatly influenced the quality and coherence of the summaries.

I developed a function to generate summaries, incorporating attributes like input length, maximum length, and minimum length to balance detail and efficiency. To enhance readability, I created another function to inject defining words back into the summaries. This step aimed to improve the fluidity and coherence of the generated text. Additionally, I included source citations in the summaries, allowing users to reference the original articles.

The project was successful in generating clear, source-cited summaries. The results were printed with keywords and sources, providing users with a concise overview of the articles. However, several aspects still require refinement. For instance, the summaries would benefit from proper capitalization of nouns and more consistent integration of stop-words. Some summaries were fluid and coherent, while others were fragmented and incomplete. This variation highlighted the challenge of achieving uniform quality in automated summarization.

In conclusion, while the project met its primary objective, it also revealed areas for improvement. The experience was a valuable learning journey, providing insights into handling textual data, optimizing models, and enhancing user experience. Future iterations of the project will focus on refining the summarization process, improving the handling of multi-word queries, and ensuring more consistent and coherent outputs. Despite the challenges, the project was a significant step forward in developing a functional and flexible summarizer for news articles.