News Headlines NLP Lab: Bag-of-Words and Document Similarity
Objective

Analyze a collection of news headlines by building a Bag-of-Words representation to extract features, explore word frequency, and compute document similarity.
Dataset

Use the following list of news headlines:

headlines = [
    "AI outperforms doctors in diagnosing rare diseases",\
    "Stock markets hit new record highs amid global optimism",\
    "New vaccine shows promise in early trials",\
    "Climate change impacts agriculture across multiple continents",\
    "Scientists develop biodegradable plastic from seaweed",\
    "Sports teams adapt strategies with big data analytics",\
    "Electric vehicles set new sales record worldwide",\
    "Breakthrough in quantum computing boosts encryption security"
]

Tasks

    Preprocessing
        Write a function to lowercase all text, remove punctuation, and normalize whitespace in each headline.
    Bag-of-Words Analysis
        Use scikit-learn’s CountVectorizer with stop word removal and vocabulary limited to 50 words.
        Fit and transform the preprocessed headlines into a Bag-of-Words matrix.
        Display the vocabulary, shape, and sparsity of the matrix.
    Word Frequency and Visualization
        Compute total word frequency across all headlines.
        Plot the top 10 most frequent words using matplotlib or seaborn.
    Document Similarity
        Calculate cosine similarity between headline vectors.
        Display the similarity matrix in tabular form.
        Identify the two most similar headlines and explain their similarity based on shared vocabulary.

Deliverables

    A notebook implementing the preprocessing function and Bag-of-Words construction.
    Printed output showing vocabulary and matrix characteristics.
    A bar chart of the top 10 words by frequency.
    A similarity matrix with highlighted most similar headline pairs.
    A short commentary explaining the results.

This exercise provides hands-on experience with core NLP techniques including text cleaning, feature extraction via Bag-of-Words, and comparing documents using cosine similarity on vectorized features.

In [1]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import re

In [3]:
headlines = [
    "AI outperforms doctors in diagnosing rare diseases",
    "Stock markets hit new record highs amid global optimism",
    "New vaccine shows promise in early trials",
    "Climate change impacts agriculture across multiple continents",
    "Scientists develop biodegradable plastic from seaweed",
    "Sports teams adapt strategies with big data analytics",
    "Electric vehicles set new sales record worldwide",
    "Breakthrough in quantum computing boosts encryption security"
]

print("Sample Headlines for Vectorization:")
print("=" * 50)
for i, doc in enumerate(headlines, 1):
    print(f"{i}. {doc}")

Sample Headlines for Vectorization:
1. AI outperforms doctors in diagnosing rare diseases
2. Stock markets hit new record highs amid global optimism
3. New vaccine shows promise in early trials
4. Climate change impacts agriculture across multiple continents
5. Scientists develop biodegradable plastic from seaweed
6. Sports teams adapt strategies with big data analytics
7. Electric vehicles set new sales record worldwide
8. Breakthrough in quantum computing boosts encryption security


In [4]:
# Preprocessing
def simple_preprocess(text):
    """Basic text preprocessing for vectorization"""
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation and extra spaces
    text = re.sub(r'[^\w\s]', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# Preprocess documents
processed_headlines = [simple_preprocess(headline) for headline in headlines]
print(f"\nPreprocessed Documents:")
for i, headline in enumerate(processed_headlines, 1):
    print(f"{i}. {headline}")


Preprocessed Documents:
1. ai outperforms doctors in diagnosing rare diseases
2. stock markets hit new record highs amid global optimism
3. new vaccine shows promise in early trials
4. climate change impacts agriculture across multiple continents
5. scientists develop biodegradable plastic from seaweed
6. sports teams adapt strategies with big data analytics
7. electric vehicles set new sales record worldwide
8. breakthrough in quantum computing boosts encryption security
