**CIS 4190: Applied Machine Learning**

# **Final Project - News Source Classification (B)**


## Data Collection Notebook

This notebook handles the web scraping pipeline to collect headlines from Fox News and NBC News.

**Notebook Output:** `scraped_data.csv`

Contains scraped headlines, can be found in the shared Google Drive [folder](https://drive.google.com/drive/folders/1aL1XihVewKl_Uz_-mAllUobb0SH2UMy_?usp=drive_link)

**For the main project notebook and ML pipeline, see:**

**[Main Notebook](main_notebook.ipynb)** - ML pipeline and model development




---

## **1. Setup and Installation**

Install required Python packages for web scraping, data processing, and machine learning, as well as file setup.

#### **1.1 Import Libraries**

Import all necessary libraries for data processing, web scraping, and machine learning.


In [None]:
%pip install beautifulsoup4 requests scikit-learn pandas numpy matplotlib



In [None]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
import matplotlib.pyplot as plt
import numpy as np

#### **1.2 Required Files Setup**

This notebook requires data files from a shared Google Drive folder. **Please upload them manually before running the notebook.**

**Required files:**
- `url_only_data.csv` - Contains URLs for data collection

**Instructions:**
1. Download files from [Google Drive](https://drive.google.com/drive/folders/1aL1XihVewKl_Uz_-mAllUobb0SH2UMy_?usp=drive_link)
2. In Colab, click the **Files** icon (left sidebar)
3. Click **Upload** button
4. Select and upload the required files
5. Run the next cell to verify files are present

**Note:** The notebook will stop execution if required files are missing.

In [None]:
# Check for required files
# This cell will stop execution if files are missing

import os

# Google Drive folder
DRIVE_FOLDER_URL = "https://drive.google.com/drive/folders/1aL1XihVewKl_Uz_-mAllUobb0SH2UMy_?usp=drive_link"

# Required files (must be present for notebook to run)
REQUIRED_FILES = {
    "url_only_data.csv": "REQUIRED - Contains URLs for data collection"
}

print("=" * 60)
print("Required Files Setup")
print("=" * 60)

# Check existing files
existing = []
missing = []
for filename, desc in REQUIRED_FILES.items():
    if os.path.exists(filename):
        existing.append(filename)
        print(f"\nFound: {filename}")
    else:
        missing.append(filename)
        print(f"\nMissing: {filename} - {desc}")

# CRITICAL: Stop execution if files are missing
if missing:
    print("\n" + "=" * 60)
    print("ERROR: Required Files Missing")
    print("=" * 60)
    print("\nThe following files are required but not found:")
    for filename in missing:
        print(f"   - {filename}")
    print(f"\nPlease download from Google Drive:")
    print(f"   {DRIVE_FOLDER_URL}")
    print("\nUpload to Colab:")
    print("   1. Click Files icon (left sidebar)")
    print("   2. Click Upload button")
    print("   3. Select and upload missing files")
    print("   4. Re-run this cell")
    print("\n" + "=" * 60)
    raise FileNotFoundError(
        f"Missing required files: {', '.join(missing)}. "
        f"Please download from Google Drive and upload to Colab."
    )

print("\n" + "=" * 60)
print("All required files are present!")
print("=" * 60)

Required Files Setup

Found: url_only_data.csv

All required files are present!


## **2. Data Collection**

This section contains data collection process and web scraping pipeline. We will load the provided URLs, inspect the dataset, and implement a scraping pipeline.

### **2.1 Load and Inspect the Dataset**

**Objective:** Load the provided CSV file containing URLs from Fox News and NBC News, and perform initial data inspection to understand the dataset's structure and distribution.

**Tasks:**
- Load the CSV file
- Check for missing values
- Identify and count URLs by source (FoxNews vs NBC)
- Display sample URLs from each source


In [None]:
## -------------------------------------------------
## Task 1: Load URL file
## -------------------------------------------------

csv_path = "url_only_data.csv"

print("=" * 60)
print("STEP 2.1: Load and Inspect the Dataset")
print("=" * 60)

# Load the CSV
print(f"\nLoading CSV from: {csv_path}")
df_urls = pd.read_csv(csv_path)

# Basic info
print(f"\nDataset Overview:")
print(f"   Total rows: {len(df_urls)}")
print(f"   Columns: {list(df_urls.columns)}")

df_urls.head()

STEP 2.1: Load and Inspect the Dataset

Loading CSV from: url_only_data.csv

Dataset Overview:
   Total rows: 3805
   Columns: ['url']


Unnamed: 0,url
0,https://www.foxnews.com/lifestyle/jack-carrs-e...
1,https://www.foxnews.com/entertainment/bruce-wi...
2,https://www.foxnews.com/politics/blinken-meets...
3,https://www.foxnews.com/entertainment/emily-bl...
4,https://www.foxnews.com/media/the-view-co-host...


In [None]:
## -------------------------------------------------
## Task 2: Check for missing values
## -------------------------------------------------
missing_count = df_urls.isnull().sum()
if missing_count.sum() > 0:
    print(f"\nMissing values:")
    print(missing_count[missing_count > 0])
else:
    print(f"\nNo missing values found")

# Identify source (Fox News vs NBC News)
print(f"\nAnalyzing URL sources...")
df_urls['source'] = df_urls['url'].apply(
    lambda x: 'FoxNews' if 'foxnews.com' in x.lower()
    else 'NBC' if 'nbcnews.com' in x.lower()
    else 'Unknown'
)


No missing values found

Analyzing URL sources...


In [None]:
## -------------------------------------------------
## Task 3: Identify URLs by source
## -------------------------------------------------

# Count by source
source_counts = df_urls['source'].value_counts()
print(f"\nURLs by Source:")
for source, count in source_counts.items():
    percentage = (count / len(df_urls)) * 100
    print(f"   {source}: {count} ({percentage:.1f}%)")

# Show sample URLs from each source
print(f"\nSample URLs from each source:")
print(f"\n   Fox News samples:")
fox_samples = df_urls[df_urls['source'] == 'FoxNews']['url'].head(3)
for i, url in enumerate(fox_samples, 1):
    print(f"      {i}. {url}")

print(f"\n   NBC News samples:")
nbc_samples = df_urls[df_urls['source'] == 'NBC']['url'].head(3)
for i, url in enumerate(nbc_samples, 1):
    print(f"      {i}. {url}")

# Check for any unknown sources
if 'Unknown' in source_counts:
    print(f"\n Warning: Found {source_counts['Unknown']} URLs with unknown source")
    unknown_samples = df_urls[df_urls['source'] == 'Unknown']['url'].head(3)
    print(f"   Sample unknown URLs:")
    for i, url in enumerate(unknown_samples, 1):
        print(f"      {i}. {url}")

print(f"\n" + "=" * 60)
print("Step 2.1 Complete: Dataset loaded and inspected")
print("=" * 60)

# Display first few rows with source column
print(f"\nFirst 5 rows of the dataset with source labels:")
df_urls.head()


URLs by Source:
   FoxNews: 2000 (52.6%)
   NBC: 1805 (47.4%)

Sample URLs from each source:

   Fox News samples:
      1. https://www.foxnews.com/lifestyle/jack-carrs-eisenhower-d-days-memo-noble-undertaking
      2. https://www.foxnews.com/entertainment/bruce-willis-demi-moore-avoided-doing-one-thing-while-co-parenting-daughter-says
      3. https://www.foxnews.com/politics/blinken-meets-with-qatars-prime-minister.print

   NBC News samples:
      1. https://www.nbcnews.com/news/world/helicopter-carrying-irans-president-suffers-hard-landing-state-tv-says-rcna152961
      2. https://www.nbcnews.com/pop-culture/celebrity/kristen-cavallari-husband-jay-cutler-divorce-after-10-years-together-n1192966
      3. https://www.nbcnews.com/news/asian-america/why-atlanta-spa-shooter-s-asian-acquaintances-can-t-tell-n1275348

Step 2.1 Complete: Dataset loaded and inspected

First 5 rows of the dataset with source labels:


Unnamed: 0,url,source
0,https://www.foxnews.com/lifestyle/jack-carrs-e...,FoxNews
1,https://www.foxnews.com/entertainment/bruce-wi...,FoxNews
2,https://www.foxnews.com/politics/blinken-meets...,FoxNews
3,https://www.foxnews.com/entertainment/emily-bl...,FoxNews
4,https://www.foxnews.com/media/the-view-co-host...,FoxNews


### **2.2 Web Scraping Pipeline**

**Objective:** Implement a clean scraping pipeline using requests + BeautifulSoup to extract headlines from news articles.

**Steps:**
1. Scraping Function: Extracts headlines from URLs, with error handling and retry logic.
2. Test Scraping Function: Tests the scraping function with sample URLs.
3. Scraping Process: Applies the scraping function to all URLs in the dataset.
4. Save Scraped Data: Saves the scraped headlines to a CSV file.

In [None]:
## -------------------------------------------------
## Step 1: Scraping Function
## -------------------------------------------------

import time
from datetime import datetime

def scrape_headline(url, source, max_retries=3, delay=1):
    """
    Scrape headline from an article's URL.
    Retries on failure, with error handling.

    Args:
        url: The URL to scrape
        source: 'FoxNews' or 'NBC' to determine which selector to use
        max_retries: Maximum number of retry attempts (default: 3)
        delay: Delay between retries in seconds (default: 1)

    Returns:
        str: The headline text if successful, None otherwise
    """
    for attempt in range(max_retries):
        try:
            # Add headers to mimic a browser request
            headers = {
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
            }

            response = requests.get(url, timeout=10, headers=headers)

            # Check if request was successful
            if response.status_code != 200:
                if attempt < max_retries - 1:
                    time.sleep(delay)
                    continue
                else:
                    return None

            # Parse HTML
            soup = BeautifulSoup(response.text, "html.parser")

            # Try different selectors based on source
            headline = None

            if source == 'FoxNews':
                # Fox News: Try the specific class first
                h1 = soup.find("h1", class_="headline speakable")
                if h1:
                    headline = h1.text.strip()
                else:
                    # Fallback: try other common Fox News patterns
                    h1 = soup.find("h1", class_="headline")
                    if h1:
                        headline = h1.text.strip()
                    else:
                        # Last resort: any h1
                        h1 = soup.find("h1")
                        if h1:
                            headline = h1.text.strip()

            elif source == 'NBC':
                # NBC News: Try main headline selectors
                # Common NBC patterns
                h1 = soup.find("h1", class_="article-hero-headline")
                if h1:
                    headline = h1.text.strip()
                else:
                    h1 = soup.find("h1", class_="headline")
                    if h1:
                        headline = h1.text.strip()
                    else:
                        # Fallback: any h1
                        h1 = soup.find("h1")
                        if h1:
                            headline = h1.text.strip()

            # Clean up the headline
            if headline:
                headline = headline.replace('\n', ' ').replace('\r', ' ')
                headline = ' '.join(headline.split())  # Remove extra whitespace
                return headline

            return None

        except requests.exceptions.Timeout:
            if attempt < max_retries - 1:
                time.sleep(delay)
                continue
            else:
                return None
        except requests.exceptions.RequestException as e:
            if attempt < max_retries - 1:
                time.sleep(delay)
                continue
            else:
                return None
        except Exception as e:
            # Unexpected error
            if attempt < max_retries - 1:
                time.sleep(delay)
                continue
            else:
                return None

    return None

In [None]:
## -------------------------------------------------
## Step 2: Test Scraping Function
## -------------------------------------------------

print("Testing scraping function with sample URLs...")
print("=" * 60)

# Test with one Fox News URL
if len(df_urls) > 0:
    test_fox_url = df_urls[df_urls['source'] == 'FoxNews']['url'].iloc[0]
    print(f"\nTesting Fox News URL:")
    print(f"  {test_fox_url}")
    test_headline = scrape_headline(test_fox_url, 'FoxNews')
    if test_headline:
        print(f"Success: {test_headline[:80]}...")
    else:
        print(f"Failed to scrape headline")

    # Test with one NBC URL
    if len(df_urls[df_urls['source'] == 'NBC']) > 0:
        test_nbc_url = df_urls[df_urls['source'] == 'NBC']['url'].iloc[0]
        print(f"\nTesting NBC News URL:")
        print(f"  {test_nbc_url}")
        test_headline = scrape_headline(test_nbc_url, 'NBC')
        if test_headline:
            print(f"Success: {test_headline[:80]}...")
        else:
            print(f"Failed to scrape headline")

    print("\n" + "=" * 60)
    print("Function test complete. Ready to scrape all URLs.")
else:
    print("No URLs found in df_urls. Please run section 2.1 first.")

Testing scraping function with sample URLs...

Testing Fox News URL:
  https://www.foxnews.com/lifestyle/jack-carrs-eisenhower-d-days-memo-noble-undertaking
Success: Jack Carr recalls Gen. Eisenhower's D-Day memo about 'great and noble undertakin...

Testing NBC News URL:
  https://www.nbcnews.com/news/world/helicopter-carrying-irans-president-suffers-hard-landing-state-tv-says-rcna152961
Success: Iranian President Raisi is killed in helicopter crash...

Function test complete. Ready to scrape all URLs.


In [None]:
## -------------------------------------------------
##  Step 3: Scraping Process
## -------------------------------------------------
# Scrapes all headlines from the URLs
# NOTE: Scraping only runs if  = True (forced scraping)
# By default, this cell loads existing scraped_data.csv


# Check if scraped_data.csv exists
import os
if not os.path.exists("scraped_data.csv"):
    print("=" * 60)
    print("STEP 2.2: Scraping Headlines from All URLs")
    print("=" * 60)
    print(f"\nTotal URLs to scrape: {len(df_urls)}")
    print(f"   Fox News URLs: {len(df_urls[df_urls['source'] == 'FoxNews'])}")
    print(f"   NBC News URLs: {len(df_urls[df_urls['source'] == 'NBC'])}")
    print(f"\nEstimated time: ~{len(df_urls) * 2 // 60} minutes (assuming ~2 seconds per URL)")
    print(f"\nStarting scraping process...")
    print("=" * 60)

    # Initialize lists to track progress
    headlines = []
    failed_urls = []
    success_count = 0
    fail_count = 0

    # Scrape each URL
    start_time = time.time()

    for i, row in df_urls.iterrows():
        url = row['url']
        source = row['source']

        # Scrape the headline
        headline = scrape_headline(url, source)

        if headline:
            headlines.append(headline)
            success_count += 1
        else:
            headlines.append(None)
            failed_urls.append({'url': url, 'source': source, 'index': i})
            fail_count += 1

        # Progress update every 100 URLs
        if (i + 1) % 100 == 0:
            elapsed = time.time() - start_time
            rate = (i + 1) / elapsed if elapsed > 0 else 0
            remaining = (len(df_urls) - (i + 1)) / rate if rate > 0 else 0
            print(f"  Progress: {i + 1}/{len(df_urls)} URLs scraped | "
                  f"Success: {success_count} | Failed: {fail_count} | "
                  f"ETA: {remaining/60:.1f} minutes")

    # Add headlines to dataframe
    df_urls['headline'] = headlines

    # Calculate final statistics
    elapsed_time = time.time() - start_time
    success_rate = (success_count / len(df_urls)) * 100

    print("\n" + "=" * 60)
    print("Scraping Complete!")
    print("=" * 60)
    print(f"\nFinal Statistics:")
    print(f"   Total URLs: {len(df_urls)}")
    print(f"   Successfully scraped: {success_count} ({success_rate:.1f}%)")
    print(f"   Failed: {fail_count} ({100 - success_rate:.1f}%)")
    print(f"   Time elapsed: {elapsed_time/60:.1f} minutes")
    print(f"   Average time per URL: {elapsed_time/len(df_urls):.2f} seconds")

    # Display failed URLs summary
    if failed_urls:
        print(f"\nFailed URLs Summary:")
        fox_failed = sum(1 for f in failed_urls if f['source'] == 'FoxNews')
        nbc_failed = sum(1 for f in failed_urls if f['source'] == 'NBC')
        print(f"   Fox News failed: {fox_failed}")
        print(f"   NBC News failed: {nbc_failed}")
        print(f"\n   First 5 failed URLs:")
        for i, failed in enumerate(failed_urls[:5], 1):
            print(f"      {i}. [{failed['source']}] {failed['url']}")

    # Display sample of successfully scraped headlines
    print(f"\nSample of Successfully Scraped Headlines:")
    successful_df = df_urls[df_urls['headline'].notna()]
    if len(successful_df) > 0:
        for i, row in successful_df.head(5).iterrows():
            print(f"   [{row['source']}] {row['headline'][:80]}...")
    else:
        print("No headlines were successfully scraped!")

    print("\n" + "=" * 60)
else:
    print("=" * 60)
    print("STEP 2.2: Scraping Headlines from All URLs")
    print("=" * 60)
    print("\nScraping skipped - scraped data already exists.")
    print("Loading existing scraped data...")

    # Load existing data
    if os.path.exists("scraped_data.csv"):
        df_scraped = pd.read_csv("scraped_data.csv")
        print(f"\nLoaded {len(df_scraped)} headlines from scraped_data.csv")

        # Merge with df_urls to add headlines
        if 'url' in df_scraped.columns:
            df_urls = df_urls.merge(df_scraped[['url', 'headline']], on='url', how='left')
            print("Headlines merged with df_urls")
        else:
            print("Warning: scraped_data.csv does not have 'url' column")
    else:
        print("Warning: scraped_data.csv not found, but scraping was skipped")

    print("\nTo re-run scraping, delete scraped_data.csv and re-run this cell")
    print("=" * 60)

In [None]:
## -------------------------------------------------
## Step 4: Save failed URLs to a log file for later review
## -------------------------------------------------
if failed_urls:
    import csv

    log_filename = f"failed_urls_log_{datetime.now().strftime('%Y%m%d_%H%M%S')}.csv"
    with open(log_filename, 'w', newline='', encoding='utf-8') as f:
        writer = csv.DictWriter(f, fieldnames=['index', 'url', 'source'])
        writer.writeheader()
        writer.writerows(failed_urls)

    print(f"Failed URLs logged to: {log_filename}")
    print(f"   Total failed URLs: {len(failed_urls)}")
else:
    print("No failed URLs to log!")

# Display dataframe with scraped headlines
print(f"\nPreview of scraped data (first 10 rows with headlines):")
df_urls[['url', 'source', 'headline']].head(10)


### **2.3 Save Scraped Data**

**Objective:** Save all scraped headlines to a unified CSV file.

**Output Format:** CSV with columns: `headline`, `source` (Fox/NBC), `url`


In [None]:
# Save the scraped data to CSV file

print("=" * 60)
print("STEP 2.3: Save Scraped Data")
print("=" * 60)

# Prepare the final dataset with only the required columns
# Remove rows with missing headlines (failed scrapes)
df_scraped = df_urls[df_urls['headline'].notna()].copy()

# Rename 'source' to match expected format
df_scraped = df_scraped[['headline', 'source', 'url']].copy()

# Save to CSV
output_filename = "scraped_data.csv"
df_scraped.to_csv(output_filename, index=False)

print(f"\nData saved successfully!")
print(f"\nFinal Dataset Statistics:")
print(f"   Total rows: {len(df_scraped)}")
print(f"   Columns: {list(df_scraped.columns)}")
print(f"   Saved to: {output_filename}")

# Show distribution by source
source_dist = df_scraped['source'].value_counts()
print(f"\nDistribution by Source:")
for source, count in source_dist.items():
    percentage = (count / len(df_scraped)) * 100
    print(f"   {source}: {count} ({percentage:.1f}%)")

# Display sample of saved data
print(f"\nSample of saved data (first 5 rows):")
df_scraped.head()


### **2.4 URL Parsing Pipeline**

**Objective:** Implement a clean URL parsing pipeline that extracts and cleans URL slugs (pseudo-headlines) from raw article URLs for use as model input.

In [None]:
# Helper Functions for URL Parsing and Cleaning
from urllib.parse import urlparse
import re

def extract_slug(url):
    path = urlparse(url).path
    parts = path.strip("/").split("/")

    if len(parts) == 0:
        return None

    slug = parts[-1]
    return slug

def clean_slug(slug):
    if slug is None:
        return None
    slug = slug.replace(".print", "")
    slug = re.sub(r'(rcna|ncna)\d+$', '', slug)  # REMOVE NBC leak
    slug = slug.replace("-", " ")
    return slug.strip()

def slug_to_text(slug):
    if slug is None:
        return None
    text = slug.replace("-", " ")
    return text

In [None]:
# Extract and Clean titles from URLs
df_parsed = df_urls.copy()

df_parsed['slug'] = df_parsed['url'].apply(extract_slug)
df_parsed['slug'] = df_parsed['slug'].apply(clean_slug)
df_parsed['headline'] = df_parsed['slug'].apply(slug_to_text)

In [None]:
# Display Data
df_parsed.drop(columns=['slug'], inplace=True)
df_parsed

Unnamed: 0,url,source,headline
0,https://www.foxnews.com/lifestyle/jack-carrs-e...,FoxNews,jack carrs eisenhower d days memo noble undert...
1,https://www.foxnews.com/entertainment/bruce-wi...,FoxNews,bruce willis demi moore avoided doing one thin...
2,https://www.foxnews.com/politics/blinken-meets...,FoxNews,blinken meets with qatars prime minister
3,https://www.foxnews.com/entertainment/emily-bl...,FoxNews,emily blunt says toes curl when people their k...
4,https://www.foxnews.com/media/the-view-co-host...,FoxNews,the view co host cnn commentator ana navarro h...
...,...,...,...
3800,https://www.nbcnews.com/politics/2024-election...,NBC,trumps lawyers seek post election day delay co...
3801,https://www.nbcnews.com/select/shopping/best-a...,NBC,best acne scar treatments
3802,https://www.nbcnews.com/select/shopping/best-v...,NBC,best vegetarian vegan meal delivery
3803,https://www.nbcnews.com/politics/2024-election...,NBC,trump says presidential civilian award better ...


### **2.5 Save Parsed Data**
**Objective:** Save all scraped headlines to a unified CSV file.

In [None]:
df_parsed.to_csv("parsed_data.csv", index=False)