# Sentiment Analysis Beta Prototyping for data processing

Notebook Overview
This notebook is designed as a tool for our sentiment analysis project. Its purpose is to simplify the data processing pipeline by providing functions to retrieve and extract text from various sources—including PDFs, news articles (with pop-up removal), DOC/DOCX files, and plain text files. Each extraction function includes proper error handling and detailed explanations to ensure clarity during further development and analysis.

In [1]:
import spacy

In [2]:
#example of how we can use spacy as a tokenizer to help us save time without having to write the tokenizer ourselves
nlp = spacy.load("en_core_web_sm")
doc = nlp("some text")
print([(token.pos_, token.text, token.dep, token.is_alpha) for token in doc])

[('DET', 'some', 415, True), ('NOUN', 'text', 8206900633647566924, True)]


# The following question then arises: how would our data pipeline look like along with spaCy?

Let's say it's of our interest to retrieve text from news articles. Once we retrieve the text, we need to feed it through the data pipeline and have a ready data to later feed it into the model.
<ol>
    <li>Firstly, we have a document, or any other type of file that contains the text. We extract it. For the news it would have to be through scraping. Otherwise, the main model should also work fine for the </li>
    <li>Then after we've extracted the data from these data formats, we have to cleanse the data from noise, and normalise it.</li>
    <li>After the developed methods to extract data from such files, we should have a way to store them (we have not decided yet whether it will be a simple txt file, json, or other). While the data is stored, we apply the pipeline with the spacy to tokenise the document.</li>
    <li>After we've tokenised it, it's important to do stop words removal: filtering out common words (e.g., “the,” “and”) that might not contribute meaningful sentiment. This is important since the sentiment analysis will only rely on the important words.</li>
    <li>The last step would then be lemmatising the words and retrieving the stems of the words as we have to standarize the words that will be used for then feature extraction and transformation.</li>
</ol>

# First Part: The Extraction:

Let's say we want to scrape the article from the news and do the sentiment analysis regarding a certain topic. First, we want to extract the title of the article, with the main article. For general text it would have to be some document, the title is extracted along with the content itself and stored. 

In [3]:
import textract
import PyPDF2

def extract_txt(file):
    """
    Extracts a text from a normal .txt file

    Args:
        file (str): The path to the .txt file.
    
    Returns:
        str: The extracted text.
    """
    try:
        with open(file, 'r', encoding='utf-8') as f:
            data = f.read()
        return data
    except Exception as e:
        print(f"Error reading file: {e}")
        return None

def extract_doc(file_path):
    """
    Extracts text from a .doc or .docx file using the textract library.

    Args:
        file_path (str): The path to the .doc or .docx file.

    Returns:
        str: The extracted text.
    """
    try:
        text = textract.process(file_path)
        return text.decode("utf-8")
    except Exception as e:
        print(f"Error extracting text from {file_path}: {e}")
        return None
    
    def extract_pdf(file_path):
        """
        Extracts text from a .pdf file using the PyPDF2 library.

        Args:
            file_path (str): The path to the .pdf file.

        Returns:
            str: The extracted text.
        """
        try:
            text = ""
            with open(file_path, 'rb') as file:
                reader = PyPDF2.PdfFileReader(file)
                for page in reader.pages:
                    page_text = page.extract_text()
                    if page_text:
                        text+=page_text+"\n"
            return text
        except Exception as e:
            print(f"Error extracting text from {file_path}: {e}")
            return None

## PDF Text Extraction
The function above extracts text from PDF files using the PyPDF2 library. It reads the file in binary mode, processes each page, and concatenates the extracted text into a single string.

## DOCX Text Extraction
The second function leverages the python-docx library to extract text from DOCX files. It reads all paragraphs and joins them into one string, making it ideal for further text processing.

## Plain Text Extraction
A simple function at the beginning that reads a .txt file and returns its content as a string. This is useful for cases where the data is already in text format.

## News Article Scraping

Using Selenium with a headless Chrome browser, this function loads a news article URL, removes common registration pop-ups (by targeting typical modal CSS selectors), and scrapes the main article text. Additionally, it extracts key metadata (such as title, publication date, and author) by parsing the page’s meta tags with BeautifulSoup. You may need to tweak the selectors based on the website’s structure.

In [4]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

def scrape_article_text(url):
    """
    Loads a news article URL using Selenium, attempts to remove
    any registration pop-up, and returns the extracted article text.
    
    Args:
        url (str): URL of the news article.
        
    Returns:
        str: The extracted text of the article.
    """
    # Set up headless Chrome
    chrome_options = Options()
    chrome_options.add_argument("--headless")
    chrome_options.add_argument("--disable-gpu")
    chrome_options.add_argument("--no-sandbox")
    driver = webdriver.Chrome(options=chrome_options)
    
    try:
        driver.get(url)
        
        # Wait for the page to load by waiting for an <article> element
        # Adjust the tag or selector if the site uses a different structure
        WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.TAG_NAME, "article"))
        )
    except Exception as e:
        print("Error waiting for the article to load:", e)
    
    # Attempt to remove common modal/pop-up elements.
    # The selectors here are examples; you may need to tweak them for your target site.
    modal_selectors = [".modal", ".popup", ".overlay", "#registration-modal"]
    for selector in modal_selectors:
        try:
            modals = driver.find_elements(By.CSS_SELECTOR, selector)
            for modal in modals:
                driver.execute_script("""
                    var element = arguments[0];
                    element.parentNode.removeChild(element);
                """, modal)
        except Exception as e:
            print(f"Error removing element with selector {selector}: {e}")

    # Extract text: try to get text from an <article> tag if available,
    # otherwise fall back to the full body text.
    article_text = ""
    try:
        article = driver.find_element(By.TAG_NAME, "article")
        article_text = article.text
    except Exception as e:
        print("No <article> tag found; falling back to <body> text.", e)
        article_text = driver.find_element(By.TAG_NAME, "body").text

    driver.quit()
    return article_text


#Check to make sure chromedriver works
chrome_options = Options()
chrome_options.add_argument("--headless")  # optional: run in headless mode

driver = webdriver.Chrome(options=chrome_options)
driver.get("https://www.google.com")
print("Title:", driver.title)
driver.quit()


Title: Google


## Future Enhancements

<p>Ok great, how can we improve this?</p>

One promising improvement is to enhance our data pipeline by also processing and saving metadata. We can extract key metadata—such as the article name, publication date, and author—from the source (when available) and store it along with the main text. The complete data will be saved as a JSON file in a dedicated folder (e.g., processed_articles). This folder will act as a staging area for our data pipeline, allowing us to later analyze who wrote each piece, their views on different topics, and how the overall sentiment or consensus forms around these topics.

## Unified Text Extraction Class
Below is a Python class that unifies the extraction process. It determines if the input is a URL or a file path and calls the appropriate extraction method. In the case of news articles, it also attempts to extract metadata and then offers a method to save the data as JSON.

Below is a rewritten version of the unified extraction class. In this version, each extraction method attempts to extract both the main text and any available metadata—even for file formats like PDFs, DOCX, and plain text files. You can later extend the metadata extraction for TXT files if you decide on a header format, but for now, TXT files will simply yield an empty metadata dictionary.

In [5]:
import os
import re
import json
from urllib.parse import urlparse

import textract
from bs4 import BeautifulSoup

# Selenium for URL (news article) extraction
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Library for PDF extraction
import PyPDF2

class TextExtractor:
    """
    A unified text extraction tool for sentiment analysis data processing.
    
    This class supports extracting text and metadata from various sources:
      - PDF files (using PyPDF2)
      - DOC and DOCX files (using textract)
      - Plain text files (.txt)
      - News articles via URL (using Selenium & BeautifulSoup)
    
    For each source, the method returns a tuple: (text, metadata)
    where `metadata` is a dictionary containing key properties like title, author, date, etc.
    Data is later saved as JSON in a dedicated output directory.
    """
    def __init__(self, output_dir="processed_articles"):
        self.output_dir = output_dir
        if not os.path.exists(self.output_dir):
            os.makedirs(self.output_dir)
    
    def extract_from_pdf(self, file_path):
        """
        Extracts text and metadata from a PDF file.
        
        Metadata is extracted from the PDF's document info if available (e.g., Title, Author).
        """
        text = ""
        metadata = {}
        try:
            with open(file_path, "rb") as f:
                reader = PyPDF2.PdfReader(f)
                # Extract text from each page
                for page in reader.pages:
                    page_text = page.extract_text()
                    if page_text:
                        text += page_text + "\n"
                # Extract metadata from the PDF document
                if hasattr(reader, "metadata") and reader.metadata:
                    pdf_meta = reader.metadata
                    for key, value in pdf_meta.items():
                        if value:
                            # Remove the leading slash (e.g., '/Title') and convert key to lower case
                            cleaned_key = key.lstrip('/').lower()
                            metadata[cleaned_key] = value
        except Exception as e:
            print(f"Error extracting from PDF: {e}")
        return text.strip(), metadata
    
    def extract_from_doc(self, file_path):
        """
        Extracts text from DOC and DOCX files using textract.
        
        Metadata extraction is not supported by textract, so an empty dictionary is returned.
        """
        text = ""
        metadata = {}
        try:
            # textract returns a bytestring so decode it to UTF-8.
            text = textract.process(file_path).decode("utf-8")
        except Exception as e:
            print(f"Error extracting from DOC/DOCX: {e}")
        return text.strip(), metadata
    
    def extract_from_txt(self, file_path):
        """
        Reads text from a plain text (.txt) file.
        
        Currently, no metadata is extracted from TXT files.
        """
        text = ""
        metadata = {}
        try:
            with open(file_path, "r", encoding="utf-8") as f:
                text = f.read()
        except Exception as e:
            print(f"Error reading TXT file: {e}")
        return text.strip(), metadata
    
    def scrape_article_text(self, url):
        """
        Loads a news article URL using Selenium, removes potential pop-ups,
        and extracts both the article text and key metadata.
        
        Metadata is extracted from meta tags (e.g., og:title, article:published_time, article:author)
        and falls back to <title> if necessary.
        """
        chrome_options = Options()
        chrome_options.add_argument("--headless")
        chrome_options.add_argument("--disable-gpu")
        chrome_options.add_argument("--no-sandbox")
        driver = webdriver.Chrome(options=chrome_options)
        
        article_text = ""
        metadata = {}
        
        try:
            driver.get(url)
            # Wait for content to load (preferably an <article> tag)
            try:
                WebDriverWait(driver, 10).until(
                    EC.presence_of_element_located((By.TAG_NAME, "article"))
                )
                article = driver.find_element(By.TAG_NAME, "article")
                article_text = article.text
            except Exception:
                article_text = driver.find_element(By.TAG_NAME, "body").text
            
            # Use BeautifulSoup to parse the page and extract metadata
            soup = BeautifulSoup(driver.page_source, "html.parser")
            # Extract title using og:title or <title>
            title_tag = soup.find("meta", property="og:title")
            if title_tag and title_tag.get("content"):
                metadata["title"] = title_tag["content"]
            elif soup.title and soup.title.string:
                metadata["title"] = soup.title.string.strip()
            
            # Extract publication date (if available)
            date_tag = soup.find("meta", property="article:published_time")
            if date_tag and date_tag.get("content"):
                metadata["date"] = date_tag["content"]
            
            # Extract author information (if available)
            author_tag = soup.find("meta", property="article:author")
            if author_tag and author_tag.get("content"):
                metadata["author"] = author_tag["content"]
        
        except Exception as e:
            print(f"Error scraping article: {e}")
        finally:
            driver.quit()
        
        return article_text.strip(), metadata
    
    def extract_text(self, source):
        """
        Determines whether the source is a URL or a file path,
        then uses the appropriate extraction method.
        
        Returns:
            tuple: (text, metadata)
        """
        # Check if source is a URL
        if re.match(r'^https?://', source):
            return self.scrape_article_text(source)
        else:
            ext = os.path.splitext(source)[1].lower()
            if ext == ".pdf":
                return self.extract_from_pdf(source)
            elif ext in [".doc", ".docx"]:
                return self.extract_from_doc(source)
            elif ext == ".txt":
                return self.extract_from_txt(source)
            else:
                print("Unsupported file format.")
                return "", {}
    
    def save_as_json(self, data, filename):
        """
        Saves the provided data (text and metadata) as a JSON file in the output directory.
        
        Args:
            data (dict): A dictionary containing extracted text and metadata.
            filename (str): The filename for the saved JSON.
        """
        file_path = os.path.join(self.output_dir, filename)
        try:
            with open(file_path, "w", encoding="utf-8") as f:
                json.dump(data, f, ensure_ascii=False, indent=4)
            print(f"Data successfully saved to {file_path}")
        except Exception as e:
            print(f"Error saving JSON file: {e}")

# Example usage:
if __name__ == '__main__':
    extractor = TextExtractor()
    
    # Example 1: Extract from a news article URL
    source_url = "https://www.bbc.com/news/articles/c4gdwgjkk1no"
    text, metadata = extractor.extract_text(source_url)
    data = {"text": text, "metadata": metadata}
    extractor.save_as_json(data, "sample_article.json")
    
    # Example 2: Extract from a local DOCX file (also supports .doc)
    # source_file = "path/to/sample.docx"
    # text, metadata = extractor.extract_text(source_file)
    # data = {"text": text, "metadata": metadata}
    # extractor.save_as_json(data, "sample_docx.json")
    
    # Example 3: Extract from a local PDF file
    # source_file = "path/to/sample.pdf"
    # text, metadata = extractor.extract_text(source_file)
    # data = {"text": text, "metadata": metadata}
    # extractor.save_as_json(data, "sample_pdf.json")
    
    # Example 4: Extract from a local TXT file
    # source_file = "path/to/sample.txt"
    # text, metadata = extractor.extract_text(source_file)
    # data = {"text": text, "metadata": metadata}
    # extractor.save_as_json(data, "sample_txt.json")


Data successfully saved to processed_articles/sample_article.json


## and ALAS
it works!🚀🚀🚀
If you notice, we create an instance of the class, then after we get the source_url of an article, use the extractor method to extract text that then redirects to another helper method for a specific data format, it then generates us the data saves it as a json. For metadata in txt and docs are not supported so likely the name of the title and the article are inside, which is not a problem for now, as we will tackle it. 

This is the result of the test:
<img src="dummy_files/figure1.png">

# Second Part: Noise Cleansing and Normalisation
In this section, we focus on cleaning the extracted data by removing noise (e.g., HTML tags, punctuation, stopwords, etc.) and normalizing the text using spaCy. The goal is to prepare the text for downstream sentiment analysis by standardizing it (through processes such as lemmatization and lowercasing) and stripping out unnecessary noise.

Below, we create a sample JSON file containing some text and metadata. We then load the JSON, process the text to remove unwanted elements and normalize it using spaCy, and finally display the results. Later on, this cleaning process can be refactored into its own function or class for integration into a larger data processing pipeline.

In [8]:
from bs4 import BeautifulSoup

# ---------------------------
# Create a sample JSON file
# ---------------------------
sample_json = {
    "text": "This is a sample article! It includes various, punctuation and UPPERCASE letters. And some noise... <script>alert('noise');</script>",
    "metadata": {
        "title": "Sample Article",
        "author": "Author Name",
        "date": "2023-01-01"
    }
}

with open('sample_article.json', 'w') as f:
    json.dump(sample_json, f, ensure_ascii=False, indent=4)

#load the json file
with open('sample_article.json', 'r') as f:
    data = json.load(f)

#initialise the spacy nlp model
nlp = spacy.load("en_core_web_sm")
text = re.sub(r'<[^>]+>', '', data['text'])

doc = nlp(text)
cleaned_tokens = []
for token in doc:
    if not token.is_stop and not token.is_punct and not token.like_num:
        lemma = token.lemma_.lower().strip()
        #skip pronouns or empty tokens
        if lemma and lemma != '-pron-':
            cleaned_tokens.append(lemma)
cleaned_text = ' '.join(cleaned_tokens)

# Update the JSON data with the cleaned text
data["cleaned_text"] = cleaned_text

# Display the original and cleaned text
print("Original Text:")
print(data["text"])
print("\nCleaned and Normalized Text:")
print(data["cleaned_text"])


Original Text:
This is a sample article! It includes various, punctuation and UPPERCASE letters. And some noise... <script>alert('noise');</script>

Cleaned and Normalized Text:
sample article include punctuation uppercase letter noise alert('noise


ok, great, but we can improve it with Beautiful Soup and more regular expression to get rid of the noise. 

In [9]:
nlp = spacy.load("en_core_web_sm")
with open("sample_article.json", "w", encoding="utf-8") as f:
    json.dump(sample_json, f, ensure_ascii=False, indent=4)

# ---------------------------
# Load the sample JSON file and clean the text
# ---------------------------
with open("sample_article.json", "r", encoding="utf-8") as f:
    data = json.load(f)

text = data['text']
# Use BeautifulSoup to parse the text and remove unwanted tags
soup = BeautifulSoup(text, "html.parser")
for tag in soup(["script", "style"]):
    tag.decompose()  # Remove script and style elements completely
# Extract text using BeautifulSoup's get_text method
text = soup.get_text(separator=" ")
# Remove extra whitespace and newlines
text = re.sub(r'\s+', ' ', text)

# Process the cleaned text with spaCy
doc = nlp(text)
cleaned_tokens = []
for token in doc:
    if not token.is_stop and not token.is_punct and not token.like_num:
        lemma = token.lemma_.lower().strip()
        if lemma and lemma != '-pron-':
            cleaned_tokens.append(lemma)
    
cleaned_text = " ".join(cleaned_tokens)

data["cleaned_text"] = cleaned_text

print("Original Text:")
print(data["text"])
print("\nCleaned and Normalized Text:")
print(data["cleaned_text"])
    

Original Text:
This is a sample article! It includes various, punctuation and UPPERCASE letters. And some noise... <script>alert('noise');</script>

Cleaned and Normalized Text:
sample article include punctuation uppercase letter noise


# Much Better! 🚀🚀🚀 ✅