# Data Preprocessing Pipeline

For our mini-challenge, we will be utilizing the Cleantech Media Dataset, which serves as an invaluable resource for businesses, researchers, and students interested in the latest developments in Natural Language Processing and Large Language Models within the realm of cleantech and sustainability. In an industry that is constantly evolving, having access to timely and accurate information is crucial. This dataset is specifically designed to address those needs.

This dataset is accessible on Kaggle and is credited to [Janna Lipenkova](https://www.kaggle.com/datasets/jannalipenkova/cleantech-media-dataset).

This notebook contains the data preprocessing pipeline for the Cleantech Media Dataset. In our playground notebook (/playground/data-preprocessing.ipynb), we played around with the dataset and explored its contents and documented some findings on how we should create our preprocessing pipeline. In this notebook, we will implement the preprocessing pipeline that will be used to clean and preprocess the data before we can use it for training machine learning models.

## 1. Imports

In [1]:
import os
import re
import pandas as pd
import numpy as np

## 2. Data

### 2.1 Training Data

- Comprehensive Coverage: Access a wide range of media texts on cleantech topics, from renewable energy to carbon reduction.
- Efficiency: Utilize the dataset for quick and accurate question-answering, aiding informed decision-making.
- Regular Updates: Stay current with monthly updates reflecting the latest trends in cleantech.
- Sustainability Focus: Contribute to the sustainability movement by leveraging valuable insights from the dataset.

In [2]:
data = pd.read_csv('../data/raw/cleantech_media_dataset_v2_2024-02-23.csv', index_col=0).reset_index(drop=True)
data.head()

Unnamed: 0,title,date,author,content,domain,url
0,Qatar to Slash Emissions as LNG Expansion Adva...,2021-01-13,,"[""Qatar Petroleum ( QP) is targeting aggressiv...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...
1,India Launches Its First 700 MW PHWR,2021-01-15,,"[""• Nuclear Power Corp. of India Ltd. ( NPCIL)...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...
2,New Chapter for US-China Energy Trade,2021-01-20,,"[""New US President Joe Biden took office this ...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...
3,Japan: Slow Restarts Cast Doubt on 2030 Energy...,2021-01-22,,"[""The slow pace of Japanese reactor restarts c...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...
4,NYC Pension Funds to Divest Fossil Fuel Shares,2021-01-25,,"[""Two of New York City's largest pension funds...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...


Key columns include the title, date, author, content, domain and URL, providing a comprehensive view of the dataset's structure.

## 3. Text Preprocessing

In [3]:
data_cleaned = data.copy()

# remove author column from data_cleaned
data_cleaned = data_cleaned.drop(columns=["author"])
data_cleaned.head()

Unnamed: 0,title,date,content,domain,url
0,Qatar to Slash Emissions as LNG Expansion Adva...,2021-01-13,"[""Qatar Petroleum ( QP) is targeting aggressiv...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...
1,India Launches Its First 700 MW PHWR,2021-01-15,"[""• Nuclear Power Corp. of India Ltd. ( NPCIL)...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...
2,New Chapter for US-China Energy Trade,2021-01-20,"[""New US President Joe Biden took office this ...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...
3,Japan: Slow Restarts Cast Doubt on 2030 Energy...,2021-01-22,"[""The slow pace of Japanese reactor restarts c...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...
4,NYC Pension Funds to Divest Fossil Fuel Shares,2021-01-25,"[""Two of New York City's largest pension funds...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...


We create a copy of the original dataset to ensure the raw data remains unchanged during preprocessing. The "author" column is removed from the copied dataset, as it is not needed for further analysis or modeling. This step ensures the dataset is streamlined for subsequent tasks, focusing only on relevant information.

### 3.1 Cleaning Text

After inspecting the data we noticed that there are a lot of formatting flaws, irrelevant data such as advertisements, special characters, cookie consents etc. This textual information is not useful for our RAG-system and should therefore be removed properly without losing any valuable information.

We noticed that a lot of irrelevant data is domain-specific, so we examined every domain carefully and created methods to remove the irrelevant data. In this notebook, we integrated these methods into one powerful pipeline to clean all the text data.

In [4]:
def clean_content(content):
    # Remove surrounding list notation
    content = content.strip("[]")

    # Split by list delimiters and sentence-ending punctuation
    sentences = re.split(r"',\s*'|\",\s*\"|\", '|', |(?<=[.!?])\s+(?=[A-Z])", content)

    if "Sign in to get the best natural gas news and data" in sentences[0]:
        # Remove the first 11 sentences because they are not part of the article (naturalgasintel domain)
        sentences = sentences[11:]

    # Remove everything after any stop phrase (these are advertisements, cookie consents, etc.)
    stop_phrases = [
        "The Sensi", 
        "Recharge is part of NHST Global Publications AS and we are responsible for the data that you register with us",
        "Recharge is part of DN Media Group", 
        "Ecofriend.Org", 
        "EcoFriend.com",
        "Thank you for subscribing to the email newsletter.",
        "To use the full function of this web site, JavaScript needs to be enabled in your browser.",
        "This site uses Akismet to reduce spam. Learn how your comment data is processed.",
        "Please join in the discussion in the comments below.", 
        "Advertise with CleanTechnica", 
        "Copyright © 2023 CleanTechnica",
        "Solar Industry offers industry participants probing",
        "This website uses cookies to anonymously", 
        "By submitting this form you agree to pv magazine using your data",
        "This content is protected by copyright and may not be reused",
        "© 2021 Natural Gas Intelligence. All rights reserved.",
    ]
    for i, sentence in enumerate(sentences):
        if any(stop_phrase in sentence for stop_phrase in stop_phrases):
            sentences = sentences[:i]
            break

    # Step 3: Join sentences and apply consolidated clean-up operations
    cleaned_text = ' '.join(sentence.strip() for sentence in sentences)
    cleaned_text = re.sub(r'[‘’]', "'", cleaned_text)  # Normalize all apostrophes to single '

    # Step 4: Remove any code-like blocks that start with `{ L.start`, `window.dojoRequire`, or other patterns
    cleaned_text = re.sub(r'\{.*?\}\s*\),?', '', cleaned_text, flags=re.DOTALL)  # Removes blocks like `{ ... })`
    cleaned_text = re.sub(r'window\.dojoRequire\s*\(.*?\)\s*[,;]?', '', cleaned_text, flags=re.DOTALL)  # Removes `window.dojoRequire(...)`

    # Combined regex substitutions for multiple patterns
    cleaned_text = re.sub(r'\( |\’ |”|“|…', lambda x: {'( ': '(', '’ ': '’', '”': '', '“': '', '…': ''}.get(x.group(), ''), cleaned_text)
    cleaned_text = re.sub(r'\s+', ' ', cleaned_text).strip()

    # Step 5: Consolidated special character and whitespace cleanup
    cleaned_text = re.sub(r'[“”;:"\[\]`]', '', cleaned_text)  # Remove other quotes and brackets
    cleaned_text = re.sub(r'\s+([.,?!])', r'\1', cleaned_text)  # Remove space before punctuation
    cleaned_text = re.sub(r'\s*[-–—]\s*', ' ', cleaned_text)  # Replace dashes with single space
    cleaned_text = re.sub(r'\s+', ' ', cleaned_text).strip()  # Remove extra spaces
    cleaned_text = re.sub(r'\( ', '(', cleaned_text)  # Remove space after (
    cleaned_text = re.sub(r'\.,', '.', cleaned_text)  # Replace any ".,"
    cleaned_text = re.sub(r'[^\x00-\x7F]+', '', cleaned_text)
    cleaned_text = re.sub(r'\.{2,}', '.', cleaned_text)  # Replace multiple periods with single period
    cleaned_text = re.sub(r"(?<!s)' ", '', cleaned_text) # Remove unwanted "' " unless followed by an s

    # Remove the entire word string if it contains u200b
    cleaned_text = ' '.join([word for word in cleaned_text.split() if 'u200b' not in word])

    # Replace multiple country abbreviations with full names
    abbreviations = {
        'U.S.': 'United States ', 'U.K.': 'United Kingdom', 
        'E.U.': 'European Union', 'U.A.E.': 'United Arab Emirates', 
        'U.N.': 'United Nations', 'U.S.S.R.': 'Soviet Union',
    }

    # Replace abbreviations with full names
    for abbr, full in abbreviations.items():
        cleaned_text = cleaned_text.replace(abbr, full)

    # Additional character replacements to clean single quotes and spaces
    cleaned_text = cleaned_text.replace(" – ", " ")
    cleaned_text = cleaned_text.replace("', ", " ")
    cleaned_text = cleaned_text.replace(" ' ", " ")
    cleaned_text = cleaned_text.replace(" '", " ")
    cleaned_text = cleaned_text.replace("' s", "'s")
    cleaned_text = cleaned_text.replace(" '", " ")
    cleaned_text = cleaned_text.replace("  ", " ")
    cleaned_text = cleaned_text.replace("}), ", "")
    cleaned_text = cleaned_text.replace("'.", ".")
    cleaned_text = cleaned_text.replace("pv magazine", "PV Magazine")

    # Remove dots if they are not followed by a space and a capital letter
    cleaned_text = re.sub(r'\.(?! [A-Z])', '', cleaned_text)

    # if cookie text is present replace it with empty string
    cookie_text = "By clicking Allow All you agree to the storing of cookies on your device to enhance site navigation, analyse site usage and support us in providing free open access scientific content. More info. "
    cleaned_text = cleaned_text.replace(cookie_text, "")

    return cleaned_text

data_cleaned["content"] = data_cleaned["content"].apply(clean_content)

We defined a comprehensive `clean_content` function to preprocess the content column, addressing various issues such as:
- Removing extraneous list notations and splitting the text into sentences.
- Eliminating irrelevant sections (e.g., advertisements, cookie consent notices, and stop phrases).
- Normalizing punctuation, removing special characters, and cleaning up code-like blocks.
- Expanding common abbreviations (e.g., 'U.S.' to 'United States') to improve consistency.

The cleaned text ensures a more uniform and usable dataset, free from unnecessary noise, making it suitable for downstream tasks such as embeddings or model training.

## 4. Save Processed Data

In [None]:
data_cleaned.to_csv('../data/processed/cleantech_processed.csv', index=False)

We save the cleaned dataset to a CSV file in the processed data directory. This step finalizes the data cleaning process, making the dataset ready for further analysis or modeling.