# Data Preprocessing

For our mini-challenge, we will be utilizing the Cleantech Media Dataset, which serves as an invaluable resource for businesses, researchers, and students interested in the latest developments in Natural Language Processing and Large Language Models within the realm of cleantech and sustainability. In an industry that is constantly evolving, having access to timely and accurate information is crucial. This dataset is specifically designed to address those needs.

This dataset is accessible on Kaggle and is credited to [Janna Lipenkova](https://www.kaggle.com/datasets/jannalipenkova/cleantech-media-dataset).

## Imports

In [478]:
import os
import pandas as pd
import numpy as np

import re
import nltk
from nltk.corpus import stopwords
from bs4 import BeautifulSoup
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Ensure you have the NLTK stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# Initialize stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/alexanderschilling/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Data

### Training Data

- Comprehensive Coverage: Access a wide range of media texts on cleantech topics, from renewable energy to carbon reduction.
- Efficiency: Utilize the dataset for quick and accurate question-answering, aiding informed decision-making.
- Regular Updates: Stay current with monthly updates reflecting the latest trends in cleantech.
- Sustainability Focus: Contribute to the sustainability movement by leveraging valuable insights from the dataset.

In [500]:
data = pd.read_csv('../data/raw/cleantech_media_dataset_v2_2024-02-23.csv', index_col=0)
data.head()

Unnamed: 0,title,date,author,content,domain,url
1280,Qatar to Slash Emissions as LNG Expansion Adva...,2021-01-13,,"[""Qatar Petroleum ( QP) is targeting aggressiv...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...
1281,India Launches Its First 700 MW PHWR,2021-01-15,,"[""• Nuclear Power Corp. of India Ltd. ( NPCIL)...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...
1283,New Chapter for US-China Energy Trade,2021-01-20,,"[""New US President Joe Biden took office this ...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...
1284,Japan: Slow Restarts Cast Doubt on 2030 Energy...,2021-01-22,,"[""The slow pace of Japanese reactor restarts c...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...
1285,NYC Pension Funds to Divest Fossil Fuel Shares,2021-01-25,,"[""Two of New York City's largest pension funds...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...


### Evaluation data

- The dataset comes with a small, high-quality evaluation dataset for the "Retrieval" step in Retrieval-Augmented Generation
- In addition, a small collection of human gold-standard query-passage-answer triplets will be provided for further evaluation

In [480]:
data_eval = pd.read_csv('../data/evaluation/cleantech_rag_evaluation_data_2024-09-20.csv', delimiter=";")
data_eval.head()

Unnamed: 0,example_id,question_id,question,relevant_text,answer,article_url
0,1,1,What is the innovation behind Leclanché's new ...,Leclanché said it has developed an environment...,Leclanché's innovation is using a water-based ...,https://www.sgvoice.net/strategy/technology/23...
1,2,2,What is the EU’s Green Deal Industrial Plan?,The Green Deal Industrial Plan is a bid by the...,The EU’s Green Deal Industrial Plan aims to en...,https://www.sgvoice.net/policy/25396/eu-seeks-...
2,3,2,What is the EU’s Green Deal Industrial Plan?,The European counterpart to the US Inflation R...,The EU’s Green Deal Industrial Plan aims to en...,https://www.pv-magazine.com/2023/02/02/europea...
3,4,3,What are the four focus areas of the EU's Gree...,The new plan is fundamentally focused on four ...,The four focus areas of the EU's Green Deal In...,https://www.sgvoice.net/policy/25396/eu-seeks-...
4,5,4,When did the cooperation between GM and Honda ...,What caught our eye was a new hookup between G...,July 2013,https://cleantechnica.com/2023/05/08/general-m...


## Initial Exploration

In [481]:
print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9593 entries, 0 to 9592
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  9593 non-null   int64 
 1   title       9593 non-null   object
 2   date        9593 non-null   object
 3   author      31 non-null     object
 4   content     9593 non-null   object
 5   domain      9593 non-null   object
 6   url         9593 non-null   object
dtypes: int64(1), object(6)
memory usage: 524.7+ KB
None


In [482]:
# Check for missing values
print(data.isnull().sum())

Unnamed: 0       0
title            0
date             0
author        9562
content          0
domain           0
url              0
dtype: int64


In [483]:
print(data["domain"].unique())

['energyintel' 'energyvoice' 'eurosolar' 'indorenergy' 'cleantechnica'
 'decarbxpo' 'ecofriend' 'greenprophet' 'azocleantech' 'naturalgasintel'
 'businessgreen' 'rechargenews' 'solarquarter' 'thinkgeoenergy'
 'pv-magazine' 'solarpowerworldonline' 'pv-tech' 'solarpowerportal.co'
 'solarindustrymag']


Now we want to see what the text oin the column "content" looks like. This will give us a better idea on how the texts are structured so that we can make a good plan for the data cleaning process.

In [484]:
data["content"][1]

"['Velocys has signed two offtake agreements that cover the entire sustainable aviation fuel ( SAF) production from its Bayou Fuels plant in Mississippi for at least ten years from start up in 2026. US Southwest Airlines will take-or-pay for 219 million gallons of carbon-negative SAF — generating 575 million gallons of carbon neutral blended SAF — over 15 years. Europe’ s IAG will take 73 million gallons of carbon negative SAF — 192 million gallons of carbon-neutral SAF after blending — over 10 years. Velocys said the deal with IAG was worth $ 800 million. Both deals include a price support mechanism for the greenhouse gas credits associated with Bayou’ s SAF production, which claims a carbon reduction of 6.5 million tons over the term of the contract. The plant has negative lifecycle emissions thanks to its biogenic waste feedstock, renewable power and carbon capture and storage. “ This long-dated offtake, encompassing support for environmental credits, will provide certainty of reven

## Text Preprocessing

Effective text preprocessing is a critical step in preparing our data for further analysis, as it directly impacts the quality and accuracy of the results we can achieve. After conducting an initial exploration of the dataset, we identified several areas that require attention to ensure our text data is clean, consistent and usable for analysis.

Text preprocessing in this case involves handling various nuances that are common in unstructured textual data, such as URLs, special characters and abbreviations. Each preprocessing decision must strike a balance between cleaning the data and preserving key information that could affect the integrity of the analysis.

**The following steps outline our approach to cleaning the text data:**
- Converting all text to lowercase ensures uniformity and reduces redundancy, as "Carbon" and "carbon" would be treated as the same word.
- Removing URLs, because they dont provide meaningful insight.
- Special characters (for example ", !, ?, ', ` etc.) can introduce noise if not handled carefully. However, not all punctuation marks should be removed entirely. For example, percentages (like "0.2%") or numerical ranges ("2.2 million") contain important information that must be preserved - if we delete these punctuations the information in the text would not be factually correct anymore ("2.2 million" would turn into 22 million).
- The handling of symbols like `/` and `-` also requires nuance. A forward slash (`/`) used in phrases such as "tons/yr" should be replaced with "per," while hyphens connecting words (for example "carbon-negative") should be replaced with spaces rather than being removed, to retain the meaning of compound terms. Hyphens surrounded by spaces, on the other hand, can be safely removed as they often act as separators rather than meaningful connectors.
- Finally, careful consideration needs to be given to country abbreviations that could be misinterpreted as common words (for example "US" and "us"). Ensuring context-driven handling of these abbreviations is necessary to avoid misclassification or distortion of the data.

### Lowercase

In [485]:
def to_lowercase(text):
    return text.lower()

data['content_cleaned'] = data['content'].apply(to_lowercase)

### Remove HTML tags

In [486]:
def remove_html_tags(text):
    return BeautifulSoup(text, "html.parser").get_text()

data['content_cleaned'] = data['content_cleaned'].apply(remove_html_tags)

### Remove URLs

In [487]:
def remove_urls(text):
    return re.sub(r'http\S+|www\S+|https\S+', '', text)

data['content_cleaned'] = data['content_cleaned'].apply(remove_urls)

### Remove Special Characters

In [488]:
def remove_special_characters(text):
    # Keep alphanumeric characters, certain punctuation, and hyphens/slashes while removing unwanted special characters
    # Allow periods, commas, question marks, exclamation marks, percentage symbols, and ampersands
    text = re.sub(r"[\\\'\"`]|[^A-Za-z0-9\s\-\/\.\%&]", '', text)

    # Replace "/" with " per " and specific words with their full forms
    text = re.sub(r'(\w+)/(\w+)', lambda m: f"{m.group(1)} per {'year' if m.group(2) == 'yr' else m.group(2)}", text)
    
    # Remove hyphens surrounded by spaces
    text = re.sub(r'\s*-\s*', '', text)

    # Remove hyphens not surrounded by words
    text = re.sub(r'(?<!\w)-|-(?!\w)', '', text)

    text = text.replace('%', ' percent')

    # Remove double space
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text

data['content_cleaned'] = data['content_cleaned'].apply(remove_special_characters)

### Remove punctuation

In [489]:
def remove_periods(text):
    # This regex pattern matches a period that is NOT followed by a digit
    pattern = r'\.(?!\d)'

    # Replace matched periods with an empty string
    result = re.sub(pattern, '', text)
    
    return result

data['content_cleaned'] = data['content_cleaned'].apply(remove_periods)

### Remove stop words

In [490]:
def remove_stop_words(text):
    tokens = text.split()
    tokens = [word for word in tokens if word not in stop_words]
    return ' '.join(tokens)

data['content_cleaned'] = data['content_cleaned'].apply(remove_stop_words)

### Tokenization

Tokenization is the process of breaking text down into individual units, typically words or tokens. This step is essential for most text preprocessing tasks as it allows us to manipulate and analyze the content at the word level. By splitting the content into tokens, we can more easily apply further processing techniques like stemming and lemmatization.

In [491]:
def tokenize(text):
    return text.split()

data['content_tokenized'] = data['content_cleaned'].apply(tokenize)

### Stemming

Stemming is the process of reducing words to their base or root form. It involves stripping off word endings to standardize the word, which helps in reducing the complexity of the vocabulary without losing the essential meaning. This is especially useful in cases where different word forms (e.g., "running" and "runner") need to be treated as the same root concept.

In [492]:
def stem_text(tokens):
    return ' '.join(stemmer.stem(word) for word in tokens)

data['content_stemmed'] = data['content_tokenized'].apply(stem_text)

### Lemmatization

Lemmatization, similar to stemming, is a technique used to reduce words to their base form. However, unlike stemming, lemmatization takes the context and grammatical role of the word into account, providing a more accurate base form. This process helps normalize the text while maintaining proper word meaning, which is particularly beneficial for more nuanced text analysis tasks.

In [493]:
def lemmatize_text(tokens):
    return ' '.join(lemmatizer.lemmatize(word) for word in tokens)

data['content_lemmatized'] = data['content_tokenized'].apply(lemmatize_text)

### Result

In this section, we present the results of our data preprocessing, comparing the original content with the cleaned, tokenized, stemmed and lemmatized versions. A notable observation is the presence of numerous abbreviations, such as company or organization names. Since these abbreviations do not represent standard words, we anticipate that their frequency within the corpus will be low. As a result, they are likely to be filtered out during vocabulary pruning, thereby streamlining the text and enhancing the language model's ability to interpret and process the content more effectively.

#### Initial Content

In [494]:
data["content"][1]

'["• Nuclear Power Corp. of India Ltd. ( NPCIL) synchronized Kakrapar-3 in the western state of Gujarat to the grid on Jan. 10, making it the first of India\'s 700 megawatt indigenously developed pressurized heavy water reactors ( PHWRs) to reach this milestone ( NIW Sep.18\'20). The news was tweeted by Anil Kakodkar, former chairman of the Department of Atomic Energy, who said that 15 more units of the same design will follow. Three of these are currently under construction -- another at Kakrapar, and two at NPCIL\'s Rajasthan plant. They will be followed by two at the greenfield Gorakhpur site in Haryana, and then a planned 10-unit fleet at Gorakhpur and three other sites. Kakrapar-3 was five years past its 2015 completion date, achieving criticality in July 2020, 10 years after construction began. Commercial operations are slated to begin in March, according to NPCIL\'s website, although that deadline will likely not be met. India\'s nuclear suppliers should be feeling some relief o

#### Cleaned Content

In [495]:
data['content_cleaned'][1]

'nuclear power corp india ltd npcil synchronized kakrapar3 western state gujarat grid jan 10 making first indias 700 megawatt indigenously developed pressurized heavy water reactors phwrs reach milestone niw sep.1820 news tweeted anil kakodkar former chairman department atomic energy said 15 units design follow three currently constructionanother kakrapar two npcils rajasthan plant followed two greenfield gorakhpur site haryana planned 10unit fleet gorakhpur three sites kakrapar3 five years past 2015 completion date achieving criticality july 2020 10 years construction began commercial operations slated begin march according npcils website although deadline likely met indias nuclear suppliers feeling relief kakrapar3s startup although order flows depend quickly npcil get projects moving course covid19 pandemic niw dec.1120 inaugural us small modular reactor smr project moved second phase fluor nuscale subsidiary announcing jan 11 per agreements utah associated municipal power systems u

#### Cleaned & Tokenized Content

In [496]:
data['content_tokenized'][1]

['nuclear',
 'power',
 'corp',
 'india',
 'ltd',
 'npcil',
 'synchronized',
 'kakrapar3',
 'western',
 'state',
 'gujarat',
 'grid',
 'jan',
 '10',
 'making',
 'first',
 'indias',
 '700',
 'megawatt',
 'indigenously',
 'developed',
 'pressurized',
 'heavy',
 'water',
 'reactors',
 'phwrs',
 'reach',
 'milestone',
 'niw',
 'sep.1820',
 'news',
 'tweeted',
 'anil',
 'kakodkar',
 'former',
 'chairman',
 'department',
 'atomic',
 'energy',
 'said',
 '15',
 'units',
 'design',
 'follow',
 'three',
 'currently',
 'constructionanother',
 'kakrapar',
 'two',
 'npcils',
 'rajasthan',
 'plant',
 'followed',
 'two',
 'greenfield',
 'gorakhpur',
 'site',
 'haryana',
 'planned',
 '10unit',
 'fleet',
 'gorakhpur',
 'three',
 'sites',
 'kakrapar3',
 'five',
 'years',
 'past',
 '2015',
 'completion',
 'date',
 'achieving',
 'criticality',
 'july',
 '2020',
 '10',
 'years',
 'construction',
 'began',
 'commercial',
 'operations',
 'slated',
 'begin',
 'march',
 'according',
 'npcils',
 'website',
 'alt

#### Cleaned & Stemmed Content

In [497]:
data['content_stemmed'][0]

'qatar petroleum qp target aggress cut greenhous ga emiss prepar launch phase 2 plan 48 million ton per year lng expans latest sustain report publish wednesday qp said goal includ reduc emiss intens qatar lng facil 25 percent upstream facil least 15 percent compani also aim reduc ga flare intens across upstream facil 75 percent rais carbon captur storag ambit 5 million ton per year 7 million ton per year 2027 2.2 million ton per year carbon captur goal come 32 million ton per year phase 1 lng expans also known north field east project 1.1 million ton per year come phase 2 known north field south project rais qatar lng capac 16 million ton per year qatar current lng product capac around 78 million ton per year eye phase expans 126 million ton per year qp say abl elimin routin ga flare 2030 methan emiss limit set methan intens target 0.2 percent across facil 2025 compani also plan build 1.6 gigawatt solar energi capac 2025 half come siraj solar power project next year eif jan.2220 month 

#### Cleaned & Lemmatized Content

In [498]:
data['content_lemmatized'][0]

'qatar petroleum qp targeting aggressive cut greenhouse gas emission prepares launch phase 2 planned 48 million ton per year lng expansion latest sustainability report published wednesday qp said goal include reducing emission intensity qatar lng facility 25 percent upstream facility least 15 percent company also aiming reduce gas flaring intensity across upstream facility 75 percent raised carbon capture storage ambition 5 million ton per year 7 million ton per year 2027 2.2 million ton per year carbon capture goal come 32 million ton per year phase 1 lng expansion also known north field east project 1.1 million ton per year come phase 2 known north field south project raise qatar lng capacity 16 million ton per year qatar currently lng production capacity around 78 million ton per year eyeing phased expansion 126 million ton per year qp say able eliminate routine gas flaring 2030 methane emission limited setting methane intensity target 0.2 percent across facility 2025 company also p

### Save Processed Data

### Next steps

- Text Analysis / NLP Task Preparation
- Sentiment Analysis
- Text Summarization
- Save the cleaned data
- Exploratory Data Analysis (EDA)


the next steps could include limiting the vocabulary 