# Data Preprocessing

For our mini-challenge, we will be utilizing the Cleantech Media Dataset, which serves as an invaluable resource for businesses, researchers, and students interested in the latest developments in Natural Language Processing and Large Language Models within the realm of cleantech and sustainability. In an industry that is constantly evolving, having access to timely and accurate information is crucial. This dataset is specifically designed to address those needs.

This dataset is accessible on Kaggle and is credited to [Janna Lipenkova](https://www.kaggle.com/datasets/jannalipenkova/cleantech-media-dataset).

## 1. Imports

In [2716]:
import os
import re
import pandas as pd
import numpy as np

import nltk
import spacy
from nltk.tokenize import sent_tokenize

## 2. Data

### 2.1 Training Data

- Comprehensive Coverage: Access a wide range of media texts on cleantech topics, from renewable energy to carbon reduction.
- Efficiency: Utilize the dataset for quick and accurate question-answering, aiding informed decision-making.
- Regular Updates: Stay current with monthly updates reflecting the latest trends in cleantech.
- Sustainability Focus: Contribute to the sustainability movement by leveraging valuable insights from the dataset.

In [2717]:
data = pd.read_csv('../data/raw/cleantech_media_dataset_v2_2024-02-23.csv', index_col=0).reset_index(drop=True)
data.head()

Unnamed: 0,title,date,author,content,domain,url
0,Qatar to Slash Emissions as LNG Expansion Adva...,2021-01-13,,"[""Qatar Petroleum ( QP) is targeting aggressiv...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...
1,India Launches Its First 700 MW PHWR,2021-01-15,,"[""• Nuclear Power Corp. of India Ltd. ( NPCIL)...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...
2,New Chapter for US-China Energy Trade,2021-01-20,,"[""New US President Joe Biden took office this ...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...
3,Japan: Slow Restarts Cast Doubt on 2030 Energy...,2021-01-22,,"[""The slow pace of Japanese reactor restarts c...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...
4,NYC Pension Funds to Divest Fossil Fuel Shares,2021-01-25,,"[""Two of New York City's largest pension funds...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...


### 2.2 Evaluation data

- The dataset comes with a small, high-quality evaluation dataset for the "Retrieval" step in Retrieval-Augmented Generation
- In addition, a small collection of human gold-standard query-passage-answer triplets will be provided for further evaluation

In [2718]:
data_eval = pd.read_csv('../data/evaluation/cleantech_rag_evaluation_data_2024-09-20.csv', delimiter=";")
data_eval.head()

Unnamed: 0,example_id,question_id,question,relevant_text,answer,article_url
0,1,1,What is the innovation behind Leclanché's new ...,Leclanché said it has developed an environment...,Leclanché's innovation is using a water-based ...,https://www.sgvoice.net/strategy/technology/23...
1,2,2,What is the EU’s Green Deal Industrial Plan?,The Green Deal Industrial Plan is a bid by the...,The EU’s Green Deal Industrial Plan aims to en...,https://www.sgvoice.net/policy/25396/eu-seeks-...
2,3,2,What is the EU’s Green Deal Industrial Plan?,The European counterpart to the US Inflation R...,The EU’s Green Deal Industrial Plan aims to en...,https://www.pv-magazine.com/2023/02/02/europea...
3,4,3,What are the four focus areas of the EU's Gree...,The new plan is fundamentally focused on four ...,The four focus areas of the EU's Green Deal In...,https://www.sgvoice.net/policy/25396/eu-seeks-...
4,5,4,When did the cooperation between GM and Honda ...,What caught our eye was a new hookup between G...,July 2013,https://cleantechnica.com/2023/05/08/general-m...


### 2.3 Initial Exploration

In [2719]:
print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9593 entries, 0 to 9592
Data columns (total 6 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    9593 non-null   object
 1   date     9593 non-null   object
 2   author   31 non-null     object
 3   content  9593 non-null   object
 4   domain   9593 non-null   object
 5   url      9593 non-null   object
dtypes: object(6)
memory usage: 449.8+ KB
None


In [2720]:
# Check for missing values
print(data.isnull().sum())

title         0
date          0
author     9562
content       0
domain        0
url           0
dtype: int64


Interesting that the Author is missing for most of the articles. The other information could be included as metadata to help the model contextualize and reference the source of the information more accurately, enhancing the quality and traceability of responses.

`[Title: <title>] [Date: <date>] [Domain: <domain>] [URL: <url>] <content>`

In [2721]:
# count how many times each domain appears in the dataset
data["domain"].value_counts()

cleantechnica            1861
azocleantech             1627
pv-magazine              1206
energyvoice              1017
solarindustrymag          673
naturalgasintel           658
thinkgeoenergy            645
rechargenews              559
solarpowerworldonline     505
energyintel               234
pv-tech                   232
businessgreen             158
greenprophet               80
ecofriend                  38
solarpowerportal.co        34
eurosolar                  28
decarbxpo                  19
solarquarter               17
indorenergy                 2
Name: domain, dtype: int64

Now we want to see what the text in the column `content` looks like. This will give us a better idea on how the texts are structured so that we can make a good plan for the data cleaning process.

In [2722]:
print(data[data["domain"] == "solarquarter"]["content"].iloc[2])

['Q1. How has the ‘ Make in India’ initiative contributed to boosting the recognition and growth of India’ s energy storage sector, especially in the context of Sungrow’ s 5-year manufacturing journey as a provider of solar inverters and energy storage systems?', 'Ans: Sungrow’ s achievement of completing 5 years of Indian manufacturing underscores the positive impact of their participation in the “ Make in India ” initiative. This participation has not only fueled their own growth but has also played a significant role in advancing India’ s broader renewable energy and energy storage goals. By manufacturing solar inverters locally, Sungrow has been able to address the specific requirements of the Indian market more effectively and contribute to the nation’ s transition to clean and sustainable energy sources. The expansion of domestic manufacturing has created job opportunities within the energy storage sector, benefiting the local workforce and fostering economic development. Further

In [2723]:
# show all rows that contain "/"
data[data["content"].str.contains("eurosolar")]

Unnamed: 0,title,date,author,content,domain,url
1253,Energy Storage Europe Conference: Focus on the...,2021-07-09,,['The new Energy Storage Europe app provides a...,eurosolar,https://www.eurosolar.de/2017/02/16/energy-sto...
1254,The world's largest energy storage event from ...,2021-07-09,,"['On Tuesday, 14 March, the two conferences wi...",eurosolar,https://www.eurosolar.de/2017/02/16/the-world-...
1255,IRES and ESE Conferences: Final conclusions an...,2021-07-09,,['The more than 260 contributions from interna...,eurosolar,https://www.eurosolar.de/2017/03/16/ires-and-e...
1256,The European Solar Prize 2020: Changemakers of...,2021-07-09,,"['The prize was awarded, for example, to an is...",eurosolar,https://www.eurosolar.de/2020/12/02/the-europe...
1257,Renewable energies: European Solar Prize 2018 ...,2021-07-09,,['For EUROSOLAR President and Chairman of the ...,eurosolar,https://www.eurosolar.de/2018/11/15/renewable-...
1258,European Solar Prize 2018 – Announcement of th...,2021-07-09,,['The European Solar Prize has been awarded an...,eurosolar,https://www.eurosolar.de/2018/11/06/european-s...
1261,Saving the World with Renewables: European Sol...,2021-07-09,,"['In his welcoming address, Henri Kox, former ...",eurosolar,https://www.eurosolar.de/2019/11/14/saving-the...
1262,IRES & ESE 2019: Storage applications on the r...,2021-07-09,,"['For its 8th event, the international trade f...",eurosolar,https://www.eurosolar.de/2019/03/22/ires-ese-2...
1263,IRES WEB SUMMIT 2020 Radical system change to ...,2021-07-09,,['This event marks EUROSOLARs preparations for...,eurosolar,https://www.eurosolar.de/2020/05/28/ires-web-s...
1265,European Solar Prize 2017 awarded to ten laure...,2021-07-09,,['“ The number and quality of innovations and ...,eurosolar,https://www.eurosolar.de/2017/11/15/european-s...


### 2.4 Findings

Domain-specific cleaning

**cleantechnica.com**
- remove the last string in the list: `'Copyright © 2023 CleanTechnica. The content produced by this site is for entertainment purposes only. Opinions and comments published on this site may not be sanctioned by and do not necessarily represent the views of CleanTechnica, its owners, sponsors, affiliates, or subsidiaries.'`
- remove all following elements in the list after an element contains the string `"Advertise with CleanTechnica"`
- remove some special characters like `'\xa0'` and `'\n'`

**azocleantech**
- remove the first sentence in the list which is always: `"By clicking `` Allow All \'\' you agree to the storing of cookies on your device to enhance site navigation, analyse site usage and support us in providing free open access scientific content. More info."`
- remove all following elements in the list after an element contains the string `"The Sensi"` -> this seems to be an advertisement on all pages of azocleantech

**pv-magazine**
- remove all following elements in the list after a sentence contains `This website uses cookies to anonymously`
- remove all following elements in the list after a sentence contains `By submitting this form you agree to pv magazine using your data`

**energyvoice**
- cleaning of special characters and whitespaces

**solarindustrymag**
- Normal cleaning of special characters and whitespaces
- remove `'Solar Industry offers industry participants probing, comprehensive assessments of the technology, tools and trends that are driving this dynamic energy sector. From raw materials straight through to end-user applications, we capture and analyze the critical details that help professionals stay current and navigate the solar market.', '© Copyright Zackin Publications Inc. All Rights Reserved.'`

**naturalgasintel**
- Remove signup and header text: `'Sign in to get the best natural gas news and data. Follow the topics you want and receive the daily emails.', 'Your email address *', 'Your password *', 'Remember me Continue', 'Reset password', 'Featured Content', 'News & Data Services', 'Client Support', 'Daily GPI', 'Regulatory | E & P | Infrastructure | NGI All News Access | NGI The Weekly Gas Market Report'`

**thinkgeoenergy**
- cleaning of special characters and whitespaces

**rechargenews**
- remove all following elements after `Recharge is part of NHST Global Publications AS and we are responsible for the data that you register with us`

**solarpowerworldonline**
- Remove everything after `'Copyright © 2023 WTWH Media, LLC. All Rights Reserved`

**energyintel**
- remove references like `( LNGI Nov.9'20). Rafiq Latta, Nicosia`, ` ( EIF Jan.22'20)`

**pv-tech**
- cleaning of special characters and whitespaces

**businessgreen**
- cleaning of special characters and whitespaces

**greenprophet**
- Remove signup form: `"window.dojoRequire ( [ `` mojo/signup-forms/Loader '' ], function ( L) { L.start ( { `` baseUrl '': '' mc.us4.list-manage.com '', '' uuid '': '' 2a6df7ce0f3230ba1f5efe12c '', '' lid '': '' 1e23cc3ebd '', '' uniqueMethods '': true }) })"`

**ecofriend**
- remove everything after `EcoFriend.com` or `Ecofriend.Org`

**solarpowerportal.co**
- remove everything after `'Thank you for subscribing to the email newsletter.`

**eurosolar**
- Has some german content -> maybe translate it to english
- Has some ceryillic content like Відкритий лист президентам

**decarbxpo**
- remove everything after `'To use the full function of this web site, JavaScript needs to be enabled in your browser.`

**solarquarter**
- remove everything after `This site uses Akismet to reduce spam. Learn how your comment data is processed.'`

## Text Preprocessing

In [2724]:
data_cleaned = data.copy()

# remove author column from data_cleaned
data_cleaned = data_cleaned.drop(columns=["author"])
data_cleaned.head()

Unnamed: 0,title,date,content,domain,url
0,Qatar to Slash Emissions as LNG Expansion Adva...,2021-01-13,"[""Qatar Petroleum ( QP) is targeting aggressiv...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...
1,India Launches Its First 700 MW PHWR,2021-01-15,"[""• Nuclear Power Corp. of India Ltd. ( NPCIL)...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...
2,New Chapter for US-China Energy Trade,2021-01-20,"[""New US President Joe Biden took office this ...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...
3,Japan: Slow Restarts Cast Doubt on 2030 Energy...,2021-01-22,"[""The slow pace of Japanese reactor restarts c...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...
4,NYC Pension Funds to Divest Fossil Fuel Shares,2021-01-25,"[""Two of New York City's largest pension funds...",energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...


In [2725]:
def clean_content_cleantechnica(content):
    # Step 1: Remove surrounding list notation
    content = content.strip("[]")
    
    # Step 2: Split content by the typical list delimiter
    sentences = content.split("', '")
    
    # Step 3: Remove the last element if it matches the specific copyright text
    copyright_text = "Copyright © 2023 CleanTechnica. The content produced by this site is for entertainment purposes only. Opinions and comments published on this site may not be sanctioned by and do not necessarily represent the views of CleanTechnica, its owners, sponsors, affiliates, or subsidiaries."
    if sentences and sentences[-1] == copyright_text:
        sentences = sentences[:-1]  # Remove last element
    
    # Step 4: Find the index of the "Advertise with CleanTechnica" string and remove all elements after it
    for i, sentence in enumerate(sentences):
        if "Advertise with CleanTechnica" in sentence:
            sentences = sentences[:i]
            break
    
    # Join the cleaned sentences back into a single text block
    cleaned_text = ' '.join(sentences)

    # Combined regex substitutions for multiple patterns
    cleaned_text = re.sub(r'\( |\’ |”|“|…', lambda x: {'( ': '(', '’ ': '’', '”': '', '“': '', '…': ''}.get(x.group(), ''), cleaned_text)
    cleaned_text = re.sub(r'\s+', ' ', cleaned_text).strip()

    # Replace ’ with '
    cleaned_text = cleaned_text.replace("’", "'")

    # Strip any leading/trailing single or double quotes
    cleaned_text = cleaned_text.strip("'\"")

    return cleaned_text

data_cleaned.loc[data_cleaned['domain'] == 'cleantechnica', 'content'] = \
    data_cleaned[data_cleaned['domain'] == 'cleantechnica']['content'].apply(clean_content_cleantechnica)

# Inspect first 500 characters of the cleaned content from cleantechnica
data_cleaned[data_cleaned["domain"] == "cleantechnica"]["content"].iloc[40][:500]

"During the first four months of 2022, electrical generation by renewable energy sources accounted for over 25% of the nation's electricity. In April alone, renewables accounted for 29.3% — an all-time high. And for the first time ever, the combination of just wind and solar produce more electricity in April than the nation's nuclear power plants — 17.96% more. This is according to a SUN DAY Campaign analysis of data in EIA's Electric Power Monthly report. The report also reveals that during the "

In [2726]:
def clean_content_azocleantech(content):
    # Remove surrounding list notation
    content = content.strip("[]")
    
    # Split sentences by .', ' or .', " or .", ' or .", "
    sentences = re.split(r"\.', |.'\", |.\", |.\', |.\", ", content)

    # Remove cookies message
    sentences = sentences[1:]

    # Remove the first sentence and trim after "The Sensi"
    for i, sentence in enumerate(sentences):
        if "The Sensi" in sentence:
            sentences = sentences[:i]
            break
    
    # Join and clean up text
    cleaned_text = ' '.join(sentence.strip() for sentence in sentences)

    # Apply combined replacements to clean up unwanted characters and patterns
    cleaned_text = re.sub(r"\( |’ |' |''|` |`|” |“ |\"| -;|:", '', cleaned_text)
    cleaned_text = re.sub(r"’", "'", cleaned_text)  # Replace special apostrophe
    cleaned_text = re.sub(r"\s+", ' ', cleaned_text)  # Remove extra spaces

    # Replace  ' with a space
    cleaned_text = cleaned_text.replace(" '", " ")

    # Remove non-ASCII characters
    cleaned_text = re.sub(r'[^\x00-\x7F]+', '', cleaned_text)

    # Strip any leading/trailing single or double quotes
    cleaned_text = cleaned_text.strip("'\"")

    return cleaned_text

data_cleaned.loc[data_cleaned['domain'] == 'azocleantech', 'content'] = \
    data_cleaned[data_cleaned['domain'] == 'azocleantech']['content'].apply(clean_content_azocleantech)

print(data_cleaned[data_cleaned['domain'] == 'azocleantech']['content'].iloc[250][:500])

Leading renewable energy company RES has announced a collaboration with nge Municipality in Sweden to deliver a new green hydrogen plant in Alby. Together RES and nge Municipality aim to supply green hydrogen to the local industry and accelerate the energy transition The aim of the partnership between nge Municipality and RES is to utilise current available grid capacity to give the local industry access to green hydrogen in nge by 2024. The new green hydrogen plant will utilise renewable electr


In [2727]:
def clean_content_pv_magazine(content):
    # Step 1: Remove surrounding list notation and split by ', '
    content = content.strip("[]")
    sentences = re.split(r"',\s*'", content)

    # Step 2: Remove elements following the specified phrases
    for i, sentence in enumerate(sentences):
        if "This website uses cookies to anonymously" in sentence or "By submitting this form you agree to pv magazine using your data" in sentence or "This content is protected by copyright and may not be reused" in sentence:
            sentences = sentences[:i]
            break

    # Step 3: Join sentences and clean up text
    cleaned_text = ' '.join(sentence.strip() for sentence in sentences)
    
    # Additional cleaning if needed
    cleaned_text = re.sub(r"\s+", ' ', cleaned_text).strip()  # Remove extra spaces

    # remove space after (
    cleaned_text = re.sub(r"\( ", "(", cleaned_text)

    # replace “ and ” with "
    cleaned_text = cleaned_text.replace("“", "\"").replace("”", "\"")

    # replace " with space
    cleaned_text = cleaned_text.replace(' " ', " ")

    # remove *
    cleaned_text = cleaned_text.replace("* ", "")

    # remove "" 
    cleaned_text = cleaned_text.replace('"" ', "")

    # remove ',
    cleaned_text = cleaned_text.replace("',", "")

    # remove "",
    cleaned_text = cleaned_text.replace('",', "")

    # remove ’
    cleaned_text = cleaned_text.replace("’", "'")

    # remove `` 
    cleaned_text = cleaned_text.replace("`` ", "")

    # remove '" 
    cleaned_text = cleaned_text.replace('\'" ', "")

    # remove " "
    cleaned_text = cleaned_text.replace('" "', "")

    # replace ' s with 's
    cleaned_text = cleaned_text.replace("' s", "'s")
    cleaned_text = cleaned_text.replace("' t", "'t")
    cleaned_text = cleaned_text.replace("' ve", "'ve")
    cleaned_text = cleaned_text.replace("' re", "'re")
    cleaned_text = cleaned_text.replace("' ll", "'ll")
    cleaned_text = cleaned_text.replace("' d", "'d")

    return cleaned_text

# Apply the function to 'pv-magazine' domain entries
data_cleaned.loc[data_cleaned['domain'] == 'pv-magazine', 'content'] = \
    data_cleaned[data_cleaned['domain'] == 'pv-magazine']['content'].apply(clean_content_pv_magazine)

print(data_cleaned[data_cleaned["domain"] == "pv-magazine"]["content"].iloc[700][:500])

'Copenhagen Infrastructure Partners (CIP) has announced plans to develop an AUD 30 billion ($ 19,95 billion) green hydrogen production hub on South Australia's Eyre Peninsula, while a new research study shows the feasibility of gas-to-hydrogen pipeline conversion in Western Australia. Map obtained from the Global Solar Atlas 2.0, a free web-based application developed and operated by the company Solargis s.r.o. on behalf of the World Bank Group, utilizing Solargis data, with funding provided by 


In [2728]:
def clean_content(content):
    # Step 1: Remove surrounding list notation and split by ', '
    content = content.strip("[]")
    sentences = re.split(r"',\s*'", content)

    # Step 2: Join sentences back together
    cleaned_text = ' '.join(sentence.strip() for sentence in sentences)
    
    # Step 3: Remove special characters and apply cleanup
    cleaned_text = re.sub(r'[“”‘’":\'`]', '', cleaned_text)  # Remove special quotes and ticks
    cleaned_text = re.sub(r'\s+', ' ', cleaned_text).strip()  # Remove extra spaces

    # Remove space after (
    cleaned_text = re.sub(r"\( ", "(", cleaned_text)

    # Replace " – " with space
    cleaned_text = cleaned_text.replace(" – ", " ")

    return cleaned_text

# Apply the function to 'energyvoice' domain entries
data_cleaned.loc[data_cleaned['domain'] == 'energyvoice', 'content'] = \
    data_cleaned[data_cleaned['domain'] == 'energyvoice']['content'].apply(clean_content)

# Display an example to confirm the changes
print(data_cleaned[data_cleaned["domain"] == "energyvoice"]["content"].iloc[10])

Energean has taken final investment decision (FID) on its Karish North gas development, offshore Israel. The company forecast the development would start in the second half of 2023. Energean has also announced the signing of a $ 700 million loan, of which $ 150mn will go on Karish North. The company discovered Karish North 21 months ago. The find will be tied back to the Energean Power floating production, storage and offloading (FPSO) vessel, which is 5.4 km away. DeGolyer and MacNaughton provided a report in November 2020 giving 2P reserves at Karish North of 32 billion cubic metres of gas and 34 million barrels of liquids. The first well should produce around 3 bcm per year. The main Karish development is due to begin producing in the fourth quarter of 2021. Subsea and onshore work is due to be completed in the second quarter, while the FPSO should leave Singapore in the third quarter. Adding Karish North to the FPSO will mean that potential flow is more than the vessel s 8 bcm per 

In [2729]:
def clean_content_solarindustrymag(content):
    # Step 1: Remove surrounding list notation and split by ', '
    content = content.strip("[]")
    sentences = re.split(r"',\s*'", content)

    # Step 2: Remove specified sentences if they exist
    # Use regex to match text more flexibly
    sentences = [
        sentence for sentence in sentences 
        if not re.search(r"Solar Industry offers industry participants probing.*?solar market|© Copyright Zackin Publications Inc\. All Rights Reserved", sentence, re.IGNORECASE)
    ]

    # Step 3: Join sentences and clean up special characters and whitespace
    cleaned_text = ' '.join(sentence.strip() for sentence in sentences)
    
    # Step 4: Remove special characters
    cleaned_text = re.sub(r'[“”‘’"\'`]', '', cleaned_text)  # Remove special quotes and ticks
    cleaned_text = re.sub(r'\s+', ' ', cleaned_text).strip()  # Remove extra spaces

    # remove " - "
    cleaned_text = cleaned_text.replace(" - ", " ")

    # remove space after (
    cleaned_text = re.sub(r"\( ", "(", cleaned_text)

    return cleaned_text

data_cleaned.loc[data_cleaned['domain'] == 'solarindustrymag', 'content'] = \
    data_cleaned[data_cleaned['domain'] == 'solarindustrymag']['content'].apply(clean_content_solarindustrymag)

print(data_cleaned[data_cleaned["domain"] == "solarindustrymag"]["content"].iloc[0])

Clearway Energy Group, an owner and operator of renewable assets, has completed construction and reached commercial operations on the 192 MW Rosamond Central solar project in Kern County, Calif. Rosamond Central is contracted under power purchase agreements with East Bay Community Energy and Clean Power Alliance, both community choice aggregators providing a diverse range of power options to regional customers, and the City of Palo Alto Utilities, which has administered Palo Alto s electric power system for 120 years. We are proud to continue contributing to California s goal of 100% clean electricity and keep the state on the forefront of climate leadership, said Valerie Wooley, vice president of origination at Clearway Energy Group. Rosamond Central came together thanks to the dedication and effort of many partners, including McCarthy s swift action to create a safe working environment, and the trust of East Bay Community Energy, Clean Power Alliance and Palo Alto to meet their custo

In [2730]:
def clean_content_naturalgasintel(content):
    # Step 1: Remove surrounding list notation and split by ', '
    content = content.strip("[]")
    sentences = re.split(r"',\s*'", content)

    # Step 3: Remove signup form sentences at the beginning
    signup_or_header_phrases = [
        "Sign in to get the best natural gas news and data",
        "Your email address *", "Your password *", "Remember me Continue",
        "Reset password"
    ]
    while sentences and any(signup_or_header_phrases in sentences[0] for signup_or_header_phrases in signup_or_header_phrases):
        sentences.pop(0)

    # Step 3: Remove footer information starting from the copyright phrase
    footer_start = "© 2021 Natural Gas Intelligence. All rights reserved."
    for i, sentence in enumerate(sentences):
        if footer_start in sentence:
            sentences = sentences[:i]
            break

    # Step 4: Join sentences and clean up special characters and whitespace
    cleaned_text = ' '.join(sentence.strip() for sentence in sentences)
    cleaned_text = re.sub(r'[“”‘’"\[\[\'`]\s*', '', cleaned_text)  # Remove special quotes and ticks
    cleaned_text = re.sub(r'\s+', ' ', cleaned_text).strip()  # Remove extra spaces

    # remove space after (
    cleaned_text = re.sub(r"\( ", "(", cleaned_text)

    # remove " - "
    cleaned_text = cleaned_text.replace(" — ", " ")

    # remove everything before  | 
    cleaned_text = cleaned_text.split(" | ")[-1]

    return cleaned_text

data_cleaned.loc[data_cleaned['domain'] == 'naturalgasintel', 'content'] = \
    data_cleaned[data_cleaned['domain'] == 'naturalgasintel']['content'].apply(clean_content_naturalgasintel)

print(data_cleaned[data_cleaned["domain"] == "naturalgasintel"]["content"].iloc[0][:500])

NGI All News Access Major fluctuations in the latest weather models resulted in big swings in natural gas bidweek prices, with solid gains on the East Coast and out West. However, much of the countrys midsection posted hefty losses amid healthy storage levels, leaving NGIs January Bidweek National Avg. down 2.5 cents from December 2020 bidweek to $ 2.695/MMBtu. While just shy of the previous months average, the January 2021 bidweek average came in nearly 15.0 cents higher than year-ago levels an


In [2731]:
data_cleaned.loc[data_cleaned['domain'] == 'thinkgeoenergy', 'content'] = \
    data_cleaned[data_cleaned['domain'] == 'thinkgeoenergy']['content'].apply(clean_content)

print(data_cleaned[data_cleaned["domain"] == "thinkgeoenergy"]["content"].iloc[0][:500])

IRENA has released an assessment report on the renewables readiness for El Salvador, highlighting opportunity for improving conditions for geothermal development, including better renumeration schemes and add direct use options to the regulatory framework, among others. The main renewable resources used in El Salvador for electricity generation are geothermal and hydropower. While variable renewable power is growing considerably, there is much more potential for these resources, either for elect


This function can actually handle the different apostrophes and alone standing letters.

In [2732]:
def clean_content_rechargenews(content):
    # Step 1: Remove surrounding list notation and split by ', '
    content = content.strip("[]")
    sentences = re.split(r"',\s*'", content)

    # Step 2: Remove all sentences following any stop phrase
    stop_phrases = [
        "Recharge is part of NHST Global Publications AS and we are responsible for the data that you register with us",
        "Recharge is part of DN Media Group",
        "Ecofriend.Org", "EcoFriend.com",
        "Thank you for subscribing to the email newsletter.",
        "To use the full function of this web site, JavaScript needs to be enabled in your browser.",
        "This site uses Akismet to reduce spam. Learn how your comment data is processed.",
    ]
    for i, sentence in enumerate(sentences):
        if any(stop_phrase in sentence for stop_phrase in stop_phrases):
            sentences = sentences[:i]
            break

    # Step 3: Join sentences and apply consolidated clean-up operations
    cleaned_text = ' '.join(sentence.strip() for sentence in sentences)
    cleaned_text = re.sub(r'[‘’]', "'", cleaned_text)  # Normalize all apostrophes to single '

    # Step 4: Remove any code-like blocks that start with `{ L.start`, `window.dojoRequire`, or other patterns
    cleaned_text = re.sub(r'\{.*?\}\s*\),?', '', cleaned_text, flags=re.DOTALL)  # Removes blocks like `{ ... })`
    cleaned_text = re.sub(r'window\.dojoRequire\s*\(.*?\)\s*[,;]?', '', cleaned_text, flags=re.DOTALL)  # Removes `window.dojoRequire(...)`

    # Step 5: Consolidated special character and whitespace cleanup
    cleaned_text = re.sub(r'[“”"\[\]`]', '', cleaned_text)  # Remove other quotes and brackets
    cleaned_text = re.sub(r'\s+([.,;:?!])', r'\1', cleaned_text)  # Remove space before punctuation
    cleaned_text = re.sub(r'\s*[-–—]\s*', ' ', cleaned_text)  # Replace dashes with single space
    cleaned_text = re.sub(r'\s+', ' ', cleaned_text).strip()  # Remove extra spaces
    cleaned_text = re.sub(r'\( ', '(', cleaned_text)  # Remove space after (
    cleaned_text = re.sub(r'\.,', '.', cleaned_text)  # Replace any ".,"

    # Additional character replacements to clean single quotes and spaces
    cleaned_text = cleaned_text.replace(" – ", " ")
    cleaned_text = cleaned_text.replace("', ", " ")
    cleaned_text = cleaned_text.replace(" ' ", " ")
    cleaned_text = cleaned_text.replace(" '", " ")
    cleaned_text = cleaned_text.replace("' ", "'")
    cleaned_text = cleaned_text.replace(" '", " ")
    cleaned_text = cleaned_text.replace("  ", " ")
    cleaned_text = cleaned_text.replace("}), ", "")
    
    return cleaned_text

data_cleaned.loc[data_cleaned['domain'] == 'rechargenews', 'content'] = \
    data_cleaned[data_cleaned['domain'] == 'rechargenews']['content'].apply(clean_content_rechargenews)

print(data_cleaned[data_cleaned["domain"] == "rechargenews"]["content"].iloc[3][:500])

'The US state of New York has selected Equinor to supply almost 2.5GW of power from two projects in its highly anticipated second offshore wind solicitation, with the Norwegian energy giant winning a bid which includes large scale port redevelopment that will position the state as a hub for the fast emerging sector in the region. Equinor which will take over the top slot in power capacity under contract from Denmark's Orsted with the award until its recently agreed strategic partnership BP is fo


In [2733]:
data_cleaned.loc[data_cleaned['domain'] == 'energyintel', 'content'] = \
    data_cleaned[data_cleaned['domain'] == 'energyintel']['content'].apply(clean_content_rechargenews)

print(data_cleaned[data_cleaned["domain"] == "energyintel"]["content"].iloc[0][:500])

Qatar Petroleum (QP) is targeting aggressive cuts in its greenhouse gas emissions as it prepares to launch Phase 2 of its planned 48 million ton per year LNG expansion. In its latest Sustainability Report published on Wednesday, QP said its goals include reducing the emissions intensity of Qatar's LNG facilities by 25% and of its upstream facilities by at least 15%. The company is also aiming to reduce gas flaring intensity across its upstream facilities by more than 75% and has raised its carbo


In [2734]:
data_cleaned.loc[data_cleaned['domain'] == 'pv-tech', 'content'] = \
    data_cleaned[data_cleaned['domain'] == 'pv-tech']['content'].apply(clean_content_rechargenews)

print(data_cleaned[data_cleaned["domain"] == "pv-tech"]["content"].iloc[0][:500])

'The Global Wind Energy Council (GWEC) and Global Solar Council (GSC) have called on Mexican lawmakers to prevent changes to the country's Electricity Act, Ley de la Industria Electrica (LIE), which pose what they call an unequivocal threat to private investment in clean energy. The groups issued a joint statement this week, responding to controversial reforms that Mexico's lower house of Congress approved on Wednesday (24 February). Currently, the Federal Electricity Commission (CFE) has to buy


In [2735]:
data_cleaned.loc[data_cleaned['domain'] == 'businessgreen', 'content'] = \
    data_cleaned[data_cleaned['domain'] == 'businessgreen']['content'].apply(clean_content_rechargenews)

print(data_cleaned[data_cleaned["domain"] == "businessgreen"]["content"].iloc[0][:500])

Geothermal energy in the UK took a major step forward this week, after the developer of the nation's first geothermal power plant struck landmark deals to supply electricity to green energy supplier Ecotricity and heat to a geothermal rum startup. The deal to provide a minimum 3MW of baseload electricity to Ecotricity customers marks the first time geothermal electricity will be purchased and sold through the UK's grid, according to Geothermal Engineering Limited, the company that owns the pione


In [2736]:
data_cleaned.loc[data_cleaned['domain'] == 'greenprophet', 'content'] = \
    data_cleaned[data_cleaned['domain'] == 'greenprophet']['content'].apply(clean_content_rechargenews)

print(data_cleaned[data_cleaned["domain"] == "greenprophet"]["content"].iloc[0][:500])

American consumers are more concerned about the planet than steady economic growth, new report. Your company wants to be a part of this. What steps do you take? Each company should create detailed reports that evaluate the environmental impact of the business, numerous social responsibilities and factors that can improve corporate governance. Once the investors review these reports, the shareholders may provide additional investments, request more information, examine the value of the company an


In [2737]:
data_cleaned.loc[data_cleaned['domain'] == 'ecofriend', 'content'] = \
    data_cleaned[data_cleaned['domain'] == 'ecofriend']['content'].apply(clean_content_rechargenews)

print(data_cleaned[data_cleaned["domain"] == "ecofriend"]["content"].iloc[0][:500])

'Green construction is all about using the resource efficiently and responsibly in the processes of construction to ensure lifetime sustainability of the house. The construction practices that are eco friendly, ensures the least harm on the environment while the construction takes place. Generally, the eco friendly building and construction is cost effective and at the same time durable. This type of construction ensures to lower the overall adverse impact on the environment and mankind. The mai


In [2738]:
data_cleaned.loc[data_cleaned['domain'] == 'solarpowerportal.co', 'content'] = \
    data_cleaned[data_cleaned['domain'] == 'solarpowerportal.co']['content'].apply(clean_content_rechargenews)

print(data_cleaned[data_cleaned["domain"] == "solarpowerportal.co"]["content"].iloc[0][:500])

'Statera's 50MW battery storage facility in Wiltshire enters commercial operations. Image: Santander UK. A 50MW battery energy storage facility being developed by Statera Energy has entered commercial operations in Malmesbury, Wiltshire. Dubbed Minety South Storage 2, the project received funding of around £86 million from Santander UK and NatWest. On top of this, the project received an additional £30 million from an accordion facility. The completion of Minety South Storage 2 takes Statera Ene


In [2739]:
data_cleaned.loc[data_cleaned['domain'] == 'eurosolar', 'content'] = \
    data_cleaned[data_cleaned['domain'] == 'eurosolar']['content'].apply(clean_content_rechargenews)

print(data_cleaned[data_cleaned["domain"] == "eurosolar"]["content"].iloc[0][:500])

'EUROSOLAR veranstaltet vom 16. bis 18. März 2021 seine 15. International Renewable Energy Storage Conference (IRES 2021) als globales Online Event. Wissenschaftlern aus der ganzen Welt wird so die Teilnahme ermöglicht. Die IRES Konferenz widmet sich neusten wissenschaftlichen Erkenntnissen zu Speichersystemen in der Welt der intelligenten und verteilten Energieressourcen der zentrale Fokus liegt nicht nur auf Speichertechnologien, sondern auch auf den entsprechenden rechtlichen, politischen, Ne


In [2740]:
data_cleaned.loc[data_cleaned['domain'] == 'decarbxpo', 'content'] = \
    data_cleaned[data_cleaned['domain'] == 'decarbxpo']['content'].apply(clean_content_rechargenews)

print(data_cleaned[data_cleaned["domain"] == "decarbxpo"]["content"].iloc[0])

'Product Details: Place of origin: Jiangxi, China Brand name: HLY Certification: CE/RoHS/UN38.3/MSDS/FCC/BIS Model Number: rechargeable Li Ion 18650 2600mAhPayment and Shipping T & C: Min order Product Details: Place of origin: Jiangxi, China Brand name: HLY Certification: CE/RoHS/UN38.3/MSDS/FCC/BIS Model Number: rechargeable Li Ion 26650 5000mAhPayment and Shipping T & C: Min order SUSTAINABLE AND COMMITTED TO THE FUTURE. The 17 Sustainable Development Goals (SDGs) are political targets set by the United Nations (UN) to ensure sustainable development worldwide at the economic, The CHARX connect mode 3 charging cables are suitable for AC charging of electric vehicles from all manufacturers. They achieve charging powers of up to 26 kW and are available with Type 1 and Type 2 WE LOVE INDIVIDUAL AND COMPLETE SOLUTIONS WITH ALTERNATIVE ENERGIES FOR SMART APPLICATIONS IN COMPANIES, PUBLIC FACILITIES AND PRIVATE BUILDINGS. WE ARE HAPPY TO HELP! For more than 40 years we have specialized in 

In [2741]:
data_cleaned.loc[data_cleaned['domain'] == 'solarquarter', 'content'] = \
    data_cleaned[data_cleaned['domain'] == 'solarquarter']['content'].apply(clean_content_rechargenews)

print(data_cleaned[data_cleaned["domain"] == "solarquarter"]["content"].iloc[0])

'A groundbreaking achievement in solar energy research has been unveiled as scientists from the Fraunhofer Institute for Solar Energy Research ISE and NWO Institute AMOLF (Amsterdam) jointly created a multijunction solar cell with an unprecedented efficiency rate of 36.1%. This remarkable milestone marks the highest efficiency ever achieved for a solar cell based on silicon. The team presented their groundbreaking achievement at the European Photovoltaic Solar Energy Conference (EU PVSEC) held in Lisbon on September 21, 2023. The research project received funding through the Fraunhofer ICON program. Albert Polman, who spearheaded the AMOLF segment of the project, highlighted the unique collaboration between Fraunhofer ISE and AMOLF, which began in 2020. Fraunhofer ISE is renowned for its work on ultra high efficiency solar cells using silicon and III V semiconductors, while AMOLF has extensive expertise in optimizing light management within solar cells. This collaboration allowed them 

In [2745]:
data_cleaned.loc[data_cleaned['domain'] == 'indorenergy', 'content'] = \
    data_cleaned[data_cleaned['domain'] == 'indorenergy']['content'].apply(clean_content_rechargenews)

print(data_cleaned[data_cleaned["domain"] == "indorenergy"]["content"].iloc[0])

'For our 13th Annual Event Indo Renergy 2023 Expo & Forum, we will be bringing together businesses, sustainable energy industry trade associations, government agencies, and energy policy research organizations to showcase the status and near term potential of the cross section of renewable energy (bio fuels/biomass, geothermal, solar, water, wind) and energy efficiency technologies. Renewable energy is energy that is collected from renewable resources, which are naturally replenished on a human timescale, such as sunlight, wind, rain, tides, waves, and geothermal heat. Renewable energy often provides energy in four important areas: electricity generation, air and water heating/cooling, transportation, and rural (off grid) energy services. Based on REN21's 2016 report, renewables contributed 19.2% to humans'global energy consumption and 23.7% to their generation of electricity in 2014 and 2015, respectively. This energy consumption is divided as 8.9% coming from traditional biomass, 4.2

# Which preprocessing steps are actually necessary??????

### Remove copyright notices

In [272]:
# Step 2: Remove copyright notices 

data_cleaned['content'] = data_cleaned['content'].apply(
    lambda text: re.sub(r'©.*?All rights reserved\.?', '', text, flags=re.IGNORECASE))

# if domain is cleantechnica, remove the following text
copyright_text_cleantechnica = 'Copyright © 2023 CleanTechnica. The content produced by this site is for entertainment purposes only. Opinions and comments published on this site may not be sanctioned by and do not necessarily represent the views of CleanTechnica, its owners, sponsors, affiliates, or subsidiaries.'
data_cleaned['content'] = data_cleaned.apply(
    lambda row: re.sub(re.escape(copyright_text_cleantechnica), '', row['content'], flags=re.IGNORECASE) if row['domain'] == 'cleantechnica' else row['content'], axis=1)

data_cleaned['content'][1281]

"['This is the first installment of a three-part series on community solar for low- and moderate-income costumers. The next piece will focus on how partnerships with community-based organizations can catalyze low- and moderate-income solar, while the third will explore the role local governments can play in accelerating solar adoption in those communities.', 'Residential clean energy projects are typically associated with affluent customers. Community solar, however, offers a pathway for low- and moderate-income customers to benefit from clean energy projects too.', 'Residential rooftop solar projects often only benefit the households who live under the array — but community solar projects can reach beyond the property where a project is located, impacting many more customers who otherwise might have a hard time accessing renewable power.', 'Here, we explain how community solar works and how it can deliver benefits for low- and moderate-income customers where projects are sited and bey

### Remove Irrelevant Content

- Sentences like `"By clicking `` Allow All \'\' you agree to the storing of cookies on your device to enhance site navigation, analyse site usage and support us in providing free open access scientific content. More info."` should be removed as they are not relevant to the content of the article.

    -> found in 1627 rows totally

- Copyright notices like `'© Copyright Zackin Publications Inc. All Rights Reserved.'` should be removed as they are not relevant to the content of the article.

    -> found in 672 rows totally

### Tokenization

Tokenization is the process of breaking text down into individual units, typically words or tokens. This step is essential for most text preprocessing tasks as it allows us to manipulate and analyze the content at the word level. By splitting the content into tokens, we can more easily apply further processing techniques like stemming and lemmatization.

In [14]:
def tokenize(text):
    return text.split()

data['content_tokenized'] = data['content_cleaned'].apply(tokenize)

### Result

### Save Processed Data

In [22]:
data.to_csv('../data/processed/cleantech_processed.csv', index=False)

### Next steps

- Text Analysis / NLP Task Preparation
- Sentiment Analysis
- Text Summarization
- Save the cleaned data
- Exploratory Data Analysis (EDA)


the next steps could include limiting the vocabulary 