# Stage 1: Enhanced Data Cleaning, Preprocessing, and Exploratory Analysis
In this notebook, we perform **data cleaning, preprocessing, and exploratory analysis (EDA)** on the Cleantech Media and Google Patent datasets. The goal is to identify **trends, key technologies, and innovation gaps** by analyzing media publications and patents.

In [None]:
# Mount Google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!pip install contractions
!pip install unidecode
!pip install num2words
!pip install pyspellchecker
!pip install langdetect

Collecting contractions
  Downloading contractions-0.1.73-py2.py3-none-any.whl.metadata (1.2 kB)
Collecting textsearch>=0.0.21 (from contractions)
  Downloading textsearch-0.0.24-py2.py3-none-any.whl.metadata (1.2 kB)
Collecting anyascii (from textsearch>=0.0.21->contractions)
  Downloading anyascii-0.3.2-py3-none-any.whl.metadata (1.5 kB)
Collecting pyahocorasick (from textsearch>=0.0.21->contractions)
  Downloading pyahocorasick-2.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
Downloading contractions-0.1.73-py2.py3-none-any.whl (8.7 kB)
Downloading textsearch-0.0.24-py2.py3-none-any.whl (7.6 kB)
Downloading anyascii-0.3.2-py3-none-any.whl (289 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m289.9/289.9 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyahocorasick-2.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (118 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m118.3/118.3 kB[0m 

In [None]:
import numpy as np
import pandas as pd
from pathlib import Path
import re
import nltk
import spacy
import string
import contractions
import unidecode
from tqdm import tqdm
from num2words import num2words
from collections import Counter
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
# from nltk.stem import WordNetLemmatizer
from bs4 import BeautifulSoup
from spellchecker import SpellChecker
from langdetect import detect

## Data Collection and Cleaning (Joshua)
Before analyzing the data, we first **load, inspect, and clean** the datasets:  

- **Load datasets**: We import the **Cleantech Media Dataset** and the **Cleantech Google Patent Dataset** into Pandas DataFrames.  
- **Remove duplicates**: Identical or near-identical entries are removed to prevent data bias.  
- **Handle missing values**: We check for null or incomplete entries and decide whether to impute, replace, or remove them.  
- **Filter relevant information**: Non-informative texts (e.g., generic statements) are removed to ensure high-quality analysis.  

In [None]:
# Load raw data
data_folder = Path("/content/drive/MyDrive/computational-language-technology-project")
media_dataset_path = data_folder / "data/cleantech_media_dataset_v3_2024-10-28.csv"
google_patent_dataset_path = data_folder / "data/CleanTech_22-24_updated.json"

pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 100)

## Cleantech Media Dataset

In [None]:
# Load CSV files
df_media = pd.read_csv(media_dataset_path, header = 0)

print(df_media.info())
df_media

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20111 entries, 0 to 20110
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  20111 non-null  int64  
 1   title       20111 non-null  object 
 2   date        20111 non-null  object 
 3   author      0 non-null      float64
 4   content     20111 non-null  object 
 5   domain      20111 non-null  object 
 6   url         20111 non-null  object 
dtypes: float64(1), int64(1), object(5)
memory usage: 1.1+ MB
None


Unnamed: 0.1,Unnamed: 0,title,date,author,content,domain,url
0,93320,"XPeng Delivered ~100,000 V...",2022-01-02,,['Chinese automotive start...,cleantechnica,https://cleantechnica.com/...
1,93321,Green Hydrogen: Drop In Bu...,2022-01-02,,['Sinopec has laid plans t...,cleantechnica,https://cleantechnica.com/...
2,98159,World’ s largest floating ...,2022-01-03,,['Huaneng Power Internatio...,pv-magazine,https://www.pv-magazine.co...
3,98158,Iran wants to deploy 10 GW...,2022-01-03,,['According to the Iranian...,pv-magazine,https://www.pv-magazine.co...
4,31128,Eastern Interconnection Po...,2022-01-03,,['Sign in to get the best ...,naturalgasintel,https://www.naturalgasinte...
...,...,...,...,...,...,...,...
20106,104263,US Treasury finalises 45X ...,2024-10-24,,['The US Department of the...,pv-tech,https://www.pv-tech.org/us...
20107,104264,EDP trials robotic constru...,2024-10-24,,['Developer EDP is pilotin...,pv-tech,https://www.pv-tech.org/ed...
20108,101434,Australia has 7.8 GW of ut...,2024-10-24,,['The volume of large-scal...,pv-magazine,https://www.pv-magazine.co...
20109,101428,Residential PV prices in G...,2024-10-24,,['The comparison site Self...,pv-magazine,https://www.pv-magazine.co...


In [None]:
# Count occurrences of each unique value in the 'domain' column
domain_counts = df_media['domain'].value_counts()

# Display the counts
print(domain_counts)

domain
energy-xprt              4181
pv-magazine              3093
azocleantech             2488
cleantechnica            2089
pv-tech                  1969
thinkgeoenergy           1052
solarpowerportal.co       850
energyvoice               828
solarpowerworldonline     785
solarindustrymag          621
solarquarter              606
rechargenews              573
naturalgasintel           298
iea                       173
energyintel               171
greenprophet              130
greenairnews               59
ecofriend                  55
all-energy                 39
decarbxpo                  20
storagesummit              15
eurosolar                   9
indorenergy                 4
bex-asia                    2
biofuels-news               1
Name: count, dtype: int64


In [None]:
# Create a new dataframe for the processed data
df_media_preprocessed = df_media.rename(columns={df_media.columns[0]: 'id'})
df_media_preprocessed.drop(columns=['author'], inplace=True)

TEXT HERE

In [None]:
# Extract the 'content' for each ID
content_23099 = df_media_preprocessed[df_media_preprocessed['id'] == 23099]['content'].values[0]
content_23100 = df_media_preprocessed[df_media_preprocessed['id'] == 23100]['content'].values[0]

# Create a comparison table
comparison_table = pd.DataFrame({
    'ID 23099 Content': [content_23099],
    'ID 23100 Content': [content_23100]
})

# Display the table
print(comparison_table)

                ID 23099 Content               ID 23100 Content
0  ['We use cookies to enhanc...  ['We use cookies to enhanc...


In [None]:
# Convert columns to the required data types
df_media_preprocessed['title'] = df_media_preprocessed['title'].astype(str)
df_media_preprocessed['content'] = df_media_preprocessed['content'].astype(str)
df_media_preprocessed['domain'] = df_media_preprocessed['domain'].astype(str)
df_media_preprocessed['url'] = df_media_preprocessed['url'].astype(str)
df_media_preprocessed['date'] = pd.to_datetime(df_media_preprocessed['date'], errors='coerce')
df_media_preprocessed['id'] = df_media_preprocessed['id'].astype(int)

# Check for duplicates
duplicate_ids = df_media_preprocessed[df_media_preprocessed.duplicated(subset=['id'])]
print(duplicate_ids)


df_media_preprocessed

Empty DataFrame
Columns: [id, title, date, content, domain, url]
Index: []


Unnamed: 0,id,title,date,content,domain,url
0,93320,"XPeng Delivered ~100,000 V...",2022-01-02,['Chinese automotive start...,cleantechnica,https://cleantechnica.com/...
1,93321,Green Hydrogen: Drop In Bu...,2022-01-02,['Sinopec has laid plans t...,cleantechnica,https://cleantechnica.com/...
2,98159,World’ s largest floating ...,2022-01-03,['Huaneng Power Internatio...,pv-magazine,https://www.pv-magazine.co...
3,98158,Iran wants to deploy 10 GW...,2022-01-03,['According to the Iranian...,pv-magazine,https://www.pv-magazine.co...
4,31128,Eastern Interconnection Po...,2022-01-03,['Sign in to get the best ...,naturalgasintel,https://www.naturalgasinte...
...,...,...,...,...,...,...
20106,104263,US Treasury finalises 45X ...,2024-10-24,['The US Department of the...,pv-tech,https://www.pv-tech.org/us...
20107,104264,EDP trials robotic constru...,2024-10-24,['Developer EDP is pilotin...,pv-tech,https://www.pv-tech.org/ed...
20108,101434,Australia has 7.8 GW of ut...,2024-10-24,['The volume of large-scal...,pv-magazine,https://www.pv-magazine.co...
20109,101428,Residential PV prices in G...,2024-10-24,['The comparison site Self...,pv-magazine,https://www.pv-magazine.co...


In [None]:
# List of unwanted suffixes
unwanted_suffixes = [
    "Need help finding the right suppliers?",
    "Energy XPRT is part of XPRT Media All Rights Reserved. Terms Privacy",
    "Your email address will not be published. Required fields are marked",
    "Hi, I 'm Azthena, you can trust me to find commercial scientific answers",
    "Copyright",
    "To use the full function of this web site",
    "EcoFriend.com ",
    "EuropÃ¤ische Vereinigung",
    "Get updates on the IEA",
    "About us",
    "This website uses cookies to",
    "Recharge is part of DN Media Group",
    "Solar Industry offers industry participants probing",
    "Thank you for subscribing to the email newsletter",
    "This site uses Akismet to reduce spam"
]

# Define the prefixes to remove
unwanted_prefixes = [
    "By clicking `` Allow All '' you agree ",
    "We use cookies to enhance your experience",
    "Sign in to get the best natural gas news and data",
    "Your email address *",
    "Your password *",
    "Remember me Continue",
    "Reset password",
    "Featured Content",
    "News & Data Services",
    "Client Support",
    "- May 27, 2022 - With the innovation and development",
    "A consortium of Mitsubishi Power Americas",
    "Advanced Energy's WaveCapture",
    "Check out a tour of Solar Turbines ' Configurable Modular",
    "Benefits of Combined Heat & Power ( Cogeneration) Website",
    "window.dojoRequire",
    "Accurate Wind Resource Assessment. Power Performance Verification",
    "Create a free IEA account to download",
    "Equip yourself with various operating voltages and advanced control",
    "FLOWSTAR-Energy is a practical, high resolution model "
    "Hydrogen Technology Expo & Carbon Capture Technology Expo was held in Bremen",
    "By Power Vision Engineering Sarl based in Ecublens",
    "Power Vision Engineering provides various...",
    "Shenzhen Power Kingdom Co., Ltd. is one of the subsidiaries of Henan Yuguang Gold",
    "This course investigates",
    "This course examines",
    "Welcome to Edinburgh Instruments",
    "Create a free IEA account to download our reports or subcribe"
]

def remove_unwanted_prefixes(content):
    if isinstance(content, str):
        try:
            content_list = eval(content)
            if isinstance(content_list, list):
                content_list = [
                    item for item in content_list
                    if not any(item.startswith(prefix) for prefix in unwanted_prefixes)
                ]
                return str(content_list)  # Convert back to string if needed
        except:
            pass  # Ignore errors if content is not a valid list
    return content

# Function to remove everything after any of the unwanted phrases
def remove_unwanted_suffixes(content):
    if isinstance(content, str):
        for phrase in unwanted_suffixes:
            index = content.find(phrase)
            if index != -1:
                return content[:index].strip()  # Keep only the part before the phrase
    return content

df_media_preprocessed['content'] = df_media_preprocessed['content'].apply(remove_unwanted_suffixes)
df_media_preprocessed['content'] = df_media_preprocessed['content'].apply(remove_unwanted_prefixes)

In [None]:
# Extract the 'content' for each ID
content_23099 = df_media_preprocessed[df_media_preprocessed['id'] == 23099]['content'].values[0]
content_23100 = df_media_preprocessed[df_media_preprocessed['id'] == 23100]['content'].values[0]

# Create a comparison table
comparison_table = pd.DataFrame({
    'ID 23099 Content': [content_23099],
    'ID 23100 Content': [content_23100]
})

# Display the table
print(comparison_table)

                ID 23099 Content               ID 23100 Content
0  ['In a groundbreaking leap...  ['There are multiple metho...


In [None]:
save_path = "/content/drive/MyDrive/computational-language-technology-project/cleaned_data/media_dataset_pre-cleaned.csv"
df_media_preprocessed.to_csv(save_path, index=False)

## Cleantech Google Patent Dataset

In [None]:
df_google_patents = pd.read_json(google_patent_dataset_path, lines=True)
print(df_google_patents.info())
df_google_patents

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 406857 entries, 0 to 406856
Data columns (total 8 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   publication_number  406857 non-null  object
 1   application_number  406857 non-null  object
 2   country_code        406857 non-null  object
 3   title               406857 non-null  object
 4   abstract            406857 non-null  object
 5   publication_date    406857 non-null  int64 
 6   inventor            406857 non-null  object
 7   cpc_code            406857 non-null  object
dtypes: int64(1), object(7)
memory usage: 24.8+ MB
None


Unnamed: 0,publication_number,application_number,country_code,title,abstract,publication_date,inventor,cpc_code
0,CN-117138249-A,CN-202311356270-A,CN,一种石墨烯光疗面罩,The application provides a...,20231201,"[LI HAITAO, CAO WENQIANG]",A61N2005/0654
1,CN-117151396-A,CN-202311109834-A,CN,Distributed economic sched...,The invention discloses a ...,20231201,"[HU PENGFEI, LI ZIMENG]",G06Q50/06
2,CN-117141530-A,CN-202310980795-A,CN,氢能源动力轨道车辆组,The invention discloses a ...,20231201,"[XIE BO, ZHANG SHUIQING, Z...",Y02T90/40
3,CN-117141244-A,CN-202311177651-A,CN,一种汽车太阳能充电系统、方法及新能源汽车,The application discloses ...,20231201,[ZHAO PENGCHENG],B60K16/00
4,CN-117146094-A,CN-202311272549-A,CN,一种水利水电管道连接装置,The invention provides a w...,20231201,"[LYU SHUOSHUO, LI PANFENG,...",F16L55/02
...,...,...,...,...,...,...,...,...
406852,CN-113887946-A,CN-202111162179-A,CN,Emergency material transpo...,The application discloses ...,20220104,"[QI QINGJIE, LIU HAN, LIU ...",G06Q10/0631
406853,TW-202201895-A,TW-109121032-A,TW,Fish farming system capabl...,The present invention rela...,20220101,"[WANG, YI-FENG]",Y02E10/50
406854,TW-202201895-A,TW-109121032-A,TW,魚類養殖系統,The present invention rela...,20220101,"[WANG, YI-FENG]",Y02E10/50
406855,TW-202201895-A,TW-109121032-A,TW,Fish farming system capabl...,The present invention rela...,20220101,"[WANG, YI-FENG]",Y02B10/10


In [None]:
df_counts = df_google_patents["publication_number"].value_counts().reset_index()
df_counts.columns = ["publication_number", "count"]
df_counts = df_counts.sort_values(by="count", ascending=False)

print(df_counts)

      publication_number  count
0          EP-3969821-A2    550
1          EP-4237625-A1    450
2       WO-2022089858-A1    450
3          EP-4326671-A1    390
4       WO-2023124030-A1    370
...                  ...    ...
31355     ZA-202305543-B      1
31354      NL-2032498-B1      1
31353       TW-M626789-U      1
31351      NL-2028183-B1      1
31365         SI-26447-A      1

[31366 rows x 2 columns]


Differences between those duplicates are in either theirs language or their cpc code. The cpc  (Cooperative Patent Classification) code is part of a classification system used to categorize patents based on their technical content.

**Structure of CPC Codes:**

CPC codes are hierarchical and consist of:

- Section (e.g., Y)
- Class (e.g., Y02)
- Subclass (e.g., Y02B)
- Group (e.g., Y02B10)
- Subgroup (e.g., Y02B10/10 or Y02B10/20)

A single patent can be classified under multiple CPC codes if its invention covers aspects of different categories.

To get rid of all those duplicates, the text in the column "title" and "abstract" shall be checked if it is in english. If yes, the entry shall be kept and else the duplicate is removed:

In [None]:
# Function to check if text is in English
def is_english(text):
    try:
        return detect(text) == "en"
    except:
        return False  # Handle cases where text is empty or cannot be detected

# Select all publication numbers from df_counts
publication_numbers_to_check = df_counts["publication_number"].tolist()

# Filter the main dataframe to only include these publication numbers
df_filtered = df_google_patents[df_google_patents["publication_number"].isin(publication_numbers_to_check)]

# Initialize progress bar for language detection
tqdm.pandas(desc="Checking language")

# Apply language detection on the filtered dataset
df_filtered["is_english"] = df_filtered.progress_apply(
    lambda row: is_english(str(row["title"])) and is_english(str(row["abstract"])), axis=1
)

# Count total and English entries
total_count = len(df_filtered)
english_count = df_filtered["is_english"].sum()

print(f"Total rows processed: {total_count}")
print(f"English rows detected: {english_count}")

# Filter only English rows
df_english = df_filtered[df_filtered["is_english"]]

# Drop duplicates and keep the first English entry per publication_number
df_google_patents_preprocessed = df_english.drop_duplicates(subset="publication_number", keep="first")

# Check the results
df_google_patents_preprocessed.info()


Checking language: 100%|██████████| 406857/406857 [53:54<00:00, 125.80it/s]


Total rows processed: 406857
English rows detected: 188212
<class 'pandas.core.frame.DataFrame'>
Index: 28827 entries, 1 to 406853
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   publication_number  28827 non-null  object
 1   application_number  28827 non-null  object
 2   country_code        28827 non-null  object
 3   title               28827 non-null  object
 4   abstract            28827 non-null  object
 5   publication_date    28827 non-null  int64 
 6   inventor            28827 non-null  object
 7   cpc_code            28827 non-null  object
 8   is_english          28827 non-null  bool  
dtypes: bool(1), int64(1), object(7)
memory usage: 2.0+ MB


In [None]:
df_google_patents_preprocessed

Unnamed: 0,publication_number,application_number,country_code,title,abstract,publication_date,inventor,cpc_code,is_english
1,CN-117151396-A,CN-202311109834-A,CN,Distributed economic sched...,The invention discloses a ...,20231201,"[HU PENGFEI, LI ZIMENG]",G06Q50/06,True
5,CN-117147382-A,CN-202310985511-A,CN,Device for monitoring hydr...,The invention provides a d...,20231201,"[MA ZHAOXIANG, WANG CHENGX...",G01N13/00,True
6,CN-113344288-B,CN-202110717505-A,CN,Cascade hydropower station...,The invention discloses a ...,20231201,[],G06Q10/04,True
8,CN-117153944-A,CN-202311209193-A,CN,"Heterojunction solar cell,...",The application provides a...,20231201,"[TONG HONGBO, JIN YUPENG]",H01L31/074,True
9,CN-116911695-B,CN-202311167289-A,CN,Flexible resource adequacy...,The invention relates to a...,20231201,[],H02J2203/20,True
...,...,...,...,...,...,...,...,...,...
406698,CN-215416083-U,CN-202121495971-U,CN,Combined linear Fresnel li...,The utility model disclose...,20220104,[Qin Taohua],Y02B10/20,True
406767,CN-215412583-U,CN-202121129938-U,CN,Solar air heat collection ...,The utility model disclose...,20220104,[THE INVENTOR HAS WAIVED T...,Y02E10/40,True
406815,CN-215420159-U,CN-202120890502-U,CN,But angle regulation&#39;s...,The utility model disclose...,20220104,"[YU RONGSHENG, ZHANG YONGS...",Y02E10/50,True
406841,CN-215412573-U,CN-202120748049-U,CN,Commercial solar energy an...,The utility model disclose...,20220104,"[ZHANG LIANGLIANG, XU MENG...",Y02E10/40,True


In [None]:
save_path = "/content/drive/MyDrive/computational-language-technology-project/cleaned_data/google_patent_pre-cleaned.csv"
df_google_patents_preprocessed.to_csv(save_path, index=False)

## Text Preprocessing (Joshua)
To ensure that the text data is **ready for NLP tasks**, we preprocess it using common natural language processing (NLP) techniques:  

- **Tokenization**: Split text into individual words or subwords for better analysis.  
- **Stopword Removal**: Common but uninformative words (e.g., "the", "is", "and") are removed.  
- **Stemming**: Words are reduced to their root form (e.g., "developing" → "develop").  
- **Lowercasing**: Standardize all text to lowercase to avoid duplicate entries.  

These steps improve the quality of text-based analysis and ensure consistency across datasets.

In [None]:
PUNCTUATIONS = string.punctuation
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('punkt_tab')
STOPWORDS = set(stopwords.words('english'))
stemmer = PorterStemmer()
nltk.download('wordnet')
spell = SpellChecker()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


In [None]:
def remove_emails(text):
    return re.sub(r'\S+@\S+', '', text) if isinstance(text, str) else text

def remove_dates(text):
    text = re.sub(r'\d{1,2}(st|nd|rd|th)?[-./]\d{1,2}[-./]\d{2,4}', '', text)
    pattern = re.compile(r'(\d{1,2})?(st|nd|rd|th)?[-./,]?\s?(of)?\s?([J|j]an(uary)?|[F|f]eb(ruary)?|[Mm]ar(ch)?|[Aa]pr(il)?|[Mm]ay|[Jj]un(e)?|[Jj]ul(y)?|[Aa]ug(ust)?|[Ss]ep(tember)?|[Oo]ct(ober)?|[Nn]ov(ember)?|[Dd]ec(ember)?)\s?(\d{1,2})?(st|nd|rd|th)?\s?[-./,]?\s?(\d{2,4})?')
    text = pattern.sub(r'', text)
    return text if isinstance(text, str) else text

def remove_html(text):
    clean_text = BeautifulSoup(text).get_text()
    return clean_text

def remove_tags_mentions(text):
    pattern = re.compile(r'(@\S+|#\S+)')
    return pattern.sub('', text)

def remove_punctuation(text):
    return text.translate(str.maketrans('', '', PUNCTUATIONS))

def remove_stopwords(text):
    return ' '.join([word for word in text.split() if word not in STOPWORDS])

def remove_whitespaces(text):
    return " ".join(text.split())

def freq_words(text):
    tokens = word_tokenize(text)
    counter = Counter(tokens)
    FrequentWords = [word for word, _ in counter.most_common(10)]
    return FrequentWords

def remove_fw(text):
    FrequentWords = freq_words(text)
    tokens = word_tokenize(text)
    without_fw = [word for word in tokens if word not in FrequentWords]
    return ' '.join(without_fw)

def rare_words(text):
    tokens = word_tokenize(text)
    counter = Counter(tokens)
    RareWords = []
    number_rare_words = 10
    for word, word_count in counter.most_common()[-number_rare_words:]:
        RareWords.append(word)
    return RareWords

def remove_rw(text):
    RareWords = rare_words(text)
    tokens = word_tokenize(text)
    without_rw = [word for word in tokens if word not in RareWords]
    return ' '.join(without_rw)

def nums_to_words(text):
    new_text = []
    for word in text.split():
        # Check if the word has a number followed by a non-digit (e.g., 45x, 122gw, 122%, etc.)
        match = re.match(r"(-?\d+)([a-zA-Z%]+)?", word)

        if match:
            num_part = match.group(1)  # The number part
            suffix = match.group(2)    # The suffix (if any)

            try:
                num_in_words = num2words(num_part)
                if suffix:
                    # Ensure the number and suffix are correctly formatted
                    new_text.append(f"{num_in_words} {suffix}")
                else:
                    new_text.append(num_in_words)
            except Exception as e:
                new_text.append(word)  # In case of an error, keep the original word
        else:
            new_text.append(word)

    return " ".join(new_text)

def stem_words(text):
    return ' '.join([stemmer.stem(word) for word in text.split()])

def accented_to_ascii(text):
    return unidecode.unidecode(text)

### df_media

In [None]:
data_folder = Path("/content/drive/MyDrive/computational-language-technology-project")
media_preprocessed_path = data_folder / "cleaned_data/media_dataset_pre-cleaned.csv"

In [None]:
df_media_processed = pd.read_csv(media_preprocessed_path, header = 0)

In [None]:
# lower casing
df_media_processed['content'] = df_media_processed['content'].apply(lambda x: x.lower())

# Call all removals
df_media_processed['content'] = df_media_processed['content'].map(remove_emails)
df_media_processed['content'] = df_media_processed['content'].map(remove_dates)
df_media_processed['content'] = df_media_processed['content'].map(remove_html)
df_media_processed['content'] = df_media_processed['content'].map(remove_tags_mentions)
df_media_processed['content'] = df_media_processed['content'].map(remove_punctuation)
df_media_processed['content'] = df_media_processed['content'].map(remove_stopwords)
df_media_processed['content'] = df_media_processed['content'].map(remove_whitespaces)
df_media_processed['content'] = df_media_processed['content'].map(remove_fw)
df_media_processed['content'] = df_media_processed['content'].map(remove_rw)
df_media_processed['content'] = df_media_processed['content'].map(nums_to_words)
df_media_processed['content'] = df_media_processed['content'].map(stem_words)
df_media_processed['content'] = df_media_processed['content'].map(accented_to_ascii)

df_media_processed

Unnamed: 0,id,title,date,content,domain,url
0,93320,"XPeng Delivered ~100,000 Vehicles In 2021",2022-01-02,chines startup shown dramat auto product rampup histori good news produc one hundr st ev mere se...,cleantechnica,https://cleantechnica.com/2022/01/02/xpeng-delivered-100000-vehicles-in-2021/
1,93321,Green Hydrogen: Drop In Bucket Or Big Splash?,2022-01-02,laid plan build largest product facil world interest fossilsourc continu apac skeptic abound fie...,cleantechnica,https://cleantechnica.com/2022/01/02/its-a-green-hydrogen-drop-in-the-bucket-but-it-could-still-...
2,98159,World’ s largest floating PV plant goes online in China – pv magazine International,2022-01-03,intern switch array china ' shandong deploy two phase reservoir near two hundr and sixty-f gw de...,pv-magazine,https://www.pv-magazine.com/2022/01/03/worlds-largest-floating-pv-plant-goes-online-in-china/
3,98158,Iran wants to deploy 10 GW of renewables over the next four years – pv magazine International,2022-01-03,iranian author current eighti gw project submit privat investor review four hundr and fourteen m...,pv-magazine,https://www.pv-magazine.com/2022/01/03/iran-wants-to-deploy-10-gw-of-renewables-over-the-next-fo...
4,31128,Eastern Interconnection Power Grid Said ‘ Being Challenged in New Ways’,2022-01-03,daili gpi infrastructur ngi news access electr planner major power serv twothird unit state cana...,naturalgasintel,https://www.naturalgasintel.com/eastern-interconnection-power-grid-said-being-challenged-in-new-...
...,...,...,...,...,...,...
20106,104263,US Treasury finalises 45X Advanced Manufacturing tax credit rules,2024-10-24,depart treasuri dot finalis rule forty-f x advanc product credit part ongo effort encourag great...,pv-tech,https://www.pv-tech.org/us-treasury-finalises-45x-advanced-manufacturing-tax-credit-rules/
20107,104264,EDP trials robotic construction on Spanish PV project,2024-10-24,develop pilot robot solut one hundr and twenty-two mw power plant valladolid spain hope signific...,pv-tech,https://www.pv-tech.org/edp-trials-robotic-construction-spanish-pv-project/
20108,101434,Australia has 7.8 GW of utility-scale batteries under construction – pv magazine International,2024-10-24,volum largescal project construct pass solar wind project combin two thousand and twenty-thre tr...,pv-magazine,https://www.pv-magazine.com/2024/10/24/australia-has-7-8-gw-of-utility-scale-batteries-under-con...
20109,101428,Residential PV prices in Germany drop 25% within 12 months – pv magazine International,2024-10-24,comparison site selfmad energi show addit fall obtain significantli one two year ago manag direc...,pv-magazine,https://www.pv-magazine.com/2024/10/24/residential-pv-prices-in-germany-drop-25-within-12-months...


Correct all spelling errors from the preprocessing

In [None]:
def correct_spelling(text):
    if pd.isna(text):  # Check if the value is NaN (None)
        return text  # Return it as is

    # Initialize progress bar for language detection
    tqdm.pandas(desc="Checking spelling")

    correct_text = []
    misspelled_words = spell.unknown(text.split())

    for word in text.split():
        if word in misspelled_words:
            corrected_word = spell.correction(word)
            correct_text.append(corrected_word if corrected_word else word)  # Handle None corrections
        else:
            correct_text.append(word)

    return " ".join(correct_text)

df_media_processed.loc[:1000, 'content'] = df_media_processed.loc[:1000, 'content'].map(correct_spelling)


In [None]:
df_media_processed

Unnamed: 0,id,title,date,content,domain,url
0,93320,"XPeng Delivered ~100,000 Vehicles In 2021",2022-01-02,chines startup shown drama auto product campus history good news produce one under st e mere sev...,cleantechnica,https://cleantechnica.com/2022/01/02/xpeng-delivered-100000-vehicles-in-2021/
1,93321,Green Hydrogen: Drop In Bucket Or Big Splash?,2022-01-02,laid plan build largest product fail world interest fossilsourc continue apace skeptic abound fi...,cleantechnica,https://cleantechnica.com/2022/01/02/its-a-green-hydrogen-drop-in-the-bucket-but-it-could-still-...
2,98159,World’ s largest floating PV plant goes online in China – pv magazine International,2022-01-03,intern switch array china ' shandong deploy two phase reservoir near two hundr and sixty-f gw de...,pv-magazine,https://www.pv-magazine.com/2022/01/03/worlds-largest-floating-pv-plant-goes-online-in-china/
3,98158,Iran wants to deploy 10 GW of renewables over the next four years – pv magazine International,2022-01-03,iranian author current eighti gw project submit privat investor review four hundr and fourteen m...,pv-magazine,https://www.pv-magazine.com/2022/01/03/iran-wants-to-deploy-10-gw-of-renewables-over-the-next-fo...
4,31128,Eastern Interconnection Power Grid Said ‘ Being Challenged in New Ways’,2022-01-03,daili gpi infrastructur ngi news access electr planner major power serv twothird unit state cana...,naturalgasintel,https://www.naturalgasintel.com/eastern-interconnection-power-grid-said-being-challenged-in-new-...
...,...,...,...,...,...,...
20106,104263,US Treasury finalises 45X Advanced Manufacturing tax credit rules,2024-10-24,depart treasuri dot finalis rule forty-f x advanc product credit part ongo effort encourag great...,pv-tech,https://www.pv-tech.org/us-treasury-finalises-45x-advanced-manufacturing-tax-credit-rules/
20107,104264,EDP trials robotic construction on Spanish PV project,2024-10-24,develop pilot robot solut one hundr and twenty-two mw power plant valladolid spain hope signific...,pv-tech,https://www.pv-tech.org/edp-trials-robotic-construction-spanish-pv-project/
20108,101434,Australia has 7.8 GW of utility-scale batteries under construction – pv magazine International,2024-10-24,volum largescal project construct pass solar wind project combin two thousand and twenty-thre tr...,pv-magazine,https://www.pv-magazine.com/2024/10/24/australia-has-7-8-gw-of-utility-scale-batteries-under-con...
20109,101428,Residential PV prices in Germany drop 25% within 12 months – pv magazine International,2024-10-24,comparison site selfmad energi show addit fall obtain significantli one two year ago manag direc...,pv-magazine,https://www.pv-magazine.com/2024/10/24/residential-pv-prices-in-germany-drop-25-within-12-months...


To save some time, we save the processed/cleaned dataframe to directly load it for further steps:

In [None]:
save_path = "/content/drive/MyDrive/computational-language-technology-project/cleaned_data/media_dataset_cleaned.csv"
df_media_processed.to_csv(save_path, index=False)

### df_google_patents

In [None]:
data_folder = Path("/content/drive/MyDrive/computational-language-technology-project")
google_patents_preprocessed_path = data_folder / "cleaned_data/google_patent_pre-cleaned.csv"

In [None]:
df_google_patents_processed = pd.read_csv(google_patents_preprocessed_path, header = 0)

In [None]:
# Lower casing
df_google_patents_processed["abstract"] = df_google_patents_processed["abstract"].apply(lambda x: x.lower())

# Call all removals
df_google_patents_processed["abstract"] = df_google_patents_processed["abstract"].map(remove_emails)
df_google_patents_processed["abstract"] = df_google_patents_processed["abstract"].map(remove_dates)
df_google_patents_processed["abstract"] = df_google_patents_processed["abstract"].map(remove_html)
df_google_patents_processed["abstract"] = df_google_patents_processed["abstract"].map(remove_tags_mentions)
df_google_patents_processed["abstract"] = df_google_patents_processed["abstract"].map(remove_punctuation)
df_google_patents_processed["abstract"] = df_google_patents_processed["abstract"].map(remove_stopwords)
df_google_patents_processed["abstract"] = df_google_patents_processed["abstract"].map(remove_whitespaces)
df_google_patents_processed["abstract"] = df_google_patents_processed["abstract"].map(remove_fw)
df_google_patents_processed["abstract"] = df_google_patents_processed["abstract"].map(remove_rw)
df_google_patents_processed["abstract"] = df_google_patents_processed["abstract"].map(nums_to_words)
df_google_patents_processed["abstract"] = df_google_patents_processed["abstract"].map(stem_words)
df_google_patents_processed["abstract"] = df_google_patents_processed["abstract"].map(accented_to_ascii)

df_google_patents_processed

Unnamed: 0,publication_number,application_number,country_code,title,abstract,publication_date,inventor,cpc_code,is_english
0,CN-117151396-A,CN-202311109834-A,CN,Distributed economic scheduling method for win...,disclos method solar methan firstli solar prov...,20231201,"['HU PENGFEI', 'LI ZIMENG']",G06Q50/06,True
1,CN-117147382-A,CN-202310985511-A,CN,Device for monitoring hydrogen atom crossing g...,invent provid atom use skpfm relat technic fie...,20231201,"['MA ZHAOXIANG', 'WANG CHENGXU', 'LIU ZHONGLI']",G01N13/00,True
2,CN-113344288-B,CN-202110717505-A,CN,Cascade hydropower station group water level p...,disclos devic comput readabl storag medium tec...,20231201,[],G06Q10/04,True
3,CN-117153944-A,CN-202311209193-A,CN,"Heterojunction solar cell, preparation method ...",applic provid heterion solar cell prepar metho...,20231201,"['TONG HONGBO', 'JIN YUPENG']",H01L31/074,True
4,CN-116911695-B,CN-202311167289-A,CN,Flexible resource adequacy evaluation method a...,invent relat method devic electr system belong...,20231201,[],H02J2203/20,True
...,...,...,...,...,...,...,...,...,...
28822,CN-215416083-U,CN-202121495971-U,CN,Combined linear Fresnel light condensing device,util model disclos linear fresnel focus devic ...,20220104,['Qin Taohua'],Y02B10/20,True
28823,CN-215412583-U,CN-202121129938-U,CN,Solar air heat collection control equipment fo...,util model disclos air heat collect control co...,20220104,['THE INVENTOR HAS WAIVED THE RIGHT TO BE MENT...,Y02E10/40,True
28824,CN-215420159-U,CN-202120890502-U,CN,But angle regulation&#39;s photovoltaic solar ...,util model disclos photovolta adjust relat fie...,20220104,"['YU RONGSHENG', 'ZHANG YONGSHENG', 'CAI QUN',...",Y02E10/50,True
28825,CN-215412573-U,CN-202120748049-U,CN,Commercial solar energy and air can integratio...,util model disclos commerci integr heat compri...,20220104,"['ZHANG LIANGLIANG', 'XU MENG', 'MIAO XINGCHON...",Y02E10/40,True


In [None]:
def correct_spelling(text):
    if pd.isna(text):  # Check if the value is NaN (None)
        return text  # Return it as is

    # Initialize progress bar for language detection
    tqdm.pandas(desc="Checking spelling")

    correct_text = []
    misspelled_words = spell.unknown(text.split())

    for word in text.split():
        if word in misspelled_words:
            corrected_word = spell.correction(word)
            correct_text.append(corrected_word if corrected_word else word)  # Handle None corrections
        else:
            correct_text.append(word)

    return " ".join(correct_text)

df_google_patents_processed.loc[101:5000,'abstract'] = df_google_patents_processed.loc[101:5000, 'abstract'].map(correct_spelling)

In [None]:
save_path = "/content/drive/MyDrive/computational-language-technology-project/cleaned_data/google_patent_cleaned.csv"
df_google_patents_processed.to_csv(save_path, index=False)

## Exploratory Data Analysis
EDA helps us **understand data patterns and distributions** before applying complex NLP models. We perform:  

- **Temporal Analysis**: We examine **publication trends** over time to detect emerging Cleantech topics.  
- **Named Entity Recognition (NER)**: Identify key **companies, organizations, and technologies** frequently mentioned in the datasets.  
- **Word Frequency Analysis**: Find the most common words and phrases across media and patents.  
- **Visualization**:  
  - **Word Clouds** to showcase frequently occurring terms  
  - **Bar Charts** to compare key industry players and technology mentions  
  - **Network Graphs** to analyze relationships between companies and technologies  

Check file

2.2_exploratory_data_and_text_analysis for this part

2.2.1_exploratory_text_analysis_I

## Topic Modeling
To **identify hidden themes and emerging trends**, we apply topic modeling techniques on both datasets:  

- **Latent Dirichlet Allocation (LDA)** and **Non-Negative Matrix Factorization (NMF)** to uncover broad thematic structures.  
- **Top2Vec** and **BERTopic** for **more dynamic and context-aware topic modeling**.  
- **Comparing Media vs. Patents**:  
  - Which Cleantech topics are **gaining media attention** but **not patented** yet?  
  - Are **patents aligned with market trends**, or do they focus on different areas?  
  - **What are the innovation gaps** between research and real-world applications?  

By the end of this step, we will have a **structured view of the Cleantech landscape**, highlighting **key trends, players, and technological opportunities**.


Check file

2.2_exploratory_data_and_text_analysis for this part

2.2.2_exploratory_text_analysis_II
