# Stage 2: Advanced Embedding Models Training and Analysis
This notebook explores advanced embedding models to analyze and compare the content of the Cleantech Media and Google Patent datasets. The goal is to develop meaningful vector representations of the text data using word embeddings, sentence embeddings, and transfer learning techniques.

- Deadline 2 (Stage 2): 6 April 2025 23:59

## Data Preparation for Embeddings
At this stage, we need to ensure that our dataset is properly cleaned and preprocessed to generate high-quality embeddings.


In [76]:
#Packages
import pandas as pd
import re
from bs4 import BeautifulSoup
import unidecode
import spacy
from spacy.lang.en import English
from sklearn.model_selection import train_test_split
import string
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import word_tokenize
PUNCTUATIONS = string.punctuation
stemmer = PorterStemmer()

In [None]:
!pip install gensim
!pip install spacy
!pip install unidecode

In [77]:
#load the pre-cleaned dataset
media_preprocessed_path = "../cleaned_data/media_dataset_pre-cleaned.csv"
patent_preprocessed_path = "../cleaned_data/google_patent_pre-cleaned.csv"

df_media_processed = pd.read_csv(media_preprocessed_path, header = 0)
df_patent_processed = pd.read_csv(patent_preprocessed_path, header = 0)

First, all function for the preprocessing steps are loaded:

In [78]:
def remove_emails(text):
    return re.sub(r'\S+@\S+', '', text) if isinstance(text, str) else text

def remove_dates(text):
    text = re.sub(r'\d{1,2}(st|nd|rd|th)?[-./]\d{1,2}[-./]\d{2,4}', '', text)
    pattern = re.compile(r'(\d{1,2})?(st|nd|rd|th)?[-./,]?\s?(of)?\s?([J|j]an(uary)?|[F|f]eb(ruary)?|[Mm]ar(ch)?|[Aa]pr(il)?|[Mm]ay|[Jj]un(e)?|[Jj]ul(y)?|[Aa]ug(ust)?|[Ss]ep(tember)?|[Oo]ct(ober)?|[Nn]ov(ember)?|[Dd]ec(ember)?)\s?(\d{1,2})?(st|nd|rd|th)?\s?[-./,]?\s?(\d{2,4})?')
    text = pattern.sub(r'', text)
    return text if isinstance(text, str) else text

def remove_html(text):
    clean_text = BeautifulSoup(text).get_text()
    return clean_text

def remove_tags_mentions(text):
    pattern = re.compile(r'(@\S+|#\S+)')
    return pattern.sub('', text)

def remove_punctuation(text):
    return text.translate(str.maketrans('', '', PUNCTUATIONS))


def remove_whitespaces(text):
    return " ".join(text.split())

#probably not used
# def stem_words(text):
#     return ' '.join([stemmer.stem(word) for word in text.split()])

def accented_to_ascii(text):
    return unidecode.unidecode(text)

# Disable all the annotators except the tokenizer so its fast
nlp = English(disable=['tagger', 'parser', 'ner'])

def tokenize(text):
  return [t.text.lower() for t in nlp(text)]

Now we apply the function to both datasets but this time we are not removing the stopwords because we will use them later.

In [79]:
# lower casing
df_patent_processed['abstract'] = df_patent_processed['abstract'].apply(lambda x: x.lower())

# Call all removals
df_patent_processed['abstract'] = df_patent_processed['abstract'].map(remove_emails)
df_patent_processed['abstract'] = df_patent_processed['abstract'].map(remove_dates)
df_patent_processed['abstract'] = df_patent_processed['abstract'].map(remove_html)
df_patent_processed['abstract'] = df_patent_processed['abstract'].map(remove_tags_mentions)
df_patent_processed['abstract'] = df_patent_processed['abstract'].map(remove_punctuation)
df_patent_processed['abstract'] = df_patent_processed['abstract'].map(remove_whitespaces)
#df_patent_processed['abstract'] = df_patent_processed['abstract'].map(stem_words)
df_patent_processed['abstract'] = df_patent_processed['abstract'].map(accented_to_ascii)
df_patent_processed['tokens'] = df_patent_processed['abstract'].apply(tokenize)

df_patent_processed

Unnamed: 0,publication_number,application_number,country_code,title,abstract,publication_date,inventor,cpc_code,is_english,tokens
0,CN-117151396-A,CN-202311109834-A,CN,Distributed economic scheduling method for win...,the invention discloses a distributed economic...,20231201,"['HU PENGFEI', 'LI ZIMENG']",G06Q50/06,True,"[the, invention, discloses, a, distributed, ec..."
1,CN-117147382-A,CN-202310985511-A,CN,Device for monitoring hydrogen atom crossing g...,the invention provides a device and a method f...,20231201,"['MA ZHAOXIANG', 'WANG CHENGXU', 'LIU ZHONGLI']",G01N13/00,True,"[the, invention, provides, a, device, and, a, ..."
2,CN-113344288-B,CN-202110717505-A,CN,Cascade hydropower station group water level p...,the invention discloses a cascade hydropower s...,20231201,[],G06Q10/04,True,"[the, invention, discloses, a, cascade, hydrop..."
3,CN-117153944-A,CN-202311209193-A,CN,"Heterojunction solar cell, preparation method ...",the application provides a heteroction solar c...,20231201,"['TONG HONGBO', 'JIN YUPENG']",H01L31/074,True,"[the, application, provides, a, heteroction, s..."
4,CN-116911695-B,CN-202311167289-A,CN,Flexible resource adequacy evaluation method a...,the invention relates to a flexible resource a...,20231201,[],H02J2203/20,True,"[the, invention, relates, to, a, flexible, res..."
...,...,...,...,...,...,...,...,...,...,...
28822,CN-215416083-U,CN-202121495971-U,CN,Combined linear Fresnel light condensing device,the utility model discloses a combined linear ...,20220104,['Qin Taohua'],Y02B10/20,True,"[the, utility, model, discloses, a, combined, ..."
28823,CN-215412583-U,CN-202121129938-U,CN,Solar air heat collection control equipment fo...,the utility model discloses a solar air heat c...,20220104,['THE INVENTOR HAS WAIVED THE RIGHT TO BE MENT...,Y02E10/40,True,"[the, utility, model, discloses, a, solar, air..."
28824,CN-215420159-U,CN-202120890502-U,CN,But angle regulation&#39;s photovoltaic solar ...,the utility model discloses a photovoltaic sol...,20220104,"['YU RONGSHENG', 'ZHANG YONGSHENG', 'CAI QUN',...",Y02E10/50,True,"[the, utility, model, discloses, a, photovolta..."
28825,CN-215412573-U,CN-202120748049-U,CN,Commercial solar energy and air can integratio...,the utility model discloses a commercial solar...,20220104,"['ZHANG LIANGLIANG', 'XU MENG', 'MIAO XINGCHON...",Y02E10/40,True,"[the, utility, model, discloses, a, commercial..."


In [73]:
# lower casing
df_media_processed['content'] = df_media_processed['content'].apply(lambda x: x.lower())

# Call all removals
df_media_processed['content'] = df_media_processed['content'].map(remove_emails)
df_media_processed['content'] = df_media_processed['content'].map(remove_dates)
df_media_processed['content'] = df_media_processed['content'].map(remove_html)
df_media_processed['content'] = df_media_processed['content'].map(remove_tags_mentions)
df_media_processed['content'] = df_media_processed['content'].map(remove_punctuation)
df_media_processed['content'] = df_media_processed['content'].map(remove_whitespaces)
#df_media_processed['content'] = df_media_processed['content'].map(stem_words)
df_media_processed['content'] = df_media_processed['content'].map(accented_to_ascii)
df_media_processed['tokens'] = df_media_processed['content'].apply(tokenize)


df_media_processed

Unnamed: 0,id,title,date,content,domain,url,tokens
0,93320,"XPeng Delivered ~100,000 Vehicles In 2021",2022-01-02,chinese automotive startup xpeng has shown one...,cleantechnica,https://cleantechnica.com/2022/01/02/xpeng-del...,"[chinese, automotive, startup, xpeng, has, sho..."
1,93321,Green Hydrogen: Drop In Bucket Or Big Splash?,2022-01-02,sinopec has laid plans to build the largest gr...,cleantechnica,https://cleantechnica.com/2022/01/02/its-a-gre...,"[sinopec, has, laid, plans, to, build, the, la..."
2,98159,World’ s largest floating PV plant goes online...,2022-01-03,huaneng power international has switched on a ...,pv-magazine,https://www.pv-magazine.com/2022/01/03/worlds-...,"[huaneng, power, international, has, switched,..."
3,98158,Iran wants to deploy 10 GW of renewables over ...,2022-01-03,according to the iranian authorities there are...,pv-magazine,https://www.pv-magazine.com/2022/01/03/iran-wa...,"[according, to, the, iranian, authorities, the..."
4,31128,Eastern Interconnection Power Grid Said ‘ Bein...,2022-01-03,daily gpi infrastructure ngi all news access e...,naturalgasintel,https://www.naturalgasintel.com/eastern-interc...,"[daily, gpi, infrastructure, ngi, all, news, a..."
...,...,...,...,...,...,...,...
20106,104263,US Treasury finalises 45X Advanced Manufacturi...,2024-10-24,the us department of the treasury dot has fina...,pv-tech,https://www.pv-tech.org/us-treasury-finalises-...,"[the, us, department, of, the, treasury, dot, ..."
20107,104264,EDP trials robotic construction on Spanish PV ...,2024-10-24,developer edp is piloting a robotic constructi...,pv-tech,https://www.pv-tech.org/edp-trials-robotic-con...,"[developer, edp, is, piloting, a, robotic, con..."
20108,101434,Australia has 7.8 GW of utility-scale batterie...,2024-10-24,the volume of largescale battery energy storag...,pv-magazine,https://www.pv-magazine.com/2024/10/24/austral...,"[the, volume, of, largescale, battery, energy,..."
20109,101428,Residential PV prices in Germany drop 25% with...,2024-10-24,the comparison site selfmade energy shows in a...,pv-magazine,https://www.pv-magazine.com/2024/10/24/residen...,"[the, comparison, site, selfmade, energy, show..."


In [80]:
df_media = df_media_processed.copy()
df_patent= df_patent_processed.copy()

As a last step of our cleaning we will now split the data into 20 % test and 80 % train.

In [74]:
# Split the patent dataset (80% train, 20% test)
train_patent, test_patent = train_test_split(df_patent_processed, test_size=0.2, random_state=42)
# Print the size of each split\n"
print(f"Training set: {len(train_patent)} rows")
print(f"Testing set: {len(test_patent)} rows")

Training set: 23061 rows
Testing set: 5766 rows


In [75]:
# Split the media dataset (80% train, 20% test)
train_media, test_media = train_test_split(df_media_processed, test_size=0.2, random_state=42)
# Print the size of each split\n"
print(f"Training set: {len(train_media)} rows")
print(f"Testing set: {len(test_media)} rows")

Training set: 16088 rows
Testing set: 4023 rows


## Word Embedding Training
- Train separate word embedding models on each dataset using techniques such as Word2Vec, FastText, or GloVe.
- Experiment with hyperparameters such as vector dimensions, context window size, and training epochs to optimize word embeddings evaluated using intrinsic methods such as word similarity tasks, analogy tasks and clustering and visualization.
- Use the trained embeddings to explore thematic overlaps and differences between the two datasets and identify unique insights and innovation gaps.

Baseline Model

with the tf- idf in stage 1 we already took into account the non-semantic

we use word2vec

Remember inahbility to handle unknwon or OOV words 

We stick to english, as we could not use the embedding for a new matrice

Word order is respected

is a sematntic & static, so does not take into account the order


In [None]:
# import warnings
# warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import re
import nltk
import matplotlib
import matplotlib.pyplot as plt
from pathlib import Path

# Default Style Settings
matplotlib.rcParams['figure.dpi'] = 300
pd.options.display.max_colwidth = 200
#%matplotlib inline

## Sentence Embedding Training
- Train separate sentence embedding models on each dataset using methods such as averaging word vectors, Doc2Vec, or BERT embeddings.
- Experiment with hyperparameters such as vector dimensions, context window size, learning rate, batch size and training epochs to optimize sentence embeddings evaluated using intrinsic methods such as sentence similarity tasks and clustering and visualization.
- Use the trained embeddings to explore thematic overlaps and differences between the two datasets and identify unique insights and innovation gaps.

## Transfer Learning with Advanced Open-Source Models
- Implement transfer learning by fine-tuning pre-trained open-source models such as RoBERTa, XLNet, Longformer, FLAN-T5, and BART on the text data. Evaluate the model performance using intrinsic measures (e.g., word similarity, clustering quality) before and after fine-tuning. Analyze and quantify the insights gained from the fine-tuned model regarding emerging trends and innovation gaps in cleantech.
- Compare the performance of transfer learning with the in-house embeddings. This comparison could be done through evaluating the effectiveness of the embeddings in domain-specific tasks like topic classification.