# Stage 2: Advanced Embedding Models Training and Analysis
This notebook explores advanced embedding models to analyze and compare the content of the Cleantech Media and Google Patent datasets. The goal is to develop meaningful vector representations of the text data using word embeddings, sentence embeddings, and transfer learning techniques.

- Deadline 2 (Stage 2): 6 April 2025 23:59

## Data Preparation for Embeddings
At this stage, we need to ensure that our dataset is properly cleaned and preprocessed to generate high-quality embeddings.

Since the dataset has already been normalized, tokenized, and stripped of stopwords and special characters, we do not need to repeat these steps. Instead, we will use the preprocessed dataset to make a list out of the abstract and content column (for word2Vec) and split both datasets into training and test sets.

In [12]:
# imports
import pandas as pd
from sklearn.model_selection import train_test_split
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')

In [25]:
media_modeling = pd.read_csv("../cleaned_data/media_dataset_cleaned_entity.csv")  
patent_modeling = pd.read_csv("../cleaned_data/google_patent_cleaned_entity.csv") 

We now will make the text data in a list of tokenized sentences.

In [26]:
## first for the media dataset

# Tokenize the abstract column
media_modeling['tokenized_content'] = media_modeling['content'].apply(lambda x: word_tokenize(str(x).lower()))

# Convert the tokenized abstracts into a list of lists
sentences_media = media_modeling['tokenized_content'].tolist()
print(sentences_media[:3])

[['chines', 'startup', 'shown', 'drama', 'auto', 'product', 'campus', 'history', 'good', 'news', 'product', 'one', 'under', 'st', 'eve', 'mere', 'seven', 'year', 'age', 'year', 'launch', 'first', 'vehicle', 'go', 'went', 'wrap', 'total', 'almost', 'one', 'under', 'thousand', 'two', 'thousand', 'and', 'twenty-on', 'deli', 'ninety-eight', 'thousands', 'one', 'under', 'and', 'fifty', 'st', 'eve', 'two', 'under', 'and', 'sixty-thre', 'yearoveryear', 'in000', 'deli', 'one', 'under', 'and', 'eighty-on', 'yearoveryear', 'compar', 'forty-on', 'thousands', 'seven', 'under', 'and', 'fifty-on', 'deli', 'of', 'two', 'thousand', 'and', 'twenty-on', 'two', 'under', 'and', 'twenty-two', 'yearoveryear', 'reinforce', 'impress', 'ninety-eight', 'thousands', 'one', 'under', 'and', 'fifty', 'delivery', 'figure', 'two', 'thousand', 'and', 'twenty-on', 'end', 'reach', 'one', 'under', 'and', 'thirty-seven', 'thousands', 'nine', 'under', 'and', 'fifty-thre', 'cumuli', 'delivery', 'impress', 'monthly', 'delive

In [27]:
## now for the patent dataset

# Tokenize the abstract column
patent_modeling['tokenized_abstract'] = patent_modeling['abstract'].apply(lambda x: word_tokenize(str(x).lower()))

# Convert the tokenized abstracts into a list of lists
sentences_patent = patent_modeling['tokenized_abstract'].tolist()

print(sentences_patent[:3])

[['disclose', 'method', 'solar', 'ethan', 'firstly', 'solar', 'provide', 'apply', 'two', 'control', 'renew', 'source', 'biomass', 'absorb', 'light', 'renew', 'source', 'promote', 'problem', 'abandon', 'light', 'abandon', 'avoid', 'far', 'possible', 'meanwhile', 'couple', 'electro', 'heat', 'a', 'consider', 'realize', 'complement', 'mutual', 'aid', 'until', 'office', 'improve', 'carbon', 'miss', 'reduce', 'base', 'multi', 'consist', 'theory', 'provide', 'method', 'consist', 'algorithm', 'adopt', 'consist', 'algorithm', 'adopt'], ['invent', 'provide', 'atom', 'use', 'spam', 'relay', 'technic', 'field', 'energy', 'until', 'atom', 'use', 'spam', 'comprise', 'target', 'research', 'cantilever', 'beam', 'electrolyte', 'cell', 'platinum', 'wire', 'direct', 'current', 'power', 'supply', 'accord', 'columnar', 'mater', 'direct', 'solidify', 'certain', 'determine', 'sample', 'cantilever', 'beam', 'common', 'electrolyte', 'cell', 'prepare', 'atom', 'diffs', 'gradual', 'cross', 'electrolyte', 'proce

As a last step of our cleaning we will now split the data into 20 % test and 80 % train.

In [29]:
# Split the media dataset (80% train, 20% test)
train_media, test_media = train_test_split(media_modeling, test_size=0.2, random_state=42)

# Print the size of each split
print(f"Training set: {len(train_media)} rows")
print(f"Testing set: {len(test_media)} rows")

Training set: 16088 rows
Testing set: 4023 rows


In [31]:
# Split the patent dataset (80% train, 20% test)
train_patent, test_patent = train_test_split(patent_modeling, test_size=0.2, random_state=42)

# Print the size of each split
print(f"Training set: {len(train_patent)} rows")
print(f"Testing set: {len(test_patent)} rows")

Training set: 23061 rows
Testing set: 5766 rows


## Word Embedding Training
- Train separate word embedding models on each dataset using techniques such as Word2Vec, FastText, or GloVe.
- Experiment with hyperparameters such as vector dimensions, context window size, and training epochs to optimize word embeddings evaluated using intrinsic methods such as word similarity tasks, analogy tasks and clustering and visualization.
- Use the trained embeddings to explore thematic overlaps and differences between the two datasets and identify unique insights and innovation gaps.

## Sentence Embedding Training
- Train separate sentence embedding models on each dataset using methods such as averaging word vectors, Doc2Vec, or BERT embeddings.
- Experiment with hyperparameters such as vector dimensions, context window size, learning rate, batch size and training epochs to optimize sentence embeddings evaluated using intrinsic methods such as sentence similarity tasks and clustering and visualization.
- Use the trained embeddings to explore thematic overlaps and differences between the two datasets and identify unique insights and innovation gaps.

## Transfer Learning with Advanced Open-Source Models
- Implement transfer learning by fine-tuning pre-trained open-source models such as RoBERTa, XLNet, Longformer, FLAN-T5, and BART on the text data. Evaluate the model performance using intrinsic measures (e.g., word similarity, clustering quality) before and after fine-tuning. Analyze and quantify the insights gained from the fine-tuned model regarding emerging trends and innovation gaps in cleantech.
- Compare the performance of transfer learning with the in-house embeddings. This comparison could be done through evaluating the effectiveness of the embeddings in domain-specific tasks like topic classification.