<a href="https://colab.research.google.com/github/daniel-hain/bibliometrics_EIST_2021/blob/master/python/BERTopic_EIST.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BERTopic EIST 

* First tryout on EIST topic modelling using BERT an reprentation learning.


## Setup

In [3]:
# Weird monkey patch
import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding

# When using openai (doesnt work right now)
# !pip install --upgrade openai

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')
# Change directory
import os
os.chdir("00_projects")
# Print out the current directory
!pwd

In [2]:
# Installing BERTopic
!pip install bertopic

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting bertopic
  Downloading bertopic-0.14.1-py2.py3-none-any.whl (120 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m120.7/120.7 KB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting hdbscan>=0.8.29
  Downloading hdbscan-0.8.29.tar.gz (5.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m64.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting sentence-transformers>=0.4.1
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 KB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting umap-learn>=0.5.0
  Downloading umap-l

## Data

* Title + Abstracts of all EIST articles up to 02.2023

In [11]:
import os
import pandas as pd
docs = pd.read_csv("https://raw.githubusercontent.com/daniel-hain/bibliometrics_EIST_2021/master/data/data_text.csv")
docs.head()

Unnamed: 0,UT,PY,text
0,2-S2.0-85117715938,2021,common-pool resources and governance in sustai...
1,2-S2.0-85137288685,2022,introduction to ‘markets in sustainability tra...
2,2-S2.0-85128534377,2022,policy instruments for green-growth of cluster...
3,2-S2.0-85130369331,2022,the transition of agriculture to low carbon pa...
4,2-S2.0-85126615038,2022,hard coal phase-out and the labour market tran...


# **Topic Modeling**

In this example, we will go through the main components of BERTopic and the steps necessary to create a strong topic model. 




## Training

* We start by instantiating BERTopic. 
*We set language to `english` since our documents are in the English language. If you would like to use a multi-lingual model, please use `language="multilingual"` instead. 
* We will also calculate the topic probabilities. 


In [12]:
from bertopic import BERTopic
from bertopic.vectorizers import ClassTfidfTransformer
from bertopic.representation import MaximalMarginalRelevance
from bertopic.representation import KeyBERTInspired
from bertopic.representation import TextGeneration

from sentence_transformers import SentenceTransformer
from transformers.pipelines import pipeline

from hdbscan import HDBSCAN
from umap import UMAP

from sklearn.feature_extraction.text import CountVectorizer

In [13]:
# Own stopwords
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

new_stopwords = ["study", "studies", "paper", "insight", "insights", "research", "literature", "framework", "frameworks"]
stpwrd = nltk.corpus.stopwords.words('english')
stpwrd.extend(new_stopwords)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [33]:
# Some hyperparameters
umap_n_neighbors = 5
umap_n_components = 5
umap_min_dist = 0
hdbscan_min_cluster_size = 5
hdbscan_min_samples = hdbscan_min_cluster_size

# EMBEDDING MODEL:
embedding_model = "allenai-specter"

# DIMENSIONALITY REDUCTION
umap_model = UMAP(n_neighbors=umap_n_neighbors, 
                  n_components=umap_n_components, 
                  min_dist=umap_min_dist, 
                  metric='cosine', 
                  random_state=1337)

# CLUSTERING: resuce min_cluster_size and min_samples
hdbscan_model = HDBSCAN(min_cluster_size=hdbscan_min_cluster_size, 
                        metric='euclidean', 
                        cluster_selection_method='eom', 
                        prediction_data=True, 
                        min_samples=hdbscan_min_samples)

# VECTORIZER: custom vectorizer to get rid of stopwords
vectorizer_model = CountVectorizer(stop_words=stpwrd,
                                   min_df = 10)

# TFIDF Weighting: Weighting model with redued stoppwords
ctfidf_model = ClassTfidfTransformer(bm25_weighting=True,
                                     reduce_frequent_words=True)

In [None]:
#import openai
#from bertopic.representation import OpenAI

#!pip install cohere # if you use cohere
#import cohere
# from bertopic.representation import Cohere

In [27]:
from getpass import getpass
# get from: https://docs.google.com/spreadsheets/d/1Zjy7IFRqowBsB1W7vPRQRYcqANOo82LJEvNR22iAsA8/edit#gid=0

#openai.api_key = getpass('Enter OpenAi secret value: ')
#cohere_api_key = getpass('Enter Cohere secret value: ')

In [34]:
## REPRESENTATION MODEL: 
representation_model = MaximalMarginalRelevance(diversity=0.8)
#representation_model = KeyBERTInspired()
#representation_model = OpenAI()

In [35]:
# BEWRTopic: specify all custom models and n_grams
topic_model = BERTopic(language="english", 
                       calculate_probabilities=True,
                       verbose=True, 
                       n_gram_range=(1, 3),
                       #
                       embedding_model=embedding_model, 
                       umap_model=umap_model,
                       hdbscan_model=hdbscan_model, 
                       vectorizer_model=vectorizer_model,
                       ctfidf_model=ctfidf_model,
                       representation_model=representation_model)

In [36]:
topics, probs = topic_model.fit_transform(docs.loc[:,'text'])

Batches:   0%|          | 0/19 [00:00<?, ?it/s]

2023-03-13 15:44:09,852 - BERTopic - Transformed documents to Embeddings
2023-03-13 15:44:12,574 - BERTopic - Reduced dimensionality
2023-03-13 15:44:12,659 - BERTopic - Clustered reduced embeddings


## Extracting Topics
* After fitting our model, we can start by looking at the results. 
* Typically, we look at the most frequent topics first as they best represent the collection of documents. 
* -1 refers to all outliers and should typically be ignored. 

In [37]:
freq = topic_model.get_topic_info(); freq.head(50)

Unnamed: 0,Topic,Count,Name
0,-1,13,-1_knowledge_learning_provide_industry
1,0,252,0_management_attention_social_governance
2,1,154,1_economic_impacts_policies_climate
3,2,50,2_industry_alternative_countries_firms
4,3,21,3_dynamic_innovation_challenges_technological
5,4,19,4_regime_legitimacy_knowledge_alternative
6,5,19,5_market_institutional_innovation_countries
7,6,17,6_value_network_learning_innovations
8,7,14,7_environmental_diffusion_countries_innovation
9,8,12,8_value_sustainable_firms_addresses


## Attributes

* There are a number of attributes that you can access after having trained your BERTopic model:


| Attribute | Description |
|------------------------|---------------------------------------------------------------------------------------------|
| topics_               | The topics that are generated for each document after training or updating the topic model. |
| probabilities_ | The probabilities that are generated for each document if HDBSCAN is used. |
| topic_sizes_           | The size of each topic                                                                      |
| topic_mapper_          | A class for tracking topics and their mappings anytime they are merged/reduced.             |
| topic_representations_ | The top *n* terms per topic and their respective c-TF-IDF values.                             |
| c_tf_idf_              | The topic-term matrix as calculated through c-TF-IDF.                                       |
| topic_labels_          | The default labels for each topic.                                                          |
| custom_labels_         | Custom labels for each topic as generated through `.set_topic_labels`.                                                               |
| topic_embeddings_      | The embeddings for each topic if `embedding_model` was used.                                                              |
| representative_docs_   | The representative documents for each topic if HDBSCAN is used.                                                |

For example, to access the predicted topics for the first 10 documents, we simply run the following:

# **Visualization**
* There are several visualization options available in BERTopic, namely the visualization of topics, probabilities and topics over time. 
* Topic modeling is, to a certain extent, quite subjective. * Visualizations help understand the topics that were created. 

## General topic relation

In [38]:
topic_model.visualize_topics()

# Optimal: Finetune topics

In [39]:
# VECTORIZER: custom vectorizer to get rid of stopwords
vectorizer_model = CountVectorizer(stop_words=stpwrd,
                                   min_df = 10)

# TFIDF Weighting: Weighting model with redued stoppwords
ctfidf_model = ClassTfidfTransformer(bm25_weighting=True,
                                     reduce_frequent_words=True)

In [40]:
## REPRESENTATION MODEL: 
rep_mm = MaximalMarginalRelevance(diversity=0.8)
rep_kb = KeyBERTInspired()
# rep_oai = OpenAI()
representation_model = rep_mm # [rep_mm, rep_kb]

In [41]:
topic_model.update_topics(docs.loc[:,'text'], 
                          n_gram_range=(1, 3),
                          vectorizer_model=vectorizer_model, 
                          ctfidf_model=ctfidf_model, 
                          representation_model=representation_model) 

In [None]:
# Reduce topics if wanted
# topic_model.reduce_topics(docs, nr_topics=30)

## Visualize Terms

We can visualize the selected terms for a few topics by creating bar charts out of the c-TF-IDF scores for each topic representation. Insights can be gained from the relative c-TF-IDF scores between and within topics. Moreover, you can easily compare topic representations to each other.

In [42]:
topic_model.visualize_barchart(top_n_topics=50, n_words = 10)

In [43]:
# Representative docs
rep_doc = topic_model.get_representative_docs()
rep_doc = pd.DataFrame(rep_doc)
rep_doc.T

Unnamed: 0,0,1,2
-1,learning and diffusion of knowledge in clean e...,non-disruptive regime changes—the case of comp...,solar business model adoption by energy incumb...
0,the changing landscape of deep transitions: so...,pinning it down? measuring innovation for sust...,households in sustainability transitions: a sy...
1,evaluating the global impact of low-carbon ene...,explaining inclusivity in energy transitions: ...,trust gaps in energy transitions: japan's nati...
2,shifting gears on sustainable transport transi...,the rise of china's new energy vehicle lithium...,the ev paradox - a multilevel study of why sto...
3,towards sustainable urban basic services in lo...,overcoming the harmony fallacy: how values sha...,analysis of institutional work on innovation t...
4,sustainability transitions in the agri-food se...,understanding the protein transition: the rise...,analysing the role of consumers within technol...
5,coordination and legitimacy in the australian ...,the institutional evolution process of the glo...,dynamics of a policy-driven market: the co-evo...
6,a network perspective to niche-regime interact...,understanding the limits to forming policy-dri...,using lead market factors to assess the potent...
7,"the concept of ""lead markets"" revisited: contr...",diffusion of environmental innovations: sector...,assessing the impact of environmental innovati...
8,opening up the black box of learning-by-doing ...,sustainable business model innovation and scal...,understanding and governing learning in sustai...


In [44]:
embeddings = topic_model.embedding_model.embed_documents(docs.loc[:,'text'])
#pd.DataFrame(embeddings).to_csv('embeddings_specter.csv')

In [45]:
topic_model.visualize_documents(docs.loc[:,'text'], embeddings=embeddings)

## Visualize Topic Hierarchy

* The topics that were created can be hierarchically reduced. 
* In order to understand the potential hierarchical structure of the topics, we can create clusters and visualize how they relate to one another. 
* This might also help selecting an appropriate nr_topics when reducing the number of topics that you have created.

In [46]:
# Calculate Hirarchy
hierarchical_topics = topic_model.hierarchical_topics(docs.loc[:,'text'])

100%|██████████| 12/12 [00:00<00:00, 21.55it/s]


In [47]:
topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)

In [52]:
# Run the visualization with the original embeddings
topic_model.visualize_hierarchical_documents(docs.loc[:,'text'], hierarchical_topics, embeddings=embeddings)

## Topics over time

In [53]:
# Calculate topics over time
topics_over_time = topic_model.topics_over_time(docs.loc[:,'text'],
                                                docs.loc[:,'PY'])

13it [00:05,  2.28it/s]


In [None]:
topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=20)

## Visualize Topic Similarity
Having generated topic embeddings, through both c-TF-IDF and embeddings, we can create a similarity matrix by simply applying cosine similarities through those topic embeddings. The result will be a matrix indicating how similar certain topics are to each other.

In [None]:
topic_model.visualize_heatmap(n_clusters=5, width=1000, height=1000)

## Visualize Topic Probabilities

- The variable `probabilities` that is returned from `transform()` or `fit_transform()` can  be used to understand how confident BERTopic is that certain topics can be found in a document. 



In [None]:
# Calculate the topic distributions on a token-level
topic_distr, topic_token_distr = topic_model.approximate_distribution(docs.loc[:,'text'], calculate_tokens=True)

100%|██████████| 1/1 [00:01<00:00,  1.40s/it]


In [None]:
# Select a document
doc_nr = 102
docs.loc[doc_nr,'text']

'influences of technological and sectoral contexts on technological innovation systems. this paper analyses how technological and sectoral context structures influence the functional pattern of a focal technological innovation system (tis), focusing on value chain interdependencies. through a case study of the ‘embryonic’ bio-succinate from mixed food waste tis in sweden, we show that all identified context structures exert both negative and positive influences on tis functions by influencing resource availability and accessibility, market conditions and the wider selection environment. contextual influences result from interdependencies throughout the value chain, but in contrast to previous studies, direct structural overlaps are not as relevant as competition for resources, markets and policy support. competition does not only come from the regime but also from other related tiss and sectors. the study also confirms the importance of contextual influences from the international tis.

In [None]:
# Get the probability distribution
topic_model.visualize_distribution(probs[doc_nr], min_probability=0.01)

In [None]:
# Visualize the token-level distributions
doc_topic_dist = topic_model.visualize_approximate_distribution(docs.loc[doc_nr,'text'], topic_token_distr[doc_nr])
doc_topic_dist

Unnamed: 0,influences,of,technological,and,sectoral,contexts,on,technological.1,innovation,systems,this,paper,analyses,how,technological.2,and.1,sectoral.1,context,structures,influence,the,functional,pattern,of.1,focal,technological.3,innovation.1,system,tis,focusing,on.1,value,chain,interdependencies,through,case,study,of.2,the.1,embryonic,bio,succinate,from,mixed,food,waste,tis.1,in,sweden,we,show,that,all,identified,context.1,structures.1,exert,both,negative,and.2,positive,influences.1,on.2,tis.2,functions,by,influencing,resource,availability,and.3,accessibility,market,conditions,and.4,the.2,wider,selection,environment,contextual,influences.2,result,from.1,interdependencies.1,throughout,the.3,value.1,chain.1,but,in.1,contrast,to,previous,studies,direct,structural,overlaps,are,not,as,relevant,as.1,competition,for,resources,markets,and.5,policy,support,competition.1,does,not.1,only,come,from.2,the.4,regime,but.1,also,from.3,other,related,tiss,and.6,sectors,the.5,study.1,also.1,confirms,the.6,importance,of.3,contextual.1,influences.3,from.4,the.7,international,tis.3,these,findings,suggest,that.1,wide,perspective,on.3,context.2,structures.2,and.7,selection.1,pressures,should,be,considered,in.2,future,research
0_socio_understanding_challenges_change,0.0,0.0,0.102,0.102,0.204,0.318,0.334,0.451,0.45,0.336,0.218,0.1,0.0,0.0,0.109,0.109,0.225,0.341,0.231,0.231,0.116,0.0,0.0,0.0,0.115,0.23,0.23,0.23,0.115,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.108,0.216,0.216,0.216,0.108,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.11,0.11,0.11,0.11,0.0,0.0,0.0,0.0,0.0,0.0,0.101,0.203,0.304,0.42,0.318,0.343,0.368,0.253,0.253,0.126,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1_economic_power_national_climate,0.0,0.0,0.0,0.0,0.0,0.104,0.219,0.335,0.335,0.231,0.115,0.103,0.205,0.205,0.308,0.206,0.207,0.31,0.207,0.207,0.104,0.0,0.0,0.0,0.121,0.241,0.241,0.241,0.121,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.117,0.234,0.234,0.234,0.117,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.107,0.213,0.32,0.32,0.213,0.107,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.112,0.232,0.353,0.474,0.362,0.241,0.121,0.0,0.0,0.0,0.0,0.116,0.232,0.232,0.232,0.116,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.11,0.11,0.11,0.11,0.0,0.0,0.0,0.0,0.0,0.0,0.109,0.218,0.327,0.327,0.218,0.212,0.206,0.206,0.206,0.103,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2_value_alternative_current_countries,0.0,0.0,0.111,0.111,0.222,0.354,0.39,0.537,0.537,0.405,0.259,0.213,0.202,0.202,0.316,0.215,0.226,0.339,0.225,0.225,0.113,0.0,0.0,0.128,0.272,0.416,0.524,0.396,0.38,0.363,0.383,0.511,0.383,0.255,0.128,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.11,0.221,0.221,0.221,0.11,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.128,0.255,0.383,0.511,0.383,0.255,0.128,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.102,0.203,0.305,0.305,0.203,0.102,0.0,0.0,0.0,0.0,0.11,0.22,0.22,0.22,0.11,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.112,0.223,0.223,0.223,0.112,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3_regime_legitimacy_public_knowledge,0.0,0.0,0.0,0.0,0.0,0.0,0.161,0.322,0.462,0.597,0.436,0.397,0.379,0.244,0.373,0.25,0.128,0.128,0.0,0.0,0.0,0.0,0.0,0.107,0.249,0.391,0.504,0.398,0.255,0.113,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.112,0.223,0.223,0.223,0.112,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.215,0.33,0.445,0.345,0.23,0.115,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.105,0.105,0.105,0.105,0.0,0.0,0.0,0.0,0.0,0.178,0.355,0.523,0.691,0.514,0.336,0.168,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.108,0.108,0.108,0.108,0.0,0.0,0.0,0.0,0.0,0.0,0.124,0.248,0.373,0.484,0.36,0.342,0.324,0.212,0.212,0.106,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4_formation_value_innovations_sustainable,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111,0.111,0.285,0.459,0.522,0.696,0.522,0.348,0.174,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.112,0.224,0.224,0.224,0.112,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.252,0.391,0.53,0.67,0.418,0.279,0.139,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.174,0.348,0.522,0.696,0.522,0.348,0.174,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.108,0.217,0.325,0.452,0.343,0.235,0.127,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.11,0.22,0.22,0.22,0.11,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.105,0.105,0.105,0.105,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.131,0.261,0.261,0.261,0.131,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5_environmental_countries_market_innovation,0.0,0.0,0.0,0.0,0.0,0.13,0.27,0.411,0.521,0.391,0.25,0.233,0.246,0.246,0.354,0.231,0.209,0.31,0.203,0.203,0.101,0.0,0.0,0.143,0.277,0.411,0.512,0.37,0.236,0.101,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.104,0.209,0.209,0.209,0.104,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.135,0.135,0.135,0.135,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.101,0.201,0.201,0.201,0.101,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6_technological_formation_institutional_knowledge,0.173,0.345,0.522,0.522,0.526,0.552,0.585,0.794,0.74,0.541,0.332,0.264,0.283,0.456,0.668,0.661,0.666,0.64,0.428,0.293,0.147,0.0,0.173,0.374,0.585,0.796,0.748,0.546,0.335,0.124,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.114,0.229,0.35,0.472,0.357,0.243,0.256,0.27,0.404,0.539,0.404,0.27,0.135,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.102,0.233,0.363,0.494,0.392,0.261,0.131,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.114,0.227,0.341,0.463,0.349,0.236,0.122,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.16,0.32,0.479,0.618,0.458,0.437,0.416,0.413,0.547,0.408,0.27,0.135,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7_technological_formation_diffusion_knowledge,0.18,0.36,0.36,0.36,0.18,0.129,0.354,0.579,0.717,0.588,0.363,0.286,0.298,0.478,0.628,0.479,0.33,0.15,0.0,0.0,0.0,0.0,0.18,0.397,0.638,0.88,0.861,0.752,0.613,0.473,0.413,0.406,0.305,0.203,0.102,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.102,0.204,0.204,0.204,0.102,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.109,0.234,0.358,0.483,0.374,0.249,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.102,0.203,0.305,0.406,0.305,0.203,0.102,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.108,0.217,0.325,0.451,0.463,0.475,0.486,0.461,0.341,0.221,0.101,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.112,0.217,0.322,0.427,0.315,0.21,0.105,0.0,0.0,0.0,0.104,0.208,0.312,0.312,0.208,0.104,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8_innovation_conceptual_within_germany,0.138,0.276,0.276,0.276,0.138,0.117,0.349,0.581,0.772,0.78,0.548,0.455,0.402,0.415,0.566,0.428,0.417,0.407,0.256,0.256,0.128,0.0,0.138,0.334,0.571,0.807,0.868,0.804,0.567,0.331,0.132,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.117,0.233,0.392,0.551,0.544,0.536,0.377,0.218,0.109,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.103,0.103,0.103,0.103,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.117,0.234,0.234,0.234,0.117,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9_need_private_social_climate,0.0,0.0,0.0,0.0,0.0,0.0,0.135,0.269,0.464,0.464,0.33,0.195,0.0,0.0,0.0,0.0,0.145,0.29,0.415,0.54,0.395,0.25,0.125,0.0,0.107,0.214,0.371,0.524,0.418,0.311,0.153,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.112,0.224,0.365,0.507,0.396,0.284,0.142,0.0,0.0,0.142,0.284,0.426,0.426,0.284,0.142,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.127,0.127,0.242,0.356,0.23,0.23,0.115,0.0,0.0,0.0,0.0,0.111,0.223,0.223,0.223,0.111


# Work in progress: ChatGTP topic summary:

In [None]:
top_n = 12
n_doc = 5

In [None]:
T = topic_model.get_document_info(docs.loc[:,'text'])

In [None]:
promt_doc_intro = "I want you to act as a academic in social science. I want you to identify a topic in the following documents (sepperated ending with an '---'), which are abstacts of scientific journal publications in a journal called 'Environmental Innovation and Societal Transitions'."

promt_doc = [' --- '.join(rep_doc.loc[:,top_n])]

promt_key_intro = "Now that your recieved the corresponding documents, I will also provide you some keywords describing the topic. They are sepperated with an ','"

promt_key = pd.DataFrame(topic_model.get_topic(top_n))
promt_key = [', '.join(promt_key.loc[:,0])]

promt_final = "Your task is to summarize the topic by a short label of 2-7 words, plus a short description of 3-5 sentences. This summary should highlight the commonality of the documents in light of the provided keywords. It should indicate the main theoretical there, research framework, context, or implications. It should also indicate if the research in this topic is more quantitative or qualitative, on a macro or micro level, and aimed at theoretical, policy, or industry implications."

In [None]:
print(promt_doc_intro, promt_doc, promt_key_intro, promt_key, promt_final, sep = " \n \n ")

I want you to act as a academic in social science. I want you to identify a topic in the following documents (sepperated ending with an '---'), which are abstacts of scientific journal publications in a journal called 'Environmental Innovation and Societal Transitions'. 
 
 ["user innovation, niche construction and regime destabilization in heat pump transitions. domestic heating systems require a rapid shift to low-carbon options to meet global climate targets. we analyse a heat pump transition in two contrasting case studies: finland and the united kingdom, utilizing original data from interviews, document analysis, and archival online data. finland has an almost completed transition, while the united kingdom can be considered a stalled one. building on previous research that has highlighted the importance of context, policy and users in transitions, we explore various user roles within low-carbon transitions, and how they shape processes of niche construction and regime destabilisat

## Results

### Topic 0
Topic: Sustainable Transitions: Actors, Processes, and Innovations

Description: The selected articles address sustainability transitions by examining the role of actors, processes, and innovations. They present analytical frameworks for generating actionable knowledge and offer refined versions of concepts like systemic and landscape imprinting to understand the dynamics of transition processes. The articles explore neglected factors like war and military in sociotechnical transitions and propose new research avenues for understanding households' roles in shaping social innovation and environmental policies. Overall, the articles contribute to emerging understandings of sustainable transitions, emphasizing the need for a strategic approach to generate actionable knowledge and identify opportunities for sustainable innovation. The research in this topic is qualitative, focused on a macro level, and aimed at theoretical and policy implications.

### Topic 1

Topic: Institutional Challenges in Sustainable Energy Transitions

Description: The selected articles explore the challenges of creating sustainable energy systems, particularly in the context of institutional changes and actors' roles. They apply dialectic and multi-level perspectives to understand the tensions and contradictions driving niche creation and the interactions between niche and regime actors. The articles also examine the risks and uncertainties involved in implementing renewable energy systems in off-grid communities and the importance of trust in deliberative policy-making to enhance policy legitimacy and ensure effective post-disaster energy transitions. Overall, the articles contribute to emerging understandings of the institutional challenges in sustainable energy transitions, highlighting the need for transdisciplinary approaches and policy interventions to address these challenges. The research in this topic is qualitative, focused on a micro level, and aimed at theoretical and policy implications.

### Topic 2

Topic: Sustainable Innovations in the Energy Sector

This collection of journal articles explores various aspects of sustainable innovations in the energy sector, particularly in the area of e-mobility and electric vehicles. The papers examine the potential for e-mobility as a sustainable system innovation, explain variance in national electric vehicle policies, and analyze the coevolution of battery technological innovation systems and policies in China's new energy vehicle lithium-ion battery industry. The research framework is theoretical and empirical, combining qualitative and quantitative methods to assess the impact of actors, technologies, and policies on sustainable innovations in the industry. The implications of the research are relevant for policymakers and industry leaders seeking to promote sustainable energy transitions.

## Topic 3

Label: Sustainability Transitions in Food Processing and Agriculture

Description: This collection of articles explores sustainability transitions in the food processing and agricultural sectors. Using the Technological Innovation Systems (TIS) framework, the articles analyze the roles of consumers, producers, and institutional actors in driving and facilitating sustainability transitions. They highlight the importance of cognitive and normative legitimacy, place-based embeddedness, and the public goods' character of biodiversity in shaping transition dynamics. The research is primarily qualitative and conducted on a micro-level, with implications for policy and industry stakeholders seeking to facilitate sustainable food systems.

## Topic 4

Label: Market Formation for Transformative Policy

This collection of papers focuses on understanding market formation in the context of sustainability transitions, drawing on perspectives from economics, sociology, and marketing studies. The papers use case studies to demonstrate the dynamic and interconnected nature of market formation and propose frameworks for analyzing and characterizing new market formation processes. The research emphasizes the importance of identifying potential misalignments, bottlenecks, and failures in market formation and suggests entry points for policy intervention to support transformative innovation. The research is largely qualitative, focusing on the theoretical and conceptual implications for policy and industry, with a focus on market formation as a critical component of sustainability transitions.

## Topic 5

Topic: Environmental Innovations in Industry and Regional Markets

This set of articles examines environmental innovations in industry and regional markets, focusing on the diffusion processes, impacts, and factors influencing their adoption. The research employs mixed-methods approaches to explore the role of regional lead markets, sector-specific factors, and incumbent firms in shaping the diffusion and impact of environmental innovations. The studies demonstrate the importance of contextual factors in understanding the diffusion and impact of environmental innovations, and provide insights into effective policies for promoting their adoption. The research is mainly empirical and quantitative, with a focus on industry and regional-level implications.

## Topic 6

Label: Global Dynamics of Innovation in Emerging Industries

This collection of documents focuses on the innovation and institutional dynamics of emerging industries in the global economy, with particular attention to the energy and environmental sectors. The papers employ various theoretical frameworks, including the technological innovation system and institutional evolution approaches, and examine the role of both public and private actors in driving innovation and market formation. The research is largely qualitative and macro-level, with implications for industry strategy and policy development in a global context.

## Topic 7

Label: Sustainable Business Models and Transitions

This topic covers the challenges and opportunities of sustainable business models in the context of transitions towards a circular economy and reduced environmental impact. The papers highlight the importance of understanding consumer behaviors, industry value chains, and collaboration among actors to enable sustainable innovation and scaling strategies. The research adopts qualitative methods, theoretical frameworks such as systems of practice and industry value chain perspective, and case studies of upstream incumbent firms and mobile phone product service systems. The implications of this research are relevant for practitioners, policymakers, and scholars interested in sustainability transitions and strategic innovation for sustainable development.

## Topic 8

Label: Spatial diffusion of innovation in wind energy

Description: These articles focus on the spatial diffusion of wind energy innovation across countries and regions. The research utilizes theoretical frameworks such as the innovation systems and value chain perspectives to examine the factors that drive innovation and diffusion in the wind energy sector. The articles highlight the importance of regional actor networks, absorptive capacity, and transnational linkages in the formation of technological innovation systems. The research is mainly qualitative and is relevant to industry practitioners, policymakers, and researchers interested in understanding the factors that influence the international diffusion of sustainable energy technologies.

## Topic 9

Topic: Innovation Systems and Policy

This set of articles explores various aspects of innovation systems and policy. They focus on the actors, processes, and institutional frameworks involved in creating, implementing, and evaluating policies that foster transformative innovation. The articles also examine the spatial and temporal dimensions of innovation systems, the role of values and institutional logics in shaping innovation trajectories, and the challenges of scaling up sustainable innovations. The research is qualitative and theoretical in nature, and has implications for policymakers, practitioners, and scholars interested in understanding and promoting transformative innovation.

## Topic 10

Topic: "Just Transitions in Sustainable Food Systems"

This collection of articles explores the challenges and opportunities for achieving sustainable food systems that are just and equitable. The articles analyze contrasting perspectives and proposals on how to address issues of environmental and food justice, social equity, and climate change in different countries and contexts. They offer a multidimensional understanding of justice that includes distributional, recognition, and procedural aspects. The research is qualitative and aims to provide theoretical and policy implications for achieving just transitions towards sustainable food systems that do not exacerbate existing inequalities and unsustainabilities.

## Topic 11

Label: Intermediaries and Institutional Change for Sustainability Transitions

The three articles analyze the role of intermediaries in driving sustainability transitions through institutional change. The first article examines the use of intermediaries to accelerate the diffusion of wooden multi-storey construction in Finland. The second article investigates the different strategies used by institutional entrepreneurs in diverse niche contexts in the Netherlands, while the third article focuses on the actors and platforms that facilitate change towards zero carbon residential buildings in the UK. The articles highlight the importance of actors and networks in shaping institutional change, and the challenges that arise from differing agendas and goals. The research is qualitative and micro-level, with implications for practitioners and policymakers seeking to drive sustainability transitions through institutional change.

## Topic 12

Label: User Innovation and Policy Implications

These articles examine the role of user innovation and policy processes in sustainability transitions, specifically in the context of low-carbon heating systems, energy-efficient homes, and energy production in the Dutch glasshouse sector. The studies utilize qualitative methods such as interviews, document analysis, and archival data to analyze the role of users in niche construction and regime destabilization, the linking processes between socio-technical regimes and technological niches, and the co-evolutionary dynamics of policy mixes and socio-technical systems. The research highlights the importance of context, policy, and users in transitions and suggests that policy should aim to mobilize users and create effective initial designs that generate positive feedback to induce system innovation for sustainability. The studies have implications for policy makers, practitioners, and researchers in the field of sustainability transitions.