With this nodebook we intend to ingest a json list of AI related papers and output a summary by industry with the corresponding contributions we find within those papers. 

In [43]:
!pip install chromadb
!pip install hdbscan


Collecting hdbscan
  Downloading hdbscan-0.8.40-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (15 kB)
Downloading hdbscan-0.8.40-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.6/4.6 MB[0m [31m37.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: hdbscan
Successfully installed hdbscan-0.8.40


In [46]:
import sys
import kagglehub 
gen_ai_capstone_utils_path = kagglehub.dataset_download('helgaguerreiro/gen-ai-capstone-utils')
sys.path.append(gen_ai_capstone_utils_path)

import json 
import os  
from utils import lib 

print(os.listdir(gen_ai_capstone_utils_path))

['utils', 'industry-list.json', 'papers_sample.json']


In [32]:
from kaggle_secrets import UserSecretsClient

#
# Config 
#
config = {
    "data_dir": gen_ai_capstone_utils_path, # where we store the input data 
    "out_dir":"/kaggle/working/",# where we are outputing intermediate data and the final document 
    "papers_file":"papers_sample.json", # what paper list are we ingesting 
    "GOOGLE_API_KEY": UserSecretsClient().get_secret("GOOGLE_API_KEY")
}

genai.configure(api_key=config['GOOGLE_API_KEY'])
papers = [] 

# 
# load the papers by reading the data/papers_2025-03-23_2025-03-29.json file
# if there's an intermediate file  load that one 
filepath = os.path.join(config['data_dir'],config['papers_file'])


Make sure we can restaure intermediate results if something goes wrong while testing changes 

In [34]:
if  os.path.exists(os.path.join(config['out_dir'],'papers_enriched.json')):
    filepath =  os.path.join(config['out_dir'],'papers_enriched.json')

print("Loading papers from:", filepath)
with open(filepath, 'r') as f:
    for line in f:
        papers.append(json.loads(line))

papers = pd.DataFrame(papers)
print("Loaded papers:", len(papers))


Loading papers from: /kaggle/input/gen-ai-capstone-utils/papers_sample.json
Loaded papers: 400


We want to ground the model output to a pre-defined set of industries. 

The industry_meta contains 
* name - the industry name
* keywords - keywords associated with the industry
* examples - examples of what should match the industry

We will use this data to compose the prompt when querying the model. 

In [39]:
# Load the industry meta data from the data/industry-list.json file
with open(os.path.join(config['data_dir'], 'industry-list.json'), 'r') as f:
    industry_meta = json.load(f)

print(industry_meta)

[{'industry': 'Healthcare', 'keywords': ['doctor', 'hospital', 'medical', 'imaging', 'diagnosis', 'clinical'], 'examples': ['Detecting tumors from MRI scans.']}, {'industry': 'Finance & Banking', 'keywords': ['finance', 'banking', 'fraud', 'credit', 'insurance', 'investment', 'stock'], 'examples': ['Predicting credit default risk using machine learning.']}, {'industry': 'Education', 'keywords': ['education', 'student', 'learning', 'tutoring', 'school', 'curriculum'], 'examples': ['AI tutor for personalized mathematics education.']}, {'industry': 'Energy & Utilities', 'keywords': ['energy', 'grid', 'power', 'electricity', 'utility'], 'examples': ['Forecasting power demand in smart grids.']}, {'industry': 'Retail & E-commerce', 'keywords': ['retail', 'shopping', 'e-commerce', 'consumer', 'purchase', 'sales'], 'examples': ['Optimizing product recommendations for online stores.']}, {'industry': 'Agriculture', 'keywords': ['agriculture', 'crop', 'farming', 'irrigation', 'soil'], 'examples':

To have the model assign an industry to a paper we should consider the following: 
*When using flash models, it is best to separate the task of deciding if it is an industry-relevant paper from the task of deciding which industry fits 
* Grounding the prompt helps in directing the model to specific names of industries but does not prevent the model from hallucinating 
* Flash models have poor abstraction capabilities and will often drift into inferring industry associations even at low temperatures; we should always check the output of the model to exclude hallucinations 

* We instruct the model to assign the category Other as a fallback if none of the given allowed categories match 
* When the flash model returns three or more categories, or it returns the category "Other" alongside other categories, we classify this as a "suspicious classification" and  use a pro model to re-run the prompt and replace the flash model's assessment

implementation is defined in utils/industry.py

In [53]:
from utils import industry 
# Run each papers through the model so we can assign an industry label 
industry.classify_industry(papers,industry_meta,config['out_dir'])
print(papers.head(10))

################################################## 
ℹ️ Saving intermediate results...


 ################################################## 
Paper 1/400: 2503.17894 Generative AI for Validating Physics Laws 


Paper already classified, skipping... [{'industry': 'N/A', 'relevanceScore': 0, 'summary': 'No industry or application domain mentioned'}]


 ################################################## 
Paper 2/400: 2503.17896 Multi-Disease-Aware Training Strategy for Cardiac MR Image Segmentation 


Paper already classified, skipping... [{'industry': 'N/A', 'relevanceScore': 0, 'summary': 'No industry or application domain mentioned'}]


 ################################################## 
Paper 3/400: 2503.17897 Real-time Global Illumination for Dynamic 3D Gaussian Scenes 


Paper already classified, skipping... [{'industry': 'Entertainment & Media', 'relevanceScore': 85, 'summary': 'This paper presents a real-time global illumination approach for dynamic 3D Gaussian models and meshes. 

In [None]:

# What % of papers are industry related , select those with industry N/A
industry_related = papers[
    papers['industry_list'].apply(lambda x: len(x) > 0 and x[0]['industry'] != 'N/A')
]
print("Papers related to industry development:", len(industry_related))
print("Percentage of papers related to industry development:", len(industry_related)/len(papers)*100)

Some industries will contain many papers; every summary implies information compression. We can increase resolution without exploding the summary size if we can cluster similar papers and evaluate each cluster separately. 

To cluster papers together, first, we create semantic embedings. 

In [None]:
collection = lib.get_chroma_collection(name="paper_abstracts", base_path=config['out_dir'])
sample_embeding = lib.get_chroma_record(collection,'2504.01981__Manufacturing')
if sample_embeding is not None:
    print("\n\n#" * 20, "ChromaDB collection already exists. Skipping embedding.", sample_embeding )
else:
    print("\n\n","#" * 20, "Embedding papers.", sample_embeding)
    client = genai_embed.Client(api_key=config['GOOGLE_API_KEY']) 
    embed_papers(client,papers)