With this nodebook we intend to ingest a json list of AI related papers and output a summary by industry with the corresponding contributions we find within those papers. 

In [2]:
!pip install chromadb
!pip install hdbscan


Collecting chromadb
  Downloading chromadb-1.0.5-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.9 kB)
Collecting build>=1.0.3 (from chromadb)
  Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb)
  Downloading chroma_hnswlib-0.7.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (252 bytes)
Collecting fastapi==0.115.9 (from chromadb)
  Downloading fastapi-0.115.9-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb)
  Downloading uvicorn-0.34.2-py3-none-any.whl.metadata (6.5 kB)
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.25.0-py2.py3-none-any.whl.metadata (3.0 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.21.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.5 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0 (from chromadb)
  Downloading opentele

In [3]:
import sys
import kagglehub 
gen_ai_capstone_utils_path = kagglehub.dataset_download('helgaguerreiro/gen-ai-capstone-utils')
sys.path.append(gen_ai_capstone_utils_path)

import json 
import os  
from utils import lib 

print(os.listdir(gen_ai_capstone_utils_path))

['utils', 'industry-list.json', 'papers_sample.json']


In [8]:
from kaggle_secrets import UserSecretsClient
import google.generativeai as genai  
#
# Config 
#
config = {
    "data_dir": gen_ai_capstone_utils_path, # where we store the input data 
    "out_dir":"/kaggle/working/",# where we are outputing intermediate data and the final document 
    "papers_file":"papers_sample.json", # what paper list are we ingesting 
    "GOOGLE_API_KEY": UserSecretsClient().get_secret("GOOGLE_API_KEY"),
    "reset": True
}

genai.configure(api_key=config['GOOGLE_API_KEY'])
papers = [] 

# 
# load the papers by reading the data/papers_2025-03-23_2025-03-29.json file
# if there's an intermediate file  load that one 
filepath = os.path.join(config['data_dir'],config['papers_file'])

if (config['reset']):
  os.remove('/kaggle/working/papers_enriched.json')
  os.remove('/kaggle/working/suspicious_classifications.json')

Make sure we can restaure intermediate results if something goes wrong while testing changes 

In [9]:
import pandas as pd 

if  os.path.exists(os.path.join(config['out_dir'],'papers_enriched.json')):
    filepath =  os.path.join(config['out_dir'],'papers_enriched.json')

print("Loading papers from:", filepath)
with open(filepath, 'r') as f:
    for line in f:
        papers.append(json.loads(line))

papers = pd.DataFrame(papers)
print("Loaded papers:", len(papers))


Loading papers from: /kaggle/input/gen-ai-capstone-utils/papers_sample.json
Loaded papers: 20


We want to ground the model output to a pre-defined set of industries. 

The industry_meta contains 
* name - the industry name
* keywords - keywords associated with the industry
* examples - examples of what should match the industry

We will use this data to compose the prompt when querying the model. 

In [10]:
# Load the industry meta data from the data/industry-list.json file
with open(os.path.join(config['data_dir'], 'industry-list.json'), 'r') as f:
    industry_meta = json.load(f)

print(industry_meta)

[{'industry': 'Healthcare', 'keywords': ['doctor', 'hospital', 'medical', 'imaging', 'diagnosis', 'clinical'], 'examples': ['Detecting tumors from MRI scans.']}, {'industry': 'Finance & Banking', 'keywords': ['finance', 'banking', 'fraud', 'credit', 'insurance', 'investment', 'stock'], 'examples': ['Predicting credit default risk using machine learning.']}, {'industry': 'Education', 'keywords': ['education', 'student', 'learning', 'tutoring', 'school', 'curriculum'], 'examples': ['AI tutor for personalized mathematics education.']}, {'industry': 'Energy & Utilities', 'keywords': ['energy', 'grid', 'power', 'electricity', 'utility'], 'examples': ['Forecasting power demand in smart grids.']}, {'industry': 'Retail & E-commerce', 'keywords': ['retail', 'shopping', 'e-commerce', 'consumer', 'purchase', 'sales'], 'examples': ['Optimizing product recommendations for online stores.']}, {'industry': 'Agriculture', 'keywords': ['agriculture', 'crop', 'farming', 'irrigation', 'soil'], 'examples':

To have the model assign an industry to a paper we should consider the following: 
*When using flash models, it is best to separate the task of deciding if it is an industry-relevant paper from the task of deciding which industry fits 
* Grounding the prompt helps in directing the model to specific names of industries but does not prevent the model from hallucinating 
* Flash models have poor abstraction capabilities and will often drift into inferring industry associations even at low temperatures; we should always check the output of the model to exclude hallucinations 

* We instruct the model to assign the category Other as a fallback if none of the given allowed categories match 
* When the flash model returns three or more categories, or it returns the category "Other" alongside other categories, we classify this as a "suspicious classification" and  use a pro model to re-run the prompt and replace the flash model's assessment

implementation is defined in utils/industry.py

In [11]:
from utils import industry 
# Run each papers through the model so we can assign an industry label 
industry.classify_industry(papers,industry_meta,config['out_dir'])
print(papers.head(10))

################################################## 
ℹ️ Saving intermediate results...


 ################################################## 
Paper 1/20: 2503.17894 Generative AI for Validating Physics Laws 


Paper is not related to industry development, skipping...


 ################################################## 
Paper 2/20: 2503.17896 Multi-Disease-Aware Training Strategy for Cardiac MR Image Segmentation 


Paper is not related to industry development, skipping...


 ################################################## 
Paper 3/20: 2503.17897 Real-time Global Illumination for Dynamic 3D Gaussian Scenes 


Time taken to query model: 0.9729924201965332 seconds
Industry list:
 [{'industry': 'Entertainment & Media', 'relevanceScore': 85, 'summary': 'The paper presents a real-time global illumination approach for dynamic 3D Gaussian models and meshes. This is relevant to the entertainment and media industry as it enables high-quality, real-time rendering of dynamic scenes with intera

In [12]:

# What % of papers are industry related , select those with industry N/A
industry_related = papers[
    papers['industry_list'].apply(lambda x: len(x) > 0 and x[0]['industry'] != 'N/A')
]
print("Papers related to industry development:", len(industry_related))
print("Percentage of papers related to industry development:", len(industry_related)/len(papers)*100)

Papers related to industry development: 11
Percentage of papers related to industry development: 55.00000000000001


Some industries will contain many papers; every summary implies information compression. We can increase resolution without exploding the summary size if we can cluster similar papers and evaluate each cluster separately. 

To cluster papers together, first, we create semantic embedings. 

In [17]:
import google.genai as genai_embed
from utils import embedings

collection = lib.get_chroma_collection(name="paper_abstracts", base_path=config['out_dir'])
sample_embeding = lib.get_chroma_record(collection,'2504.01981__Manufacturing')
if sample_embeding is not None:
    print("\n\n#" * 20, "ChromaDB collection already exists. Skipping embedding.", sample_embeding )
else:
    print("\n\n","#" * 20, "Embedding papers.", sample_embeding)
    client = genai_embed.Client(api_key=config['GOOGLE_API_KEY']) 
    embedings.embed_papers(client,papers)



 #################### Embedding papers. None
✅ Collection 'paper_abstracts' contains 0 items.
🧠 Found 9 industries to embed.
\n🛠️ Embedding 3 papers for industry: Entertainment & Media
🚀 Embedding 3 papers for Entertainment & Media: ['2503.17897__Entertainment & Media', '2503.17907__Entertainment & Media', '2503.17934__Entertainment & Media']
\n🛠️ Embedding 1 papers for industry: Construction & Real Estate
🚀 Embedding 1 papers for Construction & Real Estate: ['2503.17897__Construction & Real Estate']
\n🛠️ Embedding 3 papers for industry: Healthcare
🚀 Embedding 3 papers for Healthcare: ['2503.17900__Healthcare', '2503.17903__Healthcare', '2503.17933__Healthcare']
\n🛠️ Embedding 1 papers for industry: Social Sciences & Humanities
🚀 Embedding 1 papers for Social Sciences & Humanities: ['2503.17903__Social Sciences & Humanities']
\n🛠️ Embedding 1 papers for industry: Geospatial & Remote Sensing
🚀 Embedding 1 papers for Geospatial & Remote Sensing: ['2503.17907__Geospatial & Remote Sensin

The goal is to create 1 section per industry. 

Each section contains multiple clusters and one “General Outlook" summary representing the papers that did not fit any clusters. 

We also ask the model to assign a representative title for each cluster.

In [18]:
from utils import summary
summary.get_summaries(papers,config['out_dir'])

🧠 Found 11 industries to summarize. ['N/A', 'Entertainment & Media', 'Other', 'Finance & Banking', 'Healthcare', 'Manufacturing', 'Social Sciences & Humanities', 'Energy & Utilities', 'Construction & Real Estate', 'Cybersecurity', 'Geospatial & Remote Sensing']


Summarizing industry: Entertainment & Media 
get_sections for industry: Entertainment & Media


Clustering papers for industry: Entertainment & Media
           id               industry
0  2503.17897  Entertainment & Media
1  2503.17907  Entertainment & Media
2  2503.17934  Entertainment & Media
🧠 Found 3 papers for industry: Entertainment & Media with 1 clusters
Cluster -1 has 3 papers  ['2503.17897', '2503.17907', '2503.17934']
Small cluster, summarizing directly
Summarizing batch of papers 3  papers
get_cluster_summary prompt: 4617
Title: AI Innovations for High-Fidelity and Efficient Media in Entertainment
returning sections  [{'title': 'General Outlook', 'summary': "These studies highlight AI's role in advancing media cr

[{'industry': 'Entertainment & Media',
  'sections': [{'title': 'General Outlook',
    'summary': "These studies highlight AI's role in advancing media creation and processing for entertainment. A common thread is the use of novel AI techniques, including generative models like diffusion and 3D Gaussian representations, to enhance visual content. Key contributions focus on generating higher-quality, more realistic, and specialized media formats, such as dynamic 3D scenes with complex lighting, human-perceivable images derived from machine-optimized data, and videos incorporating transparency for visual effects. The research emphasizes improving efficiency, enabling real-time performance in rendering and optimizing data transmission through scalable coding. Furthermore, these approaches offer enhanced control and integration capabilities, bridging different data types or generation methods to meet specific industry demands in areas like gaming, visual effects, and content delivery.",
  

Finally, we want to output an HTML file that allows us to preview the data in a more user-friendly way.

In [20]:
from utils import report 
report.generate_html_report(
    os.path.join(config['out_dir'], 'industry_sections.json'),
    os.path.join(config['out_dir'], 'final_report.html')
)

✅ Report generated: /kaggle/working/final_report.html
