## OpenAI label generation for topics obtained by BERTopic

Using the keywords(KeyBERT, MMR, POS, representations)(tmod2 model) and representative docs(top 20) (in rep_docs2.json file) obtained for each topic in 5_1_Bertopic.ipynb we now generate labels for each topic that represents the overall context of the headlines of the topic.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import json

### importing representative docs

In [None]:
repdocs = {}
with open('datasets/rep_docs2.json') as f:
    repdocs = json.load(f)

In [None]:
repdocs

{'0': ['Best Practice Insights for Using AI in Healthcare',
  'Advancements in pediatrics: Uses of ai in mental health diagnosis and treatment',
  'AI in Health Care: Powering Patient Outcomes',
  'AI in healthcare: The future of patient care and health management',
  'Health Equity and Ethical Considerations in Using ai in Public Health and Medicine',
  'Explainable ai in breast cancer detection and risk prediction: A systematic scoping review',
  'NHS to trial AI tool that predicts health risks and early death',
  'New APA CEO on uses of ai in mental health, the future of psychiatry and more',
  'ai in Health Care',
  'AI-based selection of individuals for supplemental MRI in population-based breast cancer screening: the randomized ScreenTrustMRI trial',
  'Harnessing ai (AI) in Anaesthesiology: Enhancing Patient Outcomes and Clinical Efficiency',
  'Development and validation of ai -based analysis software to support screening system of cervical intraepithelial neoplasia',
  'Custom

### preparing keywords

In [2]:
pip install bertopic

Collecting bertopic
  Downloading bertopic-0.16.4-py3-none-any.whl.metadata (23 kB)
Collecting hdbscan>=0.8.29 (from bertopic)
  Downloading hdbscan-0.8.40-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (15 kB)
Collecting umap-learn>=0.5.0 (from bertopic)
  Downloading umap_learn-0.5.7-py3-none-any.whl.metadata (21 kB)
Collecting pynndescent>=0.5 (from umap-learn>=0.5.0->bertopic)
  Downloading pynndescent-0.5.13-py3-none-any.whl.metadata (6.8 kB)
Downloading bertopic-0.16.4-py3-none-any.whl (143 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.7/143.7 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading hdbscan-0.8.40-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.6/4.6 MB[0m [31m68.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading umap_learn-0.5.7-py3-none-any.whl (88 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m88.8/88.8 kB

In [3]:
from bertopic import BERTopic

In [4]:
tmod = BERTopic.load('tmod2')

In [9]:
tmod.embedding_model,tmod.umap_model,tmod.hdbscan_model,tmod.vectorizer_model,tmod.ctfidf_model

(<bertopic.backend._sentencetransformers.SentenceTransformerBackend at 0x7c9517dcae90>,
 UMAP(angular_rp_forest=True, metric='cosine', min_dist=0.0, n_components=100, n_jobs=1, random_state=42, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True}),
 HDBSCAN(min_cluster_size=20, prediction_data=True),
 CountVectorizer(min_df=2, ngram_range=(1, 2), stop_words='english'),
 ClassTfidfTransformer(reduce_frequent_words=True))

In [None]:
df = tmod.get_topic_info()

In [None]:
df.head()

Unnamed: 0,Topic,Count,Name,Representation,KeyBERT,MMR,POS,Representative_Docs
0,-1,3310,-1_human_chief_good_ibm,"[human, chief, good, ibm, officer, ai officer,...","[ai officer, ai policy, launches ai, chief ai,...","[human, ibm, ai officer, chief ai, sap, white ...","[human, chief, good, officer, house, machine, ...",[News | ESSC | UAH scientist earns National ai...
1,0,808,0_health_care_healthcare_medical,"[health, care, healthcare, medical, medicine, ...","[healthcare ai, ai healthcare, ai medical, hea...","[healthcare, ai healthcare, ai health, healthc...","[health, care, healthcare, medical, medicine, ...","[ai in Health Care, AI in Health Care: Powerin..."
2,1,412,1_education_schools_school_students,"[education, schools, school, students, classro...","[ai education, ai schools, ai academic, ai cla...","[education, schools, students, ai education, a...","[education, schools, school, students, classro...",[Exploring the impact of ai on higher educatio...
3,2,281,2_generative ai_generative_ai generative_use g...,"[generative ai, generative, ai generative, use...","[ai generative, generative ai, ai creative, br...","[generative ai, ai generative, use generative,...","[generative, usage, enterprise, creative, diff...",[Executive Conversations: Putting generative A...
4,3,260,3_jobs_job_hr_hiring,"[jobs, job, hr, hiring, workers, ai jobs, empl...","[ai workplace, ai jobs, workforce ai, ai job, ...","[jobs, hr, ai jobs, employers, ai workplace, w...","[jobs, job, hiring, workers, employers, employ...",[AI will affect 40% of jobs and probably worse...


In [None]:
df['keywords'] = df['KeyBERT']+df['MMR']+df['POS']+df['Representation'] # preparin keywords

In [None]:
df['keywords']=df['keywords'].apply(set)

In [None]:
repdocs['0']

['Best Practice Insights for Using AI in Healthcare',
 'Advancements in pediatrics: Uses of ai in mental health diagnosis and treatment',
 'AI in Health Care: Powering Patient Outcomes',
 'AI in healthcare: The future of patient care and health management',
 'Health Equity and Ethical Considerations in Using ai in Public Health and Medicine',
 'Explainable ai in breast cancer detection and risk prediction: A systematic scoping review',
 'NHS to trial AI tool that predicts health risks and early death',
 'New APA CEO on uses of ai in mental health, the future of psychiatry and more',
 'ai in Health Care',
 'AI-based selection of individuals for supplemental MRI in population-based breast cancer screening: the randomized ScreenTrustMRI trial',
 'Harnessing ai (AI) in Anaesthesiology: Enhancing Patient Outcomes and Clinical Efficiency',
 'Development and validation of ai -based analysis software to support screening system of cervical intraepithelial neoplasia',
 'Customizable AI tool dev

In [None]:
df['keywords'].iloc[1]

{'ai based',
 'ai health',
 'ai healthcare',
 'ai insurance',
 'ai medical',
 'ai mental',
 'care',
 'care ai',
 'clinical',
 'disease',
 'doctors',
 'drug',
 'health',
 'health ai',
 'healthcare',
 'healthcare ai',
 'medical',
 'medicine',
 'mental health',
 'patient',
 'screening',
 'treatment',
 'using ai'}

### preparing prompts

 **The prompt used is :** "I have topic that contains the following headlines: "+ docs+ '. The topic is described by the following keywords: '+ kwords+ '. Based on the above information, can you give a short label of the topic?'

In [None]:
prompts = [] # preparing prompts

for i in range(88):
    docs = '. '.join(repdocs[str(i)])
    kwords = ', '.join(list(df['keywords'].iloc[i+1]))
    s = "I have topic that contains the following headlines: "+ docs+ '. The topic is described by the following keywords: '+ kwords+ '. Based on the above information, can you give a short label of the topic?'
    prompts.append(s)
prompts

['I have topic that contains the following headlines: Best Practice Insights for Using AI in Healthcare. Advancements in pediatrics: Uses of ai in mental health diagnosis and treatment. AI in Health Care: Powering Patient Outcomes. AI in healthcare: The future of patient care and health management. Health Equity and Ethical Considerations in Using ai in Public Health and Medicine. Explainable ai in breast cancer detection and risk prediction: A systematic scoping review. NHS to trial AI tool that predicts health risks and early death. New APA CEO on uses of ai in mental health, the future of psychiatry and more. ai in Health Care. AI-based selection of individuals for supplemental MRI in population-based breast cancer screening: the randomized ScreenTrustMRI trial. Harnessing ai (AI) in Anaesthesiology: Enhancing Patient Outcomes and Clinical Efficiency. Development and validation of ai -based analysis software to support screening system of cervical intraepithelial neoplasia. Customiz

In [None]:
prompts[0]

'I have topic that contains the following headlines: Best Practice Insights for Using AI in Healthcare. Advancements in pediatrics: Uses of ai in mental health diagnosis and treatment. AI in Health Care: Powering Patient Outcomes. AI in healthcare: The future of patient care and health management. Health Equity and Ethical Considerations in Using ai in Public Health and Medicine. Explainable ai in breast cancer detection and risk prediction: A systematic scoping review. NHS to trial AI tool that predicts health risks and early death. New APA CEO on uses of ai in mental health, the future of psychiatry and more. ai in Health Care. AI-based selection of individuals for supplemental MRI in population-based breast cancer screening: the randomized ScreenTrustMRI trial. Harnessing ai (AI) in Anaesthesiology: Enhancing Patient Outcomes and Clinical Efficiency. Development and validation of ai -based analysis software to support screening system of cervical intraepithelial neoplasia. Customiza

### openAI prompt implementation

In [None]:
pip install openai



In [None]:
from openai import OpenAI

client = OpenAI(
  api_key="Enter your openAI API key"
)

completion = client.chat.completions.create(
  model="gpt-4o-mini",
  store=True,
  messages=[
    {"role": "user", "content": prompts[0]}
  ]
)

print(completion.choices[0].message.content)

"Transformative Applications of AI in Healthcare: Enhancing Diagnosis, Treatment, and Patient Outcomes"


In [None]:
# Store responses
responses = {}

for i in range(len(prompts)):
    key = i
    prompt = prompts[i]
    # Generate completion for each prompt
    completion = client.chat.completions.create(
        model="gpt-4o-mini",
        store=True,  # Save interactions for future analysis
        temperature=0,  # Ensure reproducibility
        messages=[{"role": "user", "content": prompt}]
    )
    # Store the response content
    responses[key] = completion.choices[0].message.content
    print(f"response for prompt {i}: {responses[key]}")

# Save responses to a JSON file for future use
with open("responses.json", "w") as file:
    json.dump(responses, file, indent=4)

# Print the responses for verification
print("Responses saved:", responses)

response for prompt 0: "AI Innovations in Healthcare: Enhancing Patient Care, Diagnosis, and Treatment"
response for prompt 1: "Integrating AI in Education: Opportunities, Challenges, and Innovations"
response for prompt 2: "Exploring the Impact and Adoption of Generative AI Across Industries"
response for prompt 3: "Impact of AI on Job Security and Workforce Dynamics"
response for prompt 4: "AI Chip Market Competition: Intel, AMD, and Alternative Stocks to Nvidia"
response for prompt 5: "Impact of AI on Search Engines: Google's Innovations and Challenges"
response for prompt 6: "Wall Street Insights: Promising AI Stocks and Investment Opportunities"
response for prompt 7: "AI Chatbots: Innovations, Controversies, and Legal Challenges"
response for prompt 8: "AI Startup Funding and Market Growth Trends"
response for prompt 9: "AI in the Legal Field: Impacts, Challenges, and Ethical Considerations"
response for prompt 10: "AI's Impact on the 2024 Presidential Election: Challenges and Co

In [None]:
responses[-1]=''

In [None]:
df['openAI'] = df['Topic'].map(responses)

In [None]:
df.head()

Unnamed: 0,Topic,Count,Name,Representation,KeyBERT,MMR,POS,Representative_Docs,keywords,openAI
0,-1,3310,-1_human_chief_good_ibm,"[human, chief, good, ibm, officer, ai officer,...","[ai officer, ai policy, launches ai, chief ai,...","[human, ibm, ai officer, chief ai, sap, white ...","[human, chief, good, officer, house, machine, ...",[News | ESSC | UAH scientist earns National ai...,"{white, ai officer, sam altman, officer, chief...",
1,0,808,0_health_care_healthcare_medical,"[health, care, healthcare, medical, medicine, ...","[healthcare ai, ai healthcare, ai medical, hea...","[healthcare, ai healthcare, ai health, healthc...","[health, care, healthcare, medical, medicine, ...","[ai in Health Care, AI in Health Care: Powerin...","{drug, health, screening, mental health, care,...","""AI Innovations in Healthcare: Enhancing Patie..."
2,1,412,1_education_schools_school_students,"[education, schools, school, students, classro...","[ai education, ai schools, ai academic, ai cla...","[education, schools, students, ai education, a...","[education, schools, school, students, classro...",[Exploring the impact of ai on higher educatio...,"{ai learn, ai higher, ai schools, teaching, hu...","""Integrating AI in Education: Opportunities, C..."
3,2,281,2_generative ai_generative_ai generative_use g...,"[generative ai, generative, ai generative, use...","[ai generative, generative ai, ai creative, br...","[generative ai, ai generative, use generative,...","[generative, usage, enterprise, creative, diff...",[Executive Conversations: Putting generative A...,"{use generative, usage, guide generative, ai g...","""Exploring the Impact and Adoption of Generati..."
4,3,260,3_jobs_job_hr_hiring,"[jobs, job, hr, hiring, workers, ai jobs, empl...","[ai workplace, ai jobs, workforce ai, ai job, ...","[jobs, hr, ai jobs, employers, ai workplace, w...","[jobs, job, hiring, workers, employers, employ...",[AI will affect 40% of jobs and probably worse...,"{job ai, ai workplace, workforce ai, ai job, j...","""Impact of AI on Job Security and Workforce Dy..."


In [None]:
df.to_csv('datasets/topic_info2.csv') # storing the topics information