# GPT5nano – Clean Annotated Notebook
*Last updated: October 28, 2025*

This notebook has been organized for clarity and presentation. Section headers explain the purpose of each step so a reviewer can quickly follow the workflow and rationale.

**What’s included:**
- Clear section headers injected before relevant code blocks
- Lightweight explanations to guide readers
- A tidy footer with recommended next steps

---


<details>
<summary><strong>Table of Contents</strong></summary>

1. Setup & Imports  
2. Configuration & Constants  
3. Data Loading  
4. Exploratory Data Analysis (EDA)  
5. Cleaning & Preprocessing  
6. Feature Engineering  
7. Train/Validation Split & Cross-Validation  
8. Modeling  
9. Evaluation & Metrics  
10. Explainability & Feature Importance  
11. Deployment / Inference  
12. Persistence & Export  

</details>


### Setup & Imports
Import core libraries and frameworks. Group imports and remove unused ones.


In [6]:
from openai import OpenAI

client = OpenAI(
  api_key=""
)

response = client.responses.create(
  model="gpt-5-nano-2025-08-07",
  input="Hello",
  store=True,
)

print(response.output_text);

Hi there! 👋 How can I help today? I can explain concepts, answer questions, help with writing or editing, brainstorm ideas, debug code, plan a project, translate, or just chat. Tell me what you’re working on or ask me something specific.


### Data Loading
Load datasets or artifacts and validate shapes/schemas.


In [22]:
import re, json, time  
import numpy as np
import pandas as pd
from pathlib import Path

path = r'C:\Users\dbal\anaconda_projects\PotentialTalentsNLP\potentialtalents.csv'
df = pd.read_csv(path)

In [33]:
search_term = "Aspiring Human Resources Specialist"

# Get embedding for the search term
search_embedding = client.embeddings.create(
    model="text-embedding-3-large",
    input=[search_term]
).data[0].embedding

search_embedding = np.array(search_embedding)  # Convert to NumPy array for easier computation

In [32]:
def get_embeddings(texts, model="text-embedding-3-large"):
    response = client.embeddings.create(model=model, input=texts)
    return [np.array(item.embedding) for item in response.data]

# Get embeddings for all job titles
job_titles = df['job_title'].tolist()
title_embeddings = get_embeddings(job_titles)

# Add embeddings to the DataFrame
df['title_embedding'] = title_embeddings

In [27]:
from numpy.linalg import norm

def cosine_similarity(vec1, vec2):
    return np.dot(vec1, vec2) / (norm(vec1) * norm(vec2))

In [28]:
# Compute similarity scores
df['relevance_score'] = df['title_embedding'].apply(lambda emb: cosine_similarity(emb, search_embedding))

# Optional: Normalize scores to 0-100 for easier interpretation
df['relevance_score'] = 100 * (df['relevance_score'] - df['relevance_score'].min()) / (df['relevance_score'].max() - df['relevance_score'].min())

In [29]:
# Rank by relevance
ranked_df = df.sort_values(by='relevance_score', ascending=False).reset_index(drop=True)

# Add rank column
ranked_df['rank'] = ranked_df.index + 1

# Preview top 10
print(ranked_df[['id', 'job_title', 'location', 'connection', 'relevance_score', 'rank']].head(10))

   id                              job_title  \
0   6    Aspiring Human Resources Specialist   
1  36    Aspiring Human Resources Specialist   
2  49    Aspiring Human Resources Specialist   
3  24    Aspiring Human Resources Specialist   
4  60    Aspiring Human Resources Specialist   
5  58  Aspiring Human Resources Professional   
6  46  Aspiring Human Resources Professional   
7  33  Aspiring Human Resources Professional   
8   3  Aspiring Human Resources Professional   
9  17  Aspiring Human Resources Professional   

                              location connection  relevance_score  rank  
0           Greater New York City Area          1       100.000000     1  
1           Greater New York City Area          1       100.000000     2  
2           Greater New York City Area          1        99.999988     3  
3           Greater New York City Area          1        99.999980     4  
4           Greater New York City Area          1        99.999980     5  
5  Raleigh-Durham, No

In [49]:
# Set a threshold (adjust based on your score range)
threshold = 5
hr_candidates = ranked_df[ranked_df['relevance_score'] > threshold]

# Print predicted HR candidates
print("Predicted HR-Based Candidates:")
print(hr_candidates[['id', 'job_title', 'location', 'connection', 'relevance_score']])

# Save to new CSV if needed
hr_candidates.to_csv('hr_candidates_ranked.csv', index=False)

Predicted HR-Based Candidates:
    id                                          job_title  \
0    6                Aspiring Human Resources Specialist   
1   36                Aspiring Human Resources Specialist   
2   49                Aspiring Human Resources Specialist   
3   24                Aspiring Human Resources Specialist   
4   60                Aspiring Human Resources Specialist   
..  ..                                                ...   
94  16  Native English Teacher at EPIK (English Progra...   
95  20  Native English Teacher at EPIK (English Progra...   
96  91       Lead Official at Western Illinois University   
97  85  RRP Brand Portfolio Executive at JTI (Japan To...   
98  87  Bachelor of Science in Biology from Victoria U...   

                      location connection  relevance_score  
0   Greater New York City Area          1       100.000000  
1   Greater New York City Area          1       100.000000  
2   Greater New York City Area          1        99.9

In [42]:
df_unique = hr_candidates.sort_values('relevance_score', ascending=False).drop_duplicates(subset=['job_title', 'location'], keep='first').reset_index(drop=True)
df_unique['rank'] = df_unique.index + 1

# Round relevance_score for cleaner display
df_unique['relevance_score'] = df_unique['relevance_score'].round(2)

#print(df_unique)

df_display = df_unique[['rank', 'id', 'job_title', 'location', 'connection', 'relevance_score']]
df_display.columns = ['Rank', 'ID', 'Job Title', 'Location', 'Connections', 'Relevance Score']

# Generate Markdown table
markdown_table = df_display.to_markdown(index=False, tablefmt='pipe', stralign='left')

# Print and save the table
print(markdown_table)
with open('ranked_candidates.md', 'w') as f:
    f.write(markdown_table)

|   Rank |   ID | Job Title                                                                                                             | Location                            | Connections   |   Relevance Score |
|-------:|-----:|:----------------------------------------------------------------------------------------------------------------------|:------------------------------------|:--------------|------------------:|
|      1 |    6 | Aspiring Human Resources Specialist                                                                                   | Greater New York City Area          | 1             |            100    |
|      2 |   58 | Aspiring Human Resources Professional                                                                                 | Raleigh-Durham, North Carolina Area | 44            |             85.05 |
|      3 |   97 | Aspiring Human Resources Professional                                                                                 | Kokomo, Indian

---

## Next Steps

- Add a short **Experiment Config** cell that prints model/version, data snapshot, and seed.
- Include **error analysis** (top FP/FN or low‑confidence cases) where relevant.
- Provide a minimal **inference demo** with input/output schemas.
- Export a final **results summary** for easy sharing (tables/plots).
