# Final Project - Part 2

## 1. Title: “CareerChat: Your AI Job Outreach Assistant”

Team:
1.   Anshika Bajpai
2.   Brendan Kelly
3.   Cassie Cagwin

11/9/2025

## 2. Preprocessing 30pts

---

Provide all essential steps that you deem necessary for your application

### Parse resume to build the user's profile.

We first tried various PDF parsing Python packages, regex parsing, etc.   Although this was very successful with our test resumes for name and email and somewhat successful for education and skills, it was very unsuccessful at capturing job information due to the variety of formatting, etc.  After reading several articles on the topic, we decided to try using Open AI's API to improve the information captured from the resume and it yielded a dramatic improvement in our results.

We used this reference: https://platform.openai.com/docs/api-reference/

In [None]:
import os, time
from getpass import getpass
from google.colab import files
from openai import OpenAI
import json

In [None]:
#Enter OpenAI API key
os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")
client = OpenAI()

Enter your OpenAI API key: ··········


In [None]:
#Upload resume
print("Please upload your resume as a pdf.")
uploaded = files.upload()
pdf_path = next(iter(uploaded))

Please upload your resume as a pdf.


Saving Anshika_old.pdf to Anshika_old (2).pdf


In [None]:
#Create vector store for resume information.
resume_vector_store = client.vector_stores.create(
    name="resume_store"
)

file_obj = client.files.create(
    file=open(pdf_path, "rb"),
    purpose="assistants"
)

client.vector_stores.files.create(
    vector_store_id=resume_vector_store.id,
    file_id=file_obj.id
)

VectorStoreFile(id='file-G4Ym5bEshhEuMGhbHrVy4w', created_at=1762721284, last_error=None, object='vector_store.file', status='in_progress', usage_bytes=0, vector_store_id='vs_6910fe019b6881919d2837e3f3182c53', attributes={}, chunking_strategy=StaticFileChunkingStrategyObject(static=StaticFileChunkingStrategy(chunk_overlap_tokens=400, max_chunk_size_tokens=800), type='static'))

In [None]:
vector_store_files = client.vector_stores.files.list(
  vector_store_id=resume_vector_store.id
)
print(vector_store_files)

SyncCursorPage[VectorStoreFile](data=[], has_more=False, object='list', first_id=None, last_id=None)


In [None]:
#Set up an assistant using gpt-4.1-mini and the resume's vector stor
assistant = client.beta.assistants.create(
    name="recruiting assistant",
    model="gpt-4.1-mini",
    tools=[{"type": "file_search"}],
    tool_resources={
        "file_search": { "vector_store_ids": [resume_vector_store.id] }
    }
)

In [None]:
#Get file id for our resume vector store
file_id = client.vector_stores.files.list(vector_store_id=resume_vector_store.id).data[0].id

#Create thread and prompt for resume json
thread = client.beta.threads.create()
client.beta.threads.messages.create(
    thread_id=thread.id,
    role="user",
    content=(
        "Please use the attached resume to extract this information:\n"
        "- name\n- email\n- education (degree, field, institution, graduation_date if present)\n"
        "- skills (list)\n"
        "- work_experience (title, company, start_date, end_date, responsibilities list)\n\n"
        "Respond ONLY with a single valid JSON object. Only include information from the resume, don't include markdown, don't include prose."
    ),
    attachments=[
        {
            "file_id": file_id,
            "tools": [{"type": "file_search"}]
        }
    ],
)

run = client.beta.threads.runs.create(
    thread_id=thread.id,
    assistant_id=assistant.id,
)

# We need to make sure it completes before the next step (we got errors otherwise)
while run.status != "completed":
    time.sleep(5)
    run = client.beta.threads.runs.retrieve(
        thread_id=thread.id,
        run_id=run.id
    )


  thread = client.beta.threads.create()
  client.beta.threads.messages.create(
  run = client.beta.threads.runs.create(
  run = client.beta.threads.runs.retrieve(


In [None]:
msgs = client.beta.threads.messages.list(thread_id=thread.id)
text = None
for m in msgs.data:
    if m.role == "assistant":
        for part in m.content:
            if part.type == "text":
                text = part.text.value.strip()

  msgs = client.beta.threads.messages.list(thread_id=thread.id)


In [None]:
#Formatting json
start = text.find("{"); end = text.rfind("}") + 1
json_str = text[start:end]
data = json.loads(json_str)

In [None]:
#Printing json
print(json.dumps(data, indent=2, ensure_ascii=False))

{
  "name": "Anshika Bajpai",
  "email": "anshikabajpai23@gmail.com",
  "education": [
    {
      "degree": "Master of Science",
      "field": "Data Science",
      "institution": "Indiana University Bloomington",
      "graduation_date": "May 2026"
    },
    {
      "degree": "Bachelor of Technology",
      "field": "Computer Science and Engineering",
      "institution": "Jaypee Institute of Information Technology",
      "graduation_date": "May 2021"
    }
  ],
  "skills": [
    "Python",
    "R",
    "SQL",
    "Java",
    "C++",
    "C",
    "PostgreSQL",
    "BigQuery",
    "Jenkins",
    "Git",
    "GCP",
    "DevOps",
    "Docker",
    "Kubernetes",
    "PySpark",
    "Data Visualisation",
    "Prometheus",
    "Data Integration",
    "AWS",
    "GCP",
    "PyTorch",
    "NLP",
    "Computer Vision",
    "Neural Networks (including CNN, RNN, LSTM, GRU, FeedForward Networks)",
    "Deep Learning",
    "Generative AI",
    "Recommendation System",
    "MLOps",
    "Anomaly Det

In [None]:
#Save json to file called payload.json
with open("payload.json", "w", encoding="utf-8") as f:
    json.dump(data, f, indent=2, ensure_ascii=False)

We saved the resume information as "payload.json" for later use by our application.

In our final application we will offer users a chance to review how their resume was parsed in a form and make manual changes as desired before moving on to the next step.

## 3. Feature Extraction 30pts

---

Implement any existing feature extraction tools and methods (term frequency, word embeddings etc)

For feature extraction, we have built out a process to obtain and summarize news articles pertaining to the company that the user is applying to. The purpose of these news articles is to provide users of our app with the option to integrate information and acknowledgement of recent events in the document that they generate using our app, e.g. a note about recent investments in technology for a cover letter to apply for a software engineer position to highlight how skillsets align with the company's focus.

Once we build out the app, users will provide the name of the company and role they are applying for, and the app will utilize these inputs to generate relevant news for them. The user will have the option to select which news articles are relevant to incorporate into their document, which will be utilized in the core functionality of generating the document. For now, while not yet having this interface available, we have tested using a hard-coded company and role.

In [None]:
!pip install python-dateutil newspaper3k GoogleNews pandas lxml_html_clean nltk transformers



In [None]:
import time
import random
import requests
import pandas as pd
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
from datetime import datetime
from dateutil.relativedelta import relativedelta
from datetime import datetime
from dateutil.relativedelta import relativedelta
from newspaper import Article, Config
from GoogleNews import GoogleNews
from nltk.tokenize import sent_tokenize
from transformers import pipeline
from urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter

[nltk_data] Downloading package punkt to C:\Users\New
[nltk_data]     User\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to C:\Users\New
[nltk_data]     User\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [31]:
# configuration
user_agents = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Edge/120.0.0.0",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Firefox/120.0",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.0 Safari/605.1.15"
]
config = Config()
config.request_timeout = 30

gnews_api_key = "04d48be4ea74ddf64f4b4401f3fb8177"
gnews_url = "https://gnews.io/api/v4/search"

company = "Microsoft" # to eventually be replaced with user input
role = "Software Engineer" # to eventually be replaced with user input

start_date_str = (datetime.now() - relativedelta(days=30)).strftime("%Y-%m-%dT%H:%M:%SZ")

# establish request session
session = requests.Session()
retries = Retry(total=5, backoff_factor=1)
adapter = HTTPAdapter(max_retries=retries)
session.mount('http://', adapter)
session.mount('https://', adapter)
news_items = []
seen_articles = []

# create company query (can be modified for testing other ways to retrieve articles)
company_query = f'{company} AND (earnings OR results OR revenue OR profit OR acquisition OR merger OR investment OR partnership OR strategy OR forecast OR guidance)'

# format gnews request
params = {
    "q": company_query,
    "lang": "en",
    "country": "us",
    "max": 25,
    "from": start_date_str,
    "apikey": gnews_api_key,
    "sortby": "publishedAt"
}

# make gnews request
response = session.get(gnews_url, params=params, timeout=config.request_timeout)
response.raise_for_status()
gnews_data = response.json()

print(response)
print(gnews_data)

articles = gnews_data.get("articles",[])
print(articles)
parsed_articles = []

for article in articles:
    if len(parsed_articles) >= 3:
        break
    title = article.get('title')
    link = article.get('url')
    content = article.get('content')[:1000]
    published = article.get('publishedAt')

    if not link or title in seen_articles:
        continue

    seen_articles.append(title)

    if not company in title:
        # skip over any articles that do not mention the company in the title
        continue
    parsed_articles.append({
        'Title': title,
        'Link': link,
        'Published': published,
        'Full_Text': content,
    })    

print(parsed_articles)

df = pd.DataFrame(parsed_articles)
df.head()
print(df)
# print(failed_articles)

<Response [200]>
[{'Title': 'Microsoft faces complaint in EU over Israeli surveillance data', 'Link': 'https://www.al-monitor.com/originals/2025/12/microsoft-faces-complaint-eu-over-israeli-surveillance-data', 'Published': '2025-12-04T13:03:43Z', 'Full_Text': 'Microsoft is facing a complaint in the European Union filed by a non-profit organisation alleging it illegally stored data on Palestinians used for Israeli military surveillance.\nThe Irish Data Protection Commission (DPC) confirmed Thursday it had received the complaint against the US tech giant, saying it was "currently under assessment".\nSince Microsoft\'s European headquarters are located in Ireland, the DPC is the EU\'s lead data regulator for the company.\nThe organisation that brought the complaint, Eko -- which says it fights for "people and planet over profits" -- accused Microsoft of violating Europe\'s data protection law.\n"Microsoft unlawfully processed personal data belonging to Palestinians and EU citizens, enabli

In [29]:
summarizer = pipeline("text2text-generation", model="google/flan-t5-base", dtype="auto")

# summarize articles
summaries = []
prompts = []
links = []
titles = []

for article in parsed_articles:
    text = article.get('Full_Text').strip()

    if not text:
        summaries.append((article.get('Link'), article.get('Title'), 'Summary failure.'))
        continue

    prompt = f"Write a one sentence summary of the content contained in the following article; specifically highlight its impact or relevance on {company} in a context that would be useful for a job interview of a position as a {role} at the company:\n{text}"
    prompts.append(prompt)
    links.append(article.get('Link'))
    titles.append(article.get('Title'))

for i in range(0, len(prompts), 10):
    # batch 10 prompts at a time
    batch_prompts = prompts[i:i+10]
    batch_links = links[i:i+10]
    batch_titles = titles[i:i+10]

    summary_results = summarizer(batch_prompts, truncation=True, do_sample=False)

    for link, title, summary_result in zip(batch_links, batch_titles, summary_results):
        summaries.append((link, title, summary_result.get('generated_text').split('.')[0]))


Device set to use cpu


In [30]:
summaries

[('https://www.al-monitor.com/originals/2025/12/microsoft-faces-complaint-eu-over-israeli-surveillance-data',
  'Microsoft faces complaint in EU over Israeli surveillance data',
  'Microsoft is facing a complaint in the European Union filed by a non-profit organisation alleging it illegally stored data on Palestinians used for Israeli military surveillance'),
 ('https://www.cnbc.com/2025/12/03/hightowers-stephanie-link-says-market-is-failing-to-appreciate-microsofts-ai-value.html',
  "Hightower’s Stephanie Link says market is failing to appreciate Microsoft's AI value",
  "Hightower Advisors' Stephanie Link is finding investment opportunities in underappreciated technology stocks such as Microsoft and Palo Alto Networks"),
 ('https://www.newsbreak.com/fortune-561435/4377238318418-meet-amar-subramanya-the-46-year-old-google-and-microsoft-veteran-who-will-now-steer-apple-s-supremely-important-ai-strategy',
  'old Google and Microsoft veteran who will now steer Apple’s supremely important

## 4. Main Functionality 10pts

---

### Main Functionality

This project focuses on building a personalized message-generation assistant for Job hunt using prompt engineering to help users craft engaging outreach messages. The system will allow users to specify the recipient (name, relationship, and context of how they know them), and will incorporate customizable creativity levels through a temperature setting to control tone and originality. It will offer curated content options: such as recent news highlights or trending topics allowing the user to select relevant hooks to include in the message. The tool will then generate a tailored, well-crafted message/cover letter using optimized prompts and optional code-based automation for message creation, ensuring personalization, relevance, and creativity in communication.


In [None]:
#Import libraries
import json
import os
from jinja2 import Environment, FileSystemLoader
from dotenv import load_dotenv
import openai


In [None]:

#Load OpenAI API key from .env file
openai.api_key = os.environ["OPENAI_API_KEY"]


In [None]:


#Load message payload template
with open("/content/payload.json","r", encoding="utf-8") as f:
    payload = json.load(f)


In [None]:
#Set user varibales (We will use streamlit to get this input later on)
role = 'recruiter'
history = ['indiana university', 'society of women engineers conference 2025']
position = 'machine learning engineer'
message_type = 'Cover Letters' #'LinkedIn connection notes' # 'Cover Letters'


In [None]:

#Load Jinja2 template and render prompt
env = Environment(loader=FileSystemLoader('/content/'))


In [None]:

# template=env.get_template("template.j2")

if message_type == 'LinkedIn connection notes':
    template = env.get_template("linkedin_msg.j2")
elif message_type == 'Cover Letters':
    template = env.get_template("cover_letter.j2")

prompt = template.render(payload=payload,role=role, history=history, position=position, news=summaries, company_name="Microsoft", Recipient_name="Alex Johnson")
print(prompt)


You are given below JSON payload:
{
    "education": [
        {
            "degree": "Master of Science",
            "field": "Data Science",
            "graduation_date": "May 2026",
            "institution": "Indiana University Bloomington"
        },
        {
            "degree": "Bachelor of Technology",
            "field": "Computer Science and Engineering",
            "graduation_date": "May 2021",
            "institution": "Jaypee Institute of Information Technology"
        }
    ],
    "email": "anshikabajpai23@gmail.com",
    "name": "Anshika Bajpai",
    "skills": [
        "Python",
        "R",
        "SQL",
        "Java",
        "C++",
        "C",
        "PostgreSQL",
        "BigQuery",
        "Jenkins",
        "Git",
        "GCP",
        "DevOps",
        "Docker",
        "Kubernetes",
        "PySpark",
        "Data Visualisation",
        "Prometheus",
        "Data Integration",
        "AWS",
        "GCP",
        "PyTorch",
        "NLP",
    

In [None]:

#Generate message using OpenAI API
client = OpenAI()

response = client.responses.create(
    model="gpt-4.1-mini", #"gpt-4o-mini",
    input=[{
        "role": "system",
        "content": "You are a helpful assistant that helps people draft {message_type} based on their background and the job description."
    }, {
        "role": "user",
        "content": prompt
    }],
    max_output_tokens=512,
    temperature=0.2
)
print(response)



Response(id='resp_0ef56a2fdabee8a200691107f41c6881a293052701d1b9c564', created_at=1762723828.0, error=None, incomplete_details=None, instructions=None, metadata={}, model='gpt-4.1-mini-2025-04-14', object='response', output=[ResponseOutputMessage(id='msg_0ef56a2fdabee8a200691107f4ddd881a2afeafc08e59dbb63', content=[ResponseOutputText(annotations=[], text='Subject: Interest in Machine Learning Engineer Position at Microsoft\n\nHi Alex,\n\nI hope this message finds you well. It was great connecting with you at Indiana University and the Society of Women Engineers Conference 2025.\n\nI am writing to express my strong interest in the Machine Learning Engineer position at Microsoft. Currently, I am pursuing my Master of Science in Data Science at Indiana University Bloomington, and I bring hands-on experience from internships and roles at Palo Alto Networks, Optum, and Taiyo LLC. My background includes developing transformer-based large language models, building scalable systems with a focu

In [None]:



reply = response.output[0].content[0].text
print("Generated Email Draft:\n", reply)
### Personal Contribution Statement (10 pts)



Generated Email Draft:
 Subject: Interest in Machine Learning Engineer Position at Microsoft

Hi Alex,

I hope this message finds you well. It was great connecting with you at Indiana University and the Society of Women Engineers Conference 2025.

I am writing to express my strong interest in the Machine Learning Engineer position at Microsoft. Currently, I am pursuing my Master of Science in Data Science at Indiana University Bloomington, and I bring hands-on experience from internships and roles at Palo Alto Networks, Optum, and Taiyo LLC. My background includes developing transformer-based large language models, building scalable systems with a focus on MLOps, and expertise in Python, PyTorch, TensorFlow, NLP, and cloud platforms like GCP and AWS.

I am particularly excited about Microsoft’s recent advancements in AI-driven cloud expansion and investments in AI infrastructure, as highlighted in the latest fiscal reports and the collaboration with Boeing on virtual training programs 

## 5. Personal Contribution Statement 10pts

---

*  I contributed to the core idea and system architecture, and worked on designing the prompt engineering workflow for personalized message generation. My work involved developing two tailored prompt templates one for cover letters and one for LinkedIn outreach messages ensuring that the tone, structure, and content aligned with each use case. This included integrating recipient context (who the message is for and how the user knows them) and incorporating a creativity control using temperature settings. I also worked on the message generation pipeline, where the selected information, user inputs, and optional content hooks are passed into the LLM through the designed prompts to generate a personalized and well-structured response. In addition, I contributed to the code for automated message generation to create a smooth end-to-end user experience.

*   Proofreading :
All team members participated in proofreading to ensure clarity, correctness, and cohesiveness across documentation and outputs.



