---
format:
    html:
        embed-resources: true
---

# Cleaning: Part-2 

The goal here is exactly the same as `HW-3.2-cleaning-1.ipynb`, except this time we will repeat the exercise but by leveraging LLM APIs and prompt engineering to stream line the cleaning process. 

Essentially, our job is to write an LLM wrapper to clean the job descriptions. 

We can use any LLM API that you want, and we can use any prompt engineering techniques. 


Here is an example of how to use OpenAI's API:

[https://jfh.georgetown.domains/centralized-lecture-content/content/computer-science/general-concepts/openAI-API-example/notes.html](https://jfh.georgetown.domains/centralized-lecture-content/content/computer-science/general-concepts/openAI-API-example/notes.html)

There are also various LLM APIs that we can wrap around to get partial access. Do some googling and find a tool that seems like it will fit our needs.  

* [https://ai.google.dev/gemini-api/docs/quickstart?lang=python](https://ai.google.dev/gemini-api/docs/quickstart?lang=python)



In [66]:
import json
import pandas as pd
from tqdm import tqdm
import time
import os
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()
client = OpenAI()


In [67]:
def process_job(job):
    job_fields="""
    {{
        "job_title": "",
        "company_name": "",
        "sector": "",
        "location": "",
        "job_type": "",
        "salary_range": "",
        "experience_level": "",
        "education_requirements": "",
        "required_skills": "",
        "responsibilities": "",
        "years_experience": "",
        "benefits": "",
        "work_mode": "",
        "posting_date": "",
        "job_description_length": "",
        "required_certifications": "",
        "team_size": "",
        "company_size": "",
        "posting_platform": "",
        "company_culture": "",
        "visa_sponsorship": "",
        "working_hours": "",
        "language_requirements": "",
        "travel_requirements": "",
        "collaboration_tools": "",
        "reporting_structure": "",
        "learning_opportunities": "",
        "stock_options": "",
        "soft_skills": "",
        "perks": "",
        "job_id": ""
    }}
    """
    
    prompt = f'''Here is some job data loaded from a json file, please analyze this job listing and extract the following information in JSON format. 
    If a field is not available, use null. Be concise and factual:
    {job_fields}

    Job Data:
    {json.dumps(job,indent=2)}
    '''

    try:
        response = client.chat.completions.create(
        model='gpt-3.5-turbo',
        messages=[
                {"role": "system", "content": "You are a helpful assistant that extracts and cleans job posting data. Return only the JSON object with the extracted information."},
                {"role": "user", "content": prompt}
            ],
            temperature=0.1,
            max_tokens=1000)
        
        cleaned_data = json.loads(response.choices[0].message.content)
        return cleaned_data
        
    except Exception as e:
        print(f"Error processing job: {e}")
        return None
    except Exception as e:
        print(f"Error processing job: {e}")
        return None

In [68]:
def clean_job_data_llm(crawled_data):
    all_cleaned_jobs = []

    jobs_results = []
    for crawled_unit in crawled_data:
        new_unit = crawled_unit['results'].get('jobs_results',[])
        jobs_results.extend(new_unit)

    for job in tqdm(jobs_results, desc="Processing jobs"):
        cleaned_job = process_job(job)
        if cleaned_job:
            all_cleaned_jobs.append(cleaned_job)
        time.sleep(0.1)  
    
    return all_cleaned_jobs

In [69]:
def save_data(cleaned_data, output_file):
    df = pd.DataFrame(cleaned_data)
    df.to_csv(output_file, index=False)

    return df

In [70]:
if __name__ == "__main__":
    with open('googlejobs_alltitles_2024-11-05_22-34-20.json', 'r') as file:
        crawled_data = json.load(file)

    cleaned_data = clean_job_data_llm(crawled_data)
    
    output_file = 'data/processed-jobs-2.csv'
    df = save_data(cleaned_data, output_file)

    print(df.head())

Processing jobs:   4%|▍         | 37/847 [03:51<1:47:45,  7.98s/it]

Error processing job: Unterminated string starting at: line 32 column 15 (char 1709)


Processing jobs:  22%|██▏       | 186/847 [20:28<1:22:11,  7.46s/it]

Error processing job: Unterminated string starting at: line 32 column 15 (char 1528)


Processing jobs:  85%|████████▍ | 717/847 [1:19:32<16:35,  7.66s/it]

Error processing job: Unterminated string starting at: line 32 column 15 (char 1344)


Processing jobs:  86%|████████▋ | 732/847 [1:21:10<14:04,  7.35s/it]

Error processing job: Unterminated string starting at: line 32 column 15 (char 2251)


Processing jobs:  87%|████████▋ | 738/847 [1:21:56<14:52,  8.19s/it]

Error processing job: Unterminated string starting at: line 32 column 15 (char 2361)


Processing jobs: 100%|██████████| 847/847 [1:37:31<00:00,  6.91s/it]  

                                           job_title     company_name  \
0                       Data Scientist, Data Science  Cardinal Health   
1             Usability Researcher 2- Data Scientist    Jobs via Dice   
2  Senior Data Scientist – Financial Crimes and T...             USAA   
3                                Data Scientist, R&D      Eight Sleep   
4                       Data Scientist I B - GBS IND  Bank of America   

               sector                  location   job_type  \
0                None  United States (+1 other)  Full-time   
1          Technology                  Anywhere  Full-time   
2  Financial Services      Colorado Springs, CO  Full-time   
3   Consumer Wellness                  Anywhere  Full-time   
4             Finance                  Anywhere  Full-time   

          salary_range                                   experience_level  \
0   $93,500 - $133,600                                           3+ years   
1                 None            


