## System Overview
This system is designed to enable end-to-end machine learning workflows, integrating data collection, labeling, model fine-tuning, and prediction into a cohesive pipeline. It supports both automated and manual steps to adapt to varying degrees of supervision and task complexity.

1. Data Collection
The system begins by collecting raw data from two websites. A standardized ingestion module ensures consistent formatting, cleansing, and storage. 

2. Labeling
Collected data is passed through a labeling module, which supports manual labeling through an interactive UI.

3. Model Fine-tuning
Once labeled data is available, it is used to fine-tune a LLM foundation model tailored to your interest.

4. Prediction
Our interest of range is quite small. The prediction is used to filter less interesting things out. 

In [None]:
import sys
import os
import hashlib
import pandas as pd
from elasticsearch import Elasticsearch
sys.path.append(os.getcwd() +  f"{os.sep}.." )
from data_collections.run import crawling 
from utils.dataset import Job, Resume
from preprocess.loader import DataStream
from nlp_functions.llm_utils import Agent

## Set your data up
1. call `crawling()` to search website with words: ["Data", "Scientist", "Machine"] and save the raw data into your database
2. `label_tool.py` will pop up a GUI for the user to do a binary labeling
3. Update label to your dataset
4. Summarize with any powerful LLM api 

In [None]:

crawling()
# use GUI tool to label
# $ cd JOB_RECOMMENDATION/utils
# $ python label_tool.py

In [None]:
db_url = "Url to access your collection" # normally, the url is set to https://localhost:9200
es = Elasticsearch(
            [db_url],
            basic_auth = ("$username", "$password"),
            verify_certs=False
            )  
# you can access your data as follows
respond = es.get(index="jobs_db", id="Your_job_id")

In [None]:
# update yout label to database

label_path = "Path_to_your_label" 
with open(label_path, "r") as f:
    lables = json.load(f)


for job_id, label in labels.items():
    update_body = {
            "doc" : {
            "Labels": [label]
        }
    }
    es.update(index="jobs_db", id=job_id, body=update_body)



In [None]:
# To utilize the power of state-of-art, use LLM API to summarize jobs' info 
# For me, it is a good way to reduce dimension and make the pattern organzied
from openai import OpenAI
import copy
import time

client = OpenAI(
  base_url="https://openrouter.ai/api/v1",
  api_key="<API-Key>",
)

job_ids = list(labels.keys())
train_size = 10
data_size = 30
summaries = {}
for i in range(data_size):
    print(f"parsing row={i}")
    respond = es.get(index="jobs_db", id=df.iloc[i,0])
    cache = copy.copy(respond["_source"])
    del cache["Labels"]
    job = Job(**cache)
    completion = client.chat.completions.create(
    extra_body={},
    model="deepseek/deepseek-v3-base:free",
    messages=[
        {
        "role": "user",
        "content":f"""
                    You are a good headhunter. Summarize the following description:

                    {job.form()}
                    
                    The content should be as simple as possible, including:
                    
                    1. what's the requirements ?

                    2. what's the main task ?

                    3. Salary Range 
                    """
        
        }
    ]
    )
    summaries[job_ids[i]] = completion.choices[0].message.content

_job_ids = []
_summaries = []
_labels = []
for k,v in summaries.items():
    _job_ids.append(k)
    _summaries.append(v)
    _labels.append(labels[k])

model_input = pd.DataFracme({"job_id": _job_ids, "job_description": _summaries, "labels": _labels})


## Learning from your label

1. load Mistral-7B to use in default
2. finetunig with your label
3. testing and 

In [None]:
# finetuing
model_path = "Path_to_your_model"
agent = Agent(model_path=model_path)
agent.load_models()
agent.finetuning(model_input["job_description"].tolist(), model_input["labels"].tolist())

In [None]:
# set your test ids
test_ids = [
    "Experian,Staff Software Engineer, Data & AI Platform Architecture",
    "Avanade,Data & AI Solutions Architect",
    "Avanade,Manager, Data Engineering",
    "Samsara Inc.,Principal, Product Manager - AI Assistant",
    "CrowdStrike,Sr. Software Engineer - Charlotte AI (Remote, ROU)"
]
# prediction
test_data = [summaries[test_id] for test_id in test_ids]
prediction = agent.do_grade(test_data)
print(prediction)

output should be like:

```
array([0.8152325 , 0.8152325 , 0.83601975, 0.8587186 , 0.8601343 ],
      dtype=float32)
```

My rank for the test data is: [2, 2, 2, 2, 1]

Preidiction rank: [4, 4, 3, 2, 1]

For further finetuning, change loss function to **Bayesain Personalized Ranking (BPR) Loss**
