# Part 4: Catboost

 - Objective: Analyze changes in Data Science and Analytics (DSA) job content over the past 10 months in Taipei.

 - Data Collection: Compilation of job data in Taipei, focusing on positions related to Data Science and Analytics.

 - Data Scope: The dataset includes some positions that may not be directly relevant to DSA.

 - Initial Step: Filter out job postings that are not pertinent to DSA.
 
 - Analysis Method: Employ dynamic topic modeling to identify shifts in the content of DSA job descriptions during the 10-month period.

In [1]:
import utils, json
import pandas as pd
import numpy as np
from catboost import CatBoostClassifier
from sklearn.metrics import confusion_matrix, roc_curve
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.utils import shuffle

In [2]:
job = pd.read_csv("data/20231119_job_market_job.csv")
job_name_embeddings = json.load(open("data/20231119_job_market_job_name_embeddings.json", "r"))

In [3]:
keywords = ["數據科學", "資料科學", "數據工程", "資料工程", "資料庫", "演算法", "機器學習", "人工智慧", "人工智能", "數據分析", "自然語言", "電腦視覺", "資料分析", "商業分析", "產品分析", "data science", "data engineering", "database", "algorithm", "machine learning", "artificial intelligence", "natural language", "computer vision", "data analysis", "business analysis", "product analysis"]
job["job_name"] = job["job_name"].str.lower()
job["contain_keyword"] = job["job_name"].str.contains("|".join(keywords))

In [4]:
x = job_name_embeddings
y = job["contain_keyword"].astype(int).tolist()

# Randomized Search in Colab

In [5]:
# x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 69)

# param_grid = {
#     "iterations": [10, 30, 50, 70, 90, 100, 300, 500, 700, 900, 1000, 3000, 5000],
#     "learning_rate": [0.3, 0.1, 0.03, 0.01, 0.003, 0.001, 0.0003, 0.0001],
#     "subsample": [0.3, 0.5, 0.7],
#     "depth": [3, 5, 7, 9, 11],
#     "l2_leaf_reg": [0, 5, 10, 15, 20],
#     "min_data_in_leaf": [10, 30, 50, 70, 90],
#     "random_strength": [0.3, 0.1, 0.03, 0.01, 0.003, 0.001, 0.0003, 0.0001],
#     "boosting_type": ["Plain"],
#     "bootstrap_type": ["Bernoulli"],
#     "one_hot_max_size": [255],
#     "border_count": [32]
# }

# catboost_model = CatBoostClassifier(task_type = "GPU", loss_function = "Logloss", auto_class_weights = "SqrtBalanced", verbose = 500)
# search_result = catboost_model.randomized_search(param_distributions = param_grid, X = x_train, y = y_train, cv = 5, n_iter = 300, search_by_train_test_split = True, verbose = False)

# print(search_result["params"])
# final_model = CatBoostClassifier(**best_params, task_type = "GPU", loss_function = "Logloss", auto_class_weights = "SqrtBalanced", verbose = 0)
# final_model.fit(x_train, y_train)

# final_model.save_model("20231119_job_type_classifier")

# Final Model

 - Accuracy, Sensitivity, and Specificity all exceed 95%.

 - A very high Sensitivity could indicate that the model primarily identifies jobs through direct keyword searches.
 
 - In such a case, it might overlook related jobs that cannot be identified through straightforward keyword searches.

In [6]:
model = CatBoostClassifier()
model.load_model("20231119_job_type_classifier")
y_pred = model.predict(x)

In [7]:
utils.print_binary_classifier_metrics(y, y_pred)

Evaluation Metrics:
Confusion Matrix:
                 Predicted Negative  Predicted Positive
Actual Negative               26832                  82
Actual Positive                  79                1784
--------------------
Accuracy:
0.9944052541960594
--------------------
Sensitivity:
0.9575952764358562
--------------------
Specificity:
0.9969532585271605
--------------------
Precision:
0.9560557341907824
--------------------
Negative Predictive Value:
0.9970643974582885
--------------------
False Positive Rate:
0.0030467414728394143
--------------------
False Discovery Rate:
0.04394426580921758
--------------------
False Negative Rate:
0.042404723564143855
--------------------
F1 Score:
0.9568248860284259
--------------------
Matthews Correlation Coefficient:
0.9538340659199472
--------------------


In [8]:
job["pred_contain_keyword"] = y_pred.astype(bool)

Those job cannot be found directly through keyword search, but model picked:

In [9]:
print("\n".join(job.query("contain_keyword == True and pred_contain_keyword == False").job_name.tolist()))

camera isp algorithm system engineer, senior to staff (taipei)(3018388)
【系統事業】演算法工程師-台北辦公室-r33
mis-database engineer
[ java  ]  senior data engineer 資料工程師   #分散式系統  #技術大神 #內外部社群發展
deep learning and computer vision embedded software engineer (cd2390)
直流無刷馬達(bldc)演算法工程師 
【新契約服務】規劃人員_資深商業分析師(ba)
dior- crm assistant 客戶服務助理 (資料分析/客戶關係管理/需細心耐心喜歡數字)
clinical database administrator
software qa engineer (sqa) ｜港資上市公司｜ai & machine learning
電腦視覺軟體工程師
【寵物公園】-數據工程師 / bi工程師
總公司 assistant manager of  market research and business analysis_台北
港資上市公司 ｜ai & machine learning｜sr.full-stack,tech lead
數據分析師ー日系知名廣告代理商(16683)
【專案管理部】people partner解決方案專員 ｜資深解決方案專員（hr數據分析與自動化）
r2: axi 演算法研發工程師-241
accounting manager (financial service/big data analysis)
數據分析人員 | 日系知名國際金融業 | 台北 508004
etl 數據工程師【os】
cross-border business analyst 蝦皮跨境商業分析師
【營運企劃】團險數據分析暨通路企劃人員
sr. backend developer - quantitative algorithmic trading team
【ubereats▲ 營運專員ops specialist】文案企劃背景者佳/數據分析/底薪38k-45k+獎金_sms_865
database engineer
資料科學工程師-集團徵才-

In [10]:
json.dump(job.query("contain_keyword == True or pred_contain_keyword == True")["id"].unique().tolist(),
          open("data/20231119_job_market_dsa_job_id.json", "w"))