# Part 4: Lightgbm

 - Objective: Analyze changes in Data Science and Analytics (DSA) job content over the past 10 months in Taipei.

 - Data Collection: Compilation of job data in Taipei, focusing on positions related to Data Science and Analytics.

 - Data Scope: The dataset includes some positions that may not be directly relevant to DSA.

 - Initial Step: Filter out job postings that are not pertinent to DSA.
 
 - Analysis Method: Employ dynamic topic modeling to identify shifts in the content of DSA job descriptions during the 10-month period.

In [1]:
import utils, umap, json
import pandas as pd
import numpy as np
import lightgbm as lgb
from sklearn.cluster import HDBSCAN
from sklearn.metrics import roc_curve, accuracy_score, confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split, GridSearchCV
from scipy.spatial.distance import cosine
import plotly.express as px
import matplotlib.pyplot as plt

In [2]:
data = pd.read_csv("data/20231119_job_market_job.csv")
list_job_name_embeddings = json.load(open("data/20231119_job_market_job_name_embeddings.json", "r"))

In [3]:
keywords = ["數據科學", "資料科學", "數據工程", "資料工程", "資料庫", "演算法",
            "機器學習", "人工智慧", "人工智能", "數據分析", "自然語言", "電腦視覺",
            "資料分析", "商業分析", "產品分析", "data science",
            "data engineering", "database", "algorithm", "machine learning",
            "artificial intelligence", "natural language", "computer vision",
            "data analysis", "business analysis", "product analysis"]

In [5]:
data["job_name"].str.lower()
data["contain_keyword"] = data["job_name"].str.contains("|".join(keywords))
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28777 entries, 0 to 28776
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   id               28777 non-null  object
 1   job_name         28777 non-null  object
 2   job_desc         28752 non-null  object
 3   contain_keyword  28777 non-null  bool  
dtypes: bool(1), object(3)
memory usage: 702.7+ KB


In [7]:
X_train, X_test, y_train, y_test = train_test_split(list_job_name_embeddings, data["contain_keyword"].tolist(),
                                                    test_size=0.3, random_state=69)

In [8]:
param_grid = {
    "boosting_type": ["gbdt", "dart"],
    "num_leaves": [10, 20, 30],
    "learning_rate": [0.1, 0.01, 0.001],
    "n_estimators": [100, 200, 300],
    "objective": ["binary"],
    "random_state": [69]
}

In [102]:
lgb_model = lgb.LGBMClassifier()

grid_search = GridSearchCV(estimator=lgb_model, param_grid=param_grid, cv=5)
grid_search.fit(X_train, y_train)

[LightGBM] [Info] Number of positive: 1043, number of negative: 15071
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.100856 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 391680
[LightGBM] [Info] Number of data points in the train set: 16114, number of used features: 1536
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.064726 -> initscore=-2.670671
[LightGBM] [Info] Start training from score -2.670671
[LightGBM] [Info] Number of positive: 1043, number of negative: 15071
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.101625 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 391680
[LightGBM] [Info] Number of data points in the train set: 16114, number of used features: 1536
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.064726 -> initscore=-2.670671
[LightGBM] [Info] Start training from score -2.670671
[Light

The Best Model Paramaters were:
```
best_params = {
    'boosting_type': 'gbdt',
    'learning_rate': 0.1,
    'n_estimators': 300,
    'num_leaves': 10,
    'objective': 'binary',
    'random_state': 69
    }
 ```

In [103]:
best_params = grid_search.best_params_
print(best_params)

{'boosting_type': 'gbdt', 'learning_rate': 0.1, 'n_estimators': 300, 'num_leaves': 10, 'objective': 'binary', 'random_state': 69}


In [104]:
final_model = lgb.LGBMClassifier(**best_params)
final_model.fit(X_train, y_train)

[LightGBM] [Info] Number of positive: 1304, number of negative: 18839
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.112409 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 391680
[LightGBM] [Info] Number of data points in the train set: 20143, number of used features: 1536
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.064737 -> initscore=-2.670493
[LightGBM] [Info] Start training from score -2.670493


In [118]:
y_pred = final_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(accuracy)
print(pd.DataFrame(confusion_matrix(y_test, y_pred), index = ["Actual Negative", "Actual Positive"], columns = ["Predicted Negative", "Predicted Positive"]))

0.9813527912902479
                 Predicted Negative  Predicted Positive
Actual Negative                8028                  47
Actual Positive                 114                 445
