<a href="https://colab.research.google.com/github/arminmu13106-art/TEAM7-Capstone-Project/blob/main/2026Capstone.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
import joblib

**Predicting Career Domain and Seniority from LinkedIn Profiles**

Project Overview:

In this semester’s capstone project, your task is to develop an end-to-end machine-learning pipeline that predicts

(1) the current professional domain and

(2) the current seniority level of an individual based solely on the information contained in their LinkedIn CV.

Your models will be evaluated using a hand-labeled dataset provided by SnapAddy.
The project encourages you to creatively combine modern NLP techniques, programmatic labeling strategies, and supervised or zero-shot approaches to extract meaningful signals from semi-structured career data.

Details:
The target is to predict the characteristics (domain, seniority) of the current job. The current job is labeled as "ACTIVE" in the status in the CVs.

Possible Approaches (non-exhaustive)

1.Rule-based matching (baseline): Identify relevant job titles and text passages using predefined label lists and assign domain and seniority accordingly.

2.Embedding-based labeling: Use the provided label lists to generate embeddings (e.g., via LLMs or sentence transformers). Compute similarity between profile text and label embeddings and perform zero-shot classification.

*3.Fine-tuned classification model. Use the csv files to fine-tune a pre-trained classification model. Apply the model to the linked-in data*

4.Programmatic labeling + supervised learning: Use rule-based or embedding-based predictions to create pseudo-labels for a large set of LinkedIn profiles, then fine-tune a classifier on this expanded dataset.

5.Feature engineering and conventional machine learning. Look at the linked-In data and generate meaningful features (e.g. number of previous jobs as an indicator for seniority, etc.) . Then train conventional algorithms (e.g. random forests) to predict the labels.

6.Simple interpretable baseline: E.g. a bag-of-words and TF–IDF + logistic regression classifier for domain or seniority.

7.Your own approach: Be creative and find your own solution.

Note that for each of these approaches, two models are required: one for predicting the department and one for predicting the seniority.

Download（json&csv）

In [None]:
from google.colab import files
files.upload()

In [None]:
!ls

In [None]:
import json

with open("test_json.txt", "r", encoding="utf-8") as f:
    data_json = json.load(f)

print(type(data_json))
print(len(data_json))
print(data_json[0])

That's json file. Then CSV file.

In [None]:
from google.colab import files
files.upload()
# department-v2

In [None]:
from google.colab import files
files.upload()
# seniority-v2

In [None]:
!ls

In [None]:
import pandas as pd

df_department = pd.read_csv("department-v2.csv")
df_seniority = pd.read_csv("seniority-v2.csv")

df_department.head(), df_seniority.head()

First do the model for DEPARTMENT

In [None]:
df_department['text'].str.len().describe()

In [None]:
df_department['label'].value_counts(normalize=True)

Method：**TF-IDF + Logistic Regression**

Term Frequency – Inverse Document Frequency

No suitable for deep learning
because:

1. The text is extremely short (average 34 characters)

2. The dataset size is approximately 10k (too small for DL)

3. Category noise is high (the title itself is ambiguous)

4. The task objective leans toward semantic matching rather than generation

In [None]:
from sklearn.model_selection import train_test_split

X = df_department["text"]
y = df_department["label"]

X_train, X_val, y_train, y_val = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

print(len(X_train), len(X_val))

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(
    lowercase=True,
    ngram_range=(1, 2),
    min_df=2,
    max_df=0.9
)

In [None]:
X_train_tfidf = tfidf.fit_transform(X_train)

In [None]:
X_val_tfidf = tfidf.transform(X_val)

In [None]:
print(X_train_tfidf.shape)
print(X_val_tfidf.shape)

Modeling

In [None]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(
    max_iter=1000,
    class_weight="balanced",
    n_jobs=-1,
    random_state=42
)

In [None]:
clf.fit(X_train_tfidf, y_train)

In [None]:
y_pred = clf.predict(X_val_tfidf)

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_val, y_pred))

In [None]:
import random
import re

def text_augment_case_symbol(text):
    ops = []

    ops.append(text.lower())
    ops.append(text.upper())
    ops.append(text.title())

    ops.append(text.replace(" ", "-"))
    ops.append(text.replace(" ", "_"))

    ops.append(re.sub(r"\s+", "  ", text))

    ops = list(set(ops))
    ops = [t for t in ops if t != text]

    if not ops:
        return text

    return random.choice(ops)


In [None]:
import pandas as pd

df = df_department.copy()

class_counts = df['label'].value_counts()

small_classes = class_counts[class_counts < 50].index.tolist()

augmented_rows = []

for cls in small_classes:
    subset = df[df['label'] == cls]

    for _, row in subset.iterrows():
        if random.random() < 0.8:
            new_text = text_augment_case_symbol(row['text'])
            augmented_rows.append({
                "text": new_text,
                "label": cls
            })

df_augmented = pd.concat([df, pd.DataFrame(augmented_rows)], ignore_index=True)

print("Original:", len(df))
print("After:", len(df_augmented))


“We applied light text augmentation limited to casing and formatting variations to improve robustness against real-world writing inconsistencies, particularly for underrepresented classes, without introducing semantic noise.”

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report

X = df_augmented['text']
y = df_augmented['label']

X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)
X_val_tfidf = vectorizer.transform(X_val)

clf = LogisticRegression(
    max_iter=1000,
    class_weight="balanced",
    n_jobs=-1,
    random_state=42
)

clf.fit(X_train_tfidf, y_train)

y_pred = clf.predict(X_val_tfidf)

print(classification_report(y_val, y_pred))


The above represents the optimized results of data augmentation (case, spacing, and symbol variations).

Next, try n-gram TF-IDF. The default raw version is Unigram, which handles individual words. N-grams represent phrases.

In [None]:
df_department

In [None]:
from sklearn.model_selection import train_test_split

X = df_department['text']
y = df_department['label']

X_train, X_val, y_train, y_val = train_test_split(
    X,
    y,
    test_size=0.2,
    stratify=y,
    random_state=42
)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(
    ngram_range=(1, 2),   # unigram + bigram
    min_df=2,
    max_df=0.95,          # remove most frequency word
    lowercase=True
)

X_train_tfidf = vectorizer.fit_transform(X_train)
X_val_tfidf = vectorizer.transform(X_val)

print(X_train_tfidf.shape)
print(X_val_tfidf.shape)

In [None]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(
    max_iter=1000,
    class_weight="balanced",
    n_jobs=-1,
    random_state=42
)

clf.fit(X_train_tfidf, y_train)

In [None]:
from sklearn.metrics import classification_report

y_pred = clf.predict(X_val_tfidf)

print(classification_report(y_val, y_pred))


Next combine **enhanced data+ n-gram**

In [None]:
X_aug = df_augmented['text']
y_aug = df_augmented['label']

print("Original size:", len(df_department))
print("Augmented size:", len(df_augmented))

y_aug.value_counts(normalize=True)

In [None]:
from sklearn.model_selection import train_test_split

X_train_aug, X_val_aug, y_train_aug, y_val_aug = train_test_split(
    X_aug,
    y_aug,
    test_size=0.2,
    stratify=y_aug,
    random_state=42
)

print(len(X_train_aug), len(X_val_aug))

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer_aug_ngram = TfidfVectorizer(
    ngram_range=(1, 2),
    min_df=2
)

X_train_aug_tfidf = vectorizer_aug_ngram.fit_transform(X_train_aug)
X_val_aug_tfidf = vectorizer_aug_ngram.transform(X_val_aug)

print(X_train_aug_tfidf.shape)
print(X_val_aug_tfidf.shape)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

clf_aug_ngram = LogisticRegression(
    max_iter=1000,
    class_weight="balanced",
    n_jobs=-1,
    random_state=42
)

# Training with TF-IDF features enhanced by n-grams
clf_aug_ngram.fit(X_train_aug_tfidf, y_train_aug)

y_pred_aug_ngram = clf_aug_ngram.predict(X_val_aug_tfidf)

print(classification_report(y_val_aug, y_pred_aug_ngram))

Try L1 regularization.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# use X_train_tfidf, X_val_tfidf, y_train, y_val
clf_l1 = LogisticRegression(
    penalty='l1',
    solver='liblinear',
    C=1.0,          ######1
    class_weight='balanced',
    max_iter=1000,
    random_state=42
)

clf_l1.fit(X_train_tfidf, y_train)

y_pred_l1 = clf_l1.predict(X_val_tfidf)

print(classification_report(y_val, y_pred_l1))

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

clf_l1 = LogisticRegression(
    penalty='l1',
    solver='liblinear',
    C=0.5,               #0.5
    class_weight='balanced',
    max_iter=1000,
    random_state=42
)

clf_l1.fit(X_train_tfidf, y_train)

y_pred_l1 = clf_l1.predict(X_val_tfidf)

print(classification_report(y_val, y_pred_l1))

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

clf_l1 = LogisticRegression(
    penalty='l1',
    solver='liblinear',
    C=2,               #2
    class_weight='balanced',
    max_iter=1000,
    random_state=42
)

clf_l1.fit(X_train_tfidf, y_train)

y_pred_l1 = clf_l1.predict(X_val_tfidf)

print(classification_report(y_val, y_pred_l1))

C=1 is the best.

In [None]:
X = df_augmented["text"]
y = df_augmented["label"]

X_train, X_val, y_train, y_val = train_test_split(
    X,
    y,
    test_size=0.2,
    stratify=y,
    random_state=42
)

In [None]:
vectorizer = TfidfVectorizer(
    ngram_range=(1, 2),     # unigram + bigram
    min_df=2,
    max_df=0.9
)

X_train_tfidf = vectorizer.fit_transform(X_train)
X_val_tfidf = vectorizer.transform(X_val)

print("Train TF-IDF shape:", X_train_tfidf.shape)
print("Val   TF-IDF shape:", X_val_tfidf.shape)

In [None]:
clf = LogisticRegression(
    penalty="l1",
    solver="liblinear",
    C=1.0,
    class_weight="balanced",
    max_iter=1000,
    random_state=42
)

clf.fit(X_train_tfidf, y_train)

In [None]:
y_pred = clf.predict(X_val_tfidf)

print("===== Final Department Model (Validation) =====")
print(classification_report(y_val, y_pred))

Final enhanced TF-IDF + Logistic Regression (L1) model（department）

LETS GOOO , TEST json

In [None]:
type(data_json)

In [None]:
type(data_json[0][0])

In [None]:
data_json[0][0].keys()

In [None]:
texts_test = []

for group in data_json:
    for item in group:
        texts_test.append(item["position"])

print("Number of test samples:", len(texts_test))

In [None]:
X_test = vectorizer.transform(texts_test)
print("TF-IDF test shape:", X_test.shape)

In [None]:
X_test = [record[0]['position'] for record in data_json]
y_true_department = [record[0]['department'] for record in data_json]

print("Number of samples:", len(X_test), len(y_true_department))

X_test_tfidf = vectorizer.transform(X_test)
y_pred_department = clf.predict(X_test_tfidf)

from sklearn.metrics import classification_report, accuracy_score, f1_score

print("Accuracy:", accuracy_score(y_true_department, y_pred_department))
print("Macro F1:", f1_score(y_true_department, y_pred_department, average='macro'))
print("Weighted F1:", f1_score(y_true_department, y_pred_department, average='weighted'))

print("\nDetailed report:")
print(classification_report(y_true_department, y_pred_department))

Seniority！！

In [None]:
df_seniority['label'].value_counts(normalize=True)

In [None]:
df_seniority['text'].str.len().describe()

In [None]:
df_seniority.head(5)

In [None]:
from sklearn.model_selection import train_test_split

X = df_seniority['text']
y = df_seniority['label']

df_seniority_train, df_seniority_test = train_test_split(
    df_seniority,
    test_size=0.2,
    random_state=42,
    stratify=y
)

print("Train label distribution:")
print(df_seniority_train['label'].value_counts(normalize=True))

print("\nTest label distribution:")
print(df_seniority_test['label'].value_counts(normalize=True))


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, accuracy_score, f1_score

# X / y
X_train = df_seniority_train['text']
y_train = df_seniority_train['label']

X_test = df_seniority_test['text']
y_test = df_seniority_test['label']

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(
        ngram_range=(1, 2),
        min_df=5,
        max_df=0.9
    )),
    ('clf', LogisticRegression(
        penalty='l1',
        solver='liblinear',
        class_weight='balanced',
        max_iter=1000
    ))
])

pipeline.fit(X_train, y_train)

y_pred = pipeline.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Macro F1:", f1_score(y_test, y_pred, average='macro'))
print("Weighted F1:", f1_score(y_test, y_pred, average='weighted'))

print("\nDetailed report:\n")
print(classification_report(y_test, y_pred))


Try L1 regularization.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

X = df_seniority['text']
y = df_seniority['label']

X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)
X_val_tfidf = vectorizer.transform(X_val)

clf_seniority_l1 = LogisticRegression(
    penalty='l1',
    solver='liblinear',
    C=1.0,
    class_weight='balanced',
    max_iter=1000,
    random_state=42
)

# train
clf_seniority_l1.fit(X_train_tfidf, y_train)

y_pred_l1 = clf_seniority_l1.predict(X_val_tfidf)

print("===== L1 Regularized Seniority Model Performance =====")
print(f"Number of samples: {X_val_tfidf.shape[0]} {len(y_pred_l1)}")
print(classification_report(y_val, y_pred_l1))


test on json, with L1 regularization.

In [None]:
# Seniority Model Evaluation on JSON (with L1)
import numpy as np
from sklearn.metrics import classification_report, f1_score, accuracy_score

y_true = []
y_pred = []

for sublist in data_json:
    for entry in sublist:
        position_text = entry['position']
        true_label = entry['seniority']
        X_vec = vectorizer.transform([position_text])
        pred_label = clf_seniority_l1.predict(X_vec)[0]

        y_true.append(true_label)
        y_pred.append(pred_label)

y_true = np.array(y_true)
y_pred = np.array(y_pred)

accuracy = accuracy_score(y_true, y_pred)
macro_f1 = f1_score(y_true, y_pred, average='macro')
weighted_f1 = f1_score(y_true, y_pred, average='weighted')

print(f"Number of samples: {len(y_true)} {len(y_pred)}")
print(f"Accuracy: {accuracy}")
print(f"Macro F1: {macro_f1}")
print(f"Weighted F1: {weighted_f1}")
print("\nDetailed report:")
print(classification_report(y_true, y_pred))

Original version without L1

In [None]:
X_test_json = [item[0]['position'] for item in data_json]

y_pred_json = pipeline.predict(X_test_json)
y_true_json = [item[0]['seniority'] for item in data_json]

from sklearn.metrics import accuracy_score, f1_score, classification_report

print("Number of samples:", len(y_true_json), len(y_pred_json))
print("Accuracy:", accuracy_score(y_true_json, y_pred_json))
print("Macro F1:", f1_score(y_true_json, y_pred_json, average='macro'))
print("Weighted F1:", f1_score(y_true_json, y_pred_json, average='weighted'))
print("\nDetailed report:\n")
print(classification_report(y_true_json, y_pred_json))