# Part 3 â€” Predict: Layoff Risk Demo

Use the trained layoff-risk model to score a sample CSV and summarize results.

## Setup
- Data: `Data/cleaned_dataset.csv` (with target) and `Data/prediction_sample.csv` (features only).
- Best model: Decision Tree from Part 2 (balanced, max_depth=4, min_samples_leaf=2).
- Task: load sample CSV, predict risk, print predictions, and provide a brief takeaway.

In [1]:
from pathlib import Path
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier

pd.set_option('display.max_columns', None)
NOTEBOOK_DIR = Path(__file__).resolve().parent if '__file__' in globals() else Path().resolve()
DATA_DIR = NOTEBOOK_DIR.parent / 'Data'
TRAIN_PATH = DATA_DIR / 'cleaned_dataset.csv'
PRED_PATH = DATA_DIR / 'prediction_sample.csv'
print('Train path:', TRAIN_PATH)
print('Predict input:', PRED_PATH)


Train path: /home/udaniel/school/NEU_6105_Finall/Data/cleaned_dataset.csv
Predict input: /home/udaniel/school/NEU_6105_Finall/Data/prediction_sample.csv


In [2]:
# Load training data

df = pd.read_csv(TRAIN_PATH)
target_col = 'target_high_risk'
X = df.drop(columns=[target_col])
y = df[target_col]

categorical_cols = ['company', 'industry', 'headquarter_location', 'status']
X.head()


Unnamed: 0,company,industry,headquarter_location,status,layoffs_12m,layoffs_last90d,days_since_last_layoff,total_employees_est,layoff_ratio,impacted_pct_recent
0,glorifi,fintech,dallas,private,100.0,100.0,1094,1209.090909,0.12,1.0
1,assure,fintech,salt lake city,private,100.0,100.0,1092,1209.090909,0.12,1.0
2,ncx,"renewable energy, forestry",san francisco,private,100.0,100.0,1087,1209.090909,0.12,0.4
3,blockfi,crypto,jersey city,private,100.0,100.0,1087,4906.976744,0.12,1.0
4,candy digital,cryptocurrency,new york,private,33.0,33.0,1087,100.0,0.33,0.33


In [3]:
# Train the best model (Decision Tree)

preprocess = ColumnTransformer(
    [('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)],
    remainder='passthrough'
)

dt_model = DecisionTreeClassifier(
    max_depth=4,
    min_samples_leaf=2,
    class_weight='balanced',
    random_state=42,
)

clf = Pipeline([
    ('prep', preprocess),
    ('model', dt_model),
])
clf.fit(X, y)
print('Model trained on full dataset. Class balance:')
print(y.value_counts())


Model trained on full dataset. Class balance:
target_high_risk
1    115
0     27
Name: count, dtype: int64


In [4]:
# Load prediction CSV and run inference

pred_df = pd.read_csv(PRED_PATH)
preds = clf.predict(pred_df)
pred_df_with_preds = pred_df.copy()
pred_df_with_preds['pred_high_risk'] = preds

pred_df_with_preds[['company', 'industry', 'layoff_ratio', 'pred_high_risk']]


Unnamed: 0,company,industry,layoff_ratio,pred_high_risk
0,addepar,"fintech, data analytics",0.03,0
1,aqua security,"secops, security",0.1,1
2,superrare,"blockchain, cryptocurrency",0.12,1
3,informatica,"big data, cloud computing",0.07,1
4,bigcommerce,"enterprise software, ecommerce",0.12,1
5,wonder,"food delivery, ecommerce",0.07,1
6,citizen,surveillance,0.027293,0
7,vimeo,mediaentertainment,0.12,1


## Prediction summary
- Model predicts a subset of companies as `1 = high layoff risk`, based on historical ratios and recency of layoffs.
- Features expected: same columns as `cleaned_dataset.csv` minus `target_high_risk`; see `Data/prediction_sample.csv` for format.
- To score a new file, replace `prediction_sample.csv` with your own CSV (same columns) and rerun the notebook.