# Predicting Financial Stress ( Part 1 )


### Description

Let’s explore a realistic and socially impactful machine learning task — predicting financial stress levels among gig economy workers using data on digital behavior, income streams, and financial activity. Your objective is to analyze the provided dataset, build a classification model, and submit your predictions in a competition-style format.


Financial stress prediction is increasingly important for fintech startups, mental health platforms, and labor unions seeking to identify at-risk individuals. With the growing prevalence of gig work (e.g., ride-sharing, freelance platforms, delivery services), understanding how digital and financial patterns correlate with stress is crucial for designing timely interventions and support systems.


The dataset includes various behavioral, occupational, and financial features collected via consent-based app usage and surveys. Your model should classify each worker into one of three financial stress levels: low, moderate, or high.

The main goal of the task is to achieve the highest possible accuracy score, but unconventional approaches and creative ideas will also be taken into account.

### Evaluation

The target variable is a 3-class categorical variable:

financial_stress_level ∈ {low, moderate, high}

The evaluation metric is **Accuracy**.

### Files

**train.csv**: labeled training data


**test.csv**: unlabeled test data


**sample_submission.csv**: submission format


### Submission Format

Save a CSV file(in current folder) with the following columns:

**worker_id**, **financial_stress_level**

0, Low

1, Moderate

2, High


## Dataset Description

**worker_id** — unique identifier of each gig economy worker

**survey_month** — month when the data was collected

**worker_age** — age of the worker in years

**job_sector** — type of gig job (e.g., delivery, ride-hailing, freelance)

**estimated_annual_income** — self-estimated total income for the year

**monthly_gig_income** — actual monthly earnings from gig work

**num_savings_accounts** — number of savings or checking accounts owned

**num_credit_cards** — total number of active credit cards

**avg_credit_interest** — average interest rate across all credit cards

**num_active_loans** — total number of ongoing personal or payday loans

**avg_loan_delay_days** — average delay in loan repayments, measured in days

**missed_payment_events** — number of missed or late payment events

**recent_credit_checks** — number of credit inquiries in the past 3 months

**current_total_liability** — total amount of outstanding debt

**credit_utilization_rate** — ratio of current credit used to credit limit

**credit_age_months** — duration (in months) since the first credit account was opened

**min_payment_flag** — 1 if only the minimum payment was made, 0 if otherwise

**monthly_investments** — total amount invested or saved in the current month

**spending_behavior** — category describing current spending habits

**end_of_month_balance** — available account balance at month’s end

**financial_stress_level** — [Target] financial stress classification: low / moderate / high

In [1]:
%%capture
!pip install catboost[gpu] optuna

In [2]:
import pandas as pd
import numpy as np
from catboost import CatBoostClassifier, Pool
from sklearn.model_selection import GroupKFold
from sklearn.preprocessing import LabelEncoder
import optuna


def to_months(x):
    if isinstance(x, str):
        y, rest = x.split('y.')
        m = rest.replace('m.', '').strip()
        return int(y)*12 + (int(m) if m else 0)
    return np.nan


train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")


train['credit_age_months'] = train['credit_age_months'].apply(to_months)
test['credit_age_months'] = test['credit_age_months'].apply(to_months)


target_mapping = {'Low':0, 'Moderate':1, 'High':2}
train['financial_stress_level_encoded'] = train['financial_stress_level'].map(target_mapping)


cat_features = ['job_sector', 'min_payment_flag', 'spending_behavior', 'survey_month']


for col in cat_features:
    train[col] = train[col].astype(str)
    test[col] = test[col].astype(str)

X = train.drop(columns=['worker_id', 'financial_stress_level', 'financial_stress_level_encoded'])
y = train['financial_stress_level_encoded']
groups = train['worker_id']

def objective(trial):
    params = {
        'iterations': 2000,
        'learning_rate': trial.suggest_float('learning_rate', 1e-3, 0.1, log=True),
        'depth': trial.suggest_int('depth', 3, 10),
        'l2_leaf_reg': trial.suggest_float('l2_leaf_reg', 1e-3, 10.0, log=True),
        'bagging_temperature': trial.suggest_float('bagging_temperature', 0.0, 1.0),
        'random_strength': trial.suggest_float('random_strength', 0.0, 10.0),
        'border_count': trial.suggest_int('border_count', 32, 255),
        'task_type': 'GPU',
        'devices': '0',
        'loss_function': 'MultiClass',
        'eval_metric': 'Accuracy',
        'random_seed': 42,
        'early_stopping_rounds': 100,
        'verbose': 0
    }


    gkf = GroupKFold(n_splits=5)
    accuracies = []

    for train_idx, valid_idx in gkf.split(X, y, groups):
        X_train, X_valid = X.iloc[train_idx], X.iloc[valid_idx]
        y_train, y_valid = y.iloc[train_idx], y.iloc[valid_idx]

        train_pool = Pool(data=X_train, label=y_train, cat_features=cat_features)
        valid_pool = Pool(data=X_valid, label=y_valid, cat_features=cat_features)

        model = CatBoostClassifier(**params)
        model.fit(train_pool, eval_set=valid_pool, use_best_model=True)

        preds = model.predict(valid_pool).flatten().astype(int)
        accuracy = (preds == y_valid).mean()
        accuracies.append(accuracy)

    return np.mean(accuracies)


study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=30, timeout=3600)

print("Best trial:")
print(study.best_trial.params)


best_params = study.best_trial.params
best_params.update({
    'iterations': 2000,
    'loss_function': 'MultiClass',
    'eval_metric': 'Accuracy',
    'random_seed': 42,
    'early_stopping_rounds': 100,
    'task_type': 'GPU',
    'devices': '0',
    'verbose': 100,
})

full_train_pool = Pool(data=X, label=y, cat_features=cat_features)
test_pool = Pool(data=test.drop(columns=['worker_id']), cat_features=cat_features)

final_model = CatBoostClassifier(**best_params)
final_model.fit(full_train_pool, use_best_model=True)

preds_test = final_model.predict(test_pool).flatten().astype(int)

inv_target_mapping = {v:k for k,v in target_mapping.items()}
preds_labels = [inv_target_mapping[p] for p in preds_test]

submission = pd.DataFrame({
    'worker_id': test['worker_id'],
    'financial_stress_level': preds_labels
})

submission.to_csv('submission.csv', index=False)


[I 2025-09-09 16:39:45,838] A new study created in memory with name: no-name-86fa5cf3-2914-4fdd-9af3-7eddb52eb917
[I 2025-09-09 16:40:17,771] Trial 0 finished with value: 0.6648392857142856 and parameters: {'learning_rate': 0.001348008213435413, 'depth': 8, 'l2_leaf_reg': 4.1910810043764855, 'bagging_temperature': 0.4836204982386705, 'random_strength': 3.082636461648595, 'border_count': 76}. Best is trial 0 with value: 0.6648392857142856.
[I 2025-09-09 16:41:44,650] Trial 1 finished with value: 0.6741607142857142 and parameters: {'learning_rate': 0.0017052914298124101, 'depth': 10, 'l2_leaf_reg': 0.4317630576243327, 'bagging_temperature': 0.26589670873336013, 'random_strength': 3.629733529905926, 'border_count': 252}. Best is trial 1 with value: 0.6741607142857142.
[I 2025-09-09 16:42:00,608] Trial 2 finished with value: 0.6793214285714285 and parameters: {'learning_rate': 0.07219243812661935, 'depth': 8, 'l2_leaf_reg': 0.0049039821306937756, 'bagging_temperature': 0.9661213557490759, 

Best trial:
{'learning_rate': 0.02478515450323388, 'depth': 9, 'l2_leaf_reg': 0.051381494594595026, 'bagging_temperature': 0.5290528442715288, 'random_strength': 6.942581054988685, 'border_count': 57}


You should provide test set for use best model. use_best_model parameter has been switched to false value.


0:	learn: 0.6661786	total: 22ms	remaining: 43.9s
100:	learn: 0.7006607	total: 2.8s	remaining: 52.7s
200:	learn: 0.7164286	total: 5.33s	remaining: 47.7s
300:	learn: 0.7317321	total: 6.43s	remaining: 36.3s
400:	learn: 0.7447500	total: 7.5s	remaining: 29.9s
500:	learn: 0.7662500	total: 9.34s	remaining: 27.9s
600:	learn: 0.7861071	total: 11.7s	remaining: 27.3s
700:	learn: 0.8026429	total: 14.1s	remaining: 26.1s
800:	learn: 0.8178929	total: 18.2s	remaining: 27.3s
900:	learn: 0.8300000	total: 20.7s	remaining: 25.2s
1000:	learn: 0.8415714	total: 23.1s	remaining: 23s
1100:	learn: 0.8524643	total: 25.5s	remaining: 20.8s
1200:	learn: 0.8611071	total: 29.5s	remaining: 19.6s
1300:	learn: 0.8702857	total: 31.9s	remaining: 17.1s
1400:	learn: 0.8786429	total: 34.3s	remaining: 14.7s
1500:	learn: 0.8869107	total: 36.7s	remaining: 12.2s
1600:	learn: 0.8945179	total: 39.6s	remaining: 9.86s
1700:	learn: 0.9013393	total: 43.1s	remaining: 7.58s
1800:	learn: 0.9080000	total: 45.5s	remaining: 5.03s
1900:	lear

# Probability Theory Problems ( Part 2 )

## Task 1

A standard deck of **52** cards is well shuffled. Find the probability that all **4 aces are positioned next to each other.**

Provide solution below and the answer accurate to 10^-5.

In [3]:
1/5525

0.00018099547511312217

## Task 2

From a collection of letter tiles spelling the word **STATISTICS (10 letters in total)**, 4 tiles are randomly drawn without replacement.

What is the probability that the selected letters can be rearranged to form the word **CAST**?

Provide solution below and the answer accurate to 10^-5.

In [4]:
3/70

0.04285714285714286