# In this notebook we perform subsampling of the huge dataset.

In [1]:
import pandas as pd
import numpy as np
import polars as pl

In [2]:
train = pd.read_csv('Data/training.csv')
test = pd.read_csv('Data/test.csv')

In [3]:
train.shape, test.shape

((2056160, 44), (7890, 44))

## Is a customer repaying or not? Average over 20 periods.

An idea we come up with, is to focus only on those clients that had _at least_ one period in which they were classified as "non payers". This allows to focus only on the observation we want our model to correctly predict, while avoiding the noise brought by the humongous number of disciplined clients (which are not our primary focus indeed).

We average over the 20 periods the values of "repays_debt" for each individual client.

In [5]:
average_repaying = pd.DataFrame(train.groupby('client_id')['repays_debt'].mean())

In [6]:
average_repaying

Unnamed: 0_level_0,repays_debt
client_id,Unnamed: 1_level_1
0,0.0
1,0.0
2,0.0
3,0.0
4,0.0
...,...
103481,0.0
103482,0.0
103483,0.0
103484,0.0


We then only retain those clients that have an average value greater than 0.

In [7]:
reduced_df = average_repaying.loc[average_repaying['repays_debt'] > 0]
reduced_df['client_id'] = reduced_df.index

Save the results to a csv, that will be used in the notebook **Task 1 - Predictions**.

In [9]:
reduced_df.to_csv('Data/average_repaying.csv', index=False)