# EDA

In [17]:
import pandas as pd
from ydata_profiling import ProfileReport
import numpy as np

In [18]:
df = pd.read_csv('raw_data.csv')

In [19]:
df['product02'].value_counts()

product02
Nee    5618
Ja     3906
Name: count, dtype: int64

In [20]:
ja_sample = df[df.product02 =='Ja'].sample(100)
na_sample = df[df.product02 =='Nee'].sample(100)

test_sample = pd.concat([ja_sample,na_sample])
test_sample.to_csv('./data/input/test_data.csv')

In [21]:
train_data = df[~df.subscriber.isin(test_sample['subscriber'].values)]
train_data.to_csv('./data/input/train_data.csv')

In [22]:
train_data.describe()

Unnamed: 0,subscriber,income,age,var1
count,9324.0,8778.0,8786.0,8769.0
mean,4761.093737,122265.978974,49.030304,-0.107199
std,2747.972157,88956.810777,16.278518,1.482041
min,1.0,985.8498,18.20965,-17.1403
25%,2383.75,52970.478855,36.240522,-0.853514
50%,4756.5,101551.55,47.83053,-0.11968
75%,7137.25,171041.05,59.629265,0.676826
max,9524.0,632975.0,101.0954,13.60078


In [23]:
# Get count of missing values in the dataset
train_data.isnull().sum()

subscriber      0
income        546
age           538
var1          555
gender        545
house_type    532
lastVisit       0
product02       0
dtype: int64

In [24]:
# Checking the distribution of the null values across rows 
train_data.isnull().sum(axis=1).value_counts()

0    8120
2     422
3     329
1     314
4     124
5      15
Name: count, dtype: int64

In [25]:
# Drop rows that have more than 2 missing values.
train_data[train_data.isnull().sum(axis=1) <= 2].isnull().sum()

subscriber      0
income        221
age           238
var1          233
gender        241
house_type    225
lastVisit       0
product02       0
dtype: int64

In [26]:
profile = ProfileReport(train_data, title="Profiling Report")
profile.to_file("dataReport.html")

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.rename(columns={"index": "df_index"}, inplace=True)


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

## EDA Results
1 - Income,as expected, is skewed. We need log scaling and rounding for this if we are going to use a parametric model.

2 - Age isn't integer, should be addressed. Age is also log normal.

3 - We need to extract time & Date info from the date. 

4 - Missing data in different columns.

5 - Medium +ve corr between Var1 and house_type. Medium -ve corr between Var1 and age. 

6 - Medium +ve corr between age, income and gender.

7 - The Target variable is imbalanaced. We need to address this. (Usin class weight + weighted scoring Now, tried SMOTE but no lift) 


## Future Improvments Ideas:

1 - Try Filling out missing values using smart imputation (KNN or Prediction)

2 - Try different ways for handling imbalance in the Data instead of class weight.

2 - Try out different models with extra feature engineering (transformations)

3 - Use logging instead of print statments.

4 - Use interaction variables.

## Final Notes:

Optimizing for Precision vs Recall in model training would depend on business context. For example, if you are using the model to send marketing offers and you want to optimize for not sending too many offers (for cost or convenience reasons), you will try to optimize for Precision. On the other hand, if the goal is to maximize revenue and don't want to lose out on potential sales, optimize for Recall.