In [1]:
import os
import pandas as pd
import numpy  as np

In [2]:
import warnings
warnings.filterwarnings("ignore")

# Exploratory Data Analysis

The following notebook assumes:
- There hasn't been any problem getting the data.
- The data is located in the **data** repository, located at the same level than the **notebooks** repository that hosts this notebook 

In [3]:
DATA_REPOSITORY= '../data'

In [4]:
modeling_data_path= os.path.join(
    DATA_REPOSITORY,
    'PAKDD2010_Modeling_Data.txt'
)
variables_data_path= os.path.join(
    DATA_REPOSITORY,
    'PAKDD2010_VariablesList.XLS'
)

### Load the modeling data into a pd.DataFrame object

In [5]:
modeling_data= pd.read_csv(
    filepath_or_buffer= modeling_data_path, 
    encoding= "ISO-8859-1", 
    header= None, 
    delimiter='\t'
)

In [6]:
modeling_data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,44,45,46,47,48,49,50,51,52,53
0,1,C,5,Web,0,1,F,6,1,0,...,0,0,0,0,1,N,32,595,595,1
1,2,C,15,Carga,0,1,F,2,0,0,...,0,0,0,0,1,N,34,230,230,1
2,3,C,5,Web,0,1,F,2,0,0,...,0,0,0,0,1,N,27,591,591,0
3,4,C,20,Web,0,1,F,2,0,0,...,0,0,0,0,1,N,61,545,545,0
4,5,C,10,Web,0,1,M,2,0,0,...,0,0,0,0,1,N,48,235,235,1


### Load the variables data into a pd.DataFrame object

In [7]:
df_variables= pd.read_excel(
    io=variables_data_path,  
    index_col= 0,
    header= 0
)
df_variables= df_variables\
    .set_index(df_variables.index - 1)

df_variables.index.name= 'COL_ID'

In [8]:
df_variables.head()

Unnamed: 0_level_0,Var_Title,Var_Description,Field_Content
COL_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,ID_CLIENT,Sequential number for the applicant (to be use...,"1-50000, 50001-70000, 70001-90000"
1,CLERK_TYPE,Not informed,C
2,PAYMENT_DAY,"Day of the month for bill payment, chosen by t...",1510152025
3,APPLICATION_SUBMISSION_TYPE,Indicates if the application was submitted via...,"Web, Carga"
4,QUANT_ADDITIONAL_CARDS,Quantity of additional cards asked for in the ...,"1,2,NULL"


### Rename modeling data columns

the **df_variables** dataframe has the title and description for the columns in **modeling_data** dataframe.

We need to:
- Check if variable titles are unique.
    - In case they aren't rename them in order to make sure there is a 1:1 Title - Column relationship
- Rename columns in **modeling_data** dataframe in order to clarify the dataset

In [9]:
pd.DataFrame(df_variables.Var_Title\
    .value_counts()\
    .sort_values(ascending= False)[df_variables.Var_Title\
        .value_counts()\
        .sort_values(ascending= False) > 1
    ]
)

Unnamed: 0,Var_Title
EDUCATION_LEVEL,2


There are two columns with the **EDUCATION_LEVEL** title

Lets understand each column with the variables' descriptions

In [10]:
ed_level_col_descrip_df = pd.DataFrame(
    df_variables[
        df_variables.Var_Title == "EDUCATION_LEVEL"
    ].Var_Description
)

In [11]:
for i, row in ed_level_col_descrip_df.iterrows():
    print(
        f"--- Column ID: {i:2d} ---",
        f"{row.Var_Description}",
        sep='\n'
    )

--- Column ID:  9 ---
Edducational level in gradual order not informed
--- Column ID: 43 ---
Mate's educational level in gradual order not informed


#### As we can see, the COL_ID 9 refers to the clients educational level and COL_ID 43 refers to the client mate's educational level. We will then:

- Create a dictionary to rename **modeling_data** dataframe columns to map the Column ID with the Variable Title
- Customly, define the COL_ID 09 as CLI_EDUCATION_LEVEL
- Customly, define the COL_ID 43 as MATE_EDUCATION_LEVEL
- We will change the name in **df_variables** and save it in a csv to retrieve it

#### Finally, we will rename the columns using the created dictionary as a mapper and set the ID_CLIENT as the dataframe index


In [12]:
rename_dict= {column: df_variables.loc[column].Var_Title for column in df_variables.index}
rename_dict[9]  = f"CLI_{rename_dict[9]}" 
rename_dict[43] = f"MATE_{rename_dict[43]}" 

In [13]:
modeling_data = modeling_data.rename(
    columns= rename_dict
).set_index('ID_CLIENT')

In [17]:
df_variables.loc[9, 'Var_Title']= "CLI_EDUCATION_LEVEL"
df_variables.loc[43, 'Var_Title']= "MATE_EDUCATION_LEVEL"

In [19]:
VAR_DESCRIPTIONS_FILENAME= 'var_descriptions.csv'
VAR_DESCRIPTIONS_PATH= os.path.join(
    DATA_REPOSITORY,
    VAR_DESCRIPTIONS_FILENAME
)

In [20]:
df_variables.to_csv(
    path_or_buf= VAR_DESCRIPTIONS_PATH
)

With the raw data structured, we will save it with the name defined in **RAW_DATA_FILENAME** as a csv

In [118]:
RAW_DATA_FILENAME= 'raw_modeling_data.csv'

In [119]:
RAW_DATA_PATH= os.path.join(
    DATA_REPOSITORY,
    RAW_DATA_FILENAME
)

In [120]:
RAW_DATA_PATH

'../data/raw_modeling_data.csv'

In [121]:
modeling_data.to_csv(
    path_or_buf= RAW_DATA_PATH
)