### 环境配置

#### Conda + Python  
https://docs.conda.io/en/latest/

#### Pycharm/Vscode + Jupyter  
https://www.jetbrains.com/pycharm/download/?section=windows  
https://code.visualstudio.com/  
IDE内可以直接下载Jupyter

#### Github  
https://gitforwindows.org/  
注册Github账号  
下载gitforwindows  
本地设置账密  
连接线上代码库  

### 探索性数据分析 EDA  
快速了解数据  
得到更加干净有效的数据 便于后续的特征工程  
以application为例，bureau需要自己完成

#### 数据读取  
依赖pandas  
读取文件为DataFrame  
生成一个带有各列解释的数据表

In [54]:
import numpy as np 
import pandas as pd 
# import polars

## 读取数据
app_train = pd.read_csv('../data/application_train.csv', encoding = "", sep = ',')
app_test = pd.read_csv('../data/application_test.csv', encoding = "")
des = pd.read_csv('../data/HomeCredit_columns_description.csv', usecols = ['Row', 'Table', 'Description'])
print('shape of application train = ', app_train.shape)
print('shape of application test = ', app_test.shape)

shape of application train =  (307511, 122)
shape of application test =  (48744, 121)


In [55]:
## 分离训练集中的预测目标
y = app_train[['SK_ID_CURR', 'TARGET']]
app_train = app_train.drop(labels = 'TARGET', axis = 1)

In [56]:
## 生成特征解释表
feature_des = pd.DataFrame(app_train.columns, columns = ['Row'])
feature_des = pd.merge(feature_des, des, on = 'Row', how = 'left')
feature_des = feature_des[feature_des['Table'] == 'application_{train|test}.csv'].reset_index(drop = True)
feature_des

Unnamed: 0,Row,Table,Description
0,SK_ID_CURR,application_{train|test}.csv,ID of loan in our sample
1,NAME_CONTRACT_TYPE,application_{train|test}.csv,Identification if loan is cash or revolving
2,CODE_GENDER,application_{train|test}.csv,Gender of the client
3,FLAG_OWN_CAR,application_{train|test}.csv,Flag if the client owns a car
4,FLAG_OWN_REALTY,application_{train|test}.csv,Flag if client owns a house or flat
...,...,...,...
116,AMT_REQ_CREDIT_BUREAU_DAY,application_{train|test}.csv,Number of enquiries to Credit Bureau about the...
117,AMT_REQ_CREDIT_BUREAU_WEEK,application_{train|test}.csv,Number of enquiries to Credit Bureau about the...
118,AMT_REQ_CREDIT_BUREAU_MON,application_{train|test}.csv,Number of enquiries to Credit Bureau about the...
119,AMT_REQ_CREDIT_BUREAU_QRT,application_{train|test}.csv,Number of enquiries to Credit Bureau about the...


#### 数据清洗    
依赖pandas  
各列数据类型以及训练/测试集数据缺失率  
考虑剔除训练集和测试集缺失率差异过大的列  
结合数据类型，考虑填充缺失值  
缺失率高的列需要单独分析

In [64]:
## 数据类型 & 缺失率
feature_des['Dtype'] = list(app_train.dtypes)
feature_des['Missing_train'] = list(app_train.isna().sum() / len(app_train))
feature_des['Missing_test'] = list(app_test.isna().sum() / len(app_test))

In [71]:
## 剔除训练/测试集缺失率差异过大的列
drop_cols = {}
feature_des['Missing_diff'] = np.abs(feature_des.Missing_train - feature_des.Missing_test)
feature_des.sort_values(by = 'Missing_diff')
# drop_cols['EXT_SOURCE_1'] = 'missing rate gap'

In [84]:
## 数值类型填充缺失值，字符串类型暂不填充
def missing_padding(train, test, feature, padding_type, custom_value = -999):
    padding_value = np.nan
    if padding_type == 'mean':
        padding_value = train[feature].mean()
    elif padding_type == 'median':
        padding_value = train[feature].median()
    elif padding_type == 'custom':
        padding_value = custom_value
    train[feature] = train[feature].fillna(padding_value)
    test[feature] = test[feature].fillna(padding_value)

for index, row in feature_des.iterrows():
    if row.Dtype in ('int', 'float'):
        missing_padding(app_train, app_test, row.Row, 'mean')
    else:
        pass

SK_ID_CURR
NAME_CONTRACT_TYPE
CODE_GENDER
FLAG_OWN_CAR
FLAG_OWN_REALTY
CNT_CHILDREN
NAME_TYPE_SUITE
NAME_INCOME_TYPE
NAME_EDUCATION_TYPE
NAME_FAMILY_STATUS
NAME_HOUSING_TYPE
DAYS_BIRTH
DAYS_EMPLOYED
DAYS_ID_PUBLISH
FLAG_MOBIL
FLAG_EMP_PHONE
FLAG_WORK_PHONE
FLAG_CONT_MOBILE
FLAG_PHONE
FLAG_EMAIL
OCCUPATION_TYPE
REGION_RATING_CLIENT
REGION_RATING_CLIENT_W_CITY
WEEKDAY_APPR_PROCESS_START
HOUR_APPR_PROCESS_START
REG_REGION_NOT_LIVE_REGION
REG_REGION_NOT_WORK_REGION
LIVE_REGION_NOT_WORK_REGION
REG_CITY_NOT_LIVE_CITY
REG_CITY_NOT_WORK_CITY
LIVE_CITY_NOT_WORK_CITY
ORGANIZATION_TYPE
FONDKAPREMONT_MODE
HOUSETYPE_MODE
WALLSMATERIAL_MODE
EMERGENCYSTATE_MODE
FLAG_DOCUMENT_2
FLAG_DOCUMENT_3
FLAG_DOCUMENT_4
FLAG_DOCUMENT_5
FLAG_DOCUMENT_6
FLAG_DOCUMENT_7
FLAG_DOCUMENT_8
FLAG_DOCUMENT_9
FLAG_DOCUMENT_10
FLAG_DOCUMENT_11
FLAG_DOCUMENT_12
FLAG_DOCUMENT_13
FLAG_DOCUMENT_14
FLAG_DOCUMENT_15
FLAG_DOCUMENT_16
FLAG_DOCUMENT_17
FLAG_DOCUMENT_18
FLAG_DOCUMENT_19
FLAG_DOCUMENT_20
FLAG_DOCUMENT_21


#### 数据分布
依赖pandas, numpy
生成一个

0

#### 可视化