**将问题框架化并关注重点**

1. 用业务术语定义目标
    * 预测给定信息用户的还款能力，以概率的方式
2. 你的解决方案将如何使用？
    * 用来预测未来贷款申请客户的好坏，作为申请决策辅助依据。
3. 目前的解决方案/解决方法（如果有的话）是什么？
    * 未知
4. 你应该如何解决这个问题（监督/非监督，在线/离线等）？
    * 使用监督学习，回归问题，label中‘0’表示好人，‘1’表示坏人，给出的数介于两者之间，清洗好特征后，多模型测试
5. 如何度量模型的表现？
    * 使用roc下的面积
6. 模型的表现是否和业务目标一致？
    * 纯模型训练，无实际业务目标，暂且认为一致
7. 达到业务目标所需的最低性能是多少？
    * 未知，一般任务auc=0.75
8. 类似的问题如何解决？是否可以复用经验或工具？
    * pass
9. 人员是否专业？
    * pass
10. 你如何动手解决问题？
    * pass
11. 列出目前你（或者其他人）所做的假设
    * pass
12. 如果可能，验证假设。

---

In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

**获取数据**

注意：尽可能自动化，以便你轻松获取新数据。

1. 列出你需要的数据和数据量。
    * 已给定
2. 查找并记录你可以获取该数据的位置。
    * pass
3. 检查它将占用多少存储空间。
    * 2.5G
4. 检查法律义务并在必要时获取授权。
    *  开放数据集，无需授权
5. 获取访问权限。
    * kaggle
6. 创建工作目录（拥有足够的存储空间）。
    * pass
7. 获取数据。
    * 如下
8. 将数据转换为你可以轻松操作的格式（不更改数据本身）。
    * pass
9. 确保删除或保护敏感信息（比如，匿名）。
    * 数据集本身已处理
10. 检查数据的大小和类型（时间序列，样本，地理信息等）。
    * 如下
11. 抽样出测试集，将它放在一边，以后不需要关注它（没有数据窥探！）。
    * 如下

---

In [2]:
# 使用kaggle命令获取数据并解压
# &&kaggle competitions download -c home-credit-default-risk\
# &&mkdir data&&unzip *.zip -d ./data/

In [3]:
# 查看数据占用空间 
!du -sh ./data/

2.5G	./data/


In [4]:
def load_data(data_set_name:str):
    return pd.read_csv('./data/{}.csv'.format(data_set_name))

In [5]:
application = load_data('application_train')
# application_test = load_data('application_test')
application.shape
# application.columns.difference(application_test.columns)
# application_test.shape
# del application_test

(307511, 122)

In [6]:
def reduce_mem_usage(df, ignore_cols=['SK_ID_CURR','SK_ID_BUREAU','SK_ID_PREV']):
    """在不损失数据信息的情况下，通过转换数字数据类型来减少数据帧的内存使用"""
    # 初始化数据框的内存使用
    start_mem = df.memory_usage().sum() / 1024**2
    print('初始内存使用: {:.2f} MB'.format(start_mem))
    # 剔除特定列
    cols = [ col for col in df.columns if col not in ignore_cols]
    # 遍历每一列
    for col in cols:
        col_type = df[col].dtype
        
        # 如果数据类型是整数类型
        if col_type != object and col_type.name != 'category' and 'datetime' not in col_type.name:
            c_min = df[col].min()
            c_max = df[col].max()
            
            # 如果最小值和最大值都可以用更小的数据类型表示
            if str(col_type)[:3] == 'int' and c_min >= np.iinfo(np.int8).min and c_max <= np.iinfo(np.int8).max:
                df[col] = df[col].astype(np.int8)
            elif str(col_type)[:3] == 'int' and c_min >= np.iinfo(np.int16).min and c_max <= np.iinfo(np.int16).max:
                df[col] = df[col].astype(np.int16)
            elif str(col_type)[:3] == 'int' and c_min >= np.iinfo(np.int32).min and c_max <= np.iinfo(np.int32).max:
                df[col] = df[col].astype(np.int32)
            elif str(col_type)[:3] == 'int' and c_min >= np.iinfo(np.int64).min and c_max <= np.iinfo(np.int64).max:
                df[col] = df[col].astype(np.int64)
            
            # 如果最小值和最大值都可以用更小的数据类型表示
            elif str(col_type)[:5] == 'float' and c_min >= np.finfo(np.float16).min and c_max <= np.finfo(np.float16).max:
                df[col] = df[col].astype(np.float16)
            elif str(col_type)[:5] == 'float' and c_min >= np.finfo(np.float32).min and c_max <= np.finfo(np.float32).max:
                df[col] = df[col].astype(np.float32)
            else:
                pass

    # 输出优化后的内存使用
    end_mem = df.memory_usage().sum() / 1024**2
    print('优化后的内存使用: {:.2f} MB'.format(end_mem))
    return df

In [7]:
application = reduce_mem_usage(application)

初始内存使用: 286.23 MB
优化后的内存使用: 93.55 MB


**快速看一眼数据**

In [8]:
application.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: float16(61), float32(4), int16(2), int32(1), int64(1), int8(37), object(16)
memory usage: 93.6+ MB


**创建测试集**

因为kaggle比赛的application_test没有target，无法测量ROC下的面积，所以丢弃不用，切分application_train为训练集和测试集

在切分数据集的时候有多种策略可以选择，一般而言，可以随机切分 也可以分层切分，在本场景下，数据集比较大，坏样本相对较多，所以随机切；

**随机切分**

In [9]:
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(application, test_size=0.2, random_state= 42)

**探索数据**

注意：尝试从领域专家那获取有关这些步骤的见解。

1. 创建用于探索的数据副本（如有必要，将其取样为可管理的大小）。
2. 创建一个 Jupyter 笔记本来记录你的数据探索。
3. 研究每个属性及其特征：
  * 名称；
  * 类型（分类，整数/浮点数，有界/无界，文本，结构化数据等）；
  * 缺失数据的百分比；
  * 噪声点和它的类型（随机点，异常点，舍入误差等）；
  * 对任务可能有用吗？
  * 分布类型（高斯分布，均匀分布，对数分布等）。
4. 对于监督学习任务，确定目标属性。
5. 可视化数据。
6. 研究属性间的相关性。
7. 研究怎如何手动解决问题。
8. 确定你想要应用的有效的转换。
9. 确定有用的额外数据（回到上一步）。
10. 记录你所学到的知识。


In [10]:
app = train_set.copy()

In [11]:
app.dtypes.value_counts()

float16    61
int8       37
object     16
float32     4
int16       2
int64       1
int32       1
dtype: int64

In [12]:
def missing_values_summary(df):
    mis_val = df.isnull().sum()
    mis_val_percent = 100*df.isnull().sum() / len(df)
    mis_val_table =pd.concat([mis_val, mis_val_percent], axis=1)   
    mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'})
    mis_val_table_ren_columns = mis_val_table_ren_columns[mis_val_table_ren_columns.iloc[:,1] !=0].sort_values(
    '% of Total Values', ascending = False).round(1)
    mis_val_table_ren_columns = mis_val_table_ren_columns.merge(df.dtypes.rename('dtype').to_frame(),left_index=True, right_index=True)
    print('Your selected dataframe has ' + str(df.shape[1])+ ' columns.\n'
         "There are " + str(mis_val_table_ren_columns.shape[0])+ ' columns that have missing values.')
    return mis_val_table_ren_columns

In [13]:
missing_values_summary(app)

Your selected dataframe has 122 columns.
There are 67 columns that have missing values.


Unnamed: 0,Missing Values,% of Total Values,dtype
COMMONAREA_MEDI,171929,69.9,float16
COMMONAREA_AVG,171929,69.9,float16
COMMONAREA_MODE,171929,69.9,float16
NONLIVINGAPARTMENTS_MEDI,170868,69.5,float16
NONLIVINGAPARTMENTS_MODE,170868,69.5,float16
NONLIVINGAPARTMENTS_AVG,170868,69.5,float16
FONDKAPREMONT_MODE,168286,68.4,object
LIVINGAPARTMENTS_MODE,168196,68.4,float16
LIVINGAPARTMENTS_MEDI,168196,68.4,float16
LIVINGAPARTMENTS_AVG,168196,68.4,float16


-----

**简单筛选一下特征**

In [14]:
# 筛选特征
cate_cols = []
num_cols = []

In [15]:
# 将类别属性放进去
cate_cols.extend(app.dtypes[app.dtypes == 'object'].index.tolist())

In [16]:
def select_low_cardinality_numeric_features(df, label_col, threshold=5):
    """
    挑选出除了 label_col 以外的数值型特征，并判断其唯一值数是否低于 threshold，
    如果是，则将该特征的名称加入到列表 low_cardinality_feats 中返回。

    Parameters:
    ----------
    df: pandas.DataFrame
        数据表，包含了特征和目标变量
    label_col: str
        目标变量的名称
    threshold: int
        指定唯一值数量的阈值，低于该值的特征将被视为“唯一值较少的特征”，默认值为 5。

    Returns:
    ----------
    low_cardinality_feats: list
        唯一值较少的数值型特征的名称列表
    """
    numeric_feats = df.select_dtypes(include='number').columns.tolist()
    low_cardinality_feats = []
    for feat in numeric_feats:
        if feat == label_col:
            continue
        if df[feat].nunique() <= threshold:
            low_cardinality_feats.append(feat)
    return low_cardinality_feats


In [17]:
# 将唯一值少于5个的数值型变量也放进去
cate_cols.extend(select_low_cardinality_numeric_features(app, 'TARGET'))

In [18]:
num_cols.extend(app.columns.difference(cate_cols))

In [19]:
num_cols.remove('TARGET')

In [20]:
cat_cols_object = app[cate_cols].select_dtypes(include=['object']).columns
cat_cols_number = app[cate_cols].select_dtypes(include=['number']).columns

---

**准备数据**

注意：
* 处理数据副本（保持原始数据集完整）。
* 为你应用的所有数据转换编写函数，原因有五：
    * 你可以在下次获得新数据集时轻松准备数据
    * 你可以在未来的项目中应用这些转换
    * 用来清洗和准备测试数据集
    * 一旦项目上线你可以用来清洗和准备新的数据集
    * 为了便于将你的准备选择视为超参数

1. 数据清洗：
    * 修正或移除异常值（可选）。
    * 填补缺失值（比如用零，平均值，中位数等）或者删除所在行（或者列）。
2. 特征提取（可选）：
    * 丢弃不提供有用信息的属性；
3. 适当的特征工程：
    * 连续特征离散化。
    * 分解特征（比如分类，日期/时间等）。
    * 对特征添加有益的转换（比如 log(x)，sqrt(x)，x^2 等）
    * 将一些特征融合为有益的新特征
4. 特征缩放：
    * 标准化或者正规化特征。

**准备数据副本**

In [21]:
app = train_set.drop('TARGET', axis=1)
app_labels = train_set['TARGET'].copy()

In [22]:
app.shape

(246008, 121)

**数据清洗**

1. 上面已经列清楚了哪些列有缺失值，分析具体情况进行填充
2. 针对数值型的列，使用具体情况具体分析进行填充，一般来说，数据服从正态分布，使用均值；不服从使用中位数，如果是离散的数据点，可以使用众数；如果缺失较少，使用均值或者中位数，可以保证数据整体分布，缺失较多，可以使用众数，分布更加稳定
3. 针对对象型，使用特殊标记‘UKN’进行填充

In [23]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler,OneHotEncoder,FunctionTransformer
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer

In [24]:
import sklearn
sklearn.set_config(display="diagram")

In [25]:
num_pipeline = make_pipeline(
                SimpleImputer(strategy='median'),
                StandardScaler()
                )

In [26]:
cat_number_pipeline = make_pipeline(
                FunctionTransformer(lambda X: X.astype(str),feature_names_out='one-to-one'),
                SimpleImputer(strategy='constant', fill_value='UKN'),
                OneHotEncoder(handle_unknown='ignore')
)

In [27]:
cat_object_pipeline = make_pipeline(
                SimpleImputer(strategy='constant', fill_value='UKN'),
                OneHotEncoder(handle_unknown='ignore')
)

In [28]:
processing = ColumnTransformer([
    ('num', num_pipeline, num_cols),
    ('cate_object',cat_object_pipeline,cat_cols_object),
    ('cate_number', cat_number_pipeline, cat_cols_number)
], remainder='passthrough')

In [29]:
app_transform = processing.fit_transform(app)

In [30]:
app_transform.shape

(246008, 291)

In [31]:
app_transformed = pd.DataFrame(app_transform, columns=processing.get_feature_names_out())

**列出有用模型**

注意：

  * 如果数据量巨大，你可能需要采样出较小的训练集，以便在合理的时间内训练许多不同的模型（请注意，这会对诸如大型神经网络或随机森林等复杂模型进行处罚）。
  * 再次尝试尽可能自动化这些步骤。

1. 使用标准参数训练许多快速、粗糙的模型（比如线性模型，朴素贝叶斯模型，支持向量机模型，随机森林模型，神经网络等）。
2. 衡量并比较他们的表现。
  * 对于每个模型，使用 N 折交叉验证法，并且计算基于 N 折交叉验证的均值与方差。
3. 分析每种算法的最重要变量。
4. 分析模型产生的错误类型。
  * 人们用什么数据来避免这些错误？
5. 进行一轮快速的特征提取和特征工程。
6. 对之前的五个步骤进行一两次的快速迭代。
7. 列出前三到五名最有用的模型，由其是产生不同类型错误的模型。


In [32]:
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
import time


def cross_validate_with_feature_importance(models, data, labels, n_folds=5):
    skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=42)
    
    for model_idx, model in enumerate(models):
        print(f"Model {model_idx + 1}: {type(model).__name__}")
        total_feature_importance = None
        
        for fold, (train_idx, val_idx) in enumerate(skf.split(data, labels)):
            print(f"Fold {fold + 1}")
            X_train, y_train = data.iloc[train_idx], labels.iloc[train_idx].values
            X_val, y_val = data.iloc[val_idx], labels.iloc[val_idx].values

            start_time = time.time()
            model.fit(X_train, y_train)
            train_time = time.time() - start_time
            print(f"Training time: {train_time:.2f}s")

            if fold == 0 and train_time > 180:
                print(f"Model training skipped because it took {train_time:.2f}s")
                break

            if hasattr(model, "predict_proba"):
                y_pred = model.predict_proba(X_val)[:, 1]
            else:
                y_pred = model.predict(X_val)
            score = roc_auc_score(y_val, y_pred)
            print(f"Validation ROC AUC score: {score:.4f}")

            if hasattr(model, "feature_importances_"):
                fold_feature_importance = pd.Series(model.feature_importances_, index=data.columns)
                if total_feature_importance is None:
                    total_feature_importance = fold_feature_importance
                else:
                    total_feature_importance += fold_feature_importance

        if total_feature_importance is not None:
            top_features = total_feature_importance.sort_values(ascending=False)[:10]
            print("Top 10 most important features:")
            print(top_features)


## Baseline

In [33]:
from sklearn.linear_model import LinearRegression,SGDRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPRegressor

lr = LinearRegression()
sgd = SGDRegressor()
tree = DecisionTreeRegressor()
svr = SVR()
# gpr = GaussianProcessRegressor()
gnb = GaussianNB()
nn = MLPRegressor()

In [34]:
models = [lr, sgd, tree, gnb, nn]

In [35]:
# cross_validate_with_feature_importance(models, app_transformed, app_labels)

In [36]:
del app,app_transform,app_transformed,app_labels

先粗略的跑一下模型，不做任何处理，LR和NN可以取得0.75的成绩，还不错

## add more information

In [37]:
import pandas as pd
import gc

def concat_df_by_name(name_str):
    all_vars = globals()
    df_list = []
    keys_to_delete = []  # 存储需要删除的键值对
    for var_name, var_value in all_vars.items():
        if isinstance(var_value, pd.DataFrame) and name_str in var_name:
            df_list.append(var_value)
            keys_to_delete.append(var_name)  # 记录需要删除的键
    for key in keys_to_delete:  # 在循环结束后删除键值对
        del all_vars[key]
    if not df_list:
        return pd.DataFrame()
    result = pd.concat(df_list, axis=1)
    gc.collect()
    return result

In [38]:
bureau = load_data('bureau')
bureau = reduce_mem_usage(bureau)
bureau.CREDIT_TYPE.value_counts()

初始内存使用: 222.62 MB
优化后的内存使用: 126.04 MB


Consumer credit                                 1251615
Credit card                                      402195
Car loan                                          27690
Mortgage                                          18391
Microloan                                         12413
Loan for business development                      1975
Another type of loan                               1017
Unknown type of loan                                555
Loan for working capital replenishment              469
Cash loan (non-earmarked)                            56
Real estate loan                                     27
Loan for the purchase of equipment                   19
Loan for purchase of shares (margin lending)          4
Mobile operator loan                                  1
Interbank credit                                      1
Name: CREDIT_TYPE, dtype: int64

In [39]:
# bureau表
bureau = load_data('bureau')
bureau = reduce_mem_usage(bureau)

# 征信返回总笔数
BUREAU_NUM = bureau[['SK_ID_CURR','SK_ID_BUREAU']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_BUREAU':'BUREAU_NUM'})
# 征信报告中活跃贷款笔数
BUREAU_ACTIVE_NUM = bureau[bureau['CREDIT_ACTIVE']=='Active'][['SK_ID_CURR','SK_ID_BUREAU']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_BUREAU':'BUREAU_ACTIVE_NUM'})
# 征信报告中关闭贷款笔数
BUREAU_COLSED_NUM = bureau[bureau['CREDIT_ACTIVE']=='Closed'][['SK_ID_CURR','SK_ID_BUREAU']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_BUREAU':'BUREAU_Closed_NUM'})
# 征信报告中出售贷款笔数
BUREAU_SOLD_NUM = bureau[bureau['CREDIT_ACTIVE']=='Sold'][['SK_ID_CURR','SK_ID_BUREAU']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_BUREAU':'BUREAU_SOLD_NUM'})
# 征信报告中坏账贷款笔数
BUREAU_BAD_DEBT_NUM = bureau[bureau['CREDIT_ACTIVE']=='Bad debt'][['SK_ID_CURR','SK_ID_BUREAU']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_BUREAU':'BUREAU_BAD_DEBT_NUM'})
# 最早一次贷款距今时长
BUREAU_MINDAY_APPLICATION = bureau[['SK_ID_CURR','DAYS_CREDIT']].groupby(['SK_ID_CURR']).min().rename(columns={'DAYS_CREDIT':'BUREAU_MINDAY_APPLICATION'})
# 最近一次贷款距今时长
BUREAU_MAXDAY_APPLICATION = bureau[['SK_ID_CURR','DAYS_CREDIT']].groupby(['SK_ID_CURR']).max().rename(columns={'DAYS_CREDIT':'BUREAU_MAXDAY_APPLICATION'})
# 历史上有过逾期的贷款笔数
BUREAU_NUM_OVERDUE = bureau[bureau['CREDIT_DAY_OVERDUE'] > 0][['SK_ID_CURR','CREDIT_DAY_OVERDUE']].groupby(['SK_ID_CURR']).count().rename(columns={'CREDIT_DAY_OVERDUE':'BUREAU_NUM_OVERDUE'})
# 历史上最长逾期天数
BUREAU_MAXDAY_OVERDUE = bureau[bureau['CREDIT_DAY_OVERDUE'] > 0][['SK_ID_CURR','CREDIT_DAY_OVERDUE']].groupby(['SK_ID_CURR']).max().rename(columns={'CREDIT_DAY_OVERDUE':'BUREAU_MAXDAY_OVERDUE'})
# 提前还款的笔数
BUREAU_NUM_PREPAY = bureau[bureau['DAYS_CREDIT_ENDDATE'] > bureau['DAYS_ENDDATE_FACT'] ][['SK_ID_CURR','DAYS_ENDDATE_FACT']].groupby(['SK_ID_CURR']).count().rename(columns={'DAYS_ENDDATE_FACT':'BUREAU_NUM_PREPAY'})
# 到期还款笔数
BUREAU_NUM_NORMAL = bureau[bureau['DAYS_CREDIT_ENDDATE'] == bureau['DAYS_ENDDATE_FACT'] ][['SK_ID_CURR','DAYS_ENDDATE_FACT']].groupby(['SK_ID_CURR']).count().rename(columns={'DAYS_ENDDATE_FACT':'BUREAU_NUM_NORMAL'})
# 延后还款笔数
BUREAU_NUM_DELAY = bureau[bureau['DAYS_CREDIT_ENDDATE'] < bureau['DAYS_ENDDATE_FACT'] ][['SK_ID_CURR','DAYS_ENDDATE_FACT']].groupby(['SK_ID_CURR']).count().rename(columns={'DAYS_ENDDATE_FACT':'BUREAU_NUM_DELAY'})
# 历史最大逾期金额
BUREAU_MAXAMT_OVERDUE = bureau[bureau['AMT_CREDIT_MAX_OVERDUE'] > 0][['SK_ID_CURR','AMT_CREDIT_MAX_OVERDUE']].groupby(['SK_ID_CURR']).max().rename(columns={'AMT_CREDIT_MAX_OVERDUE':'BUREAU_MAXAMT_OVERDUE'})
# 总展期次数
BUREAU_PROLONG_NUM = bureau[['SK_ID_CURR','CNT_CREDIT_PROLONG']].groupby(['SK_ID_CURR']).max().rename(columns={'CNT_CREDIT_PROLONG':'BUREAU_PROLONG_NUM'})
# 总还款金额
BUREAU_LOAN_AMT = bureau[['SK_ID_CURR','AMT_CREDIT_SUM']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_CREDIT_SUM':'BUREAU_LOAN_AMT'})
# 总未还金额
BUREAU_DEBT_AMT = bureau[['SK_ID_CURR','AMT_CREDIT_SUM_DEBT']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_CREDIT_SUM_DEBT':'BUREAU_DEBT_AMT'})
# 最大信用额度
BUREAU_LIMIT_MAX = bureau[['SK_ID_CURR','AMT_CREDIT_SUM_LIMIT']].groupby(['SK_ID_CURR']).max().rename(columns={'AMT_CREDIT_SUM_LIMIT':'BUREAU_LIMIT_MAX'})
# 总逾期金额
BUREAU_SUM_OVERDUE = bureau[['SK_ID_CURR','AMT_CREDIT_SUM_OVERDUE']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_CREDIT_SUM_OVERDUE':'BUREAU_SUM_OVERDUE'})
# 消费贷数
BUREAU_NUM_CONSUMER = bureau[bureau['CREDIT_TYPE'] == 'Consumer credit' ][['SK_ID_CURR','CREDIT_TYPE']].groupby(['SK_ID_CURR']).count().rename(columns={'CREDIT_TYPE':'BUREAU_NUM_CONSUMER'})
# 信用卡数
BUREAU_NUM_CARD = bureau[bureau['CREDIT_TYPE'] == 'Credit card' ][['SK_ID_CURR','CREDIT_TYPE']].groupby(['SK_ID_CURR']).count().rename(columns={'CREDIT_TYPE':'BUREAU_NUM_CARD'})
# 汽车贷款数
BUREAU_NUM_CAR = bureau[bureau['CREDIT_TYPE'] == 'Car loan' ][['SK_ID_CURR','CREDIT_TYPE']].groupby(['SK_ID_CURR']).count().rename(columns={'CREDIT_TYPE':'BUREAU_NUM_CAR'})
# 抵押贷款数
BUREAU_NUM_MORTGAGE = bureau[bureau['CREDIT_TYPE'] == 'Mortgage' ][['SK_ID_CURR','CREDIT_TYPE']].groupby(['SK_ID_CURR']).count().rename(columns={'CREDIT_TYPE':'BUREAU_NUM_MORTGAGE'})
# 小微贷款数
BUREAU_NUM_MiCROLOAN = bureau[bureau['CREDIT_TYPE'] == 'Microloan' ][['SK_ID_CURR','CREDIT_TYPE']].groupby(['SK_ID_CURR']).count().rename(columns={'CREDIT_TYPE':'BUREAU_NUM_MiCROLOAN'})
# 其他贷款数
BUREAU_NUM_OTHER = bureau[~bureau.CREDIT_TYPE.isin(['Consumer credit','Credit card','Car loan','Mortgage','Microloan']) ][['SK_ID_CURR','CREDIT_TYPE']].groupby(['SK_ID_CURR']).count().rename(columns={'CREDIT_TYPE':'BUREAU_NUM_OTHER'})
# 消费贷借款总金额
BUREAU_AMT_CONSUMER = bureau[bureau['CREDIT_TYPE'] == 'Consumer credit' ][['SK_ID_CURR','AMT_CREDIT_SUM']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_CREDIT_SUM':'BUREAU_AMT_CONSUMER'})
# 信用卡借款总金额
BUREAU_AMT_CARD = bureau[bureau['CREDIT_TYPE'] == 'Credit card' ][['SK_ID_CURR','AMT_CREDIT_SUM']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_CREDIT_SUM':'BUREAU_AMT_CARD'})
# 汽车贷款借款总金额
BUREAU_AMT_CAR = bureau[bureau['CREDIT_TYPE'] == 'Car loan' ][['SK_ID_CURR','AMT_CREDIT_SUM']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_CREDIT_SUM':'BUREAU_AMT_CAR'})
# 抵押贷款借款总金额
BUREAU_AMT_MORTGAGE = bureau[bureau['CREDIT_TYPE'] == 'Mortgage' ][['SK_ID_CURR','AMT_CREDIT_SUM']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_CREDIT_SUM':'BUREAU_AMT_MORTGAGE'})
# 小微贷款借款总金额
BUREAU_AMT_MiCROLOAN = bureau[bureau['CREDIT_TYPE'] == 'Microloan' ][['SK_ID_CURR','AMT_CREDIT_SUM']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_CREDIT_SUM':'BUREAU_AMT_MiCROLOAN'})
# 其他贷款借款总金额
BUREAU_AMT_OTHER = bureau[~bureau.CREDIT_TYPE.isin(['Consumer credit','Credit card','Car loan','Mortgage','Microloan']) ][['SK_ID_CURR','AMT_CREDIT_SUM']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_CREDIT_SUM':'BUREAU_AMT_OTHER'})
# 消费贷借款未还总金额
BUREAU_DEBTAMT_CONSUMER = bureau[bureau['CREDIT_TYPE'] == 'Consumer credit' ][['SK_ID_CURR','AMT_CREDIT_SUM_DEBT']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_CREDIT_SUM_DEBT':'BUREAU_DEBTAMT_CONSUMER'})
# 信用卡借款未还总金额
BUREAU_DEBTAMT_CARD = bureau[bureau['CREDIT_TYPE'] == 'Credit card' ][['SK_ID_CURR','AMT_CREDIT_SUM_DEBT']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_CREDIT_SUM_DEBT':'BUREAU_DEBTAMT_CARD'})
# 汽车贷款借款未还总金额
BUREAU_DEBTAMT_CAR = bureau[bureau['CREDIT_TYPE'] == 'Car loan' ][['SK_ID_CURR','AMT_CREDIT_SUM_DEBT']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_CREDIT_SUM_DEBT':'BUREAU_DEBTAMT_CAR'})
# 抵押贷款借款未还总金额
BUREAU_DEBTAMT_MORTGAGE = bureau[bureau['CREDIT_TYPE'] == 'Mortgage' ][['SK_ID_CURR','AMT_CREDIT_SUM_DEBT']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_CREDIT_SUM_DEBT':'BUREAU_DEBTAMT_MORTGAGE'})
# 小微贷款借款未还总金额
BUREAU_DEBTAMT_MiCROLOAN = bureau[bureau['CREDIT_TYPE'] == 'Microloan' ][['SK_ID_CURR','AMT_CREDIT_SUM_DEBT']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_CREDIT_SUM_DEBT':'BUREAU_DEBTAMT_MiCROLOAN'})
# 其他贷款借款未还总金额
BUREAU_DEBTAMT_OTHER = bureau[~bureau.CREDIT_TYPE.isin(['Consumer credit','Credit card','Car loan','Mortgage','Microloan']) ][['SK_ID_CURR','AMT_CREDIT_SUM_DEBT']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_CREDIT_SUM_DEBT':'BUREAU_DEBTAMT_OTHER'})
# 消费贷逾期总金额
BUREAU_OVERDUEAMT_CONSUMER = bureau[bureau['CREDIT_TYPE'] == 'Consumer credit' ][['SK_ID_CURR','AMT_CREDIT_SUM_OVERDUE']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_CREDIT_SUM_OVERDUE':'BUREAU_OVERDUEAMT_CONSUMER'})
# 信用卡逾期总金额
BUREAU_OVERDUEAMT_CARD = bureau[bureau['CREDIT_TYPE'] == 'Credit card' ][['SK_ID_CURR','AMT_CREDIT_SUM_OVERDUE']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_CREDIT_SUM_OVERDUE':'BUREAU_OVERDUEAMT_CARD'})
# 汽车贷款逾期总金额
BUREAU_OVERDUEAMT_CAR = bureau[bureau['CREDIT_TYPE'] == 'Car loan' ][['SK_ID_CURR','AMT_CREDIT_SUM_OVERDUE']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_CREDIT_SUM_OVERDUE':'BUREAU_OVERDUEAMT_CAR'})
# 抵押贷款逾期总金额
BUREAU_OVERDUEAMT_MORTGAGE = bureau[bureau['CREDIT_TYPE'] == 'Mortgage' ][['SK_ID_CURR','AMT_CREDIT_SUM_OVERDUE']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_CREDIT_SUM_OVERDUE':'BUREAU_OVERDUEAMT_MORTGAGE'})
# 小微贷款逾期总金额
BUREAU_OVERDUEAMT_MiCROLOAN = bureau[bureau['CREDIT_TYPE'] == 'Microloan' ][['SK_ID_CURR','AMT_CREDIT_SUM_OVERDUE']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_CREDIT_SUM_OVERDUE':'BUREAU_OVERDUEAMT_MiCROLOAN'})
# 其他贷款逾期总金额
BUREAU_OVERDUEAMT_OTHER = bureau[~bureau.CREDIT_TYPE.isin(['Consumer credit','Credit card','Car loan','Mortgage','Microloan']) ][['SK_ID_CURR','AMT_CREDIT_SUM_OVERDUE']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_CREDIT_SUM_OVERDUE':'BUREAU_OVERDUEAMT_OTHER'})
# 最近一次更新记录时间
BUREAU_LAST_UPDATE = bureau[['SK_ID_CURR','DAYS_CREDIT_UPDATE']].groupby(['SK_ID_CURR']).max().rename(columns={'DAYS_CREDIT_UPDATE':'BUREAU_LAST_UPDATE'})


初始内存使用: 222.62 MB
优化后的内存使用: 126.04 MB


In [40]:
bureau_extra = concat_df_by_name('BUREAU')
bureau_extra.shape

(305811, 43)

In [41]:
# bureau_balance表
bureau_balance = load_data('bureau_balance')
bureau_balance = reduce_mem_usage(bureau_balance)

tmp = bureau[['SK_ID_BUREAU','SK_ID_CURR']]
bureau_balance_union = pd.merge(bureau_balance, tmp, how='left', on='SK_ID_BUREAU')
bureau_balance_union['SK_ID_CURR'] = bureau_balance_union['SK_ID_CURR'].fillna(0).astype('int64')
bureau_balance_union = bureau_balance_union[bureau_balance_union.STATUS.isin(['0','1','2','3','4','5'])]

# 征信最近1个月逾期状态 0 1 2 3 4 5的笔数
tmp = bureau_balance_union[bureau_balance_union.MONTHS_BALANCE >= -1][['SK_ID_BUREAU','SK_ID_CURR','STATUS']].groupby(['SK_ID_BUREAU','SK_ID_CURR'], as_index=False).max()
BUREAU_OVERDUE_1_0 = tmp[tmp.STATUS =='0'][['SK_ID_BUREAU','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_BUREAU':'BUREAU_OVERDUE_1_0'})
BUREAU_OVERDUE_1_1 = tmp[tmp.STATUS == '1'][['SK_ID_BUREAU','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_BUREAU':'BUREAU_OVERDUE_1_1'})
BUREAU_OVERDUE_1_2 = tmp[tmp.STATUS == '2'][['SK_ID_BUREAU','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_BUREAU':'BUREAU_OVERDUE_1_2'})
BUREAU_OVERDUE_1_3 = tmp[tmp.STATUS == '3'][['SK_ID_BUREAU','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_BUREAU':'BUREAU_OVERDUE_1_3'})
BUREAU_OVERDUE_1_4 = tmp[tmp.STATUS == '4'][['SK_ID_BUREAU','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_BUREAU':'BUREAU_OVERDUE_1_4'})
BUREAU_OVERDUE_1_5 = tmp[tmp.STATUS == '5'][['SK_ID_BUREAU','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_BUREAU':'BUREAU_OVERDUE_1_5'})

# 征信最近1-3个月逾期状态 0 1 2 3 4 5的笔数
tmp = bureau_balance_union[ (bureau_balance_union.MONTHS_BALANCE < -1) & (bureau_balance_union.MONTHS_BALANCE >= -3) ][['SK_ID_BUREAU','SK_ID_CURR','STATUS']].groupby(['SK_ID_BUREAU','SK_ID_CURR'], as_index=False).max()
BUREAU_OVERDUE_3_0 = tmp[tmp.STATUS =='0'][['SK_ID_BUREAU','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_BUREAU':'BUREAU_OVERDUE_3_0'})
BUREAU_OVERDUE_3_1 = tmp[tmp.STATUS == '1'][['SK_ID_BUREAU','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_BUREAU':'BUREAU_OVERDUE_3_1'})
BUREAU_OVERDUE_3_2 = tmp[tmp.STATUS == '2'][['SK_ID_BUREAU','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_BUREAU':'BUREAU_OVERDUE_3_2'})
BUREAU_OVERDUE_3_3 = tmp[tmp.STATUS == '3'][['SK_ID_BUREAU','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_BUREAU':'BUREAU_OVERDUE_3_3'})
BUREAU_OVERDUE_3_4 = tmp[tmp.STATUS == '4'][['SK_ID_BUREAU','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_BUREAU':'BUREAU_OVERDUE_3_4'})
BUREAU_OVERDUE_3_5 = tmp[tmp.STATUS == '5'][['SK_ID_BUREAU','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_BUREAU':'BUREAU_OVERDUE_3_5'})

# 征信最近3-6个月逾期状态 0 1 2 3 4 5的笔数
tmp = bureau_balance_union[ (bureau_balance_union.MONTHS_BALANCE < -3) & (bureau_balance_union.MONTHS_BALANCE >= -6) ][['SK_ID_BUREAU','SK_ID_CURR','STATUS']].groupby(['SK_ID_BUREAU','SK_ID_CURR'], as_index=False).max()
BUREAU_OVERDUE_6_0 = tmp[tmp.STATUS =='0'][['SK_ID_BUREAU','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_BUREAU':'BUREAU_OVERDUE_6_0'})
BUREAU_OVERDUE_6_1 = tmp[tmp.STATUS == '1'][['SK_ID_BUREAU','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_BUREAU':'BUREAU_OVERDUE_6_1'})
BUREAU_OVERDUE_6_2 = tmp[tmp.STATUS == '2'][['SK_ID_BUREAU','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_BUREAU':'BUREAU_OVERDUE_6_2'})
BUREAU_OVERDUE_6_3 = tmp[tmp.STATUS == '3'][['SK_ID_BUREAU','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_BUREAU':'BUREAU_OVERDUE_6_3'})
BUREAU_OVERDUE_6_4 = tmp[tmp.STATUS == '4'][['SK_ID_BUREAU','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_BUREAU':'BUREAU_OVERDUE_6_4'})
BUREAU_OVERDUE_6_5 = tmp[tmp.STATUS == '5'][['SK_ID_BUREAU','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_BUREAU':'BUREAU_OVERDUE_6_5'})

# 征信最近6-12个月逾期状态 0 1 2 3 4 5的笔数
tmp = bureau_balance_union[ (bureau_balance_union.MONTHS_BALANCE < -6) & (bureau_balance_union.MONTHS_BALANCE >= -12) ][['SK_ID_BUREAU','SK_ID_CURR','STATUS']].groupby(['SK_ID_BUREAU','SK_ID_CURR'], as_index=False).max()
BUREAU_OVERDUE_12_0 = tmp[tmp.STATUS =='0'][['SK_ID_BUREAU','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_BUREAU':'BUREAU_OVERDUE_12_0'})
BUREAU_OVERDUE_12_1 = tmp[tmp.STATUS == '1'][['SK_ID_BUREAU','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_BUREAU':'BUREAU_OVERDUE_12_1'})
BUREAU_OVERDUE_12_2 = tmp[tmp.STATUS == '2'][['SK_ID_BUREAU','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_BUREAU':'BUREAU_OVERDUE_12_2'})
BUREAU_OVERDUE_12_3 = tmp[tmp.STATUS == '3'][['SK_ID_BUREAU','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_BUREAU':'BUREAU_OVERDUE_12_3'})
BUREAU_OVERDUE_12_4 = tmp[tmp.STATUS == '4'][['SK_ID_BUREAU','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_BUREAU':'BUREAU_OVERDUE_12_4'})
BUREAU_OVERDUE_12_5 = tmp[tmp.STATUS == '5'][['SK_ID_BUREAU','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_BUREAU':'BUREAU_OVERDUE_12_5'})

# 征信最近12-24个月逾期状态 0 1 2 3 4 5的笔数
tmp = bureau_balance_union[ (bureau_balance_union.MONTHS_BALANCE < -12) & (bureau_balance_union.MONTHS_BALANCE >= -24) ][['SK_ID_BUREAU','SK_ID_CURR','STATUS']].groupby(['SK_ID_BUREAU','SK_ID_CURR'], as_index=False).max()
BUREAU_OVERDUE_24_0 = tmp[tmp.STATUS =='0'][['SK_ID_BUREAU','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_BUREAU':'BUREAU_OVERDUE_24_0'})
BUREAU_OVERDUE_24_1 = tmp[tmp.STATUS == '1'][['SK_ID_BUREAU','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_BUREAU':'BUREAU_OVERDUE_24_1'})
BUREAU_OVERDUE_24_2 = tmp[tmp.STATUS == '2'][['SK_ID_BUREAU','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_BUREAU':'BUREAU_OVERDUE_24_2'})
BUREAU_OVERDUE_24_3 = tmp[tmp.STATUS == '3'][['SK_ID_BUREAU','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_BUREAU':'BUREAU_OVERDUE_24_3'})
BUREAU_OVERDUE_24_4 = tmp[tmp.STATUS == '4'][['SK_ID_BUREAU','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_BUREAU':'BUREAU_OVERDUE_24_4'})
BUREAU_OVERDUE_24_5 = tmp[tmp.STATUS == '5'][['SK_ID_BUREAU','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_BUREAU':'BUREAU_OVERDUE_24_5'})

# 征信最近24-36个月逾期状态 0 1 2 3 4 5的笔数
tmp = bureau_balance_union[ (bureau_balance_union.MONTHS_BALANCE < -24) & (bureau_balance_union.MONTHS_BALANCE >= -36) ][['SK_ID_BUREAU','SK_ID_CURR','STATUS']].groupby(['SK_ID_BUREAU','SK_ID_CURR'], as_index=False).max()
BUREAU_OVERDUE_36_0 = tmp[tmp.STATUS =='0'][['SK_ID_BUREAU','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_BUREAU':'BUREAU_OVERDUE_36_0'})
BUREAU_OVERDUE_36_1 = tmp[tmp.STATUS == '1'][['SK_ID_BUREAU','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_BUREAU':'BUREAU_OVERDUE_36_1'})
BUREAU_OVERDUE_36_2 = tmp[tmp.STATUS == '2'][['SK_ID_BUREAU','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_BUREAU':'BUREAU_OVERDUE_36_2'})
BUREAU_OVERDUE_36_3 = tmp[tmp.STATUS == '3'][['SK_ID_BUREAU','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_BUREAU':'BUREAU_OVERDUE_36_3'})
BUREAU_OVERDUE_36_4 = tmp[tmp.STATUS == '4'][['SK_ID_BUREAU','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_BUREAU':'BUREAU_OVERDUE_36_4'})
BUREAU_OVERDUE_36_5 = tmp[tmp.STATUS == '5'][['SK_ID_BUREAU','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_BUREAU':'BUREAU_OVERDUE_36_5'})

# 征信最近36+个月逾期状态 0 1 2 3 4 5的笔数
tmp = bureau_balance_union[ bureau_balance_union.MONTHS_BALANCE < -36][['SK_ID_BUREAU','SK_ID_CURR','STATUS']].groupby(['SK_ID_BUREAU','SK_ID_CURR'], as_index=False).max()
BUREAU_OVERDUE_36plus_0 = tmp[tmp.STATUS =='0'][['SK_ID_BUREAU','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_BUREAU':'BUREAU_OVERDUE_36plus_0'})
BUREAU_OVERDUE_36plus_1 = tmp[tmp.STATUS == '1'][['SK_ID_BUREAU','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_BUREAU':'BUREAU_OVERDUE_36plus_1'})
BUREAU_OVERDUE_36plus_2 = tmp[tmp.STATUS == '2'][['SK_ID_BUREAU','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_BUREAU':'BUREAU_OVERDUE_36plus_2'})
BUREAU_OVERDUE_36plus_3 = tmp[tmp.STATUS == '3'][['SK_ID_BUREAU','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_BUREAU':'BUREAU_OVERDUE_36plus_3'})
BUREAU_OVERDUE_36plus_4 = tmp[tmp.STATUS == '4'][['SK_ID_BUREAU','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_BUREAU':'BUREAU_OVERDUE_36plus_4'})
BUREAU_OVERDUE_36plus_5 = tmp[tmp.STATUS == '5'][['SK_ID_BUREAU','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_BUREAU':'BUREAU_OVERDUE_36plus_5'})

# 删除数据
del bureau
del bureau_balance
del bureau_balance_union
del tmp

初始内存使用: 624.85 MB
优化后的内存使用: 442.60 MB


In [42]:
bureau_balance_extra = concat_df_by_name('BUREAU')
bureau_balance_extra.shape

(130774, 42)

In [43]:
# previous_application表
previous_application = load_data('previous_application')
previous_application = reduce_mem_usage(previous_application)

previous_application['RATE_INTEREST_ACTUAL'] = ((previous_application.AMT_ANNUITY * previous_application.CNT_PAYMENT) - previous_application.AMT_CREDIT) / previous_application.AMT_CREDIT
previous_application.loc[previous_application.RATE_INTEREST_ACTUAL == -1.0,'RATE_INTEREST_ACTUAL'] = np.nan
# 历史贷款总数
PRE_CREDIT_NUM = previous_application[['SK_ID_CURR','SK_ID_PREV']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':'PRE_CREDIT_NUM'})
# 历史贷款总金额
PRE_CREDIT_AMT = previous_application[['SK_ID_CURR','AMT_CREDIT']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_CREDIT':'PRE_CREDIT_AMT'})
# 历史贷款总月付款
PRE_CREDIT_ANNUITY = previous_application[['SK_ID_CURR','AMT_ANNUITY']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_ANNUITY':'PRE_CREDIT_ANNUITY'})
# 历史刷卡贷款总数
PRE_CREDIT_POS_NUM = previous_application[previous_application['NAME_PORTFOLIO'] == 'POS'][['SK_ID_CURR','SK_ID_PREV']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':'PRE_CREDIT_POS_NUM'})
# 历史刷卡贷款总金额
PRE_CREDIT_POS_AMT = previous_application[previous_application['NAME_PORTFOLIO'] == 'POS'][['SK_ID_CURR','AMT_CREDIT']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_CREDIT':'PRE_CREDIT_POS_AMT'})
# 历史刷卡贷款总月付款
PRE_CREDIT_POS_ANNUITY = previous_application[previous_application['NAME_PORTFOLIO'] == 'POS'][['SK_ID_CURR','AMT_ANNUITY']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_ANNUITY':'PRE_CREDIT_POS_ANNUITY'})
# 历史现金贷款总数
PRE_CREDIT_CASH_NUM = previous_application[previous_application['NAME_PORTFOLIO'] == 'Cash'][['SK_ID_CURR','SK_ID_PREV']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':'PRE_CREDIT_CASH_NUM'})
# 历史现金贷款总金额
PRE_CREDIT_CASH_AMT = previous_application[previous_application['NAME_PORTFOLIO'] == 'Cash'][['SK_ID_CURR','AMT_CREDIT']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_CREDIT':'PRE_CREDIT_CASH_AMT'})
# 历史现金贷款总月付款
PRE_CREDIT_CASH_ANNUITY = previous_application[previous_application['NAME_PORTFOLIO'] == 'Cash'][['SK_ID_CURR','AMT_ANNUITY']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_ANNUITY':'PRE_CREDIT_CASH_ANNUITY'})
# 历史其他贷款总数
PRE_CREDIT_XNA_NUM = previous_application[previous_application['NAME_PORTFOLIO'] == 'XNA'][['SK_ID_CURR','SK_ID_PREV']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':'PRE_CREDIT_XNA_NUM'})
# 历史其他贷款总金额
PRE_CREDIT_XNA_AMT = previous_application[previous_application['NAME_PORTFOLIO'] == 'XNA'][['SK_ID_CURR','AMT_CREDIT']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_CREDIT':'PRE_CREDIT_XNA_AMT'})
# 历史其他贷款总月付款
PRE_CREDIT_XNA_ANNUITY = previous_application[previous_application['NAME_PORTFOLIO'] == 'XNA'][['SK_ID_CURR','AMT_ANNUITY']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_ANNUITY':'PRE_CREDIT_XNA_ANNUITY'})
# 历史信用卡贷款总数
PRE_CREDIT_Cards_NUM = previous_application[previous_application['NAME_PORTFOLIO'] == 'Cards'][['SK_ID_CURR','SK_ID_PREV']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':'PRE_CREDIT_Cards_NUM'})
# 历史信用卡贷款总金额
PRE_CREDIT_Cards_AMT = previous_application[previous_application['NAME_PORTFOLIO'] == 'Cards'][['SK_ID_CURR','AMT_CREDIT']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_CREDIT':'PRE_CREDIT_Cards_AMT'})
# 历史信用卡贷款总月付款
PRE_CREDIT_Cards_ANNUITY = previous_application[previous_application['NAME_PORTFOLIO'] == 'Cards'][['SK_ID_CURR','AMT_ANNUITY']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_ANNUITY':'PRE_CREDIT_Cards_ANNUITY'})
# 历史汽车贷款总数
PRE_CREDIT_Cars_NUM = previous_application[previous_application['NAME_PORTFOLIO'] == 'Cars'][['SK_ID_CURR','SK_ID_PREV']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':'PRE_CREDIT_Cars_NUM'})
# 历史汽车贷款总金额
PRE_CREDIT_Cars_AMT = previous_application[previous_application['NAME_PORTFOLIO'] == 'Cars'][['SK_ID_CURR','AMT_CREDIT']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_CREDIT':'PRE_CREDIT_Cars_AMT'})
# 历史汽车贷款总月付款
PRE_CREDIT_Cars_ANNUITY = previous_application[previous_application['NAME_PORTFOLIO'] == 'Cars'][['SK_ID_CURR','AMT_ANNUITY']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_ANNUITY':'PRE_CREDIT_Cars_ANNUITY'})
# 历史通过贷款总数
PRE_CREDIT_Approved_NUM = previous_application[previous_application['NAME_CONTRACT_STATUS'] == 'Approved'][['SK_ID_CURR','SK_ID_PREV']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':'PRE_CREDIT_Approved_NUM'})
# 历史通过贷款总金额
PRE_CREDIT_Approved_AMT = previous_application[previous_application['NAME_CONTRACT_STATUS'] == 'Approved'][['SK_ID_CURR','AMT_CREDIT']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_CREDIT':'PRE_CREDIT_Approved_AMT'})
# 历史通过贷款总月付款
PRE_CREDIT_Approved_ANNUITY = previous_application[previous_application['NAME_CONTRACT_STATUS'] == 'Approved'][['SK_ID_CURR','AMT_ANNUITY']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_ANNUITY':'PRE_CREDIT_Approved_ANNUITY'})
# 历史取消贷款总数
PRE_CREDIT_Canceled_NUM = previous_application[previous_application['NAME_CONTRACT_STATUS'] == 'Canceled'][['SK_ID_CURR','SK_ID_PREV']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':'PRE_CREDIT_Canceled_NUM'})
# 历史取消贷款总金额
PRE_CREDIT_Canceled_AMT = previous_application[previous_application['NAME_CONTRACT_STATUS'] == 'Canceled'][['SK_ID_CURR','AMT_CREDIT']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_CREDIT':'PRE_CREDIT_Canceled_AMT'})
# 历史取消贷款总月付款
PRE_CREDIT_Canceled_ANNUITY = previous_application[previous_application['NAME_CONTRACT_STATUS'] == 'Canceled'][['SK_ID_CURR','AMT_ANNUITY']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_ANNUITY':'PRE_CREDIT_Canceled_ANNUITY'})
# 历史被拒绝贷款总数
PRE_CREDIT_Refused_NUM = previous_application[previous_application['NAME_CONTRACT_STATUS'] == 'Refused'][['SK_ID_CURR','SK_ID_PREV']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':'PRE_CREDIT_Refused_NUM'})
# 历史被拒绝贷款总金额
PRE_CREDIT_Refused_AMT = previous_application[previous_application['NAME_CONTRACT_STATUS'] == 'Refused'][['SK_ID_CURR','AMT_CREDIT']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_CREDIT':'PRE_CREDIT_Refused_AMT'})
# 历史被拒绝贷款总月付款
PRE_CREDIT_Refused_ANNUITY = previous_application[previous_application['NAME_CONTRACT_STATUS'] == 'Refused'][['SK_ID_CURR','AMT_ANNUITY']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_ANNUITY':'PRE_CREDIT_Refused_ANNUITY'})
# 历史未使用贷款总数
PRE_CREDIT_Unused_NUM = previous_application[previous_application['NAME_CONTRACT_STATUS'] == 'Unused offer'][['SK_ID_CURR','SK_ID_PREV']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':'PRE_CREDIT_Unused_NUM'})
# 历史未使用贷款总金额
PRE_CREDIT_Unused_AMT = previous_application[previous_application['NAME_CONTRACT_STATUS'] == 'Unused offer'][['SK_ID_CURR','AMT_CREDIT']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_CREDIT':'PRE_CREDIT_Unused_AMT'})
# 历史未使用贷款总月付款
PRE_CREDIT_Unused_ANNUITY = previous_application[previous_application['NAME_CONTRACT_STATUS'] == 'Unused offer'][['SK_ID_CURR','AMT_ANNUITY']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_ANNUITY':'PRE_CREDIT_Unused_ANNUITY'})
# 拒绝原因是HC的申请数
PRE_HC_Refused_NUM = previous_application[(previous_application['NAME_CONTRACT_STATUS'] == 'Refused') & (previous_application['CODE_REJECT_REASON'] == 'HC')][['SK_ID_CURR','SK_ID_PREV']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':'PRE_HC_Refused_NUM'})
# 拒绝原因是LIMIT的申请数
PRE_LIMIT_Refused_NUM = previous_application[(previous_application['NAME_CONTRACT_STATUS'] == 'Refused') & (previous_application['CODE_REJECT_REASON'] == 'LIMIT')][['SK_ID_CURR','SK_ID_PREV']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':'PRE_LIMIT_Refused_NUM'})
# 拒绝原因是SCO的申请数
PRE_SCO_Refused_NUM = previous_application[(previous_application['NAME_CONTRACT_STATUS'] == 'Refused') & (previous_application['CODE_REJECT_REASON'] == 'SCO')][['SK_ID_CURR','SK_ID_PREV']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':'PRE_SCO_Refused_NUM'})
# 拒绝原因是SCOFR的申请数
PRE_SCOFR_Refused_NUM = previous_application[(previous_application['NAME_CONTRACT_STATUS'] == 'Refused') & (previous_application['CODE_REJECT_REASON'] == 'SCOFR')][['SK_ID_CURR','SK_ID_PREV']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':'PRE_SCOFR_Refused_NUM'})
# 拒绝原因是XNA的申请数
PRE_XNA_Refused_NUM = previous_application[(previous_application['NAME_CONTRACT_STATUS'] == 'Refused') & (previous_application['CODE_REJECT_REASON'] == 'XNA')][['SK_ID_CURR','SK_ID_PREV']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':'PRE_XNA_Refused_NUM'})
# 拒绝原因是VERIF的申请数
PRE_VERIF_Refused_NUM = previous_application[(previous_application['NAME_CONTRACT_STATUS'] == 'Refused') & (previous_application['CODE_REJECT_REASON'] == 'VERIF')][['SK_ID_CURR','SK_ID_PREV']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':'PRE_VERIF_Refused_NUM'})
# 拒绝原因是SYSTEM的申请数
PRE_SYSTEM_Refused_NUM = previous_application[(previous_application['NAME_CONTRACT_STATUS'] == 'Refused') & (previous_application['CODE_REJECT_REASON'] == 'SYSTEM')][['SK_ID_CURR','SK_ID_PREV']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':'PRE_SYSTEM_Refused_NUM'})
# 历史申请最大利率
PRE_MAX_INTEREST_RATE = previous_application[['SK_ID_CURR','RATE_INTEREST_ACTUAL']].groupby(['SK_ID_CURR']).max().rename(columns={'RATE_INTEREST_ACTUAL':'PRE_MAX_INTEREST_RATE'})
# 历史申请最小利率
PRE_MIN_INTEREST_RATE = previous_application[['SK_ID_CURR','RATE_INTEREST_ACTUAL']].groupby(['SK_ID_CURR']).min().rename(columns={'RATE_INTEREST_ACTUAL':'PRE_MIN_INTEREST_RATE'})
# 历史申请平均利率
PRE_AVG_INTEREST_RATE = previous_application[['SK_ID_CURR','RATE_INTEREST_ACTUAL']].groupby(['SK_ID_CURR']).mean().rename(columns={'RATE_INTEREST_ACTUAL':'PRE_AVG_INTEREST_RATE'})

# 当前正在还的申请数
PRE_REPAY_NUM = previous_application[previous_application.DAYS_TERMINATION == 365243.0][['SK_ID_CURR','SK_ID_PREV']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':'PRE_NUM_REPAY'})
# 当前正在还的总金额
PRE_REPAY_AMT = previous_application[previous_application.DAYS_TERMINATION == 365243.0][['SK_ID_CURR','AMT_APPLICATION']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_APPLICATION':'PRE_REPAY_AMT'})
# 当前正在还的月付额
PRE_REPAY_ANNUITY = previous_application[previous_application.DAYS_TERMINATION == 365243.0][['SK_ID_CURR','AMT_ANNUITY']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_ANNUITY':'PRE_REPAY_ANNUITY'})

# 当前在还中属于逾期的申请数
PRE_REAPY_OVERDUR_NUM = previous_application[(previous_application.DAYS_TERMINATION == 365243.0) & (previous_application.DAYS_LAST_DUE != 365243.0)][['SK_ID_CURR','SK_ID_PREV']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':'PRE_NUM_REAPY_OVERDUR'})
# 当前正在还属于逾期的总金额
PRE_REAPY_OVERDUR_AMT = previous_application[(previous_application.DAYS_TERMINATION == 365243.0) & (previous_application.DAYS_LAST_DUE != 365243.0)][['SK_ID_CURR','AMT_APPLICATION']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_APPLICATION':'PRE_REAPY_OVERDUR_AMT'})
# 当前正在还属于逾期的月付额
PRE_REAPY_OVERDUR_ANNUITY = previous_application[(previous_application.DAYS_TERMINATION == 365243.0) & (previous_application.DAYS_LAST_DUE != 365243.0)][['SK_ID_CURR','AMT_ANNUITY']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_ANNUITY':'PRE_REAPY_OVERDUR_ANNUITY'})

# 最近3 6个月 1年 2年 3年 3年+贷款数 被拒绝数 通过数 贷款总金额 月付款额

del previous_application

初始内存使用: 471.48 MB
优化后的内存使用: 321.75 MB


In [44]:
pre_extra = concat_df_by_name('PRE')
pre_extra.shape

(338857, 46)

In [45]:
# POS_CASH_balance表
POS_CASH_balance = load_data('POS_CASH_balance')
POS_CASH_balance = reduce_mem_usage(POS_CASH_balance)

# pos总贷款数
POS_CREDIT_NUM = POS_CASH_balance[['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).nunique().rename(columns={'SK_ID_PREV':'POS_CREDIT_NUM'})
# pos已经还清的笔数
POS_FINISH_NUM = POS_CASH_balance[POS_CASH_balance.NAME_CONTRACT_STATUS == 'Completed'][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).nunique().rename(columns={'SK_ID_PREV':'POS_FINISH_NUM'})
# pos正在还的笔数
POS_REPAY_NUM = POS_CASH_balance[ (POS_CASH_balance.MONTHS_BALANCE == -1) & (POS_CASH_balance.NAME_CONTRACT_STATUS == 'Active')][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).nunique().rename(columns={'SK_ID_PREV':'POS_REPAY_NUM'})
# pos已经还清贷款最大逾期天数
tmp = pd.DataFrame(POS_CASH_balance[POS_CASH_balance.NAME_CONTRACT_STATUS == 'Completed']['SK_ID_PREV'].unique(), columns=['SK_ID_PREV'])
tmp = pd.merge(POS_CASH_balance, tmp, how='inner', on='SK_ID_PREV')
POS_DAYS_MAXOVERDUE_FINISH = tmp[['SK_ID_CURR','SK_DPD_DEF']].groupby(['SK_ID_CURR']).max().rename(columns={'SK_DPD_DEF':'POS_DAYS_MAXOVERDUE_FINISH'})
# pos已经还清贷款发生过逾期的笔数
POS_NUM_MAXOVERDUE_FINISH = tmp[tmp.SK_DPD_DEF > 0][['SK_ID_PREV','SK_ID_CURR']].groupby('SK_ID_CURR').nunique().rename(columns={'SK_ID_PREV':'POS_NUM_MAXOVERDUE_FINISH'})
# pos正在还贷款最大逾期天数
tmp = pd.DataFrame(POS_CASH_balance[(POS_CASH_balance.MONTHS_BALANCE == -1) & (POS_CASH_balance.NAME_CONTRACT_STATUS == 'Active')]['SK_ID_PREV'].unique(), columns=['SK_ID_PREV'])
tmp = pd.merge(POS_CASH_balance, tmp, how='inner', on='SK_ID_PREV')
POS_DAYS_MAXOVERDUE_REPAY = tmp[['SK_ID_CURR','SK_DPD_DEF']].groupby(['SK_ID_CURR']).max().rename(columns={'SK_DPD_DEF':'POS_DAYS_MAXOVERDUE_REPAY'})
# pos正在还贷款发生过逾期的笔数
POS_NUM_MAXOVERDUE_REPAY = tmp[tmp.SK_DPD_DEF > 0][['SK_ID_PREV','SK_ID_CURR']].groupby('SK_ID_CURR').nunique().rename(columns={'SK_ID_PREV':'POS_NUM_MAXOVERDUE_REPAY'})

# pos最近6个月逾期0 7 14 30 90 90+天内的笔数
tmp = POS_CASH_balance[(POS_CASH_balance.MONTHS_BALANCE >= -6)][['SK_ID_PREV','SK_ID_CURR','SK_DPD_DEF']].groupby(['SK_ID_PREV','SK_ID_CURR'], as_index=False).max()
POS_OVERDUE_6_0 = tmp[tmp.SK_DPD_DEF ==0][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':'POS_OVERDUE_6_0'})
POS_OVERDUE_6_7 = tmp[(tmp.SK_DPD_DEF > 0) & (tmp.SK_DPD_DEF <= 7)][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':'POS_OVERDUE_6_7'})
POS_OVERDUE_6_14 = tmp[(tmp.SK_DPD_DEF > 7) & (tmp.SK_DPD_DEF <= 14)][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':'POS_OVERDUE_6_14'})
POS_OVERDUE_6_30 = tmp[(tmp.SK_DPD_DEF > 14) & (tmp.SK_DPD_DEF <= 30)][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':'POS_OVERDUE_6_30'})
POS_OVERDUE_6_90 = tmp[(tmp.SK_DPD_DEF > 30) & (tmp.SK_DPD_DEF <= 90)][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':'POS_OVERDUE_6_90'})
POS_OVERDUE_6_90plus = tmp[tmp.SK_DPD_DEF > 90][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':'POS_OVERDUE_6_90plus'})

# pos最近7-12个月逾期0 7 14 30 90 90+天内的笔数
tmp = POS_CASH_balance[(POS_CASH_balance.MONTHS_BALANCE >= -12) & (POS_CASH_balance.MONTHS_BALANCE < -6)][['SK_ID_PREV','SK_ID_CURR','SK_DPD_DEF']].groupby(['SK_ID_PREV','SK_ID_CURR'], as_index=False).max()
POS_OVERDUE_12_0 = tmp[tmp.SK_DPD_DEF ==0][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':'POS_OVERDUE_12_0'})
POS_OVERDUE_12_7 = tmp[(tmp.SK_DPD_DEF > 0) & (tmp.SK_DPD_DEF <= 7)][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':'POS_OVERDUE_12_7'})
POS_OVERDUE_12_14 = tmp[(tmp.SK_DPD_DEF > 7) & (tmp.SK_DPD_DEF <= 14)][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':'POS_OVERDUE_12_14'})
POS_OVERDUE_12_30 = tmp[(tmp.SK_DPD_DEF > 14) & (tmp.SK_DPD_DEF <= 30)][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':'POS_OVERDUE_12_30'})
POS_OVERDUE_12_90 = tmp[(tmp.SK_DPD_DEF > 30) & (tmp.SK_DPD_DEF <= 90)][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':'POS_OVERDUE_12_90'})
POS_OVERDUE_12_90plus = tmp[tmp.SK_DPD_DEF > 90][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':'POS_OVERDUE_12_90plus'})

# pos最近13-24个月逾期0 7 14 30 90 90+天内的笔数
tmp = POS_CASH_balance[(POS_CASH_balance.MONTHS_BALANCE >= -24) & (POS_CASH_balance.MONTHS_BALANCE < -12)][['SK_ID_PREV','SK_ID_CURR','SK_DPD_DEF']].groupby(['SK_ID_PREV','SK_ID_CURR'], as_index=False).max()
POS_OVERDUE_24_0 = tmp[tmp.SK_DPD_DEF ==0][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':'POS_OVERDUE_24_0'})
POS_OVERDUE_24_7 = tmp[(tmp.SK_DPD_DEF > 0) & (tmp.SK_DPD_DEF <= 7)][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':'POS_OVERDUE_24_7'})
POS_OVERDUE_24_14 = tmp[(tmp.SK_DPD_DEF > 7) & (tmp.SK_DPD_DEF <= 14)][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':'POS_OVERDUE_24_14'})
POS_OVERDUE_24_30 = tmp[(tmp.SK_DPD_DEF > 14) & (tmp.SK_DPD_DEF <= 30)][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':'POS_OVERDUE_24_30'})
POS_OVERDUE_24_90 = tmp[(tmp.SK_DPD_DEF > 30) & (tmp.SK_DPD_DEF <= 90)][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':'POS_OVERDUE_24_90'})
POS_OVERDUE_24_90plus = tmp[tmp.SK_DPD_DEF > 90][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':'POS_OVERDUE_24_90plus'})

# pos最近24-36个月逾期0 7 14 30 90 90+天内的笔数
tmp = POS_CASH_balance[(POS_CASH_balance.MONTHS_BALANCE >= -36) & (POS_CASH_balance.MONTHS_BALANCE < -24)][['SK_ID_PREV','SK_ID_CURR','SK_DPD_DEF']].groupby(['SK_ID_PREV','SK_ID_CURR'], as_index=False).max()
POS_OVERDUE_36_0 = tmp[tmp.SK_DPD_DEF ==0][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':'POS_OVERDUE_36_0'})
POS_OVERDUE_36_7 = tmp[(tmp.SK_DPD_DEF > 0) & (tmp.SK_DPD_DEF <= 7)][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':'POS_OVERDUE_36_7'})
POS_OVERDUE_36_14 = tmp[(tmp.SK_DPD_DEF > 7) & (tmp.SK_DPD_DEF <= 14)][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':'POS_OVERDUE_36_14'})
POS_OVERDUE_36_30 = tmp[(tmp.SK_DPD_DEF > 14) & (tmp.SK_DPD_DEF <= 30)][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':'POS_OVERDUE_36_30'})
POS_OVERDUE_36_90 = tmp[(tmp.SK_DPD_DEF > 30) & (tmp.SK_DPD_DEF <= 90)][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':'POS_OVERDUE_36_90'})
POS_OVERDUE_36_90plus = tmp[tmp.SK_DPD_DEF > 90][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':'POS_OVERDUE_36_90plus'})

# pos最近36以上个月逾期0 7 14 30 90 90+天内的笔数
tmp = POS_CASH_balance[POS_CASH_balance.MONTHS_BALANCE < -36][['SK_ID_PREV','SK_ID_CURR','SK_DPD_DEF']].groupby(['SK_ID_PREV','SK_ID_CURR'], as_index=False).max()
POS_OVERDUE_36plus_0 = tmp[tmp.SK_DPD_DEF ==0][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':'POS_OVERDUE_36plus_0'})
POS_OVERDUE_36plus_7 = tmp[(tmp.SK_DPD_DEF > 0) & (tmp.SK_DPD_DEF <= 7)][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':'POS_OVERDUE_36plus_7'})
POS_OVERDUE_36plus_14 = tmp[(tmp.SK_DPD_DEF > 7) & (tmp.SK_DPD_DEF <= 14)][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':'POS_OVERDUE_36plus_14'})
POS_OVERDUE_36plus_30 = tmp[(tmp.SK_DPD_DEF > 14) & (tmp.SK_DPD_DEF <= 30)][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':'POS_OVERDUE_36plus_30'})
POS_OVERDUE_36plus_90 = tmp[(tmp.SK_DPD_DEF > 30) & (tmp.SK_DPD_DEF <= 90)][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':'POS_OVERDUE_36plus_90'})
POS_OVERDUE_36plus_90plus = tmp[tmp.SK_DPD_DEF > 90][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':'POS_OVERDUE_36plus_90plus'})

# pos当前仍在逾期的笔数 最大逾期天数
POS_NUM_OVERDUE_STILL = POS_CASH_balance[(POS_CASH_balance.MONTHS_BALANCE == -1) & (POS_CASH_balance.SK_DPD_DEF > 30)][['SK_ID_CURR','SK_ID_PREV']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':'POS_NUM_OVERDUE_STILL'})
POS_DAYS_MAXOVERDUE_STILL = POS_CASH_balance[(POS_CASH_balance.MONTHS_BALANCE == -1) & (POS_CASH_balance.SK_DPD_DEF > 30)][['SK_ID_CURR','SK_DPD_DEF']].groupby(['SK_ID_CURR']).max().rename(columns={'SK_DPD_DEF':'POS_DAYS_MAXOVERDUE_STILL'})

del POS_CASH_balance
del tmp

初始内存使用: 610.43 MB
优化后的内存使用: 314.76 MB


In [46]:
pos_extra = concat_df_by_name('POS')
pos_extra.shape

(337252, 39)

In [47]:
# installments_payments表
installments_payments = load_data('installments_payments')
installments_payments = reduce_mem_usage(installments_payments)

# 分期还款记录中有很多应还款额为67.5的记录，中间有大量的逾期，明显有问题，我们剔除这些数据
installments_payments = installments_payments[installments_payments.AMT_INSTALMENT > 100]
installments_payments['DAYS_DIFF'] = installments_payments['DAYS_ENTRY_PAYMENT'] - installments_payments['DAYS_INSTALMENT']
# 分期还款的贷款数
INST_NUM = installments_payments[['SK_ID_CURR','SK_ID_PREV']].groupby('SK_ID_CURR').nunique().rename(columns={'SK_ID_PREV':'INST_NUM'})
# 分期还款最近6个月在还贷款数
INST_NUM_6m = installments_payments[installments_payments.DAYS_INSTALMENT >= -180][['SK_ID_CURR','SK_ID_PREV']].groupby('SK_ID_CURR').nunique().rename(columns={'SK_ID_PREV':'INST_NUM_6m'})
# 分期还款最近6个月发生了逾期的贷款数
INST_NUM_6m_all = installments_payments[(installments_payments.DAYS_INSTALMENT >= -180) & (installments_payments.DAYS_DIFF > 0)][['SK_ID_CURR','SK_ID_PREV']].groupby('SK_ID_CURR').nunique().rename(columns={'SK_ID_PREV':'INST_NUM_6m_all'})
# 分期还款最近6个月发生了逾期，逾期天数在7天内的贷款数
INST_NUM_6m_7d = installments_payments[(installments_payments.DAYS_INSTALMENT >= -180) & (installments_payments.DAYS_DIFF > 0) & (installments_payments.DAYS_DIFF <= 7)][['SK_ID_CURR','SK_ID_PREV']].groupby('SK_ID_CURR').nunique().rename(columns={'SK_ID_PREV':'INST_NUM_6m_7d'})
# 分期还款最近6个月发生了逾期，逾期天数在8-14天内的贷款数
INST_NUM_6m_14d = installments_payments[(installments_payments.DAYS_INSTALMENT >= -180) & (installments_payments.DAYS_DIFF > 7) & (installments_payments.DAYS_DIFF <= 14)][['SK_ID_CURR','SK_ID_PREV']].groupby('SK_ID_CURR').nunique().rename(columns={'SK_ID_PREV':'INST_NUM_6m_14d'})
# 分期还款最近6个月发生了逾期，逾期天数在15-30天内的贷款数
INST_NUM_6m_30d = installments_payments[(installments_payments.DAYS_INSTALMENT >= -180) & (installments_payments.DAYS_DIFF > 14) & (installments_payments.DAYS_DIFF <= 30)][['SK_ID_CURR','SK_ID_PREV']].groupby('SK_ID_CURR').nunique().rename(columns={'SK_ID_PREV':'INST_NUM_6m_30d'})
# 分期还款最近6个月发生了逾期，逾期天数在31-90天内的贷款数
INST_NUM_6m_90d = installments_payments[(installments_payments.DAYS_INSTALMENT >= -180) & (installments_payments.DAYS_DIFF > 30) & (installments_payments.DAYS_DIFF <= 90)][['SK_ID_CURR','SK_ID_PREV']].groupby('SK_ID_CURR').nunique().rename(columns={'SK_ID_PREV':'INST_NUM_6m_90d'})
# 分期还款最近6个月发生了逾期，逾期天数在90+天内的贷款数
INST_NUM_6m_90dplus = installments_payments[(installments_payments.DAYS_INSTALMENT >= -180) & (installments_payments.DAYS_DIFF > 90)][['SK_ID_CURR','SK_ID_PREV']].groupby('SK_ID_CURR').nunique().rename(columns={'SK_ID_PREV':'INST_NUM_6m_90dplus'})

# 分期还款最近6-12个月在还贷款数
INST_NUM_12m = installments_payments[(installments_payments.DAYS_INSTALMENT < -180) & (installments_payments.DAYS_INSTALMENT >= -360)][['SK_ID_CURR','SK_ID_PREV']].groupby('SK_ID_CURR').nunique().rename(columns={'SK_ID_PREV':'INST_NUM_12m'})
# 分期还款最近6-12个月发生了逾期的贷款数
INST_NUM_12m_all = installments_payments[(installments_payments.DAYS_INSTALMENT < -180) & (installments_payments.DAYS_INSTALMENT >= -360) & (installments_payments.DAYS_DIFF > 0)][['SK_ID_CURR','SK_ID_PREV']].groupby('SK_ID_CURR').nunique().rename(columns={'SK_ID_PREV':'INST_NUM_12m_all'})
# 分期还款最近6-12个月发生了逾期，逾期天数在7天内的贷款数
INST_NUM_12m_7d = installments_payments[(installments_payments.DAYS_INSTALMENT < -180) & (installments_payments.DAYS_INSTALMENT >= -360) & (installments_payments.DAYS_DIFF > 0) & (installments_payments.DAYS_DIFF <= 7)][['SK_ID_CURR','SK_ID_PREV']].groupby('SK_ID_CURR').nunique().rename(columns={'SK_ID_PREV':'INST_NUM_12m_7d'})
# 分期还款最近6-12个月发生了逾期，逾期天数在8-14天内的贷款数
INST_NUM_12m_14d = installments_payments[(installments_payments.DAYS_INSTALMENT < -180) & (installments_payments.DAYS_INSTALMENT >= -360) & (installments_payments.DAYS_DIFF > 7) & (installments_payments.DAYS_DIFF <= 14)][['SK_ID_CURR','SK_ID_PREV']].groupby('SK_ID_CURR').nunique().rename(columns={'SK_ID_PREV':'INST_NUM_12m_14d'})
# 分期还款最近6-12个月发生了逾期，逾期天数在15-30天内的贷款数
INST_NUM_12m_30d = installments_payments[(installments_payments.DAYS_INSTALMENT < -180) & (installments_payments.DAYS_INSTALMENT >= -360) & (installments_payments.DAYS_DIFF > 14) & (installments_payments.DAYS_DIFF <= 30)][['SK_ID_CURR','SK_ID_PREV']].groupby('SK_ID_CURR').nunique().rename(columns={'SK_ID_PREV':'INST_NUM_12m_30d'})
# 分期还款最近6-12个月发生了逾期，逾期天数在31-90天内的贷款数
INST_NUM_12m_90d = installments_payments[(installments_payments.DAYS_INSTALMENT < -180) & (installments_payments.DAYS_INSTALMENT >= -360) & (installments_payments.DAYS_DIFF > 30) & (installments_payments.DAYS_DIFF <= 90)][['SK_ID_CURR','SK_ID_PREV']].groupby('SK_ID_CURR').nunique().rename(columns={'SK_ID_PREV':'INST_NUM_12m_90d'})
# 分期还款最近6-12个月发生了逾期，逾期天数在90+天内的贷款数
INST_NUM_12m_90dplus = installments_payments[(installments_payments.DAYS_INSTALMENT < -180) & (installments_payments.DAYS_INSTALMENT >= -360) & (installments_payments.DAYS_DIFF > 90)][['SK_ID_CURR','SK_ID_PREV']].groupby('SK_ID_CURR').nunique().rename(columns={'SK_ID_PREV':'INST_NUM_12m_90dplus'})

# 分期还款最近12-24个月在还贷款数
INST_NUM_24m = installments_payments[(installments_payments.DAYS_INSTALMENT <= -720) & (installments_payments.DAYS_INSTALMENT < -360)][['SK_ID_CURR','SK_ID_PREV']].groupby('SK_ID_CURR').nunique().rename(columns={'SK_ID_PREV':'INST_NUM_24m'})
# 分期还款最近12-24个月发生了逾期的贷款数
INST_NUM_24m_all = installments_payments[(installments_payments.DAYS_INSTALMENT <= -720) & (installments_payments.DAYS_INSTALMENT < -360) & (installments_payments.DAYS_DIFF > 0)][['SK_ID_CURR','SK_ID_PREV']].groupby('SK_ID_CURR').nunique().rename(columns={'SK_ID_PREV':'INST_NUM_24m_all'})
# 分期还款最近12-24个月发生了逾期，逾期天数在7天内的贷款数
INST_NUM_24m_7d = installments_payments[(installments_payments.DAYS_INSTALMENT <= -720) & (installments_payments.DAYS_INSTALMENT < -360) & (installments_payments.DAYS_DIFF > 0) & (installments_payments.DAYS_DIFF <= 7)][['SK_ID_CURR','SK_ID_PREV']].groupby('SK_ID_CURR').nunique().rename(columns={'SK_ID_PREV':'INST_NUM_24m_7d'})
# 分期还款最近12-24个月发生了逾期，逾期天数在8-14天内的贷款数
INST_NUM_24m_14d = installments_payments[(installments_payments.DAYS_INSTALMENT <= -720) & (installments_payments.DAYS_INSTALMENT < -360) & (installments_payments.DAYS_DIFF > 7) & (installments_payments.DAYS_DIFF <= 14)][['SK_ID_CURR','SK_ID_PREV']].groupby('SK_ID_CURR').nunique().rename(columns={'SK_ID_PREV':'INST_NUM_24m_14d'})
# 分期还款最近12-24个月发生了逾期，逾期天数在15-30天内的贷款数
INST_NUM_24m_30d = installments_payments[(installments_payments.DAYS_INSTALMENT <= -720) & (installments_payments.DAYS_INSTALMENT < -360) & (installments_payments.DAYS_DIFF > 14) & (installments_payments.DAYS_DIFF <= 30)][['SK_ID_CURR','SK_ID_PREV']].groupby('SK_ID_CURR').nunique().rename(columns={'SK_ID_PREV':'INST_NUM_24m_30d'})
# 分期还款最近12-24个月发生了逾期，逾期天数在31-90天内的贷款数
INST_NUM_24m_90d = installments_payments[(installments_payments.DAYS_INSTALMENT <= -720) & (installments_payments.DAYS_INSTALMENT < -360) & (installments_payments.DAYS_DIFF > 30) & (installments_payments.DAYS_DIFF <= 90)][['SK_ID_CURR','SK_ID_PREV']].groupby('SK_ID_CURR').nunique().rename(columns={'SK_ID_PREV':'INST_NUM_24m_90d'})
# 分期还款最近12-24个月发生了逾期，逾期天数在90+天内的贷款数
INST_NUM_24m_90dplus = installments_payments[(installments_payments.DAYS_INSTALMENT <= -720) & (installments_payments.DAYS_INSTALMENT < -360) & (installments_payments.DAYS_DIFF > 90)][['SK_ID_CURR','SK_ID_PREV']].groupby('SK_ID_CURR').nunique().rename(columns={'SK_ID_PREV':'INST_NUM_24m_90dplus'})

# 分期还款最近24-36个月在还贷款数
INST_NUM_36m = installments_payments[(installments_payments.DAYS_INSTALMENT >= -1080) & (installments_payments.DAYS_INSTALMENT < -720)][['SK_ID_CURR','SK_ID_PREV']].groupby('SK_ID_CURR').nunique().rename(columns={'SK_ID_PREV':'INST_NUM_36m'})
# 分期还款最近24-36个月发生了逾期的贷款数
INST_NUM_36m_all = installments_payments[(installments_payments.DAYS_INSTALMENT >= -1080) & (installments_payments.DAYS_INSTALMENT < -720) & (installments_payments.DAYS_DIFF > 0)][['SK_ID_CURR','SK_ID_PREV']].groupby('SK_ID_CURR').nunique().rename(columns={'SK_ID_PREV':'INST_NUM_36m_all'})
# 分期还款最近24-36个月发生了逾期，逾期天数在7天内的贷款数
INST_NUM_36m_7d = installments_payments[(installments_payments.DAYS_INSTALMENT >= -1080) & (installments_payments.DAYS_INSTALMENT < -720) & (installments_payments.DAYS_DIFF > 0) & (installments_payments.DAYS_DIFF <= 7)][['SK_ID_CURR','SK_ID_PREV']].groupby('SK_ID_CURR').nunique().rename(columns={'SK_ID_PREV':'INST_NUM_36m_7d'})
# 分期还款最近24-36个月发生了逾期，逾期天数在8-14天内的贷款数
INST_NUM_36m_14d = installments_payments[(installments_payments.DAYS_INSTALMENT >= -1080) & (installments_payments.DAYS_INSTALMENT < -720) & (installments_payments.DAYS_DIFF > 7) & (installments_payments.DAYS_DIFF <= 14)][['SK_ID_CURR','SK_ID_PREV']].groupby('SK_ID_CURR').nunique().rename(columns={'SK_ID_PREV':'INST_NUM_36m_14d'})
# 分期还款最近24-36个月发生了逾期，逾期天数在15-30天内的贷款数
INST_NUM_36m_30d = installments_payments[(installments_payments.DAYS_INSTALMENT >= -1080) & (installments_payments.DAYS_INSTALMENT < -720) & (installments_payments.DAYS_DIFF > 14) & (installments_payments.DAYS_DIFF <= 30)][['SK_ID_CURR','SK_ID_PREV']].groupby('SK_ID_CURR').nunique().rename(columns={'SK_ID_PREV':'INST_NUM_36m_30d'})
# 分期还款最近24-36个月发生了逾期，逾期天数在31-90天内的贷款数
INST_NUM_36m_90d = installments_payments[(installments_payments.DAYS_INSTALMENT >= -1080) & (installments_payments.DAYS_INSTALMENT < -720) & (installments_payments.DAYS_DIFF > 30) & (installments_payments.DAYS_DIFF <= 90)][['SK_ID_CURR','SK_ID_PREV']].groupby('SK_ID_CURR').nunique().rename(columns={'SK_ID_PREV':'INST_NUM_36m_90d'})
# 分期还款最近24-36个月发生了逾期，逾期天数在90+天内的贷款数
INST_NUM_36m_90dplus = installments_payments[(installments_payments.DAYS_INSTALMENT >= -1080) & (installments_payments.DAYS_INSTALMENT < -720) & (installments_payments.DAYS_DIFF > 90)][['SK_ID_CURR','SK_ID_PREV']].groupby('SK_ID_CURR').nunique().rename(columns={'SK_ID_PREV':'INST_NUM_36m_90dplus'})

# 分期还款最近36+个月在还贷款数
INST_NUM_36mplus = installments_payments[installments_payments.DAYS_INSTALMENT < -1080][['SK_ID_CURR','SK_ID_PREV']].groupby('SK_ID_CURR').nunique().rename(columns={'SK_ID_PREV':'INST_NUM_36mplus'})
# 分期还款最近36+个月发生了逾期的贷款数
INST_NUM_36mplus_all = installments_payments[installments_payments.DAYS_INSTALMENT < -1080 & (installments_payments.DAYS_DIFF > 0)][['SK_ID_CURR','SK_ID_PREV']].groupby('SK_ID_CURR').nunique().rename(columns={'SK_ID_PREV':'INST_NUM_36mplus_all'})
# 分期还款最近36+个月发生了逾期，逾期天数在7天内的贷款数
INST_NUM_36mplus_7d = installments_payments[installments_payments.DAYS_INSTALMENT < -1080 & (installments_payments.DAYS_DIFF > 0) & (installments_payments.DAYS_DIFF <= 7)][['SK_ID_CURR','SK_ID_PREV']].groupby('SK_ID_CURR').nunique().rename(columns={'SK_ID_PREV':'INST_NUM_36mplus_7d'})
# 分期还款最近36+个月发生了逾期，逾期天数在8-14天内的贷款数
INST_NUM_36mplus_14d = installments_payments[installments_payments.DAYS_INSTALMENT < -1080 & (installments_payments.DAYS_DIFF > 7) & (installments_payments.DAYS_DIFF <= 14)][['SK_ID_CURR','SK_ID_PREV']].groupby('SK_ID_CURR').nunique().rename(columns={'SK_ID_PREV':'INST_NUM_36mplus_14d'})
# 分期还款最近36+个月发生了逾期，逾期天数在15-30天内的贷款数
INST_NUM_36mplus_30d = installments_payments[installments_payments.DAYS_INSTALMENT < -1080 & (installments_payments.DAYS_DIFF > 14) & (installments_payments.DAYS_DIFF <= 30)][['SK_ID_CURR','SK_ID_PREV']].groupby('SK_ID_CURR').nunique().rename(columns={'SK_ID_PREV':'INST_NUM_36mplus_30d'})
# 分期还款最近36+个月发生了逾期，逾期天数在31-90天内的贷款数
INST_NUM_36mplus_90d = installments_payments[installments_payments.DAYS_INSTALMENT < -1080 & (installments_payments.DAYS_DIFF > 30) & (installments_payments.DAYS_DIFF <= 90)][['SK_ID_CURR','SK_ID_PREV']].groupby('SK_ID_CURR').nunique().rename(columns={'SK_ID_PREV':'INST_NUM_36mplus_90d'})
# 分期还款最近36+个月发生了逾期，逾期天数在90+天内的贷款数
INST_NUM_36mplus_90dplus = installments_payments[installments_payments.DAYS_INSTALMENT < -1080 & (installments_payments.DAYS_DIFF > 90)][['SK_ID_CURR','SK_ID_PREV']].groupby('SK_ID_CURR').nunique().rename(columns={'SK_ID_PREV':'INST_NUM_36mplus_90dplus'})

# 分期还款最近6个月发生逾期的逾期金额
INST_AMT_6m = installments_payments[(installments_payments.DAYS_INSTALMENT >= -180) & (installments_payments.DAYS_DIFF > 0)][['SK_ID_CURR','AMT_PAYMENT']].groupby('SK_ID_CURR').sum().rename(columns={'AMT_PAYMENT':'INST_AMT_6m'})
# 分期还款最近6-12个月发生逾期的逾期金额
INST_AMT_12m = installments_payments[(installments_payments.DAYS_INSTALMENT < -180) & (installments_payments.DAYS_INSTALMENT >= -360) & (installments_payments.DAYS_DIFF > 0)][['SK_ID_CURR','AMT_PAYMENT']].groupby('SK_ID_CURR').sum().rename(columns={'AMT_PAYMENT':'INST_AMT_12m'})
# 分期还款最近12-24个月发生逾期的逾期金额
INST_AMT_24m = installments_payments[(installments_payments.DAYS_INSTALMENT < -360) & (installments_payments.DAYS_INSTALMENT >= -720) & (installments_payments.DAYS_DIFF > 0)][['SK_ID_CURR','AMT_PAYMENT']].groupby('SK_ID_CURR').sum().rename(columns={'AMT_PAYMENT':'INST_AMT_24m'})
# 分期还款最近24-36个月发生逾期的逾期金额
INST_AMT_36m = installments_payments[(installments_payments.DAYS_INSTALMENT < -720) & (installments_payments.DAYS_INSTALMENT >= -1080) & (installments_payments.DAYS_DIFF > 0)][['SK_ID_CURR','AMT_PAYMENT']].groupby('SK_ID_CURR').sum().rename(columns={'AMT_PAYMENT':'INST_AMT_36m'})
# 分期还款最近36+个月发生逾期的逾期金额
INST_AMT_36mplus = installments_payments[(installments_payments.DAYS_INSTALMENT < -1080) & (installments_payments.DAYS_DIFF > 0)][['SK_ID_CURR','AMT_PAYMENT']].groupby('SK_ID_CURR').sum().rename(columns={'AMT_PAYMENT':'INST_AMT_36mplus'})

# 分期还款当前仍在逾期的贷款数
INST_NUM_STILL = installments_payments[installments_payments.DAYS_ENTRY_PAYMENT.isnull()][['SK_ID_CURR','SK_ID_PREV']].groupby(['SK_ID_CURR']).nunique().rename(columns={'SK_ID_PREV':'INST_NUM_STILL'})
# 分期还款当前仍在逾期的贷款总逾期期数
INST_NUM_SEQ_STILL = installments_payments[installments_payments.DAYS_ENTRY_PAYMENT.isnull()][['SK_ID_CURR','SK_ID_PREV']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':'INST_NUM_SEQ_STILL'})
# 分期还款当前仍在逾期的贷款最大逾期天数
INST_DAYS_MAX_STILL = installments_payments[installments_payments.DAYS_ENTRY_PAYMENT.isnull()][['SK_ID_CURR','DAYS_INSTALMENT']].groupby(['SK_ID_CURR']).min().rename(columns={'DAYS_INSTALMENT':'INST_DAYS_MAX_STILL'})
# 分期还款当前仍在逾期的总逾期金额
INST_AMT_STILL = installments_payments[installments_payments.DAYS_ENTRY_PAYMENT.isnull()][['SK_ID_CURR','AMT_INSTALMENT']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_INSTALMENT':'INST_AMT_STILL'})

del installments_payments

初始内存使用: 830.41 MB
优化后的内存使用: 415.20 MB


In [48]:
inst_extra = concat_df_by_name('INST')
inst_extra.shape

(339572, 45)

In [49]:
# credit_card_balance表
credit_card_balance = load_data('credit_card_balance')
credit_card_balance = reduce_mem_usage(credit_card_balance)

# 信用卡数
CREDIT_NUM = credit_card_balance[['SK_ID_CURR','SK_ID_PREV']].groupby(['SK_ID_CURR']).nunique().rename(columns={'SK_ID_PREV':'CREDIT_NUM'})
# 不同状态的信用卡数
CREDIT_NUM_ACTIVE = credit_card_balance[credit_card_balance.NAME_CONTRACT_STATUS == 'Active'][['SK_ID_CURR','SK_ID_PREV']].groupby(['SK_ID_CURR']).nunique().rename(columns={'SK_ID_PREV':'CREDIT_NUM_ACTIVE'})
CREDIT_NUM_COMPLETED = credit_card_balance[credit_card_balance.NAME_CONTRACT_STATUS == 'Completed'][['SK_ID_CURR','SK_ID_PREV']].groupby(['SK_ID_CURR']).nunique().rename(columns={'SK_ID_PREV':'CREDIT_NUM_COMPLETED'})
CREDIT_NUM_SIGNED = credit_card_balance[credit_card_balance.NAME_CONTRACT_STATUS == 'Signed'][['SK_ID_CURR','SK_ID_PREV']].groupby(['SK_ID_CURR']).nunique().rename(columns={'SK_ID_PREV':'CREDIT_NUM_SIGNED'})
CREDIT_NUM_DEMAND = credit_card_balance[credit_card_balance.NAME_CONTRACT_STATUS == 'Demand'][['SK_ID_CURR','SK_ID_PREV']].groupby(['SK_ID_CURR']).nunique().rename(columns={'SK_ID_PREV':'CREDIT_NUM_DEMAND'})
CREDIT_NUM_SENT = credit_card_balance[credit_card_balance.NAME_CONTRACT_STATUS == 'Sent proposal'][['SK_ID_CURR','SK_ID_PREV']].groupby(['SK_ID_CURR']).nunique().rename(columns={'SK_ID_PREV':'CREDIT_NUM_SENT'})
CREDIT_NUM_REFUSED = credit_card_balance[credit_card_balance.NAME_CONTRACT_STATUS == 'Refused'][['SK_ID_CURR','SK_ID_PREV']].groupby(['SK_ID_CURR']).nunique().rename(columns={'SK_ID_PREV':'CREDIT_NUM_REFUSED'})
# 信用卡使用时长
tmp = credit_card_balance[['SK_ID_PREV','SK_ID_CURR','MONTHS_BALANCE']].groupby(['SK_ID_PREV','SK_ID_CURR'], as_index=False).count()
CREDIT_MONTHS_MAX = tmp[['SK_ID_CURR','MONTHS_BALANCE']].groupby('SK_ID_CURR').max().rename(columns={'MONTHS_BALANCE':'CREDIT_MONTHS_MAX'})
CREDIT_MONTHS_MIN = tmp[['SK_ID_CURR','MONTHS_BALANCE']].groupby('SK_ID_CURR').min().rename(columns={'MONTHS_BALANCE':'CREDIT_MONTHS_MIN'})
CREDIT_MONTHS_AVG = tmp[['SK_ID_CURR','MONTHS_BALANCE']].groupby('SK_ID_CURR').mean().rename(columns={'MONTHS_BALANCE':'CREDIT_MONTHS_AVG'})
CREDIT_MONTHS_SUM = tmp[['SK_ID_CURR','MONTHS_BALANCE']].groupby('SK_ID_CURR').sum().rename(columns={'MONTHS_BALANCE':'CREDIT_MONTHS_SUM'})
# 信用卡最近1 3 6 12 24 36 36+个月 月均余额
CREDIT_AMT_1m = credit_card_balance[credit_card_balance.MONTHS_BALANCE == -1][['SK_ID_CURR','AMT_BALANCE']].groupby(['SK_ID_CURR']).mean().rename(columns={'AMT_BALANCE':'CREDIT_AMT_1m'})
CREDIT_AMT_3m = credit_card_balance[credit_card_balance.MONTHS_BALANCE >= -3][['SK_ID_CURR','AMT_BALANCE']].groupby(['SK_ID_CURR']).mean().rename(columns={'AMT_BALANCE':'CREDIT_AMT_3m'})
CREDIT_AMT_6m = credit_card_balance[credit_card_balance.MONTHS_BALANCE >= -6][['SK_ID_CURR','AMT_BALANCE']].groupby(['SK_ID_CURR']).mean().rename(columns={'AMT_BALANCE':'CREDIT_AMT_6m'})
CREDIT_AMT_12m = credit_card_balance[credit_card_balance.MONTHS_BALANCE >= -12][['SK_ID_CURR','AMT_BALANCE']].groupby(['SK_ID_CURR']).mean().rename(columns={'AMT_BALANCE':'CREDIT_AMT_12m'})
CREDIT_AMT_24m = credit_card_balance[credit_card_balance.MONTHS_BALANCE >= -24][['SK_ID_CURR','AMT_BALANCE']].groupby(['SK_ID_CURR']).mean().rename(columns={'AMT_BALANCE':'CREDIT_AMT_24m'})
CREDIT_AMT_36m = credit_card_balance[credit_card_balance.MONTHS_BALANCE >= -36][['SK_ID_CURR','AMT_BALANCE']].groupby(['SK_ID_CURR']).mean().rename(columns={'AMT_BALANCE':'CREDIT_AMT_36m'})
CREDIT_AMT_36mplus = credit_card_balance[credit_card_balance.MONTHS_BALANCE < -36][['SK_ID_CURR','AMT_BALANCE']].groupby(['SK_ID_CURR']).mean().rename(columns={'AMT_BALANCE':'CREDIT_AMT_36mplus'})
# 信用卡最近1 3 6 12 24 36 36+个月取款金额
CREDIT_AMT_1m_CURRENT = credit_card_balance[credit_card_balance.MONTHS_BALANCE == -1][['SK_ID_CURR','AMT_DRAWINGS_CURRENT']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_DRAWINGS_CURRENT':'CREDIT_AMT_1m_CURRENT'})
CREDIT_AMT_3m_CURRENT = credit_card_balance[credit_card_balance.MONTHS_BALANCE >= -3][['SK_ID_CURR','AMT_DRAWINGS_CURRENT']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_DRAWINGS_CURRENT':'CREDIT_AMT_3m_CURRENT'})
CREDIT_AMT_6m_CURRENT = credit_card_balance[credit_card_balance.MONTHS_BALANCE >= -6][['SK_ID_CURR','AMT_DRAWINGS_CURRENT']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_DRAWINGS_CURRENT':'CREDIT_AMT_6m_CURRENT'})
CREDIT_AMT_12m_CURRENT = credit_card_balance[credit_card_balance.MONTHS_BALANCE >= -12][['SK_ID_CURR','AMT_DRAWINGS_CURRENT']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_DRAWINGS_CURRENT':'CREDIT_AMT_12m_CURRENT'})
CREDIT_AMT_24m_CURRENT = credit_card_balance[credit_card_balance.MONTHS_BALANCE >= -24][['SK_ID_CURR','AMT_DRAWINGS_CURRENT']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_DRAWINGS_CURRENT':'CREDIT_AMT_24m_CURRENT'})
CREDIT_AMT_36m_CURRENT = credit_card_balance[credit_card_balance.MONTHS_BALANCE >= -36][['SK_ID_CURR','AMT_DRAWINGS_CURRENT']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_DRAWINGS_CURRENT':'CREDIT_AMT_36m_CURRENT'})
CREDIT_AMT_36mplus_CURRENT = credit_card_balance[credit_card_balance.MONTHS_BALANCE < -36][['SK_ID_CURR','AMT_DRAWINGS_CURRENT']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_DRAWINGS_CURRENT':'CREDIT_AMT_36mplus_CURRENT'})
# 信用卡最近1 3 6 12 24 36 36+个月POS金额
CREDIT_AMT_1m_POS = credit_card_balance[credit_card_balance.MONTHS_BALANCE == -1][['SK_ID_CURR','AMT_DRAWINGS_POS_CURRENT']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_DRAWINGS_POS_CURRENT':'CREDIT_AMT_1m_POS'})
CREDIT_AMT_3m_POS = credit_card_balance[credit_card_balance.MONTHS_BALANCE >= -3][['SK_ID_CURR','AMT_DRAWINGS_POS_CURRENT']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_DRAWINGS_POS_CURRENT':'CREDIT_AMT_3m_POS'})
CREDIT_AMT_6m_POS = credit_card_balance[credit_card_balance.MONTHS_BALANCE >= -6][['SK_ID_CURR','AMT_DRAWINGS_POS_CURRENT']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_DRAWINGS_POS_CURRENT':'CREDIT_AMT_6m_POS'})
CREDIT_AMT_12m_POS = credit_card_balance[credit_card_balance.MONTHS_BALANCE >= -12][['SK_ID_CURR','AMT_DRAWINGS_POS_CURRENT']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_DRAWINGS_POS_CURRENT':'CREDIT_AMT_12m_POS'})
CREDIT_AMT_24m_POS = credit_card_balance[credit_card_balance.MONTHS_BALANCE >= -24][['SK_ID_CURR','AMT_DRAWINGS_POS_CURRENT']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_DRAWINGS_POS_CURRENT':'CREDIT_AMT_24m_POS'})
CREDIT_AMT_36m_POS = credit_card_balance[credit_card_balance.MONTHS_BALANCE >= -36][['SK_ID_CURR','AMT_DRAWINGS_POS_CURRENT']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_DRAWINGS_POS_CURRENT':'CREDIT_AMT_36m_POS'})
CREDIT_AMT_36mplus_POS = credit_card_balance[credit_card_balance.MONTHS_BALANCE < -36][['SK_ID_CURR','AMT_DRAWINGS_POS_CURRENT']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_DRAWINGS_POS_CURRENT':'CREDIT_AMT_36mplus_POS'})
# 信用卡最近1 3 6 12 24 36 36+个月ATM金额
CREDIT_AMT_1m_ATM = credit_card_balance[credit_card_balance.MONTHS_BALANCE == -1][['SK_ID_CURR','AMT_DRAWINGS_ATM_CURRENT']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_DRAWINGS_ATM_CURRENT':'CREDIT_AMT_1m_ATM'})
CREDIT_AMT_3m_ATM = credit_card_balance[credit_card_balance.MONTHS_BALANCE >= -3][['SK_ID_CURR','AMT_DRAWINGS_ATM_CURRENT']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_DRAWINGS_ATM_CURRENT':'CREDIT_AMT_3m_ATM'})
CREDIT_AMT_6m_ATM = credit_card_balance[credit_card_balance.MONTHS_BALANCE >= -6][['SK_ID_CURR','AMT_DRAWINGS_ATM_CURRENT']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_DRAWINGS_ATM_CURRENT':'CREDIT_AMT_6m_ATM'})
CREDIT_AMT_12m_ATM = credit_card_balance[credit_card_balance.MONTHS_BALANCE >= -12][['SK_ID_CURR','AMT_DRAWINGS_ATM_CURRENT']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_DRAWINGS_ATM_CURRENT':'CREDIT_AMT_12m_ATM'})
CREDIT_AMT_24m_ATM = credit_card_balance[credit_card_balance.MONTHS_BALANCE >= -24][['SK_ID_CURR','AMT_DRAWINGS_ATM_CURRENT']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_DRAWINGS_ATM_CURRENT':'CREDIT_AMT_24m_ATM'})
CREDIT_AMT_36m_ATM = credit_card_balance[credit_card_balance.MONTHS_BALANCE >= -36][['SK_ID_CURR','AMT_DRAWINGS_ATM_CURRENT']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_DRAWINGS_ATM_CURRENT':'CREDIT_AMT_36m_ATM'})
CREDIT_AMT_36mplus_ATM = credit_card_balance[credit_card_balance.MONTHS_BALANCE < -36][['SK_ID_CURR','AMT_DRAWINGS_ATM_CURRENT']].groupby(['SK_ID_CURR']).sum().rename(columns={'AMT_DRAWINGS_ATM_CURRENT':'CREDIT_AMT_36mplus_ATM'})
# 信用卡最大 最小 平均额度
CREDIT_LIMIT_MAX = credit_card_balance[credit_card_balance.NAME_CONTRACT_STATUS == 'Active'][['SK_ID_CURR','AMT_CREDIT_LIMIT_ACTUAL']].groupby(['SK_ID_CURR']).max().rename(columns={'AMT_CREDIT_LIMIT_ACTUAL':'CREDIT_LIMIT_MAX'})
CREDIT_LIMIT_MIN = credit_card_balance[credit_card_balance.NAME_CONTRACT_STATUS == 'Active'][['SK_ID_CURR','AMT_CREDIT_LIMIT_ACTUAL']].groupby(['SK_ID_CURR']).min().rename(columns={'AMT_CREDIT_LIMIT_ACTUAL':'CREDIT_LIMIT_MIN'})
# 信用卡最近6个月逾期0 7 14 30 90 90+天内的笔数
tmp = credit_card_balance[(credit_card_balance.MONTHS_BALANCE >= -6)][['SK_ID_PREV','SK_ID_CURR','SK_DPD_DEF']].groupby(['SK_ID_PREV','SK_ID_CURR'], as_index=False).max()
CREDIT_OVERDUE_6_0 = tmp[tmp.SK_DPD_DEF ==0][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':' CREDIT_OVERDUE_6_0'})
CREDIT_OVERDUE_6_7 = tmp[(tmp.SK_DPD_DEF > 0) & (tmp.SK_DPD_DEF <= 7)][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':' CREDIT_OVERDUE_6_7'})
CREDIT_OVERDUE_6_14 = tmp[(tmp.SK_DPD_DEF > 7) & (tmp.SK_DPD_DEF <= 14)][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':' CREDIT_OVERDUE_6_14'})
CREDIT_OVERDUE_6_30 = tmp[(tmp.SK_DPD_DEF > 14) & (tmp.SK_DPD_DEF <= 30)][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':' CREDIT_OVERDUE_6_30'})
CREDIT_OVERDUE_6_90 = tmp[(tmp.SK_DPD_DEF > 30) & (tmp.SK_DPD_DEF <= 90)][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':' CREDIT_OVERDUE_6_90'})
CREDIT_OVERDUE_6_90plus = tmp[tmp.SK_DPD_DEF > 90][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':' CREDIT_OVERDUE_6_90plus'})

# 信用卡最近7-12个月逾期0 7 14 30 90 90+天内的笔数
tmp = credit_card_balance[(credit_card_balance.MONTHS_BALANCE >= -12) & (credit_card_balance.MONTHS_BALANCE < -6)][['SK_ID_PREV','SK_ID_CURR','SK_DPD_DEF']].groupby(['SK_ID_PREV','SK_ID_CURR'], as_index=False).max()
CREDIT_OVERDUE_12_0 = tmp[tmp.SK_DPD_DEF ==0][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':' CREDIT_OVERDUE_12_0'})
CREDIT_OVERDUE_12_7 = tmp[(tmp.SK_DPD_DEF > 0) & (tmp.SK_DPD_DEF <= 7)][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':' CREDIT_OVERDUE_12_7'})
CREDIT_OVERDUE_12_14 = tmp[(tmp.SK_DPD_DEF > 7) & (tmp.SK_DPD_DEF <= 14)][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':' CREDIT_OVERDUE_12_14'})
CREDIT_OVERDUE_12_30 = tmp[(tmp.SK_DPD_DEF > 14) & (tmp.SK_DPD_DEF <= 30)][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':' CREDIT_OVERDUE_12_30'})
CREDIT_OVERDUE_12_90 = tmp[(tmp.SK_DPD_DEF > 30) & (tmp.SK_DPD_DEF <= 90)][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':' CREDIT_OVERDUE_12_90'})
CREDIT_OVERDUE_12_90plus = tmp[tmp.SK_DPD_DEF > 90][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':' CREDIT_OVERDUE_12_90plus'})

# 信用卡最近13-24个月逾期0 7 14 30 90 90+天内的笔数
tmp = credit_card_balance[(credit_card_balance.MONTHS_BALANCE >= -24) & (credit_card_balance.MONTHS_BALANCE < -12)][['SK_ID_PREV','SK_ID_CURR','SK_DPD_DEF']].groupby(['SK_ID_PREV','SK_ID_CURR'], as_index=False).max()
CREDIT_OVERDUE_24_0 = tmp[tmp.SK_DPD_DEF ==0][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':' CREDIT_OVERDUE_24_0'})
CREDIT_OVERDUE_24_7 = tmp[(tmp.SK_DPD_DEF > 0) & (tmp.SK_DPD_DEF <= 7)][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':' CREDIT_OVERDUE_24_7'})
CREDIT_OVERDUE_24_14 = tmp[(tmp.SK_DPD_DEF > 7) & (tmp.SK_DPD_DEF <= 14)][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':' CREDIT_OVERDUE_24_14'})
CREDIT_OVERDUE_24_30 = tmp[(tmp.SK_DPD_DEF > 14) & (tmp.SK_DPD_DEF <= 30)][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':' CREDIT_OVERDUE_24_30'})
CREDIT_OVERDUE_24_90 = tmp[(tmp.SK_DPD_DEF > 30) & (tmp.SK_DPD_DEF <= 90)][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':' CREDIT_OVERDUE_24_90'})
CREDIT_OVERDUE_24_90plus = tmp[tmp.SK_DPD_DEF > 90][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':' CREDIT_OVERDUE_24_90plus'})

# 信用卡最近24-36个月逾期0 7 14 30 90 90+天内的笔数
tmp = credit_card_balance[(credit_card_balance.MONTHS_BALANCE >= -36) & (credit_card_balance.MONTHS_BALANCE < -24)][['SK_ID_PREV','SK_ID_CURR','SK_DPD_DEF']].groupby(['SK_ID_PREV','SK_ID_CURR'], as_index=False).max()
CREDIT_OVERDUE_36_0 = tmp[tmp.SK_DPD_DEF ==0][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':' CREDIT_OVERDUE_36_0'})
CREDIT_OVERDUE_36_7 = tmp[(tmp.SK_DPD_DEF > 0) & (tmp.SK_DPD_DEF <= 7)][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':' CREDIT_OVERDUE_36_7'})
CREDIT_OVERDUE_36_14 = tmp[(tmp.SK_DPD_DEF > 7) & (tmp.SK_DPD_DEF <= 14)][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':' CREDIT_OVERDUE_36_14'})
CREDIT_OVERDUE_36_30 = tmp[(tmp.SK_DPD_DEF > 14) & (tmp.SK_DPD_DEF <= 30)][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':' CREDIT_OVERDUE_36_30'})
CREDIT_OVERDUE_36_90 = tmp[(tmp.SK_DPD_DEF > 30) & (tmp.SK_DPD_DEF <= 90)][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':' CREDIT_OVERDUE_36_90'})
CREDIT_OVERDUE_36_90plus = tmp[tmp.SK_DPD_DEF > 90][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':' CREDIT_OVERDUE_36_90plus'})

# 信用卡最近36以上个月逾期0 7 14 30 90 90+天内的笔数
tmp = credit_card_balance[credit_card_balance.MONTHS_BALANCE < -36][['SK_ID_PREV','SK_ID_CURR','SK_DPD_DEF']].groupby(['SK_ID_PREV','SK_ID_CURR'], as_index=False).max()
CREDIT_OVERDUE_36plus_0 = tmp[tmp.SK_DPD_DEF ==0][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':' CREDIT_OVERDUE_36plus_0'})
CREDIT_OVERDUE_36plus_7 = tmp[(tmp.SK_DPD_DEF > 0) & (tmp.SK_DPD_DEF <= 7)][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':' CREDIT_OVERDUE_36plus_7'})
CREDIT_OVERDUE_36plus_14 = tmp[(tmp.SK_DPD_DEF > 7) & (tmp.SK_DPD_DEF <= 14)][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':' CREDIT_OVERDUE_36plus_14'})
CREDIT_OVERDUE_36plus_30 = tmp[(tmp.SK_DPD_DEF > 14) & (tmp.SK_DPD_DEF <= 30)][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':' CREDIT_OVERDUE_36plus_30'})
CREDIT_OVERDUE_36plus_90 = tmp[(tmp.SK_DPD_DEF > 30) & (tmp.SK_DPD_DEF <= 90)][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':' CREDIT_OVERDUE_36plus_90'})
CREDIT_OVERDUE_36plus_90plus = tmp[tmp.SK_DPD_DEF > 90][['SK_ID_PREV','SK_ID_CURR']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':' CREDIT_OVERDUE_36plus_90plus'})

# 信用卡当前仍在逾期的笔数 最大逾期天数
# CREDIT_NUM_OVERDUE_STILL = credit_card_balance[(credit_card_balance.MONTHS_BALANCE == -1) & (credit_card_balance.SK_DPD_DEF > 30)][['SK_ID_CURR','SK_ID_PREV']].groupby(['SK_ID_CURR']).count().rename(columns={'SK_ID_PREV':'CREDIT_NUM_OVERDUE_STILL'})
# CREDIT_DAYS_MAXOVERDUE_STILL = credit_card_balance[(credit_card_balance.MONTHS_BALANCE == -1) & (credit_card_balance.SK_DPD_DEF > 30)][['SK_ID_CURR','SK_DPD_DEF']].groupby(['SK_ID_CURR']).max().rename(columns={'SK_DPD_DEF':'CREDIT_DAYS_MAXOVERDUE_STILL'})

del credit_card_balance
del tmp

初始内存使用: 673.88 MB
优化后的内存使用: 318.63 MB


In [50]:
credit_extra = concat_df_by_name('CREDIT')
credit_extra.shape

(103558, 71)

In [51]:
extra_info = [ bureau_extra, bureau_balance_extra, pre_extra, pos_extra, inst_extra, credit_extra ]

In [52]:
def merge_info(df,ls):
    print('合并前shape:{}'.format(df.shape))
    res = df.set_index('SK_ID_CURR')
    for extra in ls:
        res = pd.merge(res, extra, how = 'left', left_index=True, right_index=True)
    print('合并前shape:{}'.format(res.shape))
    return res

In [53]:
app = train_set.copy()
app_labels = train_set['TARGET'].copy()

In [54]:
app_extra = merge_info(app, extra_info)

合并前shape:(246008, 122)
合并前shape:(246008, 407)


In [55]:
missing_values_summary(app_extra)

Your selected dataframe has 407 columns.
There are 353 columns that have missing values.


Unnamed: 0,Missing Values,% of Total Values,dtype
CREDIT_OVERDUE_24_90,246007,100.0,float64
CREDIT_OVERDUE_12_90,246006,100.0,float64
POS_OVERDUE_24_90,246005,100.0,float64
CREDIT_OVERDUE_6_90,246003,100.0,float64
CREDIT_OVERDUE_12_90plus,246001,100.0,float64
CREDIT_OVERDUE_24_90plus,246001,100.0,float64
POS_OVERDUE_12_90,246000,100.0,float64
CREDIT_OVERDUE_36_90plus,246000,100.0,float64
CREDIT_OVERDUE_6_30,245999,100.0,float64
CREDIT_NUM_DEMAND,245997,100.0,float64


In [82]:
app_extra.head()

Unnamed: 0_level_0,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,NAME_TYPE_SUITE,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,DAYS_REGISTRATION,DAYS_ID_PUBLISH,OWN_CAR_AGE,FLAG_MOBIL,FLAG_EMP_PHONE,FLAG_WORK_PHONE,FLAG_CONT_MOBILE,FLAG_PHONE,FLAG_EMAIL,OCCUPATION_TYPE,CNT_FAM_MEMBERS,REGION_RATING_CLIENT,REGION_RATING_CLIENT_W_CITY,WEEKDAY_APPR_PROCESS_START,HOUR_APPR_PROCESS_START,REG_REGION_NOT_LIVE_REGION,REG_REGION_NOT_WORK_REGION,LIVE_REGION_NOT_WORK_REGION,REG_CITY_NOT_LIVE_CITY,REG_CITY_NOT_WORK_CITY,LIVE_CITY_NOT_WORK_CITY,ORGANIZATION_TYPE,EXT_SOURCE_1,EXT_SOURCE_2,EXT_SOURCE_3,APARTMENTS_AVG,BASEMENTAREA_AVG,YEARS_BEGINEXPLUATATION_AVG,YEARS_BUILD_AVG,COMMONAREA_AVG,ELEVATORS_AVG,ENTRANCES_AVG,FLOORSMAX_AVG,FLOORSMIN_AVG,LANDAREA_AVG,LIVINGAPARTMENTS_AVG,LIVINGAREA_AVG,NONLIVINGAPARTMENTS_AVG,NONLIVINGAREA_AVG,APARTMENTS_MODE,BASEMENTAREA_MODE,YEARS_BEGINEXPLUATATION_MODE,YEARS_BUILD_MODE,COMMONAREA_MODE,ELEVATORS_MODE,ENTRANCES_MODE,FLOORSMAX_MODE,FLOORSMIN_MODE,LANDAREA_MODE,LIVINGAPARTMENTS_MODE,LIVINGAREA_MODE,NONLIVINGAPARTMENTS_MODE,NONLIVINGAREA_MODE,APARTMENTS_MEDI,BASEMENTAREA_MEDI,YEARS_BEGINEXPLUATATION_MEDI,YEARS_BUILD_MEDI,COMMONAREA_MEDI,ELEVATORS_MEDI,ENTRANCES_MEDI,FLOORSMAX_MEDI,FLOORSMIN_MEDI,LANDAREA_MEDI,LIVINGAPARTMENTS_MEDI,LIVINGAREA_MEDI,NONLIVINGAPARTMENTS_MEDI,NONLIVINGAREA_MEDI,FONDKAPREMONT_MODE,HOUSETYPE_MODE,TOTALAREA_MODE,WALLSMATERIAL_MODE,EMERGENCYSTATE_MODE,OBS_30_CNT_SOCIAL_CIRCLE,DEF_30_CNT_SOCIAL_CIRCLE,OBS_60_CNT_SOCIAL_CIRCLE,DEF_60_CNT_SOCIAL_CIRCLE,DAYS_LAST_PHONE_CHANGE,FLAG_DOCUMENT_2,FLAG_DOCUMENT_3,FLAG_DOCUMENT_4,FLAG_DOCUMENT_5,FLAG_DOCUMENT_6,FLAG_DOCUMENT_7,FLAG_DOCUMENT_8,FLAG_DOCUMENT_9,FLAG_DOCUMENT_10,FLAG_DOCUMENT_11,FLAG_DOCUMENT_12,FLAG_DOCUMENT_13,FLAG_DOCUMENT_14,FLAG_DOCUMENT_15,FLAG_DOCUMENT_16,FLAG_DOCUMENT_17,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR,BUREAU_NUM,BUREAU_ACTIVE_NUM,BUREAU_Closed_NUM,BUREAU_SOLD_NUM,BUREAU_BAD_DEBT_NUM,BUREAU_MINDAY_APPLICATION,BUREAU_MAXDAY_APPLICATION,BUREAU_NUM_OVERDUE,BUREAU_MAXDAY_OVERDUE,BUREAU_NUM_PREPAY,BUREAU_NUM_NORMAL,BUREAU_NUM_DELAY,BUREAU_MAXAMT_OVERDUE,BUREAU_PROLONG_NUM,BUREAU_LOAN_AMT,BUREAU_DEBT_AMT,BUREAU_LIMIT_MAX,BUREAU_SUM_OVERDUE,BUREAU_NUM_CONSUMER,BUREAU_NUM_CARD,BUREAU_NUM_CAR,BUREAU_NUM_MORTGAGE,BUREAU_NUM_MiCROLOAN,BUREAU_NUM_OTHER,BUREAU_AMT_CONSUMER,BUREAU_AMT_CARD,BUREAU_AMT_CAR,BUREAU_AMT_MORTGAGE,BUREAU_AMT_MiCROLOAN,BUREAU_AMT_OTHER,BUREAU_DEBTAMT_CONSUMER,BUREAU_DEBTAMT_CARD,BUREAU_DEBTAMT_CAR,BUREAU_DEBTAMT_MORTGAGE,BUREAU_DEBTAMT_MiCROLOAN,BUREAU_DEBTAMT_OTHER,BUREAU_OVERDUEAMT_CONSUMER,BUREAU_OVERDUEAMT_CARD,BUREAU_OVERDUEAMT_CAR,BUREAU_OVERDUEAMT_MORTGAGE,BUREAU_OVERDUEAMT_MiCROLOAN,BUREAU_OVERDUEAMT_OTHER,BUREAU_LAST_UPDATE,BUREAU_OVERDUE_1_0,BUREAU_OVERDUE_1_1,BUREAU_OVERDUE_1_2,BUREAU_OVERDUE_1_3,BUREAU_OVERDUE_1_4,BUREAU_OVERDUE_1_5,BUREAU_OVERDUE_3_0,BUREAU_OVERDUE_3_1,BUREAU_OVERDUE_3_2,BUREAU_OVERDUE_3_3,BUREAU_OVERDUE_3_4,BUREAU_OVERDUE_3_5,BUREAU_OVERDUE_6_0,BUREAU_OVERDUE_6_1,BUREAU_OVERDUE_6_2,BUREAU_OVERDUE_6_3,BUREAU_OVERDUE_6_4,BUREAU_OVERDUE_6_5,BUREAU_OVERDUE_12_0,BUREAU_OVERDUE_12_1,BUREAU_OVERDUE_12_2,BUREAU_OVERDUE_12_3,BUREAU_OVERDUE_12_4,BUREAU_OVERDUE_12_5,BUREAU_OVERDUE_24_0,BUREAU_OVERDUE_24_1,BUREAU_OVERDUE_24_2,BUREAU_OVERDUE_24_3,BUREAU_OVERDUE_24_4,BUREAU_OVERDUE_24_5,BUREAU_OVERDUE_36_0,BUREAU_OVERDUE_36_1,BUREAU_OVERDUE_36_2,BUREAU_OVERDUE_36_3,BUREAU_OVERDUE_36_4,BUREAU_OVERDUE_36_5,BUREAU_OVERDUE_36plus_0,BUREAU_OVERDUE_36plus_1,BUREAU_OVERDUE_36plus_2,BUREAU_OVERDUE_36plus_3,BUREAU_OVERDUE_36plus_4,BUREAU_OVERDUE_36plus_5,PRE_CREDIT_NUM,PRE_CREDIT_AMT,PRE_CREDIT_ANNUITY,PRE_CREDIT_POS_NUM,PRE_CREDIT_POS_AMT,PRE_CREDIT_POS_ANNUITY,PRE_CREDIT_CASH_NUM,PRE_CREDIT_CASH_AMT,PRE_CREDIT_CASH_ANNUITY,PRE_CREDIT_XNA_NUM,PRE_CREDIT_XNA_AMT,PRE_CREDIT_XNA_ANNUITY,PRE_CREDIT_Cards_NUM,PRE_CREDIT_Cards_AMT,PRE_CREDIT_Cards_ANNUITY,PRE_CREDIT_Cars_NUM,PRE_CREDIT_Cars_AMT,PRE_CREDIT_Cars_ANNUITY,PRE_CREDIT_Approved_NUM,PRE_CREDIT_Approved_AMT,PRE_CREDIT_Approved_ANNUITY,PRE_CREDIT_Canceled_NUM,PRE_CREDIT_Canceled_AMT,PRE_CREDIT_Canceled_ANNUITY,PRE_CREDIT_Refused_NUM,PRE_CREDIT_Refused_AMT,PRE_CREDIT_Refused_ANNUITY,PRE_CREDIT_Unused_NUM,PRE_CREDIT_Unused_AMT,PRE_CREDIT_Unused_ANNUITY,PRE_HC_Refused_NUM,PRE_LIMIT_Refused_NUM,PRE_SCO_Refused_NUM,PRE_SCOFR_Refused_NUM,PRE_XNA_Refused_NUM,PRE_VERIF_Refused_NUM,PRE_SYSTEM_Refused_NUM,PRE_MAX_INTEREST_RATE,PRE_MIN_INTEREST_RATE,PRE_AVG_INTEREST_RATE,PRE_NUM_REPAY,PRE_REPAY_AMT,PRE_REPAY_ANNUITY,PRE_NUM_REAPY_OVERDUR,PRE_REAPY_OVERDUR_AMT,PRE_REAPY_OVERDUR_ANNUITY,POS_CREDIT_NUM,POS_FINISH_NUM,POS_REPAY_NUM,POS_DAYS_MAXOVERDUE_FINISH,POS_NUM_MAXOVERDUE_FINISH,POS_DAYS_MAXOVERDUE_REPAY,POS_NUM_MAXOVERDUE_REPAY,POS_OVERDUE_6_0,POS_OVERDUE_6_7,POS_OVERDUE_6_14,POS_OVERDUE_6_30,POS_OVERDUE_6_90,POS_OVERDUE_6_90plus,POS_OVERDUE_12_0,POS_OVERDUE_12_7,POS_OVERDUE_12_14,POS_OVERDUE_12_30,POS_OVERDUE_12_90,POS_OVERDUE_12_90plus,POS_OVERDUE_24_0,POS_OVERDUE_24_7,POS_OVERDUE_24_14,POS_OVERDUE_24_30,POS_OVERDUE_24_90,POS_OVERDUE_24_90plus,POS_OVERDUE_36_0,POS_OVERDUE_36_7,POS_OVERDUE_36_14,POS_OVERDUE_36_30,POS_OVERDUE_36_90,POS_OVERDUE_36_90plus,POS_OVERDUE_36plus_0,POS_OVERDUE_36plus_7,POS_OVERDUE_36plus_14,POS_OVERDUE_36plus_30,POS_OVERDUE_36plus_90,POS_OVERDUE_36plus_90plus,POS_NUM_OVERDUE_STILL,POS_DAYS_MAXOVERDUE_STILL,CREDIT_NUM,CREDIT_NUM_ACTIVE,CREDIT_NUM_COMPLETED,CREDIT_NUM_SIGNED,CREDIT_NUM_DEMAND,CREDIT_NUM_SENT,CREDIT_NUM_REFUSED,CREDIT_MONTHS_MAX,CREDIT_MONTHS_MIN,CREDIT_MONTHS_AVG,CREDIT_MONTHS_SUM,CREDIT_AMT_1m,CREDIT_AMT_3m,CREDIT_AMT_6m,CREDIT_AMT_12m,CREDIT_AMT_24m,CREDIT_AMT_36m,CREDIT_AMT_36mplus,CREDIT_AMT_1m_CURRENT,CREDIT_AMT_3m_CURRENT,CREDIT_AMT_6m_CURRENT,CREDIT_AMT_12m_CURRENT,CREDIT_AMT_24m_CURRENT,CREDIT_AMT_36m_CURRENT,CREDIT_AMT_36mplus_CURRENT,CREDIT_AMT_1m_POS,CREDIT_AMT_3m_POS,CREDIT_AMT_6m_POS,CREDIT_AMT_12m_POS,CREDIT_AMT_24m_POS,CREDIT_AMT_36m_POS,CREDIT_AMT_36mplus_POS,CREDIT_AMT_1m_ATM,CREDIT_AMT_3m_ATM,CREDIT_AMT_6m_ATM,CREDIT_AMT_12m_ATM,CREDIT_AMT_24m_ATM,CREDIT_AMT_36m_ATM,CREDIT_AMT_36mplus_ATM,CREDIT_LIMIT_MAX,CREDIT_LIMIT_MIN,CREDIT_OVERDUE_6_0,CREDIT_OVERDUE_6_7,CREDIT_OVERDUE_6_14,CREDIT_OVERDUE_6_30,CREDIT_OVERDUE_6_90,CREDIT_OVERDUE_6_90plus,CREDIT_OVERDUE_12_0,CREDIT_OVERDUE_12_7,CREDIT_OVERDUE_12_14,CREDIT_OVERDUE_12_30,CREDIT_OVERDUE_12_90,CREDIT_OVERDUE_12_90plus,CREDIT_OVERDUE_24_0,CREDIT_OVERDUE_24_7,CREDIT_OVERDUE_24_14,CREDIT_OVERDUE_24_30,CREDIT_OVERDUE_24_90,CREDIT_OVERDUE_24_90plus,CREDIT_OVERDUE_36_0,CREDIT_OVERDUE_36_7,CREDIT_OVERDUE_36_14,CREDIT_OVERDUE_36_30,CREDIT_OVERDUE_36_90,CREDIT_OVERDUE_36_90plus,CREDIT_OVERDUE_36plus_0,CREDIT_OVERDUE_36plus_7,CREDIT_OVERDUE_36plus_14,CREDIT_OVERDUE_36plus_30,CREDIT_OVERDUE_36plus_90,CREDIT_OVERDUE_36plus_90plus,CREDIT_NUM_OVERDUE_STILL,CREDIT_DAYS_MAXOVERDUE_STILL
SK_ID_CURR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1,Unnamed: 97_level_1,Unnamed: 98_level_1,Unnamed: 99_level_1,Unnamed: 100_level_1,Unnamed: 101_level_1,Unnamed: 102_level_1,Unnamed: 103_level_1,Unnamed: 104_level_1,Unnamed: 105_level_1,Unnamed: 106_level_1,Unnamed: 107_level_1,Unnamed: 108_level_1,Unnamed: 109_level_1,Unnamed: 110_level_1,Unnamed: 111_level_1,Unnamed: 112_level_1,Unnamed: 113_level_1,Unnamed: 114_level_1,Unnamed: 115_level_1,Unnamed: 116_level_1,Unnamed: 117_level_1,Unnamed: 118_level_1,Unnamed: 119_level_1,Unnamed: 120_level_1,Unnamed: 121_level_1,Unnamed: 122_level_1,Unnamed: 123_level_1,Unnamed: 124_level_1,Unnamed: 125_level_1,Unnamed: 126_level_1,Unnamed: 127_level_1,Unnamed: 128_level_1,Unnamed: 129_level_1,Unnamed: 130_level_1,Unnamed: 131_level_1,Unnamed: 132_level_1,Unnamed: 133_level_1,Unnamed: 134_level_1,Unnamed: 135_level_1,Unnamed: 136_level_1,Unnamed: 137_level_1,Unnamed: 138_level_1,Unnamed: 139_level_1,Unnamed: 140_level_1,Unnamed: 141_level_1,Unnamed: 142_level_1,Unnamed: 143_level_1,Unnamed: 144_level_1,Unnamed: 145_level_1,Unnamed: 146_level_1,Unnamed: 147_level_1,Unnamed: 148_level_1,Unnamed: 149_level_1,Unnamed: 150_level_1,Unnamed: 151_level_1,Unnamed: 152_level_1,Unnamed: 153_level_1,Unnamed: 154_level_1,Unnamed: 155_level_1,Unnamed: 156_level_1,Unnamed: 157_level_1,Unnamed: 158_level_1,Unnamed: 159_level_1,Unnamed: 160_level_1,Unnamed: 161_level_1,Unnamed: 162_level_1,Unnamed: 163_level_1,Unnamed: 164_level_1,Unnamed: 165_level_1,Unnamed: 166_level_1,Unnamed: 167_level_1,Unnamed: 168_level_1,Unnamed: 169_level_1,Unnamed: 170_level_1,Unnamed: 171_level_1,Unnamed: 172_level_1,Unnamed: 173_level_1,Unnamed: 174_level_1,Unnamed: 175_level_1,Unnamed: 176_level_1,Unnamed: 177_level_1,Unnamed: 178_level_1,Unnamed: 179_level_1,Unnamed: 180_level_1,Unnamed: 181_level_1,Unnamed: 182_level_1,Unnamed: 183_level_1,Unnamed: 184_level_1,Unnamed: 185_level_1,Unnamed: 186_level_1,Unnamed: 187_level_1,Unnamed: 188_level_1,Unnamed: 189_level_1,Unnamed: 190_level_1,Unnamed: 191_level_1,Unnamed: 192_level_1,Unnamed: 193_level_1,Unnamed: 194_level_1,Unnamed: 195_level_1,Unnamed: 196_level_1,Unnamed: 197_level_1,Unnamed: 198_level_1,Unnamed: 199_level_1,Unnamed: 200_level_1,Unnamed: 201_level_1,Unnamed: 202_level_1,Unnamed: 203_level_1,Unnamed: 204_level_1,Unnamed: 205_level_1,Unnamed: 206_level_1,Unnamed: 207_level_1,Unnamed: 208_level_1,Unnamed: 209_level_1,Unnamed: 210_level_1,Unnamed: 211_level_1,Unnamed: 212_level_1,Unnamed: 213_level_1,Unnamed: 214_level_1,Unnamed: 215_level_1,Unnamed: 216_level_1,Unnamed: 217_level_1,Unnamed: 218_level_1,Unnamed: 219_level_1,Unnamed: 220_level_1,Unnamed: 221_level_1,Unnamed: 222_level_1,Unnamed: 223_level_1,Unnamed: 224_level_1,Unnamed: 225_level_1,Unnamed: 226_level_1,Unnamed: 227_level_1,Unnamed: 228_level_1,Unnamed: 229_level_1,Unnamed: 230_level_1,Unnamed: 231_level_1,Unnamed: 232_level_1,Unnamed: 233_level_1,Unnamed: 234_level_1,Unnamed: 235_level_1,Unnamed: 236_level_1,Unnamed: 237_level_1,Unnamed: 238_level_1,Unnamed: 239_level_1,Unnamed: 240_level_1,Unnamed: 241_level_1,Unnamed: 242_level_1,Unnamed: 243_level_1,Unnamed: 244_level_1,Unnamed: 245_level_1,Unnamed: 246_level_1,Unnamed: 247_level_1,Unnamed: 248_level_1,Unnamed: 249_level_1,Unnamed: 250_level_1,Unnamed: 251_level_1,Unnamed: 252_level_1,Unnamed: 253_level_1,Unnamed: 254_level_1,Unnamed: 255_level_1,Unnamed: 256_level_1,Unnamed: 257_level_1,Unnamed: 258_level_1,Unnamed: 259_level_1,Unnamed: 260_level_1,Unnamed: 261_level_1,Unnamed: 262_level_1,Unnamed: 263_level_1,Unnamed: 264_level_1,Unnamed: 265_level_1,Unnamed: 266_level_1,Unnamed: 267_level_1,Unnamed: 268_level_1,Unnamed: 269_level_1,Unnamed: 270_level_1,Unnamed: 271_level_1,Unnamed: 272_level_1,Unnamed: 273_level_1,Unnamed: 274_level_1,Unnamed: 275_level_1,Unnamed: 276_level_1,Unnamed: 277_level_1,Unnamed: 278_level_1,Unnamed: 279_level_1,Unnamed: 280_level_1,Unnamed: 281_level_1,Unnamed: 282_level_1,Unnamed: 283_level_1,Unnamed: 284_level_1,Unnamed: 285_level_1,Unnamed: 286_level_1,Unnamed: 287_level_1,Unnamed: 288_level_1,Unnamed: 289_level_1,Unnamed: 290_level_1,Unnamed: 291_level_1,Unnamed: 292_level_1,Unnamed: 293_level_1,Unnamed: 294_level_1,Unnamed: 295_level_1,Unnamed: 296_level_1,Unnamed: 297_level_1,Unnamed: 298_level_1,Unnamed: 299_level_1,Unnamed: 300_level_1,Unnamed: 301_level_1,Unnamed: 302_level_1,Unnamed: 303_level_1,Unnamed: 304_level_1,Unnamed: 305_level_1,Unnamed: 306_level_1,Unnamed: 307_level_1,Unnamed: 308_level_1,Unnamed: 309_level_1,Unnamed: 310_level_1,Unnamed: 311_level_1,Unnamed: 312_level_1,Unnamed: 313_level_1,Unnamed: 314_level_1,Unnamed: 315_level_1,Unnamed: 316_level_1,Unnamed: 317_level_1,Unnamed: 318_level_1,Unnamed: 319_level_1,Unnamed: 320_level_1,Unnamed: 321_level_1,Unnamed: 322_level_1,Unnamed: 323_level_1,Unnamed: 324_level_1,Unnamed: 325_level_1,Unnamed: 326_level_1,Unnamed: 327_level_1,Unnamed: 328_level_1,Unnamed: 329_level_1,Unnamed: 330_level_1,Unnamed: 331_level_1,Unnamed: 332_level_1,Unnamed: 333_level_1,Unnamed: 334_level_1,Unnamed: 335_level_1,Unnamed: 336_level_1,Unnamed: 337_level_1,Unnamed: 338_level_1,Unnamed: 339_level_1,Unnamed: 340_level_1,Unnamed: 341_level_1,Unnamed: 342_level_1,Unnamed: 343_level_1,Unnamed: 344_level_1,Unnamed: 345_level_1,Unnamed: 346_level_1,Unnamed: 347_level_1,Unnamed: 348_level_1,Unnamed: 349_level_1,Unnamed: 350_level_1,Unnamed: 351_level_1,Unnamed: 352_level_1,Unnamed: 353_level_1,Unnamed: 354_level_1,Unnamed: 355_level_1,Unnamed: 356_level_1,Unnamed: 357_level_1,Unnamed: 358_level_1,Unnamed: 359_level_1,Unnamed: 360_level_1,Unnamed: 361_level_1,Unnamed: 362_level_1,Unnamed: 363_level_1,Unnamed: 364_level_1
243191,0,Cash loans,F,Y,N,0,171000.0,555273.0,16366.5,463500.0,Unaccompanied,Pensioner,Secondary / secondary special,Widow,House / apartment,0.035797,-23349,365243,-3596.0,-4408,31.0,1,0,0,1,0,0,,1.0,2,2,TUESDAY,9,0,0,0,0,0,0,XNA,0.524902,0.358643,0.563965,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,-2058.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,1.0,1.0,8.0,4.0,4.0,,,-2902.0,-599.0,,,1.0,1.0,2.0,,0.0,1488066.0,517774.5,0.0,0.0,6.0,2.0,,,,,1110066.0,378000.0,,,,,269644.5,248130.0,,,,,0.0,0.0,,,,,-12.0,1.0,,,,,,3.0,,,,,,1.0,,,,,,1.0,,,,,,1.0,,,,,,2.0,,,,,,,,,,,,7.0,797179.5,43781.039062,4.0,150147.0,18192.058594,,,,,,,,,,,,,7.0,797179.5,43781.039062,,,,,,,,,,,,,,,,,0.724159,0.111004,0.449755,1.0,288000.0,12584.474609,,,,8.0,6.0,1.0,0.0,,0.0,,3.0,,,,,,3.0,,,,,,4.0,,,,,,3.0,,,,,,7.0,,,,,,,,1.0,1.0,,,,,,96.0,96.0,96.0,96.0,0.0,0.0,0.0,0.0,0.0,0.0,141536.34375,0.0,0.0,0.0,0.0,0.0,0.0,284400.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,284400.0,180000.0,0.0,1.0,,,,,,1.0,,,,,,1.0,,,,,,1.0,,,,,,,1.0,,,,,,
111778,0,Cash loans,M,N,Y,1,157500.0,198085.5,23638.5,171000.0,Unaccompanied,Working,Secondary / secondary special,Married,House / apartment,0.010033,-10921,-117,-4280.0,-3399,,1,1,1,1,1,0,Laborers,3.0,2,2,SATURDAY,7,0,0,0,0,0,0,Business Entity Type 2,0.244873,0.490234,0.595215,0.07843,0.063293,0.974121,0.646484,0.026596,0.0,0.137939,0.166748,0.208252,0.040894,0.062988,0.059387,0.003901,0.0149,0.079773,0.065674,0.974121,0.660156,0.026901,0.0,0.137939,0.166748,0.208252,0.041809,0.068909,0.06189,0.003901,0.015793,0.079102,0.063293,0.974121,0.650879,0.026794,0.0,0.137939,0.166748,0.208252,0.041595,0.064087,0.060486,0.003901,0.015297,reg oper account,block of flats,0.064514,"Stone, brick",No,1.0,0.0,1.0,0.0,-73.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0,13.0,4.0,9.0,,,-2786.0,-292.0,,,5.0,2.0,2.0,,0.0,1289114.0,206473.5,0.0,0.0,10.0,3.0,,,,,1167614.0,121500.0,,,,,206473.5,0.0,,,,,0.0,0.0,,,,,-8.0,,,,,,,1.0,,,,,,2.0,,,,,,2.0,,,,,,1.0,,,,,,2.0,,,,,,1.0,,,,,,4.0,365175.0,51446.878906,4.0,365175.0,51446.878906,,,,,,,,,,,,,3.0,307647.0,41241.421875,,,,1.0,57528.0,10205.459961,,,,,,1.0,,,,,0.380999,0.064399,0.178889,,,,,,,3.0,3.0,,10.0,1.0,,,,,,,,,,,,,,,1.0,,,,,,2.0,,,,,,,,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
175057,1,Cash loans,M,Y,Y,0,135000.0,776304.0,25173.0,648000.0,Unaccompanied,Working,Lower secondary,Civil marriage,House / apartment,0.035797,-23213,-2157,-5680.0,-5009,8.0,1,1,0,1,0,0,Drivers,2.0,2,2,FRIDAY,13,0,0,0,0,0,0,Self-employed,,0.643555,0.706055,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,0.0,2.0,0.0,-1959.0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,5.0,4.0,,4.0,,,-1707.0,-1025.0,,,3.0,,1.0,,0.0,126180.2,0.0,0.0,0.0,4.0,,,,,,126180.2,,,,,,0.0,,,,,,0.0,,,,,,-691.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,18.0,1452433.5,105317.867188,13.0,483871.5,57101.671875,,,,3.0,0.0,0.0,,,,,,,13.0,1393605.0,98264.296875,3.0,0.0,0.0,2.0,58828.5,7053.570312,,,,1.0,1.0,,,,,,0.946347,0.087003,0.329275,1.0,522000.0,23638.769531,,,,15.0,14.0,1.0,0.0,,0.0,,3.0,,,,,,3.0,,,,,,3.0,,,,,,2.0,,,,,,10.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
372147,0,Cash loans,M,Y,Y,1,164133.0,900000.0,36787.5,900000.0,Unaccompanied,Commercial associate,Secondary / secondary special,Married,House / apartment,0.030762,-10703,-2530,-2618.0,-2751,15.0,1,1,1,1,1,0,High skill tech staff,3.0,2,2,TUESDAY,10,0,0,0,0,1,1,Trade: type 3,0.288574,0.426514,0.506348,0.149536,0.113586,0.983887,0.782227,0.094177,0.160034,0.137939,0.333252,0.041687,0.037415,0.120972,0.091675,0.003901,0.236816,0.152344,0.11792,0.983887,0.791016,0.095093,0.161133,0.137939,0.333252,0.041687,0.038208,0.132202,0.09552,0.003901,0.250732,0.150879,0.113586,0.983887,0.785156,0.094788,0.160034,0.137939,0.333252,0.041687,0.037994,0.123108,0.093323,0.003901,0.241821,reg oper account,terraced house,0.122192,Panel,No,0.0,0.0,0.0,0.0,-531.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,1.0,3.0,8.0,3.0,5.0,,,-1642.0,-382.0,,,1.0,2.0,2.0,,0.0,2263310.0,1641505.5,0.0,0.0,7.0,1.0,,,,,1939310.0,324000.0,,,,,1401264.0,240241.5,,,,,0.0,0.0,,,,,-15.0,,,,,,,,,,,,,,,,,,,,,,,,,2.0,,,,,,1.0,,,,,,,,,,,,15.0,2654490.5,123697.710938,8.0,757435.5,63452.609375,,,,5.0,97055.101562,0.0,,,,,,,7.0,718033.5,58078.167969,4.0,0.0,0.0,3.0,1839402.0,65619.539062,1.0,97055.101562,0.0,2.0,1.0,,,,,,0.659667,0.081199,0.326188,2.0,261630.0,13712.30957,,,,7.0,4.0,,0.0,,,,2.0,,,,,,2.0,,,,,,3.0,,,,,,1.0,,,,,,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
373412,0,Cash loans,M,N,Y,0,225000.0,533668.5,21294.0,477000.0,"Spouse, partner",Commercial associate,Secondary / secondary special,Married,House / apartment,0.025162,-15798,-3520,-8008.0,-5001,,1,1,0,1,0,0,Laborers,2.0,2,2,SATURDAY,12,0,0,0,0,0,0,Industry: type 11,0.790039,0.445801,0.52832,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,6.0,1.0,6.0,0.0,-9.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,4.0,8.0,4.0,4.0,,,-1653.0,-43.0,,,4.0,,,,0.0,5663809.0,3267585.0,0.0,0.0,6.0,2.0,,,,,5483809.0,180000.0,,,,,3267585.0,0.0,,,,,0.0,0.0,,,,,-19.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,3.0,317844.0,26319.554688,1.0,205344.0,20694.554688,,,,1.0,0.0,0.0,1.0,112500.0,5625.0,,,,2.0,317844.0,26319.554688,1.0,0.0,0.0,,,,,,,,,,,,,,0.209359,0.209359,0.209359,1.0,0.0,5625.0,1.0,0.0,5625.0,1.0,1.0,,0.0,,,,,,,,,,,,,,,,1.0,,,,,,,,,,,,,,,,,,,,1.0,1.0,,,,,,93.0,93.0,93.0,93.0,0.0,0.0,0.0,0.0,0.0,0.0,46445.023438,0.0,0.0,0.0,0.0,0.0,0.0,139500.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,139500.0,112500.0,0.0,1.0,,,,,,1.0,,,,,,1.0,,,,,,1.0,,,,,,,1.0,,,,,,
