### Задание – предсказание оплаты выставленных штрафов.
Мэрия некоего города пытается бороться с ненадлежащим содержанием владельцами своих домохозяйств с помощью штрафов. Каждый год мэрия выставляет штрафы на миллионы долларов, но многие из них остаются неоплаченными. Так как принудительное взимание штрафа – долгий и дорогой процесс, мэрия хотела бы улучшить процесс взимания штрафов.
Первым шагом в этом улучшении может стать понимание того, почему домовладелец может отказаться оплатить штраф. Для ответа на этот вопрос можно построить предиктивную модель оплаты штрафа. 
Задание: построить модель, предсказывающую, будет ли штраф оплачен вовремя. 
Данные представлены двумя файлами - train.csv для обучения модели и test.csv – для валидации модели. Каждая запись в этих файлах относится к одному выставленному штрафу и содержит информацию о том когда, кому и за что был выписан штраф. Целевая переменная compliance содержится только в тренировочном наборе. Эта переменная имеет значение True, если штраф был оплачен вовремя, т.е. в течение месяца после выставления, False – если был оплачен после этого срока или не оплачен вообще и Null, если штраф был выписан ошибочно.  
Замечание: ошибочно выписанные штрафы не следует использовать для обучения модели. Они включены в тренировочный набор исключительно как дополнительный источник данных для визуализации данных или для кластеризации исходных данных (если понадобится). 

Описание файлов:
* Train_2.csv – тренировочный набор (все штрафы выписанные в 2004-2011)
* Test_2.csv – тестовый набор (все штрафы выписанные в 2012-2016)
* addresses.csv & latlons.csv – справочники адресов и географических координат. 
 
* Поля файлов данных:
* Train_2.csv & test_2.csv
* ticket_id – уникальный идентификатор штрафа
* agency_name – агентство, выставившее штраф
* inspector_name – имя инспектора, выставившего штраф
* violator_name – имя человека / организации, которой выписали штраф
* violation_street_number, violation_street_name, violation_zip_code – адрес объекта нарушения
* mailing_address_str_number, mailing_address_str_name, city, state, zip_code, non_us_str_code, country – почтовый адрес нарушителя
* ticket_issued_date – дата и время выписки штрафа
* hearing_date – дата назначенного слушания дела в суде
* violation_code, violation_description – тип нарушения
* disposition – тип решения о штрафе
* fine_amount – сумма штрафа за исключением пошлин
* admin_fee - административная пошлина
* state_fee – государственная пошлина
* late_fee – дополнительная пошлина
* discount_amount - дисконт
* clean_up_cost – цена ремонта или удаления граффити
* judgment_amount – общая сумма платежа
* grafitti_status – флаг наличия граффити
* дополнительные поля в train.csv
* payment_amount – размер оплаты
* payment_date – дата оплаты
* payment_status – статус оплаты по состоянию на 01/02/2017
* balance_due – остаток штрафа
* collection_status – флаг для платежей в процессе сбора
* compliance [целевая переменная] 
*  Null = не подлежит оплате
*  0 = Responsible, не оплачен
*  1 = Responsible, оплачен
* compliance_detail – краткое описание статуса целевой переменной


In [44]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from sklearn.cluster import KMeans
from sklearn.metrics import roc_auc_score
from imblearn.over_sampling import ADASYN
from sklearn.utils import shuffle
from sklearn.model_selection import cross_val_score
%matplotlib inline

*Загрузим исходные датасеты*

In [45]:
pd.set_option('display.max_columns', None)
train = pd.read_csv(r'C:\datasets\Сбербанк\train_2.csv', low_memory=False)
test = pd.read_csv(r'C:\datasets\Сбербанк\test_2.csv', low_memory=False)
latlon = pd.read_csv(r'C:\datasets\Сбербанк\latlons.csv', low_memory=False)
addresses = pd.read_csv(r'C:\datasets\Сбербанк\addresses.csv', low_memory=False)

***
***

## Анализ и предобработка данных

*Посмотрим визуально на наши данные*

In [46]:
train.head(3)

Unnamed: 0.1,Unnamed: 0,ticket_id,agency_name,inspector_name,violator_name,violation_street_number,violation_street_name,violation_zip_code,mailing_address_str_number,mailing_address_str_name,city,state,zip_code,non_us_str_code,country,ticket_issued_date,hearing_date,violation_code,violation_description,disposition,fine_amount,admin_fee,state_fee,late_fee,discount_amount,clean_up_cost,judgment_amount,payment_amount,balance_due,payment_date,payment_status,collection_status,grafitti_status,compliance_detail,compliance
0,0,22056,"Buildings, Safety Engineering & Env Department","Sims, Martinzie","INVESTMENT INC., MIDWEST MORTGAGE",2900.0,TYLER,,3.0,S. WICKER,CHICAGO,IL,60606,,USA,2004-03-16 11:40:00,2005-03-21 10:30:00,9-1-36(a),Failure of owner to obtain certificate of comp...,Responsible by Default,250.0,20.0,10.0,25.0,0.0,0.0,305.0,0.0,305.0,,NO PAYMENT APPLIED,,,non-compliant by no payment,0.0
1,1,27586,"Buildings, Safety Engineering & Env Department","Williams, Darrin","Michigan, Covenant House",4311.0,CENTRAL,,2959.0,Martin Luther King,Detroit,MI,48208,,USA,2004-04-23 12:30:00,2005-05-06 13:30:00,61-63.0600,Failed To Secure Permit For Lawful Use Of Buil...,Responsible by Determination,750.0,20.0,10.0,75.0,0.0,0.0,855.0,780.0,75.0,2005-06-02 00:00:00,PAID IN FULL,,,compliant by late payment within 1 month,1.0
2,2,22062,"Buildings, Safety Engineering & Env Department","Sims, Martinzie","SANDERS, DERRON",1449.0,LONGFELLOW,,23658.0,P.O. BOX,DETROIT,MI,48223,,USA,2004-04-26 13:40:00,2005-03-29 10:30:00,9-1-36(a),Failure of owner to obtain certificate of comp...,Not responsible by Dismissal,250.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,NO PAYMENT APPLIED,,,not responsible by disposition,


In [47]:
test.head(3)

Unnamed: 0.1,Unnamed: 0,ticket_id,agency_name,inspector_name,violator_name,violation_street_number,violation_street_name,violation_zip_code,mailing_address_str_number,mailing_address_str_name,city,state,zip_code,non_us_str_code,country,ticket_issued_date,hearing_date,violation_code,violation_description,disposition,fine_amount,admin_fee,state_fee,late_fee,discount_amount,clean_up_cost,judgment_amount,grafitti_status
0,225001,259669,"Buildings, Safety Engineering & Env Department","Sloane, Bennie J","Jobczyk, Richard",5502.0,CHOPIN,,556.0,Chopin,Detroit,MI,48210,,USA,2010-08-27 08:50:00,2010-10-04 09:00:00,9-1-36(a),Failure of owner to obtain certificate of comp...,Responsible by Default,250.0,20.0,10.0,25.0,0.0,0.0,305.0,
1,225002,259733,"Buildings, Safety Engineering & Env Department","Sloane, Bennie J","Jobczyk, Richard",5502.0,CHOPIN,,556.0,Chopin,Detroit,MI,48210,,USA,2010-08-27 09:00:00,2010-10-04 09:00:00,9-1-81(a),Failure to obtain certificate of registration ...,Responsible by Default,250.0,20.0,10.0,25.0,0.0,0.0,305.0,
2,225003,258776,"Buildings, Safety Engineering & Env Department","Addison, Michael","REAL ESTATE, S & N",7661.0,VERNOR,,319.0,WATERFALL,TROY,MI,48083,,USA,2010-08-27 13:00:00,2011-03-22 09:00:00,9-1-36(a),Failure of owner to obtain certificate of comp...,Responsible by Default,250.0,20.0,10.0,25.0,0.0,0.0,305.0,


*Сразу видим, что у нас есть неинформативный признак - Unnamed: 0. Удалим его*

In [48]:
train = train.drop('Unnamed: 0', axis=1)
test = test.drop('Unnamed: 0', axis=1)

*Оценим размер наших данных*

In [49]:
print("Размер train: {}".format(train.shape))
print("Размер test: {}".format(test.shape))

Размер train: (225000, 34)
Размер test: (25305, 27)


***

*Посмотрим, каких признаков у нас нет в тестовом наборе, но которые присутствуют в обучающем наборе*

In [50]:
columns_outside_test = set(train) - set(test) - {'compliance'}
columns_outside_test

{'balance_due',
 'collection_status',
 'compliance_detail',
 'payment_amount',
 'payment_date',
 'payment_status'}

*Данные признаки было бы логичным удалить из обучающего набора, так как они отсутствую в тестовом, оставив только целевой признак*

In [51]:
total_columns = set(train) - columns_outside_test

*Обновим обучающий набор*

In [52]:
train = train[total_columns]

*Посмотрим на количество пропусков в наших данных*

In [53]:
train.isna().sum()

mailing_address_str_number      2895
inspector_name                     0
compliance                     80472
violation_description              0
state_fee                          0
judgment_amount                    0
violator_name                     31
mailing_address_str_name           4
agency_name                        0
zip_code                           1
violation_street_name              0
hearing_date                   10896
discount_amount                    0
ticket_id                          0
disposition                        0
grafitti_status               225000
city                               0
violation_code                     0
country                            0
fine_amount                        1
violation_zip_code            225000
ticket_issued_date                 0
state                             21
late_fee                           0
clean_up_cost                      0
violation_street_number            0
admin_fee                          0
n

In [54]:
test.isna().sum()

ticket_id                         0
agency_name                       0
inspector_name                    0
violator_name                     3
violation_street_number           0
violation_street_name             0
violation_zip_code            25305
mailing_address_str_number      707
mailing_address_str_name          0
city                              0
state                            72
zip_code                          0
non_us_str_code               25303
country                           0
ticket_issued_date                0
hearing_date                   1595
violation_code                    0
violation_description             0
disposition                       0
fine_amount                       0
admin_fee                         0
state_fee                         0
late_fee                          0
discount_amount                   0
clean_up_cost                     0
judgment_amount                   0
grafitti_status               25304
dtype: int64

*Видим, что по трём признакам у нас полностью отсутсвуют данные: grafitti_status, non_us_str_code, violation_zip_code. Удалим данные признаки*

In [55]:
train = train.drop(['grafitti_status', 'non_us_str_code', 'violation_zip_code'], axis=1)
test = test.drop(['grafitti_status', 'non_us_str_code', 'violation_zip_code'], axis=1)

*Так как признак с графити у нас отсутствует в данных, то и нет смысла в признаке clean_up_cost. Удалим его.*

In [56]:
train = train.drop('clean_up_cost', axis=1)
test = test.drop('clean_up_cost', axis=1)

*Следующее, что мы можем обнаружить в наших данных, так это то, что у нас имеется два признака, которые дублируют друг друга: violation_code, violation_description. Удалим из данных текстовое описание правонарушения.*

In [57]:
train = train.drop('violation_description', axis=1)
test = test.drop('violation_description', axis=1)

*Далее, есть подозрение на признак, который может являться неинформативным, так как все события происходят на территории США, а именно признак country*

In [58]:
print("Train:\n{}\n\nTest:\n{}".format(train.country.value_counts(normalize=True), 
                                     test.country.value_counts(normalize=True)))

Train:
USA     0.999960
Cana    0.000022
Egyp    0.000009
Aust    0.000009
Name: country, dtype: float64

Test:
USA     0.999842
Cana    0.000079
Aust    0.000040
Germ    0.000040
Name: country, dtype: float64


*Видим, что данный признак является доминантно-дисбалансным, не несет информативности - удаляем*

In [59]:
train = train.drop('country', axis=1)
test = test.drop('country', axis=1)

*Удалим события, у которых целевая переменная compliance имеет значение NaN, так как по таким событиям мы не можем обучиться*

In [60]:
train = train.dropna(subset=['compliance'])

*У нас имеется признак ticket_id, который логичней будет сделать индексом*

In [61]:
train.set_index('ticket_id', inplace=True)
test.set_index('ticket_id', inplace=True)

*Анализируя числовые признаки, я пришел к выводу, что часть из них нам не нужна, так как они следуют одни из других.*

*А именно:*
 * *admin_fee, late_fee, state_fee и fine_amount в сумме являются judgment_amount (суммарной задолженностью, с учетом всех дополнительных налогов) за минусом скидки*
 * *Признак с размером скидки оставим, так как ее наличиеможет быть важным критерием для оплаты штрафа*

*Удалим из данных эти признаки, они являются* **мультиколинеарными**

In [62]:
train = train.drop(['admin_fee', 'late_fee', 'state_fee', 'fine_amount'], axis=1)
test = test.drop(['admin_fee', 'late_fee', 'state_fee', 'fine_amount'], axis=1)

*Посмотрим сколько у нас дубликатов*

In [63]:
train.duplicated().sum(), test.duplicated().sum()

(1044, 270)

*Удалим из данных дубликаты*

In [64]:
train.drop_duplicates(inplace=True)

***

*Теперь займемся предобработкой категориальных переменных*

*Признак violation_code содержит статью о правонарушении, с различными значениями в скобках(частные случаи), нам необходимо оставить только общие случаи, поэтому удалим все знаки, которые могут означать подмножество статьи.*

In [65]:
def delete_special_case(data):
    data['violation_code'] = data['violation_code'].apply(lambda x: x.split('(')[0])
    data['violation_code'] = data['violation_code'].apply(lambda x: x.split('/')[0])
    data['violation_code'] = data['violation_code'].apply(lambda x: x.split(' ')[0])
    data['violation_code'][data['violation_code'].apply(lambda x: x.find('-')<=0)] = ''

delete_special_case(train)
delete_special_case(test)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


*Далее удалим признаки, которые являются специфическими и могут присутствовать только на данной территории и которые могут быть изменены и не несут какой-то особой значимости, такие как имена собственные, названия агенств выписавших штрафы, место правонарушения(в виде адреса, индекс правонарушителя - это слишком частные вещи, которые не несут обобщающих свойств)*

In [66]:
specific_columns = ['violator_name', 'city', 'zip_code', 'mailing_address_str_name', 'state', 'inspector_name', 'violation_street_number', 'agency_name', 'mailing_address_str_number', 'violation_street_name']
train = train.drop(specific_columns, axis=1)
test = test.drop(specific_columns, axis=1)

*Статьи, по которым было меньше 100 штрафов, объеденим в единную категорию Other*

In [67]:
counts = train['violation_code'].value_counts()
train['violation_code'][train['violation_code'].isin(counts[counts < 100].index)] = 'Other'
test['violation_code'][test['violation_code'].isin(counts[counts < 100].index)] = 'Other'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


***

*Теперь обработаем столбцы с датой. Добавим новый признак, разницу между датой оформления штрафа и датой подачи в суд по взысканию штрафа, но для начала приведем столбцы в формат даты*

In [68]:
train.ticket_issued_date = pd.to_datetime(train.ticket_issued_date)
test.ticket_issued_date = pd.to_datetime(test.ticket_issued_date)
train.hearing_date = pd.to_datetime(train.hearing_date)
test.hearing_date = pd.to_datetime(test.hearing_date)

In [69]:
train['timedelta'] = (train.hearing_date - train.ticket_issued_date).dt.days
test['timedelta'] = (test.hearing_date - test.ticket_issued_date).dt.days

*Теперь выделим из признака даты: месяц, число, день недели. К дню недели прибавим 1, чтобы не было 0. В данных, где судебное решение не назначено, установим значение -1, чтобы модель могла более корректно отработать, 0 означает 0, а -1 - это значение*

In [70]:
train['issued_day'] = train.ticket_issued_date.dt.day
train['issued_month'] = train.ticket_issued_date.dt.month
train['issued_weekday'] = train.ticket_issued_date.dt.weekday+1
train['hearing_day'] = train.hearing_date.dt.day
train['hearing_month'] = train.hearing_date.dt.month
train['hearing_weekday'] = train.hearing_date.dt.weekday+1

test['issued_day'] = test.ticket_issued_date.dt.day
test['issued_month'] = test.ticket_issued_date.dt.month
test['issued_weekday'] = test.ticket_issued_date.dt.weekday+1
test['hearing_day'] = test.hearing_date.dt.day
test['hearing_month'] = test.hearing_date.dt.month
test['hearing_weekday'] = test.hearing_date.dt.weekday+1

train = train.drop(['ticket_issued_date', 'hearing_date'], axis=1)
train = train.fillna(-1)

test = test.drop(['ticket_issued_date', 'hearing_date'], axis=1)
test = test.fillna(-1)

*В признаке disposition мало уникальных значений, разложим через One Hot*

In [71]:
encoder = OneHotEncoder(sparse=False)

disposition_one_hot_train = train['disposition'].values.reshape(-1, 1)
disposition_one_hot_test = test['disposition'].values.reshape(-1, 1)

encoder.fit(disposition_one_hot_train)
encoder.fit(disposition_one_hot_test)

train_disposition = pd.DataFrame(encoder.transform(disposition_one_hot_train), 
                                 columns=encoder.categories_, 
                                 index=train.index)
train = train.drop('disposition', axis=1)

test_disposition = pd.DataFrame(encoder.transform(disposition_one_hot_test), 
                                columns=encoder.categories_, 
                                index=test.index)
test = test.drop('disposition', axis=1)

In [72]:
train = train.merge(train_disposition, left_index=True, right_index=True)
test = test.merge(test_disposition, left_index=True, right_index=True)

Напишем функцию, которая категоризирует средним и так категоризируем violation_code, так как в нем много уникальных значений

In [73]:
violation_code_mean = pd.DataFrame(data=[train.groupby('violation_code')['violation_code'].count().index, 
                       train.groupby('violation_code')['violation_code'].count()]).T
violation_code_mean[1] = violation_code_mean[1] / len(train)
violation_code_dict = {x[0] : x[1] for x in violation_code_mean.itertuples(index=False)}

In [74]:
def violation_code_categorizer(row):
    try:
        revenue = violation_code_dict[row['violation_code']]
    except:
        revenue = violation_code_dict['Other']
    return revenue

In [75]:
test['violation_code'] = test.apply(violation_code_categorizer, axis=1)
train['violation_code'] = train.apply(violation_code_categorizer, axis=1)

***

*Теперь немного поработаем с дополнительными данными*

*Предположим, что вероятность уплаты штрафа каким то образом может зависеть от района. Кластеризируем координаты адресов правонарушений.*

*Я хотел использовать DBSCAN, но у меня не хватает оперативной памяти на стационарном компьютере, чтобы провести данную кластеризацию, поэтому я проведу кластеризацию KMean, с разбивкой на 100 районов.*

In [76]:
latlon = latlon.dropna()
#lat_sc = latlon[['lat', 'lon']]
#db = KMeans(n_clusters=100)
#clusters = db.fit_predict(lat_sc)
#latlon = latlon.merge(pd.DataFrame(clusters, index=latlon.index), left_index=True, right_index=True)

*Добавим результат кластеризации в обучающий и тестовый датасеты*

In [77]:
total_ad = latlon.merge(addresses)
total_ad.set_index('ticket_id', inplace=True)
train = train.merge(total_ad, left_index=True, right_index=True)
test = test.merge(total_ad, left_index=True, right_index=True)
train = train.drop(['address', 'lat', 'lon'], axis=1)
test = test.drop(['address', 'lat', 'lon'], axis=1)

In [81]:
train

Unnamed: 0_level_0,compliance,judgment_amount,discount_amount,violation_code,timedelta,issued_day,issued_month,issued_weekday,hearing_day,hearing_month,hearing_weekday,"(Not responsible by City Dismissal,)","(Not responsible by Determination,)","(Not responsible by Dismissal,)","(PENDING JUDGMENT,)","(Responsible (Fine Waived) by Deter,)","(Responsible by Admission,)","(Responsible by Default,)","(Responsible by Determination,)","(SET-ASIDE (PENDING JUDGMENT),)"
ticket_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
22056,0.0,305.0,0.0,0.420848,369.0,16,3,2,21.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
27586,1.0,855.0,0.0,0.010545,378.0,23,4,5,6.0,5.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
22046,0.0,305.0,0.0,0.420848,323.0,1,5,6,21.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
18738,0.0,855.0,0.0,0.010545,253.0,14,6,1,22.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
18735,0.0,140.0,0.0,0.010545,251.0,16,6,3,22.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
260885,0.0,305.0,0.0,0.152435,55.0,27,8,5,21.0,10.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
260886,0.0,305.0,0.0,0.420848,55.0,27,8,5,21.0,10.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
258588,0.0,1130.0,0.0,0.144371,69.0,27,8,5,5.0,11.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
258590,0.0,85.0,0.0,0.095739,83.0,27,8,5,19.0,11.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [1530]:
train = shuffle(train)

In [1531]:
train.duplicated().sum(), test.duplicated().sum()

(28629, 750)

In [1532]:
train.drop_duplicates(inplace=True)

*Разобъем на признаки и целевую переменную*

In [1533]:
X = train.drop('compliance', axis=1)
y = train['compliance']

*В наших данных имеется дисбаланс классов, устраним его адаптивным дополнением данных в обучающем наборе при помощи библиотеки imblearn*

In [1534]:
sm = ADASYN(random_state=42)
X_res, y_res = sm.fit_resample(X, y)

In [1535]:
X_res, y_res = shuffle(X_res, y_res)

***

*Построим модель*

*Построим модель на CatBoost*

In [1536]:
model = CatBoostClassifier()

In [1537]:
model.fit(X_res, y_res)

Learning rate set to 0.092543
0:	learn: 0.5904738	total: 69.7ms	remaining: 1m 9s
1:	learn: 0.5451132	total: 126ms	remaining: 1m 2s
2:	learn: 0.5100457	total: 184ms	remaining: 1m 1s
3:	learn: 0.4888548	total: 246ms	remaining: 1m 1s
4:	learn: 0.4483038	total: 312ms	remaining: 1m 2s
5:	learn: 0.4276473	total: 373ms	remaining: 1m 1s
6:	learn: 0.4042890	total: 441ms	remaining: 1m 2s
7:	learn: 0.3838036	total: 500ms	remaining: 1m 1s
8:	learn: 0.3710300	total: 558ms	remaining: 1m 1s
9:	learn: 0.3604119	total: 631ms	remaining: 1m 2s
10:	learn: 0.3516529	total: 692ms	remaining: 1m 2s
11:	learn: 0.3452638	total: 754ms	remaining: 1m 2s
12:	learn: 0.3348319	total: 811ms	remaining: 1m 1s
13:	learn: 0.3262238	total: 901ms	remaining: 1m 3s
14:	learn: 0.3183315	total: 960ms	remaining: 1m 3s
15:	learn: 0.3137913	total: 1.02s	remaining: 1m 2s
16:	learn: 0.3102617	total: 1.09s	remaining: 1m 3s
17:	learn: 0.3012827	total: 1.15s	remaining: 1m 2s
18:	learn: 0.2923471	total: 1.21s	remaining: 1m 2s
19:	learn:

<catboost.core.CatBoostClassifier at 0x17b734a08>

In [1455]:
scores = cross_val_score(model, X_res, y_res, cv=5, scoring='roc_auc')

Learning rate set to 0.101499
0:	learn: 0.6021770	total: 77.6ms	remaining: 1m 17s
1:	learn: 0.5490701	total: 144ms	remaining: 1m 11s
2:	learn: 0.4766812	total: 212ms	remaining: 1m 10s
3:	learn: 0.4453305	total: 291ms	remaining: 1m 12s
4:	learn: 0.4244098	total: 357ms	remaining: 1m 10s
5:	learn: 0.4020233	total: 425ms	remaining: 1m 10s
6:	learn: 0.3854852	total: 506ms	remaining: 1m 11s
7:	learn: 0.3667159	total: 577ms	remaining: 1m 11s
8:	learn: 0.3544144	total: 650ms	remaining: 1m 11s
9:	learn: 0.3456064	total: 730ms	remaining: 1m 12s
10:	learn: 0.3221557	total: 808ms	remaining: 1m 12s
11:	learn: 0.3161338	total: 879ms	remaining: 1m 12s
12:	learn: 0.3076154	total: 959ms	remaining: 1m 12s
13:	learn: 0.2984750	total: 1.02s	remaining: 1m 12s
14:	learn: 0.2931726	total: 1.09s	remaining: 1m 11s
15:	learn: 0.2899667	total: 1.17s	remaining: 1m 12s
16:	learn: 0.2859503	total: 1.24s	remaining: 1m 11s
17:	learn: 0.2774414	total: 1.32s	remaining: 1m 11s
18:	learn: 0.2738307	total: 1.39s	remaining

In [1457]:
scores.mean()

0.9857921066291574

In [1538]:
roc_auc_score(y_res, model.predict_proba(X_res)[:, 1])

0.9852914570987755

In [1539]:
model.predict_proba(test)[:, 1]

array([0.12738234, 0.04510847, 0.02669485, ..., 0.92052685, 0.92052685,
       0.92052685])

In [1574]:
test

Unnamed: 0_level_0,violation_code,discount_amount,judgment_amount,timedelta,issued_day,issued_month,issued_weekday,hearing_day,hearing_month,hearing_weekday,"(Not responsible by City Dismissal,)","(Not responsible by Determination,)","(Not responsible by Dismissal,)","(PENDING JUDGMENT,)","(Responsible (Fine Waived) by Deter,)","(Responsible by Admission,)","(Responsible by Default,)","(Responsible by Determination,)","(SET-ASIDE (PENDING JUDGMENT),)"
ticket_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
259669,0.420848,0.0,305.0,38.0,27,8,5,4.0,10.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
259733,0.152435,0.0,305.0,38.0,27,8,5,4.0,10.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
258776,0.420848,0.0,305.0,206.0,27,8,5,22.0,3.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
258777,0.010545,0.0,855.0,38.0,27,8,5,5.0,10.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
259663,0.420848,0.0,280.0,35.0,27,8,5,1.0,10.0,5.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
325555,0.019006,0.0,0.0,1495.0,2,12,4,6.0,1.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
325557,0.019006,0.0,0.0,1495.0,2,12,4,6.0,1.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
325562,0.019006,0.0,0.0,1495.0,2,12,4,6.0,1.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
325559,0.019006,0.0,0.0,1495.0,2,12,4,6.0,1.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
