Вы получили основные навыки обработки данных, теперь пора испытать их на практике. Сейчас вам предстоит заняться задачей классификации.

Представлен датасет центра приюта животных, и вашей задачей будет обучить модель таким образом, чтобы  по определенным признакам была возможность максимально уверенно предсказать метки 'Adoption' и 'Transfer' (столбец “outcome_type”).

Здесь вы вольны делать что угодно. Я хочу видеть от вас:
1. Проверка наличия/обработка пропусков
2. Проверьте взаимосвязи между признаками
3. Попробуйте создать свои признаки
4. Удалите лишние
5. Обратите внимание на текстовые столбцы. Подумайте, что можно извлечь полезного оттуда
6. Использование профайлера вам поможет.
7. Не забывайте, что у вас есть PCA (Метод главных компонент). Он может пригодиться.

Вспомните о всем, что я говорил на предыдущих занятиях. Не все будет пригодится, но в жизни вам никто не будет говорить, что использовать :)

Хорошим классификатором для этой задачи будет "Случайный лес" (https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

Понимать суть работы "леса" не обязательно на данном этапе, но качество предсказаний будет выше, чем с линейным классификатором. (если желаете, вот гайд https://adataanalyst.com/scikit-learn/linear-classification-method/)

### Поиск выбросов и генерация новых признаков

In [38]:
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from warnings import filterwarnings 
filterwarnings ('ignore')

import re
import datetime as dt
from functools import reduce

In [39]:
df = pd.read_csv('aac_shelter_outcomes.csv')
df.shape

(78256, 12)

In [40]:
df.head()

Unnamed: 0,age_upon_outcome,animal_id,animal_type,breed,color,date_of_birth,datetime,monthyear,name,outcome_subtype,outcome_type,sex_upon_outcome
0,2 weeks,A684346,Cat,Domestic Shorthair Mix,Orange Tabby,2014-07-07T00:00:00,2014-07-22T16:04:00,2014-07-22T16:04:00,,Partner,Transfer,Intact Male
1,1 year,A666430,Dog,Beagle Mix,White/Brown,2012-11-06T00:00:00,2013-11-07T11:47:00,2013-11-07T11:47:00,Lucy,Partner,Transfer,Spayed Female
2,1 year,A675708,Dog,Pit Bull,Blue/White,2013-03-31T00:00:00,2014-06-03T14:20:00,2014-06-03T14:20:00,*Johnny,,Adoption,Neutered Male
3,9 years,A680386,Dog,Miniature Schnauzer Mix,White,2005-06-02T00:00:00,2014-06-15T15:50:00,2014-06-15T15:50:00,Monday,Partner,Transfer,Neutered Male
4,5 months,A683115,Other,Bat Mix,Brown,2014-01-07T00:00:00,2014-07-07T14:04:00,2014-07-07T14:04:00,,Rabies Risk,Euthanasia,Unknown


* age_upon_outcome - возраст животного на выходе   || (переведем в дни)
* animal_id - идентификатор животного    || (данная информация нам не нужна, удалим)
* animal_type - вид животного    || (преобразуем через OneHotEncoding)
* breed - порода    || (можно будет почистить данные от значения "mix" и преобразовать через OneHotEncoding)
* color - цвет    || (преобразуем через OneHotEncoding)
* date_of_birth - дата рождения животного   || (данный параметр не сильно нужен, так как есть "datetime", удалим)
* datetime - дата и время выхода животного из приюта 
* monthyear - месяц и год выхода животного из приюта   || (данный параметр являеся полной копией datetime, удалим)
* name - кличка животного   || (данная информация нам не нужна, удалим)
* outcome_subtype - более конкретное описание причины покидания приюта   || 0_о
* outcome_type - как именно животное покинуло приют   (наша задача)
* sex_upon_outcome - пол животного и признак сохранения репродуктивной функции   || (извлечем признак сохранности репродуктивных органов и а к полу применим OneHotEncoding)

Представлен датасет центра приюта животных, и вашей задачей будет обучить модель таким образом, чтобы по определенным признакам была возможность максимально уверенно предсказать метки 'Adoption' и 'Transfer' (столбец “outcome_type”).

посмотрим уникальные значения столбца “outcome_type”

In [41]:
df.outcome_type.unique()

array(['Transfer', 'Adoption', 'Euthanasia', 'Return to Owner', 'Died',
       'Disposal', 'Relocate', 'Missing', nan, 'Rto-Adopt'], dtype=object)

Для упрощения работы откинем ненужные нам значения

In [42]:
df = df[df['outcome_type'].isin(['Transfer', 'Adoption'])]
df.shape

(56611, 12)

### age_upon_outcome

посмотрим на уникальные значения

In [43]:
age_upon_outcome_unique = df.age_upon_outcome.unique()
age_upon_outcome_unique

array(['2 weeks', '1 year', '9 years', '4 months', '3 years', '1 month',
       '3 months', '2 years', '2 months', '3 weeks', '8 months',
       '5 months', '12 years', '4 years', '7 years', '5 years', '5 days',
       '10 months', '4 weeks', '2 days', '10 years', '6 months',
       '8 years', '11 months', '15 years', '7 months', '6 years',
       '16 years', '9 months', '6 days', '4 days', '1 week', '3 days',
       '14 years', '13 years', '1 day', '1 weeks', '0 years', '11 years',
       '5 weeks', '20 years', '17 years', '19 years', '18 years',
       '25 years', nan], dtype=object)

теперь приведем все значения к единому, а именно переведем все в дни.

In [44]:
def calc_age_in_days(age_upon_outcome):    
    if age_upon_outcome is np.nan:
        return 0
    
    count, period = age_upon_outcome.split()
    multiplier = 0
    
    if 'day' in period:
        multiplier = 1
    elif 'week' in period:
        multiplier = 7
    elif 'month' in period:
        multiplier = 30
    elif 'year' in period:
        multiplier = 365
        
    return int(count) * multiplier

df['days_upon_outcome'] = df.age_upon_outcome.apply(calc_age_in_days)

df.head()

Unnamed: 0,age_upon_outcome,animal_id,animal_type,breed,color,date_of_birth,datetime,monthyear,name,outcome_subtype,outcome_type,sex_upon_outcome,days_upon_outcome
0,2 weeks,A684346,Cat,Domestic Shorthair Mix,Orange Tabby,2014-07-07T00:00:00,2014-07-22T16:04:00,2014-07-22T16:04:00,,Partner,Transfer,Intact Male,14
1,1 year,A666430,Dog,Beagle Mix,White/Brown,2012-11-06T00:00:00,2013-11-07T11:47:00,2013-11-07T11:47:00,Lucy,Partner,Transfer,Spayed Female,365
2,1 year,A675708,Dog,Pit Bull,Blue/White,2013-03-31T00:00:00,2014-06-03T14:20:00,2014-06-03T14:20:00,*Johnny,,Adoption,Neutered Male,365
3,9 years,A680386,Dog,Miniature Schnauzer Mix,White,2005-06-02T00:00:00,2014-06-15T15:50:00,2014-06-15T15:50:00,Monday,Partner,Transfer,Neutered Male,3285
5,4 months,A664462,Dog,Leonberger Mix,Brown/White,2013-06-03T00:00:00,2013-10-07T13:06:00,2013-10-07T13:06:00,*Edgar,Partner,Transfer,Intact Male,120


### breed

уберем значение "mix"

In [45]:
len(df.breed.unique())

1803

Разнообразие значений окозалось большим.

In [46]:
breed_words = []

for words_list in df.breed.str.lower().str.split():
    for word in words_list:
        breed_words.append(word)
        
len(breed_words)

165465

In [47]:
df_breed_words = pd.DataFrame(breed_words)
len(df_breed_words[0].unique())

1348

In [48]:
df_breed_words[0].value_counts().head(20)

mix           48939
shorthair     25391
domestic      24391
chihuahua      4364
labrador       4280
retriever      4240
bull           4117
pit            3921
terrier        2571
shepherd       2315
hair           2224
medium         2180
german         1693
australian     1682
longhair       1452
cattle         1238
dog            1153
miniature      1045
siamese         909
dachshund       867
Name: 0, dtype: int64

In [49]:
breed_v_counts = df.breed.value_counts()

breed_v_counts.head(50)

Domestic Shorthair Mix                20809
Pit Bull Mix                           3509
Chihuahua Shorthair Mix                3399
Labrador Retriever Mix                 3258
Domestic Medium Hair Mix               2049
German Shepherd Mix                    1265
Domestic Longhair Mix                  1027
Siamese Mix                             857
Australian Cattle Dog Mix               788
Dachshund Mix                           581
Border Collie Mix                       459
Boxer Mix                               437
Miniature Poodle Mix                    392
Catahoula Mix                           336
Domestic Shorthair                      323
Rat Terrier Mix                         310
Australian Shepherd Mix                 301
Jack Russell Terrier Mix                296
Yorkshire Terrier Mix                   282
Beagle Mix                              267
Cairn Terrier Mix                       266
Pointer Mix                             263
Chihuahua Longhair Mix          

Оценив содержание слов в породе животного, можно выделить ещё такие значения, как длину шерсти.


### outcome_subtype

In [50]:
outcome_subtype = df.outcome_subtype.unique()

outcome_subtype.shape, outcome_subtype

((7,),
 array(['Partner', nan, 'Offsite', 'Foster', 'SCRP', 'Barn', 'Snr'],
       dtype=object))

Уникальных значений не так уж и много, преобразуем с помощью OneHotEncoding

Теперь преобразуем наше целевое значение 

In [51]:
df['is_adopted'] = df.outcome_type.str.lower().str.contains('adopt').astype(int)
df.head()

Unnamed: 0,age_upon_outcome,animal_id,animal_type,breed,color,date_of_birth,datetime,monthyear,name,outcome_subtype,outcome_type,sex_upon_outcome,days_upon_outcome,is_adopted
0,2 weeks,A684346,Cat,Domestic Shorthair Mix,Orange Tabby,2014-07-07T00:00:00,2014-07-22T16:04:00,2014-07-22T16:04:00,,Partner,Transfer,Intact Male,14,0
1,1 year,A666430,Dog,Beagle Mix,White/Brown,2012-11-06T00:00:00,2013-11-07T11:47:00,2013-11-07T11:47:00,Lucy,Partner,Transfer,Spayed Female,365,0
2,1 year,A675708,Dog,Pit Bull,Blue/White,2013-03-31T00:00:00,2014-06-03T14:20:00,2014-06-03T14:20:00,*Johnny,,Adoption,Neutered Male,365,1
3,9 years,A680386,Dog,Miniature Schnauzer Mix,White,2005-06-02T00:00:00,2014-06-15T15:50:00,2014-06-15T15:50:00,Monday,Partner,Transfer,Neutered Male,3285,0
5,4 months,A664462,Dog,Leonberger Mix,Brown/White,2013-06-03T00:00:00,2013-10-07T13:06:00,2013-10-07T13:06:00,*Edgar,Partner,Transfer,Intact Male,120,0


А теперь преобразуем пол и сохранность репродуктивных органов.

In [52]:
sex_upon_outcome = df.sex_upon_outcome.unique()

sex_upon_outcome.shape, sex_upon_outcome

((5,),
 array(['Intact Male', 'Spayed Female', 'Neutered Male', 'Intact Female',
        'Unknown'], dtype=object))

"intact" - означает наличие репродуктивной функции, а "spayed" и "neutered" - отсутствие

In [53]:
df['sex_upon_outcome'] = df.sex_upon_outcome.str.lower()

In [54]:
df['sex_intact'] = df.sex_upon_outcome.str.contains('intact').astype(int)

df.head()

Unnamed: 0,age_upon_outcome,animal_id,animal_type,breed,color,date_of_birth,datetime,monthyear,name,outcome_subtype,outcome_type,sex_upon_outcome,days_upon_outcome,is_adopted,sex_intact
0,2 weeks,A684346,Cat,Domestic Shorthair Mix,Orange Tabby,2014-07-07T00:00:00,2014-07-22T16:04:00,2014-07-22T16:04:00,,Partner,Transfer,intact male,14,0,1
1,1 year,A666430,Dog,Beagle Mix,White/Brown,2012-11-06T00:00:00,2013-11-07T11:47:00,2013-11-07T11:47:00,Lucy,Partner,Transfer,spayed female,365,0,0
2,1 year,A675708,Dog,Pit Bull,Blue/White,2013-03-31T00:00:00,2014-06-03T14:20:00,2014-06-03T14:20:00,*Johnny,,Adoption,neutered male,365,1,0
3,9 years,A680386,Dog,Miniature Schnauzer Mix,White,2005-06-02T00:00:00,2014-06-15T15:50:00,2014-06-15T15:50:00,Monday,Partner,Transfer,neutered male,3285,0,0
5,4 months,A664462,Dog,Leonberger Mix,Brown/White,2013-06-03T00:00:00,2013-10-07T13:06:00,2013-10-07T13:06:00,*Edgar,Partner,Transfer,intact male,120,0,1


In [55]:
df['sex'] = df.sex_upon_outcome.str.replace('intact|spayed|neutered', '').str.strip()

df.head()

Unnamed: 0,age_upon_outcome,animal_id,animal_type,breed,color,date_of_birth,datetime,monthyear,name,outcome_subtype,outcome_type,sex_upon_outcome,days_upon_outcome,is_adopted,sex_intact,sex
0,2 weeks,A684346,Cat,Domestic Shorthair Mix,Orange Tabby,2014-07-07T00:00:00,2014-07-22T16:04:00,2014-07-22T16:04:00,,Partner,Transfer,intact male,14,0,1,male
1,1 year,A666430,Dog,Beagle Mix,White/Brown,2012-11-06T00:00:00,2013-11-07T11:47:00,2013-11-07T11:47:00,Lucy,Partner,Transfer,spayed female,365,0,0,female
2,1 year,A675708,Dog,Pit Bull,Blue/White,2013-03-31T00:00:00,2014-06-03T14:20:00,2014-06-03T14:20:00,*Johnny,,Adoption,neutered male,365,1,0,male
3,9 years,A680386,Dog,Miniature Schnauzer Mix,White,2005-06-02T00:00:00,2014-06-15T15:50:00,2014-06-15T15:50:00,Monday,Partner,Transfer,neutered male,3285,0,0,male
5,4 months,A664462,Dog,Leonberger Mix,Brown/White,2013-06-03T00:00:00,2013-10-07T13:06:00,2013-10-07T13:06:00,*Edgar,Partner,Transfer,intact male,120,0,1,male


Теперь избавимся от ненужных столбов.

In [56]:
df.drop(['age_upon_outcome', 'animal_id', 'date_of_birth', 'monthyear', 'name', 'outcome_type', 'sex_upon_outcome', 'breed'], axis=1, inplace=True)

In [57]:
df.drop(['outcome_subtype'], axis=1, inplace=True)

In [58]:
df.head()

Unnamed: 0,animal_type,color,datetime,days_upon_outcome,is_adopted,sex_intact,sex
0,Cat,Orange Tabby,2014-07-22T16:04:00,14,0,1,male
1,Dog,White/Brown,2013-11-07T11:47:00,365,0,0,female
2,Dog,Blue/White,2014-06-03T14:20:00,365,1,0,male
3,Dog,White,2014-06-15T15:50:00,3285,0,0,male
5,Dog,Brown/White,2013-10-07T13:06:00,120,0,1,male


### datetime

Преведем значение "datetime", к более простому, числовому значению.

In [59]:
df['datetime'] = pd.to_datetime(df.datetime).map(dt.datetime.toordinal)

In [60]:
df.head()

Unnamed: 0,animal_type,color,datetime,days_upon_outcome,is_adopted,sex_intact,sex
0,Cat,Orange Tabby,735436,14,0,1,male
1,Dog,White/Brown,735179,365,0,0,female
2,Dog,Blue/White,735387,365,1,0,male
3,Dog,White,735399,3285,0,0,male
5,Dog,Brown/White,735148,120,0,1,male


### color

In [61]:
df.color.value_counts()

Black/White               6111
Black                     5141
Brown Tabby               3975
Brown Tabby/White         2083
Orange Tabby              1914
                          ... 
Gray Tabby/Orange            1
Blue/Calico                  1
Buff/Red                     1
Black Tabby/Gray Tabby       1
Orange Tabby/Brown           1
Name: color, Length: 475, dtype: int64

In [62]:
splited_colors_series = df.color.str.split('/')
max(map(len, splited_colors_series))

2

Сделаем так что бы можно было привести к двум категориальным признакам.

In [63]:
splited_colors = []

for colors_list in splited_colors_series:
    for color in colors_list:
        splited_colors.append(color)

In [64]:
len(pd.DataFrame(splited_colors)[0].unique())

57

Мы соктратили количесво категорий до 57. Теперь разделить категории

In [65]:
df['color_common'] = df.color.map(lambda colors: colors[0])
df['color_second'] = df.color.map(lambda colors: colors[1] if len(colors) > 1 else np.nan)
df.drop(['color'], axis=1, inplace=True)

In [66]:
df.head()

Unnamed: 0,animal_type,datetime,days_upon_outcome,is_adopted,sex_intact,sex,color_common,color_second
0,Cat,735436,14,0,1,male,O,r
1,Dog,735179,365,0,0,female,W,h
2,Dog,735387,365,1,0,male,B,l
3,Dog,735399,3285,0,0,male,W,h
5,Dog,735148,120,0,1,male,B,r


In [67]:
df['color_common'] = df['color_common'].str.replace(' ', '_')
df['color_second'] = df['color_second'].str.replace(' ', '_')

In [68]:
df.head()

Unnamed: 0,animal_type,datetime,days_upon_outcome,is_adopted,sex_intact,sex,color_common,color_second
0,Cat,735436,14,0,1,male,O,r
1,Dog,735179,365,0,0,female,W,h
2,Dog,735387,365,1,0,male,B,l
3,Dog,735399,3285,0,0,male,W,h
5,Dog,735148,120,0,1,male,B,r


Мы разделили цвета на первичный и второстепенный. Теперь мы можим преобразовать наши категориальные данные при помощи OneHotEncoder. Все кроме "color_second", т.к. этот признак придётся проставлять отдельно

In [69]:
def one_hot_encode_new_columns(df: pd.DataFrame, col_name: str):
    enc = OneHotEncoder(categories='auto')
    
    encoded_data = enc.fit_transform(
        np.array( df[col_name] ).reshape(-1, 1)
    ).todense()
    
    encoded_feature_names = list(map(lambda val: re.sub(r'^.+_', f'{col_name}_', val), enc.get_feature_names()))
    
    return pd.DataFrame(data=encoded_data, columns=encoded_feature_names)

In [70]:
df = df.reset_index(drop=True)

ohe_col_names = ['animal_type', 'sex', 'color_common']

df_ohe = pd.concat([
    df,
    *map(lambda col_name: one_hot_encode_new_columns(df, col_name), ohe_col_names)
], sort=False, axis=1)

df_ohe.drop(ohe_col_names, axis=1, inplace=True)

df_ohe.head()

Unnamed: 0,datetime,days_upon_outcome,is_adopted,sex_intact,color_second,animal_type_Bird,animal_type_Cat,animal_type_Dog,animal_type_Livestock,animal_type_Other,...,color_common_F,color_common_G,color_common_L,color_common_O,color_common_P,color_common_R,color_common_S,color_common_T,color_common_W,color_common_Y
0,735436,14,0,1,r,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1,735179,365,0,0,h,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,735387,365,1,0,l,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,735399,3285,0,0,h,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,735148,120,0,1,r,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [71]:
def set_second_color(row):
    second_color = row.get('color_second')

    if second_color:
        common_color_col_name = f'color_common_{second_color}'
        col_is_exist = row.get(common_color_col_name)
        if col_is_exist is not None:
            row[common_color_col_name] = 1

    return row

color_second_index = df_ohe[df_ohe.color_second.notna()].index

df_ohe.loc[color_second_index].apply(set_second_color, axis=1)

Unnamed: 0,datetime,days_upon_outcome,is_adopted,sex_intact,color_second,animal_type_Bird,animal_type_Cat,animal_type_Dog,animal_type_Livestock,animal_type_Other,...,color_common_F,color_common_G,color_common_L,color_common_O,color_common_P,color_common_R,color_common_S,color_common_T,color_common_W,color_common_Y
0,735436,14,0,1,r,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1,735179,365,0,0,h,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,735387,365,1,0,l,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,735399,3285,0,0,h,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,735148,120,0,1,r,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
56606,736726,30,1,0,r,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
56607,736726,30,1,0,r,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
56608,736726,1095,1,0,l,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
56609,736726,60,1,0,e,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


In [72]:
df_ohe.shape

(56611, 26)

In [73]:
df_ohe.drop(['color_second'], axis=1, inplace=True)

In [74]:
df_ohe.head()

Unnamed: 0,datetime,days_upon_outcome,is_adopted,sex_intact,animal_type_Bird,animal_type_Cat,animal_type_Dog,animal_type_Livestock,animal_type_Other,sex_female,...,color_common_F,color_common_G,color_common_L,color_common_O,color_common_P,color_common_R,color_common_S,color_common_T,color_common_W,color_common_Y
0,735436,14,0,1,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1,735179,365,0,0,0.0,0.0,1.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,735387,365,1,0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,735399,3285,0,0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,735148,120,0,1,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [75]:
df_ohe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56611 entries, 0 to 56610
Data columns (total 25 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   datetime               56611 non-null  int64  
 1   days_upon_outcome      56611 non-null  int64  
 2   is_adopted             56611 non-null  int64  
 3   sex_intact             56611 non-null  int64  
 4   animal_type_Bird       56611 non-null  float64
 5   animal_type_Cat        56611 non-null  float64
 6   animal_type_Dog        56611 non-null  float64
 7   animal_type_Livestock  56611 non-null  float64
 8   animal_type_Other      56611 non-null  float64
 9   sex_female             56611 non-null  float64
 10  sex_male               56611 non-null  float64
 11  sex_unknown            56611 non-null  float64
 12  color_common_A         56611 non-null  float64
 13  color_common_B         56611 non-null  float64
 14  color_common_C         56611 non-null  float64
 15  co

На всякий переведем все значения в int

In [76]:
float_column_names = df_ohe.select_dtypes(float).columns

for col_name in float_column_names:
    df_ohe[col_name] = df_ohe[col_name].astype('int32')

In [77]:
df_ohe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56611 entries, 0 to 56610
Data columns (total 25 columns):
 #   Column                 Non-Null Count  Dtype
---  ------                 --------------  -----
 0   datetime               56611 non-null  int64
 1   days_upon_outcome      56611 non-null  int64
 2   is_adopted             56611 non-null  int64
 3   sex_intact             56611 non-null  int64
 4   animal_type_Bird       56611 non-null  int32
 5   animal_type_Cat        56611 non-null  int32
 6   animal_type_Dog        56611 non-null  int32
 7   animal_type_Livestock  56611 non-null  int32
 8   animal_type_Other      56611 non-null  int32
 9   sex_female             56611 non-null  int32
 10  sex_male               56611 non-null  int32
 11  sex_unknown            56611 non-null  int32
 12  color_common_A         56611 non-null  int32
 13  color_common_B         56611 non-null  int32
 14  color_common_C         56611 non-null  int32
 15  color_common_F         56611 non-nul

Приступим к тренировке нашей модели.

In [78]:
X = df_ohe.drop(['is_adopted'], axis=1)
Y = df_ohe['is_adopted']

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3)

In [79]:
X_train.shape, X_test.shape, Y_train.shape, Y_test.shape

((39627, 24), (16984, 24), (39627,), (16984,))

In [80]:
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train, Y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [81]:
clf.score(X_test, Y_test)

0.7809114460668864

##### Точность нашей модели составляет 78.1%, что вполне неплохо