Horse Survival:
------------------
Predict the survival of a horse based on various medical conditions.
Try Decision Tree classifier and Random Forest classifier  and observe the occuracy

In [1]:
#Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

In [2]:
#Read the input dataset
animals = pd.read_csv('horse.csv')

In [3]:
#View first few records
animals.head()

Unnamed: 0,surgery,age,hospital_number,rectal_temp,pulse,respiratory_rate,temp_of_extremities,peripheral_pulse,mucous_membrane,capillary_refill_time,...,packed_cell_volume,total_protein,abdomo_appearance,abdomo_protein,outcome,surgical_lesion,lesion_1,lesion_2,lesion_3,cp_data
0,no,adult,530101,38.5,66.0,28.0,cool,reduced,,more_3_sec,...,45.0,8.4,,,died,no,11300,0,0,no
1,yes,adult,534817,39.2,88.0,20.0,,,pale_cyanotic,less_3_sec,...,50.0,85.0,cloudy,2.0,euthanized,no,2208,0,0,no
2,no,adult,530334,38.3,40.0,24.0,normal,normal,pale_pink,less_3_sec,...,33.0,6.7,,,lived,no,0,0,0,yes
3,yes,young,5290409,39.1,164.0,84.0,cold,normal,dark_cyanotic,more_3_sec,...,48.0,7.2,serosanguious,5.3,died,yes,2208,0,0,yes
4,no,adult,530255,37.3,104.0,35.0,,,dark_cyanotic,more_3_sec,...,74.0,7.4,,,died,no,4300,0,0,no


In [4]:
#View size of the dataset
animals.shape

(299, 28)

In [5]:
# Target is the column named 'outcome'
target = animals['outcome']

In [6]:
#Check unique values in target
target.unique()

array(['died', 'euthanized', 'lived'], dtype=object)

In [7]:
animals = animals.drop(['outcome'],axis=1)

In [8]:
animals.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 27 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   surgery                299 non-null    object 
 1   age                    299 non-null    object 
 2   hospital_number        299 non-null    int64  
 3   rectal_temp            239 non-null    float64
 4   pulse                  275 non-null    float64
 5   respiratory_rate       241 non-null    float64
 6   temp_of_extremities    243 non-null    object 
 7   peripheral_pulse       230 non-null    object 
 8   mucous_membrane        252 non-null    object 
 9   capillary_refill_time  267 non-null    object 
 10  pain                   244 non-null    object 
 11  peristalsis            255 non-null    object 
 12  abdominal_distention   243 non-null    object 
 13  nasogastric_tube       195 non-null    object 
 14  nasogastric_reflux     193 non-null    object 
 15  nasoga

In [9]:
# As DT algo does not work well with category variables, we convert them to numeric using dummies
category_variables = ['surgery','age','temp_of_extremities','peripheral_pulse',
                     'mucous_membrane','capillary_refill_time','pain','peristalsis',
                     'abdominal_distention','nasogastric_tube','nasogastric_reflux',
                      'rectal_exam_feces','abdomen','abdomo_appearance','surgical_lesion',
                      'cp_data']

In [10]:
for category in category_variables:
    animals[category] = pd.get_dummies(animals[category])

In [11]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

X, y = animals.values, target.values
label_encoder = LabelEncoder() # Applying label encoder to change categorical values into numeric values
print(y)
y = label_encoder.fit_transform(y)
print(y)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=1)

['died' 'euthanized' 'lived' 'died' 'died' 'lived' 'lived' 'died'
 'euthanized' 'lived' 'lived' 'lived' 'lived' 'died' 'lived' 'died' 'died'
 'lived' 'lived' 'lived' 'lived' 'lived' 'lived' 'lived' 'lived' 'lived'
 'lived' 'lived' 'died' 'lived' 'died' 'euthanized' 'lived' 'lived'
 'lived' 'euthanized' 'euthanized' 'lived' 'lived' 'died' 'died' 'lived'
 'lived' 'euthanized' 'euthanized' 'died' 'lived' 'lived' 'died' 'died'
 'lived' 'died' 'lived' 'lived' 'euthanized' 'died' 'lived' 'died' 'died'
 'died' 'lived' 'lived' 'died' 'euthanized' 'lived' 'lived' 'lived'
 'lived' 'lived' 'lived' 'euthanized' 'lived' 'died' 'died' 'died'
 'euthanized' 'lived' 'lived' 'died' 'lived' 'died' 'lived' 'lived' 'died'
 'lived' 'lived' 'died' 'euthanized' 'lived' 'lived' 'lived' 'died'
 'lived' 'died' 'lived' 'lived' 'lived' 'euthanized' 'lived' 'lived'
 'lived' 'euthanized' 'lived' 'lived' 'died' 'lived' 'lived' 'lived'
 'euthanized' 'died' 'died' 'lived' 'lived' 'died' 'lived' 'lived' 'lived'
 'euthan

In [12]:
from sklearn.tree import DecisionTreeClassifier
print(X_train.shape)

(239, 27)


In [13]:
#Replace Nan values in both train & test data using most_frequent values
from sklearn.impute import SimpleImputer
import numpy as np
imp = SimpleImputer(missing_values=np.nan, strategy="most_frequent")
X_train = imp.fit_transform(X_train)
X_test = imp.fit_transform(X_test)

In [14]:
classifier = DecisionTreeClassifier()

In [15]:
classifier.fit(X_train,y_train)

DecisionTreeClassifier()

In [16]:
y_predict = classifier.predict(X_test)

In [17]:
from sklearn.metrics import accuracy_score

In [18]:
accuracy = accuracy_score(y_predict,y_test)
print('Accuracy of DT:',accuracy)

Accuracy of DT: 0.5833333333333334


In [19]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier()

In [20]:
classifier.fit(X_train,y_train)

RandomForestClassifier()

In [21]:
y_predict = classifier.predict(X_test)

In [22]:
accuracy = accuracy_score(y_predict,y_test)
print('Accuracy of RFC:', accuracy)

Accuracy of RFC: 0.7166666666666667


We can see the ensemble algo, RFC has better accuracy than DT.