## All you need is love… And a pet!

<img src="img/dataset-cover.jpg" width="920">

Here we are going to build a classifier to predict whether an animal from an animal shelter will be adopted or not (aac_intakes_outcomes.csv, available at: https://www.kaggle.com/aaronschlegel/austin-animal-center-shelter-intakes-and-outcomes/version/1#aac_intakes_outcomes.csv). You will be working with the following features:

1. *animal_type:* Type of animal. May be one of 'cat', 'dog', 'bird', etc.
2. *intake_year:* Year of intake
3. *intake_condition:* The intake condition of the animal. Can be one of 'normal', 'injured', 'sick', etc.
4. *intake_number:* The intake number denoting the number of occurrences the animal has been brought into the shelter. Values higher than 1 indicate the animal has been taken into the shelter on more than one occasion.
5. *intake_type:* The type of intake, for example, 'stray', 'owner surrender', etc.
6. *sex_upon_intake:* The gender of the animal and if it has been spayed or neutered at the time of intake
7. *age_upon\_intake_(years):* The age of the animal upon intake represented in years
8. *time_in_shelter_days:* Numeric value denoting the number of days the animal remained at the shelter from intake to outcome.
9. *sex_upon_outcome:* The gender of the animal and if it has been spayed or neutered at time of outcome
10. *age_upon\_outcome_(years):* The age of the animal upon outcome represented in years
11. *outcome_type:* The outcome type. Can be one of ‘adopted’, ‘transferred’, etc.

In [95]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import scipy as sp
from itertools import combinations 
import ast
from sklearn.linear_model import LogisticRegression
import seaborn as sn
%matplotlib inline

data_folder = './data/'

### A) Load the dataset and convert categorical features to a suitable numerical representation (use dummy-variable encoding). 
- Split the data into a training set (80%) and a test set (20%). Pair each feature vector with the corresponding label, i.e., whether the outcome_type is adoption or not. 
- Standardize the values of each feature in the data to have mean 0 and variance 1.

The use of external libraries is not permitted in part A, except for numpy and pandas. 
You can drop entries with missing values.

In [96]:
columns = ['animal_type', 'intake_year', 'intake_condition', 'intake_number', 'intake_type', 'sex_upon_intake', \
          'age_upon_intake_(years)', 'time_in_shelter_days', 'sex_upon_outcome', 'age_upon_outcome_(years)', \
          'outcome_type']
original_data = pd.read_csv(data_folder+'aac_intakes_outcomes.csv', usecols=columns)

In [97]:
# First let's take a look at the dataset
original_data.head()

Unnamed: 0,outcome_type,sex_upon_outcome,age_upon_outcome_(years),animal_type,intake_condition,intake_type,sex_upon_intake,age_upon_intake_(years),intake_year,intake_number,time_in_shelter_days
0,Return to Owner,Neutered Male,10.0,Dog,Normal,Stray,Neutered Male,10.0,2017,1.0,0.588194
1,Return to Owner,Neutered Male,7.0,Dog,Normal,Public Assist,Neutered Male,7.0,2014,2.0,1.259722
2,Return to Owner,Neutered Male,6.0,Dog,Normal,Public Assist,Neutered Male,6.0,2014,3.0,1.113889
3,Transfer,Neutered Male,10.0,Dog,Normal,Owner Surrender,Neutered Male,10.0,2014,1.0,4.970139
4,Return to Owner,Neutered Male,16.0,Dog,Injured,Public Assist,Neutered Male,16.0,2013,1.0,0.119444


In [98]:
# The dataset contains categorical variable.
# We need to convert it to a numerical variable. Pandas offers the method *get_dummies* that takes care of this:
X = pd.get_dummies(original_data[columns])
X.head()

Unnamed: 0,intake_year,intake_number,age_upon_intake_(years),time_in_shelter_days,age_upon_outcome_(years),animal_type_Bird,animal_type_Cat,animal_type_Dog,animal_type_Other,intake_condition_Aged,...,sex_upon_outcome_Unknown,outcome_type_Adoption,outcome_type_Died,outcome_type_Disposal,outcome_type_Euthanasia,outcome_type_Missing,outcome_type_Relocate,outcome_type_Return to Owner,outcome_type_Rto-Adopt,outcome_type_Transfer
0,2017,1.0,10.0,0.588194,10.0,False,False,True,False,False,...,False,False,False,False,False,False,False,True,False,False
1,2014,2.0,7.0,1.259722,7.0,False,False,True,False,False,...,False,False,False,False,False,False,False,True,False,False
2,2014,3.0,6.0,1.113889,6.0,False,False,True,False,False,...,False,False,False,False,False,False,False,True,False,False
3,2014,1.0,10.0,4.970139,10.0,False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,True
4,2013,1.0,16.0,0.119444,16.0,False,False,True,False,False,...,False,False,False,False,False,False,False,True,False,False


In [99]:
# Let's check if there are undefined values.
len(X[X.isna().any(axis=1)])

0

In [100]:
# # Let's standardize the continuous features !
def standardize(df, column) :
    df[column] = (df[column] - df[column].mean())/df[column].std()
    return df

columns_std = ['intake_year', 'intake_number', 'age_upon_intake_(years)', 'time_in_shelter_days', 'age_upon_outcome_(years)']
for i in columns_std :
    X = standardize(X,i)

# Display the new standardized dataframe
X.head()

Unnamed: 0,intake_year,intake_number,age_upon_intake_(years),time_in_shelter_days,age_upon_outcome_(years),animal_type_Bird,animal_type_Cat,animal_type_Dog,animal_type_Other,intake_condition_Aged,...,sex_upon_outcome_Unknown,outcome_type_Adoption,outcome_type_Died,outcome_type_Disposal,outcome_type_Euthanasia,outcome_type_Missing,outcome_type_Relocate,outcome_type_Return to Owner,outcome_type_Rto-Adopt,outcome_type_Transfer
0,1.200085,-0.278079,2.727873,-0.387936,2.709378,False,False,True,False,False,...,False,False,False,False,False,False,False,True,False,False
1,-1.102017,1.914629,1.69095,-0.371824,1.674923,False,False,True,False,False,...,False,False,False,False,False,False,False,True,False,False
2,-1.102017,4.107338,1.345309,-0.375323,1.330105,False,False,True,False,False,...,False,False,False,False,False,False,False,True,False,False
3,-1.102017,-0.278079,2.727873,-0.282801,2.709378,False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,True
4,-1.869384,-0.278079,4.801719,-0.399183,4.778288,False,False,True,False,False,...,False,False,False,False,False,False,False,True,False,False


In [110]:
# Split the data into a training set (80%) and a test set (20%) :
# Set a seed for reproducibility
np.random.seed(42)

# Shuffle the indices of your data
index = np.arange(len(X))
np.random.shuffle(index)

# Define the proportion for the training set
train_ratio = 0.8
train_size = int(len(X) * train_ratio)

# Split the data
X_train = X.iloc[index[:train_size]]
X_test = X.iloc[index[train_size:]]

# Display 
# display(X_train)
# display(X_test)

### B) Train a logistic regression classifier on your training set. Logistic regression returns probabilities as predictions, so in order to arrive at a binary prediction, you need to put a threshold on the predicted probabilities. 
- For the decision threshold of 0.5, present the performance of your classifier on the test set by displaying the confusion matrix. Based on the confusion matrix, manually calculate accuracy, precision, recall, and F1-score with respect to the positive and the negative class. 

In [111]:
# The label used for the traning :
y = X['outcome_type_Adoption']

y_train = y.iloc[index[:train_size]]
y_test = y.iloc[index[train_size:]]

# Display 
# display(y_train)
# display(y_test)

In [112]:
logistic = LogisticRegression(solver='lbfgs')

### C) Vary the value of the threshold in the range from 0 to 1 and visualize the value of accuracy, precision, recall, and F1-score (with respect to both classes) as a function of the threshold.

### D) Plot in a bar chart the coefficients of the logistic regression sorted by their contribution to the prediction.


## Question 1: Which of the following metrics is most suitable when you are dealing with unbalanced classes?

- a) F1 Score
- b) Recall
- c) Precision
- d) Accuracy

In [106]:
# The answer is the a)
# The F1 Score is a metric that balances both precision and recall, making it suitable for evaluating 
# the performance of a model with unbalanced classes. It considers both false positives and false negatives, 
# providing a more comprehensive measure in situations where one class significantly outnumbers the other.

## Question 2: You are working on a binary classification problem. You trained a model on a training dataset and got the following confusion matrix on the test dataset. What is true about the evaluation metrics (rounded to the second decimal point):

|            | Pred = NO|Pred=YES|
|------------|----------|--------|
| Actual NO  |    50    |   10   |
| Actual YES |    5     |   100  |

- a) Accuracy is 0.95
- b) Accuracy is 0.85
- c) False positive rate is 0.95
- d) True positive rate is 0.95

In [107]:
TN = 50
FN = 5
TP = 100
FP = 10
N = TN + FN + TP + FP

# Accuracy :
ACC = (TP + TN)/N
print("The accuracy is : {:.2f}".format(ACC))

# True positive rate :
TPR = TP/(TP+FN)
print("The True positive rate is : {:.2f}".format(TPR))

# False positive rate :
FPR = FP/(FP+TN)
print("The False positive rate is : {:.2f}".format(FPR))

# The answer is then the d)


The accuracy is : 0.91
The True positive rate is : 0.95
The False positive rate is : 0.17
