# Importing Data

Mount Google Drive folder.

In [1]:
import pandas as pd
from google.colab import drive

drive.mount('/content/drive')

Mounted at /content/drive


Enable displaying all columns when we print dataframes.

In [2]:
# Display all columns in output
pd.set_option('display.max_columns', None)

Import training data. Change path if needed.

In [3]:
TRAIN_PATH = '/content/drive/MyDrive/petfinder-adoption-prediction/train.csv'
df = pd.read_csv(TRAIN_PATH)

# General Pre-Processing

These are the pre-processing steps that the whole team followed:
- Drop unnecessary columns:
  - `Name` was dropped because it has many missing values.
  - `RescuerID` and `PetID` were dropped because they provide no meaning to our models.
  - `Description` was dropped because we are not intending to do sentiment analysis on text for our project.
- Drop rows with missing data.
- Remove outliers for `Fee` and `Age` (numerical values). Outliers are those at least 5 standard deviations above the average for a column.

In [4]:
import numpy as np
from scipy import stats

# Drop unnecessary columns
columns_to_drop = ['Name', 'RescuerID', 'PetID', 'Description']
df = df.drop(columns=columns_to_drop, axis=1)  # 1 refers to columns

# Drop rows with missing data
df = df.dropna()

# Find z-scores for numerical columns
z_scores_fee = np.abs(stats.zscore(df['Fee']))
z_scores_age = np.abs(stats.zscore(df['Age']))

# Find outliers based on threshold of 5 std. deviations
outlier_rows_fee = z_scores_fee > 5
outlier_rows_age = z_scores_age > 5
combined_outlier_rows = outlier_rows_fee | outlier_rows_age

# Remove outliers
df = df[~combined_outlier_rows]

print(df)

       Type  Age  Breed1  Breed2  Gender  Color1  Color2  Color3  \
0         2    3     299       0       1       1       7       0   
1         2    1     265       0       1       1       2       0   
2         1    1     307       0       1       2       7       0   
3         1    4     307       0       2       1       2       0   
4         1    1     307       0       1       1       0       0   
...     ...  ...     ...     ...     ...     ...     ...     ...   
14988     2    2     266       0       3       1       0       0   
14989     2   60     265     264       3       1       4       7   
14990     2    2     265     266       3       5       6       7   
14991     2    9     266       0       2       4       7       0   
14992     1    1     307     307       1       2       0       0   

       MaturitySize  FurLength  Vaccinated  Dewormed  Sterilized  Health  \
0                 1          1           2         2           2       1   
1                 2          2 

# Neural Network Pre-Processing

This is pre-processing that is specific to my neural network model.

Scale numerical attributes with `StandardScaler`. According to the `scikit-learn` [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html), `StandardScaler` should be used to scale numerical attributes for input to a machine learning estimator. This is to prevent features with large values from dominating the learning process.

In [5]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
features_to_scale = ['Age', 'Quantity', 'Fee', 'VideoAmt', 'PhotoAmt']
df[features_to_scale] = scaler.fit_transform(df[features_to_scale])
print(df[features_to_scale])

            Age  Quantity       Fee  VideoAmt  PhotoAmt
0     -0.430954 -0.393225  1.515422 -0.164405 -0.828418
1     -0.558874 -0.393225 -0.308343 -0.164405 -0.542567
2     -0.558874 -0.393225 -0.308343 -0.164405  0.886685
3     -0.366994 -0.393225  2.427305 -0.164405  1.172536
4     -0.558874 -0.393225 -0.308343 -0.164405 -0.256717
...         ...       ...       ...       ...       ...
14988 -0.494914  1.633300 -0.308343 -0.164405 -0.256717
14989  3.214782  0.282284 -0.308343 -0.164405 -0.256717
14990 -0.494914  2.308808  0.238787 -0.164405  0.314984
14991 -0.047192 -0.393225 -0.308343 -0.164405 -0.256717
14992 -0.558874 -0.393225 -0.308343 -0.164405 -0.828418

[14796 rows x 5 columns]


One-hot encode non-binary categorical features with `pd.get_dummies()`, documentation is [here](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html). The below columns have values [1, 2, 3] for "Yes", "No", and "Not Sure". The one-hot encoding is done to prevent the neural network from assuming 1 < 2 < 3 for these values.

In [6]:
features_to_encode = ['Type', 'Gender', 'Vaccinated', 'Dewormed', 'Sterilized']
df = pd.get_dummies(df, columns=features_to_encode)
print(df)

            Age  Breed1  Breed2  Color1  Color2  Color3  MaturitySize  \
0     -0.430954     299       0       1       7       0             1   
1     -0.558874     265       0       1       2       0             2   
2     -0.558874     307       0       2       7       0             2   
3     -0.366994     307       0       1       2       0             2   
4     -0.558874     307       0       1       0       0             2   
...         ...     ...     ...     ...     ...     ...           ...   
14988 -0.494914     266       0       1       0       0             2   
14989  3.214782     265     264       1       4       7             2   
14990 -0.494914     265     266       5       6       7             3   
14991 -0.047192     266       0       4       7       0             1   
14992 -0.558874     307     307       2       0       0             2   

       FurLength  Health  Quantity       Fee  State  VideoAmt  PhotoAmt  \
0              1       1 -0.393225  1.515422  41

Split `Breed1`, `Breed2`, and `State` features into N features each based on frequency of values.

The breed columns have 306 unique values that have been label encoded, [1, 2, ..., 306]. We don't want the neural network to think there's a relationship between the numbers, like 100 < 101, but we can't one-hot encode because we will have too many columns.

Instead, we create a column for each of the top 3 most common values, and another for "other".

In [7]:
n = 3
features_to_split = ['Breed1', 'Breed2', 'State']

for feature in features_to_split:
  top_N_values = df[feature].value_counts().head(n)
  print(f'Top {n} values for {feature}:\n{top_N_values}\n')

  top_N_value_names = top_N_values.index
  for index, row in df.iterrows():
    # If value isn't top N frequency, replace with -1 (other)
    if row[feature] not in top_N_value_names:
      df.at[index, feature] = -1

  df = pd.get_dummies(df, columns=[feature])

print(df)

Top 3 values for Breed1:
307    5903
266    3623
265    1257
Name: Breed1, dtype: int64

Top 3 values for Breed2:
0      10613
307     1723
266      597
Name: Breed2, dtype: int64

Top 3 values for State:
41326    8595
41401    3800
41327     830
Name: State, dtype: int64

            Age  Color1  Color2  Color3  MaturitySize  FurLength  Health  \
0     -0.430954       1       7       0             1          1       1   
1     -0.558874       1       2       0             2          2       1   
2     -0.558874       2       7       0             2          2       1   
3     -0.366994       1       2       0             2          1       1   
4     -0.558874       1       0       0             2          1       1   
...         ...     ...     ...     ...           ...        ...     ...   
14988 -0.494914       1       0       0             2          2       1   
14989  3.214782       1       4       7             2          2       1   
14990 -0.494914       5       6       7   

# Neural Network Testing

Set up neural network testing and training data. 85% is training, 10% is validation, and 5% is test.

In [8]:
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier

target_feature = 'AdoptionSpeed'
features = df.drop(columns=[target_feature])
target = df[target_feature]

# Split training data, 15% is for testing & validation
x_train, x_test_val, y_train, y_test_val = train_test_split(
    features, target, test_size=0.15, shuffle=True,
    stratify=target, random_state=42
)

# 5% of data is training, 10% is validation
x_test, x_val, y_test, y_val = train_test_split(
    x_test_val, y_test_val, test_size=1/3, shuffle=True,
    stratify=y_test_val, random_state=42
)

Fit the neural network and assess accuracy on training, validation, and test sets.

- `max-iter` was set to 500 to prevent the neural network from giving up due to too many iterations.

- `hidden_layer_sizes` was determined after experimenting with different parameters with `GridSearchCV`. I ran a grid search on my local machine with a variety of layer sizes and found that `(38, 5)` gave the highest accuracy on test and validation data. 38 corresponds to the number of input features, and 5 corresponds to the number of categories we're predicting.

- The `solver` of `adam` was kept because the `scikit-learn` [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html) recommends `adam` for datasets with thousands of samples or more. I considered using the solver `lbfgs`, but the neural network with `lbfgs` never converged.

- `random_state` was set to 42 to make the predictions reproducible.

In [9]:
from sklearn.metrics import precision_score, recall_score

classifier = MLPClassifier(
    max_iter=500,
    hidden_layer_sizes=(38, 5),
    random_state=42
)

# Fit to training data and make predictions on test data
classifier.fit(x_train, y_train)
test_predictions = classifier.predict(x_test)

accuracy = classifier.score(x_train, y_train)
print(f'Accuracy on training: {round(accuracy, 3) * 100}%')

accuracy = classifier.score(x_val, y_val)
print(f'Accuracy on validation: {round(accuracy, 3) * 100}%\n')

accuracy = classifier.score(x_test, y_test)
print(f'Accuracy on test: {round(accuracy, 3) * 100}%')

# zero_divison=0 returns 0 precision/recall if no positive samples for a class
avg_precision = precision_score(
    y_test, test_predictions, average='weighted', zero_division=0
)
avg_recall = recall_score(
    y_test, test_predictions, average='weighted', zero_division=0
)
precision_per_class = precision_score(
    y_test, test_predictions, average=None, zero_division=0
)
recall_per_class = recall_score(
    y_test, test_predictions, average=None, zero_division=0
)

print(f'Average weighted precision: {round(avg_precision, 3) * 100}%')
print(f'Average weighted recall: {round(avg_recall, 3) * 100}%')
for class_label, precision, recall in zip(range(len(precision_per_class)), precision_per_class, recall_per_class):
    print(f'Class {class_label}: Precision = {round(precision, 3) * 100}%, Recall = {round(recall, 3) * 100}%')

Accuracy on training: 45.0%
Accuracy on validation: 40.0%

Accuracy on test: 38.7%
Average weighted precision: 36.9%
Average weighted recall: 38.7%
Class 0: Precision = 0.0%, Recall = 0.0%
Class 1: Precision = 30.9%, Recall = 29.5%
Class 2: Precision = 33.7%, Recall = 43.1%
Class 3: Precision = 37.7%, Recall = 13.3%
Class 4: Precision = 47.4%, Recall = 64.9%


(Optional) Perform k-fold cross validation for modified parameters.

In [10]:
from sklearn.model_selection import cross_validate
from sklearn.metrics import make_scorer, accuracy_score

num_folds = 10

scoring_metrics = {
    'accuracy': make_scorer(accuracy_score),
    'precision': make_scorer(
        precision_score, average='weighted', zero_division=0
    ),
    'recall': make_scorer(
        recall_score, average='weighted', zero_division=0
    )
}

scores = cross_validate(
    classifier, x_train, y_train,
    cv=num_folds, scoring=scoring_metrics
)

Output cross-validation accuracy, precision, and recall (weighted averages).

In [11]:
print(f'\n{num_folds}-fold validation accuracy mean: {round(scores["test_accuracy"].mean(), 3) * 100}%')
print(f'{num_folds}-fold validation precision mean: {round(scores["test_precision"].mean(), 3) * 100}%')
print(f'{num_folds}-fold validation recall mean: {round(scores["test_recall"].mean(), 3) * 100}%')


10-fold validation accuracy mean: 37.8%
10-fold validation precision mean: 35.9%
10-fold validation recall mean: 37.8%


Repeating the same experiment with a default instance of `MLPClassifier`. `max_iter` was changed to 500 because the default value would not converge.

In [12]:
classifier = MLPClassifier(max_iter=500, random_state=42)

# Fit to training data and make predictions on test data
classifier.fit(x_train, y_train)
test_predictions = classifier.predict(x_test)

accuracy = classifier.score(x_train, y_train)
print(f'Accuracy on training: {round(accuracy, 3) * 100}%')

accuracy = classifier.score(x_val, y_val)
print(f'Accuracy on validation: {round(accuracy, 3) * 100}%\n')

accuracy = classifier.score(x_test, y_test)
print(f'Accuracy on test: {round(accuracy, 3) * 100}%')

# zero_divison=0 returns 0 precision/recall if no positive samples for a class
avg_precision = precision_score(
    y_test, test_predictions, average='weighted', zero_division=0
)
avg_recall = recall_score(
    y_test, test_predictions, average='weighted', zero_division=0
)
precision_per_class = precision_score(
    y_test, test_predictions, average=None, zero_division=0
)
recall_per_class = recall_score(
    y_test, test_predictions, average=None, zero_division=0
)

print(f'Average weighted precision: {round(avg_precision, 3) * 100}%')
print(f'Average weighted recall: {round(avg_recall, 3) * 100}%')
for class_label, precision, recall in zip(range(len(precision_per_class)), precision_per_class, recall_per_class):
    print(f'Class {class_label}: Precision = {round(precision, 3) * 100}%, Recall = {round(recall, 3) * 100}%')

Accuracy on training: 51.2%
Accuracy on validation: 40.699999999999996%

Accuracy on test: 37.1%
Average weighted precision: 35.3%
Average weighted recall: 37.1%
Class 0: Precision = 0.0%, Recall = 0.0%
Class 1: Precision = 30.5%, Recall = 23.9%
Class 2: Precision = 33.0%, Recall = 38.6%
Class 3: Precision = 35.9%, Recall = 21.4%
Class 4: Precision = 44.0%, Recall = 61.3%


(Optional) Perform k-fold cross validation for default parameters.

In [13]:
from sklearn.model_selection import cross_validate
from sklearn.metrics import make_scorer, accuracy_score

num_folds = 10

scoring_metrics = {
    'accuracy': make_scorer(accuracy_score),
    'precision': make_scorer(
        precision_score, average='weighted', zero_division=0
    ),
    'recall': make_scorer(
        recall_score, average='weighted', zero_division=0
    )
}

scores = cross_validate(
    classifier, x_train, y_train,
    cv=num_folds, scoring=scoring_metrics
)

Output cross-validation accuracy, precision, and recall (weighted averages).

In [14]:
print(f'\n{num_folds}-fold validation accuracy mean: {round(scores["test_accuracy"].mean(), 3) * 100}%')
print(f'{num_folds}-fold validation precision mean: {round(scores["test_precision"].mean(), 3) * 100}%')
print(f'{num_folds}-fold validation recall mean: {round(scores["test_recall"].mean(), 3) * 100}%')


10-fold validation accuracy mean: 36.199999999999996%
10-fold validation precision mean: 34.9%
10-fold validation recall mean: 36.199999999999996%
