# Data Pre-Processing

Citations:
- [TA provided sample project](https://elearn.ucr.edu/courses/104198/pages/sample-project-material?module_item_id=1752456)
- [Google Drive Mounting and Folder Path](https://stackoverflow.com/questions/72199130/google-colab-import-data-from-google-drive-and-make-it-possible-to-share-it)
- [SKLearn Libraries for Decision Trees](https://scikit-learn.org/stable/modules/tree.html)
- Preprocessing/Metric Representation - similar to the team
---

**Goal:** Predict adoption speed of animal based on features given by adoption centers to improve allocation of resources for long-term animals.

---
Here is a key for the given attributes:
- adoption speed (value to predict) (0 - same day, 1 - one and seven days, 2 - eight and thirty days, 3 - thirty-one and ninety days, 4  - no adoption after 100 days)
- animal type (1 = dog/2 = cat)
- age (in months)
- breed (refer to labels)
- gender (1 = Male, 2 = Female, 3 = Multiple in one Post)
- color (refer to labels)
- maturity (1 = Small, 2 = Medium, 3 = Large, 4 = Extra Large, 0 = Not Specified)
- fur length  (1 = Short, 2 = Medium, 3 = Long, 0 = Not Specified)
- vaccination status  (1 = Yes, 2 = No, 3 = Not Sure)
- dewormed status  (1 = Yes, 2 = No, 3 = Not Sure)
- sterilization  (1 = Yes, 2 = No, 3 = Not Sure)
- health condition  (1 = Healthy, 2 = Minor Injury, 3 = Serious Injury, 0 = Not Specified)
- quantity (number of pets in listing)
- adoption fee (in RM)
- location (state in Malaysia, refer to labels)
---


In [1]:
# Setup Libraries
import numpy as np
import seaborn as sns
from scipy import stats
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.tree import plot_tree
from sklearn.metrics import precision_score, recall_score
from sklearn.feature_extraction import FeatureHasher
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_validate
from sklearn.metrics import make_scorer
from scipy import stats

# Setup Dataframes
import pandas as pd
from google.colab import drive
from sklearn.model_selection import train_test_split
drive.mount('/content/drive/', force_remount=True)
df_train = pd.read_csv( "/content/drive/My Drive/Colab Notebooks/train.csv")

Mounted at /content/drive/


In [2]:
# Drop columns
df_train = df_train.drop(['Name', 'RescuerID', 'PetID', 'Description'], axis=1)

# Drop rows with missing information
df_train = df_train.dropna()

# Calculate z-score for only numerical columns
z_scores_fee = np.abs(stats.zscore(df_train['Fee']))
z_scores_age = np.abs(stats.zscore(df_train['Age']))

# Using a threshold of 5 std dev. identify the outlier rows
outlier_rows_fee = z_scores_fee > 5
outlier_rows_age = z_scores_age > 5
combined_outlier_rows = outlier_rows_fee | outlier_rows_age

# Remove rows with outliers
df_train = df_train[~combined_outlier_rows]
print(f"Removed {combined_outlier_rows.sum()} rows due to outliers. There are now {df_train.shape[0]} rows in our unsplit training data.")

# Separate the resulting column from the training data
adoption_speed_train = df_train.pop('AdoptionSpeed')

Removed 197 rows due to outliers. There are now 14796 rows in our unsplit training data.


For categorical features, we have to encode them in order to avoid having the number associated with the feature implying an order. I used separated them into binary, ordinal, and non-ordinal categories.

- Gender, etc. are binary as there are only 3 options for each (ex: Yes, No, Not sure) and we can easily use one-hot encoding through pandas' get_dummies function
- Maturity, Size, Color, etc. have an order to them (ex: 1 = black, 7 = white), and so color is ordered from dark to light which means that there is a natural order and we can use Label encoding
- For Breed since it is non-ordinal, we will have to split Breed into different features, and we chose the top 8 breeds and calculated their frequencies for each column, and created another column for all other.

In [3]:
# Separating categorical features
binary_categorical_columns = ['Gender','Vaccinated','Dewormed', 'Sterilized']
ordinal_categorical_columns = ['MaturitySize', 'FurLength', 'Health', 'Color1', 'Color2', 'Color3']
nonordinal_categorical_columns = ['Breed1', 'Breed2']

# Encoding
label_encoder = LabelEncoder()
df_train = pd.get_dummies(df_train, columns=binary_categorical_columns)
for column in ordinal_categorical_columns:
  df_train[column] = label_encoder.fit_transform(df_train[column])

# Split non-ordinal categorical columns into n features (similar to Benjamin's preprocessing)
n = 3

for feature in nonordinal_categorical_columns:
  top_N_values = df_train[feature].value_counts().head(n)
  print(f'Top {n} values for {feature}:\n{top_N_values}\n')

  top_N_value_names = top_N_values.index
  for index, row in df_train.iterrows():
    # If value isn't top N frequency, replace with -1 (other)
    if row[feature] not in top_N_value_names:
      df_train.at[index, feature] = -1

  df_train = pd.get_dummies(df_train, columns=[feature])

Top 3 values for Breed1:
307    5903
266    3623
265    1257
Name: Breed1, dtype: int64

Top 3 values for Breed2:
0      10613
307     1723
266      597
Name: Breed2, dtype: int64



In [4]:
# Split 85% training, 10% test, 5% validation
df_train_85, X_temp, adoption_speed_train_85, y_temp = train_test_split(df_train, adoption_speed_train, test_size=0.15, random_state=42, stratify=adoption_speed_train)    # 85% training set, 15% temp set
df_test_10, df_val_5, adoption_speed_test_10, adoption_speed_val_5 = train_test_split(X_temp, y_temp, test_size=1/3, random_state=42, stratify=y_temp)       # 10% test set, 5% validation set

Out of curiostiy, I also plot the pairwise feature plot for any values that are features with continuous values to view their corresponding plots. I can see from these graphs that many of the graphs are right skewed.

I did not plot any other features as their values are categorical and do not give very interesting graphs as many of them group vertically due to their categorical nature.

In [5]:
sns.pairplot(df_train_85[['Age', 'Quantity', 'Fee']], diag_kind="hist")

For the decision tree model:
- max_depth: helps to restrict the max depth that the decision tree can be (decrease to help with overfitting)
- min_samples_split: the smallest number that an internal node needs to split into child nodes (increase to help with overfitting)
- min_samples_leaf: the smallest number that an external node needs to split into child nodes (increase to help with overfitting)

In [6]:
# Create a decision tree model (default)
decision_tree_default = DecisionTreeClassifier()
decision_tree_default.fit(df_train_85, adoption_speed_train_85)
adoption_speed_train_prediction_default = decision_tree_default.predict(df_train_85)
adoption_speed_val_prediction_default = decision_tree_default.predict(df_val_5)
adoption_speed_test_prediction_default = decision_tree_default.predict(df_test_10)


# Create a decision tree model (pre-pruning)
decision_tree = DecisionTreeClassifier(max_depth=10, min_samples_split=2, min_samples_leaf=2)
decision_tree.fit(df_train_85, adoption_speed_train_85)
adoption_speed_train_prediction = decision_tree.predict(df_train_85)
adoption_speed_val_prediction = decision_tree.predict(df_val_5)
adoption_speed_test_prediction = decision_tree.predict(df_test_10)

In [7]:
# OFF THE SHELF BASELINE MODEL
print(f"Before Pre-Pruning (default baseline model):")

datasets = [
    (adoption_speed_train_85, adoption_speed_train_prediction_default, 'Training'),
    (adoption_speed_val_5, adoption_speed_val_prediction_default, 'Validation'),
    (adoption_speed_test_10, adoption_speed_test_prediction_default, 'Test')
]

for data, predictions, label in datasets:
  # Calculate and print accuracy, precion, and recall per class (FOR TRAINING)
  accuracy = accuracy_score(data, predictions)
  print(f"\nAccuracy ({label}): {accuracy}")
  # Calculate and print average weighted precision and recall for classes overall
  avg_precision = precision_score(data, predictions, average='weighted', zero_division=0)
  avg_recall = recall_score(data, predictions, average='weighted', zero_division=0)
  print(f'Average weighted precision: {avg_precision}')
  print(f'Average weighted recall: {avg_recall}')
  # Calculate precision and recall for each class separately with zero_division = 0 so that precision or recall as 0.0 for classes with no predicted samples
  precision_per_class = precision_score(data, predictions, average=None, zero_division=0)
  recall_per_class = recall_score(data, predictions, average=None, zero_division=0)
  # Print precision and recall for each class
  for class_label, precision, recall in zip(range(len(precision_per_class)), precision_per_class, recall_per_class):
      print(f'Class {class_label}: Precision = {precision:.2f}, Recall = {recall:.2f}')

Before Pre-Pruning (default baseline model):

Accuracy (Training): 0.9838581424936387
Average weighted precision: 0.9840000811568964
Average weighted recall: 0.9838581424936387
Class 0: Precision = 0.98, Recall = 0.99
Class 1: Precision = 0.97, Recall = 0.99
Class 2: Precision = 0.98, Recall = 0.98
Class 3: Precision = 0.99, Recall = 0.98
Class 4: Precision = 1.00, Recall = 0.98

Accuracy (Validation): 0.3364864864864865
Average weighted precision: 0.3345237353546888
Average weighted recall: 0.3364864864864865
Class 0: Precision = 0.12, Recall = 0.15
Class 1: Precision = 0.29, Recall = 0.29
Class 2: Precision = 0.31, Recall = 0.28
Class 3: Precision = 0.27, Recall = 0.27
Class 4: Precision = 0.46, Recall = 0.50

Accuracy (Test): 0.3290540540540541
Average weighted precision: 0.33730794223059996
Average weighted recall: 0.3290540540540541
Class 0: Precision = 0.06, Recall = 0.10
Class 1: Precision = 0.27, Recall = 0.28
Class 2: Precision = 0.35, Recall = 0.34
Class 3: Precision = 0.30, 

In [8]:
# USING ADJUSTED MODEL PARAMETERS (Pre-pruning)
print(f"After Pre-Pruning:")
datasets = [
    (adoption_speed_train_85, adoption_speed_train_prediction, 'Training'),
    (adoption_speed_val_5, adoption_speed_val_prediction, 'Validation'),
    (adoption_speed_test_10, adoption_speed_test_prediction, 'Test')
]

for data, predictions, label in datasets:
  # Calculate and print accuracy, precion, and recall per class (FOR TRAINING)
  accuracy = accuracy_score(data, predictions)
  print(f"\nAccuracy ({label}): {accuracy}")
  # Calculate and print average weighted precision and recall for classes overall
  avg_precision = precision_score(data, predictions, average='weighted', zero_division=0)
  avg_recall = recall_score(data, predictions, average='weighted', zero_division=0)
  print(f'Average weighted precision: {avg_precision}')
  print(f'Average weighted recall: {avg_recall}')
  # Calculate precision and recall for each class separately with zero_division = 0 so that precision or recall as 0.0 for classes with no predicted samples
  precision_per_class = precision_score(data, predictions, average=None, zero_division=0)
  recall_per_class = recall_score(data, predictions, average=None, zero_division=0)
  # Print precision and recall for each class
  for class_label, precision, recall in zip(range(len(precision_per_class)), precision_per_class, recall_per_class):
      print(f'Class {class_label}: Precision = {precision:.2f}, Recall = {recall:.2f}')


After Pre-Pruning:

Accuracy (Training): 0.47892811704834604
Average weighted precision: 0.4807584954752817
Average weighted recall: 0.47892811704834604
Class 0: Precision = 0.47, Recall = 0.08
Class 1: Precision = 0.42, Recall = 0.43
Class 2: Precision = 0.42, Recall = 0.50
Class 3: Precision = 0.48, Recall = 0.36
Class 4: Precision = 0.59, Recall = 0.62

Accuracy (Validation): 0.3878378378378378
Average weighted precision: 0.3776627687053279
Average weighted recall: 0.3878378378378378
Class 0: Precision = 0.00, Recall = 0.00
Class 1: Precision = 0.33, Recall = 0.35
Class 2: Precision = 0.34, Recall = 0.40
Class 3: Precision = 0.33, Recall = 0.27
Class 4: Precision = 0.53, Recall = 0.54

Accuracy (Test): 0.3587837837837838
Average weighted precision: 0.3576067787118627
Average weighted recall: 0.3587837837837838
Class 0: Precision = 0.25, Recall = 0.03
Class 1: Precision = 0.30, Recall = 0.32
Class 2: Precision = 0.31, Recall = 0.39
Class 3: Precision = 0.31, Recall = 0.22
Class 4: Pr

In [18]:
# Perform Cross Valdiation on default and pre-pruned decision trees (Worked with Ben)
scoring_metrics = {
    'accuracy': make_scorer(accuracy_score),
    'precision': make_scorer(precision_score, average='weighted', zero_division=0),
    'recall': make_scorer(recall_score, average='weighted', zero_division=0)
}

scores_default = cross_validate(decision_tree_default, df_train_85, adoption_speed_train_85, cv=10, scoring=scoring_metrics)
print(f'10-fold validation accuracy mean: {scores_default["test_accuracy"].mean()}')
print(f'10-fold validation precision mean: {scores_default["test_precision"].mean()}')
print(f'10-fold validation recall mean: {scores_default["test_recall"].mean()}')

scores_pruned = cross_validate(decision_tree, df_train_85, adoption_speed_train_85, cv=10, scoring=scoring_metrics)
print(f'\n10-fold validation accuracy mean: {scores_pruned["test_accuracy"].mean()}')
print(f'10-fold validation precision mean: {scores_pruned["test_precision"].mean()}')
print(f'10-fold validation recall mean: {scores_pruned["test_recall"].mean()}')

10-fold validation accuracy mean: 0.32394868545366934
10-fold validation precision mean: 0.32500862490708754
10-fold validation recall mean: 0.32394868545366934

10-fold validation accuracy mean: 0.3663309315211603
10-fold validation precision mean: 0.35646408440380384
10-fold validation recall mean: 0.3663309315211603


In [21]:
# Perform T-test for statistical significance
t_stat_accuracy, p_value_accuracy = stats.ttest_rel(scores_default["test_accuracy"], scores_pruned["test_accuracy"])
t_stat_precision, p_value_precision = stats.ttest_rel(scores_default["test_precision"], scores_pruned["test_precision"])
t_stat_recall, p_value_recall = stats.ttest_rel(scores_default["test_recall"], scores_pruned["test_recall"])

print(f'p_value_accuracy = {p_value_accuracy}')
print(f'p_value_precision = {p_value_precision}')
print(f'p_value_recall = {p_value_recall}')


p_value_accuracy = 9.099466736064099e-07
p_value_precision = 1.3764931213505677e-05
p_value_recall = 9.099466736064099e-07


In [None]:
from sklearn.tree import export_text
r = export_text(decision_tree, feature_names=df_train_85.columns.tolist())
print(r)

In [None]:
plot_tree(decision_tree)