# Grader Guidelines

This notebook __must__ be run first, as the CSVs generated here are used by some deep learning models.  You will also need to run the other notebooks from the same directory, since they will need the output CSVs.

Additionally, some cells in this notebook are resource intensive (in particular, you will probably want 16 GB RAM for data augmentation).  Use the constants in the next cell to control which parts are run.  Be aware that if you set these constants to `False`, *other notebooks will __not__ work.*

In [17]:
# Controls whether preprocessed, transformed data is saved to a CSV--the CSV is used by some our our deep learning models
EXPORT_TO_CSV = False

# Controls whether data augmentation is performed.  This is also saved to a CSV.
# 16 GB of RAM and at least 4 GB of storage space should be sufficient.
USE_DATA_AUGMENTATION = False

In [1]:
import pandas as pd
import numpy as np
from sklearn.dummy import DummyClassifier
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler, OrdinalEncoder, Normalizer
from sklearn.model_selection import cross_val_score, cross_validate, train_test_split, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from scipy.stats import randint
import PIL.Image as Image
import os

import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler, OrdinalEncoder, Normalizer
from sklearn.model_selection import train_test_split

import torch
from torchvision.transforms import v2 as transforms
from torchvision.io import read_image, ImageReadMode
from PIL import Image
import pandas as pd
import os
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
import numpy as np
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

import pandas as pd
import numpy as np
from sklearn.dummy import DummyClassifier
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer
from sklearn.model_selection import cross_val_score, cross_val_predict, train_test_split, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from scipy.stats import randint
import PIL.Image as Image
import os

# Preprocessing

In [2]:
# Image size constants for convenience
IMAGE_X, IMAGE_Y = 200, 200     # Resize all images to these dimensions
IMAGE_DIM = (IMAGE_X, IMAGE_Y)
N_PIXELS = IMAGE_X * IMAGE_Y    # Number of pixels in resized images

N_IMAGES = 1584
AUGMENTED_ORIGINAL_RATIO = 3    # 5 augmented images for every original image

In [3]:
# Read in training and test data
train_df = pd.read_csv("CSV/train.csv", index_col='id')
test_df = pd.read_csv("CSV/test.csv", index_col='id')

#### Image Preprocessing

In [8]:
# Loads an image and applies some basic preprocessing
# to it (resizing, black and white)
def load_image(filename):
    return Image.open(filename) \
                .resize(IMAGE_DIM) \
                .convert(
                    mode='1',   # black and white
                    dither=Image.Dither.NONE,
                )

# Converts a Pillow image to a 1D numpy array of pixel data
def image_to_flat_array(img):
    return np.array(img).reshape((-1))

# Loads images with specified indices into one-dimensional
# pixel data and concatenates it all into a single dataframe
def images_to_df(indices):
    # TODO filepath hard-coded, change if needed
    imgs = [
        load_image(f'images/{i}.jpg') for i in indices
    ]

    # Converts each image to flat 1D representation
    df = pd.DataFrame(
               np.asarray(imgs).reshape((len(imgs), -1))
           ).set_index(indices)
    df.columns = df.columns.astype(str)     # Prevents some obscure errors later
    return df

In [11]:
# Load the data corresponding to each dataset
img_train_df = images_to_df(train_df.index).astype(int)
img_test_df = images_to_df(test_df.index).astype(int)

#### Traditional Machine Learning Pipeline

In [14]:
# Add the img_train columns to the preprocessor lists
# Specify the type of each column for us to know what preprocessing needs to be done to each column
numeric_features = [i for i in train_df.drop(columns='species').columns]
passthrough_features = [i for i in img_train_df]

In [15]:
# Define the pipelines for each type of data we have
numeric_pipeline = make_pipeline(
    # Normalize all of them to unit norm
    Normalizer()
)

# Define our column transformer/preprocessor itself
preprocessor = make_column_transformer(
    (numeric_pipeline, numeric_features),
    ('passthrough', passthrough_features)

)

#### Concatenated Dataframes & Column Transformations

In [16]:
# Concatenates the image data with the metadata into one dataframe
full_train_df = pd.concat([train_df, img_train_df], axis=1)
full_test_df = pd.concat([test_df, img_test_df], axis=1)

# We want to split X (big X because it is a matrix) and y from each other
full_X_train = full_train_df.drop(columns=['species'])
full_y_train = full_train_df['species']

# Funnily enough, the test has no ground truth...
full_X_test = full_test_df

# Now we want to fit our preprocessor onto our data, so we can actually transform it (then cast it to a DF)
full_X_train_transformed = pd.DataFrame(preprocessor.fit_transform(full_X_train))
full_X_test_transformed = pd.DataFrame(preprocessor.transform(full_X_test))

### Save the pre-processed data to csv for access in other jupyter notebooks

In [18]:
# This MUST be run (i.e. EXPORT_TO_CSV must be True) for deep learning models
# to work.
if(EXPORT_TO_CSV):
    full_X_train_transformed.to_csv('CSV/pre-processed/full_X_train_transformed')
    full_y_train.to_csv('CSV/pre-processed/full_y_train')


    full_X_test_transformed.to_csv('CSV/pre-processed/full_X_test_transformed')

# Models

In [78]:
# Borrowed from the class notebooks with a few modifications.
# Uses accuracy for the CV metric
def mean_std_cross_val_scores(model, X_train, y_train, **kwargs):

    scores = cross_validate(
        model, X_train, y_train, scoring='accuracy', return_train_score=True, **kwargs
    )

    mean_scores = pd.DataFrame(scores).mean()
    std_scores = pd.DataFrame(scores).std()
    out_col = []

    for i in range(len(mean_scores)):
        out_col.append((f"%0.3f (+/- %0.3f)" % (mean_scores[i], std_scores[i])))

    return pd.Series(data=out_col, index=mean_scores.index).rename({'test_score': 'validation_score'})

### Manual Dummy Classifier

In [34]:
# Dummy classifer that randomly guesses the species a leaf belongs to
np.random.seed()
dummy = DummyClassifier(
    strategy="uniform",
    random_state=7         # A constant is used to improve reproducibility.
)

In [79]:

mean_std_cross_val_scores(dummy, full_X_train_transformed, full_y_train)

fit_time            0.212 (+/- 0.071)
score_time          0.005 (+/- 0.007)
validation_score    0.014 (+/- 0.012)
train_score         0.009 (+/- 0.005)
dtype: object

With 99 classes, the DummyClassifier's cv-accuracy is a little over 10%, as expected.

### Random Forest

In [82]:
# Testing a random forest
forest = RandomForestClassifier(
    n_jobs=-1,
    warm_start=False,
    random_state = 42,  # For reproducibility
)

# No preprocessing/pipeline is required here, since DecisionTrees don't need scaling
forest_cv_scores = mean_std_cross_val_scores(forest, full_X_train, full_y_train, cv=5)
print(forest_cv_scores)

fit_time            2.827 (+/- 0.063)
score_time          0.293 (+/- 0.023)
validation_score    0.894 (+/- 0.011)
train_score         1.000 (+/- 0.000)
dtype: object


There may be some overfitting here.

We then wanted to evaluate the RandomForest's accuracy on the image data, to see whether it would be better to drop those features altogether.  To that end, we trained two additional RandomForests:

- One on image data only
- One on metadata only

In [83]:
# Evaluate performance on the image data only
forest_img_pipeline = make_pipeline(
    make_column_transformer(
        ('passthrough', passthrough_features),  # Preserve image data features
        remainder='drop',   # Drop metadata features
    ),
    forest,
)

forest_img_cv_scores = mean_std_cross_val_scores(forest_img_pipeline, full_X_train, full_y_train, cv=5)
print(forest_img_cv_scores)

fit_time            2.736 (+/- 0.148)
score_time          0.178 (+/- 0.056)
validation_score    0.557 (+/- 0.015)
train_score         1.000 (+/- 0.000)
dtype: object


In [84]:
# Evaluate performance on the metadata only
forest_metadata_pipeline = make_pipeline(
    make_column_transformer(
        ('passthrough', numeric_features),  # Don't need to normalize numeric data when using a RandomForest
        remainder='drop',   # Drop image features
    ),
    forest,
)

forest_meta_cv_scores = mean_std_cross_val_scores(forest_metadata_pipeline, full_X_train, full_y_train, cv=5)
print(forest_meta_cv_scores)

fit_time            1.634 (+/- 0.111)
score_time          0.079 (+/- 0.009)
validation_score    0.976 (+/- 0.006)
train_score         1.000 (+/- 0.000)
dtype: object


Out of all three RandomForests, the metadata-only RandomForest has the best CV accuracy, although this did not necessarily translate into a good Kaggle score.  It seems that there is still some overfitting, despite the cross-validation score.  Some research indicates that sklearn's OOB score might be a better metric for evaluating how well the model will generalize if we were to do this project again.

We tried hyperparameter optimization on the metadata-only forest

In [96]:
rscv = RandomizedSearchCV(
    forest_metadata_pipeline,
    {
        # Although more estimators would always seem to be better, even when we
        # did this with between 20, 1000 estimators, the best scoring estimator
        # consistently had between 430-480 estimators, regardless of other parameters.
        'randomforestclassifier__n_estimators': randint(450, 500),
        'randomforestclassifier__max_depth': [10, 30, 80, None],
        # sqrt ~= 14, log2 ~= 8
        'randomforestclassifier__max_features': [5, 'log2', 'sqrt', 21],
    },
    n_jobs=-1,
    n_iter=20,
    random_state=42,
)

# Performs CV implicitly
rscv.fit(full_X_train, full_y_train)

In [98]:
print(rscv.best_params_)
print(rscv.best_score_)

{'randomforestclassifier__max_depth': 80, 'randomforestclassifier__max_features': 21, 'randomforestclassifier__n_estimators': 452}
0.9848484848484848


*Since RandomForest has highest accuracy with metadata only, we may have better luck just throwing out the image data for this classifier type*

### Voting Classifier

Can a RandomForest trained on the image data help improve the score of the metadata-only RandomForest by catching some edge cases?

In [106]:
voting = VotingClassifier(
    estimators=[
        ('meta', rscv.best_estimator_),
        ('img', forest_img_pipeline)
    ],
    voting='soft',
    n_jobs=-1,
)

X_train, X_val, y_train, y_val = train_test_split(full_X_train, full_y_train, train_size=0.8)

mean_std_cross_val_scores(voting, full_X_train, full_y_train)

fit_time            17.996 (+/- 2.557)
score_time           0.531 (+/- 0.075)
validation_score     0.986 (+/- 0.004)
train_score          1.000 (+/- 0.000)
dtype: object

There's not much change in performance and it seems like there still might be some overfitting still.

## Test Set Predictions for Kaggle

In [99]:
# Outputs a CSV of predictions for each classifier's submission to Kaggle
def create_report(clf, filename):
    # Predict probabilities for each class
    proba = pd.DataFrame(
        clf.predict_proba(full_X_test),
        columns=dummy.classes_,
        index=test_df.index   
    )

    
    # Output to CSV for submission
    if not os.path.exists('Reports/'):
        os.mkdir('Reports')

    proba.to_csv('Reports/{filename}')

In [102]:
# Dummy Classifier
create_report(
    dummy.fit(full_X_train_transformed, full_y_train), 
    'DummyClassifier.csv'
)

In [107]:
# Hyperparameter-optimized metadata-only RandomForest
create_report(
    rscv.best_estimator_.fit(full_X_train, full_y_train),
    'RandomForestMetadataOnly.csv'
)

In [109]:
# VotingClassifier with two RandomForests
create_report(
    voting.fit(full_X_train, full_y_train),
    'VotingClassifier'
)