# dog_breeds_train

## Training models for running dog breeds app

Let us train needed models for the dog breeds app.

In [1]:
# basic imports 
import os
import shutil
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm

# importing core
from app.core import (
    chunks_df,
    build_metadata,
    extract_features
)

## Data

The data is divided into 120 folders, each representing a breed, that contain several dog pictures each. The `build_metadata` function builds a simple dataframe which contains a single column `breed` and the path to the corresponding image as index.

In [2]:
# reading data
meta_df = build_metadata()
meta_df.head()

121it [00:06, 17.43it/s]


Unnamed: 0_level_0,breed
pet_id,Unnamed: 1_level_1
data/img/.DS_Store,
data/img/n02097658-silky_terrier/n02097658_26.jpg,silky_terrier
data/img/n02097658-silky_terrier/n02097658_4869.jpg,silky_terrier
data/img/n02097658-silky_terrier/n02097658_595.jpg,silky_terrier
data/img/n02097658-silky_terrier/n02097658_9222.jpg,silky_terrier


As expected, we have 120 breeds. Also, we have 20580 images, as a I joined the train and test sets of the original dataset, as I need the most data I can get.

In [3]:
# number of unique breeds after filter
print('number of unique breeds:', meta_df['breed'].nunique())
print('number of rows in the dataframe:', meta_df['breed'].shape[0])

number of unique breeds: 121
number of rows in the dataframe: 20581


We reserve the images' paths for use later:

In [4]:
# creating list with paths
paths = meta_df.index.values

## Feature extraction 

The first step is extracting features from the images using a pretrained neural network. I chose `Xception` based on its good results on this [Kaggle Kernel](https://www.kaggle.com/gaborfodor/dog-breed-pretrained-keras-models-lb-0-3/#data), and for it being relatively lightweight for quick inference.

In [5]:
# using a pre-trained net
from tensorflow.keras.applications.xception import Xception, preprocess_input
from tensorflow.keras.preprocessing import image

# instance of feature extractor
extractor = Xception(include_top=False, pooling='avg')

The function `extract_features` gets a list of paths, an `extractor` (the Xception net in this case), and returns a dataframe with features. We save the dataframe so we don't need to run the process all the time (it takes ~15 minutes on my machine).

In [6]:
# if we havent extracted features, do it
if not os.path.exists('data/features.zip'):
    features_df = extract_features(paths, extractor)
    features_df.to_csv(
        'data/features.zip',
        compression=dict(method='zip', archive_name='features.csv')
    )
    
# read features
features_df = pd.read_csv('data/features.zip', index_col='pet_id')
features_df = features_df.sort_index()

As the extraction pipeline can't process some of the images, we need to realign our metadata index with the extraction's index, so they have the same images, in the same order:

In [7]:
# realign index with main df
meta_df = meta_df.loc[features_df.index]
meta_df.to_csv('data/metadata.csv')

## Modeling

### Data preparation

In [8]:
# label encoder for target and splitter
from sklearn.preprocessing import LabelEncoder

# defining design matrix
X = features_df.copy().values

# defining target
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(meta_df['breed'])

### Training Logistic regression and PCA

In [9]:
### PCA
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression

# instance of PCA
pca = PCA(n_components=1024, whiten=True)

# instance of logistic regression
lr = LogisticRegression(C=1e-2, multi_class='multinomial', penalty='l2', max_iter=200)

# model
model = Pipeline([('pca', pca), ('lr', lr)])

In [10]:
from sklearn.metrics import accuracy_score, log_loss
from sklearn.model_selection import cross_val_predict

# evaluating
preds = cross_val_predict(model, X, y, method='predict_proba', n_jobs=-1, verbose=3)

# test metrics
print(f'Accuracy: {accuracy_score(y, np.argmax(preds, axis=1)):.3f}')
print(f'Log-loss: {log_loss(y, preds):.3f}')

Accuracy: 0.883
Log-loss: 0.534


In [11]:
# fit model with all the data
model.fit(X, y)

# getting just the pca
pca = model.steps[0][1]

### Testing nearest neighbor model

In [12]:
from sklearn.neighbors import KNeighborsClassifier

# function to 'supervise' embedding given coefficients of logreg
def supervised_transform(x):
    return np.abs(lr.coef_).sum(axis=0) * pca.transform(x)

# building nearest neighbor model
nn = KNeighborsClassifier(n_neighbors=30)

# evaluating
preds = cross_val_predict(nn, supervised_transform(X), y, method='predict_proba', n_jobs=-1, verbose=3)

# test metrics
print(f'Accuracy: {accuracy_score(y, np.argmax(preds, axis=1)):.3f}')
print(f'Log-loss: {log_loss(y, preds):.3f}')

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:  3.0min remaining:  4.5min
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:  3.0min finished


Accuracy: 0.877
Log-loss: 0.740


### Saving nearest neighbor model

In [13]:
from joblib import dump, load

# saving transform data
dump(lr.coef_, 'data/coef.compressed', compress=True)
dump(pca, 'data/pca.compressed', compress=True)

['data/pca.compressed']

### Saving transformed features

In [21]:
X_transformed = supervised_transform(X)
X_transformed = pd.DataFrame(X_transformed, index=features_df.index)

In [22]:
# saving in chunks
for i, ck in enumerate(chunks_df(X_transformed, 5500)):
    ck.to_hdf(
        f'data/features/features_{i}.h5',
        key=f'features_{i}',
        complevel=5
    )