# Skeleton

This skeleton should give you a good starting point.
We load the training data and write the submission file needed for Kaggle.
So, you can fully concentrate on data analysis and creating models.

In [1]:
import pandas as pd
import numpy as np
import pickle

from collections import Counter

from sklearn.metrics import accuracy_score

from sklearn.model_selection import train_test_split

# Prepare data

First, we will load and transform our data. The transformation could be anything from splitting the data, adding new features, or basic feature engineering. In this skeleton, we only split the data into the train set (X_train, y_train) and validation set (X_val, y_val).

In [2]:
# Load the data and split into features and labels
with open('../../data/train.pkl', 'rb') as f:
    data_train = pickle.load(f)
X_data = data_train["images"]
y_data = data_train["labels"]

In [3]:
# Split features and labels into train (X_train, y_train) and validation set (X_val, y_val).
X_train, X_val, y_train, y_val = train_test_split(X_data, y_data, random_state=42)

# Data Analysis

We recommend you always start with data analysis to understand the data better.

In [4]:
# TODO: Do your data analysis here

# Define and train model

After the data analysis, we are ready to train a model. We recommend starting with a very simple model. We can try more complex ones later.

Try for example a Logistic Regression, Decision Tree, or Linear SVM.

In [5]:
# TODO: Define and train your model here

# Evaluate the model

After we have trained our model, we will evaluate it on the validation set. Note that we are working with a multi-class classification setting. Suitable metrics for multi-class classification are for example accuracy, precision, recall, or F1 score. They can all be imported from sklearn.

It is good practice to compare you first result to a random baseline. In a classification case, it makes sense not to predict random classes but to always predict the majority class (most frequent class) in the training data. Here, this amounts to an accuracy of roughly 10% since the classes are balanced (if you haven't verified that, go back to your data analysis). 

In [6]:
counts = Counter(y_train.flatten())
mode = counts.most_common()[0][0]
print(mode)

deer


In [7]:
y_val_pred = np.repeat(mode, y_val.shape[0])
accuracy_score(y_true=y_val, y_pred=y_val_pred)

0.09456

# Refit the model on the entire training set

To make use of all possible data, we retrain the model on the entire training set (including the validation set).

In [8]:
counts = Counter(y_data.flatten())
mode_train_val = counts.most_common()[0][0]
print(mode_train_val)

frog


# Predict classes for test set

If we are happy with the performance of our model on the validation set, we can apply it to the test set. 

In [9]:
with open('../../data/test.pkl', 'rb') as f:
    X_test = pickle.load(f)

In [10]:
y_test_pred = np.repeat(mode_train_val, X_test.shape[0])
y_test_pred_df = pd.DataFrame(y_test_pred, columns=['label'])
print(y_test_pred_df.shape)

(10000, 1)


To submit the predictions to Kaggle we write them into a .csv file, which you can manually submit.

In [11]:
y_test_pred_df.to_csv('../../out/train_mode_submission.csv', header=True, index_label='id')