# Extreme multi-class classification

ID2223 Scalable Machine Learning and Deep Learning

**Federico Baldassarre (fedbal@kth.se) and Beatrice Ionascu (bionascu@kth.se)**

In [None]:
%matplotlib inline

import os, sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display, HTML

plt.rcParams['figure.figsize'] = (15, 5)

# add the 'src' directory so we can import modules
src_dir = os.path.join(os.getcwd(), os.pardir, 'src')
sys.path.append(src_dir)

from utils.paths import data_raw_dir
from utils.paths import data_processed_dir

## Problem
The goal of this project is to classify products from [Cdiscount](https://www.cdiscount.com/), France’s largest non-food e-commerce company, based on the product images on the company's website. Being able to correctly predict the category of products is important in ensuring that new products are well classified. 

## Data set

![data](../figures/data.png)

The Cdiscount dataset consists of 15 million images at 180x180 resolution of almost 9
million products. The training data consists of a list of 7,069,896 dictionaries, one per product. Each dictionary contains a product id, the category id of the product, and between
1-4 images, stored in a list. In addition, each category id has a corresponding level1,
level2, and level3 name, in French.

In [None]:
categories = pd.read_pickle(os.path.join(data_processed_dir, 'categories.pickle'))
display(HTML(categories.filter(like='category', axis=1).head().to_html(index=False)))

In [None]:
product_distrib = pd.read_pickle(os.path.join(data_processed_dir, 'cat_id_prod_distrib.pickle'))
print('There are {} products.'.format(product_distrib.sum()))

In [None]:
image_distrib = pd.read_pickle(os.path.join(data_processed_dir, 'cat_id_img_distrib.pickle'))
print('There are {} images.'.format(image_distrib.sum()))

In [None]:
categ_counts = pd.read_csv(os.path.join(data_raw_dir, 'category_names.csv'))
print(categ_counts.nunique().to_frame('Category counts'))

The 5270 available categories are very unevenly distributed amongst the products. As seen below, there are many categories with just a few products and few categories with very many products.

In [None]:
cat_id_distrib = pd.read_pickle(os.path.join(data_processed_dir, 'cat_id_prod_distrib.pickle'))
cat_1_distrib = pd.read_pickle(os.path.join(data_processed_dir, 'cat_1_prod_distrib.pickle'))
cat_2_distrib = pd.read_pickle(os.path.join(data_processed_dir, 'cat_2_prod_distrib.pickle'))
cat_3_distrib = pd.read_pickle(os.path.join(data_processed_dir, 'cat_3_prod_distrib.pickle'))
fig, ax = plt.subplots(2, 2, figsize=(16,10))
ax=ax.ravel()
cat_1_distrib.hist(log=True, bins=50,ax=ax[0])
ax[0].set_title('Category Level 1 ({})'.format(cat_1_distrib.size)); ax[0].set_xlabel('Number of products')
cat_2_distrib.hist(log=True, bins=50,ax=ax[1])
ax[1].set_title('Category Level 2 ({})'.format(cat_2_distrib.size)); ax[1].set_xlabel('Number of products')
cat_3_distrib.hist(log=True, bins=50,ax=ax[2])
ax[2].set_title('Category Level 3 ({})'.format(cat_3_distrib.size)); ax[2].set_xlabel('Number of products')
cat_id_distrib.hist(log=True, bins=50,ax=ax[3])
ax[3].set_title('Category id ({})'.format(cat_id_distrib.size)); ax[3].set_xlabel('Number of products')

The categories with the most and least products are the following:

In [None]:
pd.merge(categories.filter(like='category'), 
         cat_id_distrib.sort_values(ascending=False).iloc[np.r_[0:5, -5:0]].to_frame('counts'),
         left_index=True, right_index=True).sort_values('counts', ascending=False)

## Approach
Use xceptio as feature extraction
Architecture: Learn a few conv layer and a softmax dense
compare negative sampling with regular cross entropy loss

## Pre-processing
### Label encoding

## Feature Extraction
### TfRecords



### Computation graph
![computation graph](../figures/computational_graph.png)

## Training

- We trained the network using two different loss functions.
- explain training and testing split
- explain epoch / step/batch size


## Results


![results](../figures/metrics_v2.png)

- classification accuracy and loss per image
- can't test on the test data since we don't have access to it ...



#### Future work

- group images into products -- report results per product
- test the influence of the negative sample size in sampled softmax
- use heirarchy
- use OCR for books
- ...