# Today you are a Machine Learning Engineer at the Department of New Products at Target Cosmetics!
This work relies on processed data from Kaggle https://www.kaggle.com/mkechinov/ecommerce-events-history-in-cosmetics-shop

This work is motivated by the publication https://arxiv.org/pdf/2102.01625.pdf

Further details are at: https://arxiv.org/pdf/2010.02503.pdf

### So far you have seen user-product interaction data that can lead to classification of a user-product relationship as ending in purchase or no-purchase, and for clustering (categorizing) user behaviors.
### In this assignment, you have access to user-product level interactions without any insights into the user behaviors. Your goal is to classify if the "Products" will sell at least 5 pieces in a month (denoted by `Purchased?` =`1) or not. The intention is to utilize as minimum product level as possible (price and product category only) at first and then designing a more complex system that ingests more product level information.
### Labeled data is sparse, and the intention is to maximize Recall (so that no popular cosmetic is understocked). Digital overstocking is allowed since it will not cause disengagement in customers.

In [None]:
# The session-level data that is mined for this work is as follows:
from IPython.display import Image
Image(filename='image10.png')

## This week you are helping plan the launch of new products! You start with minimal product information and then identify what other information is helpful for the task!

## The minimal product level information available to you about the new products is their cost range and product category (cream, foundation, lipcolor, etc..).

## You have to figure out how to mine the past cosmetic sales data from last month, utilize relevant features and to make estimations as to which products will sell more (`Purchased?` = 1)

## Task 0: Getting to know the Data!

In [None]:
## Importing required Libraries
import os
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import seaborn as sb

In [None]:
# Load the data from previous months (past)
Past = pd.read_csv("Past_month_products.csv")
print(Past.shape)
Past.head()

In [None]:
# Next, load the data regarding products to be launched next month
Next = pd.read_csv("Next_month_products.csv")
print(Next.shape)
Next.head()


### Only the `product_id`, `maxPrice`, `minPrice`, and `Category` columns are common to both the training and test data

# Task 1: Exploratory Data Analysis (EDA) and Data Preparation
## EDA: Doing our your due diligence. Find the following:
1. Percentage of Purchased events in train data: 
2. Percentage of Purchased events in test data:
3. Are there any overlaps in product ID between train and test data?

In [None]:
### START CODE HERE ###
y_train = None
print(f"Percentage of Purchased in Training data = {None}")
y_test = Next['Purchased?'].values
print(f"Percentage of Purchased in Test data = {None}")

# Verify that every product ID in the training data appears only once
print(f"Every product ID in the training data appears only once: {None}")
# Verify that every product ID in the test data appears only once
print(f"Every product ID in the test data appears only once: {None}")
# Determine whether any product IDs appear in both the training and test data
overlap = None
print(f"These product IDs are present in both the training and test data: {overlap}")
### END CODE HERE ###

## Next, create `X_train`, `y_train`, `X_test`, and `y_test`. Remember the following: 
1. The `Purchased?` column is the target
2. `X_train` and `X_test` should contain the same features
3. `product_id` should NOT be one of those features. Can you see why?

In [None]:
### START CODE HERE ###
def return_train_test_data(df_old, df_new):
    X_train = None
    y_train = None
    X_test  = None
    y_test  = None
    return X_train, y_train, X_test, y_test
### END CODE HERE ###
    
X_train, y_train, X_test, y_test = return_train_test_data(Past, Next)    
print(X_train.shape, y_train.shape, X_test.shape)

# Task 2, Baselining: Build the best classifier using the Past month's data that will predict if the Next month's products will be Purchased or not?
## Consider using AutoML to estimate the best classifier. Which features would you use from the training data?

In [None]:
# Uncomment the following line if using Colab
# !pip install tpot

In [None]:
# TPOT for classification
from tpot import TPOTClassifier
### START CODE HERE ###
# Instantiate and train a TPOT auto-ML classifier
# Set generations to 5, population_size to 40, and verbosity to 2 (so you can see each generation's performance)
tpot = None
None
# Evaluate the classifier on the test data
# By default, the scoring function is accuracy
print(None)
### END CODE HERE ###
tpot.export('tpot_products_pipeline.py')

<!-- ## Modify the file `tpot_products_pipeline.py` to return the prediction labels for `X_test` and paste the function here or reload kernel to reload updated file -->

## Use the appropriate lines of `tpot_products_pipeline.py` (and modify the relevant names) to write a function which returns the predicted labels generated by the best classifier which TPOT found 

In [None]:
from sklearn.decomposition import PCA
from sklearn.pipeline import make_pipeline, make_union
from sklearn.svm import LinearSVC
from tpot.builtins import StackingEstimator
from xgboost import XGBClassifier
from sklearn.tree import DecisionTreeClassifier

def return_tpot_results(X_train, y_train, X_test):
    ### START CODE HERE ###
    exported_pipeline = None
    
    None
    prediction = None
    ### END CODE HERE ### 
    return prediction

pred = return_tpot_results(X_train, y_train, X_test)


## Evaluate the results of the best classifier which TPOT found

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score as accuracy
from sklearn.metrics import recall_score as recall
from sklearn.metrics import precision_score as precision
from sklearn.metrics import f1_score

### START CODE HERE ###
# TPOT confusion matrix
cmtp = None
acc  = None
rec  = None
prec = None
f1   = None
### END CODE HERE ###
print(f'Accuracy = {acc}, Precision = {prec}, Recall = {rec}, F1-score = {f1}')
print('Confusion Matrix is:')
print(cmtp)

# Task 3, Semi-supervised learning: Apply label spreading on the data and run performance analysis by cross validation.

Step 1: Combine `X_train` and `X_test`

Step 2: Combine `y_train` and pad `y_test` with -1 labels

Step 3: Run label spreading on complete data. Use knn spreading with `n_neighbors` varying as 1,3,5,7,9,11. What's the best neighborhood?


### Concatenate `X_train` and `X_test`

In [None]:
### START CODE HERE ###
X = None
### END CODE HERE ### 
print(X.shape[0])
print(y_train.shape)

### Create an array shaped like a column of `X_test`, with each value equal to -1
### Make sure the array is a column vector

In [None]:
### START CODE HERE ###
y_hat = None
### END CODE HERE ###

### Concatenate `y_train` and `y_hat`

In [None]:
### START CODE HERE ###
y = None
### END CODE HERE ###

### Instantiate and train the label-spreading model. Use a KNN kernel and set `alpha` to 0.01. Try the `n_neighbors` values mentioned above.

In [None]:
from sklearn.semi_supervised import LabelSpreading
### START CODE HERE ###
lp_model = None
None
### END CODE HERE ###

### Extract the label predictions (transductions) for the test data

In [None]:
### START CODE HERE ###
semi_sup_preds = None
### END CODE HERE

### Evaluate the test predictions against the true test labels

In [None]:
### START CODE HERE ###
cm   = None
acc  = None
rec  = None
prec = None
f1   = None
### END CODE HERE ###
print(f'Accuracy = {acc}, Precision = {prec}, Recall = {rec}, F1-score = {f1}')
print('Confusion Matrix is:')
print(cm)

## Observe increase in recall by running label spreading. Tabulate your results
----------------------------------------------------------------------------------------------------------------
Method    |   Recall      |F1-score    | Accuracy    |
------------------------------------------------------------------------------
AutoML    |                   |                    |                    |
-------------------------------------------------------------------------
### Label Spread |               |                        |                        |

# Task 4, System Design for Zero Shot Learning:
So far we have been looking at 3 product level features (min price, max price, Product Category) to classify if a particular product will get get purchased or not.
Now, let's say you have access to some more information regarding each Past sold cosmetic item and the Next cosmetic item. Design a System to enable accurate identification of an item that is more likely to be purchased.
Think through the following:
1. What additional data fields do you need per cosmetic in past and Next catalogue? How would you process these data fields?
2. You have access to picture images of each cosmetic. How will you use these images to extract relevant features for gauging interest in the new coemetics?
3. Design an end-to-end system workflow using the additional cosmetic data and cosmetic images to predict its purchasing polularity. 

# We will discuss a sample solution in https://docs.google.com/presentation/d/1yhHFZO6vvTNBICr1dkZzbV0cdhaHsaSk/edit#slide=id.p1
## Make the required changes and put the picture corresponding to Your version of the final System Diagram in the following cell.

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=3d281695-f4e9-4212-a682-42c233d829cc' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>