# PROJECT

## k-means  

Customers into groups:
-explain why (men/women, old/young)

## Similarity Score: An item v.s. Items a customer has bought  

Specificity:

Each item of clothing that customers buy have characteristics such as the color, and each characteristic has values such as blue or black. Having the "color black" can be considered a feature.  

These items can be represented in a $c \cdot v$ sparse matrix $I$, where  
$c =$ the number of characteristics  
$v =$ the largest number of values (that any characteristic has)  
and $I_{c,v} = 1$ if the item has the value for the characteristic (the item has that "feature"), 0 otherwise.  
(ie: if item $I$ had the 2nd possible value (children) for the 4th characteristic (department), $I_{4,2}=1$)

Similarly, a persons's purchases can be represented by the same $c \cdot v$ sparse matrix $P$  
except $P_{c,v} =$ the sum of all matrices in their purchase history, or the total number of times the value for the characteristic appears.  

An interesting consequence is that  $I \cdot P^{T}$ is a square matrix with a diagonal representing the features present in the item that the person tends to buy. The sum of the diagonal matrix divided by the total number of features in a person's purchase history will be our Similarity Score:

$$\frac{\sum \limits diag(I \cdot P^{T})}{|P|}$$


## Other variables for linear regression  

The other variables are price difference, absolute seasonal difference in days (and popularity of item (within cluster?)?) for linear regression.  

For each item and person, an $n-$dimensional vector of these variables $r$ can be formed.  
For one person with $m$ items to buy, an $m \cdot n$ matrix $A$ can be formed.  

Combined with the $m \cdot 1$ vector $b$ containing binary outcome of whether the item was purchased, we have a linear regression model:  

$$Ax = b$$


Vector of features they have in common:  
$A[i] x A[j]$  

Number of features in common:  
$ A[i] \cdot A[j])$  


If  
$C_1 = \sum \limits_{i=1}^{8} A[i]$

weighted similarity score for random article with indice $r$:  
$ C_{1} x A[r] $

## Training  

The sample size for each customer is very small, so we will be implementing bootstrapping the data in a way inspirired by k-fold cross validation.  
Subtract I from P when calculating the similarity score, along with a few other random items proportional to the total number of items they have bought.




In [1]:
import os 
import re
import math
import random
from random import randint, sample
import statistics
import pandas as pd
import numpy as np
from numpy import unique
from numpy.random import random_sample
from numpy import sqrt, dot, array, diagonal, mean, transpose, eye, diag, zeros
from numpy.linalg import inv, qr
from sklearn.linear_model import LinearRegression
from sklearn.cluster import KMeans
from sklearn.preprocessing import OneHotEncoder

In [2]:
articles = pd.read_csv("articles.csv")
customers = pd.read_csv("customers.csv")
transactions_raw = pd.read_csv("transactions_train.csv").drop("sales_channel_id", axis = 1)

In [9]:
def encode(data):
    """Returns a fitted one-hot-encodeder"""
    encoder = OneHotEncoder()
    encoder.fit(data)
    return encoder

def view(enc_article):
    """
    Takes: a similarity vector 
    Returns: a list of characteristics to print
    """
    global features
    if len(enc_article) != len(features):
        return ["Viewing: They're not the same length"]
    else:
        chars = []
        for i in range(len(enc_article)):
            if enc_article[i] == 1:
                chars.append(features[i])
        return chars
    
def prune_customers(transactions, n_transactions, min_buy):
    """
    returns a dictionary of {customer_ids: list of purchase arrays}
    """
    # because dropping NA in customers first, not all transactions correspond to a customer
    customers_no_NA = list(array(customers["customer_id"]))
    transactions_new = transactions.iloc[-n_transactions:]  # at 30K:: iloc[31758324] :: last one would be that num + 29,999
    first_ind = transactions_new.iloc[0].name
    unique_customer_ids = unique(transactions_new["customer_id"])    
    dd = {customer_id:[] for customer_id in unique_customer_ids}
    for row in transactions_new.iterrows():
            dd[row[1][1]].append((row[1][2], row[1][3], row[1][0]))  # an array (tr_id, price, date)
    dd = {customer: dd[customer] for customer in dd if len(dd[customer]) >= min_buy}
    return {customer: dd[customer] for customer in dd if customer in customers_no_NA}

def prune_customersDF(_customers, _customer_transactions):
    """
    Used to shrink customersDF into only the ones selected 
    Returns: a DF
    """
    # uses indices for iloc
    indices = []
    customer_ids = list(_customers["customer_id"])
    IDs = list(_customer_transactions.keys())
    for _id in IDs:
        indices.append(customer_ids.index(_id))
    return _customers.iloc[indices]

# get all the relevant articles
def prune_articlesDF(customer_transactions, raw_articles):
    """ (for the sake of runtime)
    Returns a DF of articles that have been purchased
    """
    raw_article_ids = list(raw_articles["article_id"])
    indices = []
    for cart in customer_transactions:
        for article in customer_transactions[cart]:
            indices.append(raw_article_ids.index(article[0]))
    return raw_articles.iloc[indices]

In [13]:
customer_transactions = prune_customers(transactions_raw, 50000, 10)
customers = prune_customersDF(customers, customer_transactions)
articles = prune_articlesDF(customer_transactions, articles)

## "product_type_name" might be very good to use; nah is 45k unique vals
drop1 = ["article_id", "product_code", "prod_name", "product_type_no", "graphical_appearance_no"]
drop2 = ["colour_group_code", "perceived_colour_value_id", "perceived_colour_master_id", "department_no"]
drop3 = ["index_code", "index_group_no", "section_no"]  
drop4 = ["garment_group_no", "detail_desc"]  # , "colour_group_name", "perceived_colour_master_name"
articles_enc = articles.drop(columns=drop1)
articles_enc = articles_enc.drop(columns=drop2)
articles_enc = articles_enc.drop(columns=drop3)

# the articles to encode
articles_enc = articles_enc.drop(columns=drop4)

# didn't drop: "product_type_name", "department_name"

articles_enc.head()

264

Unnamed: 0,product_type_name,product_group_name,graphical_appearance_name,colour_group_name,perceived_colour_value_name,perceived_colour_master_name,department_name,index_name,index_group_name,section_name,garment_group_name
104954,Top,Garment Upper body,Solid,Black,Dark,Black,Jersey fancy,Ladieswear,Ladieswear,Womens Everyday Collection,Jersey Fancy
53914,Trousers,Garment Lower body,Solid,Blue,Medium Dusty,Blue,Trousers,Divided,Divided,Divided Collection,Trousers
17171,Trousers,Garment Lower body,Denim,Black,Dark,Black,Trouser,Ladieswear,Ladieswear,Womens Everyday Collection,Trousers
104341,Bodysuit,Garment Upper body,Lace,Black,Dark,Black,Jersey fancy,Ladieswear,Ladieswear,Womens Everyday Collection,Jersey Fancy
104341,Bodysuit,Garment Upper body,Lace,Black,Dark,Black,Jersey fancy,Ladieswear,Ladieswear,Womens Everyday Collection,Jersey Fancy


# A is one hot encoded articles

In [29]:
encoder_A = encode(articles_enc)  # 19x631 max
len(encoder_A.categories_), sum([len(cat) for cat in encoder_A.categories_])
# I believe this turns it into a vector that can just be dot producted
A = encoder_A.transform(array(articles_enc)).toarray()
# indeed :0

In [30]:
# a list of every feature
features = []
article_ids = []
purchases = []

for cat in encoder_A.categories_:
    for feature in cat:
        features.append(feature)

for cart in customer_transactions.values():
    for purchase in cart:
        article_ids.append(purchase[0])
        purchases.append(purchase)

In [34]:
#### what features looks like
features[:9]

['Bag',
 'Ballerinas',
 'Beanie',
 'Belt',
 'Bikini top',
 'Blazer',
 'Blouse',
 'Bodysuit',
 'Boots']

In [35]:
article_ids[:8]

[929397001,
 706016038,
 573085057,
 921671001,
 921671001,
 923534001,
 914441001,
 573085028]

In [36]:
purchases[:8]

[(929397001, 0.0220169491525423, '2020-09-22'),
 (706016038, 0.0338813559322033, '2020-09-22'),
 (573085057, 0.0338813559322033, '2020-09-22'),
 (921671001, 0.0508305084745762, '2020-09-22'),
 (921671001, 0.0508305084745762, '2020-09-22'),
 (923534001, 0.0169322033898305, '2020-09-22'),
 (914441001, 0.0338813559322033, '2020-09-22'),
 (573085028, 0.0338813559322033, '2020-09-22')]

In [24]:
# after encoding
len(A[0]), len(features)

(382, 382)

In [37]:
#### checking
len(articles_enc) == len(A), len(articles) == len(A)

(True, True)

In [38]:
category_names = [col for col in articles_enc.columns]
# the black lingere one
print(article_ids[3], "\n")

for item in zip(category_names,view(A[3])):
    print(item[0],":" ,item[1])

921671001 

product_type_name : Bodysuit
product_group_name : Garment Upper body
graphical_appearance_name : Lace
colour_group_name : Black
perceived_colour_value_name : Dark
perceived_colour_master_name : Black
department_name : Jersey fancy
index_name : Ladieswear
index_group_name : Ladieswear
section_name : Womens Everyday Collection
garment_group_name : Jersey Fancy


In [39]:
####

for i in range(500):
    if 11 > dot(A[i], A[3]) > 8:
        print(article_ids[i])
        print(f"VIEWING: i: {i}, j: {3},  dot(A[i], A[j]): {dot(A[i], A[3])}")
        print(f"view(A[i]): {view(A[i])}")
        print(f"view(A[j]): {view(A[3])}", "--\n\n")

929397001
VIEWING: i: 0, j: 3,  dot(A[i], A[j]): 9.0
view(A[i]): ['Top', 'Garment Upper body', 'Solid', 'Black', 'Dark', 'Black', 'Jersey fancy', 'Ladieswear', 'Ladieswear', 'Womens Everyday Collection', 'Jersey Fancy']
view(A[j]): ['Bodysuit', 'Garment Upper body', 'Lace', 'Black', 'Dark', 'Black', 'Jersey fancy', 'Ladieswear', 'Ladieswear', 'Womens Everyday Collection', 'Jersey Fancy'] --


918890002
VIEWING: i: 64, j: 3,  dot(A[i], A[j]): 9.0
view(A[i]): ['Top', 'Garment Upper body', 'Solid', 'Black', 'Dark', 'Black', 'Jersey fancy', 'Ladieswear', 'Ladieswear', 'Womens Everyday Collection', 'Jersey Fancy']
view(A[j]): ['Bodysuit', 'Garment Upper body', 'Lace', 'Black', 'Dark', 'Black', 'Jersey fancy', 'Ladieswear', 'Ladieswear', 'Womens Everyday Collection', 'Jersey Fancy'] --


918890002
VIEWING: i: 233, j: 3,  dot(A[i], A[j]): 9.0
view(A[i]): ['Top', 'Garment Upper body', 'Solid', 'Black', 'Dark', 'Black', 'Jersey fancy', 'Ladieswear', 'Ladieswear', 'Womens Everyday Collection', '

In [40]:
n_features = len(features)

In [41]:
transactions_raw.iloc[0]

t_dat                                                 2018-09-20
customer_id    000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...
article_id                                             663713001
price                                                   0.050831
Name: 0, dtype: object

In [42]:
i = 0
category_lengths = []
for col in articles_enc.columns:
    i += len(list(unique(articles[col])))
    category_lengths.append(i)

def convertToCat(calculated):
    global category_lengths
    n = len(calculated)
    ret = []
    for i in range(len(category_lengths)):
        if i == 0:
            ret.append(sum(calculated[:category_lengths[i]])) 
        else:
            ret.append(sum(calculated[category_lengths[i-1]:category_lengths[i]]))
    return ret

In [43]:
len(articles)

3583

In [45]:
article_prices = {}
for val in customer_transactions.values():
    for tr in val:
        article_prices[tr[0]] = tr[1]

In [46]:
n_articles = len(articles)
#all_customer_inds = [i for i in range(len())]
all_article_inds = [i for i in range(len(A))]

In [32]:
class Customer(object):
    
    def initialize(self):
        global article_ids, n_features
        self.article_inds = [article_ids.index(transaction[0]) for transaction in self.cart]
        for ind in self.article_inds:
            self.CF += A[ind]

    def __init__(self, customer_id, purchases):
        self.id = customer_id
        self.cart = purchases  # [(art_id),(price),(tdat)]
        self.article_inds = []  # to index A
        self.n = len(self.cart)
        
        self.CF = zeros(n_features)
        self.X = [] # list of CF copies
        self.Y = []
        self.initialize()

    def generateTrueObs(self, alpha, rep):  # True and False now
        sample_size = int(self.n * alpha)
        for i in range(rep):  # 15
            selected = sample(self.article_inds, sample_size)  # selected inds
            remain = [ind for ind in self.article_inds if ind not in selected]
            # make X
            CF_true = sum([A[ind] for ind in selected])
            # TRUE OBS
            for rem in remain: # remaining inds
                # formula
                calculated = list(CF_true*A[rem])
                calculated = convertToCat(calculated)
                # added price
                price = self.cart[self.article_inds.index(rem)][1]
                Xi = [price] + calculated
                self.X.append(Xi)
                self.Y.append([1])
            # NEWLY ADDING :: FALSE!
            for ind in [randint(0,n_articles-1) for i in range(rep)]:
                if ind not in self.article_inds:
                    calculated = list(CF_true*A[ind])
                    calculated = convertToCat(calculated)
                    price = article_prices[article_ids[ind]]
                    Xi = [price] + calculated
                    self.X.append(Xi)
                    self.Y.append([0])

    
class Model(object):
    def __init__(self):
        global customer_transactions
        self.customers = createCustomers(customer_transactions)
        self.n = len(self.customers)
                        
    def convertToVec(self, customer, ind):
        """converts indice of article to formula vec """
        calculated = list(customer.CF*A[ind])
        calculated = convertToCat(calculated)
        price = article_prices[article_ids[ind]]
        Xi = [price] + calculated
        return Xi
    
    def fit(self, alpha, rep):
        for i in range(self.n):
            self.customers[i].generateTrueObs(alpha, rep)
            #self.customers[i].generateFalseObs(rep)
    
        

def createCustomers(__customer_transactions):
    Customers = []
    for __customer_id in customer_transactions:
        Customers.append(Customer(__customer_id, customer_transactions[__customer_id]))
    return Customers

model = None
model = Model()
model.fit(0.5, 4)

In [None]:
class Customer(object):
    
    def initialize(self):
        global article_ids, n_features
        self.article_inds = [article_ids.index(transaction[0]) for transaction in self.cart]
        for ind in self.article_inds:
            self.CF += A[ind]

    def __init__(self, customer_id, purchases):
        self.id = customer_id
        self.cart = purchases  # [(art_id),(price),(tdat)]
        self.article_inds = []  # to index A
        self.n = len(self.cart)
        
        self.CF = zeros(n_features)
        self.X = [] # list of CF copies
        self.Y = []
        self.initialize()

    def generateTrueObs(self, alpha, rep):
        sample_size = int(self.n * alpha)
        for i in range(rep): 
            selected = sample(self.article_inds, sample_size)
            remain = [ind for ind in self.article_inds if ind not in selected]
            CF_true = sum([A[ind] for ind in selected])
            for rem in remain:
                calculated = list(CF_true*A[rem])
                calculated = convertToCat(calculated)
                price = self.cart[self.article_inds.index(rem)][1]
                Xi = [price] + calculated
                self.X.append(Xi)
                self.Y.append([1])
            # NEWLY ADDING :: FALSE!
            for ind in [randint(0,n_articles-1) for i in range(rep)]:
                if ind not in self.article_inds:
                    calculated = list(CF_true*A[ind])
                    calculated = convertToCat(calculated)
                    price = article_prices[article_ids[ind]]
                    Xi = [price] + calculated
                    self.X.append(Xi)
                    self.Y.append([0])

class Model(object):
    def __init__(self):
        global customer_transactions
        self.customers = createCustomers(customer_transactions)
        self.n = len(self.customers)
                        
    def convertToVec(self, customer, ind):
        calculated = list(customer.CF*A[ind])
        calculated = convertToCat(calculated)
        price = article_prices[article_ids[ind]]
        Xi = [price] + calculated
        return Xi
    
    def fit(self, alpha, rep):
        for i in range(self.n):
            self.customers[i].generateTrueObs(alpha, rep)
    
def createCustomers(__customer_transactions):
    Customers = []
    for __customer_id in customer_transactions:
        Customers.append(Customer(__customer_id, customer_transactions[__customer_id]))
    return Customers

model = None
model = Model()
model.fit(0.5, 4)

In [33]:
model.convertToVec(model.customers[0], 1)

[0.0338813559322033, 4.0, 4.0, 5.0, 1.0, 1.0, 1.0, 1.0, 2.0, 2.0, 1.0, 3.0]

In [34]:
model.customers[0].cart

[(929397001, 0.0220169491525423, '2020-09-22'),
 (706016038, 0.0338813559322033, '2020-09-22'),
 (573085057, 0.0338813559322033, '2020-09-22'),
 (921671001, 0.0508305084745762, '2020-09-22'),
 (921671001, 0.0508305084745762, '2020-09-22'),
 (923534001, 0.0169322033898305, '2020-09-22'),
 (914441001, 0.0338813559322033, '2020-09-22'),
 (573085028, 0.0338813559322033, '2020-09-22'),
 (817110002, 0.0338813559322033, '2020-09-22'),
 (919273002, 0.0423559322033898, '2020-09-22')]

In [35]:
article_ids[0], view(A[0])

(929397001,
 ['Top',
  'Garment Upper body',
  'Solid',
  'Black',
  'Dark',
  'Black',
  'Jersey fancy',
  'Ladieswear',
  'Ladieswear',
  'Womens Everyday Collection',
  'Jersey Fancy'])

In [36]:
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestClassifier

skM = LinearRegression()
tty = [item[0] for item in model.customers[0].Y]
skM.fit(array(model.customers[0].X), array(model.customers[0].Y))

LinearRegression()

In [48]:
def test(model):
    "given the skModel trained on one person"
    already = [arr[0] for arr in model.customers[0].cart]
    for i in range(n_articles):
        # i = article indice
        Xtest = model.convertToVec(model.customers[0], i)  # person 0
        pred = skM.predict([Xtest])
        if pred[0] > 4.3 and article_ids[i] not in already:
            print("0",article_ids[i], pred, sep="")
        
test(model)

0768229001[[4.35557513]]
0915292001[[4.44008715]]
0791587001[[4.44008715]]
0885547001[[4.36698792]]
0876043001[[4.4454965]]
0868283001[[4.51669516]]
0915921002[[4.51318639]]
0915921002[[4.51318639]]
0915921002[[4.51318639]]
0441386001[[4.66019331]]
0924453003[[4.36698792]]
0915292001[[4.44008715]]
0915292001[[4.44008715]]
0851370002[[4.44008715]]
0915921002[[4.51318639]]
0915292001[[4.44008715]]
0907702002[[4.30148169]]
0907702002[[4.30148169]]
0907951001[[4.30148169]]
0907951001[[4.30148169]]
0740519002[[4.55054445]]
0875350001[[4.52780624]]
0924453003[[4.36698792]]
0862063001[[4.39622761]]
0862063001[[4.39622761]]
0907951001[[4.30148169]]
0791587001[[4.44008715]]
0740519002[[4.55054445]]
0768147001[[4.34534123]]
0927576002[[4.44008715]]


In [None]:
# trying the entire model with customer c1
testItems = purchases[:100] # (3)(3)(3)
testArray = [c1.calc_array(ti) for ti in testItems]  # [2][2][2]
pred = lm.predict(testArray)
for i in range(len(pred)):
    if pred[i] > 0.55:
        v = article_ids.index(testItems[i][0])
        print(view(A[v]))
        print(pred[i])



In [None]:
def createCustomers(__customer_transactions):
    Customers = []
    for c_id in customer_transactions:
        Customers.append(Customer(c_id, customer_transactions[c_id]))
    return Customers

class Customer(object):
    def initialize(self):
        self.article_inds = [article_ids.index(transaction) for transaction in self.cart]
        for ind in self.article_inds:
            self.CF += A[ind]

    def __init__(self, customer_id, purchases):
        self.id = customer_id
        self.cart = purchases
        self.article_inds = []
        self.CF = zeros(n_features)
        self.X, self.Y, self.n = [], [], len(self.cart)
        self.initialize()

    def generateObs(self, alpha, rep):
        sample_size = int(self.n * alpha)
        for i in range(rep):
            selected = sample(self.article_inds, sample_size)  
            remain = [ind for ind in self.article_inds if ind not in selected]
            CF = sum([A[ind] for ind in selected])
            for rem in remain:
                Xi = getXi(CF, rem)
                self.X.append(Xi)
                self.Y.append([1])
            for ind in [randint(0,n_articles-1) for i in range(rep*2)]:
                if ind not in self.article_inds:
                    Xi = getXi(CF, ind)
                    self.X.append(Xi)
                    self.Y.append([0])

class Model(object):
    def __init__(self):
        self.customers = createCustomers(customer_transactions)
        self.n = len(self.customers)
                        
    def convertToVec(self, customer, ind):
        calculated = list(customer.CF*A[ind])
        calculated = convertToCat(calculated)
        price = article_prices[article_ids[ind]]
        Xi = [price] + calculated
        return Xi
    
    def fit(self, alpha, rep):
        for i in range(self.n):
            self.customers[i].generateObs(alpha, rep)
        print("Model Has been fitted with", len(model.customers))

In [None]:
model = Model()
model.fit(0.5, 4)

In [334]:
model.customers[0].cart

[(929397001, 0.0220169491525423, '2020-09-22'),
 (706016038, 0.0338813559322033, '2020-09-22'),
 (573085057, 0.0338813559322033, '2020-09-22'),
 (921671001, 0.0508305084745762, '2020-09-22'),
 (921671001, 0.0508305084745762, '2020-09-22'),
 (923534001, 0.0169322033898305, '2020-09-22'),
 (914441001, 0.0338813559322033, '2020-09-22'),
 (573085028, 0.0338813559322033, '2020-09-22'),
 (817110002, 0.0338813559322033, '2020-09-22'),
 (919273002, 0.0423559322033898, '2020-09-22')]