# Description of algorithm

### Data preparation

Loads data set, shuffles it and filteres out sample that do not have a category_name. Then it replaces each NaN in brand_name column by 'No Brand'.

### Grouping brands

The algorithm first creates three dictionaries for storing brand occurences, votes and prestigue, where brand names are the keyes and the values are initialised to 0.
The algorithm retrieves unique categories from category_name columns and loops through them. For each category, it performs the following:
* Loops through all the brands in that category and for every brand it obtains a mean price (across all the items in that category and brand)
* Computes median from the brand mean prices
* Loops through all the brands in that category again and peforms the following:
    * Adds 1 to the brand votes dictionary if the mean for the given brand is more then threshold*median (current treshold is 1.5)
    * Adds 1 to the brand occurences for the given brand

Brand prestige is then computed as (number of votes) / (number of occurences), where these values are contained in the brand votes and brand occurences dictionaries.

# Settings

In [1]:
settings = {
            'test_size': 0.25, # fraction of data set to be assigned as test data
            'del': True, # Delete variables that are no longer needed to proceed in computations to save place
            'filename_str': 'learning_curve.csv', # File for saving training and test RMSEs, this is appended to current date string (yymmdd)
            'random_state': { # Set random states so that the results are repeatable
                'shuffle': 42, # sklearn's shuffle method
                'split': 17 # sklearn's train_test_split method
            }
           }

# Use helper functions

In [2]:
%run ../src/helper_functions.py

Following functions has been loaded:

rmse



# Load data set and shuffle it

In [3]:
import numpy as np
import pandas as pd

In [4]:
PATH = "../../data/"
data_full = pd.read_csv(f'{PATH}train.tsv', sep='\t')

In [5]:
data_full.head()

Unnamed: 0,train_id,name,item_condition_id,category_name,brand_name,price,shipping,item_description
0,0,MLB Cincinnati Reds T Shirt Size XL,3,Men/Tops/T-shirts,,10.0,1,No description yet
1,1,Razer BlackWidow Chroma Keyboard,3,Electronics/Computers & Tablets/Components & P...,Razer,52.0,0,This keyboard is in great condition and works ...
2,2,AVA-VIV Blouse,1,Women/Tops & Blouses/Blouse,Target,10.0,1,Adorable top with a hint of lace and a key hol...
3,3,Leather Horse Statues,1,Home/Home Décor/Home Décor Accents,,35.0,1,New with tags. Leather horses. Retail for [rm]...
4,4,24K GOLD plated rose,1,Women/Jewelry/Necklaces,,44.0,0,Complete with certificate of authenticity


## Shuffle

In [6]:
from sklearn.utils import shuffle
data_shuffled = shuffle(data_full, random_state=settings['random_state']['shuffle'])

if (settings['del']):
    del data_full

In [7]:
len(data_shuffled)

1482535

In [8]:
print("Number of unique fields:\n")

print("category_name: \t%d" % data_shuffled['category_name'].nunique())
print("brand_name: \t%d" % data_shuffled['brand_name'].nunique())
print()
print("%d items have no brand" % data_shuffled['brand_name'].isna().sum())
print("%d items have no category_name" % data_shuffled['category_name'].isna().sum())

Number of unique fields:

category_name: 	1287
brand_name: 	4809

632682 items have no brand
6327 items have no category_name


# Filter out samples that does not have category_name

In [9]:
data_reduced = data_shuffled.loc[data_shuffled['category_name'].notnull()]

if (settings['del']):
    del data_shuffled

In [10]:
# double check
print("%d items have no category_name" % data_reduced['category_name'].isna().sum())

0 items have no category_name


# Replace NaN in brand_name by 'No Brand'

In [11]:
data_reduced['brand_name'] = data_reduced['brand_name'].fillna('No Brand')

In [12]:
data_reduced

Unnamed: 0,train_id,name,item_condition_id,category_name,brand_name,price,shipping,item_description
777341,777341,F/ship 4 Totoro Washi + 1 pen,1,Handmade/Paper Goods/Stationery,No Brand,12.0,1,This listing is for all 4 Totoro washi tape fo...
1463629,1463629,UCLA Men's Bundle + Shorts,1,Women/Other/Other,Adidas,76.0,1,7 items. 1: XL. 2: 2XL. 3:2XL. 4: XL. 5: 2XL. ...
350669,350669,Listing for lol,1,Beauty/Makeup/Lips,No Brand,12.0,1,- sunglasses and necklace :)
310222,310222,25 pcs kawaii sticker flakes,1,Kids/Toys/Arts & Crafts,No Brand,3.0,1,I ordered a bunch of stickers so you will reci...
759257,759257,Chanel Mini Lipgloss Set,2,Beauty/Makeup/Lips,Chanel,30.0,1,Brand new never used authentic Mini Lipgloss g...
288846,288846,Maroon Foamposites,3,Men/Shoes/Fashion Sneakers,Nike,225.0,1,9/10 Condition N Sz 12
1178450,1178450,INC studdedHeart Black Blouse Dolman,2,Women/Tops & Blouses/Blouse,INC International Concepts,16.0,1,New without tags INC International Concepts Sh...
726296,726296,Leggo silicone molds,3,Home/Kitchen & Dining/Bakeware,No Brand,12.0,0,I used these for my son's leggo birthday party...
840510,840510,Supreme Uzi Chain,1,Handmade/Accessories/Men,No Brand,15.0,1,10/10 New
1473033,1473033,Women Gold Palm Pendant Necklace FC,1,Vintage & Collectibles/Jewelry/Necklace,No Brand,17.0,1,High quality Immediate purchase Ok? Free shipp...


In [13]:
unique_cns = data_reduced.category_name.unique() # array of unique category names
unique_brands = data_reduced.brand_name.unique() # array of unique brand names

# Placeholder (loop)

In [50]:
# every time a brand occurs in a given category,its occurance will be increased by one (loops through categories)
brand_occurences = dict(zip(unique_brands, len(unique_brands)*[0])) # {brand_name: occurence}

# every time a brand is determined as prestigious, increase its vote by one, otherwise don't do anything
brand_votes = dict(zip(unique_brands, len(unique_brands)*[0])) # {brand_name: vote}, initialised to zeros

brand_prestige = {} # {brand_name: prestigue} will hold prestigue for every brand, where prestigue is number of votes divided by number of occurences

from tqdm import tqdm_notebook
for cat_name in tqdm_notebook(unique_cns): # iterate through all categories
    
    data_filtered_cn = data_reduced.loc[data_reduced.category_name == cat_name] # get data subset for the given category
    brands = data_filtered_cn.brand_name.unique() # array of unique brand names for the given category
    
    # Get mean for each brand in the given category
    brand_means = {} # {brand_name: mean} dictionary to store mean price for every brand in the given category
    for b in brands:
        data_brand = data_filtered_cn.loc[data_filtered_cn['brand_name'] == b] # data frame containing only one specific brand for one category
        brand_means[b] = data_brand.price.mean()
    
    # Increase vote by 1 for presitgious brands, otherwise keep current vote
    
    def vote(val, treshold):
        if (val >= treshold):
            return 1
        return 0
    
    treshold = 1.5 * np.median(list(brand_means.values())) # 1.5 * (median of the brands means)
    
    for brand in brands:
        # vote
        votes_so_far = brand_votes[brand]
        new_vote = vote(brand_means[brand], treshold)
        brand_votes[brand] = votes_so_far + new_vote
        
        # add occurence
        occurence_so_far = brand_occurences[brand]
        brand_occurences[brand] = occurence_so_far + 1
    
for brand in unique_brands:
    brand_prestige[brand] = brand_votes[brand] / brand_occurences[brand]

HBox(children=(IntProgress(value=0, max=1287), HTML(value='')))




In [28]:
# brand_votes

In [29]:
# brand_occurences

In [30]:
# brand_prestige

In [41]:
# parse votes, occurences and prestige dictionaries for pandas
parsed_brands = {'brand': [], 'votes':[], 'occurences':[], 'prestige':[]}
for brand in unique_brands:
    parsed_brands['brand'].append(brand)
    parsed_brands['votes'].append(brand_votes[brand])
    parsed_brands['occurences'].append(brand_occurences[brand])
    parsed_brands['prestige'].append(brand_prestige[brand])

In [42]:
prestige_df = pd.DataFrame.from_dict(parsed_brands)

In [45]:
prestige_df.head()

Unnamed: 0,brand,votes,occurences,prestige
0,No Brand,114,1270,0.089764
1,Adidas,43,175,0.245714
2,Chanel,80,97,0.824742
3,Nike,62,259,0.239382
4,INC International Concepts,5,70,0.071429


In [48]:
sorted_df = prestige_df.sort_values(['prestige'], ascending=False)
sorted_df.to_csv('sorted.csv')