## 3.1. Basket Analysis improved

As mentioned in notebook **3.Basket Analysis**, product data has been cleaned and *standardized^* based on the most popular EEUU grocery products. For example, instead of "Dryer Sheets Geranium Scent", now it is "scent". The main goal behind this process is to try to improve the results provided by FP Growth.

Libraries:

In [1]:
import pandas as pd
import re
import yaml

from functions import keywords_match
from functions import get_items
from functions import pack_items_by_order

from mlxtend.frequent_patterns import fpgrowth, association_rules
from mlxtend.preprocessing import TransactionEncoder

YAML parameters

In [2]:
try: 
    with open ("./../params.yaml", 'r') as file:
        config = yaml.safe_load(file)
except Exception as e:
    print('Error reading the config file')

In [3]:
instacart = pd.read_csv(config['data']['instacart_sample'])
print(instacart.shape)
instacart.drop('Unnamed: 0', axis=1, inplace=True)
instacart.head()

(5204393, 15)


Unnamed: 0,order_id,user_id,order_number,order_dow,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,department_id,aisle_id,aisle,department
0,6,22352,4,1,12,30.0,15873,2,0,Dryer Sheets Geranium Scent,17,75.0,laundry,household
1,8,3107,5,4,6,17.0,23423,1,1,Original Hawaiian Sweet Rolls,3,43.0,buns rolls,bakery
2,13,45082,2,6,17,1.0,3800,12,0,Hampshire 100% Natural Sour Cream,16,108.0,other creams cheeses,dairy eggs
3,13,45082,2,6,17,1.0,25783,7,0,Lemon Lime Thirst Quencher,7,64.0,energy sports drinks,beverages
4,13,45082,2,6,17,1.0,23020,10,0,Diet Tonic Water,7,77.0,soft drinks,beverages


In [4]:
instacart_ba = instacart.copy()
prods = instacart_ba[['order_id', 'product_name']]
prods.head()

Unnamed: 0,order_id,product_name
0,6,Dryer Sheets Geranium Scent
1,8,Original Hawaiian Sweet Rolls
2,13,Hampshire 100% Natural Sour Cream
3,13,Lemon Lime Thirst Quencher
4,13,Diet Tonic Water


The selection of keywords has been made on the basis of two criteria:

- List of the most purchased products in the United States.
- List of the most repeated products in the dataset.

The process is as follows: 

1. Keyword selection (manual process).
2. Creation of a function to identify which keywords exist in the product.
3. As the products are in English and normally the last word is the one that refers to the product itself (and not to a brand or an adjective), if the previous function had found more than one keyword, we will keep only the last one.

There is some margin of error due to lack of time, the next step would be to refine the list of keywords and the subsequent match. The rows in which no match was produced were not taken into account for the subsequent model.

In [5]:
keywords = ['soda','milk', 'chips', 'eggs', 'bread', 'cereal', 'cheese', 'beer', 'water', 'chocolate', 
            'cookies','ham','bacon','jerky', 'wine','cupcakes','bananas','apple','lemon','lime','strawberries','mango','orange','juice','broccoli',
            'yams','potato', 'potatoes','tonic','tomato','sriracha','tomatoes','sauce','spaghetti','pasta','cucumber','kale','salad','spinach','arugula','dressing','onion','garlic','pepper',
            'carrots','avocados','artichoke','chicken','coffee','yoghurt','milkshake','peanut butter','beef','hot dog','wipes','cleaner','garbage',
            'bleach','baby food','mayonnaise','ice cream','sandwich','pizza','sausage','burger','veggie','macaroni','rolls','waffles','pancakes',
            'biscuits','crackers','fish','salmon','cod','cat food','pollock','cake','rice','vinegar','herring','lentil','soup','chickpea','tea','popcorn',
            'pumpkin','dog food','canned','beans','tuna','olive','oil','toilet paper', 'detergent', 'softener', 'applesauce', 'honey', 'maple syrup', 'sports drink',
            'energy drink', 'bar', 'bars', 'kombucha', 'disposable', 'tofu', 'edamame', 'sunflower oil', 'soybean oil', 'flour', 'meatballs','ketchup',
            'bbq', 'mustard', 'gum', 'noodles', 'tissues', 'soy sauce', 'fish sauce', 'tabasco', 'shampoo', 'skincare', 'toothpaste', 'soap', 'tomato paste', 
            'tomato sauce', 'ice cubes', 'vinegar', 'herbs', 'spices', 'gel', 'sugar','tampax','tampon', 'soft drink', 'butter', 'orzo', 'bagel', 'grape', 'nectarine', 
            'peach-pear', 'sushi','clementine','lasagna','meatless','eggplant', 'squash', 'scent', 'light', 'lettuce', 'banana', 'yogurt', 'cola', 'sticks',
           'cream', 'salsa', 'snack', 'snacks', 'chiles', 'avocado', 'roll', 'half & half', 'trash bags', 'parmesan', 'granola', 'hummus','pesto',
           'plates', 'cups', 'cherries', 'chili', 'peas', 'blueberries', 'half and half', 'prosciutto', 'blueberry', 'arancita', 'mint', 'egg', 'marshmallows','cilantro',
           'salami', 'raspberries', 'sea salt', 'beets', 'pot', 'walnut', 'anchovies', 'celery', 'blackberries', 'asparagus', 'cauliflower',
           'turkey', 'romaine', 'mozzarella', 'penne', 'fries', 'saffron', 'baking paper', 'matcha', 'radish', 'nuts', 'paper towels', 'paper', 'kefir',
           'parsley', 'bathroom tissue', 'smoothie', 'pears', 'mushrooms', 'apricot', 'salame', 'crab', 'chorizo', 'meat', 'tortillas', 'tortilla', 'corn', 'forks',
           'carrot', 'cinnamon toast', 'cinnamon','oatmeal', 'diapers', 'bok choy', 'sorbetto', 'whiskey', 'peaches', 'noodle', 'raisins','palm', 'pickles',
           'pickle', 'tequila', 'prunes', 'spread', 'kiwi', 'fusili', 'cocoa', 'cashews', 'raspberry', 'cold', 'flu', 'muffins', 'muffin', 'cracker', 'donut', 'peach',
           'ginger', 'turmeric', 'cheddar', 'moisturizing', 'pork', 'seltzer', 'burrito', 'pudding', 'pecorino romano', 'coconut', 'oats', 'chutney', 'mate', 'probiotic',
           'protein powder', 'mozarella', 'tisue', 'pomegranate', 'creps','rosemary','tarragon','napkins', 'brussel sprouts', 'probiotic', 'tarragon', 'cabbage', 'broth', 'ice bag',
           'tahini', 'quinoa', 'leek', 'almonds', 'shallot', 'bath tissue', 'basil', 'tilapia', 'medjool', 'coke', 'collard greens', 'vainilla extract', 'dish liquid','black plum',
           'red plums', 'emmentaler', 'sage', 'cantaloupe', 'chip', 'quencher']

In [6]:
prods['product_name'] = prods['product_name'].apply(lambda x: x.lower())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  prods['product_name'] = prods['product_name'].apply(lambda x: x.lower())


In [7]:
prods['prod_1'] = keywords_match(prods['product_name'], keywords)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  prods['prod_1'] = keywords_match(prods['product_name'], keywords)


Drop transactions with "no match".

In [8]:
prods_sel = prods[prods['prod_1'] != 'no match']
print('Shape:', prods_sel.shape)
prods_sel.head()

Shape: (4910870, 3)


Unnamed: 0,order_id,product_name,prod_1
0,6,dryer sheets geranium scent,[scent]
1,8,original hawaiian sweet rolls,[rolls]
2,13,hampshire 100% natural sour cream,"[ham, cream]"
3,13,lemon lime thirst quencher,"[lemon, lime, quencher]"
4,13,diet tonic water,"[tonic, water]"


In [9]:
prods_sel['prod_2'] = prods_sel['prod_1'].apply(get_items)
print('Shape:', prods_sel.shape)
prods_sel.head()

Shape: (4910870, 4)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  prods_sel['prod_2'] = prods_sel['prod_1'].apply(get_items)


Unnamed: 0,order_id,product_name,prod_1,prod_2
0,6,dryer sheets geranium scent,[scent],scent
1,8,original hawaiian sweet rolls,[rolls],rolls
2,13,hampshire 100% natural sour cream,"[ham, cream]",cream
3,13,lemon lime thirst quencher,"[lemon, lime, quencher]",quencher
4,13,diet tonic water,"[tonic, water]",water


In [10]:
pd.options.display.max_rows = 150
item_freq = prods_sel['prod_2'].value_counts()[0:100]
item_freq

milk             256402
cheese           184742
yogurt           179437
water            162566
banana           121634
strawberries     112395
bread            104757
bananas           93099
bar               89732
pepper            88501
apple             85597
onion             85315
spinach           80274
chicken           79012
chips             78090
eggs              74630
tomato            72120
juice             68564
sauce             63255
beans             57259
cereal            53621
crackers          51352
cream             50569
potato            48314
butter            46527
carrots           46462
lemon             45495
garlic            45134
broccoli          45017
snack             43880
oil               43377
hummus            43295
lime              42184
raspberries       42036
cucumber          41780
honey             40870
kale              40638
tea               40482
grape             36084
corn              34684
coffee            32583
pizza           

Let's reduce the number of products and set a frequency limit (select top 100 products).

In [11]:
prods_to_keep = item_freq.index.tolist()

In [12]:
prods_for_fp = prods_sel.loc[prods_sel['prod_2'].isin(prods_to_keep)]
print('Shape:', prods_for_fp.shape)
prods_for_fp.head()

Shape: (4228431, 4)


Unnamed: 0,order_id,product_name,prod_1,prod_2
1,8,original hawaiian sweet rolls,[rolls],rolls
2,13,hampshire 100% natural sour cream,"[ham, cream]",cream
4,13,diet tonic water,"[tonic, water]",water
5,13,chunky salsa medium,[salsa],salsa
6,13,light,[light],light


### Testing FP Growth

In [14]:
order_num = prods_for_fp['order_id'].unique()
prod_lst = pack_items_by_order(order_num, prods_for_fp, 'prod_2')

In [15]:
te = TransactionEncoder()
te_ary = te.fit(prod_lst).transform(prod_lst)
te_ary

array([[False, False, False, ..., False, False, False],
       [False, False, False, ..., False,  True, False],
       [False, False, False, ...,  True, False, False],
       ...,
       [False, False, False, ..., False,  True, False],
       [False,  True, False, ..., False, False, False],
       [False, False, False, ..., False, False, False]])

In [16]:
encoded_df = pd.DataFrame(te_ary, columns = te.columns_)
encoded_df.head()

Unnamed: 0,almonds,apple,arugula,avocado,avocados,baby food,bacon,bagel,banana,bananas,...,strawberries,sugar,tea,tomato,tortillas,turkey,vinegar,waffles,water,yogurt
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
3,False,False,False,False,False,False,False,True,True,False,...,False,False,False,False,False,True,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [18]:
print('Testing FP Growth with {} transactions and {} products.'.format(len(order_num), encoded_df.shape[1]))

Testing FP Growth with 740899 transactions and 100 products.


#### Support to be considered

A support of 0.02 indicates that a product has been purchased more than 20.000 times.

In [19]:
freq_items_fp = fpgrowth(encoded_df, min_support=0.02, use_colnames=True)
freq_items_fp

Unnamed: 0,support,itemsets
0,0.021998,(rolls)
1,0.180451,(water)
2,0.092342,(chips)
3,0.064250,(cream)
4,0.038676,(soda)
...,...,...
162,0.026326,"(pepper, onion)"
163,0.024222,"(beans, milk)"
164,0.023180,"(milk, butter)"
165,0.020429,"(milk, honey)"


In [20]:
rules_fp = association_rules(freq_items_fp, metric='confidence', min_threshold=0.3)

In [21]:
rules_fp.sort_values(by='lift', ascending=False)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
12,"(cheese, milk)",(yogurt),0.07397,0.176336,0.023278,0.314703,1.784682,0.010235,1.201909
16,(crackers),(cheese),0.062031,0.199964,0.020765,0.334755,1.674077,0.008361,1.202619
14,"(milk, yogurt)",(cheese),0.070429,0.199964,0.023278,0.330523,1.652912,0.009195,1.195016
5,(sauce),(cheese),0.076894,0.199964,0.025168,0.327307,1.63683,0.009792,1.189303
9,(bread),(cheese),0.131231,0.199964,0.041045,0.312767,1.564117,0.014803,1.164141
13,"(cheese, yogurt)",(milk),0.050625,0.29456,0.023278,0.459822,1.561048,0.008366,1.30594
24,(cereal),(milk),0.062585,0.29456,0.027695,0.442515,1.502295,0.00926,1.265399
28,(raspberries),(milk),0.05651,0.29456,0.023324,0.41275,1.401242,0.006679,1.20126
11,(yogurt),(milk),0.176336,0.29456,0.070429,0.399405,1.355937,0.018488,1.174568
10,(eggs),(milk),0.100174,0.29456,0.039885,0.39816,1.351711,0.010378,1.172138


Interpretando los resultados:
 
 - Support: nos dice el porcentaje de veces que los productos se compran juntos. // Support is the fraction of the total number of transactions in which the itemset (both prods) occurs. It help us identify the rules worth considering for further analysis.
 
 - Confidence: probabilidad condidicional de que se compre el producto de la derecha si primero se compra el de la izquierda (o el número de veces que la regla ocurre). Of all transactions containing product A, how many also had product B on them? Confidence is the conditional probability of occurrence of consequent given the antecedent.
 
 - Lift: la "fuerza" de la asociación (cuánto más cerca de 1, menos intensa es la relación). To rephrase, lift is the rise in probability of having {Y} on the cart with the knowledge of {X} being present over the probability of having {Y} on the cart without any knowledge about presence of {X}. In cases where {X} actually leads to {Y} on the cart, value of lift will be greater than 1. A value of lift less than 1 shows that having toothbrush on the cart does not increase the chances of occurrence of milk on the cart in spite of the rule showing a high confidence value. A value of lift greater than 1 vouches for high association between {Y} and {X}. More the value of lift, greater are the chances of preference to buy {Y} if the customer has already bought {X}. Lift is the measure that will help store managers to decide product placements on aisle.
    