### Questions we have asked

1. What is the current inventory of a particular store?

2. What are the top-selling products at a particular store?

3. Which store has the highest total sales revenue?

4. What are the 5 stores with the most sales so far this month?

5. How many customers are currently enrolled in the frequent-shopper program?

6. What is the average order value for online orders compared to in-store purchases?

7. Which products have the highest profit margin across all stores?

8. How does the sales performance of a particular product compare between different store locations?

9. Which store locations have the highest percentage of repeat customers?

10. What are the most popular product combinations purchased together by customers?

### Before doing some analytics

1. What fields are missing?

- Quantity
- is_active

2. Are there duplicates in dataset?

Potentially yes, because products are scraped from different collections. For examples, 'New arrivals' and 'Womens' have many products in common.

3. How to sort out duplicates?

Title

4. How to organize data?

The goals is to keep collections and process data without duplicates
- Create a new dataset of all products
- Add collections as a new field

5. If we only data analytics, do description, material & care matter?

Generally no, but if we use semantics analytics, then they matter. It's better to keep everything, but not in the dataset we currently work on

6. What is our plan?

- Filter out description, product details, material & care
- Add missing fields and generate data
- Create a new dataset for all products
- Add collections as a new field

In [None]:
import pandas as pd

df = pd.read_csv('../webscrape/data/products.csv')
df.columns
df.info

In [15]:
# Filter out description, product details, material & care
data = []
for index, row in df.iterrows():
    data.append(row[:5])

new_df = pd.DataFrame(data)
new_df.info
new_df.columns

Index(['Collection', 'Title', 'Current price', 'Colors', 'Sizes'], dtype='object')

In [34]:
# Filter duplicates by item titles and save collections for each item
from collections import defaultdict

item_map = defaultdict(list)
for index, row in new_df.iterrows():
    item_map[row.loc['Title']].append(row.loc['Collection'])
print(len(item_map.keys()))
print(item_map["Men's Non-Iron Button Down Shirt"])


1561
['mens-shirts-polos', 'mens-shirts-polos', 'new-arrivals', 'mens', 'mens', 'mens-tops', 'mens-tops', 'workwear']


In [50]:
# Create a new dataset for all products
data = []
visited = set()
for index, row in new_df.iterrows():
    title = row.loc['Title']
    if title not in visited:
        price = row.loc['Current price']
        colors = row.loc['Colors']
        sizes = row.loc['Sizes']
        data.append([title, item_map[title], price, colors, sizes])
        visited.add(title)

filtered_df = pd.DataFrame(data, columns=['Title', 'Collections', 'Current price', 'Colors', 'Sizes'])
filtered_df

Unnamed: 0,Title,Collections,Current price,Colors,Sizes
0,Sesame Paste Ramen 5 Pack,"[public-goods-food, food, public-goods, dry-so...",$12.90,[],[]
1,Organic Linden Flower Raw Honey,"[public-goods-food, food, kitchen-dining, publ...",$9.90,[],[]
2,Seaweed Snacks,"[public-goods-food, food, public-goods, home-e...",$9.90,[],[]
3,Veggie Chips,"[public-goods-food, food, public-goods]",$6.90,[],[]
4,Organic Wildflower Raw Honey,"[public-goods-food, home-essentials-new-arrivals]",$18.90,[],[]
...,...,...,...,...,...
1556,Cord Hair Band 2 Pieces Set,"[hair-care, makeup]",$4.90,[],[]
1557,"Pine Shelf Unit - Cross Bar - Large (33.9"")",[pine-shelving-unit],$15.00,[],[]
1558,SUS Shelving Unit - Walnut - Regular - Medium,[home-essentials-new-arrivals],$350.00,[],[]
1559,Heatproof Glass Pot - 25.3 oz,[glassware],$29.90,[],[]


In [None]:
# Add missing fields and generate data