# Reduce Data

> Stratified random sampling by category is a common technique used when dealing with imbalanced datasets, where certain categories have significantly more data points than others. By randomly sampling each category based on the mean of the total number of rows divided by the total number of categories, a representative sample of each category in the dataset can be obtained. The mean, calculated as the ratio of the total number of rows to the total number of categories, helps ensure that no category is over-represented or under-represented in the sample, thereby enhancing the generalizability of the resulting model. $$
\text{mean} = \frac{{\text{total number of rows}}}{{\text{total number of categories}}}$$ In the case of the original dataset with over 300k rows, stratified random sampling by category can be an efficient and effective approach for obtaining a representative sample for analysis and modeling purposes. Although the sample size may exceed the actual mean (e.g., 20,000 vs. 19,810.5), the larger sample size allows for capturing more data and potentially reducing sampling variability.

**Note:** To those reading our project, you access the original dataset [here](https://www.kaggle.com/datasets/lokeshparab/amazon-products-dataset?select=Amazon-Products.csv).

In [1]:
import pandas as pd

# notebook configurations
pd.options.display.max_colwidth = 1000

import warnings
warnings.filterwarnings("ignore")

In [2]:
df = pd.read_csv("data/amazon_products_raw.csv")
df.head()

Unnamed: 0,name,main_category,sub_category,image,link,ratings,no_of_ratings,discount_price,actual_price
0,"Lloyd 1.5 Ton 3 Star Inverter Split Ac (5 In 1 Convertible, Copper, Anti-Viral + Pm 2.5 Filter, 2023 Model, White, Gls18I3...",appliances,Air Conditioners,https://m.media-amazon.com/images/I/31UISB90sYL._AC_UL320_.jpg,https://www.amazon.in/Lloyd-Inverter-Convertible-Anti-Viral-GLS18I3FWAMC/dp/B0BRKXTSBT/ref=sr_1_4?qid=1679134237&s=kitchen&sr=1-4,4.2,2255,"₹32,999","₹58,990"
1,"LG 1.5 Ton 5 Star AI DUAL Inverter Split AC (Copper, Super Convertible 6-in-1 Cooling, HD Filter with Anti-Virus Protectio...",appliances,Air Conditioners,https://m.media-amazon.com/images/I/51JFb7FctDL._AC_UL320_.jpg,https://www.amazon.in/LG-Convertible-Anti-Virus-Protection-RS-Q19YNZE/dp/B0BQ3MXML8/ref=sr_1_5?qid=1679134237&s=kitchen&sr=1-5,4.2,2948,"₹46,490","₹75,990"
2,"LG 1 Ton 4 Star Ai Dual Inverter Split Ac (Copper, Super Convertible 6-In-1 Cooling, Hd Filter With Anti Virus Protection,...",appliances,Air Conditioners,https://m.media-amazon.com/images/I/51JFb7FctDL._AC_UL320_.jpg,https://www.amazon.in/LG-Inverter-Convertible-protection-RS-Q13JNYE/dp/B0BPYN9JGF/ref=sr_1_6?qid=1679134237&s=kitchen&sr=1-6,4.2,1206,"₹34,490","₹61,990"
3,"LG 1.5 Ton 3 Star AI DUAL Inverter Split AC (Copper, Super Convertible 6-in-1 Cooling, HD Filter with Anti-Virus Protectio...",appliances,Air Conditioners,https://m.media-amazon.com/images/I/51JFb7FctDL._AC_UL320_.jpg,https://www.amazon.in/LG-Convertible-Anti-Virus-Protection-RS-Q19JNXE/dp/B0BQ3MJ1TG/ref=sr_1_7?qid=1679134237&s=kitchen&sr=1-7,4.0,69,"₹37,990","₹68,990"
4,"Carrier 1.5 Ton 3 Star Inverter Split AC (Copper,ESTER Dxi, 4-in-1 Flexicool Inverter, 2022 Model,R32,White)",appliances,Air Conditioners,https://m.media-amazon.com/images/I/41lrtqXPiWL._AC_UL320_.jpg,https://www.amazon.in/Carrier-Inverter-Split-Copper-Flexicool/dp/B0B67RLLJC/ref=sr_1_8?qid=1679134237&s=kitchen&sr=1-8,4.1,630,"₹34,490","₹67,790"


In [3]:
df.shape

(396210, 9)

In [4]:
df["main_category"].value_counts()

accessories                84913
tv, audio & cameras        52721
women's clothing           50253
men's shoes                46275
men's clothing             42772
appliances                 21614
stores                     17375
home & kitchen             13356
sports & fitness           10995
kids' fashion              10954
beauty & health             9611
bags & luggage              8348
car & motorbike             6798
toys & baby products        5611
women's shoes               4930
industrial supplies         3849
grocery & gourmet foods     3196
pet supplies                1549
music                       1066
home, kitchen, pets           24
Name: main_category, dtype: int64

In [5]:
df["main_category"].value_counts().mean()

19810.5

In [6]:
# Group the dataset
grouped = df.groupby("main_category")

# Sample size for each group
sample_size = 20_000

sampled_data = pd.DataFrame()

# Iterate over each group, and perform sampling
for group_name, group_df in grouped:
    if len(group_df) >= sample_size:
        # If the group has enough rows, sample the specified number of rows
        sampled_rows = group_df.sample(n=sample_size)
    else:
        # If the group has fewer rows than the desired sample size, sample all rows
        sampled_rows = group_df
    sampled_data = pd.concat([sampled_data, sampled_rows])

In [7]:
sampled_data

Unnamed: 0,name,main_category,sub_category,image,link,ratings,no_of_ratings,discount_price,actual_price
33430,Gear Polyester 23 Cms Travel Bag(DUFCRSTNG0104_Black),accessories,Bags & Luggage,https://m.media-amazon.com/images/W/IMAGERENDERING_521856-T1/images/I/51yLMxZu5wS._AC_UL320_.jpg,https://www.amazon.in/Gear-Training-Travel-Black-Grey-DUFCRSTNG0104/dp/B098FFTDKT/ref=sr_1_173?qid=1679143891&s=luggage&sr=1-173,4.2,1920,₹569,"₹1,299"
137273,Atasi International Superb Trendy Pink Diamond Gold Plated Alloy Princess Style Necklace Set For Women,accessories,Fashion & Silver Jewellery,https://m.media-amazon.com/images/I/81Nnff9c8+L._AC_UL320_.jpg,https://www.amazon.in/Atasi-International-Diamond-Necklace-Jewellery/dp/B08GG5MJB5/ref=sr_1_18938?qid=1679160606&s=jewelry&sr=1-18938,5.0,3,₹359,"₹1,999"
39723,Overnice Solid 8 Cms Cosmetic Pouch (S_GTYJB_Black),accessories,Bags & Luggage,https://m.media-amazon.com/images/W/IMAGERENDERING_521856-T2/images/I/619TwxvyuKL._AC_UL320_.jpg,https://www.amazon.in/Overnice-Shaving-Polyester-Travel-Women/dp/B07YXYQ56H/ref=sr_1_7834?qid=1679144154&s=luggage&sr=1-7834,3.5,125,₹109,₹299
131296,Priyaasi Gold Plated Charming Colorful Stone Hair Pin Hair Clip for Girls and Women (Juda Pin),accessories,Fashion & Silver Jewellery,https://m.media-amazon.com/images/W/IMAGERENDERING_521856-T1/images/I/710eLRt2KmL._AC_UL320_.jpg,https://www.amazon.in/Priyaasi-Plated-Charming-Colorful-Stone/dp/B083BWF3BY/ref=sr_1_12314?qid=1679160370&s=jewelry&sr=1-12314,3.9,9,₹759,"₹2,265"
245468,Anuj Sales South Sea Pearl 13.00 Carat Natural Pearl Gemstone Original Certified Moti Adjustable Astrological panchhdhaat...,accessories,Jewellery,https://m.media-amazon.com/images/I/61k0j++6kkL._AC_UL320_.jpg,https://www.amazon.in/Anuj-Sales-Adjustable-Astrological-panchhdhaatu/dp/B08W9MHM1J/ref=sr_1_17761?qid=1679145753&s=jewelry&sr=1-17761,,,₹550,"₹2,199"
...,...,...,...,...,...,...,...,...,...
314023,BATA Women's Simo Pu Slipper,women's shoes,Shoes,https://m.media-amazon.com/images/I/61RepswFlCL._AC_UL320_.jpg,https://www.amazon.in/BATA-Womens-Fashion-Slippers-3-6715937/dp/B0837BX1DN/ref=sr_1_19004?qid=1679211652&s=shoes&sr=1-19004,3.6,11,,"₹1,066"
314024,new balance womens Pesu Running Shoe,women's shoes,Shoes,https://m.media-amazon.com/images/I/61xjPTmn0RL._AC_UL320_.jpg,https://www.amazon.in/new-balance-WPESULK1-Black/dp/B0817HTWR7/ref=sr_1_19005?qid=1679211652&s=shoes&sr=1-19005,4.0,16,"₹3,599","₹7,999"
314025,A&Z Dream Women's & Girls' Fashion Sandal,women's shoes,Shoes,https://m.media-amazon.com/images/I/61-GiZbnBJL._AC_UL320_.jpg,https://www.amazon.in/drem-Women-Girls-Sandal-Yellow/dp/B0815WYPMS/ref=sr_1_19006?qid=1679211652&s=shoes&sr=1-19006,3.5,346,₹497,₹999
314026,DEEANNE LONDON Women's Block Heels Bellies,women's shoes,Shoes,https://m.media-amazon.com/images/I/71u4MPMPzjL._AC_UL320_.jpg,https://www.amazon.in/DEEANNE-LONDON-Womens-Block-Bellies/dp/B07ZVS52R3/ref=sr_1_19007?qid=1679211652&s=shoes&sr=1-19007,3.2,121,₹699,₹999


In [8]:
sampled_data.to_csv("data/amazon_products_sampled_raw.csv", index = True)