# Reduce Data

**revise and add to paper**

> Stratified random sampling by category is commonly employed in the analysis of imbalanced datasets, where certain categories have a significantly larger number of data points compared to others. This technique ensures that each category is proportionally represented in the sample, promoting a more balanced analysis and modeling process.However, it is important to acknowledge that the smaller sample size may affect the generalizability of the results and introduce some degree of sampling variability.

> In this project, the dataset consists of over 300,000 rows, and the objective is to create a representative sample for analysis and modeling purposes. However, due to resource constraints and the need to strike a balance between computational feasibility and sample size, the decision was made to reduce the sample size to 4,500 per category. It is recognized that this sample size may not fully capture the intricacies of each category, but it provides a reasonable representation of the data and enables the project to progress with the analysis.

> Although the reduced sample size is lower than the mean of 19,810.5 (calculated by dividing the total number of rows by the total number of categories), it still provides an adequate number of observations for analysis. While the smaller sample size may not capture the entire population, it allows for the exploration of key category characteristics and addresses the challenges associated with class imbalance.

> Despite these limitations, the selected sample size allows for meaningful insights and reliable conclusions within the scope of the project's constraints.

**Note:** To those reading our project, you access the original dataset [here](https://www.kaggle.com/datasets/lokeshparab/amazon-products-dataset?select=Amazon-Products.csv).

In [1]:
import pandas as pd

# notebook configurations
pd.options.display.max_colwidth = 1000

import warnings
warnings.filterwarnings("ignore")

In [2]:
df = pd.read_csv("data/amazon_products_raw.csv")
df.head()

Unnamed: 0,name,main_category,sub_category,image,link,ratings,no_of_ratings,discount_price,actual_price
0,"Lloyd 1.5 Ton 3 Star Inverter Split Ac (5 In 1 Convertible, Copper, Anti-Viral + Pm 2.5 Filter, 2023 Model, White, Gls18I3...",appliances,Air Conditioners,https://m.media-amazon.com/images/I/31UISB90sYL._AC_UL320_.jpg,https://www.amazon.in/Lloyd-Inverter-Convertible-Anti-Viral-GLS18I3FWAMC/dp/B0BRKXTSBT/ref=sr_1_4?qid=1679134237&s=kitchen&sr=1-4,4.2,2255,"₹32,999","₹58,990"
1,"LG 1.5 Ton 5 Star AI DUAL Inverter Split AC (Copper, Super Convertible 6-in-1 Cooling, HD Filter with Anti-Virus Protectio...",appliances,Air Conditioners,https://m.media-amazon.com/images/I/51JFb7FctDL._AC_UL320_.jpg,https://www.amazon.in/LG-Convertible-Anti-Virus-Protection-RS-Q19YNZE/dp/B0BQ3MXML8/ref=sr_1_5?qid=1679134237&s=kitchen&sr=1-5,4.2,2948,"₹46,490","₹75,990"
2,"LG 1 Ton 4 Star Ai Dual Inverter Split Ac (Copper, Super Convertible 6-In-1 Cooling, Hd Filter With Anti Virus Protection,...",appliances,Air Conditioners,https://m.media-amazon.com/images/I/51JFb7FctDL._AC_UL320_.jpg,https://www.amazon.in/LG-Inverter-Convertible-protection-RS-Q13JNYE/dp/B0BPYN9JGF/ref=sr_1_6?qid=1679134237&s=kitchen&sr=1-6,4.2,1206,"₹34,490","₹61,990"
3,"LG 1.5 Ton 3 Star AI DUAL Inverter Split AC (Copper, Super Convertible 6-in-1 Cooling, HD Filter with Anti-Virus Protectio...",appliances,Air Conditioners,https://m.media-amazon.com/images/I/51JFb7FctDL._AC_UL320_.jpg,https://www.amazon.in/LG-Convertible-Anti-Virus-Protection-RS-Q19JNXE/dp/B0BQ3MJ1TG/ref=sr_1_7?qid=1679134237&s=kitchen&sr=1-7,4.0,69,"₹37,990","₹68,990"
4,"Carrier 1.5 Ton 3 Star Inverter Split AC (Copper,ESTER Dxi, 4-in-1 Flexicool Inverter, 2022 Model,R32,White)",appliances,Air Conditioners,https://m.media-amazon.com/images/I/41lrtqXPiWL._AC_UL320_.jpg,https://www.amazon.in/Carrier-Inverter-Split-Copper-Flexicool/dp/B0B67RLLJC/ref=sr_1_8?qid=1679134237&s=kitchen&sr=1-8,4.1,630,"₹34,490","₹67,790"


In [3]:
df.shape

(396210, 9)

In [4]:
df["main_category"].value_counts()

accessories                84913
tv, audio & cameras        52721
women's clothing           50253
men's shoes                46275
men's clothing             42772
appliances                 21614
stores                     17375
home & kitchen             13356
sports & fitness           10995
kids' fashion              10954
beauty & health             9611
bags & luggage              8348
car & motorbike             6798
toys & baby products        5611
women's shoes               4930
industrial supplies         3849
grocery & gourmet foods     3196
pet supplies                1549
music                       1066
home, kitchen, pets           24
Name: main_category, dtype: int64

In [5]:
df["main_category"].value_counts().mean()

19810.5

In [6]:
# Group the dataset
grouped = df.groupby("main_category")

# Sample size for each group
sample_size = 4_500

sampled_data = pd.DataFrame()

# Iterate over each group, and perform sampling
for group_name, group_df in grouped:
    if len(group_df) >= sample_size:
        # If the group has enough rows, sample the specified number of rows
        sampled_rows = group_df.sample(n=sample_size)
    else:
        # If the group has fewer rows than the desired sample size, sample all rows
        sampled_rows = group_df
    sampled_data = pd.concat([sampled_data, sampled_rows])

In [7]:
sampled_data

Unnamed: 0,name,main_category,sub_category,image,link,ratings,no_of_ratings,discount_price,actual_price
39932,Caprese ZOLA women's Satchel (YELLOW),accessories,Bags & Luggage,https://m.media-amazon.com/images/I/717z0jxPDmL._AC_UL320_.jpg,https://www.amazon.in/Caprese-SLZOLLGYLW-Womens-Western-Yellow/dp/B0B3YRYFP7/ref=sr_1_8081?qid=1679144163&s=luggage&sr=1-8081,5.0,1,"₹1,244.71","₹4,599"
33680,Fastrack Brown Leather Men's Wallet (C0408LBR01),accessories,Bags & Luggage,https://m.media-amazon.com/images/I/81WgnqcRnzL._AC_UL320_.jpg,https://www.amazon.in/Fastrack-Brown-Mens-Wallet-C0408LBR01/dp/B07BKYB2DX/ref=sr_1_553?qid=1679143906&s=luggage&sr=1-553,4.3,2162,₹821,"₹1,095"
375608,TEKZIE Butterfly Colourful Combo Set of - 3 Watch for Girls & Women.,accessories,Watches,https://m.media-amazon.com/images/I/41XT3Ck+IhL._AC_UL320_.jpg,https://www.amazon.in/TEKZIE-Butterfly-Colourful-Combo-Set/dp/B09NGJ8ZG5/ref=sr_1_9743?qid=1679155977&s=watches&sr=1-9743,,,₹323,₹999
136055,Sorellaz Womens Rose Gold Open Branch Ring: SR/FAJEWLK21-L80/1,accessories,Fashion & Silver Jewellery,https://m.media-amazon.com/images/I/51KbDbULuDL._AC_UL320_.jpg,https://www.amazon.in/Sorellaz-Womens-Rose-Gold-Branch/dp/B0B3RR7CLH/ref=sr_1_17546?qid=1679160557&s=jewelry&sr=1-17546,3.4,3,₹189,₹730
133347,Astroghar Evil Eye Pendant for Protection & Prosperity for Men and Women,accessories,Fashion & Silver Jewellery,https://m.media-amazon.com/images/I/51Ug3-OyR+L._AC_UL320_.jpg,https://www.amazon.in/ASTROGHAR-Protection-Multicolour-Pendant-Prosperity/dp/B08FF3DDVL/ref=sr_1_14583?qid=1679160450&s=jewelry&sr=1-14583,3.4,38,₹240,₹600
...,...,...,...,...,...,...,...,...,...
138329,Cleo by Khadim's Synthetic PVC Sole Blue Decorative Sandal for Women,women's shoes,Fashion Sandals,https://m.media-amazon.com/images/I/61lOsS1GH2L._AC_UL320_.jpg,https://www.amazon.in/Cleo-Khadims-Synthetic-Decorative-Sandal/dp/B092WVLSZH/ref=sr_1_7901?qid=1679211724&s=shoes&sr=1-7901,2.0,1,₹389,₹649
49478,"Walky Wear Dashing Bellies, Ballet Flat Belly for Womens and Girl's",women's shoes,Ballerinas,https://m.media-amazon.com/images/W/IMAGERENDERING_521856-T1/images/I/6130NBC92YL._AC_UL320_.jpg,https://www.amazon.in/Walky-Wear-Dashing-Bellies-Ballet/dp/B08F37V6ZT/ref=sr_1_697?qid=1679211836&s=shoes&sr=1-697,3.7,7,₹499,₹999
312842,BEREALSlate Black Sandal Women,women's shoes,Shoes,https://m.media-amazon.com/images/I/51VeJPTxggL._AC_UL320_.jpg,https://www.amazon.in/BEREALSlate-Black-Sandal-Women-numeric_5/dp/B09GNKFFPZ/ref=sr_1_5725?qid=1679211540&s=shoes&sr=1-5725,3.1,3,₹959,"₹1,599"
137974,Stride girls Anok Floaters,women's shoes,Fashion Sandals,https://m.media-amazon.com/images/I/71-IgueZKNL._AC_UL320_.jpg,https://www.amazon.in/Stride-Womens-Black-Floaters-5-16119093/dp/B07XKGSNMP/ref=sr_1_6034?qid=1679211702&s=shoes&sr=1-6034,5.0,1,₹373.10,"₹2,999"


In [8]:
sampled_data.to_csv("data/amazon_products_sampled_raw.csv", index = False)