# Reduce Data

**revise and add to paper**

> Stratified random sampling by category is commonly employed in the analysis of imbalanced datasets, where certain categories have a significantly larger number of data points compared to others. This technique ensures that each category is proportionally represented in the sample, promoting a more balanced analysis and modeling process.However, it is important to acknowledge that the smaller sample size may affect the generalizability of the results and introduce some degree of sampling variability.

> In this project, the dataset consists of over 300,000 rows, and the objective is to create a representative sample for analysis and modeling purposes. However, due to resource constraints and the need to strike a balance between computational feasibility and sample size, the decision was made to reduce the sample size to 4,500 per category. It is recognized that this sample size may not fully capture the intricacies of each category, but it provides a reasonable representation of the data and enables the project to progress with the analysis.

> Although the reduced sample size is lower than the mean of 19,810.5 (calculated by dividing the total number of rows by the total number of categories), it still provides an adequate number of observations for analysis. While the smaller sample size may not capture the entire population, it allows for the exploration of key category characteristics and addresses the challenges associated with class imbalance.

> Despite these limitations, the selected sample size allows for meaningful insights and reliable conclusions within the scope of the project's constraints.

**Note:** To those reading our project, you access the original dataset [here](https://www.kaggle.com/datasets/lokeshparab/amazon-products-dataset?select=Amazon-Products.csv).

In [1]:
import pandas as pd

# notebook configurations
pd.options.display.max_colwidth = 1000

import warnings
warnings.filterwarnings("ignore")

In [2]:
df = pd.read_csv("data/amazon_products_raw.csv")
df.head()

Unnamed: 0,name,main_category,sub_category,image,link,ratings,no_of_ratings,discount_price,actual_price
0,"Lloyd 1.5 Ton 3 Star Inverter Split Ac (5 In 1 Convertible, Copper, Anti-Viral + Pm 2.5 Filter, 2023 Model, White, Gls18I3...",appliances,Air Conditioners,https://m.media-amazon.com/images/I/31UISB90sYL._AC_UL320_.jpg,https://www.amazon.in/Lloyd-Inverter-Convertible-Anti-Viral-GLS18I3FWAMC/dp/B0BRKXTSBT/ref=sr_1_4?qid=1679134237&s=kitchen&sr=1-4,4.2,2255,"₹32,999","₹58,990"
1,"LG 1.5 Ton 5 Star AI DUAL Inverter Split AC (Copper, Super Convertible 6-in-1 Cooling, HD Filter with Anti-Virus Protectio...",appliances,Air Conditioners,https://m.media-amazon.com/images/I/51JFb7FctDL._AC_UL320_.jpg,https://www.amazon.in/LG-Convertible-Anti-Virus-Protection-RS-Q19YNZE/dp/B0BQ3MXML8/ref=sr_1_5?qid=1679134237&s=kitchen&sr=1-5,4.2,2948,"₹46,490","₹75,990"
2,"LG 1 Ton 4 Star Ai Dual Inverter Split Ac (Copper, Super Convertible 6-In-1 Cooling, Hd Filter With Anti Virus Protection,...",appliances,Air Conditioners,https://m.media-amazon.com/images/I/51JFb7FctDL._AC_UL320_.jpg,https://www.amazon.in/LG-Inverter-Convertible-protection-RS-Q13JNYE/dp/B0BPYN9JGF/ref=sr_1_6?qid=1679134237&s=kitchen&sr=1-6,4.2,1206,"₹34,490","₹61,990"
3,"LG 1.5 Ton 3 Star AI DUAL Inverter Split AC (Copper, Super Convertible 6-in-1 Cooling, HD Filter with Anti-Virus Protectio...",appliances,Air Conditioners,https://m.media-amazon.com/images/I/51JFb7FctDL._AC_UL320_.jpg,https://www.amazon.in/LG-Convertible-Anti-Virus-Protection-RS-Q19JNXE/dp/B0BQ3MJ1TG/ref=sr_1_7?qid=1679134237&s=kitchen&sr=1-7,4.0,69,"₹37,990","₹68,990"
4,"Carrier 1.5 Ton 3 Star Inverter Split AC (Copper,ESTER Dxi, 4-in-1 Flexicool Inverter, 2022 Model,R32,White)",appliances,Air Conditioners,https://m.media-amazon.com/images/I/41lrtqXPiWL._AC_UL320_.jpg,https://www.amazon.in/Carrier-Inverter-Split-Copper-Flexicool/dp/B0B67RLLJC/ref=sr_1_8?qid=1679134237&s=kitchen&sr=1-8,4.1,630,"₹34,490","₹67,790"


In [3]:
df.shape

(396210, 9)

In [4]:
df["main_category"].value_counts()

accessories                84913
tv, audio & cameras        52721
women's clothing           50253
men's shoes                46275
men's clothing             42772
appliances                 21614
stores                     17375
home & kitchen             13356
sports & fitness           10995
kids' fashion              10954
beauty & health             9611
bags & luggage              8348
car & motorbike             6798
toys & baby products        5611
women's shoes               4930
industrial supplies         3849
grocery & gourmet foods     3196
pet supplies                1549
music                       1066
home, kitchen, pets           24
Name: main_category, dtype: int64

In [5]:
df["main_category"].value_counts().mean()

19810.5

In [9]:
# Group the dataset
grouped = df.groupby("main_category")

# Sample size for each group
sample_size = 4_500

sampled_data = pd.DataFrame()

# Iterate over each group, and perform sampling
for group_name, group_df in grouped:
    if len(group_df) >= sample_size:
        # If the group has enough rows, sample the specified number of rows
        sampled_rows = group_df.sample(n=sample_size)
    else:
        # If the group has fewer rows than the desired sample size, sample all rows
        sampled_rows = group_df
    sampled_data = pd.concat([sampled_data, sampled_rows])

In [10]:
sampled_data

Unnamed: 0,name,main_category,sub_category,image,link,ratings,no_of_ratings,discount_price,actual_price
382476,Kids Smart Watch for Girls Toy for Kids Gift for Girls Watches,accessories,Watches,https://m.media-amazon.com/images/W/IMAGERENDERING_521856-T2/images/I/61p-xAxgsXL._AC_UL320_.jpg,https://www.amazon.in/Kids-Smart-Watch-Girls-Watches/dp/B0BLZ37G2T/ref=sr_1_17225?qid=1679156236&s=watches&sr=1-17225,,,"₹9,884","₹12,884"
173693,Satya Paul Faux Leather Green Olive Women Tote Bag,accessories,Handbags & Clutches,https://m.media-amazon.com/images/I/81SI2dD+-9L._AC_UL320_.jpg,https://www.amazon.in/Satya-Paul-Leather-Green-Olive/dp/B09QKL2QR9/ref=sr_1_18381?qid=1679159207&s=shoes&sr=1-18381,,,"₹3,497","₹4,995"
131498,GIVA 925 Oxidised Silver Evil Eye Pendant with Box Chain Necklace to Gifts Women with Certificate of Authenticity and 925 ...,accessories,Fashion & Silver Jewellery,https://m.media-amazon.com/images/I/61H-wdXIaRL._AC_UL320_.jpg,https://www.amazon.in/GIVA-Oxidised-Necklace-Certificate-Authenticity/dp/B08VJFY3DB/ref=sr_1_12596?qid=1679160379&s=jewelry&sr=1-12596,3.7,13,"₹1,709","₹3,599"
43885,RangTeq Multipurpose Different Size Combo of 2 Yoga Print Jute Lunch Carry Bag with Zipper Closure Beige,accessories,Bags & Luggage,https://m.media-amazon.com/images/W/IMAGERENDERING_521856-T2/images/I/71T-5-hNc5L._AC_UL320_.jpg,https://www.amazon.in/RangTeq-Combo-Yoga-Jute-Bag/dp/B081JF6S4R/ref=sr_1_12582?qid=1679144312&s=luggage&sr=1-12582,3.0,16,₹280,₹599
120217,GIVA 925 Sterling Silver Pearl Tiny Stud Earrings | Studs to Gift Women & Girls | With Certificate of Authenticity and 925...,accessories,Fashion & Silver Jewellery,https://m.media-amazon.com/images/I/61QsaH5AZNL._AC_UL320_.jpg,https://www.amazon.in/GIVA-Sterling-Earrings-Certificate-Authenticity/dp/B08XGLHBHL/ref=sr_1_268?qid=1679159937&s=jewelry&sr=1-268,4.0,257,₹949,"₹1,798"
...,...,...,...,...,...,...,...,...,...
137877,Cleo by Khadim's Women Navy Casual Strap-On Sandal,women's shoes,Fashion Sandals,https://m.media-amazon.com/images/I/61PCnE+2JbL._AC_UL320_.jpg,https://www.amazon.in/CLEO-Khadims-Casual-Strap-Sandal/dp/B0893K3G8R/ref=sr_1_4544?qid=1679211690&s=shoes&sr=1-4544,,,₹329,₹549
138274,FootStreet Fancy Stylish women fashion comfortable casual block heels Slipper,women's shoes,Fashion Sandals,https://m.media-amazon.com/images/I/51fEPmfBejL._AC_UL320_.jpg,https://www.amazon.in/Footstreet-Stylish-fashion-comfortable-numeric_7/dp/B09RWPWF9W/ref=sr_1_7556?qid=1679211720&s=shoes&sr=1-7556,3.0,1,₹499,₹999
139473,Flat n heels Womens Black Sandals FnH 798-BK,women's shoes,Fashion Sandals,https://m.media-amazon.com/images/W/IMAGERENDERING_521856-T1/images/I/51L3vcKf9GL._AC_UL320_.jpg,https://www.amazon.in/Flat-Womens-Sandals-FnH-798-BK/dp/B0B1MTH334/ref=sr_1_18735?qid=1679211823&s=shoes&sr=1-18735,,,"₹1,699","₹3,399"
139269,Do Bhai Women's and Girls Casual Comfortable Fashion Heel Sandal/Ruchi,women's shoes,Fashion Sandals,https://m.media-amazon.com/images/I/81ME3scfDgL._AC_UL320_.jpg,https://www.amazon.in/Do-Bhai-Womens-Comfortable-Fashion/dp/B0BH4SLGMF/ref=sr_1_15733?qid=1679211799&s=shoes&sr=1-15733,,,"₹1,199","₹2,999"


In [11]:
sampled_data.to_csv("data/amazon_products_sampled_raw.csv", index = False)