# Data Preprocessing

In [1]:
# install some libraries below

# upgrade pip
!{sys.executable} -m pip install --upgrade pip

# install external libraries
import sys
!{sys.executable} -m pip install nltk # text processing

/bin/bash: {sys.executable}: command not found
Defaulting to user installation because normal site-packages is not writeable


In [2]:
import pandas as pd
import numpy as np
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

# notebook configurations
pd.options.display.max_colwidth = 1000

import warnings
warnings.filterwarnings("ignore")

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/bzekeria/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!



**----**
**delete this cell once the notebook is completed**
**----**

### Tasks
guide to data cleaning:
- [theoretical](https://towardsdatascience.com/the-ultimate-guide-to-data-cleaning-3969843991d4)
- [code example](https://medium.com/bitgrit-data-science-publication/data-cleaning-with-python-f6bc3da64e45)

1. Check for duplicate rows
1. Convert data types &#10003;
1. Remove outliers &#10003;
1. Remove the unecessary columns: ```image``` and  ```link```
1. Confirm ```ratings``` is between 1 to 5 (ik during the mtg i said 0 but that was dumb lmao)
1. Check for null values
    - Approach: Do we remove the rows completely with X number of null values or do we impute the data? Something to think about
1. Convert ```discount_price``` and ```actual_price``` to USD (just in case if we need these columns later) &#10003;
1. ```name``` column
    - [NLTK library](https://www.nltk.org)
        - Stop words are already configured
    - lowkey hardest column to clean
        - In one hand intuitvely it'd make sense as, generally speaking, stop words carry little to no meaning though at times it may:
           - Looking at the first two rows of the dataset
               - I Jewels 18K Silver Oxidised Traditional Style Coin Necklace **With** Earrings **For** Women & Girls (MC061OX)
                   - **With** indicates the necklace comes with the earrings
                   - **For** is important as it signifies this product specifically is the one catered to women. if we removed it, we may get 2 products (one for males and one for females)
                   - **i'm sure if you search up these on amazon by removing those two words the same exact product will appear but our model won't be as advanced as Amazon's**
            
            - Sunglasses CH **by** Carolina Herrera SHE 148 Black Red 300Y
                - **By** doesn't affect the interpretion of the product if we remove it
1. **There might be more cleaning that needs to be done.** Go with your intuition and knowledge about products in general 
7. When everything is done, add this code to export the final dataset*: 

    - ```df.to_csv("data/amazon_products_sampled_cleaned.csv", index = False)```

In [3]:
df = pd.read_csv("data/amazon_products_sampled_raw.csv")
df

Unnamed: 0,name,main_category,sub_category,image,link,ratings,no_of_ratings,discount_price,actual_price
0,Omsheela Lac multi 16 Bangle Set -Rajasthani Beautiful lac material bangle- Women & Girls Multi color Bangle set,accessories,Fashion & Silver Jewellery,https://m.media-amazon.com/images/I/71QX9rMyRYL._AC_UL320_.jpg,https://www.amazon.in/Omsheela-Lac-Rajasthani-Multi-set/dp/B09HSBY58K/ref=sr_1_8165?qid=1679160222&s=jewelry&sr=1-8165,3.6,4,₹599,"₹1,599"
1,Aristocrat Airpro 66cm Polypropylene Hardsided Medium Size 8 Wheels Blue Trolley,accessories,Bags & Luggage,https://m.media-amazon.com/images/I/71u9uL8nzML._AC_UL320_.jpg,https://www.amazon.in/Aristocrat-Airpro-Polypropylene-Hardsided-Trolley/dp/B0BRN2SGRM/ref=sr_1_1534?qid=1679143939&s=luggage&sr=1-1534,,,"₹2,699","₹10,000"
2,Pilot18 777 Gold Aircraft Lapel Pin with Butterfly Clasp,accessories,Jewellery,https://m.media-amazon.com/images/I/51hcoFh0GML._AC_UL320_.jpg,https://www.amazon.in/Pilot18-Aircraft-Lapel-Butterfly-Clasp/dp/B07PWCK98L/ref=sr_1_15204?qid=1679145664&s=jewelry&sr=1-15204,5.0,4,₹449,₹750
3,NEUTRON Professional Analog Black and Blue Color Dial Girls Watch - GX1-(54-L-10) (Pack of 2),accessories,Watches,https://m.media-amazon.com/images/I/71Q-c7mGwYL._AC_UL320_.jpg,https://www.amazon.in/NEUTRON-Professional-Analog-Black-Color/dp/B0BMW6D216/ref=sr_1_15161?qid=1679156164&s=watches&sr=1-15161,,,₹197,₹546
4,PAGWIN Women Leather Handbags Flower Embroidered Bags Fashion Satchel Bags Top Handle,accessories,Bags & Luggage,https://m.media-amazon.com/images/W/IMAGERENDERING_521856-T1/images/I/61blfwzVAcL._AC_UL320_.jpg,https://www.amazon.in/Leather-Handbags-Embroidered-Fashion-Satchel/dp/B09WN4BFNN/ref=sr_1_5201?qid=1679144059&s=luggage&sr=1-5201,3.6,154,₹339,₹999
...,...,...,...,...,...,...,...,...,...
19019,ABER & Q Cheer Women's Ballerina,women's shoes,Ballerinas,https://m.media-amazon.com/images/I/318fH1IVxoL._AC_UL320_.jpg,https://www.amazon.in/ABER-Cheer-Womens-Ballerina-Blue/dp/B07KZXGX3W/ref=sr_1_13053?qid=1679211941&s=shoes&sr=1-13053,3.0,1,"₹1,299","₹2,499"
19020,Elle womens Black Slipper Flat Sandal,women's shoes,Shoes,https://m.media-amazon.com/images/I/61CZp70HfuL._AC_UL320_.jpg,https://www.amazon.in/Elle-Womens-Slipper-Black-8/dp/B0B4VSPP1Y/ref=sr_1_4910?qid=1679211535&s=shoes&sr=1-4910,3.0,2,₹559,"₹1,299"
19021,Lazera Fashion Block Hill Sandals for Woman's,women's shoes,Fashion Sandals,https://m.media-amazon.com/images/I/71GA+5bwmEL._AC_UL320_.jpg,https://www.amazon.in/Lazera-Fashion-Sandals-Womans-numeric_8/dp/B0BT1PG2QR/ref=sr_1_13612?qid=1679211776&s=shoes&sr=1-13612,,,"₹2,250","₹2,499"
19022,Carlton London womens Cll-7025 Heeled Sandal,women's shoes,Fashion Sandals,https://m.media-amazon.com/images/I/717t9VhaBHL._AC_UL320_.jpg,https://www.amazon.in/Carlton-London-Womens-Sandal-3-CLL-7025/dp/B0B636465H/ref=sr_1_17752?qid=1679211814&s=shoes&sr=1-17752,,,"₹1,498","₹2,995"


In [4]:
df.columns

Index(['name', 'main_category', 'sub_category', 'image', 'link', 'ratings',
       'no_of_ratings', 'discount_price', 'actual_price'],
      dtype='object')

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19024 entries, 0 to 19023
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   name            19024 non-null  object
 1   main_category   19024 non-null  object
 2   sub_category    19024 non-null  object
 3   image           19024 non-null  object
 4   link            19024 non-null  object
 5   ratings         13652 non-null  object
 6   no_of_ratings   13652 non-null  object
 7   discount_price  16502 non-null  object
 8   actual_price    18498 non-null  object
dtypes: object(9)
memory usage: 1.3+ MB


In [6]:
df["ratings"] = pd.to_numeric(df["ratings"], errors = "coerce")
df["no_of_ratings"] = pd.to_numeric(df["no_of_ratings"], errors = "coerce")

In [7]:
def detect_outliers(data, column):
    q1 = np.percentile(data[column], 25)
    q3 = np.percentile(data[column], 75)
    iqr = q3 - q1
    lower_bound = q1 - 1.5 * iqr
    upper_bound = q3 + 1.5 * iqr
    outliers = data[(data[column] < lower_bound) | (data[column] > upper_bound)]
    return outliers

In [8]:
detect_outliers(df, "ratings")

Unnamed: 0,name,main_category,sub_category,image,link,ratings,no_of_ratings,discount_price,actual_price


In [9]:
detect_outliers(df, "no_of_ratings")

Unnamed: 0,name,main_category,sub_category,image,link,ratings,no_of_ratings,discount_price,actual_price


The linkage between ratings and the number of ratings (```no_of_ratings```) is important when analyzing outliers because it helps identify anomalies and biases. Products with a low number of ratings are more susceptible to extreme or biased ratings, which can significantly impact the overall rating score. On the other hand, a high number of ratings provides more robustness to the rating score, making it less susceptible to individual outliers and more representative of the collective sentiment of the customers.