<a id="contents"></a>
# Session 2 - The Machine Learning Workflow



### [Preparing a "rich" dataset](#rich)
- [Importing with pandas](#import)
- [Prices variable](#prices)
- [Pandas exercise](#pandas_exercise)

### [Recoding Categorical Data](#recoding)
- [Recoding categorical variables with OneHotEncoder](#ohe)
- [Recoding catgorical variables with pandas's get_dummies](#dummy)
- [Importing `mini_victoria.txt`data](#mini_victoria)

### [Handling Missing Data](#missing_data)
- [Importing datasets](#import_datasets)
- [Preparing datasets](#prepare_data)
- [Imputation with the median](#median)
- [Imputation with the mean](#mean)
- [Imputation with linear interpolation](#linear)
- [Simple imputation](#simple)
- [Multiple imputation](#multiple)
- [K Nearest Neighbors](#neighbors)

<a id="import"></a>
### Importing with pandas

- Save the `mini_victoria.txt` file
- Check the data in a text editor such as Notepad++ or Visual Studio Code
- Import it using pandas
- Print a comprehensive summary

In [1]:
import pandas as pd
import os

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows',None)

absolute and relative path

- homework
  - notebook
  - data
 
- for relative path it will be r"..\data\victoria.txt"
- for absolute path it will be "victoria.txt"

In [2]:
# Data_path="../data"
# os.chdir(Data_path)

In [3]:
os.getcwd()

'C:\\Users\\Abhishek\\Downloads\\EMLYON CLASSES\\PYTHON\\3.Intro to ML'

In [4]:
df = pd.read_csv('mini_victoria.txt', sep="*", header="infer", encoding='latin1')

In [5]:
df.head()

Unnamed: 0,product_name,mrp,price,pdp_url,brand_name,product_category,retailer,description,rating,review_count,style_attributes,total_sizes,available_size,color
0,Victoria Sport NEW! Incredible by Victoria Spo...,$36.50,$36.50,https://www.victoriassecret.com/bras/shop-all-...,Victoria's Secret,Incredible by Victoria Sport Front-close Sport...,Victoriassecret US,Game-changer: your favorite maximum-support sp...,3.6,25.0,,"[""32A"", ""32B"", ""32C"", ""32D"", ""32DD"", ""32DDD"", ...",32D3,White
1,Body by Victoria Demi Bra,$54.50,$19.99,https://www.victoriassecret.com/bras/shop-all-...,Victoria's Secret,Demi Bra,Victoriassecret US,Sexy comfort and a sleek shape start with low-...,,,,"[""30A"", ""30B"", ""30C"", ""30D"", ""30DD"", ""30DDD"", ...",38C,cadette green
2,Easy Plunge Bra,$29.50,$29.50,https://www.victoriassecret.com/bras/bralette/...,Victoria's Secret,Easy Plunge Bra,Victoriassecret US,This supersoft bra is easy to love with fully ...,4.4,260.0,,"[""32A"", ""32B"", ""32C"", ""32D"", ""32DD"", ""34A"", ""3...",34DD,Black
3,The T-Shirt Perfect Shape Bra,$39.50,$39.50,https://www.victoriassecret.com/bras/shop-all-...,Victoria's Secret,Perfect Shape Bra,Victoriassecret US,The everyday go-to bra pairs sexy lift and the...,,,,"[""32A"", ""32B"", ""32C"", ""32D"", ""32DD"", ""32DDD"", ...",32D,Coconut White Matte Print
4,PINK NEW! Wear Everywhere Super Push,$32.95,$32.95,https://www.victoriassecret.com/pink/panties/w...,Victoria's Secret Pink,Wear Everywhere Super Push,Victoriassecret US,"A super flirty new style, with more push than ...",,,,"[""30AA"", ""30A"", ""30B"", ""30C"", ""30D"", ""30DD"", ""...",32D,bayberry


In [6]:
df.shape

(45339, 14)

[Table of Contents](#contents)

<a id="prices"></a>
### The price variables (price and price) are not recognized as quantitative
- Make the necessary pre-processing to read them as such
- Create a function that removes the $ symbol for the USD currencies and replaces all others by missing values
- Apply it on each of the price columns
- Check again


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45339 entries, 0 to 45338
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   product_name      45339 non-null  object 
 1   mrp               45339 non-null  object 
 2   price             45339 non-null  object 
 3   pdp_url           45339 non-null  object 
 4   brand_name        45339 non-null  object 
 5   product_category  45339 non-null  object 
 6   retailer          45339 non-null  object 
 7   description       45339 non-null  object 
 8   rating            13662 non-null  float64
 9   review_count      13662 non-null  float64
 10  style_attributes  0 non-null      float64
 11  total_sizes       45339 non-null  object 
 12  available_size    45339 non-null  object 
 13  color             45339 non-null  object 
dtypes: float64(3), object(11)
memory usage: 4.8+ MB


In [8]:
df["price"].value_counts()

price
$10.50       4599
$36.50       3239
$34.50       2569
$34.95       2450
$19.99       2255
$54.50       1845
$32.95       1660
$20.00       1569
$29.50       1438
$39.50       1399
$49.50       1385
$14.99       1290
$44.50       1173
$29.99       1128
$59.50        972
$24.50        967
$46.50        914
$24.99        913
$42.50        799
$62.50        781
$56.50        774
$58.50        658
$9.99         652
$52.50        624
$16.99        547
$14.50        535
$48.50        535
$8.50         503
$12.99        471
$32.50        437
$36.00        430
$16.50        414
$34.99        382
$38.00        345
$25.00        343
$36.95        317
$3.99         269
$24.95        268
$7.99         264
$17.99        238
$58.00        198
$48.00        180
$55.50        179
$64.50        171
$5.00         166
$15.00        165
$49.95        138
$5.99         135
$38.50        129
$52.00        125
$68.00        112
$32.00        109
$30.00        102
$42.00        101
$39.95         93
$22.

In [9]:
## your code here ##

import numpy as np
def remove_strings(df, x):
    df[x] = df[x].apply(lambda val: str(val).replace("$", "") if isinstance(val, str) or isinstance(val, float) or isinstance(val, int) else val)
    df[x] = df[x].apply(lambda val: np.nan if "Rp" in str(val) else val)
    df[x] = df[x].apply(lambda val: np.nan if "¢" in str(val) else val)
    df[x] = pd.to_numeric(df[x])  
    return df

# Applying the function
df = remove_strings(df, "price")
df = remove_strings(df, "mrp")



df.head()

Unnamed: 0,product_name,mrp,price,pdp_url,brand_name,product_category,retailer,description,rating,review_count,style_attributes,total_sizes,available_size,color
0,Victoria Sport NEW! Incredible by Victoria Spo...,36.5,36.5,https://www.victoriassecret.com/bras/shop-all-...,Victoria's Secret,Incredible by Victoria Sport Front-close Sport...,Victoriassecret US,Game-changer: your favorite maximum-support sp...,3.6,25.0,,"[""32A"", ""32B"", ""32C"", ""32D"", ""32DD"", ""32DDD"", ...",32D3,White
1,Body by Victoria Demi Bra,54.5,19.99,https://www.victoriassecret.com/bras/shop-all-...,Victoria's Secret,Demi Bra,Victoriassecret US,Sexy comfort and a sleek shape start with low-...,,,,"[""30A"", ""30B"", ""30C"", ""30D"", ""30DD"", ""30DDD"", ...",38C,cadette green
2,Easy Plunge Bra,29.5,29.5,https://www.victoriassecret.com/bras/bralette/...,Victoria's Secret,Easy Plunge Bra,Victoriassecret US,This supersoft bra is easy to love with fully ...,4.4,260.0,,"[""32A"", ""32B"", ""32C"", ""32D"", ""32DD"", ""34A"", ""3...",34DD,Black
3,The T-Shirt Perfect Shape Bra,39.5,39.5,https://www.victoriassecret.com/bras/shop-all-...,Victoria's Secret,Perfect Shape Bra,Victoriassecret US,The everyday go-to bra pairs sexy lift and the...,,,,"[""32A"", ""32B"", ""32C"", ""32D"", ""32DD"", ""32DDD"", ...",32D,Coconut White Matte Print
4,PINK NEW! Wear Everywhere Super Push,32.95,32.95,https://www.victoriassecret.com/pink/panties/w...,Victoria's Secret Pink,Wear Everywhere Super Push,Victoriassecret US,"A super flirty new style, with more push than ...",,,,"[""30AA"", ""30A"", ""30B"", ""30C"", ""30D"", ""30DD"", ""...",32D,bayberry


Check how many missing values we have

In [10]:
df.isna().sum()

product_name            0
mrp                    39
price                  39
pdp_url                 0
brand_name              0
product_category        0
retailer                0
description             0
rating              31677
review_count        31677
style_attributes    45339
total_sizes             0
available_size          0
color                   0
dtype: int64

In [11]:
## your code here ##
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45339 entries, 0 to 45338
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   product_name      45339 non-null  object 
 1   mrp               45300 non-null  float64
 2   price             45300 non-null  float64
 3   pdp_url           45339 non-null  object 
 4   brand_name        45339 non-null  object 
 5   product_category  45339 non-null  object 
 6   retailer          45339 non-null  object 
 7   description       45339 non-null  object 
 8   rating            13662 non-null  float64
 9   review_count      13662 non-null  float64
 10  style_attributes  0 non-null      float64
 11  total_sizes       45339 non-null  object 
 12  available_size    45339 non-null  object 
 13  color             45339 non-null  object 
dtypes: float64(5), object(9)
memory usage: 4.8+ MB


Now, replace the two non-numerical price columns by numerical price columns (quantitative data)

In [12]:
## your code here ##

In [13]:
## your code here ##

Count the number of unique modalities in each variable of the dataframe

In [14]:
## your code here ##
df.nunique()

product_name         599
mrp                   72
price                 89
pdp_url             1410
brand_name             2
product_category     445
retailer               1
description          536
rating                31
review_count         333
style_attributes       0
total_sizes           30
available_size        44
color               1300
dtype: int64

In [15]:
df['available_size']

0        32D3
1         38C
2        34DD
3         32D
4         32D
5           L
6         32A
7          XL
8         32A
9         32C
10         XL
11        38D
12        36C
13        36C
14        38B
15         XL
16          L
17       34DD
18       XS/S
19       38D3
20        34A
21         XS
22        38D
23        36A
24        40C
25         XL
26          S
27         XS
28        36D
29        32A
30          S
31        34B
32        32B
33        36A
34        32A
35          S
36         XS
37        32D
38          S
39       40DD
40        34C
41        36C
42        34D
43         XS
44        32D
45          L
46       32D3
47       34AA
48        36D
49        32A
50        34B
51        34A
52        34D
53          S
54       34D3
55       40D3
56       34D3
57          L
58         XL
59       32DD
60       40DD
61        32B
62        38B
63        32A
64        36B
65        32A
66        38B
67          M
68        32A
69        34C
70        36B
71    

Check the modalities of the `brand_name` variable

In [16]:
## your code here ##
df['brand_name'].value_counts()

brand_name
Victoria's Secret         34240
Victoria's Secret Pink    11099
Name: count, dtype: int64

Were we to continue the analysis of this dataset we would certainly remove the following columns
- retailer : it has no variability so it is useless
- style attibutes does not have any values (all data missing)

In [17]:
df.drop(["retailer","style_attributes"],axis=1,inplace=True)
df.head()

Unnamed: 0,product_name,mrp,price,pdp_url,brand_name,product_category,description,rating,review_count,total_sizes,available_size,color
0,Victoria Sport NEW! Incredible by Victoria Spo...,36.5,36.5,https://www.victoriassecret.com/bras/shop-all-...,Victoria's Secret,Incredible by Victoria Sport Front-close Sport...,Game-changer: your favorite maximum-support sp...,3.6,25.0,"[""32A"", ""32B"", ""32C"", ""32D"", ""32DD"", ""32DDD"", ...",32D3,White
1,Body by Victoria Demi Bra,54.5,19.99,https://www.victoriassecret.com/bras/shop-all-...,Victoria's Secret,Demi Bra,Sexy comfort and a sleek shape start with low-...,,,"[""30A"", ""30B"", ""30C"", ""30D"", ""30DD"", ""30DDD"", ...",38C,cadette green
2,Easy Plunge Bra,29.5,29.5,https://www.victoriassecret.com/bras/bralette/...,Victoria's Secret,Easy Plunge Bra,This supersoft bra is easy to love with fully ...,4.4,260.0,"[""32A"", ""32B"", ""32C"", ""32D"", ""32DD"", ""34A"", ""3...",34DD,Black
3,The T-Shirt Perfect Shape Bra,39.5,39.5,https://www.victoriassecret.com/bras/shop-all-...,Victoria's Secret,Perfect Shape Bra,The everyday go-to bra pairs sexy lift and the...,,,"[""32A"", ""32B"", ""32C"", ""32D"", ""32DD"", ""32DDD"", ...",32D,Coconut White Matte Print
4,PINK NEW! Wear Everywhere Super Push,32.95,32.95,https://www.victoriassecret.com/pink/panties/w...,Victoria's Secret Pink,Wear Everywhere Super Push,"A super flirty new style, with more push than ...",,,"[""30AA"", ""30A"", ""30B"", ""30C"", ""30D"", ""30DD"", ""...",32D,bayberry


[Table of Contents](#contents)

<a id="pandas_exercise"></a>
### Pandas Exercise

1. Write the lines of code to provide the name of the cheapest product 
2. Write the lines of code to count the number of products with available size equal to '38A’ 
3. Write the lines of code to list and count the type and color of the most expensive products containing 'sport bra'

In [18]:
# Write the lines of code to provide the name of the cheapest product 
## your code here ##
set(df[df["price"]==min(df["price"])]['product_name'])

{'Cotton Lingerie Lace-waist Brief Panty',
 'Cotton Lingerie Mesh Thong Panty',
 'Cotton Lingerie String Bikini Panty',
 'Seamless Cheekini Panty'}

In [19]:
len(set(df[df["price"]==min(df["price"])]['product_name']))

4

In [20]:
# Write the lines of code to count the number of products with available size equal to '38A’ 
## your code here ##
len(df[df['available_size']=="38A"])

0

In [21]:
# Write the lines of code to count the number of products with available size equal to '38B’ 
## your code here ##
len(df[df['available_size']=="38B"])

630

In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45339 entries, 0 to 45338
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   product_name      45339 non-null  object 
 1   mrp               45300 non-null  float64
 2   price             45300 non-null  float64
 3   pdp_url           45339 non-null  object 
 4   brand_name        45339 non-null  object 
 5   product_category  45339 non-null  object 
 6   description       45339 non-null  object 
 7   rating            13662 non-null  float64
 8   review_count      13662 non-null  float64
 9   total_sizes       45339 non-null  object 
 10  available_size    45339 non-null  object 
 11  color             45339 non-null  object 
dtypes: float64(4), object(8)
memory usage: 4.2+ MB


In [23]:
# df['product_category'].value_counts()

In [24]:
# Write the lines of code to list and count the type and color of the products containing 'sport bra' worth price = 36.50 $
## your code here ##
df[(df['product_category'].str.contains("Sport Bra")) & (df['price']==36.5)  ][["product_name","color"]].head()



Unnamed: 0,product_name,color
0,Victoria Sport NEW! Incredible by Victoria Spo...,White
17,Victoria Sport NEW! Knockout by Victoria Sport...,Burnished Lilac
39,Victoria Sport Knockout by Victoria Sport Fron...,Hello Lovely
40,Victoria Sport Incredible by Victoria Sport Bra,Trilobel Marl
46,Victoria Sport NEW! Incredible by Victoria Spo...,Radiating Aztec


In [25]:
df_bra_cost=df[(df['product_category'].str.contains("Sport Bra")) & (df['price']==36.5)  ][["product_name","color"]].groupby(["product_name","color"]).size().reset_index(name='count')
df_bra_cost.head()

Unnamed: 0,product_name,color,count
0,Victoria Sport Incredible by Victoria Sport Bra,Almost Nude,59
1,Victoria Sport Incredible by Victoria Sport Bra,Black,43
2,Victoria Sport Incredible by Victoria Sport Bra,Black Blocked Curves,23
3,Victoria Sport Incredible by Victoria Sport Bra,Blackberry,25
4,Victoria Sport Incredible by Victoria Sport Bra,Burnished Lilac,49


In [26]:
print(f"the count is {len(df[(df['product_category'].str.contains("Sport Bra")) & (df['price']==36.5)  ]["product_name"])}")


the count is 2742


In [27]:
# Write the lines of code to list and count the type and color of the most expensive products containing 'sport bra' 
## your code here ##



In [28]:
df_sb=df[(df['product_category'].str.contains("Sport Bra"))]
df_sb[df_sb['price']==max(df_sb['price'])]['product_name'].head()

0     Victoria Sport NEW! Incredible by Victoria Spo...
17    Victoria Sport NEW! Knockout by Victoria Sport...
39    Victoria Sport Knockout by Victoria Sport Fron...
40      Victoria Sport Incredible by Victoria Sport Bra
46    Victoria Sport NEW! Incredible by Victoria Spo...
Name: product_name, dtype: object

In [29]:
df_sb[df_sb['price']==max(df_sb['price'])][['product_name',"color"]].groupby(["product_name","color"]).size().reset_index(name='count')


Unnamed: 0,product_name,color,count
0,Victoria Sport Incredible by Victoria Sport Bra,Almost Nude,59
1,Victoria Sport Incredible by Victoria Sport Bra,Black,43
2,Victoria Sport Incredible by Victoria Sport Bra,Black Blocked Curves,23
3,Victoria Sport Incredible by Victoria Sport Bra,Blackberry,25
4,Victoria Sport Incredible by Victoria Sport Bra,Burnished Lilac,49
5,Victoria Sport Incredible by Victoria Sport Bra,Fir,4
6,Victoria Sport Incredible by Victoria Sport Bra,Hello Lovely,52
7,Victoria Sport Incredible by Victoria Sport Bra,Laced Arrows,23
8,Victoria Sport Incredible by Victoria Sport Bra,Radiating Aztec,54
9,Victoria Sport Incredible by Victoria Sport Bra,Trilobel Marl,48


In [30]:
len(df_sb[df_sb['price']==max(df_sb['price'])]['product_name'])

2742

In [31]:
max(df_sb['price'])

36.5

[Table of Contents](#contents)

<a id="recoding"></a>
## Recoding Categorical Data

### Import the `Credit.csv` dataset
- Recode all the categorical variables using sklearn onehotencoder and pandas get_dummies
- Compare your results

In [32]:
import pandas as pd
import os
import numpy as np

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_info_columns', 300)

import warnings
warnings.filterwarnings('ignore')

In [33]:
df_credit= pd.read_csv("Credit.csv")

In [34]:
df_credit.head()

Unnamed: 0,Income,Limit,Rating,Cards,Age,Education,Own,Student,Married,Region,Balance
0,14.891,3606,283,2,34,11,No,No,Yes,South,333
1,106.025,6645,483,3,82,15,Yes,Yes,Yes,West,903
2,104.593,7075,514,4,71,11,No,No,No,West,580
3,148.924,9504,681,3,36,11,Yes,No,No,West,964
4,55.882,4897,357,2,68,16,No,No,Yes,South,331


[Table of Contents](#contents)

<a id="ohe"></a>
### Recode all the categorical variables using sklearn onehotencoder

In [35]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45339 entries, 0 to 45338
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   product_name      45339 non-null  object 
 1   mrp               45300 non-null  float64
 2   price             45300 non-null  float64
 3   pdp_url           45339 non-null  object 
 4   brand_name        45339 non-null  object 
 5   product_category  45339 non-null  object 
 6   description       45339 non-null  object 
 7   rating            13662 non-null  float64
 8   review_count      13662 non-null  float64
 9   total_sizes       45339 non-null  object 
 10  available_size    45339 non-null  object 
 11  color             45339 non-null  object 
dtypes: float64(4), object(8)
memory usage: 4.2+ MB


In [36]:
df_cat = ["Region","Married","Student","Own"]
df_num = ["Income","Limit","Rating","Cards","Age","Education","Balance"]

In [37]:
## your code here ##
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse_output=False)
one_hot_encoded = encoder.fit_transform(df_credit[df_cat])
one_hot_df = pd.DataFrame(one_hot_encoded, columns=encoder.get_feature_names_out(df_cat))
df_encoded = pd.concat([df_credit, one_hot_df], axis=1)
df_encoded = df_encoded.drop(df_cat, axis=1)


df_encoded.head()

Unnamed: 0,Income,Limit,Rating,Cards,Age,Education,Balance,Region_East,Region_South,Region_West,Married_No,Married_Yes,Student_No,Student_Yes,Own_No,Own_Yes
0,14.891,3606,283,2,34,11,333,0.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0
1,106.025,6645,483,3,82,15,903,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0
2,104.593,7075,514,4,71,11,580,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0
3,148.924,9504,681,3,36,11,964,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0
4,55.882,4897,357,2,68,16,331,0.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0


In [38]:
## your code here ##

In [39]:
## your code here ##

<a id="dummy"></a>
### Recode all the categorical variables using pandas get_dummies

In [40]:
## your code here ##
one_hot = pd.get_dummies(df_credit, columns = df_cat,dtype=int)
one_hot.head()

Unnamed: 0,Income,Limit,Rating,Cards,Age,Education,Balance,Region_East,Region_South,Region_West,Married_No,Married_Yes,Student_No,Student_Yes,Own_No,Own_Yes
0,14.891,3606,283,2,34,11,333,0,1,0,0,1,1,0,1,0
1,106.025,6645,483,3,82,15,903,0,0,1,0,1,0,1,0,1
2,104.593,7075,514,4,71,11,580,0,0,1,1,0,1,0,1,0
3,148.924,9504,681,3,36,11,964,0,0,1,1,0,1,0,0,1
4,55.882,4897,357,2,68,16,331,0,1,0,0,1,1,0,1,0


Check equivalence of the two dataframes

In [41]:
## your code here ##

comparison=one_hot==df_encoded
comparison.head()

Unnamed: 0,Income,Limit,Rating,Cards,Age,Education,Balance,Region_East,Region_South,Region_West,Married_No,Married_Yes,Student_No,Student_Yes,Own_No,Own_Yes
0,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True
1,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True
2,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True
3,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True
4,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True


[Table of Contents](#contents)

<a id="mini_victoria"></a>
### Import the `mini_victoria.txt` dataset
- Which categorical variables should be onehot encoded ?
- Which categorical variables should be label encoded ?


In [42]:
df.head() ## your code here ##

Unnamed: 0,product_name,mrp,price,pdp_url,brand_name,product_category,description,rating,review_count,total_sizes,available_size,color
0,Victoria Sport NEW! Incredible by Victoria Spo...,36.5,36.5,https://www.victoriassecret.com/bras/shop-all-...,Victoria's Secret,Incredible by Victoria Sport Front-close Sport...,Game-changer: your favorite maximum-support sp...,3.6,25.0,"[""32A"", ""32B"", ""32C"", ""32D"", ""32DD"", ""32DDD"", ...",32D3,White
1,Body by Victoria Demi Bra,54.5,19.99,https://www.victoriassecret.com/bras/shop-all-...,Victoria's Secret,Demi Bra,Sexy comfort and a sleek shape start with low-...,,,"[""30A"", ""30B"", ""30C"", ""30D"", ""30DD"", ""30DDD"", ...",38C,cadette green
2,Easy Plunge Bra,29.5,29.5,https://www.victoriassecret.com/bras/bralette/...,Victoria's Secret,Easy Plunge Bra,This supersoft bra is easy to love with fully ...,4.4,260.0,"[""32A"", ""32B"", ""32C"", ""32D"", ""32DD"", ""34A"", ""3...",34DD,Black
3,The T-Shirt Perfect Shape Bra,39.5,39.5,https://www.victoriassecret.com/bras/shop-all-...,Victoria's Secret,Perfect Shape Bra,The everyday go-to bra pairs sexy lift and the...,,,"[""32A"", ""32B"", ""32C"", ""32D"", ""32DD"", ""32DDD"", ...",32D,Coconut White Matte Print
4,PINK NEW! Wear Everywhere Super Push,32.95,32.95,https://www.victoriassecret.com/pink/panties/w...,Victoria's Secret Pink,Wear Everywhere Super Push,"A super flirty new style, with more push than ...",,,"[""30AA"", ""30A"", ""30B"", ""30C"", ""30D"", ""30DD"", ""...",32D,bayberry


[Table of Contents](#contents)

Now, replace the two non-numerical price columns by numerical price columns (quantitative data)

In [43]:
## your code here 
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45339 entries, 0 to 45338
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   product_name      45339 non-null  object 
 1   mrp               45300 non-null  float64
 2   price             45300 non-null  float64
 3   pdp_url           45339 non-null  object 
 4   brand_name        45339 non-null  object 
 5   product_category  45339 non-null  object 
 6   description       45339 non-null  object 
 7   rating            13662 non-null  float64
 8   review_count      13662 non-null  float64
 9   total_sizes       45339 non-null  object 
 10  available_size    45339 non-null  object 
 11  color             45339 non-null  object 
dtypes: float64(4), object(8)
memory usage: 4.2+ MB


[Table of Contents](#contents)

### Count the number of modalities for each categorical variable

In [44]:
## your code here ##
df.nunique()

product_name         599
mrp                   72
price                 89
pdp_url             1410
brand_name             2
product_category     445
description          536
rating                31
review_count         333
total_sizes           30
available_size        44
color               1300
dtype: int64

[Table of Contents](#contents)

Any categorical variable with more than 20 modalities should be label-encoded <br>
Why 20 modalities, not more nor less ? Well it depends on the number of remaining features - the more features, the less onehot encoding...

In [45]:
## your code here ##
from sklearn import preprocessing 
label_encoder = preprocessing.LabelEncoder() 
df['pdp_url']= label_encoder.fit_transform(df['pdp_url']) 
df['product_category']= label_encoder.fit_transform(df['product_category']) 
df['description']= label_encoder.fit_transform(df['description']) 
df['available_size']= label_encoder.fit_transform(df['available_size']) 
df['color']= label_encoder.fit_transform(df['color']) 
df['product_name']= label_encoder.fit_transform(df['product_name']) 

df.head()


Unnamed: 0,product_name,mrp,price,pdp_url,brand_name,product_category,description,rating,review_count,total_sizes,available_size,color
0,567,36.5,36.5,338,Victoria's Secret,122,170,3.6,25.0,"[""32A"", ""32B"", ""32C"", ""32D"", ""32DD"", ""32DDD"", ...",8,553
1,4,54.5,19.99,264,Victoria's Secret,67,302,,,"[""30A"", ""30B"", ""30C"", ""30D"", ""30DD"", ""30DDD"", ...",25,705
2,165,29.5,29.5,89,Victoria's Secret,73,475,4.4,260.0,"[""32A"", ""32B"", ""32C"", ""32D"", ""32DD"", ""34A"", ""3...",16,23
3,439,39.5,39.5,491,Victoria's Secret,283,406,,,"[""32A"", ""32B"", ""32C"", ""32D"", ""32DD"", ""32DDD"", ...",7,159
4,324,32.95,32.95,1247,Victoria's Secret Pink,427,68,,,"[""30AA"", ""30A"", ""30B"", ""30C"", ""30D"", ""30DD"", ""...",7,610


In [47]:
df_dummy = pd.get_dummies(df, columns = ["brand_name"],dtype=int)
df_dummy.head()

Unnamed: 0,product_name,mrp,price,pdp_url,product_category,description,rating,review_count,total_sizes,available_size,color,brand_name_Victoria's Secret,brand_name_Victoria's Secret Pink
0,567,36.5,36.5,338,122,170,3.6,25.0,"[""32A"", ""32B"", ""32C"", ""32D"", ""32DD"", ""32DDD"", ...",8,553,1,0
1,4,54.5,19.99,264,67,302,,,"[""30A"", ""30B"", ""30C"", ""30D"", ""30DD"", ""30DDD"", ...",25,705,1,0
2,165,29.5,29.5,89,73,475,4.4,260.0,"[""32A"", ""32B"", ""32C"", ""32D"", ""32DD"", ""34A"", ""3...",16,23,1,0
3,439,39.5,39.5,491,283,406,,,"[""32A"", ""32B"", ""32C"", ""32D"", ""32DD"", ""32DDD"", ...",7,159,1,0
4,324,32.95,32.95,1247,427,68,,,"[""30AA"", ""30A"", ""30B"", ""30C"", ""30D"", ""30DD"", ""...",7,610,0,1


In [None]:
## your code here ##

[Table of Contents](#contents)

In [None]:
## your code here ##

[Table of Contents](#contents)

Any categorical variable with less than 20 modalities should be one hot encoded <br>
Why 20 modalities, not more nor less ? Well it depends on the number of remaining features - the more features, the less onehot encoding...

Create a function that cleans and formats the `total_size` column

In [48]:
def clean(row):
    import regex as re
    row = re.sub('[^A-Z0-9]'," ",row)
    row = re.split('\s+',row)
    return [item for item in row if item !='']

In [None]:
## your code here ##

In [52]:
df_dummy["total_sizes"]=df_dummy["total_sizes"].apply(lambda x: clean(x))
df_dummy.head()

Unnamed: 0,product_name,mrp,price,pdp_url,product_category,description,rating,review_count,total_sizes,available_size,color,brand_name_Victoria's Secret,brand_name_Victoria's Secret Pink
0,567,36.5,36.5,338,122,170,3.6,25.0,"[32A, 32B, 32C, 32D, 32DD, 32DDD, 34A, 34B, 34...",8,553,1,0
1,4,54.5,19.99,264,67,302,,,"[30A, 30B, 30C, 30D, 30DD, 30DDD, 32A, 32B, 32...",25,705,1,0
2,165,29.5,29.5,89,73,475,4.4,260.0,"[32A, 32B, 32C, 32D, 32DD, 34A, 34B, 34C, 34D,...",16,23,1,0
3,439,39.5,39.5,491,283,406,,,"[32A, 32B, 32C, 32D, 32DD, 32DDD, 34A, 34B, 34...",7,159,1,0
4,324,32.95,32.95,1247,427,68,,,"[30AA, 30A, 30B, 30C, 30D, 30DD, 32AA, 32A, 32...",7,610,0,1


[Table of Contents](#contents)

### Explore the .explode() method with `total_sizes` columns

In [53]:
df_ohe=df_dummy.copy()

In [54]:
df_exp = df_ohe.explode('total_sizes')

In [55]:
df_exp.info()
df_exp.head()

<class 'pandas.core.frame.DataFrame'>
Index: 885689 entries, 0 to 45338
Data columns (total 13 columns):
 #   Column                             Non-Null Count   Dtype  
---  ------                             --------------   -----  
 0   product_name                       885689 non-null  int32  
 1   mrp                                884916 non-null  float64
 2   price                              884916 non-null  float64
 3   pdp_url                            885689 non-null  int32  
 4   product_category                   885689 non-null  int32  
 5   description                        885689 non-null  int32  
 6   rating                             272585 non-null  float64
 7   review_count                       272585 non-null  float64
 8   total_sizes                        885671 non-null  object 
 9   available_size                     885689 non-null  int32  
 10  color                              885689 non-null  int32  
 11  brand_name_Victoria's Secret       885689 non

Unnamed: 0,product_name,mrp,price,pdp_url,product_category,description,rating,review_count,total_sizes,available_size,color,brand_name_Victoria's Secret,brand_name_Victoria's Secret Pink
0,567,36.5,36.5,338,122,170,3.6,25.0,32A,8,553,1,0
0,567,36.5,36.5,338,122,170,3.6,25.0,32B,8,553,1,0
0,567,36.5,36.5,338,122,170,3.6,25.0,32C,8,553,1,0
0,567,36.5,36.5,338,122,170,3.6,25.0,32D,8,553,1,0
0,567,36.5,36.5,338,122,170,3.6,25.0,32DD,8,553,1,0


In [56]:
df_exp.nunique()

product_name                          599
mrp                                    72
price                                  89
pdp_url                              1410
product_category                      445
description                           536
rating                                 31
review_count                          333
total_sizes                            52
available_size                         44
color                                1300
brand_name_Victoria's Secret            2
brand_name_Victoria's Secret Pink       2
dtype: int64

[Table of Contents](#contents)

<a id="missing_data"></a>
## Handling Missing Data 

In [57]:
import pandas as pd
import os
import numpy as np

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_info_columns', 300)

import warnings
warnings.filterwarnings('ignore')

[Table of Contents](#contents)

<a id="import_datasets"></a>
### Importing datasets

import the `Credit.dat` dataset

In [7]:
file_path=r"Credit"
with open(file_path, 'r') as file:
    content = file.read()
    print(content)
df_miss = 

import the `Credit.csv` dataset

In [8]:
df = ## your code here ##

[Table of Contents](#contents)

<a id="median"></a>
### Imputation with the median

In [9]:
## your code here ##

compute the overall error in imputation using MSE
- suggestion : use a function...

In [1]:
# your code here ##

[Table of Contents](#contents)

<a id="mean"></a>
### Imputation with the mean

In [10]:
## your code here ##

compute the overall error in imputation using MSE
- suggestion : use a function...

In [2]:
## your code here ##

[Table of Contents](#contents)

<a id="linear"></a>
### Impution with linear interpolation

In [11]:
## your code here ##

compute the overall error in imputation using MSE
- suggestion : use a function...

In [3]:
## your code here ##

[Table of Contents](#contents)

<a id="simple"></a>
### Simple imputation

Using the mean as constant

In [4]:
## your code here ##

In [5]:
## your code here ##

[Table of Contents](#contents)

Using the mode as constant

In [44]:
## your code here ##

In [12]:
## your code here ##

[Table of Contents](#contents)

<a id="multiple"></a>
### Multiple imputation

In [36]:
## your code here ##

In [13]:
## your code here ##

[Table of Contents](#contents)

<a id="neighbors"></a>
### K-Nearest Neighbors

With the default 5 neighbors

In [48]:
## your code here ##

In [6]:
## your code here ##

[Table of Contents](#contents)

## Conclusion

**On average, the multiple (iterative) and the KNN imputation methods are clearly the best**