# Products - Data Wrangling and Data Consistency Checks:

1. Importing libraries and dataset
2. Checking for columns, datatype, shape using .info()
3. Addressing missing values
4. Addressing duplicates
5. Checking for mixed datatype
6. Changing datatypes to reduce memory usage
7. Performing Descriptive Analysis
8. Exporting wrangled, consistency checked dataframe

## 1. Importing libraries and dataset

In [1]:
# Importing libraries

import pandas as pd
import os

In [2]:
# Accessing EnvFile for path

%run EnvFile.ipynb

Stored 'path' (str)


In [3]:
# Importing products_checked.csv to dataframe

df_prods = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'products.csv'))

In [4]:
# Checking the head

df_prods.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
1,2,All-Seasons Salt,104,13,9.3
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,5,Green Chile Anytime Sauce,5,13,4.3


## 2. Checking for columns, datatype, shape using .info()

In [5]:
# Checking the info for columns, datatypes, shape of dataframe

df_prods.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49693 entries, 0 to 49692
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   product_id     49693 non-null  int64  
 1   product_name   49677 non-null  object 
 2   aisle_id       49693 non-null  int64  
 3   department_id  49693 non-null  int64  
 4   prices         49693 non-null  float64
dtypes: float64(1), int64(3), object(1)
memory usage: 1.9+ MB


#### The shape of df_prods before consistency checks is (49693, 5) with memory usage of 1.9+ MB.
#### There are no columns to be removed or renamed in this dataframe.

## 3. Addressing missing values

In [6]:
# Finding missing values

df_prods.isnull().sum()

product_id        0
product_name     16
aisle_id          0
department_id     0
prices            0
dtype: int64

In [7]:
# Creating a subset of missing values in product_name

df_nan = df_prods[df_prods['product_name'].isnull() == True]

In [8]:
df_nan

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
33,34,,121,14,12.2
68,69,,26,7,11.8
115,116,,93,3,10.8
261,262,,110,13,12.1
525,525,,109,11,1.2
1511,1511,,84,16,14.3
1780,1780,,126,11,12.3
2240,2240,,52,1,14.2
2586,2586,,104,13,12.4
3159,3159,,126,11,13.1


In [9]:
# Number of rows before removing missing values in df_prods

df_prods.shape

(49693, 5)

In [10]:
# Reassigning the subset of non-missing values to df_prods

df_prods = df_prods[df_prods['product_name'].isnull() == False]

In [11]:
# Number of rows after removing missing values in df_prods

df_prods.shape

(49677, 5)

## 4. Addressing duplicates

In [12]:
# Checking for duplicates

df_prods[df_prods.duplicated()]

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
462,462,Fiber 4g Gummy Dietary Supplement,70,11,4.8
18459,18458,Ranger IPA,27,5,9.2
26810,26808,Black House Coffee Roasty Stout Beer,27,5,13.4
35309,35306,Gluten Free Organic Peanut Butter & Chocolate ...,121,14,6.8
35495,35491,Adore Forever Body Wash,127,11,9.9


In [13]:
# Reassigning the dataframe with data that doesn’t include the duplicates identified using the drop_duplicates() function

df_prods = df_prods.drop_duplicates()

In [14]:
# Number of rows after removing duplicate values

df_prods.shape

(49672, 5)

## 5. Checking for mixed datatype

In [15]:
# Check for mixed-type data in your df_ords dataframe.

for col in df_prods.columns.tolist():
    mixeddata = (df_prods[[col]].applymap(type) != df_prods[[col]].iloc[0].apply(type)).any(axis = 1)
    if len(df_prods[mixeddata]) > 0:
        print(col)

There is no mixed type data.

## 6. Changing datatypes to reduce memory usage

In [16]:
# Changing datatype to reduce memory usage

df_prods['product_id'] = df_prods['product_id'].astype('int32')
df_prods['aisle_id'] = df_prods['aisle_id'].astype('int8')
df_prods['department_id'] = df_prods['department_id'].astype('int8')

In [17]:
# Checking df info to confirm memory usage reduction

df_prods.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 49672 entries, 0 to 49692
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   product_id     49672 non-null  int32  
 1   product_name   49672 non-null  object 
 2   aisle_id       49672 non-null  int8   
 3   department_id  49672 non-null  int8   
 4   prices         49672 non-null  float64
dtypes: float64(1), int32(1), int8(2), object(1)
memory usage: 1.4+ MB


#### The shape of df_prods after consistency checks is (49672, 5) with memory usage of 1.4+ MB.

## 7. Performing Descriptive Analysis

In [18]:
# Checking descriptive analysis

df_prods.describe()

Unnamed: 0,product_id,aisle_id,department_id,prices
count,49672.0,49672.0,49672.0,49672.0
mean,24850.349775,58.320664,11.728942,9.993282
std,14340.705287,51.109052,5.850779,453.615536
min,1.0,-128.0,1.0,1.0
25%,12432.75,28.0,7.0,4.1
50%,24850.5,65.0,13.0,7.1
75%,37268.25,98.0,17.0,11.1
max,49688.0,127.0,21.0,99999.0


#### The max and std values of price is not as expected and needs to be investigated further.

In [19]:
# Investigate record with max 'prices' value

df_prods.loc[df_prods['prices'] == 99999]

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
33666,33664,2 % Reduced Fat Milk,84,16,99999.0


In [20]:
# Check for products that cost more than $50

df_prods[df_prods["prices"]> 50.0]

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
21554,21553,Lowfat 2% Milkfat Cottage Cheese,108,16,14900.0
33666,33664,2 % Reduced Fat Milk,84,16,99999.0


In [21]:
# Replace 'prices' value for 'product_id' 21553 with 14.9000 and 'product_id' 33664 with 9.99  

df_prods = df_prods.replace ({"prices":{99999.0: 9.99, 14900.0:14.9 }})

In [22]:
# Check that changes have been successfully made

df_prods[df_prods["product_id"]==21553]

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
21554,21553,Lowfat 2% Milkfat Cottage Cheese,108,16,14.9


In [23]:
# Check that changes have been successfully made

df_prods[df_prods["product_id"]==33664]

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
33666,33664,2 % Reduced Fat Milk,84,16,9.99


In [24]:
# Rechecking descriptive analysis

df_prods.describe()

Unnamed: 0,product_id,aisle_id,department_id,prices
count,49672.0,49672.0,49672.0,49672.0
mean,24850.349775,58.320664,11.728942,7.680629
std,14340.705287,51.109052,5.850779,4.199434
min,1.0,-128.0,1.0,1.0
25%,12432.75,28.0,7.0,4.1
50%,24850.5,65.0,13.0,7.1
75%,37268.25,98.0,17.0,11.1
max,49688.0,127.0,21.0,25.0


#### Now, there is nothing off about the data. All the min and max values are as expected. 

## 8. Exporting wrangled, consistency checked dataframe

In [25]:
# Export cleaned dataframe

df_prods.to_csv(os.path.join(path, '02 Data','Prepared Data', 'products_checked.csv'), index = False)