# Data Consistency Checks

### This script contains the following points:

#### 01. Importing libraries
#### 02. Importing products.csv (original) and orders_wrangled.csv (wrangled) data
#### 03. Examining orders_wrangled.csv and products.csv data
#### 04. Data Consistency Checks
##### 4.1 Mixed Type Data
##### 4.2 Missing Values
#### 05. Duplicates
#### 06. Products data wrangling 
#### 07. Tidying Up and Exporting Changes

## 01. Importing libraries

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import os

## 02. Importing products.csv (original) and orders_wrangled.csv (wrangled) data

In [2]:
# Assign the main project folder path to the variable path and view
path = r'/Users/elsaekevall/Jupyter_Notebook/Career_Foundry/06_2022_Instacart_Basket_Analysis/'
path

'/Users/elsaekevall/Jupyter_Notebook/Career_Foundry/06_2022_Instacart_Basket_Analysis/'

In [3]:
# Use the os.path.join() function to import the products.csv and orders.csv file as pandas a dataframe without the index column
df_prods = pd.read_csv(os.path.join(path, '02_Data', '02_1_Original_Data', 'products.csv'), index_col = False)
df_ords = pd.read_pickle(os.path.join(path, '02_Data', '02_2_Prepared_Data', 'orders_wrangled.pkl'))

## 03. Examining orders_wrangled.csv and products.csv data

In [4]:
# Use the df.head function to view the first five rows of the dataframe
df_ords.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
0,2539329,1,1,2,8,
1,2398795,1,2,3,7,15.0
2,473747,1,3,3,12,21.0
3,2254736,1,4,4,7,29.0
4,431534,1,5,4,15,28.0


In [5]:
# Use the df.tail function to view the last five rows of the dataframe
df_ords.tail()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
3421078,2266710,206209,10,5,18,29.0
3421079,1854736,206209,11,4,10,30.0
3421080,626363,206209,12,1,12,18.0
3421081,2977660,206209,13,1,12,7.0
3421082,272231,206209,14,6,14,30.0


In [6]:
# Use the df.head function to view the first five rows of the dataframe
df_prods.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
1,2,All-Seasons Salt,104,13,9.3
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,5,Green Chile Anytime Sauce,5,13,4.3


In [7]:
# Use the df.tail function to view the last five rows of the dataframe
df_prods.tail()

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
49688,49684,"Vodka, Triple Distilled, Twist of Vanilla",124,5,5.3
49689,49685,En Croute Roast Hazelnut Cranberry,42,1,3.1
49690,49686,Artisan Baguette,112,3,7.8
49691,49687,Smartblend Healthy Metabolism Dry Cat Food,41,8,4.7
49692,49688,Fresh Foaming Cleanser,73,11,13.5


## 04. Data Consistency Checks

In [8]:
# Investigate the accuracy of the columns in the df_ords dataframe
df_ords.describe()

Unnamed: 0,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
count,3421083.0,3421083.0,3421083.0,3214874.0
mean,17.15486,2.776219,13.45202,11.11484
std,17.73316,2.046829,4.226088,9.206737
min,1.0,0.0,0.0,0.0
25%,5.0,1.0,10.0,4.0
50%,11.0,3.0,13.0,7.0
75%,23.0,5.0,16.0,15.0
max,100.0,6.0,23.0,30.0


In [9]:
# Dataframe information
df_prods.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49693 entries, 0 to 49692
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   product_id     49693 non-null  int64  
 1   product_name   49677 non-null  object 
 2   aisle_id       49693 non-null  int64  
 3   department_id  49693 non-null  int64  
 4   prices         49693 non-null  float64
dtypes: float64(1), int64(3), object(1)
memory usage: 1.9+ MB


In [10]:
# Change the product_id column and view the data types
df_prods['product_id'] = df_prods['product_id'].astype('str')
df_prods.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49693 entries, 0 to 49692
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   product_id     49693 non-null  object 
 1   product_name   49677 non-null  object 
 2   aisle_id       49693 non-null  int64  
 3   department_id  49693 non-null  int64  
 4   prices         49693 non-null  float64
dtypes: float64(1), int64(2), object(2)
memory usage: 1.9+ MB


In [11]:
# Change the aisle_id column and view the data types
df_prods['aisle_id'] = df_prods['aisle_id'].astype('str')
df_prods.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49693 entries, 0 to 49692
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   product_id     49693 non-null  object 
 1   product_name   49677 non-null  object 
 2   aisle_id       49693 non-null  object 
 3   department_id  49693 non-null  int64  
 4   prices         49693 non-null  float64
dtypes: float64(1), int64(1), object(3)
memory usage: 1.9+ MB


In [12]:
# Change the department_id column and view the data types
df_prods['department_id'] = df_prods['department_id'].astype('str')
df_prods.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49693 entries, 0 to 49692
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   product_id     49693 non-null  object 
 1   product_name   49677 non-null  object 
 2   aisle_id       49693 non-null  object 
 3   department_id  49693 non-null  object 
 4   prices         49693 non-null  float64
dtypes: float64(1), object(4)
memory usage: 1.9+ MB


In [13]:
df_prods.shape

(49693, 5)

In [14]:
# Dataframe information
df_ords.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3421083 entries, 0 to 3421082
Data columns (total 6 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   order_id                object 
 1   user_id                 object 
 2   order_number            int64  
 3   orders_day_of_week      int64  
 4   order_hour_of_day       int64  
 5   days_since_prior_order  float64
dtypes: float64(1), int64(3), object(2)
memory usage: 156.6+ MB


### 4.1 Mixed Type Data

In [15]:
# Create a dataframe
df_test = pd.DataFrame()

In [16]:
# Create a mixed type column and view
df_test['mix'] = ['a', 'b', 1, True]
df_test.head()

Unnamed: 0,mix
0,a
1,b
2,1
3,True


In [17]:
# Run code to check for mixed types on df_test
for col in df_test.columns.tolist():
  weird = (df_test[[col]].applymap(type) != df_test[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_test[weird]) > 0:
    print (col)

mix


In [18]:
# To change the mix column data type to str using astype and check converted using dtype
df_test['mix'] = df_test['mix'].astype('str')
df_test['mix'].dtype

dtype('O')

### 4.2 Missing Values

**Finding Missing Values**

In [19]:
# Check for missing values in the df_prods dataframe
df_prods.isnull().sum()

product_id        0
product_name     16
aisle_id          0
department_id     0
prices            0
dtype: int64

In [20]:
# create a data frame for the 16 missing values in the column product_name
df_nan = df_prods[df_prods['product_name'].isnull() == True]  
df_nan

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
33,34,,121,14,12.2
68,69,,26,7,11.8
115,116,,93,3,10.8
261,262,,110,13,12.1
525,525,,109,11,1.2
1511,1511,,84,16,14.3
1780,1780,,126,11,12.3
2240,2240,,52,1,14.2
2586,2586,,104,13,12.4
3159,3159,,126,11,13.1


**Addressing Missing Values**

*If you don’t specify an inplace argument in your code, the function will take the default setting, which is inplace = False. When specified as False, the command will only return a view of the changed dataframe, leaving the original dataframe untouched.*

*# Replacing missing values with the mean - use df.describe() to find the mean*<br>
df['column with missings'].fillna(mean value, inplace=True)

*# Replacing missing values with the median - use df.median() to find the median*<br>
df['column with missings'].fillna(median value, inplace=True)

In [21]:
#find the number of rows and columns in the df_prods dataframe 
df_prods.shape

(49693, 5)

In [22]:
# create a dataframe df_prods_clean for the new dataset without the 16 missing values in the column product_name and find the number of rows and columns 
df_prods_clean = df_prods[df_prods['product_name'].isnull() == False]  
df_prods_clean.shape

(49677, 5)

In [23]:
# Run code to check for mixed types on df_prods_clean
for col in  df_prods_clean.columns.tolist():
  weird = ( df_prods_clean[[col]].applymap(type) !=  df_prods_clean[[col]].iloc[0].apply(type)).any(axis = 1)
  if len ( df_prods_clean[weird]) > 0:
    print (col)

*# Drop all missing values and overwrite the dataset* <br>
df_prods.dropna(inplace = True)

*# Drop only the NANs from a particular column and overwrite the dataset* <br>
df_prods.dropna(subset = [‘product_name’], inplace = True)

## 05. Duplicates

In [24]:
#Look for duplicates within the products dataframe and view them
df_dups = df_prods_clean[df_prods_clean.duplicated()]
df_dups

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
462,462,Fiber 4g Gummy Dietary Supplement,70,11,4.8
18459,18458,Ranger IPA,27,5,9.2
26810,26808,Black House Coffee Roasty Stout Beer,27,5,13.4
35309,35306,Gluten Free Organic Peanut Butter & Chocolate ...,121,14,6.8
35495,35491,Adore Forever Body Wash,127,11,9.9


In [25]:
#find the number of rows and columns in the df_prods dataframe 
df_prods_clean.shape

(49677, 5)

In [26]:
# Create new df_prods_clean_no_dups dataframe without duplicates and then view the shape
df_prods_clean_no_dups = df_prods_clean.drop_duplicates()
df_prods_clean_no_dups.shape

(49672, 5)

## 06. Products data wrangling

1. If you haven’t performed the consistency checks covered in this Exercise on your df_prods dataframe, do so now. **- see sections above**

2. Run the df.describe() function on your df_prods dataframe. Using your new knowledge about how to interpret the output of this function, share in a markdown cell whether anything about the data looks off or should be investigated further.
Tip: Keep an eye on min and max values!

In [27]:
# Investigate the accuracy of the columns in the df_prods dataframe
df_prods.describe()

Unnamed: 0,prices
count,49693.0
mean,9.994136
std,453.519686
min,1.0
25%,4.1
50%,7.1
75%,11.2
max,99999.0


* **The maximum price is 99999.00 seems very high and is an outlier (mean 9.99 and 50th percentile 7.10), which might be an error.**<br> **The other columns are within the expected ranges, although there may be missing product_id values (count 49693 and max 49688)**

3. Check for mixed-type data in your df_ords dataframe.<br>
4. If you find mixed-type data, fix it. The column in question should contain observations of a single data type.

In [28]:
# Run the code to check for mixed types on df_ords
for col in df_ords.columns.tolist():
  weird = (df_ords[[col]].applymap(type) != df_ords[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_ords[weird]) > 0:
    print (col)

* **There is no mixed-type data in the df_ords dataframe**

5. Run a check for missing values in your df_ords dataframe.<br>
In a markdown cell, report your findings and propose an explanation for any missing values you find.

In [29]:
# Check for missing values in the df_ords dataframe
df_ords.isnull().sum()

order_id                       0
user_id                        0
order_number                   0
orders_day_of_week             0
order_hour_of_day              0
days_since_prior_order    206209
dtype: int64

* **There are 206209 missing values in the days_since_prior_order column, some missing values are to be expected since customers will be new customers and not have placed a prior order. <br> It would be good to have an additional column that distinguishes bewteen new customers and returning customers making it easy to identify actual missing values**

6. Address the missing values using an appropriate method.<br>
In a markdown cell, explain why you used your method of choice.

In [30]:
# create a data frame for the  missing values in the days_since_prior_order column and view
df_nan_ords = df_ords[df_ords['days_since_prior_order'].isnull() == True]  
df_nan_ords

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
0,2539329,1,1,2,8,
11,2168274,2,1,2,11,
26,1374495,3,1,1,14,
39,3343014,4,1,6,11,
45,2717275,5,1,3,12,
...,...,...,...,...,...,...
3420930,969311,206205,1,4,12,
3420934,3189322,206206,1,3,18,
3421002,2166133,206207,1,6,19,
3421019,2227043,206208,1,1,15,


In [31]:
# View value counts for the user_id column in the df_nan_ords dataframe
df_nan_ords['user_id'].value_counts(dropna = False)

140458    1
104374    1
32917     1
44700     1
25879     1
         ..
19483     1
185       1
34611     1
49493     1
111219    1
Name: user_id, Length: 206209, dtype: int64

In [32]:
# To check if there any duplicates in the user_id column of the df_nan_ords dataframe
boolean = df_nan_ords['user_id'].duplicated().any()
boolean

False

In [33]:
# To check if there any duplicates in the user_id column of the df_ords dataframe
boolean = df_ords['user_id'].duplicated().any()
boolean

True

* **The first and last five rows of the value count of the user_id column in the df_nan_ords dataframe show that the counts are all 1. Checking for duplicates in the df_nan_ords column returns False ie there are no duplicates, while checking for duplicates in the df_ords dataframe returns True as expected. Therefore I am assuming they are all first orders made with a new user_id.**<br>
* **To deal with it you could either drop the column 'days_since_prior_order' or add a column for 'first_order' that can be used as a filter for later analyses.**


In [34]:
# Address the missing values by leaving them in and adding a first_purchase column that can be used to filter the dataset and then view the df_ords_clean dataframe
df_ords_clean = df_ords #create new dataframe
df_ords_clean['first_order'] = df_ords['days_since_prior_order'].isnull() == True #create new first_purchase column with NaN vaules labelled True and all other values labelled false
df_ords_clean

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,first_order
0,2539329,1,1,2,8,,True
1,2398795,1,2,3,7,15.0,False
2,473747,1,3,3,12,21.0,False
3,2254736,1,4,4,7,29.0,False
4,431534,1,5,4,15,28.0,False
...,...,...,...,...,...,...,...
3421078,2266710,206209,10,5,18,29.0,False
3421079,1854736,206209,11,4,10,30.0,False
3421080,626363,206209,12,1,12,18.0,False
3421081,2977660,206209,13,1,12,7.0,False


* **I choose to add another column 'first_order' because: the missing values are valid and should be retained, and if the column is required for the anaysis it allows the first orders to be filtered out at that stage. If not required both columns (days_since_prior_order and first_order) can be dropped at that point.**

7. Run a check for duplicate values in your df_ords data.<br>
8. Address the duplicates using an appropriate method.<br>
In a markdown cell, report your findings and propose an explanation for any duplicate values you find.

In [35]:
#Look for duplicates within the products dataframe and view them
df_dups_ords = df_ords_clean[df_ords_clean.duplicated()]
df_dups_ords

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,first_order


* **No duplicates were found in the df_ords_clean dataframe and therefore there is nothing to address**

## 07. Tidying Up and Exporting Changes

9. Export your final, cleaned df_prods and df_ords data as “.csv” files in your “Prepared Data” folder and give them appropriate, succinct names.

In [36]:
# View df_prods_clean_no_dups dataframe before exporting
df_prods_clean_no_dups

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
1,2,All-Seasons Salt,104,13,9.3
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,5,Green Chile Anytime Sauce,5,13,4.3
...,...,...,...,...,...
49688,49684,"Vodka, Triple Distilled, Twist of Vanilla",124,5,5.3
49689,49685,En Croute Roast Hazelnut Cranberry,42,1,3.1
49690,49686,Artisan Baguette,112,3,7.8
49691,49687,Smartblend Healthy Metabolism Dry Cat Food,41,8,4.7


In [37]:
# View df_prods_clean_no_dups dataframe before exporting
df_ords_clean

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,first_order
0,2539329,1,1,2,8,,True
1,2398795,1,2,3,7,15.0,False
2,473747,1,3,3,12,21.0,False
3,2254736,1,4,4,7,29.0,False
4,431534,1,5,4,15,28.0,False
...,...,...,...,...,...,...,...
3421078,2266710,206209,10,5,18,29.0,False
3421079,1854736,206209,11,4,10,30.0,False
3421080,626363,206209,12,1,12,18.0,False
3421081,2977660,206209,13,1,12,7.0,False


In [38]:
# Export the df_prods_clean_no_dups dataframe into the Prepared_Data folder as products_checked.csv
df_prods_clean_no_dups.to_pickle(os.path.join(path, '02_Data','02_2_Prepared_Data', 'products_checked.pkl'))
# Export the df_ords_clean dataframe into the Prepared_Data folder as orders_checked.csv
df_ords_clean.to_pickle(os.path.join(path, '02_Data','02_2_Prepared_Data', 'orders_checked.pkl'))