# Table of Contents

This notebook contains the following: 

* Importing libraries 
* Turning project path into a string
* Importing data sets 
* Data Consistency Checks (identification and treatment) 
    * Mixed Type Data 
    * Missing Values 
    * Duplicates 
* Exporting Dataset 



# Step 1 - Consistency Checks from Exercise

# 01. Importing libraries

In [2]:
# Import libraries
import pandas as pd
import numpy as np
import os

# 02. Turning project path into a string

In [3]:
#Turn project folder path into a string
'/Users/aysha/Documents/Instacart Basket Analysis/'

'/Users/aysha/Documents/Instacart Basket Analysis/'

In [4]:
path = r'/Users/aysha/Documents/Instacart Basket Analysis/'

In [5]:
path

'/Users/aysha/Documents/Instacart Basket Analysis/'

# 03. Importing data sets

In [6]:
# Import the “products.csv” file into Jupyter as df_prods
df_prods = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'products.csv'), index_col = False)

In [7]:
# Import the orders_wrangled.csv from 'Prepared Data' folder as dfords
dfords = pd.read_csv(os.path.join(path, '02 Data', 'Prepared Data', 'orders_wrangled.csv'), index_col = False)

# 04. Data Consistency Checks

In [None]:
dfords.describe()

# 05. Checking for Mixed-Type Data

In [10]:
#Create a dataframe
df_test = pd.DataFrame()

In [11]:
# Create a mixed type column
df_test['mix'] = ['a', 'b', 1, True]

In [None]:
# Check first 5 rows
df_test.head()

In [13]:
# Check for mixed type data
for col in df_test.columns.tolist():
  weird = (df_test[[col]].applymap(type) != df_test[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_test[weird]) > 0:
    print (col)

mix


In [14]:
# Fix the mixed type data (by converting numeric values to string)
df_test['mix'] = df_test['mix'].astype('str')


# 06. Missing Values

In [15]:
# Finding missing values in df_prods
df_prods.isnull().sum()


product_id        0
product_name     16
aisle_id          0
department_id     0
prices            0
dtype: int64

In [17]:
# Creating a subset of the dataframe containing only the missing values in question
df_nan = df_prods[df_prods['product_name'].isnull() == True]

In [None]:
df_nan

In [26]:
# Fill in missing values by imputing the value with mean or median( if missing values are numeric)

#df['column with missings'].fillna(mean value, inplace=True)

#df['column with missings'].fillna(median value, inplace=True)

In [22]:
# Find number of rows and columns in dataframe
df_prods.shape

(49693, 5)

In [23]:
# Create a new dataframe df_prods_clean to remove missing values 
df_prods_clean = df_prods[df_prods['product_name'].isnull() == False]


In [None]:
# Find number of rows and columns in the newly created dataframe
df_prods_clean.shape

In [25]:
# Another way for dropping missing data
#df_prods.dropna(inplace = True)
#If you wanted to use this command to drop only the NaNs from a particular column, the code would look like this:
#df_prods.dropna(subset = [‘product_name’, inplace = True)

# 07. Duplicates

In [27]:
# Finding duplicates
df_dups = df_prods_clean[df_prods_clean.duplicated()]

In [None]:
df_dups

In [None]:
# Addressing duplicates
# First check the number of rows and columns in the dataframe
df_prods_clean.shape

In [30]:
# Addressing duplicates 
# Dropping duplicates
# Create a new dataframe that doesn’t include the duplicates you just identified using the drop_duplicates() function:
df_prods_clean_no_dups = df_prods_clean.drop_duplicates()

In [None]:
# Check number of rows and columns of new dataframe
df_prods_clean_no_dups.shape

# 08. Exporting Data

In [32]:
#df_prods_clean_no_dups.to_csv(os.path.join(path, '02 Data','Prepared Data', 'products_checked.csv'))


# Step 2 

# Run the df.describe() function on your df_prods dataframe. Using your new knowledge about how to interpret the output of this function, share in a markdown cell whether anything about the data looks off or should be investigated further.

# Tip: Keep an eye on min and max values!

In [None]:
# Running the describe function for statistical values
df_prods.describe()

### The maximum price seems to be entered incorrectly as it seems to be quite high especially for goods sold by an online grocery company.  

# Step 3 

# Check for mixed-type data in your df_ords dataframe.

In [34]:
# Check for mixed type data in dfords dataframe
for col in dfords.columns.tolist():
  weird = (dfords[[col]].applymap(type) != dfords[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (dfords[weird]) > 0:
    print (col)

### There is no mixed data type in the dfords dataframe

# Step 4 

# If you find mixed-type data, fix it. The column in question should contain observations of a single data type.

### There is no mixed data type hence there's no need of fixing here.

# Step 5 

# Run a check for missing values in your df_ords dataframe. In a markdown cell, report your findings and propose an explanation for any missing values you find.


In [35]:
# Finding missing values in dfords
dfords.isnull().sum()

Unnamed: 0                     0
order_id                       0
user_id                        0
order_number                   0
orders_day_of_week             0
order_time_of_day              0
days_since_prior_order    206209
dtype: int64

### The column 'days_since_prior_order' has a lot of missing values. This could be mainly due to the reason that there are first time customers hence have no purchase history in Instacart's database. 

# Step 6 

# Address the missing values using an appropriate method. In a markdown cell, explain why you used your method of choice.


In [36]:
# Creating a subset of the dataframe containing only the missing values in question
dfords_nan = dfords[dfords['days_since_prior_order'].isnull() == True]

In [None]:
dfords_nan

### The approach I would take is to create a new variable that acts like a flag based on the missing value. So this I would do by creating a new column under the heading 'first_time_customers' and defining the output as true and false depending on the column of 'days_since_prior_order'. 

### Imputing values by the mean and median wouldn't make sense here as this column indicates customers with no prior purchase history. 

### The method of creating a new dataframe with a new column is as follows:  

In [38]:
# Create a new dataframe and set a new column for first time customers
dfords_clean = dfords

In [39]:
# Adding a column of 'first_time_customers' depending on the value from the column 'days_since_prior_order'
dfords_clean['first_time_customers'] = dfords['days_since_prior_order'].isnull() == True

In [None]:
dfords_clean

# Step 7 

# Run a check for duplicate values in your df_ords data. In a markdown cell, report your findings and propose an explanation for any duplicate values you find.


In [41]:
# Finding duplicates in the dfords dataframe
dfords_dups = dfords_clean[dfords_clean.duplicated()]

In [None]:
dfords_dups

### There are no duplicates in the dfords_clean dataframe

# Step 8 

# Address the duplicates using an appropriate method. In a markdown cell, explain why you used your method of choice.

### There are no duplicates in the dfords_clean dataframe

# Step 9 

# Export your final, cleaned df_prods and df_ords data as “.csv” files in your “Prepared Data” folder and give them appropriate, succinct names.

In [43]:
# Exporting df_prods
df_prods_clean_no_dups.to_csv(os.path.join(path, '02 Data','Prepared Data', 'products_checked.csv'))

In [None]:
dfords_clean

In [45]:
# Exporting dfords
dfords_clean.to_csv(os.path.join(path, '02 Data','Prepared Data', 'orders_checked.csv'))