# Consistency checks - Products

### List of contents
1. Importing file
2. Changing column type
3. Missing values?
4. Duplicates?
5. Exploring 'prices' column
   5.1 Flagging abnormally high priced
   5.2 Addressing abnormally high priced itens
6. Final check
7. Exporting files

## Importing libraries

In [4]:
import pandas as pd
import numpy as np
import os

## Creating path for folder

In [6]:
path = r'/Users/ceciliamoura/Desktop/Career Foundry/Achievement4/Instacart Basket Analysis'

## 1. Importing file

In [94]:
# Importing original file - products

In [9]:
prods = pd.read_csv(os.path.join (path, 'Data', 'Original Data', 'products.csv'), index_col = False)

In [11]:
prods.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
1,2,All-Seasons Salt,104,13,9.3
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,5,Green Chile Anytime Sauce,5,13,4.3


In [13]:
prods.shape

(49693, 5)

In [15]:
prods.dtypes

product_id         int64
product_name      object
aisle_id           int64
department_id      int64
prices           float64
dtype: object

## 2.Changing column type

In [18]:
# 'product_id' changed from int64 to object

prods['product_id'] = prods['product_id'].astype('str')

In [22]:
# 'department_id' changed from int64 to object

prods['department_id'] = prods['department_id'].astype('str')

In [20]:
# 'aisle_id' changed from int64 to object

prods['aisle_id'] = prods['aisle_id'].astype('str')

In [24]:
# checking procedures
prods.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49693 entries, 0 to 49692
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   product_id     49693 non-null  object 
 1   product_name   49677 non-null  object 
 2   aisle_id       49693 non-null  object 
 3   department_id  49693 non-null  object 
 4   prices         49693 non-null  float64
dtypes: float64(1), object(4)
memory usage: 1.9+ MB


## 3. Missing values?

In [26]:
prods.isnull().sum()

product_id        0
product_name     16
aisle_id          0
department_id     0
prices            0
dtype: int64

In [28]:
# Creating a subset with missing values

prods_missing = prods[prods['product_name'].isnull() == True]

In [30]:
prods_missing

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
33,34,,121,14,12.2
68,69,,26,7,11.8
115,116,,93,3,10.8
261,262,,110,13,12.1
525,525,,109,11,1.2
1511,1511,,84,16,14.3
1780,1780,,126,11,12.3
2240,2240,,52,1,14.2
2586,2586,,104,13,12.4
3159,3159,,126,11,13.1


### 3.1 Addressing NaN values

    Create a new dataframe without missing values inn 'product_name'

In [35]:
prods_clean = prods[prods['product_name'].isnull() == False]

In [37]:
prods_clean.shape

(49677, 5)

In [42]:
#16 less rows than prods. 

## 4. Duplicates?

In [53]:
dups = prods_clean[prods_clean.duplicated()]

In [55]:
dups

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
462,462,Fiber 4g Gummy Dietary Supplement,70,11,4.8
18459,18458,Ranger IPA,27,5,9.2
26810,26808,Black House Coffee Roasty Stout Beer,27,5,13.4
35309,35306,Gluten Free Organic Peanut Butter & Chocolate ...,121,14,6.8
35495,35491,Adore Forever Body Wash,127,11,9.9


In [57]:
dups.to_clipboard()

### 4.1 Addressing duplicates

In [60]:
prods_clean.shape

(49677, 5)

In [62]:
# Creating a new dataframe without duplicates

In [88]:
prods_clean_nodups = prods_clean.drop_duplicates()

In [66]:
prods_clean_nodups.shape

(49672, 5)

## 5. Exploring 'Prices' column

In [73]:
prods_clean_nodups['prices'].describe()

count    49672.000000
mean         9.993282
std        453.615536
min          1.000000
25%          4.100000
50%          7.100000
75%         11.100000
max      99999.000000
Name: prices, dtype: float64

### 5.1 Flagging abnormally high priced

In [76]:
# Investigating high prices

high_prices_flag = prods_clean_nodups.loc[prods_clean_nodups['prices'] > 100]

In [78]:
high_prices_flag

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
21554,21553,Lowfat 2% Milkfat Cottage Cheese,108,16,14900.0
33666,33664,2 % Reduced Fat Milk,84,16,99999.0


In [80]:
high_prices_flag.to_clipboard()

In [136]:
high_price_subset = prods_clean_nodups.loc[(prods_clean_nodups['product_id'] == '21553') | (prods_clean_nodups['product_id'] == '33664')]

In [138]:
high_price_subset

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
21554,21553,Lowfat 2% Milkfat Cottage Cheese,108,16,14900.0
33666,33664,2 % Reduced Fat Milk,84,16,99999.0


### 5.2 Addressing abnormally high priced itens

In [142]:
# turning these two itens' prices into NaN

prods_clean_nodups.loc[prods_clean_nodups['prices'] >100, 'prices'] = np.nan

## 6. Final Check

In [149]:
prods_clean_nodups.shape

(49672, 5)

In [151]:
# 5 less observations due to exclusion of duplicates

## 7. Exporting clean file

In [154]:
prods_clean_nodups.to_csv(os.path.join(path,'Data','Final Prepared Data', 'prods_clean.csv'))