### Contents:
    01 Importing libraries and data
    02 Data exploration
    03 Cleaning
        a mixed-data types
        b outliers
        c missing values
        d duplicates
    04 Export

# Cleaning products df

## 01 Importing libraries and data

In [1]:
import pandas as pd
import numpy as np
import os

In [2]:
path = r'/Users/Emily/Documents/CF Data Analysis Program/Immersion 4/Instacart Basket Analysis'

In [3]:
df = pd.read_csv(os.path.join(path, '02 Data', 'original data', 'products.csv'), index_col = False)

## 02 Initial exploration

In [4]:
# view the top 5 rows and all column names
df.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
1,2,All-Seasons Salt,104,13,9.3
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,5,Green Chile Anytime Sauce,5,13,4.3


In [5]:
# check out the shape of the df (rows and columns)
df.shape

(49693, 5)

In [6]:
# check out the descriptive stats of whole df
# df.describe() would have just shown info for the numeric columns
df.describe(include = 'all')

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
count,49693.0,49677,49693.0,49693.0,49693.0
unique,,49672,,,
top,,Black House Coffee Roasty Stout Beer,,,
freq,,2,,,
mean,24844.345139,,67.770249,11.728433,9.994136
std,14343.717401,,38.316774,5.850282,453.519686
min,1.0,,1.0,1.0,1.0
25%,12423.0,,35.0,7.0,4.1
50%,24845.0,,69.0,13.0,7.1
75%,37265.0,,100.0,17.0,11.2


In [7]:
# check out the data type of each column
# can also use df.dtypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49693 entries, 0 to 49692
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   product_id     49693 non-null  int64  
 1   product_name   49677 non-null  object 
 2   aisle_id       49693 non-null  int64  
 3   department_id  49693 non-null  int64  
 4   prices         49693 non-null  float64
dtypes: float64(1), int64(3), object(1)
memory usage: 1.9+ MB


## 03 Cleaning up

### a) Data types

In [8]:
# Check for mixed-type data
for col in df.columns.tolist():
  weird = (df[[col]].applymap(type) != df[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df[weird]) > 0:
    print (col)

product_name


In [9]:
# looks like there are some NaN values (which are floats)
# that's the only mixed-type data that's ok
# (there are also some duplicates...)
df['product_name'].value_counts(dropna = False)

NaN                                                     16
Black House Coffee Roasty Stout Beer                     2
Adore Forever Body Wash                                  2
Gluten Free Organic Peanut Butter & Chocolate Cereal     2
Fiber 4g Gummy Dietary Supplement                        2
                                                        ..
Kosher Dill Mini Pickles                                 1
Chili Rice Scooter                                       1
All Natural Chicken Apple Sausage                        1
Quartered Artichoke Hearts                               1
Fresh Foaming Cleanser                                   1
Name: product_name, Length: 49673, dtype: int64

### b) Outlier identification

In [10]:
# there's something strange going on with the price column
# there are two products that have outrageously high prices ($99,999 & $14,900)
df['prices'].value_counts().sort_index(ascending = False)

99999.0      1
14900.0      1
25.0         5
24.9         6
24.8         5
          ... 
1.4        321
1.3        286
1.2        297
1.1        296
1.0        116
Name: prices, Length: 242, dtype: int64

In [11]:
# get rid of the outliers for price by setting them to NaN
df.loc[df['prices'] >100, 'prices'] = np.nan

In [13]:
df['prices'].max()

25.0

In [12]:
# this seems good
df['department_id'].value_counts(dropna = False).sort_index()

1     4007
2      548
3     1516
4     1684
5     1056
6     1139
7     4365
8      972
9     1858
10      38
11    6565
12     907
13    5371
14    1116
15    2092
16    3449
17    3085
18    1081
19    6264
20    1322
21    1258
Name: department_id, dtype: int64

In [14]:
# this also seems pretty good
df['aisle_id'].value_counts(dropna = False).sort_index()

1      146
2      271
3      832
4      543
5      409
      ... 
130    303
131    457
132    178
133    172
134     95
Name: aisle_id, Length: 134, dtype: int64

### c) Missing values

In [15]:
# find which column any missing values are in
df.isnull().sum()

product_id        0
product_name     16
aisle_id          0
department_id     0
prices            2
dtype: int64

In [16]:
# take a quick look at the records with missing values
df[df['product_name'].isnull() == True]

# this does the same thing
# df.loc[df['product_name'].isnull() == True]

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
33,34,,121,14,12.2
68,69,,26,7,11.8
115,116,,93,3,10.8
261,262,,110,13,12.1
525,525,,109,11,1.2
1511,1511,,84,16,14.3
1780,1780,,126,11,12.3
2240,2240,,52,1,14.2
2586,2586,,104,13,12.4
3159,3159,,126,11,13.1


Options of how to deal with missing values:
1. create a flag to indicate
2. impute
3. remove or filter out

In [16]:
# I am chosing to just leave it as is because there's still valuable info about the price
# and the department for all these products

### d) Duplicates

In [17]:
# check to see if any records are exact duplicates
df[df.duplicated()]

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
462,462,Fiber 4g Gummy Dietary Supplement,70,11,4.8
18459,18458,Ranger IPA,27,5,9.2
26810,26808,Black House Coffee Roasty Stout Beer,27,5,13.4
35309,35306,Gluten Free Organic Peanut Butter & Chocolate ...,121,14,6.8
35495,35491,Adore Forever Body Wash,127,11,9.9


In [21]:
# drop the duplicates with pandas' handy function
df_nodups = df.drop_duplicates()

In [23]:
# still some duplicate product_id numbers! What to do?
df_nodups[df_nodups.duplicated(subset='product_id')]

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
6800,6800,Sprouted Quinoa Flakes Baby Cereal,92,18,14.0
26521,26520,Cheese Shredded Sharp Cheddar Reduced Fat 2%,21,16,2.9


In [25]:
df_nodups.loc[df_nodups['product_id'] == 6800]

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
6799,6800,Revive Zero Vitamin Water,64,7,6.4
6800,6800,Sprouted Quinoa Flakes Baby Cereal,92,18,14.0


In [26]:
df_nodups.loc[df_nodups['product_id'] == 26520]

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
26520,26520,Clinical Advanced Solid Ultimate Fresh Anti-Pe...,80,11,10.6
26521,26520,Cheese Shredded Sharp Cheddar Reduced Fat 2%,21,16,2.9


In [27]:
# grab all the records that do not have duplicate product_ids
df_nodups2 = df_nodups[df_nodups.duplicated(subset='product_id') == False]

In [29]:
# for the two product_ids that were duplicated, the top record was kept
df_nodups2.loc[df_nodups2['product_id'] == 6800]

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
6799,6800,Revive Zero Vitamin Water,64,7,6.4


In [30]:
# check to see if 7 records were removed
df_nodups2.shape

(49686, 5)

In [31]:
df.shape

(49693, 5)

## 04 Export

In [32]:
df_nodups2.to_csv(os.path.join(path, '02 Data', 'prepared data', 'products_clean.csv'))