# Creating charts and visualisations in Python

    1. Importing libraries and datasets
    2. Addressing our new data set
        A. Wrangling
        B. Cleaning
        C. Merging with existing dataset
    3. Creating visualisations (included in Part2 notebook)

# 1. Importing libraries and datasets

In [1]:
#import libraries
import pandas as pd
import numpy as np
import os

In [2]:
#folder path into usable string
path = r'/Users/manuellituma/01-2023 Instacart Basket Analysis'

In [3]:
#import customers data set
df = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'customers.csv'), index_col=False)

In [4]:
#checking output
print('Sample of new customers data set')
df.head(10)

Sample of new customers data set


Unnamed: 0,user_id,First Name,Surnam,Gender,STATE,Age,date_joined,n_dependants,fam_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374
5,133128,Cynthia,Noble,Female,Kentucky,43,1/1/2017,2,married,49643
6,152052,Chris,Walton,Male,Montana,20,1/1/2017,0,single,61746
7,168851,Joseph,Hickman,Male,South Carolina,30,1/1/2017,0,single,63712
8,69965,Jeremy,Vang,Male,Texas,47,1/1/2017,1,married,162432
9,82820,Shawn,Chung,Male,Virginia,26,1/1/2017,2,married,32072


In [5]:
print('Number of rows and columns in customers data set')
df.shape

Number of rows and columns in customers data set


(206209, 10)

Initial observations:

Our new customers data set provides personal information about Instacart's customer base including first name, surname, gender, state location, age, when they joined Instacart, the number of dependents they have, their family status, and finally their income. Already there are some spelling and syntactical errors in the column names which we will be able to fix to make the data set easier to understand.

There are 206,209 rows, so lot's of customers!

# 2. Addressing our dataset

#### A. Data Wrangling

Approach

    1. Some simple exploratory analysis to understand a couple of the columns.
    2. Removal of the first_name, and surname columns. User_id is a more useful customer identifier than first or surname.
    3. Amend any spelling and syntactical errors in the column titles. This will include use of capital and lowercase letters, and use of underscores and spaces.
    4. Check data types of columns to ensure they are correct and appropriate for the type of data collected
    5. Use a data dictionary to change the data within the STATE column to two-letter codes. Although not specifically needed, reducing the data to two-letters is cleaner, and would make it easier to, for example, search for specific states using the isin fucntion.
    6. Create a new column with flag related to whether or not the customer has dependents. Instacart could provide targeted advertising to individuals based on whether they have dependents on not.

1. Initial exploratory analysis to understand some of the columns

In [6]:
#checking gender column
df['Gender'].value_counts()

Male      104067
Female    102142
Name: Gender, dtype: int64

In [7]:
#checking age column
df['Age'].value_counts()

19    3329
55    3317
51    3317
56    3306
32    3305
      ... 
65    3145
25    3127
66    3114
50    3102
36    3101
Name: Age, Length: 64, dtype: int64

In [8]:
df['Age'].describe()

count    206209.000000
mean         49.501646
std          18.480962
min          18.000000
25%          33.000000
50%          49.000000
75%          66.000000
max          81.000000
Name: Age, dtype: float64

In [9]:
#checking fam_status column
df['fam_status'].value_counts()

married                             144906
single                               33962
divorced/widowed                     17640
living with parents and siblings      9701
Name: fam_status, dtype: int64

In [10]:
#checking income column
df['income'].describe()

count    206209.000000
mean      94632.852548
std       42473.786988
min       25903.000000
25%       59874.000000
50%       93547.000000
75%      124244.000000
max      593901.000000
Name: income, dtype: float64

In [11]:
#checking n_dependants
df['n_dependants'].describe()

count    206209.000000
mean          1.499823
std           1.118433
min           0.000000
25%           0.000000
50%           1.000000
75%           3.000000
max           3.000000
Name: n_dependants, dtype: float64

Observations:

I wanted to understand more about the types of values contained within these columns before I started amending anything. Everything so far seems reasonable.

2. Removal of first name and surname columns

In [12]:
#remove date_joined column using drop function
df.drop(columns = ['First Name', 'Surnam'])

Unnamed: 0,user_id,Gender,STATE,Age,date_joined,n_dependants,fam_status,income
0,26711,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Female,Maryland,26,1/1/2017,1,married,40374
...,...,...,...,...,...,...,...,...
206204,168073,Female,North Carolina,44,4/1/2020,1,married,148828
206205,49635,Male,Hawaii,62,4/1/2020,3,married,168639
206206,135902,Female,Missouri,66,4/1/2020,2,married,53374
206207,81095,Female,California,27,4/1/2020,1,married,99799


In [13]:
#define dateframe minus date_joined column
df = df.drop(columns = ['First Name', 'Surnam'])

In [14]:
#check output
df.head()

Unnamed: 0,user_id,Gender,STATE,Age,date_joined,n_dependants,fam_status,income
0,26711,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Female,Maryland,26,1/1/2017,1,married,40374


In [15]:
#checking shape
df.shape

(206209, 8)

3. Amending spelling and syntax errors in columns names

In [16]:
#amending First Name, Surnam, Gender, STATE, and Age columns
df.rename(columns = {'Gender':'gender', 'STATE':'state', 'Age':'age'}, inplace = True)

In [17]:
df.head()

Unnamed: 0,user_id,gender,state,age,date_joined,n_dependants,fam_status,income
0,26711,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Female,Maryland,26,1/1/2017,1,married,40374


4. Checking datatypes

In [18]:
#returning the datatypes for all columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 206209 entries, 0 to 206208
Data columns (total 8 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   user_id       206209 non-null  int64 
 1   gender        206209 non-null  object
 2   state         206209 non-null  object
 3   age           206209 non-null  int64 
 4   date_joined   206209 non-null  object
 5   n_dependants  206209 non-null  int64 
 6   fam_status    206209 non-null  object
 7   income        206209 non-null  int64 
dtypes: int64(4), object(4)
memory usage: 12.6+ MB


Observations:

Whilst the majority of the datatypes themselves are all correct, we should reduce the size of some of them to save on space. The date_joined column should be changed to datetime.

In [19]:
#amending datatype for user_id, age, n_dependents, and income
df['user_id']=df['user_id'].astype('int32')
df['age'] =df['age'].astype('int8')
df['date_joined'] = df['date_joined'].apply(pd.to_datetime)
df['n_dependants']=df['n_dependants'].astype('int16')
df['income']=df['income'].astype('int32')

In [20]:
#checking output
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 206209 entries, 0 to 206208
Data columns (total 8 columns):
 #   Column        Non-Null Count   Dtype         
---  ------        --------------   -----         
 0   user_id       206209 non-null  int32         
 1   gender        206209 non-null  object        
 2   state         206209 non-null  object        
 3   age           206209 non-null  int8          
 4   date_joined   206209 non-null  datetime64[ns]
 5   n_dependants  206209 non-null  int16         
 6   fam_status    206209 non-null  object        
 7   income        206209 non-null  int32         
dtypes: datetime64[ns](1), int16(1), int32(2), int8(1), object(3)
memory usage: 8.5+ MB


In [21]:
df.head()

Unnamed: 0,user_id,gender,state,age,date_joined,n_dependants,fam_status,income
0,26711,Female,Missouri,48,2017-01-01,3,married,165665
1,33890,Female,New Mexico,36,2017-01-01,0,single,59285
2,65803,Male,Idaho,35,2017-01-01,2,married,99568
3,125935,Female,Iowa,40,2017-01-01,0,single,42049
4,130797,Female,Maryland,26,2017-01-01,1,married,40374


5. Using a data dictionary to change state names to two-letter abbreviations

In [22]:
#creating list of all state names
states_listed = df['state'].unique()

In [23]:
#checking output
print(sorted(states_listed))

['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut', 'Delaware', 'District of Columbia', 'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey', 'New Mexico', 'New York', 'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', 'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington', 'West Virginia', 'Wisconsin', 'Wyoming']


In [24]:
#number of each state
df['state'].value_counts().sort_index()

Alabama                 4044
Alaska                  4044
Arizona                 4044
Arkansas                4044
California              4044
Colorado                4044
Connecticut             4044
Delaware                4044
District of Columbia    4044
Florida                 4044
Georgia                 4044
Hawaii                  4044
Idaho                   4044
Illinois                4044
Indiana                 4044
Iowa                    4044
Kansas                  4043
Kentucky                4043
Louisiana               4043
Maine                   4043
Maryland                4043
Massachusetts           4043
Michigan                4043
Minnesota               4043
Mississippi             4043
Missouri                4043
Montana                 4043
Nebraska                4043
Nevada                  4043
New Hampshire           4043
New Jersey              4043
New Mexico              4043
New York                4043
North Carolina          4043
North Dakota  

In [25]:
#counting number of states
state_count = df['state'].nunique()
print(state_count)

51


In [26]:
#creating data dictionary, abbreviations were retrieved from US government website
state_dict = {'Alabama': 'AL',
              'Alaska': 'AK', 
              'Arizona': 'AZ',
              'Arkansas': 'AR',
              'California': 'CA',
              'Colorado': 'CO',
              'Connecticut': 'CT',
              'Delaware': 'DE',
              'District of Columbia': 'DC',
              'Florida': 'FL',
              'Georgia': 'GA',
              'Hawaii': 'HI',
              'Idaho': 'ID',
              'Illinois': 'IL',
              'Indiana': 'IN',
              'Iowa': 'IA',
              'Kansas': 'KS',
              'Kentucky': 'KY',
              'Louisiana': 'LA',
              'Maine': 'ME',
              'Maryland': 'MD',
              'Massachusetts': 'MA',
              'Michigan': 'MI',
              'Minnesota': 'MN',
              'Mississippi': 'MS',
              'Missouri': 'MO',
              'Montana': 'MT',
              'Nebraska': 'NE',
              'Nevada': 'NV',
              'New Hampshire': 'NH',
              'New Jersey': 'NJ',
              'New Mexico': 'NM',
              'New York': 'NY',
              'North Carolina': 'NC',
              'North Dakota': 'ND',
              'Ohio': 'OH',
              'Oklahoma': 'OK',
              'Oregon': 'OR',
              'Pennsylvania': 'PA',
              'Rhode Island': 'RI',
              'South Carolina': 'SC',
              'South Dakota': 'SD',
              'Tennessee': 'TN',
              'Texas': 'TX',
              'Utah': 'UT',
              'Vermont': 'VT',
              'Virginia': 'VA',
              'Washington': 'WA',
              'West Virginia': 'WV',
              'Wisconsin': 'WI',
              'Wyoming': 'WY'}

In [27]:
#replace values in state column with state dictionary 
df2 = df.replace({'state': state_dict})

In [28]:
df2.sample(10)

Unnamed: 0,user_id,gender,state,age,date_joined,n_dependants,fam_status,income
134131,97366,Male,CT,36,2019-02-12,1,married,97480
53251,151698,Male,NH,56,2017-11-02,2,married,30859
154992,146265,Female,WA,19,2019-06-12,2,living with parents and siblings,54254
91679,38327,Female,NM,70,2018-06-11,1,married,107250
193014,2557,Male,CT,36,2020-01-17,1,married,42776
29239,156607,Male,MS,27,2017-06-18,1,married,42631
65682,69416,Female,CA,26,2018-01-13,3,married,52579
117929,63704,Female,CA,39,2018-11-10,3,married,95421
195441,26175,Female,HI,65,2020-02-01,3,married,128988
23310,152468,Female,ND,72,2017-05-15,2,married,134963


In [29]:
df2['state'].value_counts().sort_index()

AK    4044
AL    4044
AR    4044
AZ    4044
CA    4044
CO    4044
CT    4044
DC    4044
DE    4044
FL    4044
GA    4044
HI    4044
IA    4044
ID    4044
IL    4044
IN    4044
KS    4043
KY    4043
LA    4043
MA    4043
MD    4043
ME    4043
MI    4043
MN    4043
MO    4043
MS    4043
MT    4043
NC    4043
ND    4043
NE    4043
NH    4043
NJ    4043
NM    4043
NV    4043
NY    4043
OH    4043
OK    4043
OR    4043
PA    4043
RI    4043
SC    4043
SD    4043
TN    4043
TX    4043
UT    4043
VA    4043
VT    4043
WA    4043
WI    4043
WV    4043
WY    4043
Name: state, dtype: int64

The value counts for the new column matches those for the state column in the original dataframe

6. Create a new flag column on whether the customer has dependants

In [30]:
df2.loc[df2['n_dependants'] >= 1, 'dependants_loc'] = 'Has dependants'

In [31]:
df2.loc[df2['n_dependants'] < 1, 'dependants_loc'] = 'No dependants'

In [32]:
df2['dependants_loc'].value_counts(dropna = False)

Has dependants    154607
No dependants      51602
Name: dependants_loc, dtype: int64

In [33]:
df2.head()

Unnamed: 0,user_id,gender,state,age,date_joined,n_dependants,fam_status,income,dependants_loc
0,26711,Female,MO,48,2017-01-01,3,married,165665,Has dependants
1,33890,Female,NM,36,2017-01-01,0,single,59285,No dependants
2,65803,Male,ID,35,2017-01-01,2,married,99568,Has dependants
3,125935,Female,IA,40,2017-01-01,0,single,42049,No dependants
4,130797,Female,MD,26,2017-01-01,1,married,40374,Has dependants


Although the code has worked, I would prefer for my two columns related to dependents to be situated next to one another.

In [34]:
#creating a list of my column titles
titles = list(df2.columns)
titles

['user_id',
 'gender',
 'state',
 'age',
 'date_joined',
 'n_dependants',
 'fam_status',
 'income',
 'dependants_loc']

In [35]:
#creating a new list, with the new changed order of columns
new_titles = ['user_id',
 'gender',
 'state',
 'age',
 'date_joined',
 'n_dependants',
 'dependants_loc',
 'fam_status',
 'income']

In [36]:
#assigning my new column title order to my dataframe
df3 = df2[new_titles]

In [37]:
#checking my output
df3.sample(5)

Unnamed: 0,user_id,gender,state,age,date_joined,n_dependants,dependants_loc,fam_status,income
149967,30062,Female,NV,26,2019-05-14,3,Has dependants,married,44404
129737,58317,Female,NH,27,2019-01-18,0,No dependants,single,58120
1258,196227,Male,OH,81,2017-01-08,2,Has dependants,married,83578
111709,198952,Female,AL,71,2018-10-06,1,Has dependants,married,97188
173551,75149,Male,NM,64,2019-09-27,2,Has dependants,married,116420


Final check of shape and size of dataframe

In [38]:
df3.sample(10)

Unnamed: 0,user_id,gender,state,age,date_joined,n_dependants,dependants_loc,fam_status,income
33658,196183,Female,MS,20,2017-07-14,1,Has dependants,living with parents and siblings,68801
41875,101326,Male,RI,54,2017-08-30,3,Has dependants,married,109183
146224,69106,Male,AL,80,2019-04-23,0,No dependants,divorced/widowed,117733
5027,46765,Female,WV,77,2017-01-29,0,No dependants,divorced/widowed,169688
148393,31554,Male,MN,41,2019-05-05,2,Has dependants,married,80480
4509,44802,Female,NH,68,2017-01-26,0,No dependants,divorced/widowed,39656
7931,198561,Male,KY,33,2017-02-15,2,Has dependants,married,73825
54759,47494,Male,ID,21,2017-11-11,0,No dependants,single,90189
23,55567,Female,NC,67,2017-01-01,2,Has dependants,married,48645
109022,170854,Female,AR,39,2018-09-20,0,No dependants,single,87879


In [39]:
df3.shape

(206209, 9)

Final Observations

In terms of the shape of our dataframe, we have the same number of rows as before. We have dropped three columns, first name, surname, and date_joined, and created another, dependants_loc.

We have also ensured consistency in our column naming conventions and changed the state names to abbreviations.

#### B. Data Quality and Consistency

##### Approach

We're going to be looking at a few things in terms of data quality and consistency

    1. Checking for null values
    2. Checking for duplicates
    3. Checking for mixed-type data

1. Checking for null values

In [40]:
df3.isnull().sum()

user_id           0
gender            0
state             0
age               0
date_joined       0
n_dependants      0
dependants_loc    0
fam_status        0
income            0
dtype: int64

Observations:

No missing data observed

2. Checking for duplicates

In [41]:
df_dups = df3[df3.duplicated()]

In [42]:
df_dups

Unnamed: 0,user_id,gender,state,age,date_joined,n_dependants,dependants_loc,fam_status,income


Observations:

No duplicate rows found

3. Checking for mixed-type data

In [43]:
for col in df3.columns.tolist():
    weird = (df3[[col]].applymap(type) != df3[[col]].iloc[0].apply(type)).any(axis = 1)
    if len (df3[weird]) > 0:
        print (col)

Observations:

No mixed-type data identified.

Interestingly, when I ran this code on the original data there were over 11,000 mixed-type data points in the first name field, representing around 5.5.% of the data set, but given that name is not important to our statistical analysis we have removed the column.

#### C. Merging

Merging our new customers dataset with our existing orders and products dataset

In [45]:
#import orders and products dataset 
df_merge = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'full_merged5.pkl'))

In [46]:
#check output
df_merge.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,_merge,...,prices,price_range_loc,busiest_day,busiest_period_of_day,max_order,loyalty_flag,avg_price,spending_flag,order_freq,order_freq_flag
0,2539329,1,1,2,8,,196,1,0,both,...,9.0,Mid-range product,Regularly busy,Average orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer
1,2398795,1,2,3,7,15.0,196,1,1,both,...,9.0,Mid-range product,Regularly busy,Average orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer
2,473747,1,3,3,12,21.0,196,1,1,both,...,9.0,Mid-range product,Regularly busy,Most orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer
3,2254736,1,4,4,7,29.0,196,1,1,both,...,9.0,Mid-range product,Least busy,Average orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer
4,431534,1,5,4,15,28.0,196,1,1,both,...,9.0,Mid-range product,Least busy,Most orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer


In [47]:
df_merge.shape

(32404859, 23)

In [48]:
df_merge.describe()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,aisle_id,department_id,prices,max_order,avg_price,order_freq
count,32404860.0,32404860.0,32404860.0,32404860.0,32404860.0,30328760.0,32404860.0,32404860.0,32404860.0,32404860.0,32404860.0,32404860.0,32404860.0,32404860.0,32404850.0
mean,1710745.0,102937.2,17.1423,2.738867,13.42515,11.10408,25598.66,8.352547,0.5895873,71.19612,9.919792,11.98023,33.05217,11.98023,10.39776
std,987298.8,59466.1,17.53532,2.090077,4.24638,8.779064,14084.0,7.127071,0.4919087,38.21139,6.281485,495.6554,25.15525,83.24227,7.131754
min,2.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0
25%,855947.0,51422.0,5.0,1.0,10.0,5.0,13544.0,3.0,0.0,31.0,4.0,4.2,13.0,7.39,6.0
50%,1711049.0,102616.0,11.0,3.0,13.0,8.0,25302.0,6.0,1.0,83.0,9.0,7.4,26.0,7.82,8.0
75%,2565499.0,154389.0,24.0,5.0,16.0,15.0,37947.0,11.0,1.0,107.0,16.0,11.3,47.0,8.25,13.0
max,3421083.0,206209.0,99.0,6.0,23.0,30.0,49688.0,145.0,1.0,134.0,21.0,99999.0,99.0,25005.42,30.0


In [49]:
#reminder of our new data set to be merged 
df3.head()

Unnamed: 0,user_id,gender,state,age,date_joined,n_dependants,dependants_loc,fam_status,income
0,26711,Female,MO,48,2017-01-01,3,Has dependants,married,165665
1,33890,Female,NM,36,2017-01-01,0,No dependants,single,59285
2,65803,Male,ID,35,2017-01-01,2,Has dependants,married,99568
3,125935,Female,IA,40,2017-01-01,0,No dependants,single,42049
4,130797,Female,MD,26,2017-01-01,1,Has dependants,married,40374


In [50]:
df3.shape

(206209, 9)

To merge our two dataframes, we need to have a common variable. In this case, it is user_id.

In [51]:
#merging our datasets
df_combo = df_merge.merge(df3, on = ['user_id'])

In [52]:
df_combo.sample(5)

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,_merge,...,order_freq,order_freq_flag,gender,state,age,date_joined,n_dependants,dependants_loc,fam_status,income
15209086,3316170,189762,3,2,22,6.0,15872,7,0,both,...,11.0,Regular customer,Female,SD,61,2018-06-06,1,Has dependants,married,51811
16337631,1572825,18003,4,1,16,18.0,12341,6,1,both,...,10.0,Frequent customer,Female,WY,50,2019-06-03,0,No dependants,single,78568
26174813,111261,167534,31,1,15,16.0,48697,32,0,both,...,9.0,Frequent customer,Male,WI,42,2018-01-11,2,Has dependants,married,114972
27480992,3420040,117639,15,2,16,7.0,35722,7,1,both,...,13.0,Regular customer,Female,MD,58,2018-10-15,3,Has dependants,married,109582
6766120,2781409,65413,32,5,17,2.0,37938,1,1,both,...,6.0,Frequent customer,Female,OK,54,2018-02-02,3,Has dependants,married,104390


In [53]:
df_combo.shape

(32404859, 31)

In [54]:
#Checking value counts
df_combo['_merge'].value_counts()

both          32404859
left_only            0
right_only           0
Name: _merge, dtype: int64

In [55]:
df_combo.describe()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,aisle_id,department_id,prices,max_order,avg_price,order_freq,age,n_dependants,income
count,32404860.0,32404860.0,32404860.0,32404860.0,32404860.0,30328760.0,32404860.0,32404860.0,32404860.0,32404860.0,32404860.0,32404860.0,32404860.0,32404860.0,32404850.0,32404860.0,32404860.0,32404860.0
mean,1710745.0,102937.2,17.1423,2.738867,13.42515,11.10408,25598.66,8.352547,0.5895873,71.19612,9.919792,11.98023,33.05217,11.98023,10.39776,49.46527,1.501896,99437.73
std,987298.8,59466.1,17.53532,2.090077,4.24638,8.779064,14084.0,7.127071,0.4919087,38.21139,6.281485,495.6554,25.15525,83.24227,7.131754,18.48558,1.118865,43057.27
min,2.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,18.0,0.0,25903.0
25%,855947.0,51422.0,5.0,1.0,10.0,5.0,13544.0,3.0,0.0,31.0,4.0,4.2,13.0,7.39,6.0,33.0,1.0,67004.0
50%,1711049.0,102616.0,11.0,3.0,13.0,8.0,25302.0,6.0,1.0,83.0,9.0,7.4,26.0,7.82,8.0,49.0,2.0,96618.0
75%,2565499.0,154389.0,24.0,5.0,16.0,15.0,37947.0,11.0,1.0,107.0,16.0,11.3,47.0,8.25,13.0,65.0,3.0,127912.0
max,3421083.0,206209.0,99.0,6.0,23.0,30.0,49688.0,145.0,1.0,134.0,21.0,99999.0,99.0,25005.42,30.0,81.0,3.0,593901.0


In [56]:
#checking on space constraints
df_combo.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 32404859 entries, 0 to 32404858
Data columns (total 31 columns):
 #   Column                  Dtype         
---  ------                  -----         
 0   order_id                int64         
 1   user_id                 int64         
 2   order_number            int64         
 3   orders_day_of_week      int64         
 4   order_hour_of_day       int64         
 5   days_since_prior_order  float64       
 6   product_id              int64         
 7   add_to_cart_order       int64         
 8   reordered               int64         
 9   _merge                  category      
 10  product_name            object        
 11  aisle_id                int64         
 12  department_id           int64         
 13  prices                  float64       
 14  price_range_loc         object        
 15  busiest_day             object        
 16  busiest_period_of_day   object        
 17  max_order               int64         
 18  

# Export dataframe

In [57]:
#export to pickle 
df_combo.to_pickle(os.path.join(path, '02 Data','Prepared Data', 'orders_products_all.pkl'))