### 4.9. Task Submission - Intro to Data Viz - part 1.2 - Kaggle

#### Directions

##### Part 1.1

Download the customer data set and add it to your “Original Data” folder.

Create a new notebook in your “Scripts” folder for part 1 of this task.

Import your analysis libraries, as well as your new customer data set as a dataframe.

Wrangle the data so that it follows consistent logic; for example, rename columns with illogical names and drop columns that don’t add anything to your analysis.

Complete the fundamental data quality and consistency checks you’ve learned throughout this Achievement; for example, check for and address missing values and duplicates, and convert any mixed-type data.

##### Part 1.2

Combine your customer data with the rest of your prepared Instacart data. (Hint: Make sure the key columns are the same data type!)

Ensure your notebook contains logical titles, section headings, and descriptive code comments.

Export this new dataframe as a pickle file so you can continue to use it in the second part of this task.

Save your notebook so that you can send it to your tutor for review after completing part 2.

#### Importing libraries and files

In [1]:
import pandas as pd
import numpy as np
import os
import polars as pl

In [2]:
path = "/kaggle/input/instacart/"

In [3]:
# import customers.csv
df_cust = pd.read_csv(os.path.join(path, 'customers_clean.csv'), index_col = False)

In [5]:
# imp. parquet:
ords_prods_pq = pl.read_parquet(os.path.join(path, 'ords_prods_merge_groups.parquet'))

In [6]:
ords_prods_pq.shape

(32404859, 25)

In [7]:
ords_prods_pq.columns

['order_id',
 'user_id',
 'order_number',
 'orders_day_of_week',
 'order_hour_of_day',
 'days_since_prior_order',
 'product_id',
 'add_to_cart_order',
 'reordered',
 '_merge',
 'product_name',
 'aisle_id',
 'department_id',
 'prices',
 'match',
 'price_range_loc',
 'busiest_day',
 'orders_affluence',
 'busiest_period_of_day',
 'max_order',
 'loyalty_flag',
 'user_avg_price',
 'spending_flag',
 'user_ord_freq',
 'order_frequency_flag']

In [12]:
#converting pandas df to polars df
df_cust = pl.from_pandas(df_cust)

In [13]:
ddf_merge = ords_prods_pq.join(df_cust, on = ['user_id'], how='inner')
# creating the merge in polars; it works, it's more efficient, but could not create an indicator.

In [14]:
ddf_merge.head()

order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,_merge,product_name,aisle_id,department_id,prices,match,price_range_loc,busiest_day,orders_affluence,busiest_period_of_day,max_order,loyalty_flag,user_avg_price,spending_flag,user_ord_freq,order_frequency_flag,first_name,surname,gender,state,age,date_joined,n_dependants,fam_status,income
i64,i64,i64,i64,i64,f64,i64,i64,i64,cat,str,i64,i64,f64,cat,str,str,str,str,i64,str,f64,str,f64,str,str,str,str,str,i64,str,i64,str,i64
2539329,1,1,2,8,,196,1,0,"""both""","""Soda""",77,7,9.0,"""both""","""Mid-range product""","""Regularly busy""","""Regularly busy""","""Average orders""",10,"""New customer""",6.367797,"""Low Spender""",20.5,"""Non-frequent customer""","""Linda""","""Nguyen""","""Female""","""Alabama""",31,"""2/17/2019""",3,"""married""",40423
2539329,1,1,2,8,,14084,2,0,"""both""","""Organic Unsweetened Vanilla Al…",91,16,12.5,"""both""","""Mid-range product""","""Regularly busy""","""Regularly busy""","""Average orders""",10,"""New customer""",6.367797,"""Low Spender""",20.5,"""Non-frequent customer""","""Linda""","""Nguyen""","""Female""","""Alabama""",31,"""2/17/2019""",3,"""married""",40423
2539329,1,1,2,8,,12427,3,0,"""both""","""Original Beef Jerky""",23,19,4.4,"""both""","""Low-range product""","""Regularly busy""","""Regularly busy""","""Average orders""",10,"""New customer""",6.367797,"""Low Spender""",20.5,"""Non-frequent customer""","""Linda""","""Nguyen""","""Female""","""Alabama""",31,"""2/17/2019""",3,"""married""",40423
2539329,1,1,2,8,,26088,4,0,"""both""","""Aged White Cheddar Popcorn""",23,19,4.7,"""both""","""Low-range product""","""Regularly busy""","""Regularly busy""","""Average orders""",10,"""New customer""",6.367797,"""Low Spender""",20.5,"""Non-frequent customer""","""Linda""","""Nguyen""","""Female""","""Alabama""",31,"""2/17/2019""",3,"""married""",40423
2539329,1,1,2,8,,26405,5,0,"""both""","""XL Pick-A-Size Paper Towel Rol…",54,17,1.0,"""both""","""Low-range product""","""Regularly busy""","""Regularly busy""","""Average orders""",10,"""New customer""",6.367797,"""Low Spender""",20.5,"""Non-frequent customer""","""Linda""","""Nguyen""","""Female""","""Alabama""",31,"""2/17/2019""",3,"""married""",40423


In [15]:
ddf_merge.shape

(32404859, 34)

I've also noticed that with polars, the data types of each column show up. I'm not sure if in row one or what.

In [17]:
ddf_merge.columns

['order_id',
 'user_id',
 'order_number',
 'orders_day_of_week',
 'order_hour_of_day',
 'days_since_prior_order',
 'product_id',
 'add_to_cart_order',
 'reordered',
 '_merge',
 'product_name',
 'aisle_id',
 'department_id',
 'prices',
 'match',
 'price_range_loc',
 'busiest_day',
 'orders_affluence',
 'busiest_period_of_day',
 'max_order',
 'loyalty_flag',
 'user_avg_price',
 'spending_flag',
 'user_ord_freq',
 'order_frequency_flag',
 'first_name',
 'surname',
 'gender',
 'state',
 'age',
 'date_joined',
 'n_dependants',
 'fam_status',
 'income']

In [None]:
ddf_merge = ddf_merge.drop(['_merge', 'match'])
# how to drop columns with polars; different from pandas; axis = 1 doesn't work here.

In [16]:
ddf_merge.describe

<bound method DataFrame.describe of shape: (32_404_859, 34)
┌──────────┬─────────┬────────────┬────────────┬───┬────────────┬────────────┬────────────┬────────┐
│ order_id ┆ user_id ┆ order_numb ┆ orders_day ┆ … ┆ date_joine ┆ n_dependan ┆ fam_status ┆ income │
│ ---      ┆ ---     ┆ er         ┆ _of_week   ┆   ┆ d          ┆ ts         ┆ ---        ┆ ---    │
│ i64      ┆ i64     ┆ ---        ┆ ---        ┆   ┆ ---        ┆ ---        ┆ str        ┆ i64    │
│          ┆         ┆ i64        ┆ i64        ┆   ┆ str        ┆ i64        ┆            ┆        │
╞══════════╪═════════╪════════════╪════════════╪═══╪════════════╪════════════╪════════════╪════════╡
│ 2539329  ┆ 1       ┆ 1          ┆ 2          ┆ … ┆ 2/17/2019  ┆ 3          ┆ married    ┆ 40423  │
│ 2539329  ┆ 1       ┆ 1          ┆ 2          ┆ … ┆ 2/17/2019  ┆ 3          ┆ married    ┆ 40423  │
│ 2539329  ┆ 1       ┆ 1          ┆ 2          ┆ … ┆ 2/17/2019  ┆ 3          ┆ married    ┆ 40423  │
│ 2539329  ┆ 1       ┆ 1       

With Polars it's not possible to add an indicator, so I cannot do the checkups I would usually do. 
32 columns, rock and rolling.

In [None]:
# exporting merged file (polars)
ddf_merge.write_parquet("/kaggle/working/ords_prods_custs_merge.parquet")
