# Customer Behavior Analysis

## What is customer behavior?
The decisions and instincts that make a customer buy a certain product or service can be described as customer behavior.

## The dataset
The dataset used in this project was made by collecting information from an e-commerce store with products in multiple categories. The data is only for the months of October and November for the year **2019**. The dataset description can be found [here](https://www.kaggle.com/datasets/mkechinov/ecommerce-behavior-data-from-multi-category-store).


In [1]:
import pandas as pd

## Working with large CSV files in Python

In [2]:
import time

### Using Pandas chunksize

In [3]:
# Time taken to read data
s_time_chunk = time.time()

# Using Pandas read_csv(chunksize) function
chunk = pd.read_csv('../cBADatasets/2019-Oct.csv', chunksize=10000)

# store data
df_chunk = pd.concat(chunk)
display(df_chunk.tail())

e_time_chunk = time.time()

Unnamed: 0,event_time,event_type,product_id,category_id,category_code,brand,price,user_id,user_session
42448759,2019-10-31 23:59:58 UTC,view,2300275,2053013560530830019,electronics.camera.video,gopro,527.4,537931532,22c57267-da98-4f28-9a9c-18bb5b385193
42448760,2019-10-31 23:59:58 UTC,view,10800172,2053013554994348409,,redmond,61.75,527322328,5054190a-46cb-4211-a8f1-16fc1a060ed8
42448761,2019-10-31 23:59:58 UTC,view,5701038,2053013553970938175,auto.accessories.player,kenwood,128.7,566280422,05b6c62b-992f-4e8e-91f7-961bcb4719cd
42448762,2019-10-31 23:59:59 UTC,view,21407424,2053013561579406073,electronics.clocks,tissot,689.85,513118352,4c14bf2a-2820-4504-929d-046356a5a204
42448763,2019-10-31 23:59:59 UTC,view,13300120,2053013557166998015,,swisshome,155.73,525266378,6e57d2d7-6022-46e6-81d6-fa77f14cefd8


In [4]:
print("With chunks: ", e_time_chunk - s_time_chunk, "seconds")

With chunks:  49.85872507095337 seconds


In [5]:
df_chunk.shape

(42448764, 9)

### Using dask 

Dask is an open-source python library that includes features of parallelism and scalability in Python by using the existing libraries like pandas. More about [dask](https://docs.dask.org/en/stable/)

In [6]:
from dask import dataframe as dd

In [7]:
# Using dask
s_time_dask = time.time()
dask_df = dd.read_csv('../cBADatasets/2019-Oct.csv')
display(dask_df.tail())
e_time_dask = time.time()

Unnamed: 0,event_time,event_type,product_id,category_id,category_code,brand,price,user_id,user_session
483203,2019-10-31 23:59:58 UTC,view,2300275,2053013560530830019,electronics.camera.video,gopro,527.4,537931532,22c57267-da98-4f28-9a9c-18bb5b385193
483204,2019-10-31 23:59:58 UTC,view,10800172,2053013554994348409,,redmond,61.75,527322328,5054190a-46cb-4211-a8f1-16fc1a060ed8
483205,2019-10-31 23:59:58 UTC,view,5701038,2053013553970938175,auto.accessories.player,kenwood,128.7,566280422,05b6c62b-992f-4e8e-91f7-961bcb4719cd
483206,2019-10-31 23:59:59 UTC,view,21407424,2053013561579406073,electronics.clocks,tissot,689.85,513118352,4c14bf2a-2820-4504-929d-046356a5a204
483207,2019-10-31 23:59:59 UTC,view,13300120,2053013557166998015,,swisshome,155.73,525266378,6e57d2d7-6022-46e6-81d6-fa77f14cefd8


In [8]:
print("Read with dask: ", (e_time_dask - s_time_dask), "seconds")

Read with dask:  0.5862700939178467 seconds


In [9]:
# Print dask dataframe shape
dask_df_shape = dask_df.shape
print(dask_df_shape[0].compute(), dask_df_shape[1])

42448764 9


### Using Pandas engine

In [10]:
# Time taken to read data
s_time_engine = time.time()
# Reading the data from file
df_engine_c = pd.read_csv("../cBADatasets/2019-Oct.csv", engine='c') 
display(df_engine_c.tail())
e_time_engine = time.time()

Unnamed: 0,event_time,event_type,product_id,category_id,category_code,brand,price,user_id,user_session
42448759,2019-10-31 23:59:58 UTC,view,2300275,2053013560530830019,electronics.camera.video,gopro,527.4,537931532,22c57267-da98-4f28-9a9c-18bb5b385193
42448760,2019-10-31 23:59:58 UTC,view,10800172,2053013554994348409,,redmond,61.75,527322328,5054190a-46cb-4211-a8f1-16fc1a060ed8
42448761,2019-10-31 23:59:58 UTC,view,5701038,2053013553970938175,auto.accessories.player,kenwood,128.7,566280422,05b6c62b-992f-4e8e-91f7-961bcb4719cd
42448762,2019-10-31 23:59:59 UTC,view,21407424,2053013561579406073,electronics.clocks,tissot,689.85,513118352,4c14bf2a-2820-4504-929d-046356a5a204
42448763,2019-10-31 23:59:59 UTC,view,13300120,2053013557166998015,,swisshome,155.73,525266378,6e57d2d7-6022-46e6-81d6-fa77f14cefd8


In [11]:
print("Read with Engine: ", (e_time_engine - s_time_engine), "seconds")

Read with Engine:  49.170133113861084 seconds


In [12]:
df_engine_c.shape

(42448764, 9)

### Using Pandas only

In [13]:
# Time taken to read data
s_time = time.time()
# Reading the data from file
df = pd.read_csv("../cBADatasets/2019-Oct.csv") 
display(df.tail())
e_time = time.time()

Unnamed: 0,event_time,event_type,product_id,category_id,category_code,brand,price,user_id,user_session
42448759,2019-10-31 23:59:58 UTC,view,2300275,2053013560530830019,electronics.camera.video,gopro,527.4,537931532,22c57267-da98-4f28-9a9c-18bb5b385193
42448760,2019-10-31 23:59:58 UTC,view,10800172,2053013554994348409,,redmond,61.75,527322328,5054190a-46cb-4211-a8f1-16fc1a060ed8
42448761,2019-10-31 23:59:58 UTC,view,5701038,2053013553970938175,auto.accessories.player,kenwood,128.7,566280422,05b6c62b-992f-4e8e-91f7-961bcb4719cd
42448762,2019-10-31 23:59:59 UTC,view,21407424,2053013561579406073,electronics.clocks,tissot,689.85,513118352,4c14bf2a-2820-4504-929d-046356a5a204
42448763,2019-10-31 23:59:59 UTC,view,13300120,2053013557166998015,,swisshome,155.73,525266378,6e57d2d7-6022-46e6-81d6-fa77f14cefd8


In [14]:
print("Read Only: ", (e_time - s_time), "seconds")

Read Only:  53.27710461616516 seconds
