# Customer Behavior Analysis

## What is customer behavior?
The decisions and instincts that make a customer buy a certain product or service can be described as customer behavior.

## The dataset
The dataset used in this project was made by collecting information from an e-commerce store with products in multiple categories. The data is only for the months of October and November for the year **2019**. The dataset description can be found [here](https://www.kaggle.com/datasets/mkechinov/ecommerce-behavior-data-from-multi-category-store).


In [1]:
import pandas as pd

## Working with large CSV files in Python

In [2]:
import time

### Using Pandas chunksize

In [3]:
# Time taken to read data
s_time_chunk = time.time()

# Using Pandas read_csv(chunksize) function
chunk = pd.read_csv('../cBADatasets/2019-Oct.csv', chunksize=10000)

# store data
df_chunk = pd.concat(chunk)
display(df_chunk.tail())

e_time_chunk = time.time()

Unnamed: 0,event_time,event_type,product_id,category_id,category_code,brand,price,user_id,user_session
42448759,2019-10-31 23:59:58 UTC,view,2300275,2053013560530830019,electronics.camera.video,gopro,527.4,537931532,22c57267-da98-4f28-9a9c-18bb5b385193
42448760,2019-10-31 23:59:58 UTC,view,10800172,2053013554994348409,,redmond,61.75,527322328,5054190a-46cb-4211-a8f1-16fc1a060ed8
42448761,2019-10-31 23:59:58 UTC,view,5701038,2053013553970938175,auto.accessories.player,kenwood,128.7,566280422,05b6c62b-992f-4e8e-91f7-961bcb4719cd
42448762,2019-10-31 23:59:59 UTC,view,21407424,2053013561579406073,electronics.clocks,tissot,689.85,513118352,4c14bf2a-2820-4504-929d-046356a5a204
42448763,2019-10-31 23:59:59 UTC,view,13300120,2053013557166998015,,swisshome,155.73,525266378,6e57d2d7-6022-46e6-81d6-fa77f14cefd8


In [4]:
print("With chunks: ", e_time_chunk - s_time_chunk, "seconds")

With chunks:  49.85872507095337 seconds


In [5]:
df_chunk.shape

(42448764, 9)

### Using dask 

Dask is an open-source python library that includes features of parallelism and scalability in Python by using the existing libraries like pandas. More about [dask](https://docs.dask.org/en/stable/)

In [6]:
from dask import dataframe as dd

In [7]:
# Using dask
s_time_dask = time.time()
dask_df = dd.read_csv('../cBADatasets/2019-Oct.csv')
display(dask_df.tail())
e_time_dask = time.time()

Unnamed: 0,event_time,event_type,product_id,category_id,category_code,brand,price,user_id,user_session
483203,2019-10-31 23:59:58 UTC,view,2300275,2053013560530830019,electronics.camera.video,gopro,527.4,537931532,22c57267-da98-4f28-9a9c-18bb5b385193
483204,2019-10-31 23:59:58 UTC,view,10800172,2053013554994348409,,redmond,61.75,527322328,5054190a-46cb-4211-a8f1-16fc1a060ed8
483205,2019-10-31 23:59:58 UTC,view,5701038,2053013553970938175,auto.accessories.player,kenwood,128.7,566280422,05b6c62b-992f-4e8e-91f7-961bcb4719cd
483206,2019-10-31 23:59:59 UTC,view,21407424,2053013561579406073,electronics.clocks,tissot,689.85,513118352,4c14bf2a-2820-4504-929d-046356a5a204
483207,2019-10-31 23:59:59 UTC,view,13300120,2053013557166998015,,swisshome,155.73,525266378,6e57d2d7-6022-46e6-81d6-fa77f14cefd8


In [8]:
print("Read with dask: ", (e_time_dask - s_time_dask), "seconds")

Read with dask:  0.5862700939178467 seconds


In [9]:
# Print dask dataframe shape
dask_df_shape = dask_df.shape
print(dask_df_shape[0].compute(), dask_df_shape[1])

42448764 9


### Using Pandas engine

In [10]:
# Time taken to read data
s_time_engine = time.time()
# Reading the data from file
df_engine_c = pd.read_csv("../cBADatasets/2019-Oct.csv", engine='c') 
display(df_engine_c.tail())
e_time_engine = time.time()

Unnamed: 0,event_time,event_type,product_id,category_id,category_code,brand,price,user_id,user_session
42448759,2019-10-31 23:59:58 UTC,view,2300275,2053013560530830019,electronics.camera.video,gopro,527.4,537931532,22c57267-da98-4f28-9a9c-18bb5b385193
42448760,2019-10-31 23:59:58 UTC,view,10800172,2053013554994348409,,redmond,61.75,527322328,5054190a-46cb-4211-a8f1-16fc1a060ed8
42448761,2019-10-31 23:59:58 UTC,view,5701038,2053013553970938175,auto.accessories.player,kenwood,128.7,566280422,05b6c62b-992f-4e8e-91f7-961bcb4719cd
42448762,2019-10-31 23:59:59 UTC,view,21407424,2053013561579406073,electronics.clocks,tissot,689.85,513118352,4c14bf2a-2820-4504-929d-046356a5a204
42448763,2019-10-31 23:59:59 UTC,view,13300120,2053013557166998015,,swisshome,155.73,525266378,6e57d2d7-6022-46e6-81d6-fa77f14cefd8


In [11]:
print("Read with Engine: ", (e_time_engine - s_time_engine), "seconds")

Read with Engine:  49.170133113861084 seconds


In [12]:
df_engine_c.shape

(42448764, 9)

### Using Pandas only

In [13]:
# Time taken to read data
s_time = time.time()
# Reading the data from file
df = pd.read_csv("../cBADatasets/2019-Oct.csv") 
display(df.tail())
e_time = time.time()

Unnamed: 0,event_time,event_type,product_id,category_id,category_code,brand,price,user_id,user_session
42448759,2019-10-31 23:59:58 UTC,view,2300275,2053013560530830019,electronics.camera.video,gopro,527.4,537931532,22c57267-da98-4f28-9a9c-18bb5b385193
42448760,2019-10-31 23:59:58 UTC,view,10800172,2053013554994348409,,redmond,61.75,527322328,5054190a-46cb-4211-a8f1-16fc1a060ed8
42448761,2019-10-31 23:59:58 UTC,view,5701038,2053013553970938175,auto.accessories.player,kenwood,128.7,566280422,05b6c62b-992f-4e8e-91f7-961bcb4719cd
42448762,2019-10-31 23:59:59 UTC,view,21407424,2053013561579406073,electronics.clocks,tissot,689.85,513118352,4c14bf2a-2820-4504-929d-046356a5a204
42448763,2019-10-31 23:59:59 UTC,view,13300120,2053013557166998015,,swisshome,155.73,525266378,6e57d2d7-6022-46e6-81d6-fa77f14cefd8


In [14]:
print("Read Only: ", (e_time - s_time), "seconds")

Read Only:  53.27710461616516 seconds


### Exploring the data
The data files for both *October* & *November* are very large in size so for this exercise, the data for only October will be selected and used. Let’s review and analyze what data is stored in which format.

According to the output, there are nine columns in the **DataFrame**, which are described below:

- `event_time`: The exact time when the activity occurred by a user

- `event_type`: The type of activity occurred; there are three types in our case, i.e, view, cart, and purchase

- `product_id`: The unique ID of a particular product

- `category_id`: The unique ID of the category to which the product belongs to

- `category_code`: The unique category code to which the product belongs to

- `brand`: The brand name of the selected product

- `price`: The price of the selected product

- `user_id`: The unique ID of the user

- `user_session`: The unique ID generated every time a user visits the site. It is different for every visit of a particular user

## Brand analysis
A *brand* is a term that differentiates one product from another. In this analysis, we will review whether people like to purchase products with a popular brand or a product without a brand.

For this analysis, only the products actually bought by the users will be considered. In our dataset, the products which have no brand are given a `NaN` value.
This will be done in two steps:

1. Separate the original `DataFrame` into two DataFrames. One with all the products with brands and one with all the products without brands.

2. Fetch all those rows from the two `DataFrames` where the `event_type` value is `purchase`.

As a final result, two `Dataframes` will be obtained containing the brand products with and without, that was purchased.

####  Step 1

In [15]:
# Fetch rows with brand
with_brand = df[df['brand'].notna()]

# Fetch rows without brand
without_brand = df[df['brand'].isna()]


#### Step 2

In [17]:
# purchased products with brands
with_brand = with_brand[with_brand['event_type'] == 'purchase']
with_brand

Unnamed: 0,event_time,event_type,product_id,category_id,category_code,brand,price,user_id,user_session
162,2019-10-01 00:02:14 UTC,purchase,1004856,2053013555631882655,electronics.smartphone,samsung,130.76,543272936,8187d148-3c41-46d4-b0c0-9c08cd9dc564
308,2019-10-01 00:04:37 UTC,purchase,1002532,2053013555631882655,electronics.smartphone,apple,642.69,551377651,3c80f0d6-e9ec-4181-8c5c-837a30be2d68
379,2019-10-01 00:06:02 UTC,purchase,5100816,2053013553375346967,,xiaomi,29.51,514591159,0e5dfc4b-2a55-43e6-8c05-97e1f07fbb56
442,2019-10-01 00:07:07 UTC,purchase,13800054,2053013557418656265,furniture.bathroom.toilet,santeri,54.42,555332717,1dea3ee2-2ded-42e8-8e7a-4e2ad6ae942f
574,2019-10-01 00:09:26 UTC,purchase,4804055,2053013554658804075,electronics.audio.headphone,apple,189.91,524601178,2af9b570-0942-4dcd-8f25-4d84fba82553
...,...,...,...,...,...,...,...,...,...
42447959,2019-10-31 23:53:53 UTC,purchase,1004767,2053013555631882655,electronics.smartphone,samsung,242.63,542774966,957dc70c-31d3-42b7-aef0-2d2827c35251
42448173,2019-10-31 23:55:21 UTC,purchase,47500017,2110937143172923797,construction.tools.light,puckator,20.59,514622109,5724116e-365b-4ac1-9d03-b8d66e1ccc7c
42448271,2019-10-31 23:56:03 UTC,purchase,1003306,2053013555631882655,electronics.smartphone,apple,577.89,512717356,f35ac37c-9573-4e30-b3d9-c09bb0b95a2b
42448362,2019-10-31 23:56:53 UTC,purchase,1004240,2053013555631882655,electronics.smartphone,apple,1054.60,533892594,3a5a3b01-2ab1-4a1d-a202-30d336e0057b


In [18]:
# Purchased products without brands
without_brand = without_brand[without_brand['event_type'] == 'purchase']
without_brand

Unnamed: 0,event_time,event_type,product_id,category_id,category_code,brand,price,user_id,user_session
1760,2019-10-01 02:19:59 UTC,purchase,28100119,2053013564918072245,,,153.16,517953667,7954f58c-158d-402d-9820-c502a5eea86d
1884,2019-10-01 02:20:28 UTC,purchase,26601440,2053013563517174627,,,91.12,554101070,5a29c401-c05b-4dcf-b9e6-78324875dfd4
2321,2019-10-01 02:21:45 UTC,purchase,28100000,2053013564918072245,,,60.49,517953667,7954f58c-158d-402d-9820-c502a5eea86d
2778,2019-10-01 02:23:03 UTC,purchase,19100075,2053013556227473861,construction.tools.saw,,120.47,513484630,92bc0a54-4dab-4748-9a39-edbb4c760254
3978,2019-10-01 02:26:02 UTC,purchase,34800175,2062461754293617058,,,33.46,512594464,f18609cf-7cab-47cf-aaf8-8622202722bd
...,...,...,...,...,...,...,...,...,...
42446989,2019-10-31 23:45:46 UTC,purchase,26500442,2053013563550729061,,,115.58,513911691,e6e67023-2258-4341-bd66-a9500d7da596
42447626,2019-10-31 23:51:02 UTC,purchase,15600016,2053013559767466645,,,419.55,542728394,e62b779b-d70e-4468-8f35-4bcf6879e471
42447890,2019-10-31 23:53:18 UTC,purchase,15600016,2053013559767466645,,,419.55,542728394,e62b779b-d70e-4468-8f35-4bcf6879e471
42448049,2019-10-31 23:54:33 UTC,purchase,26205284,2053013563693335403,,,143.89,513040838,639dc99e-72cd-433d-ad90-24c78d71418f


In the output, we can see that the products have been correctly filtered, and two DataFrames have been obtained: one with purchased branded products and the other with purchased non-branded products.

Let’s review how much percentage of branded and non-branded products were bought.

In [19]:
# Get length of original dataframe with purchased products
org = (len(df[df['event_type'] == 'purchase']))


In [28]:
# Didvide the length of with_brand dataframe with length org dataframe
brand_p = len(with_brand) / org
print('Brand products purchase =', brand_p * 100, '%')

Brand products purchase = 92.15116396468193 %


In [29]:
# Divide the length of without_brand dataframe with length org dataframe
brand_a = len(without_brand) / org
print('Without Brand products purchase =', brand_a * 100, '%')

Without Brand products purchase = 7.8488360353180795 %


According to the above output, approximately **92%** of the purchased products were associated with a brand, and only **8%** of products without a brand were bought.

### The hypothesis
A hypothesis can be drawn based on the above results.

- For marketers, most of the marketing budget should be allotted to the advertisement of branded products.

- For inventors or entrepreneurs, always introduce the product with a brand name because products without a brand have a very low probability of getting bought.