<a href="https://colab.research.google.com/github/chaeyeon2367/dataAnalysis-python-retaildata/blob/main/Identify_customer_abandonment_pages_with_log_dataipynb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Objective
1. Identify customer abandonment pages with log data

In [1]:
import pandas as pd
import numpy as np

#### Web server log data
 - Files that record information about a request (IP, time, page visited, etc.) when a web server delivers a request to a client.
 - There are standards for the format of logs recorded, but the format can be changed in the settings.
 - Log data is mainly used in the form of debugging the web server, data analysis, etc.
 - Format used in the project

   - IP sessionID User identifier Time of day Request page status code Bytesize
   ```
   1.0.0.1 sessionid user59 [16/Dec/2019:02:00:08] GET /checkout 200 1508
   ```

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
logs = pd.read_csv('/content/drive/MyDrive/Data_Project/web.log', 
                   sep='\s',
                   engine='python',
                   names=['ip', 'session_id', 'user_id', 'datetime', 'request', 'url', 'status', 'bytesize'])

logs.head()

Unnamed: 0,ip,session_id,user_id,datetime,request,url,status,bytesize
0,4.5.4.5,69de169f-6eed-4e4d-ae5b-ff997b8c889f,user89,[01/Dec/2019T00:47:11],GET,/product_list,200,2107
1,4.5.4.5,69de169f-6eed-4e4d-ae5b-ff997b8c889f,user89,[01/Dec/2019T00:51:21],GET,/product_detail,200,1323
2,3.3.3.3.,3d46aad9-17eb-4af1-bc54-6ca91d7f8f6c,user2,[01/Dec/2019T00:51:43],GET,/product_list,200,2616
3,1.0.1.0,57623182-b78b-4bdc-b977-a2b34612c6d1,user45,[01/Dec/2019T01:04:02],GET,/product_list,200,2303
4,3.3.3.3.,3d46aad9-17eb-4af1-bc54-6ca91d7f8f6c,user2,[01/Dec/2019T01:12:28],GET,/product_detail,200,1830


In [5]:
logs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1290 entries, 0 to 1289
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   ip          1290 non-null   object
 1   session_id  1290 non-null   object
 2   user_id     1290 non-null   object
 3   datetime    1290 non-null   object
 4   request     1290 non-null   object
 5   url         1290 non-null   object
 6   status      1290 non-null   int64 
 7   bytesize    1290 non-null   int64 
dtypes: int64(2), object(6)
memory usage: 80.8+ KB


#### Convert date format

#### Which pages do customers abandon?
 - Once you know which pages your customers are abandoning, you can analyze those pages to drive more customers to the final step.
 - In most cases, this is the case when the barrier to the next step is high (credit card entry, information entry, etc.)

In [8]:
# 01/Dec/2019T00:47:11
logs['datetime'] = logs['datetime'].apply(lambda date: date.replace('[', '').replace(']', ''))
logs['datetime'] = pd.to_datetime(logs['datetime'], format='%d/%b/%YT%H:%M:%S')

logs.head()

Unnamed: 0,ip,session_id,user_id,datetime,request,url,status,bytesize
0,4.5.4.5,69de169f-6eed-4e4d-ae5b-ff997b8c889f,user89,2019-12-01 00:47:11,GET,/product_list,200,2107
1,4.5.4.5,69de169f-6eed-4e4d-ae5b-ff997b8c889f,user89,2019-12-01 00:51:21,GET,/product_detail,200,1323
2,3.3.3.3.,3d46aad9-17eb-4af1-bc54-6ca91d7f8f6c,user2,2019-12-01 00:51:43,GET,/product_list,200,2616
3,1.0.1.0,57623182-b78b-4bdc-b977-a2b34612c6d1,user45,2019-12-01 01:04:02,GET,/product_list,200,2303
4,3.3.3.3.,3d46aad9-17eb-4af1-bc54-6ca91d7f8f6c,user2,2019-12-01 01:12:28,GET,/product_detail,200,1830


#### Create a funnel step dataframe
 - Use to specify step ordering, etc.

In [10]:
funnel_dict = {'/product_list' : 1 , '/product_detail': 2 , '/cart': 3, '/order_complete': 4}
funnel_steps = pd.DataFrame.from_dict(funnel_dict, orient='index', columns = ['step_no'])
funnel_steps

Unnamed: 0,step_no
/product_list,1
/product_detail,2
/cart,3
/order_complete,4


#### Grouping by url , session
 - The reason for using session rather than user_id is that the same user accessing from different sessions should be considered as different cases.
 - Group by session_id and url to extract for the earliest corresponding event in the timezone

In [17]:
grouped = logs.groupby(['session_id', 'url'])['datetime'].agg(np.min)
grouped = pd.DataFrame(grouped).merge(funnel_steps, left_on='url', right_index=True)

grouped.tail()

Unnamed: 0_level_0,Unnamed: 1_level_0,datetime,step_no
session_id,url,Unnamed: 2_level_1,Unnamed: 3_level_1
ed374836-99eb-4e31-8b0d-40e39d38bd54,/order_complete,2019-12-08 03:42:01,4
ef2c3b91-b701-4d46-85ac-96607f0fccc1,/order_complete,2019-12-16 05:48:56,4
f25e918d-f47e-4704-a923-19f1e106f618,/order_complete,2019-12-18 07:36:20,4
f8010232-b6c0-4364-9e9a-f8cc88588ebb,/order_complete,2019-12-06 12:30:47,4
f93ce85d-b7e6-4619-9756-6a7876a25520,/order_complete,2019-12-07 10:40:03,4


#### Create a funnel table
 - Change the steps of each funnel to be in columns in order

In [19]:
funnel = grouped.reset_index().pivot(index='session_id', columns='step_no', values='datetime')
funnel.columns = funnel_steps.index
funnel.head()

Unnamed: 0_level_0,/product_list,/product_detail,/cart,/order_complete
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
000d99d8-d2d4-4e9a-bb06-69b1ae6442d9,2019-12-01 11:52:32,2019-12-01 12:06:39,NaT,NaT
0155049d-32e7-44de-9b0d-4c02f63d6099,2019-12-04 00:12:47,2019-12-04 00:22:44,NaT,NaT
020d4536-1341-4de1-87d3-e22ba8611af6,2019-12-19 06:22:54,2019-12-19 06:25:48,2019-12-19 06:58:23,NaT
0381411a-78d8-4c27-9622-3210b7ed62d6,2019-12-05 04:48:34,2019-12-05 05:09:32,2019-12-05 05:35:16,NaT
06268108-6228-4237-ac1d-7927dd44273d,2019-12-11 04:15:46,2019-12-11 04:17:31,2019-12-11 04:45:05,NaT


#### Counting funnel counts
 - Calculate the count for each funnel step

In [21]:
step_values = [funnel[index].notnull().sum() for index in funnel_steps.index]
step_values

[419, 351, 261, 84]

In [24]:
def show_funnel(funnel_values, funnel_steps):
  from plotly import graph_objects as go

  fig = go.Figure(go.Funnel(
      y = funnel_steps,
      x = funnel_values
  ))

  fig.show()

In [25]:
show_funnel(step_values, funnel_steps.index)

#### Calculate average time
 - Calculate how long each funnel takes

In [26]:
np.mean(funnel['/product_detail'] - funnel['/product_list'])

Timedelta('0 days 00:16:50.635327635')

In [27]:
np.mean(funnel['/cart'] - funnel['/product_detail'])

Timedelta('0 days 00:18:42.804597701')

In [28]:
np.mean(funnel['/order_complete'] - funnel['/cart'])

Timedelta('0 days 00:33:35.904761904')