## Load the libraries and the dataset

All the libraries needed are `pandas` and `os`. `pandas` also includes plotting library, so no need to import `matplotlib` library.

The dataset is in the same folder as the notebook and in the format of CSV. So, the dataset is loaded with `read_csv` method from `pandas` library.

`head` method is used to visualize how the dataset looks like.

In [1]:
import os
import pandas as pd

file = os.path.join(os.getcwd(), 'ecommerce-session-bigquery.csv')
df = pd.read_csv(file)
df.head()

Unnamed: 0,fullVisitorId,channelGrouping,time,country,city,totalTransactionRevenue,transactions,timeOnSite,pageviews,sessionQualityDim,...,itemQuantity,itemRevenue,transactionRevenue,transactionId,pageTitle,searchKeyword,pagePathLevel1,eCommerceAction_type,eCommerceAction_step,eCommerceAction_option
0,2515546493837534633,Organic Search,966564,Taiwan,(not set),,,1567.0,82.0,17.0,...,,,,,,,/storeitem.html,0,1,
1,9361741997835388618,Organic Search,157377,France,not available in demo dataset,,,321.0,8.0,,...,,,,,,,/storeitem.html,0,1,
2,7313828956068851679,Referral,228279,United States,San Francisco,,,927.0,11.0,63.0,...,,,,,,,/storeitem.html,0,1,
3,6036794406403793540,Organic Search,1615618,United States,Boulder,,,1616.0,13.0,38.0,...,,,,,,,/storeitem.html,0,1,
4,7847280609739507227,Organic Search,37832,Canada,not available in demo dataset,,,1222.0,45.0,53.0,...,,,,,,,/storeitem.html,0,1,


In [2]:
# Print the information details about the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 32 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   fullVisitorId            10000 non-null  uint64 
 1   channelGrouping          10000 non-null  object 
 2   time                     10000 non-null  int64  
 3   country                  10000 non-null  object 
 4   city                     10000 non-null  object 
 5   totalTransactionRevenue  619 non-null    float64
 6   transactions             628 non-null    float64
 7   timeOnSite               9713 non-null   float64
 8   pageviews                9999 non-null   float64
 9   sessionQualityDim        19 non-null     float64
 10  date                     10000 non-null  int64  
 11  visitId                  10000 non-null  int64  
 12  type                     10000 non-null  object 
 13  productRefundAmount      0 non-null      float64
 14  productQuantity        

## Data preprocessing

In this section, there are some columns that are dropped because those are not needed throughout the challenges.

Dropping the rows requires careful attention. The best way to do this is to analyze each intriguing column and then come up with the decision whether to drop them the rows or not.

In [3]:
# Dropped the columns that are unlikely used throughout the challenges

dropped_col = [
    'time',
    'type',
    'productSKU',
    'v2ProductCategory',
    'productVariant',
    'itemQuantity',
    'itemRevenue',
    'transactionRevenue',
    'transactionId',
    'pageTitle',
    'searchKeyword',
    'pagePathLevel1',
    'eCommerceAction_type',
    'eCommerceAction_step',
    'eCommerceAction_option'
]

df = df.drop(columns=dropped_col)

In [4]:
# Print the information details about the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 17 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   fullVisitorId            10000 non-null  uint64 
 1   channelGrouping          10000 non-null  object 
 2   country                  10000 non-null  object 
 3   city                     10000 non-null  object 
 4   totalTransactionRevenue  619 non-null    float64
 5   transactions             628 non-null    float64
 6   timeOnSite               9713 non-null   float64
 7   pageviews                9999 non-null   float64
 8   sessionQualityDim        19 non-null     float64
 9   date                     10000 non-null  int64  
 10  visitId                  10000 non-null  int64  
 11  productRefundAmount      0 non-null      float64
 12  productQuantity          45 non-null     float64
 13  productPrice             10000 non-null  int64  
 14  productRevenue         

##### As can be observed, that the dataset becomes much neater after dropping irrelevant columns.
##### After this, dropping of rows is a challenge.

In [5]:
# Observe the rows where there are no values on 'timeOnSite' column.
df[df['timeOnSite'].isnull()][['totalTransactionRevenue', 'transactions']].nunique()

totalTransactionRevenue    0
transactions               0
dtype: int64

In [None]:
df = df.dropna(subset=['timeOnSite']).reset_index(drop=True)

In [None]:
df[df['pageviews'].isnull()]

In [None]:
df = df.dropna(subset=['pageviews']).reset_index(drop=True)

In [None]:
df[(df['currencyCode'] != 'USD') & (df['country'] != 'United States') & (df['totalTransactionRevenue'].notnull())]

In [None]:
df = df[~((df['currencyCode'] != 'USD') & 
          (df['country'] != 'United States'))].reset_index(drop=True)

In [None]:
df.info()