# Pandas and Matplotlib Example Using the Online Shoppers Dataset
## A case study about online shoppers purchasing intention
Data Source: https://archive.ics.uci.edu/ml/datasets/Online+Shoppers+Purchasing+Intention+Dataset

Description: Online shoppers - Description.pdf

Created by Jingwei Liu and Jeff Smith


### Suppose one of your friends want to build an online shopping website, one day,  you found he seems worring about something. After talking with him, you knew he is currently have some problems when building the website:
#### <font color="blue">1. Due to the limited budget, he wants to know which way is better: designing a good infomation page or a good product-related page.</font>
#### <font color="blue">2. He plans to have an offline promotion activity in one region, but he doesn't know which region is a better choice.</font>
#### <font color="blue">3. He also plans to have an online promotion activity during one month except the conventional monthes (Nov. and Dec.). But, he can't decide which month is better.</font>
#### <font color="blue">4. Are there any differences in behaviors between new users and returned users?</font>

### Fortunately, you just have a dataset about online shopper purchasing intention. You decide to explore the data and try to find some useful information to help your friend.

### Let's import the tools at first and then read teh data as a dataframe using Pandas

In [None]:
#import the tools:numpy,pandas and matplotlib
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
# the file path
fname = "../data/16_online_shoppers_purchasing_intention.csv"
# read the data into a pandas dataframe and show the fist five lines
data = pd.read_csv(fname)
data.head()

In [None]:
# a simple dataframe information
data.info()

### Now, let's use this dataset to find some useful information. Let's check the average, min, max and std on time shoppers spend in informational page and product-related page.

In [None]:
# use pandas embeded function to compute some statistics in attribute [Informational_Duration] and [ProductRelated_Duration]
print("Max value in Informational_Duration  : {:0.2f} ".format(data['Informational_Duration'].max()))
print("Min value in Informational_Duration  : {:0.2f} ".format(data['Informational_Duration'].min()))
print("Mean value in Informational_Duration : {:0.2f} ".format(data['Informational_Duration'].mean()))
print("The std in Informational_Duration    : {:0.2f}".format(data['Informational_Duration'].std()))
print(50 * '-')
print("Max value in ProductRelated_Duration  : {:0.2f} ".format(data['ProductRelated_Duration'].max()))
print("Min value in ProductRelated_Duration  : {:0.2f} ".format(data['ProductRelated_Duration'].min()))
print("Mean value in ProductRelated_Duration : {:0.2f} ".format(data['ProductRelated_Duration'].mean()))
print("The std in ProductRelated_Duration    : {:0.2f}".format(data['ProductRelated_Duration'].std()))

In [None]:
# You can also use the embeded function 'describe' to find the same values
data[['Informational_Duration','ProductRelated_Duration']].describe()

### <font color = "red"> Now, you can tell your friend, based some online data, online shoppers will spend more time on product-related page. </font>

### Now, let's try to find the shopper numbers in different region

In [None]:
# Show how many shoppers shop in different region
data.groupby('Region')[['Index']].count()

In [None]:
# you can also use histogram plot to show the difference
Reg = data['Region']
plt.hist(Reg);

#### we can clearly see that the number of online shoppers is very high in region 1. But does shopper in region 1 really have a higher percentage end up withing buying something? Let's check that!

In [None]:
# The number of shopppers in region1
region1_shopper_a = data[data['Region'] == 1]['Index'].size
# The number of shoppers end up buying somthing in region1
region1_shopper_b = data[(data['Region'] == 1) &( data['Revenue'] == True)]['Index'].size
# The number of shoppers in other regions 
regiono_shopper_a = data[data['Region'] != 1]['Index'].size
# The number of shoppers in other regions end up buying something
regiono_shopper_b = data[(data['Region'] != 1) & (data['Revenue'] == True)]['Index'].size

print("There is {:0.2f} percent of shoppers in region1 buy products".format(region1_shopper_b/region1_shopper_a))
print("There is {:0.2f} percent of shoppers in other regions(not region1) buy products".format(regiono_shopper_b/regiono_shopper_a))

#### Let's use Pie chart to see the percentage of buying something in region 1 and other regions

In [None]:
# labels for pie chart
labels = 'Buy something','Not buy something' 
# The number of shoppers end up not buying somthing in region1
region1_shopper_n = data[(data['Region'] == 1) &( data['Revenue'] != True)]['Index'].size
# The number of shoppers end up not buying somthing in other region
regiono_shopper_n = data[(data['Region'] != 1) & (data['Revenue'] != True)]['Index'].size
# the size(number) for each label in region1
region1_sizes = [region1_shopper_b,region1_shopper_n]    
# the size(number) for each label in other regions
regiono_sizes = [regiono_shopper_b,regiono_shopper_n] 

# optional parameters for better visualizition
plt.rcParams['figure.figsize'] = (20.0, 9.0) #figure size
colors = ['gold', 'yellowgreen']
explode = (0,0.1)

#plot pie chart for region 1
ax1 = plt.subplot(1, 2, 1)
patches, texts, autotexts = plt.pie(region1_sizes, explode=explode, labels=labels, colors=colors,
        autopct='%1.1f%%', shadow=True, startangle=30) 
ax1.set_title("region 1")

#plot pie chart for other regions
ax2 = plt.subplot(1, 2, 2)
patches, texts, autotexts = plt.pie(regiono_sizes, explode=explode, labels=labels, colors=colors,
        autopct='%1.1f%%', shadow=True, startangle=30) 
ax2.set_title("other regions");
# the semicolon is used to suppress the returned values from calling matplotlib functions. You can remove it to check what will heppen.

### So, it seems although there are more online shoppers in region 1, the shoppers shopping behavior are similiar to other regions

### <font color = "red"> Based on the above cells, you can tell your friend that region 1 has more online shoppers and the shoppers there are similiar to other regions so that hold a promotion activity in region 1 could be better</font>

### Now we can try to find something related with Month

In [None]:
# Show how many shoppers shop in different month
data.groupby('Month')[['Index']].count()

#### Let's also use barplot to see the shopper number by month 

In [None]:
#re-ordered the month sequence
month = data.groupby('Month')['Index'].size().reset_index(name = 'Shoppers number')
month['Month_No'] = [8,12,2,7,6,3,5,11,10,9]
month_ordered = month.sort_values(by='Month_No')

#plot the distribution
month_label = ('Feb','Mar','May','June','Jul','Aug','Sep','Oct','Nov','Dec')
fig,ax = plt.subplots(figsize=(20,10))
plt.bar(month_label,month_ordered['Shoppers number'])
plt.ylabel('Shoppers number',fontsize = 20)
plt.title('Number of shoppers in month distribution',fontsize = 20)
plt.xticks(fontsize=20)
plt.yticks(fontsize=15)

#plot the mean value line
mean = month_ordered['Shoppers number'].mean()
ax.axhline(mean,color = 'red')

#plot lengend
import matplotlib.patches as mpatches
import matplotlib.lines as mlines
monthly_num = mpatches.Patch(color='blue', label='The monthly numbers')
monthly_avg = mlines.Line2D([],[], color='red', label='monthly average')
plt.legend(handles=[monthly_avg,monthly_num],loc=2);

### We can find that except Nov and Dec, online shoppers shop more in May.

### <font color = "red"> Based on the above cells, you can tell your friend that hold an online promotion activity in May could be better</font>

### Last, let's try to find some difference between new visitor and return visitor

#### Add a column to show the total time a shopper spend in all kinds of pages

In [None]:
# add a column[Total_Duration] to show hoy much total time 
# (Administrative_Duration + Informational_Duration + ProductRelated_Duration) a shopper spend
data['Total_Duration'] = data['Administrative_Duration'] + data['Informational_Duration'] + data['ProductRelated_Duration']
data[['Total_Duration']].head()

In [None]:
# show mean and median of total_duration between different visitortype
data.groupby('VisitorType')[['Total_Duration']].aggregate(['mean','median'])

#### It shows returning visitor spend more time on the online shopping website. You can also get a deeper look about the difference among  regions

In [None]:
# Whether each type of visitors have similiar behaviors in different regions
data.groupby(['Region', 'VisitorType'])['Total_Duration'].aggregate(['mean','median']).unstack()

#### Based on the median values, there seems no significant difference among regions

### Here just show some simple analysis, you can compare other parts among different visitor types. And you could get different conclusions based on your analysis.

### <font color = "red"> Based on the simple analysis above, you can tell your friend that returning visitor seems spend more time on the website</font>

## This notebook provides an example about using Pandas and Matplotlib to help us find some useful infomation. Pandas and Matplotlib are very powerful packages, besides the above usage, they provide lots of other useful functions. If you are interested in those functions, please refer to:
### Pandas: https://pandas.pydata.org/
### Matplotlib: https://matplotlib.org/