### Investigating customer buying patterns


- Hello, As CTO and head of Blackwell's eCommerce Team, I'd like to welcome you aboard. I'm excited to get started on this project, but I'd first like to give you a bit of background to get you up to speed. Blackwell has been a successful electronics retailer for over three decades, with over numerous stores in various locations. A little over a year ago we launched our eCommerce website. 


- We are starting to build up customer transaction data from the site and we want to leverage this data to inform our decisions about site-related activities, like online marketing, enhancements to the site and so on, **in order to continue to maximize the amount of revenue we generate from eCommerce sales.**


- To that end, I would like you to explore the customer transaction data we have collected from recent online and in-store sales and see if you can infer any insights about customer purchasing behavior. 

- **Specifically, I am interested in the following**:


- Do customers in different regions spend more per transaction? Which regions spend the most/least? 


- Is there a relationship between number of items purchased and amount spent? 


- To investigate this, I’d like you to use data mining methods to explore the data, look for patterns in the data and draw conclusions. I have attached a data file of customer transactions; it includes some information about the customer who made the transaction, as well as the amount of the transaction, and how many items were purchased. Once you have completed your analysis, please create a brief report of your findings and conclusions and an explanation of how you arrived at those conclusions so I can discuss them with Martin.

In [5]:
import pandas as pd
import matplotlib.pyplot as plt 

In [6]:
data = pd.read_csv("data/Demographic_Data.csv")

In [7]:
data.head()

Unnamed: 0,in-store,age,items,amount,region
0,0,37,4,281.03,2
1,0,35,2,219.51,2
2,1,45,3,1525.7,4
3,1,46,3,715.25,3
4,1,33,4,1937.5,1


In [8]:
### check the dimensions of the data
data.shape

(80000, 5)

In [9]:
# check for the data type of columns
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 80000 entries, 0 to 79999
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   in-store  80000 non-null  int64  
 1   age       80000 non-null  int64  
 2   items     80000 non-null  int64  
 3   amount    80000 non-null  float64
 4   region    80000 non-null  int64  
dtypes: float64(1), int64(4)
memory usage: 3.1 MB


In [10]:
data.age.mean()

45.757925

In [12]:
group_by_region = data.groupby(['region'])

In [13]:
for i, k in group_by_region: 
    mean_age = k['age'].mean()
    print(k)
    print(mean_age)

       in-store  age  items   amount  region
4             1   33      4  1937.50       1
8             1   51      5   908.31       1
9             1   47      3   767.54       1
11            1   33      2   684.32       1
16            1   20      7  1901.30       1
...         ...  ...    ...      ...     ...
79960         1   23      4  1107.20       1
79962         1   46      5   333.48       1
79974         1   32      3   131.46       1
79995         1   71      3   558.82       1
79998         1   49      4   335.32       1

[16000 rows x 5 columns]
43.7039375
       in-store  age  items    amount  region
0             0   37      4  281.0300       2
1             0   35      2  219.5100       2
6             0   43      6    8.5472       2
12            0   32      2   58.9970       2
26            0   42      5  114.4900       2
...         ...  ...    ...       ...     ...
79987         0   39      2  449.9400       2
79989         0   69      1  404.4200       2
79990    

In [10]:
# check for Nans 
data.isna().any()

in-store    False
age         False
items       False
amount      False
region      False
dtype: bool

### 1 - Do customers in different regions spend more per transaction? 



In [11]:
### how many regions are there ? How many transactions per region?
data['region'].value_counts()

4    26000
2    20000
3    18000
1    16000
Name: region, dtype: int64

In [12]:
# group by region so to have one dataframe per region. 
group_by_region = data.groupby(['region'])

In [14]:
# how many reagions are there ?
spend_trs_region = {}
regions_spending = {}
for i, k in group_by_region: 
    region = i
    total_spending = k['amount'].sum()
    regions_spending[region] = total_spending
    #print(total_spending)
    spend_per_trs = total_spending/len(k)
    print('Region = ', str(i)+' -->  Spending per transaction  =  '+str(spend_per_trs))
    spend_trs_region[region] = spend_per_trs

Region =  1 -->  Spending per transaction  =  745.1614908125
Region =  2 -->  Spending per transaction  =  252.10919617499997
Region =  3 -->  Spending per transaction  =  917.9696374444444
Region =  4 -->  Spending per transaction  =  1284.0520123076924


In [15]:
marklist = sorted(spend_trs_region.items(), key=lambda x:x[1])
sortdict = dict(marklist)
print(sortdict)

{2: 252.10919617499997, 1: 745.1614908125, 3: 917.9696374444444, 4: 1284.0520123076924}


### 2- Which regions spend the most/least?

In [16]:
marklist_reg = sorted(regions_spending.items(), key=lambda x:x[1])
sortdict = dict(marklist_reg)
print(sortdict)

{2: 5042183.9235, 1: 11922583.853, 3: 16523453.474, 4: 33385352.32}


### Yes, customers in different regions spend differently. 
##### In particular : 

- the customers in region 4 spend more per trasnsaction while 

- customers in region 2 spend less. 

### 3- Is there a relationship between number of items purchased and amount spent?

- To anser this question I find the correlation between the two data columns i.e. items and amount.

In [17]:
for i, k in group_by_region: 
    print(k['items'].corr(k['amount']))

-0.007889275578117165
-0.0016842050093184991
-0.0008439979431084291
0.008285865630564907


### The answer to the question above is NO. 

- There is norelationship between the items purchased and the amount spend as the correlation coefficient remains very low (next to zero)