# Starbucks Capstone Challenge

### Introduction

This data set contains simulated data that mimics customer behavior on the Starbucks rewards mobile app. Once every few days, Starbucks sends out an offer to users of the mobile app. An offer can be merely an advertisement for a drink or an actual offer such as a discount or BOGO (buy one get one free). Some users might not receive any offer during certain weeks. 

Not all users receive the same offer, and that is the challenge to solve with this data set.

Your task is to combine transaction, demographic and offer data to determine **which demographic groups respond best to which offer type**. This data set is a simplified version of the real Starbucks app because the underlying simulator only has one product whereas Starbucks actually sells dozens of products.

Every offer has a validity period before the offer expires. As an example, a BOGO offer might be valid for only 5 days. You'll see in the data set that informational offers have a validity period even though these ads are merely providing information about a product; for example, if an informational offer has 7 days of validity, you can assume the customer is feeling the influence of the offer for 7 days after receiving the advertisement.

You'll be given transactional data showing user purchases made on the app including the timestamp of purchase and the amount of money spent on a purchase. This transactional data also has a record for each offer that a user receives as well as a record for when a user actually views the offer. There are also records for when a user completes an offer. 

Keep in mind as well that someone using the app might make a purchase through the app without having received an offer or seen an offer.

### Example

To give an example, a user could receive a discount offer buy 10 dollars get 2 off on Monday. The offer is valid for 10 days from receipt. If the customer accumulates at least 10 dollars in purchases during the validity period, the customer completes the offer.

However, there are a few things to watch out for in this data set. Customers do not opt into the offers that they receive; in other words, a user can receive an offer, never actually view the offer, and still complete the offer. For example, a user might receive the "buy 10 dollars get 2 dollars off offer", but the user never opens the offer during the 10 day validity period. The customer spends 15 dollars during those ten days. There will be an offer completion record in the data set; however, the customer was not influenced by the offer because the customer never viewed the offer.

### Cleaning

This makes data cleaning especially important and tricky.

You'll also want to take into account that some demographic groups will make purchases even if they don't receive an offer. From a business perspective, if a customer is going to make a 10 dollar purchase without an offer anyway, you wouldn't want to send a buy 10 dollars get 2 dollars off offer. You'll want to try to assess what a certain demographic group will buy when not receiving any offers.

### Final Advice

Because this is a capstone project, you are free to analyze the data any way you see fit. For example, you could build a machine learning model that predicts how much someone will spend based on demographics and offer type. Or you could build a model that predicts whether or not someone will respond to an offer. Or, you don't need to build a machine learning model at all. You could develop a set of heuristics that determine what offer you should send to each customer (i.e., 75 percent of women customers who were 35 years old responded to offer A vs 40 percent from the same demographic to offer B, so send offer A).

# Data Sets

The data is contained in three files:

* portfolio.json - containing offer ids and meta data about each offer (duration, type, etc.)
* profile.json - demographic data for each customer
* transcript.json - records for transactions, offers received, offers viewed, and offers completed

Here is the schema and explanation of each variable in the files:

**portfolio.json**
* id (string) - offer id
* offer_type (string) - type of offer ie BOGO, discount, informational
* difficulty (int) - minimum required spend to complete an offer
* reward (int) - reward given for completing an offer
* duration (int) - time for offer to be open, in days
* channels (list of strings)

**profile.json**
* age (int) - age of the customer 
* became_member_on (int) - date when customer created an app account
* gender (str) - gender of the customer (note some entries contain 'O' for other rather than M or F)
* id (str) - customer id
* income (float) - customer's income

**transcript.json**
* event (str) - record description (ie transaction, offer received, offer viewed, etc.)
* person (str) - customer id
* time (int) - time in hours since start of test. The data begins at time t=0
* value - (dict of strings) - either an offer id or transaction amount depending on the record

**Note:** If you are using the workspace, you will need to go to the terminal and run the command `conda update pandas` before reading in the files. This is because the version of pandas in the workspace cannot read in the transcript.json file correctly, but the newest version of pandas can. You can access the termnal from the orange icon in the top left of this notebook.  

You can see how to access the terminal and how the install works using the two images below.  First you need to access the terminal:

<img src="pic1.png"/>

Then you will want to run the above command:

<img src="pic2.png"/>

Finally, when you enter back into the notebook (use the jupyter icon again), you should be able to run the below cell without any errors.

# Data Cleaning 

## Datasets Loading

In [1]:
import pandas as pd
import numpy as np
import math
import json
import matplotlib.pyplot as plt
%matplotlib inline

# read in the json files
df1 = pd.read_json('data/portfolio.json', orient='records', lines=True)
df2 = pd.read_json('data/profile.json', orient='records', lines=True)
df3 = pd.read_json('data/transcript.json', orient='records', lines=True)

In [2]:
pd.__version__

'2.2.1'

In [3]:
df3.person.nunique()
# df3.head()

17000

## Datasets Cleaning

### Portfolio
There is no missing values or duplicates in this file. 
* rename column id to offer id
* change id to numerical offer id
* create a dummy column for offer_type (informational,discount,bogo)
* create dummy columns for channels

In [4]:
portfolio = df1.copy()
portfolio.head()

Unnamed: 0,reward,channels,difficulty,duration,offer_type,id
0,10,"[email, mobile, social]",10,7,bogo,ae264e3637204a6fb9bb56bc8210ddfd
1,10,"[web, email, mobile, social]",10,5,bogo,4d5c57ea9a6940dd891ad53e9dbe8da0
2,0,"[web, email, mobile]",0,4,informational,3f207df678b143eea3cee63160fa8bed
3,5,"[web, email, mobile]",5,7,bogo,9b98b8c7a33c4b65b9aebfe6a799e6d9
4,5,"[web, email]",20,10,discount,0b1e1539f2cc45b7b9fa7c272da2e1d7


In [5]:
portfolio = portfolio.rename(columns={"id":"offer_id"})

In [6]:
#encode offer ids in portfolio dataframe from string format to integer
offer_ids = portfolio['offer_id'].unique()
offer_ids_dict = pd.Series(offer_ids).to_dict()
offer_ids_dict = dict([(value,key) for key,value in offer_ids_dict.items()])

In [7]:
offer_ids_dict

{'ae264e3637204a6fb9bb56bc8210ddfd': 0,
 '4d5c57ea9a6940dd891ad53e9dbe8da0': 1,
 '3f207df678b143eea3cee63160fa8bed': 2,
 '9b98b8c7a33c4b65b9aebfe6a799e6d9': 3,
 '0b1e1539f2cc45b7b9fa7c272da2e1d7': 4,
 '2298d6c36e964ae4a3e7e9706d1fb8c2': 5,
 'fafdcd668e3743c1bb461111dcafc2a4': 6,
 '5a8bc65990b245e5a138643cd4eb9837': 7,
 'f19421c1d4aa40978ebb69ca19b0e20d': 8,
 '2906b810c7d4411798c6938adc9daaa5': 9}

In [8]:
#map offer id in portfolio to the encoded offer id
portfolio['offer_id'] = portfolio['offer_id'].map(offer_ids_dict)

In [9]:
# Return offer types as a dummy column
def offer_type_val(val):
    if val.find(offer) > -1:
        return 1
    else:
        return 0
        
offer_types = portfolio.offer_type.unique()
        
# Apply function
for offer in offer_types:
    portfolio[offer] = portfolio['offer_type'].apply(offer_type_val)

In [11]:
# dummy = pd.get_dummies(portfolio.channels.apply(pd.Series).stack()).sum(level=0)
pd.get_dummies(portfolio.channels.apply(pd.Series).stack())

Unnamed: 0,Unnamed: 1,email,mobile,social,web
0,0,True,False,False,False
0,1,False,True,False,False
0,2,False,False,True,False
1,0,False,False,False,True
1,1,True,False,False,False
1,2,False,True,False,False
1,3,False,False,True,False
2,0,False,False,False,True
2,1,True,False,False,False
2,2,False,True,False,False


In [None]:
dummy

In [None]:
portfolio = pd.concat([portfolio,dummy],axis=1)

In [None]:
portfolio.drop(['channels'],axis =1, inplace= True)

In [None]:
portfolio.head()

### Profile

In [None]:
profile = df2.copy()
profile.head()

In [None]:
profile.shape

In [None]:
profile.describe()

In [None]:
profile.info()

In [None]:
profile[profile.age == 118].sort_values('income', ascending = False)

In [None]:
#checking if it has any value for gender and income
profile[profile.age == 118].gender.notna().sum() , profile[profile.age == 118].income.notna().sum()


There is a high count for the outliers age (age = 118) which is more than 2k. This could be when customer didn't enter their age in the form, it will automatically set to a default date. 

We have to take care of missing values if we want to do modelling later. There are three options that we have; get rid of the corresponding districts, get rid of the whole attributes or replace the values to some value. So, in this case we are going to remove this data because it's representing false age and even the income and gender data are missing.

In [None]:
profile = profile[profile['age']<118].reset_index(drop=True)

In [None]:
profile.age.sort_values()

In [None]:
profile.info()

In [None]:
profile.duplicated().sum()

From the information above, we see that there is no missing value or duplicated rows.

So the next cleaning process for this dataset will be:

- change id to customer_id
- change id to encoded id
- change datatype for column became_member_on to date

In [None]:
profile = profile.rename(columns = {'id':'customer_id'})

In [None]:
customer_ids = pd.unique(profile['customer_id'])
#encode customer ids which is in string format to integers
customer_ids_dict = pd.Series(customer_ids).to_dict()
#swap value and key
customer_ids_dict = dict([(value, key) for key, value in customer_ids_dict.items()]) 
#create item iterator
itr = iter(customer_ids_dict.items())
lst = [next(itr) for i in range(10)]
lst


In [None]:
#map the new encoded customer id to the old customer id in profile table
profile['customer_id'] = profile['customer_id'].map(customer_ids_dict)

In [None]:
profile['became_member_on'] = pd.to_datetime(profile['became_member_on'],format='%Y%m%d')

In [None]:
profile.head()

In [None]:
profile.dtypes

### Transcript

In [None]:
transcript = df3.copy()
transcript.head()

In [None]:
transcript.describe()

In [None]:
transcript.info()

In [None]:
transcript.event.unique()

In [None]:
transcript.event.value_counts() 

Data cleaning to do:

- change column name from 'person' to 'customer_id'
- convert the column 'Event' into 4 different columns based on their value
- convert the column 'Values' into columns according to the value's dictionary keys
- map encoded customer ids to ids in transcrpt and profile dataframes


In [None]:
transcript = transcript.rename(columns={'person':'customer_id'})

In [None]:
transcript['event'] = transcript.event.str.replace(' ','_')

In [None]:
dummy = pd.get_dummies(transcript.event)

In [None]:
transcript = pd.concat([transcript,dummy], axis=1)

In [None]:
transcript = transcript.drop('event',axis= 1)

In [None]:
dummy_val = transcript['value'].apply(pd.Series)

Rename the 'offer id' to 'offer_id' and combine with the existing 'offer_id'. We can drop the reward column as it is already captured in portfolio dataframe.

In [None]:
dummy_val[dummy_val.offer_id.notna()].head()

In [None]:
dummy_val.offer_id.fillna(dummy_val['offer id'],inplace = True)

In [None]:
dummy_val = dummy_val.drop(['offer id','reward'],axis = 1)

In [None]:
dummy_val.head()

In [None]:
transcript = pd.concat([transcript,dummy_val], axis = 1)
transcript = transcript.drop('value', axis = 1)

In [None]:
#map encoded customer ids to ids in transcrpt dataframes
transcript['customer_id'] = transcript['customer_id'].map(customer_ids_dict)

In [None]:
#map offer id in transcript to the encoded offer id
transcript['offer_id'] = transcript['offer_id'].map(offer_ids_dict)

In [None]:
transcript.head()

In [None]:
transcript.info()

In [None]:
#drop all rows contain NA because they are the customer with age>100 
#that we removed earlier in profile table

transcript.dropna(axis = 0, subset = ['customer_id'], inplace = True)

In [None]:
transcript.info()

### Combining Dataset

In [None]:
df = pd.merge(profile,transcript, how = 'outer', on = 'customer_id')
df = pd.merge(df,portfolio, how = 'outer', on = 'offer_id')
df.head()

# Data Visualisation


## Offers

In the first part, let's explore the offer types. Let find out the business funnel for the offers and what criteria does the offer has for the higher conversion. 

In [None]:
df.groupby('offer_id')['offer_received','offer_viewed','offer_completed'].sum().plot.bar()
plt.legend(bbox_to_anchor=(1.05, 1),loc='upper left', borderaxespad=0.);

In [None]:
offer_type_funnel = df.groupby('offer_type')['offer_received','offer_viewed','offer_completed'].sum()
offer_type_funnel

To make a fair comparison, let's find the average value of each campaign and change into percentage on funnel level.

In [None]:
#divider is the counts of each offer type
divider = portfolio.offer_type.value_counts().values

#average received,viewed,and completed value for each offer base on their type
(offer_type_funnel.T/divider).T.plot.bar()
plt.legend(bbox_to_anchor=(1.05, 1),loc='upper left', borderaxespad=0.);

Let's normalize the above value and replot them.

In [None]:
a = pd.Series((offer_type_funnel.T/divider).iloc[0]/(offer_type_funnel.T/divider).iloc[0], name = 'Received')
b = pd.Series((offer_type_funnel.T/divider).iloc[1]/(offer_type_funnel.T/divider).iloc[0], name = 'Viewed')
c = pd.Series((offer_type_funnel.T/divider).iloc[2]/(offer_type_funnel.T/divider).iloc[1], name = 'Completed')

In [None]:
ab = pd.merge(a,b, on = 'offer_type')
pd.merge(ab,c, on = 'offer_type').plot.bar()
plt.legend(bbox_to_anchor=(1.05, 1),loc='upper left', borderaxespad=0.);

From the chart above, we can see that customer seems to be most interested in BOGO offer, hence it has the highest view but it doesn't mean that it has the highest conversion. The discout type offer seems to have the highest conversion. Meanwhile, the informational offer seems to have no conversion at all, maybe because customer realized they are not gaining anything here.

How about the difficulty of the offer? Does it show any trend on completed offer counts? Let's explore that below.

In [None]:
df.groupby('offer_type')['offer_received','offer_viewed','offer_completed'].sum()

In [None]:
offer_type_funnel = df.groupby('difficulty')['offer_received','offer_viewed','offer_completed'].sum()
offer_type_funnel

In [None]:
#divider is the counts of each offer type
divider = portfolio.difficulty.value_counts().sort_index().values
#average received,viewed,and completed value for each offer base on their difficulty
(offer_type_funnel.T/divider).T.plot.bar();
plt.title("Customer counts base on Difficulty")
plt.ylabel('Counts')
plt.xlabel('Difficulty')
plt.legend(bbox_to_anchor=(1.05, 1),loc='upper left', borderaxespad=0.)

a = pd.Series((offer_type_funnel.T/divider).iloc[0]/(offer_type_funnel.T/divider).iloc[0], name = 'Received')
b = pd.Series((offer_type_funnel.T/divider).iloc[1]/(offer_type_funnel.T/divider).iloc[0], name = 'Viewed')
c = pd.Series((offer_type_funnel.T/divider).iloc[2]/(offer_type_funnel.T/divider).iloc[1], name = 'Completed')

ab = pd.merge(a,b, on = 'difficulty')
pd.merge(ab,c, on = 'difficulty').plot.bar()
plt.title("Customer proportion base on Difficulty")
plt.ylabel('Percentage')
plt.xlabel('Difficulty')
plt.legend(bbox_to_anchor=(1.05, 1),loc='upper left', borderaxespad=0.);

In [None]:
portfolio.groupby(['offer_type','difficulty']).offer_id.count()

0 difficulty is 100 percent coming from informational offer. Most people viewed the  difficulty between 5-10 with 7 having the highest count for both views and completions.
The completed offer for the most difficult offer has more than 100% completions from the total viewed. This could obviously resulted by the inclusion of the demographic who doesn't even received the offer but still make the purchase anyway (which was explain earlier in the introduction).

More ideas for conclusion later:

 - what is this informational offer actually. because the 0 completed offer kind of doesn'y make sense. I assume that if they release new product or seasonal drinks, they sent the information about this to their customers. In reality there must have been people buying the new drinks. So in this case the 0 doesn't make sense.
 
 - If that's the case, that means that this informational offer are giving them negative return of investment (ROI) and they should stop with the offer.

## Distributions of Customer's Age,Gender and Income


Now,let's explore the customer's data. Let's find out about the age, gender and how much are revenue generated by certain group of customers and are there any correlation between the customer and the offer.

In [None]:
profile.head()

Let's look at the age distribution after we have remove the maximum age of 118 in the earlier cleaning part.

In [None]:
profile = profile[profile.age < profile.age.max()]
plt.hist(profile.age)
plt.grid()
plt.title("Customer's Age Distribution");

They are normally distributed with median around 55 of age and slightly skewed to the right.

In [None]:
plt.hist(profile.income)
plt.grid()
plt.title("Customer's Income Distribution");

In [None]:
profile.hist(bins =40,figsize =(30,15))
plt.show;

In [None]:
#cust age and income distribution

profile.describe()

In [None]:
#scatter plot to see relationship between age and income
plt.scatter(profile.age, profile.income, alpha = 0.05)
plt.xlabel('Age')
plt.ylabel('Income')
plt.title('Income and Age Scatter Plot');

The above plot some how show the data are being capped multiple times for both age and income. I am not sure how reliable this data is now after looking at this visual, because this can give a very bias result for our findings or when we train a model.

The income is capped at around 75k in the first level, then at 100k and 120k for the subsequent levels. 

In [None]:
#just checking if the data is actually capped based on the time they became a member

profile.info()

In [None]:
plt.plot(profile.groupby('became_member_on')[['income']].max())

In [None]:
plt.plot(profile.groupby('became_member_on')[['income']].mean())

Now let see how the gender is divided.

In [None]:
profile.gender.value_counts()

In [None]:
# profile.gender.value_counts(normalize = True).plot(kind = 'bar')
gender = ['Male', 'Female', 'Other']
gender_counts = profile.gender.value_counts().values

# Create a pie chart of the number of customers for each gender
plt.figure(figsize=(6, 6))
plt.pie(gender_counts, labels=gender, autopct="%1.1f%%")
plt.title("Pie Chart of Customer Gender Distribution")
plt.show()

Majority of the customer are male with value of more than 50% of them. 

Is there a pattern where woman are more prone to complete the offer compare to man? Let's find about that.

## Relationships between Customer's Feature and the Offers


We realized earlier that the majority of our customer are male but does it apply the same for those who completed the offer? Let's find out on that.

In [None]:
#create a customer table that holds information on their response towards the offer
cust = df.groupby('customer_id')['transaction','offer_received','offer_viewed','offer_completed','amount'].sum()
#sort the table by the highest value of transaction
cust = cust.sort_values(by = ['transaction'], ascending = False)
#add gender info
cust['gender'] = df.groupby('customer_id')['gender'].max()

In [None]:
#average of offer completed by each gender
gender_completed = cust.groupby('gender').offer_completed.mean()


In [None]:
#setting up values for x and y axis
gender = ['Female', 'Male', 'Other']
y_val = gender_completed.values

# Create a bar chart of the number of customers for each gender
plt.figure(figsize=(6, 6))
plt.bar(gender, y_val)
plt.title("Average Offer Completed by Gender")
plt.show()

From the about result, we see that our male customer is in contrary contribute to least completed offer based on average per person. This is an interesting find where women and other gender tend to complete the offer most. 



Let's move on to the next question, how much transactions does the customer make and what's their spending pattern? 

In [None]:
# adding more columns to the new cust table
cust['transaction_no_offer'] = cust.transaction - cust.offer_completed
cust['income'] = df.groupby('customer_id')['income'].max()
cust['age'] = df.groupby('customer_id')['age'].max()


In [None]:
cust.head(10)

In [None]:
cust.tail()

Since we cant track how much customer spend for each completed offer, what we can do here is to see the percentage of the offer transaction from the total transactions.

In [None]:
#find the  % of offer to  total transaction.
cust['perc_complete'] = round(cust.offer_completed/cust.transaction,2)


In [None]:
cust.head()

In [None]:
#encode the column 'gender' in the string format to integer
gender_dict = {'O': 0, 'M': 1, 'F': 2}
cust['gender'] = cust['gender'].map(gender_dict)

Now let's look at how much each attribute correlate with the completed offer.

In [None]:
corr_matrix = cust.corr()
corr_matrix['offer_completed'].sort_values(ascending = False)

In [None]:
corr_matrix = cust.corr()
corr_matrix['offer_completed'].sort_values(ascending = False)

The most positive correlated attribute is the perc_complete which is the amount of completed offer over the total transaction. It follows by the total amount spent by the customer. So the customer who spend more are more likely to complete the offer. There is a slight negative correlation between the customer's age and the completed offer, which means there's a slight tendency of the younger customer to complete the offer than the older customer.

   So the top three attributes that we can use to define our top customers are base on how many time they completed the offer, follow by perc_complete and the total amount they spent. So next time we can send out offers on targeted customers. 
   
$$ideas$$:
- high value cust, sent out offer to increase their spending
- sent special offer trying to convert customer (high spending customers wont bother much with voucher or coupon or offer, but we can target on customer base on 0-50th percentile that has offer completion <1/2 including those haven't convert
- should we classify all these group of customers?


So correlation number above shows the strength of the linear relationship between the completed offers and all other numerical variables. Now let have a look at them when they are plotted in graphs.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# # make the data
# x = cust.offer_completed
# y = cust.perc_complete

# # size and color:
# sizes = cust.amount
# # colors = {'M':'tab:blue', 'F':'tab:orange', 'O':'tab:green'}

# # plot
# fig, ax = plt.subplots()

# # ax.scatter(x, y, s=sizes, c=colors, vmin=0, vmax=100)
# ax.scatter(x, y, s=sizes)#, c=cust['gender'].map(colors))

# # ax.set(xlim=(0, 8), xticks=np.arange(1, 8),
# #        ylim=(0, 8), yticks=np.arange(1, 8))

# plt.show()

perc_completed = cust.perc_complete
offer_com = cust.offer_completed
total_spend = cust.amount

# Scatter plot with color coding based on promo usage
plt.figure(figsize=(10, 6))
plt.scatter(total_spend, perc_completed, c=offer_com, cmap='viridis', edgecolors='black')

# Label axes and title
plt.xlabel('Total Amount')
plt.ylabel('Percentage Offer Completed from Total Transaction')
plt.title('Relationship between Percentage Offer Completed, Offer Completed, and Total Amount Spent')

# Add colorbar
sm = plt.cm.ScalarMappable(cmap='viridis', norm=plt.Normalize(min(offer_com), max(offer_com)))
sm.set_array([])
plt.colorbar(sm, label='Offer Completed')

# Show plot
plt.grid(True)
plt.tight_layout()
plt.show()

In [None]:
cust[cust.perc_complete > 1].sort_values(by= "perc_complete",ascending= False)

In [None]:

transaction_counts = cust.transaction
offer_com = cust.offer_completed
total_spend = cust.amount


# Scatter plot with color coding based on promo usage
plt.figure(figsize=(10, 6))
plt.scatter(transaction_counts, total_spend, c=offer_com, cmap='viridis', edgecolors='black')

# Label axes and title
plt.xlabel('Transaction Count')
plt.ylabel('Total Amount')
plt.title('Relationship between Transaction Count, Offer Completed, and Total Amount Spent')

# Add colorbar
sm = plt.cm.ScalarMappable(cmap='viridis', norm=plt.Normalize(min(offer_com), max(offer_com)))
sm.set_array([])
plt.colorbar(sm, label='Offer Completed')

# Show plot
plt.grid(True)
plt.tight_layout()
plt.show()

There is no direct correlation that we can see here. The higher transaction count doesn't mean the spending amount is also higher. The higher promo usage is in the middle area of the distribution.

In [None]:

transaction_counts = cust.transaction
promo_usage = cust.offer_completed
income = cust.income

# Scatter plot with color coding based on promo usage
plt.figure(figsize=(10, 6))
plt.scatter(transaction_counts, income, c=promo_usage, cmap='viridis', edgecolors='black')

# Label axes and title
plt.xlabel('Transaction Count')
plt.ylabel('Income')
plt.title('Relationship between Transaction Count, Promo Usage, and Total Spend')

# Add colorbar
sm = plt.cm.ScalarMappable(cmap='viridis', norm=plt.Normalize(min(promo_usage), max(promo_usage)))
sm.set_array([])
plt.colorbar(sm, label='Promo Usage')

# Show plot
plt.grid(True)
plt.tight_layout()
plt.show()

In [None]:

transaction_counts = cust.transaction
promo_usage = cust.offer_completed
age = cust.age

# Scatter plot with color coding based on promo usage
plt.figure(figsize=(10, 6))
plt.scatter(transaction_counts, age, c=promo_usage, cmap='viridis', edgecolors='black')

# Label axes and title
plt.xlabel('Transaction Count')
plt.ylabel('Age')
plt.title('Relationship between Transaction Count, Promo Usage, and Total Spend')

# Add colorbar
sm = plt.cm.ScalarMappable(cmap='viridis', norm=plt.Normalize(min(promo_usage), max(promo_usage)))
sm.set_array([])
plt.colorbar(sm, label='Promo Usage')

# Show plot
plt.grid(True)
plt.tight_layout()
plt.show()

This is really interesting. We see the higher promo usage, which in green to yellow color marks lies densely for transaction count roughly more than five and below 25 and the total income more than its 50th percentile to 75th percentile. People with higher income don't spend more than 20 transaction.

Lets explore this further by transforming the graph into quadrants based on income and transaction count percentiles.

In [None]:
#Calculate Percentiles:

#calculate the 25th, 50th, and 75th percentiles for both income and transaction_counts:

income_25th = income.quantile(.25)
income_50th = income.quantile(.5)
income_75th = income.quantile(.75)

transaction_25th = np.percentile(transaction_counts, 25)
transaction_50th = np.percentile(transaction_counts, 50)
transaction_75th = np.percentile(transaction_counts, 75)
transaction_100th = np.percentile(transaction_counts, 100)



In [None]:
#Create Quadrant Grid
#create vertical and horizontal lines at the calculated percentiles, forming the quadrant grid
plt.figure(figsize=(10, 6))
plt.axvline(x=transaction_25th, color='gray', linestyle='--')
plt.axvline(x=transaction_50th, color='gray', linestyle='--')
plt.axvline(x=transaction_75th, color='gray', linestyle='--')
# plt.axvline(x=transaction_100th, color='gray', linestyle='--')


plt.axhline(y=income_25th, color='gray', linestyle='--')
plt.axhline(y=income_50th, color='gray', linestyle='--')
plt.axhline(y=income_75th, color='gray', linestyle='--')

# Add colorbar
sm = plt.cm.ScalarMappable(cmap='viridis', norm=plt.Normalize(min(promo_usage), max(promo_usage)))
sm.set_array([])
plt.colorbar(sm, label='Promo Usage')

#Adjust Scatter Plot
#Modify the scatter plot to use a marker that doesn't have a fill color, 
#allowing the quadrant grid to be visible
plt.scatter(transaction_counts, income, c=promo_usage, cmap='viridis', edgecolors='black', marker='o', facecolors='none');


In [None]:
def aggregate(x):
    """To find sum of the current element and all the previous elements and storing it in the current position.
       This is to be done from start of the list to the end.
       
    Args:
        x (list): List of integer values.
        
    Returns:
        x (list): Function will be performed to every element from start to end. Last element will be removed 
        from the list before returning.
    
    """
    for i in range(1, len(x)):
        x[i] = x[i] + x[i-1]
    x.pop()
    return x

In [None]:
newlist= [1,2,3,4,5]
aggregate(newlist)

In [None]:
newlist.pop()