# Customer Segmentation in Python

### Assign daily acquisition cohort
As you have seen in the video, defining a cohort is the first step to cohort analysis. You will now create daily cohorts based on the day each customer has made their first transaction.

The data has been loaded as online DataFrame, you can now print its header with online.head() in the console.
```python
# Define a function that will parse the date
def get_day(x): return dt.datetime(x.year, x.month, x.day) 

# Create InvoiceDay column
online['InvoiceDay'] = online['InvoiceDate'].apply(get_day) 

# Group by CustomerID and select the InvoiceDay value
grouping = online.groupby('CustomerID')['InvoiceDate'] 

# Assign a minimum InvoiceDay value to the dataset
online['CohortDay'] = grouping.transform('min')

# View the top 5 rows
print(online.head())

```
Calculate time offset in days - part 1
Calculating time offset for each transaction allows you to report the metrics for each cohort in a comparable fashion.

First, we will create 6 variables that capture the integer value of years, months and days for Invoice and Cohort Date using the get_date_int() function that's been already defined for you:
```python
def get_date_int(df, column):
    year = df[column].dt.year
    month = df[column].dt.month
    day = df[column].dt.day
    return year, month, day
```
The online data has been loaded, you can print its header to the console by calling online.head().

```python
# Get the integers for date parts from the `InvoiceDay` column
invoice_year, invoice_month, invoice_day = get_date_int(online, 'InvoiceDay')

# Get the integers for date parts from the `CohortDay` column
cohort_year, cohort_month, cohort_day = get_date_int(online, 'CohortDay')

```
### Calculate time offset in days - part 2
Great work! Now, we have six different data sets with year, month and day values for Invoice and Cohort dates - invoice_year, cohort_year, invoice_month, cohort_month, invoice_day, and cohort_day.

In this exercise you will calculate the difference between the Invoice and Cohort dates in years, months and days separately and then calculate the total days difference between the two. This will be your days offset which we will use in the next exercise to visualize the customer count. The online data has been loaded, you can print its header to the console by calling online.head().



```python
# Calculate difference in years
years_diff = invoice_year - cohort_year

# Calculate difference in months
months_diff = invoice_month - cohort_month

# Calculate difference in days
days_diff = invoice_day - cohort_day

# Extract the difference in days from all previous values
online['CohortIndex'] = years_diff * 365 + months_diff * 30 + days_diff + 1
print(online.head())

```
### Calculate retention rate from scratch
You have seen how to create retention and average quantity metrics table for the monthly acquisition cohorts. Now it's you time to build the retention metrics by yourself.

The online dataset has been loaded to you with monthly cohorts and cohort index assigned from this lesson. Feel free to print it in the Console.

Also, we have created a loaded a groupby object as grouping DataFrame with this command: grouping = online.groupby(['CohortMonth', 'CohortIndex'])

```python
# Count the number of unique values per customer ID
cohort_data = grouping['CustomerID'].apply(pd.Series.nunique).reset_index()

# Create a pivot 
cohort_counts = cohort_data.pivot(index='CohortMonth', columns='CohortIndex', values='CustomerID')

# Select the first column and store it to cohort_sizes
cohort_sizes = cohort_counts.iloc[:,0]

# Divide the cohort count by cohort sizes along the rows
retention = cohort_counts.divide(cohort_sizes, axis=0)

```
### Calculate average price
You will now calculate the average price metric and analyze if there are any differences in shopping patterns across time and across cohorts.

The online dataset has been loaded to you with monthly cohorts and cohort index assigned from this lesson. Feel free to print it to the Console

```python
# Create a groupby object and pass the monthly cohort and cohort index as a list
grouping = online.groupby(['CohortMonth', 'CohortIndex'])

# Calculate the average of the unit price column
cohort_data = grouping['UnitPrice'].mean()

# Reset the index of cohort_data
cohort_data = cohort_data.reset_index()

# Create a pivot 
average_price = cohort_data.pivot(index='CohortMonth', columns='CohortIndex', values='UnitPrice')
print(average_price.round(1))

```
### Visualize average quantity metric
You are now going to visualize average quantity values in a heatmap.

We have loaded pandas package as pd, and the average quantity values DataFrame as average_quantity.

Please use the console to explore it.



```python
# Import seaborn package as sns
import seaborn as sns

# Initialize an 8 by 6 inches plot figure
plt.figure(figsize=(8,6))

# Add a title
plt.title('Average Spend by Monthly Cohorts')

# Create the heatmap
sns.heatmap(data=average_quantity, annot=True, cmap='Blues')
plt.show()

```
### Calculate Spend quartiles (q=4)
We have created a dataset for you with random CustomerID and Spend values as data. You will now use this dataset to group customers into quartiles based on Spend values and assign labels to each of them.

pandas library as been loaded as pd. Feel free to print the data to the console.


```python
CustomerID  Spend
0           0    137
1           1    335
2           2    172
3           3    355
4           4    303
5           5    233
6           6    244
7           7    229

```
```python
# Create a spend quartile with 4 groups - a range between 1 and 5
spend_quartile = pd.qcut(data['Spend'], q=4, labels=range(1,5))

# Assign the quartile values to the Spend_Quartile column in data
data['Spend_Quartile'] = spend_quartile

# Print data with sorted Spend values
print(data.sort_values('Spend'))

     CustomerID  Spend Spend_Quartile
    0           0    137              1
    2           2    172              1
    7           7    229              2
    5           5    233              2
    6           6    244              3
    4           4    303              3
    1           1    335              4
    3           3    355              4

```
### Calculate Recency deciles (q=10)
We have created a dataset for you with random CustomerID and Recency_Days values as data. You will now use this dataset to group customers into quartiles based on Recency_Days values and assign labels to each of them.

Be cautious about the labels for this exercise. You will see that the labels are inverse, and will required one additional step in separately creating them. If you need to refresh your memory on the process of creating the labels, check out the slides!

The pandas library as been loaded as pd. Feel free to print the data to the console.
```python
# Store labels from 4 to 1 in a decreasing order
r_labels = list(range(4, 0, -1))

# Create a spend quartile with 4 groups and pass the previously created labels 
recency_quartiles = pd.qcut(data['Recency_Days'], q=4, labels=r_labels)

# Assign the quartile values to the Recency_Quartile column in `data`
data['Recency_Quartile'] = recency_quartiles 

# Print `data` with sorted Recency_Days values
print(data.sort_values('Recency_Days'))

    CustomerID  Recency_Days Recency_Quartile
    0           0            37                4
    3           3            72                4
    7           7           133                3
    6           6           203                3
    1           1           235                2
    4           4           255                2
    5           5           393                1
    2           2           396                1



```
### Calculate RFM values
Calculate Recency, Frequency and Monetary values for the online dataset we have used before - it has been loaded for you with recent 12 months of data. There's a TotalSum column in the online dataset which has been calculated by multiplying Quantity and UnitPrice: online['Quantity'] * online['UnitPrice'].

Also, we have created a snapshot_date variable that you can use to calculate recency. Feel free to print the online dataset and the snapshot_date into the Console. The pandas library is loaded as pd, and datetime as dt.
```python
# Calculate Recency, Frequency and Monetary value for each customer 
datamart = online.groupby(['CustomerID']).agg({
    'InvoiceDate': lambda x: (snapshot_date - x.max()).days,
    'InvoiceNo': 'count',
    'TotalSum': 'sum'})

# Rename the columns 
datamart.rename(columns={'InvoiceDate': 'Recency',
                         'InvoiceNo': 'Frequency',
                         'TotalSum': 'MonetaryValue'}, inplace=True)

# Print top 5 rows
print(datamart.head())

            Recency  Frequency  MonetaryValue
    CustomerID                                   
    12747             3         25         948.70
    12748             1        888        7046.16
    12749             4         37         813.45
    12820             4         17         268.02
    12822            71          9         146.15


```
### Calculate 3 groups for Recency and Frequency
You will now group the customers into three separate groups based on Recency, and Frequency.

The dataset has been loaded as datamart, you can use console to view top rows of it. Also, pandas has been loaded as pd.

We will use the result from the exercise in the next one, where you will group customers based on the MonetaryValue and finally calculate and RFM_Score.

Once completed, print the results to the screen to make sure you have successfully created the quartile columns.
```python
# Create labels for Recency and Frequency
r_labels = range(3, 0, -1); f_labels = range(1, 4)

# Assign these labels to three equal percentile groups 
r_groups = pd.qcut(datamart['Recency'], q=3, labels=r_labels)

# Assign these labels to three equal percentile groups 
f_groups = pd.qcut(datamart['Frequency'], q=3, labels=f_labels)

# Create new columns R and F 
datamart = datamart.assign(R=r_groups.values, F=f_groups.values)
print(datamart.head())

```
Calculate RFM Score
Great work, you will now finish the job by assigning customers to three groups based on the MonetaryValue percentiles and then calculate an RFM_Score which is a sum of the R, F, and M values.

The datamart has been loaded with the R and F values you have created in the previous exercise.
```python
# Create labels for MonetaryValue
m_labels = range(1, 4)

# Assign these labels to three equal percentile groups 
m_groups = pd.qcut(datamart['MonetaryValue'], q=3, labels=m_labels)

# Create new column M
datamart = datamart.assign(M=m_groups.values)

# Calculate RFM_Score
datamart['RFM_Score'] = datamart[['R','F','M']].sum(axis=1)
print(datamart['RFM_Score'].head())

```
### Creating custom segments
It's your turn to create a custom segmentation based on RFM_Score values. You will create a function to build segmentation and then assign it to each customer.

The dataset with the RFM values, RFM Segment and Score has been loaded as datamart, together with pandas and numpy libraries. Feel free to explore the data in the console.
```python
# Define rfm_level function
def rfm_level(df):
    if df['RFM_Score'] >= 10:
        return 'Top'
    elif ((df['RFM_Score'] >= 6) and (df['RFM_Score'] < 10)):
        return 'Middle'
    else:
        return 'Low'

# Create a new variable RFM_Level
datamart['RFM_Level'] = datamart.apply(rfm_level, axis=1)

# Print the header with top 5 rows to the console
print(datamart.head())

```
### Analyzing custom segments
As a final step, you will analyze average values of Recency, Frequency and MonetaryValue for the custom segments you've created.

We have loaded the datamart dataset with the segment values you have calculated in the previous exercise. Feel free to explore it in the console. pandas library is also loaded as pd.
```python
# Calculate average values for each RFM_Level, and return a size of each segment 
rfm_level_agg = datamart.groupby('RFM_Level').agg({
    'Recency': 'mean',
    'Frequency': 'mean',
  
  	# Return the size of each segment
    'MonetaryValue': ['mean', 'count']
}).round(1)

# Print the aggregated dataset
print(rfm_level_agg)

RFM_Level                                      
Low         180.8       3.2          52.7  1075
Middle       73.9      10.7         202.9  1547
Top          20.3      47.1         959.7  1021

```
### Calculate statistics of variables
We have created a pandas DataFrame called data for you with three variables: var1, var2 and var3.

You will now calculate average and standard deviation values for the variables, and also print key statistics of the dataset.

You can use the console to explore the dataset.
```python
# Print the average values of the variables in the dataset
print(data.mean())

# Print the standard deviation of the variables in the dataset
print(data.std())

# Get the key statistics of the dataset
print(data.describe())

var1    251.85000
var2      1.92559
var3     12.55028
dtype: float64
var1    90.993104
var2     2.583730
var3    34.516362
dtype: float64
             var1       var2        var3
count  100.000000  100.00000  100.000000
mean   251.850000    1.92559   12.550280
std     90.993104    2.58373   34.516362
min    101.000000    0.04800    0.002000
25%    171.750000    0.61250    0.280750
50%    250.000000    1.17550    1.260500
75%    339.250000    2.20800    5.568000
max    397.000000   15.31200  228.779000



```
### Detect skewed variables
We have created a dataset called data for you with three variables: var1, var2 and var3. You will now explore their distributions.

The plt.subplot(...) call before the seaborn function call allows you to plot several subplots in one chart, you do not have to change it.

Libraries seaborn and matplotlib.pyplot have been loaded as sns and plt respectively. Feel free to explore the dataset in the console.
```python
# Plot distribution of var1
plt.subplot(3, 1, 1); sns.distplot(data['var1'])

# Plot distribution of var2
plt.subplot(3,1,2); sns.distplot(data['var2'])


# Plot distribution of var3
plt.subplot(3,1,3); sns.distplot(data['var3'])

# Show the plot
plt.show()

```
### Manage skewness
We've loaded the same dataset named data. Now your goal will be to remove skewness from var2 and var3 as they had a non-symmetric distribution as you've seen in the previous exercise plot. You will visualize them to make sure the problem is solved!

Libraries pandas, numpy, seaborn and matplotlib.pyplot have been loaded as pd, np, sns and plt respectively. Feel free to explore the dataset in the console.
```python
# Apply log transformation to var2
data['var2_log'] = np.log(data['var2'])

# Apply log transformation to var3
data['var3_log'] = np.log(data['var3'])
# Create a subplot of the distribution of var2_log
plt.subplot(2, 1, 1); sns.distplot(data['var2_log'])

# Create a subplot of the distribution of var3_log
plt.subplot(2, 1, 2); sns.distplot(data['var3_log'])

# Show the plot
plt.show()

```
### Center and scale manually
We've loaded the same dataset named data. Now your goal will be to center and scale them manually.

Libraries pandas, numpy, seaborn and matplotlib.pyplot have been loaded as pd, np, sns and plt respectively. Feel free to explore the dataset in the console.
```python
# Center the data by subtracting average values from each entry
data_centered = data - data.mean()

# Scale the data by dividing each entry by standard deviation
data_scaled = data / data.std()

# Normalize the data by applying both centering and scaling
data_normalized = (data - data.mean()) / data.std()

# Print summary statistics to make sure average is zero and standard deviation is one
print(data_normalized.describe().round(2))

```
### Center and scale with StandardScaler()
We've loaded the same dataset named data. Now your goal will be to center and scale them with StandardScaler from sklearn library.

Libraries pandas, numpy, seaborn and matplotlib.pyplot have been loaded as pd, np, sns and plt respectively. We have also imported the StandardScaler.

Feel free to explore the dataset in the console.
```python
# Initialize a scaler
scaler = StandardScaler()

# Fit the scaler
scaler.fit(data)

# Scale and center the data
data_normalized = scaler.transform(data)

# Create a pandas DataFrame
data_normalized = pd.DataFrame(data_normalized, index=data.index, columns=data.columns)

# Print summary statistics
print(data_normalized.describe().round(2))

```
### Visualize RFM distributions
We have loaded the dataset with RFM values you calculated previously as datamart_rfm. You will now explore their distributions.

The plt.subplot(...) call before the seaborn function call allows you to plot several subplots in one chart, you do not have to change it.

Libraries seaborn and matplotlib.pyplot have been loaded as sns and plt respectively. Feel free to explore the dataset in the console.
```python
# Plot recency distribution
plt.subplot(3, 1, 1); sns.distplot(datamart_rfm['Recency'])

# Plot frequency distribution
plt.subplot(3,1,2); sns.distplot(datamart_rfm['Frequency'])

# Plot monetary value distribution
plt.subplot(3,1,3); sns.distplot(datamart_rfm['MonetaryValue'])

# Show the plot
plt.show()

```
### Pre-process RFM data
We have loaded the dataset with RFM values you calculated previously as datamart_rfm. Since the variables are skewed and are on different scales, you will now un-skew and normalize them.

The pandas library is loaded as pd, and numpy as np. Take some time to explore the datamart_rfm in the console.
```python
# Unskew the data
datamart_log = np.log(datamart_rfm)

# Initialize a standard scaler and fit it
scaler = StandardScaler()
scaler.fit(datamart_log)

# Scale and center the data
datamart_normalized = scaler.transform(datamart_log)

# Create a pandas DataFrame
datamart_normalized = pd.DataFrame(data=datamart_normalized, index=datamart_rfm.index, columns=datamart_rfm.columns)

```
## Visualize the normalized variables
Great work! Now you will plot the normalized and unskewed variables to see the difference in the distribution as well as the range of the values. The datamart_normalized dataset from the previous exercise is loaded.

The plt.subplot(...) call before the seaborn function call allows you to plot several subplots in one chart, you do not have to change it.

Libraries seaborn and matplotlib.pyplot have been loaded as sns and plt respectively. Feel free to explore the datamart_normalized in the console.
```python
# Plot recency distribution
plt.subplot(3, 1, 1); sns.distplot(datamart_normalized['Recency'])

# Plot frequency distribution
plt.subplot(3, 1, 2); sns.distplot(datamart_normalized['Frequency'])

# Plot monetary value distribution
plt.subplot(3, 1, 3); sns.distplot(datamart_normalized['MonetaryValue'])

# Show the plot
plt.show()

```
### Run KMeans
You will now build a 3 clusters with k-means clustering. We have loaded the pre-processed RFM dataset as datamart_normalized. We have also loaded the pandas library as pd.

You can explore the dataset in the console to get familiar with it.
```python
# Import KMeans 
from sklearn.cluster import KMeans

# Initialize KMeans
kmeans = KMeans(n_clusters=3, random_state=1)
# Fit k-means clustering on the normalized data set
kmeans.fit(datamart_normalized)

# Extract cluster labels
cluster_labels = kmeans.labels_

```
### Assign labels to raw data
You will now analyze the average RFM values of the three clusters you've created in the previous exercise. We have loaded the raw RFM dataset as datamart_rfm, and the cluster labels as cluster_labels. pandas is available as pd.

Feel free to explore the date in the console.
```python
# Create a DataFrame by adding a new cluster label column
datamart_rfm_k3 = datamart_rfm.assign(Cluster=cluster_labels)

# Group the data by cluster
grouped = datamart_rfm_k3.groupby(['Cluster'])

# Calculate average RFM values and segment sizes per cluster value
grouped.agg({
    'Recency': 'mean',
    'Frequency': 'mean',
    'MonetaryValue': ['mean', 'count']
  }).round(1)

```
### Calculate sum of squared errors
In this exercise, you will calculate the sum of squared errors for different number of clusters ranging from 1 to 20. In this example we are using a custom created dataset to get a cleaner elbow read.

We have loaded the normalized version of data as data_normalized. The KMeans module from scikit-learn is already imported. Also, we have initialized an empty dictionary to store sum of squared errors as sse = {}.

Feel free to explore the data in the console.
```python
# Fit KMeans and calculate SSE for each k
for k in range(1, 21):
  
    # Initialize KMeans with k clusters
    kmeans = KMeans(n_clusters=k, random_state=1)
    
    # Fit KMeans on the normalized dataset
    kmeans.fit(data_normalized)
    
    # Assign sum of squared distances to k element of dictionary
    sse[k] = kmeans.inertia_
    print(sse)

```
### Plot sum of squared errors
Now you will plot the sum of squared errors for each value of k and identify if there is an elbow. This will guide you towards the recommended number of clusters to use.

The sum of squared errors is loaded as a dictionary called sse from the previous exercise. matplotlib.pyplot was loaded as plt, and seaborn as sns.

You can explore the dictionary in the console.
```python
# Add the plot title "The Elbow Method"
plt.title('The Elbow Method')

# Add X-axis label "k"
plt.xlabel('k')

# Add Y-axis label "SSE"
plt.ylabel('SSE')

# Plot SSE values for each key in the dictionary
sns.pointplot(x=list(sse.keys()), y=list(sse.values()))
plt.show()

```
### Prepare data for the snake plot
Now you will prepare data for the snake plot. You will use the 3-cluster RFM segmentation solution you have built previously. You will transform the normalized RFM data into a long format by "melting" the metric columns into two columns - one for the name of the metric, and another for the actual numeric value.

We have loaded the normalized RFM data with the cluster labels already assigned. It is loaded as apandas DataFrame named datamart_normalized. Also, pandas is imported as pd.

Explore the datamart_normalized in the console before you begin the exercise to get a good sense of its structure!
```python
# Melt the normalized dataset and reset the index
datamart_melt = pd.melt(
  					datamart_normalized.reset_index(), 
                        
# Assign CustomerID and Cluster as ID variables
                    id_vars =['CustomerID', 'Cluster'],

# Assign RFM values as value variables
                    value_vars =['Recency', 'Frequency', 'MonetaryValue'], 
                        
# Name the variable and value
                    var_name ='Metric', value_name='Value'
					)
					
					
print(datamart_melt.head())


 CustomerID  Cluster   Metric     Value
0       12747        2  Recency -2.002202
1       12748        2  Recency -2.814518
2       12749        2  Recency -1.789490
3       12820        2  Recency -1.789490
4       12822        1  Recency  0.337315



```
### Visualize snake plot
Good work! You will now use the melted dataset to build the snake plot. The melted data is loaded as datamart_melt.

The seaborn library is loaded as sns and matplotlib.pyplot is available as plt.

You can use the console to explore the melted dataset.
```python
# Add the plot title
plt.title('Snake plot of normalized variables')

# Add the x axis label
plt.xlabel('Metric')

# Add the y axis label
plt.ylabel('Value')

# Plot a line for each value of the cluster variable
sns.lineplot(data=datamart_melt, x='Metric', y='Value', hue='Cluster')
plt.show()
```
### Calculate relative importance of each attribute
Now you will calculate the relative importance of the RFM values within each cluster.

We have loaded datamart_rfm with raw RFM values, and datamart_rfm_k3 which has raw RFM values and the cluster labels stored as Cluster. The pandas library is also loaded as pd.

Feel free to explore the datasets in the console.

```python
# Calculate average RFM values for each cluster
cluster_avg = datamart_rfm_k3.groupby(['Cluster']).mean() 

# Calculate average RFM values for the total customer population
population_avg = datamart_rfm.mean()

# Calculate relative importance of cluster's attribute value compared to population
relative_imp = cluster_avg / population_avg - 1

# Print relative importance scores rounded to 2 decimals
print(relative_imp.round(2))

        Recency  Frequency  MonetaryValue
Cluster                                   
0           0.84      -0.84          -0.86
1          -0.15      -0.35          -0.42
2          -0.82       1.67           1.82

```
### Plot relative importance heatmap
Great job! Now you will build a heatmap visualizing the relative scores for each cluster.

We have loaded the relative importance scores as relative_imp. The seaborn library is loaded as sns and the pyplot module from matplotlib is available as plt.
```python
# Initialize a plot with a figure size of 8 by 2 inches 
plt.figure(figsize=(8,2 ))

# Add the plot title
plt.title('Relative importance of attributes')

# Plot the heatmap
sns.heatmap(data=relative_imp, annot=True, fmt='.2f', cmap='RdYlGn')
plt.show()

```

### End-to-End Segmentation Solution
```python
# Import StandardScaler 
from sklearn.preprocessing import StandardScaler

# Apply log transformation
datamart_rfmt_log = np.log(datamart_rfmt)

# Initialize StandardScaler and fit it 
scaler = StandardScaler(); scaler.fit(datamart_rfmt_log)

# Transform and store the scaled data as datamart_rfmt_normalized
datamart_rfmt_normalized = scaler.transform(datamart_rfmt_log)

```


```python
# Fit KMeans and calculate SSE for each k between 1 and 10
for k in range(1, 11):
  
    # Initialize KMeans with k clusters and fit it 
    kmeans = KMeans(n_clusters=k, random_state=1).fit(datamart_rfmt_normalized)
    
    # Assign sum of squared distances to k element of the sse dictionary
    sse[k] = kmeans.inertia_   

# Add the plot title, x and y axis labels
plt.title('The Elbow Method'); plt.xlabel('k'); plt.ylabel('SSE')

# Plot SSE values for each k stored as keys in the dictionary
sns.pointplot(x=list(sse.keys()), y=list(sse.values()))
plt.show()

```


```python
# Import KMeans 
from sklearn.cluster import KMeans

# Initialize KMeans
kmeans = KMeans(n_clusters=4, random_state=1) 

# Fit k-means clustering on the normalized data set
kmeans.fit(datamart_rfmt_normalized)

# Extract cluster labels
cluster_labels = kmeans.labels_

```

```python
# Create a new DataFrame by adding a cluster label column to datamart_rfmt
datamart_rfmt_k4 = datamart_rfmt.assign(Cluster=cluster_labels)

# Group by cluster
grouped = datamart_rfmt_k4.groupby(['Cluster'])

# Calculate average RFMT values and segment sizes for each cluster
grouped.agg({
    'Recency': 'mean',
    'Frequency': 'mean',
    'MonetaryValue': 'mean',
    'Tenure': ['mean', 'count']
  }).round(1)

```

```python


```

```python


```