# Customer Retention Rates: Part 2 of Store Analysis
#### by ***Wayne Omondi***

## 1.1: Introduction

One of the best things for any business is to establish returning customers. Any company that wants to succeed must keep a close eye on its customer retention metrics. There’s a simple, economic reason why customer retention is so important: Keeping your existing customers is a lot less expensive than trying to win new ones. Loyal customers also contribute to your business’ health by providing referrals, promoting your brand on social media, and giving feedback to improve your product or service. So, it’s critical for companies to keep an eye on their customer retention rate.

***How do you calculate customer retention rate?***<br>
To determine our retention rate, first we have to identify the time frame you want to study. Next, collect the number of existing customers at the start of the time period. Then find the number of total customers at the end of the time period.

For this we will explore the concept of a ***cohort***. A cohort is <ins>a group of subjects that share a defining characteristic.</ins> 
>A cohort has three main attributes:
>1. time
>2. size
>3. behaviour

### 1.1.1: Steps

1. __Data Loading and Preparation__
 > importing our dataset, cleaning it and preparing it.<br> 
2. __Assigning Cohorts__

3. __Cohort Indexing__

4. __Visualization__

5. __Interpretation__

## 2.1: Data Loading and Prep

First we import the libraries we will need to use

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Enable our notebook to ignore all warnings

In [2]:
import warnings
warnings.filterwarnings("ignore")

We will use the SuperStore dataset from [Part 1. EDA Project](https://github.com/WayneNyariroh/StoreSales_Analysis)<br>
Let's import our dataset using read_csv()

In [4]:
store_df = pd.read_csv(r"C:\Users\FatherMammoth\Documents\PortfolioPorjects\SalesAnalysis\data\SuperStoreSales_Whole.csv")

In [5]:
store_df.info() #see our rows, columns, non-null values and datatypes

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9800 entries, 0 to 9799
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Row ID         9800 non-null   int64  
 1   Order ID       9800 non-null   object 
 2   Order Date     9800 non-null   object 
 3   Ship Date      9800 non-null   object 
 4   Ship Mode      9800 non-null   object 
 5   Customer ID    9800 non-null   object 
 6   Customer Name  9800 non-null   object 
 7   Segment        9800 non-null   object 
 8   Country        9800 non-null   object 
 9   City           9800 non-null   object 
 10  State          9800 non-null   object 
 11  Postal Code    9789 non-null   float64
 12  Region         9800 non-null   object 
 13  Product ID     9800 non-null   object 
 14  Category       9800 non-null   object 
 15  Sub-Category   9800 non-null   object 
 16  Product Name   9800 non-null   object 
 17  Sales          9800 non-null   float64
 18  Quantity

After seeing our columns, we do not need a good number of them. The two most important columns for our cohort analysis are the *Order Date* and *Customer ID*. As such we have to check for any missing values in those two column.

In [8]:
store_df[['Order ID','Order Date','Customer ID','Sales']].isnull().sum()

Order ID       0
Order Date     0
Customer ID    0
Sales          0
dtype: int64

In [None]:
customer_df = store_df.drop(columns='','','','')