# Exploratory Data Analysis using Python
#### by ***Wayne Willis Omondi***

## 1.0: Introduction

In this modern times where every business is highly dependent on its data *(Data is King)* to make better decisions for developing business, data analysis plays an important role in helping different business entities to get an idea on their performance and any opportunities to increase gains and minimise losses. Objective is to gain valuable insights on the overall performance of the store.

For our analysis we will be using the [SuperStore dataset](https://community.tableau.com/s/question/0D54T00000CWeX8SAL/sample-superstore-sales-excelxls).

### 1.1: Why Python?

**Python** is a popular choice for Data Analysis due to the many helpful analytics libraries and even for scientific computing, machine learning and more complex tasks. Combined with Python's overall strength for general-purpose software engineering, it is an excellent tool for building data applications.


### 1.2: Our Tools
For our EDA, we will be using the following libraries:
-  **pandas** (a library that makes working with structured and tabular data fast, easy and expressive).
-  **numpy** (a library that provides the data structures and algorithms useful for numerical computing).
-  **matplotlip** (library for plots and two-dimensional visualizations).
-  **seaborn** (statistical data visualization library).

Let's not forget *Jupyter* library that allows as to present our code in form of an interactive notebook/document with text, plots and other outputs (even a terminal - through magic commands)
All these have already been install in our python virtual environment.

### 1.3: Importing Our Libraries and Some Necessary Functions

Since we are using the pandas and numpy libraries for our data processing and manipulation and the matplotlib and seaborn libraries for data vizualization, to start off we have to import them

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
%matplotlib inline

Additional this being a Sales dataset we will likely be working with dates and/or time. So inorder to be able to manipulate those we will make an additional import of the 'datetime' function in Python

In [None]:
from datetime import datetime, date, time

We will also be disabling the warnings in the jupyter, setting the filter to never display warnings.
[More on Warnings](https://www.geeksforgeeks.org/warnings-in-python/).

In [None]:
import warnings 
warnings.filterwarnings("ignore")

## 2.0: Our Data

### 2.1: Loading Our Data

In our case our data is stored in a csv format so we will use the .read_csv() method.

In [None]:
store_df = pd.read_csv('data/SuperStoreSales_Whole.csv') #loading our dataset into a dataframe named store_df

### 2.2: Viewing Our Data 

We have to see what we are working with and whether any data cleaning is necessary. In this step we display our dataset, get to know how many columns and rows are present, how much data is missing, present datatypes in each column, unique features and last but least a statistical description of our dataset.
While doing this we will also get to understand more about the store, for example, what products they sells, who they sell to and where, before we dive into their performance.

In [None]:
#pd.set_option('display.max_rows', 6) #display 6 rows of our dataset, 3 head, 3 tail - commented out for now
store_df

In [None]:
store_df.shape #view the shape of our dataset - total rows and columns.

Our dataset has 9800 rows and 21 columns. Depending on our objective with our EDA we could drop some columns from the analysis, for example the *'Customer Name'* columns is not necessary. In certain case personal information is often excluded from a data analysis, unless in our case if the store intends to award the most loyal and/or top customer. In that case we could still drop the *'Customer Name'* column and retain the *'Customer ID'* column for that query.

In [None]:
store_df.drop(columns = ['Customer Name', 'Row ID'], inplace=True) #remove 'Customer Name' and 'Row ID' columns from the 'store_df' dataframe
store_df

In [None]:
store_df.columns #view all the columns, incase we need to rename any

In [None]:
store_df.info() #view information on our dataframe interms of index range and datatype for each column

In [None]:
store_df.describe() #statistical description of our dataset

#### 2.2.1: Check for any Duplicated Records

In [None]:
store_df[store_df.duplicated()]

There are no duplicate transactions in our dataset

#### 2.2.2: Missing Values

In [None]:
store_df.isnull().sum() #confirm for any null entries in the dataframe

the column *'Postal Code'* is the only column with missing data. Luckily data such as Postal Code can easily be retrieved and inserted into the dataframe. Had the missing data been in other columns such as *'Sales'*, *'Quantity'*, *'Discount'* etc, we might be forced to evaluate how to deal with the missing values - whether to drop them all or whether to calculate a value for them based on other available values.

In [None]:
store_df[store_df['Postal Code'].isnull()] #to find the specific missing Postal Codes

All the 11 missing Postal Codes are for 'Burlington' City in Vermont state. We can easily search for that, and inserted into our dataframe using the .fillna() method

In [None]:
store_df['Postal Code'] = store_df['Postal Code'].fillna(5401) #5401 is the Postal Code for Burlington City
store_df.isnull().sum() #check to see if null values are still present in the dataframe

### 2.2.3: Sorting Data Based On Order Date

In [None]:
store_df['Order Date'] = pd.to_datetime(store_df['Order Date']) #changing datatype format to datetime instead of an object
store_df.sort_values(by=['Order Date'], ascending=True, inplace=True)
store_df.head(5)

#### 2.2.4: What Do They Sell?

In [None]:
print(store_df['Category'].unique()) #check the categories of products sold by the store

The Store's products fall into these 3 Categories; with the following sub-categories

In [None]:
print(store_df['Sub-Category'].unique()) #sub-categories of products sold

Later we will dive into analysis these interms of quantities and revenue generated

#### 2.2.5: Who Do They Sell To?

In [None]:
print(store_df['Segment'].unique())

In [None]:
store_df['Segment'].value_counts()

#### 2.2.6: Where Do They Sell?

In [None]:
store_df['State'].nunique()

Let's visualize these States

In [None]:
plt.figure(figsize=(25,5)) #size of our plot
plt.title('SuperStore - States Sold To Over The Review Period: 2015 - 2019', fontsize=12, fontweight='bold') #plot title
plt.xticks(rotation='vertical') #orientation of the State names on the x-axis
sns.countplot(x=store_df['State']) #what to plot

With this we can already see which states seem to have more order coming from them and which have less. For example, it is visible that most of products have been sold to the States of California and New York; while less sales have gone to Wyoming and West Virginia States.

In [None]:
store_df['Region'].nunique()

The Store sales its products in 49 states that fall in 4 Regions. Later we will analysis the sales for each state and performance of each of the 4 regions

## 3.0: Sales Analysis

In this section we will divide our tasks inorder to query and visualize the following:

-  Sales Trends Over Specified Duractions (Years and Months)
-  The Store's Top Customers
-  Average days taken for Order fullfilment i.e, from Order Date to Ship Date. Is it dependent on Shipping Mode?
-  Product Performances
-  Sales Based on Cities, Regions, Categories, Sub-Category (City with highest and lowest sales)
-  Losses (Profit < 0)
-  Correlations between Columns e.g., Discount and Profit among others

In [None]:
store_df.columns #refresh our mind on our dataset's columns

In [None]:
store_df.head(2)

### 3.1: What Duration does our Dataset Fall in?

In order to get a better analysis of our date we will need some additional columns:
-  Order Year
-  Ship Year
-  Order Month
-  Ship Month
-  Day Of Week the Order or Shipping happened

Later when querying Order Fullfilment we will also need a column to capture the number of days it takes to fullfil an order (Difference between 'Order Date' and 'Ship Date').

#### 3.1.1: Getting the Years from the Dates

To achieve this we will use the dt.year method. You can also use the dt.strftime ('%Y') method for the same result.
Checking back on our dataframe, the datatype for 'Ship Date' is object; we need to change that to datetime datatype as we earlier did with Order Date

In [None]:
store_df['Ship Date'] = pd.to_datetime(store_df['Ship Date'])

In [None]:
store_df.info() #lets check to see if the datatype for the selected columns changed

In [None]:
store_df['Order Year'] = store_df['Order Date'].dt.year #creates a new column with the value for the Year of the Order
store_df['Shipping Year'] = store_df['Ship Date'].dt.year #creates a new column with the value for the Year of the Shipping
store_df.sample(2)

We successfully added 'Order Year' and 'Shipping Year' to our dataframe. 
This will allow us to later focus on specific years and/or analysis trends over the years.
Lets see the years present in our dataframe.


In [None]:
print(store_df['Shipping Year'].unique()) #using the .unique() method to see the years our dataset spans

The SuperStore dataset being analysed falls in the years 2015 to 2019

Let's proceed to add the months of each order and shipment to our dataframe using the .dt.month_name() method. Alternatively you could use .dt.to_period('M')

In [None]:
store_df['Order Month'] = store_df['Order Date'].dt.month_name() #new column Order Month
store_df['Shipping Month'] = store_df['Order Date'].dt.month_name() #new column Shipping Month
store_df[['Sales','Order Month','Order Year']].to_csv("data/SalesMonths-Years.csv") #lets save our new dataset to a csv file but only with the sales, months and year columns


In [None]:
store_df.sample(1)

#### 3.1.2: Sales Trends Over Specified Duractions (Years and Months)

The reason behind extracting the Years and Months for each sale is to enable the Store to take note of the sales trends and months when sales were higher and lower, and even attribute internal or external reasons for those. Maybe Sales are higher on festival months, maybe Sales were lower in a specific month of 2018 because of a local or international event that affected Customers' purchases. We can then analysis how the months perform, if there is any pattern and even do a time series analysis and forecast sales. 

Let's plot some pivot tables of Sales (Number and Total Revenue) for the Months and Year(s).

In [None]:
# our pivot table with the total amount generated from Sales
total_sales_table = pd.pivot_table(store_df, values='Sales', index='Order Year', columns='Order Month', aggfunc='sum', margins=True).round(2)
total_sales_table

the pivot table above clearly shows as the Sales/Revenue totals in every month for every year present in the dataset, which the *margins=True* parameter adding the 'All' column for Totals of every year. All rounded off to 2 decimal places.

In [None]:
# using the count paramter to get the number of sales done in each year. The totals in the 'All' column should be equal to the total rows in the dataframe - 9800
no_of_sales_table = pd.pivot_table(store_df, values='Sales', index='Order Year', columns='Order Month', aggfunc='count', margins=True).round(2)
no_of_sales_table

We are now able to see the number of Sales that have happened over the years. It is observable that more sells happen in November and less in February, which each year having an increase number of sales done. We can visualize by ploting a heatmap.

In [None]:
no_of_sales_table2 = no_of_sales_table.drop(columns=('All')) #drop the column 'All' it will affect the value scale of the heatmap, since its a Sum column
no_of_sales_table2 = no_of_sales_table2.drop(index=('All')) #drop the index 'All' it will affect the value scale of the heatmap for the same reason
no_of_sales_table2 #our new sales dataframe for the heatmap visualizations


In [None]:
plt.figure(figsize=(20,5)) #size of plot
plt.title('SuperStore - Number of Sales Each Month for the Review Period: 2015 - 2019', fontsize=11, fontweight='bold') #title
sns.heatmap(no_of_sales_table2, cmap='Blues', annot=True, annot_kws={"size":11}, fmt="d", cbar=False) #the plot. annotation true, colors for map are Blues, and color bar disabled

The lighter the shade of blue the less the Number of Sales that year for the corresponding month. We can then observe that September 2018 had the most Number of Sales in the review period; and February 2015 had the lowest Number of Sales.

We can also confirm the yearly total Revenue by creating a pivot table for Sales and Years only.

In [None]:
year_sales_table = pd.pivot_table(store_df, 'Sales', index='Order Year', aggfunc=['sum', 'mean']).round(2)
year_sales_table 

In [None]:
sales_year = store_df.groupby(['Order Year']).sum().sort_values("Sales", ascending=False).head(10) #sort  based on Sales
sales_year = sales_year[['Sales']].round(2) #round off Sales to the nearest 2 decimal points
sales_year.reset_index(inplace=True) #set Customer ID as a column and create a new index for this 'top_ten_customers' dataframe
sales_year

In [None]:
plt.figure(figsize=(8,4))
plt.bar(sales_year['Order Year'], sales_year['Sales'], color='#ecb365') #what to plot
plt.title('SuperStore - Total Revenue for the Review Period: 2015 - 2019', fontsize=9, fontweight='bold') #title of out plot
plt.xlabel('Year') #our axis labels
plt.xticks(rotation='vertical')
plt.ylabel('Total Revenue') 
#for i, j in sales_year['Sales'].items(): #our index, values and enumerator
#    plt.text(i, j+3, '$'+str(j)); 

### 3.2: Who Are The Top Customers of The Store?

-  Their Top Buyers
-  Revenues and Sales in Segments

We had previously removed the 'Customer Name' column from our dataframe so for this query we will use the 'Customer ID' column to find out who are the top customers of the store based off of the Sales. Maybe the store can throw in rewards for them.

In [None]:
top_ten_customers = store_df.groupby(['Customer ID']).sum().sort_values("Sales", ascending=False).head(10) #sort Customers based on Sales
top_ten_customers = top_ten_customers[['Sales']].round(2) #round off Sales to the nearest 2 decimal points
top_ten_customers.reset_index(inplace=True) #set Customer ID as a column and create a new index for this 'top_ten_customers' dataframe
top_ten_customers #view the top 10 customers

plot for this query

In [None]:
plt.figure(figsize=(20, 5)) #size of the plot
plt.bar(top_ten_customers['Customer ID'], top_ten_customers['Sales'], color='#99f5e0', edgecolor='green') #what to plot
plt.title('SuperStore - Top Ten Customers Over the Review Period: 2015 - 2019', fontsize=11, fontweight='bold') #title of out plot
plt.xlabel('Customer ID') #our axis labels
plt.ylabel('Total Spendings') 
for i, j in top_ten_customers['Sales'].items(): #our index, values and enumerator
    plt.text(i, j-10000, '$'+str(j), fontsize=11, rotation='vertical', horizontalalignment='center'); #annotation of Sales values in each bar and specifying the position of the values

### 3.2.1: Revenue and Sales in terms of the Segments

Let's see Revenues generated by the three Segments for the review period.

In [None]:
segment_sales = store_df.groupby(['Segment', 'Order Year'])[["Sales"]].sum().round(2).reset_index() #round off revenues and reset index
segment_sales

In [None]:
plt.figure(figsize=(7,4))
sns.countplot(x=store_df.Segment)
plt.title('SuperStore - Sales and Segments', fontsize=11, fontweight='bold')
plt.ylabel('Number of Sales')
plt.xlabel('Segment')
plt.show()

### 3.3: The Store's Order fullfilment

-  The Shipping Modes
-  Average days taken for Order fullfilment i.e, from Order Date to Ship Date
-  Does Fullfilment duration depend on Shipping Mode?

In [None]:
store_df['Ship Mode'].value_counts()

In [None]:
plt.figure(figsize=(10,5))
store_df['Ship Mode'].value_counts().plot.pie()

the Standard Class shipping mode is the most opted for by customers for their orders

Let's obtain the Order fullfilment days i.e., the days it takes from Order Date to the Shipping Date...how long the store takes to fullfil an order

In [None]:
store_df['Fullfilment Days'] = (store_df['Ship Date'] - store_df['Order Date']).abs() #a new column that gets its values from the difference of the two columns
#added abs() cause the initial gave negative values in some rows

In [None]:
store_df.sample(2) #check if column added

In [None]:
store_df['Fullfilment Days'] = store_df['Fullfilment Days'].astype(str) #changing Fullfilment Days column values to strings to allow removal of 'Days'
store_df.dtypes
store_df['FullfilmentDays'] = store_df['Fullfilment Days'].apply(lambda x:x.split (" ")[0]) #create a new column with only the first part digit of the Fullfilment Days column
store_df.drop(columns='Fullfilment Days', inplace=True) #then drop the old column

In [None]:
store_df['FullfilmentDays'] = store_df['FullfilmentDays'].astype(int) #turn the new column into integer data type for calculations

In [None]:
store_df['FullfilmentDays'].sample(10)

In [None]:
store_df['FullfilmentDays'].sort_values().median().round(0) #the median avg

In [None]:
pd.pivot_table(store_df, 'FullfilmentDays', index='Ship Mode', aggfunc='median').round(0).reset_index() #avg days taken to fullfil orders based on shipping mode

Standard Class shipping option, on average, takes way longer than the other 3 options. Despite this it is the most opted for, as earlier seen, with 5859 orders out of the 9800. So Customers do not seem to care much for how fast they get their order shipped.

Another way of getting the same results but sorted from largest and plotted

In [None]:
store_df.groupby(by='Ship Mode').mean()['FullfilmentDays'].nlargest().round(0).plot.bar(ylabel='Fullfilment Days', xlabel='Shipping Mode')

### 3.4: Product Performances

-  Performance of Categories and Sub-Categories
-  Top Selling Products
-  Units Sold

#### 3.4.1: Categories Performance

In [None]:
top_category = store_df.groupby(["Category"]).sum().sort_values("Sales", ascending=False)  #Sort the Categories as per the sales
top_category = top_category[["Sales"]] #keep only the sales column in the dataframe
total_revenue_category = top_category["Sales"].sum() #the total revenue generated as per category
total_revenue_category = str(int(total_revenue_category)) #Convert the total_revenue_category from float to int and then to string
total_revenue_category = '$' + total_revenue_category #Adding '$' sign before the Value
top_category.reset_index(inplace=True) #Since we have used groupby, we will have to reset the index to add the category into the dataframe

In [None]:
plt.rcParams["figure.figsize"] = (13,5) #size of plot
plt.rcParams['font.size'] = 12 #font size
plt.rcParams['font.weight'] = 6 #font weight
# we don't want to look at the percentage distribution in the pie chart. Instead, we want to look at the exact revenue generated by the categories.
def autopct_format(values): 
    def my_format(pct): 
        total = sum(values) 
        val = int(round(pct*total/100.0))
        return ' ${v:d}'.format(v=val)
    return my_format
colors = ['#00ffed','#6454f0','#ee4d5f'] #colors the pie chart
explode = (0.05,0.05,0.05)
fig1, ax1 = plt.subplots()
ax1.pie(top_category['Sales'], colors = colors, labels=top_category['Category'], autopct= autopct_format(top_category['Sales']), startangle=90,explode=explode)
centre_circle = plt.Circle((0,0),0.82,fc='white') #drawing a circle on the pie chart to make it look better 
fig = plt.gcf()
fig.gca().add_artist(centre_circle) #add the circle on the pie chart
#equal aspect ratio ensures that pie is drawn as a circle
ax1.axis('equal') 
#the total revenue generated by all the categories at the center
label = ax1.annotate('Total Revenue \n'+str(total_revenue_category),color = 'red', xy=(0, 0), fontsize=12, horizontalalignment="center")
plt.title('SuperStore - Contribution of Each Category to Total Revenue: 2015 - 2019') #title of plot
plt.tight_layout()
plt.show()

#### 3.4.2: Product Sales

##### 3.4.2.1: Products with highest Revenues

In [None]:
top_products=store_df.groupby(['Product Name']).sum().sort_values('Sales',ascending=False).head(10) #get Sales for all products, sort highest first then display only the first 10
top_products=top_products[['Sales']].round(2) #round off Sales to the nearest 2 decimal points
top_products.reset_index(inplace=True)
top_products #show results

In [None]:
plt.figure(figsize=(20,5)) #size of plot
plt.bar(top_products['Product Name'],top_products['Sales'],color='#d57eea',edgecolor='green') #what to plot
plt.xticks(rotation='vertical') #xaxis ticks orientation
plt.title('SuperStore - Ten Products with The Highest Revenues for The review period: 2015 - 2019',fontsize=12, fontweight='bold') #title of the plot
plt.xlabel('Product',fontsize=12) #xaxis labels
plt.ylabel('Total Revenue',fontsize=12)

##### 3.4.2.2: Units Sold for Each Product, and by Category and Sub-Category

In [None]:
store_df['Quantity'].sum()

the Store sold a total of 37143 products during the period of 2015 - 2019

In [None]:
units_sold = store_df.groupby(['Product Name'])[["Quantity"]].sum().sort_values('Quantity', ascending=False).reset_index().head(15) #get quantities sold by product and sort by highest and limit to 15
units_sold

As observed, staples, staple envelops and easy-staple paper were three of the most sold products by the store.

In [None]:
plt.figure(figsize=(20,5)) #size of plot
plt.bar(units_sold['Product Name'],units_sold['Quantity'],color='#d57eea',edgecolor='green') #what to plot
plt.xticks(rotation='vertical') #xaxis ticks orientation
plt.title('SuperStore - Most Purchased Products in the Review Period: 2015 - 2019',fontsize=12, fontweight='bold') #title of the plot
plt.xlabel('Product',fontsize=12) #xaxis labels
plt.ylabel('Units Sold',fontsize=12)
for i, j in units_sold['Quantity'].items(): #our index, values and enumerator
    if j>200:
        plt.text(i, j-50, str(j), fontsize=11, rotation='vertical', horizontalalignment='center');
    else:
        plt.text(i, j+4, str(j), fontsize=11, rotation='vertical', horizontalalignment='center');

### 3.5: Sales Based on Cities and Regions


#### 3.5.1: Regions and Sales

Regional distribution of the 9800 orders

In [None]:
store_df['Region'].value_counts() #show regions and the number of sales


In [None]:
plt.figure(figsize=(6,4)) #size of plot
plt.title('SuperStore - Sales Made in Regions: 2015 - 2019', fontsize=11, fontweight='bold') #title of plot
plt.xlabel('Region') #xaxis label
plt.ylabel('Sales Made') #ylabel of plot
sns.countplot(x=store_df['Region']) #what to plot

#### 3.5.2: Sales and Revenue based on States

##### 3.5.2.1: States with Highest Revenues

In [None]:
top_states = store_df.groupby(["State"]).sum().sort_values("Sales", ascending=False).head(20) # Sort the States as per the sales
top_states = top_states[["Sales"]].round(2) # Round off the Sales Value up to 2 decimal places
top_states.reset_index(inplace=True)

In [None]:
plt.figure(figsize=(20,5))
plt.bar(top_states['State'],top_states['Sales'],color='#9fa5d5',edgecolor='blue')
plt.xticks(rotation='vertical')
plt.title('SuperStore - 20 Cities with The Highest Revenues for The review period: 2015 - 2019',fontsize=12, fontweight='bold')
plt.xlabel('State',fontsize=12)
plt.ylabel('Total Revenue',fontsize=12)
for i,j in top_states["Sales"].items(): #To show the exact revenue generated on the figure
    if j>400000:
        plt.text(i,j-150000,'$'+ str(j), fontsize=12,rotation=90,color='k', horizontalalignment='center'); #annotations inside chart if its above 400000
    else:
        plt.text(i,j+15000,'$'+ str(j), fontsize=12,rotation=90,color='k', horizontalalignment='center'); #else annotations above bar

As observed, the State of California generated the most revenue for SuperStore

##### 3.5.2.2: States with Lowest Sales and Revenue

In [None]:
store_df.groupby(['State']).sum()['Sales'].nsmallest(10) #10 states that generated the lowest revenue

### 3.7: Losses Experienced By SuperStore

Despite an overall growth in profit, the Store also had its share of losses. To analyze this we will take the Profit column and filter out a dataframe with Profit less than 0.

In [None]:
losses_df = store_df[store_df['Profit'] < 0] #a new data frame with the records that have 'Profit'less than 0
losses_df.shape #see affected records

Of the 9800 transactions, 1847 resulted in losses for the store

In [None]:
total_loss = np.negative(losses_df['Profit'].sum().round(2)) #sum of negative values

print("and the total loss for the Review Period (2015-2019) is %2f"%total_loss) #print the total loss in a statement

In [None]:
plt.figure(figsize=(20,5))
plt.bar(losses_df['Sub-Category'], losses_df['Sales'])
plt.title('SuperStore - Losses in Sub-Categories: 2015 - 2019', fontsize=11, fontweight='bold')
plt.ylabel('Total Loss')
plt.xlabel('Sub-Category')
plt.show()

### 3.8: Correlations

In [None]:
store_df_correlations = store_df.corr(method='pearson') #we will use the pearson correlation matrix
sns.heatmap(store_df_correlations, annot=True) #our plot with annotations of results
plt.title('SuperStore - Correlation Matrix of Column Elements', fontsize=12, fontweight='bold')
plt.xlabel('Sales Features')
plt.ylabel('Sales Features')
plt.show()


Straight away from the plot, we notice that Discount and Profit have a negative correlation.

In [None]:
plt.scatter(x=store_df['Discount'], y=store_df['Profit'], alpha=0.5)
plt.title('SuperStore - Discount vs Profit', fontsize=11, fontweight='bold')
plt.xlabel('Discount')
plt.ylabel('Profit')
plt.show()

#### 3.8.1: Numerizing All Columns So They Can Be Included In The Correlation Matrix

As seen above the columns that do not contain numeric values are ignored. To include them, we have to numerize the columns.

In [None]:
store_df.head(1)

In [None]:
store_df_numeric = store_df.drop(columns='Country').apply(lambda x: x.factorize()[0]).corr(method='pearson')
store_df_numeric

In [None]:
plt.figure(figsize=(25,15))
sns.heatmap(store_df_numeric, annot=True) #our plot with annotations of results
plt.title('SuperStore - Correlation Matrix of Column Elements', fontsize=12, fontweight='bold')
plt.xlabel('Sales Features')
plt.ylabel('Sales Features')
plt.show()

In [None]:
correlated_pairs = store_df_numeric.unstack()
correlated_pairs.sort_values()

In [None]:
higher_corr = correlated_pairs[(correlated_pairs) > 0.5]
higher_corr

[Part 2: Customer Retention](https://github.com/WayneNyariroh/customer-retention_cohortAnalysis/blob/main/RetentionAnalysis.ipynb)