# Project: Cardio Good Fitness - Data Analysis

## Objective

* Come up with a customer profile (characteristics of a customer) of the different products
* Perform univariate and multivariate analyses
* Generate a set of insights and recommendations that will help the company in targeting new customers.

Analysis has been devided into four sections

- <a href = #link1>Understanding the structure of the data</a>
- <a href = #link2>Univariate Data Analysis</a>
- <a href = #link3>Multivariate Data Analysis</a>
- <a href = #link4>Conclusion and Recommendations</a>

## <a name='link1'>**Understanding the structure of the data**</a>
Overview of the dataset shape, datatypes - Statistical summary and check for missing values

In [None]:
# import required packages - numpy, pandas, matplotlib and seaborn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
# load the dataset
data = pd.read_csv('CardioGoodFitness.csv')
data.head(10) # Check first 10 rows of the data

In [None]:
data.info() # Print a concise summary of the data - row count, columns, nulls, data types etc.

In [None]:
data.shape # check no. of rows and columns

> **Observation**: There are **180** rows and **9** columns in the dataset

In [None]:
data.describe() # Satistical summary of the numerical data types

> **Observations**
> * Customers' age varies from 18 to 50 years with mean of 29 years, 75% customers are within 33 years
> * Education varies from 12 to 21 years with mean of 16 years
> * Weekly usage hours vary from 2 to 7 with mean of 3.4 hours
> * Fitness varies from 1(lowest) to 5(highest: very fit), since the range is 1 to 5 we can use this a categorical as well
> * Income varies from 29k to 104k with 54k as mean standard deviation of ~16k
> * Miles from 21 to 360 with mean of 103 and standard deviation of ~52

#### Check the range of values for categorical variables - Product, Gender and MaritalStatus

In [None]:
data['Product'].value_counts().sort_values(ascending=False)

> **Observation**: There are three different types of Fitness products.

In [None]:
data['Gender'].value_counts()

> **Observation**: Expectedly there are two values in Gender column, Male (104 datapoints) and Female (76 datapoints)

In [None]:
data['MaritalStatus'].value_counts()

> **Observation**: Marital status has two values Partnered (107 datapoints) and Single (73 datapoints)

In [None]:
data.isnull().sum() # check for null values, it seems there are no nulls

> **Observation**: There no **null** values in the dataset

## <a name='link2'>**Univariate Data Analysis**</a>
This section show the analysis and distribution of every feature and important observations. 

#### Check the different Cardio Good Fitness products in the sample dataset.

In [None]:
# Use countplot to plot the number of products and show their percentages
ax = sns.countplot(data=data, x='Product');
ax.set(ylabel='Product Count', title='Product Count and Percent of Total');
# add percentages to bars
for c in ax.containers:
    labels = [f'{h/data.Product.count()*100:0.2f}%' if (h := v.get_height()) > 0 else '' for v in c]
    ax.bar_label(c, labels=labels, label_type='edge')

> **Observation**: **TM195** is most popular product with a share of 44% followed by **TM498** and **TM798** models.

Check the count of products based on the other categorical values
* Fitness
* Gender
* Marital Status

In [None]:
# Add three plots in a row using figure and subpots
sns.set_style("darkgrid", {"axes.facecolor": ".9"}) # Customize seaborn style
fig = plt.figure(figsize=(14,5))

# Add the first plot
ax1 = fig.add_subplot(131)
ax1.set_title('Products by Fitness')
sns.histplot(data=data, x='Fitness', hue='Product', multiple='stack', ax=ax1)

# Add the second plot
ax2 = fig.add_subplot(132)
ax2.set_title('Products by Gender')
sns.countplot(data=data, x='Gender', hue='Product', ax=ax2)

# Add the third plot
ax3 = fig.add_subplot(133)
ax3.set_title('Products by Marital Status')
sns.histplot(data=data, y='MaritalStatus', hue='Product', multiple='dodge', ax=ax3)

plt.tight_layout()

> **Observations**
> * Most of the customers belong to Fitness category 3 where popular products are **TM195** and **TM498**
> * **TM798** is popular with higher fitness customers specially category 5
> * **TM798** is not very popular with Female customers
> * Partnered customers are more than single customers 

Next let's check the spread of all numerical variables visually using box and violin plots as below. 

In [None]:
sns.set_palette('deep')
fig = plt.figure(figsize=(15,10))

# Adding subplot arrangement to keep the visuals side by side for comparision
ax = fig.add_subplot(231)
plt.title('Spread by Age')
sns.boxplot(data=data, x='Product', y='Age', whis=1.75) # Normally whisker is 1.5 time of IQR but here setting to 1.75

ax = fig.add_subplot(232)
plt.title('Spread by Education')
sns.boxplot(data=data, x='Product', y='Education')

ax = fig.add_subplot(233)
plt.title('Spread by Income')
sns.boxplot(data=data, x='Product', y='Income')

ax = fig.add_subplot(234)
plt.title('Spread by Fitness')
sns.violinplot(data=data, x='Product', y='Fitness')

ax = fig.add_subplot(235)
plt.title('Spread by Usage')
sns.boxplot(data=data, x='Product', y='Usage')

ax = fig.add_subplot(236)
plt.title('Spread by Miles')
sns.boxplot(data=data, x='Product', y='Miles')

plt.tight_layout()

> **Observations**
> * *Age*: Age IQR for TM798 is younger than other two models although there are few outliers even with whisker 1.75
> * *Education*: TM798 is more popular with people having higher level of education.
> * *Income*: TM798 is more popular with people having higher level of income. 
> * *Fitness*: As seen earlier TM798 is preferred by higher fitness customers.
> * *Usage*: Usage of TM498 model is less than other two models. TM798 has highest usage.
> * *Miles*: Again customers having TM798 model run more miles than the other two models

#### Lets segment the customers into smaller groups based on age, income and education.
* Age Groups: <30, 30-40, 40-50
* Income Groups: <40k, 40k-80k, <80k
* Education Groups: HS (High School, 12y of Education), UG (Under graduate or 16y of education), PG (Post Graduate)

In [None]:
# Use Panda's cut function to segment and sort data values into bins
data['AgeGroup'] = pd.cut(data.Age, bins=[10, 30, 40, 50], include_lowest=False, labels=['<30', '30-40', '40-50'])
data['IncomeGroup'] = pd.cut(data.Income, bins=[25000,40000,80000,120000], include_lowest=False, labels=['<40k', '40k-80k', '>80k'])
data['EducationGroup'] = pd.cut(data.Education, bins=[10,12,16,30], include_lowest=False, labels=['HS', 'UG', 'PG'])

Check count of products based on simple countplots across the above custom groups

In [None]:
fig = plt.figure(figsize=(14,5))
# Adding subplot arrangement to keep the visuals side by side for comparision
ax = fig.add_subplot(131)
plt.title('Products by AgeGroup ')
sns.countplot(data=data, x='AgeGroup', hue='Product')

ax = fig.add_subplot(132)
plt.title('Products by IncomeGroup')
sns.countplot(data=data, x='IncomeGroup', hue='Product')

ax = fig.add_subplot(133)
plt.title('Products by EducationGroup')
g = sns.countplot(data=data, x='EducationGroup', hue='Product')

plt.tight_layout()

> **Observations**
> * Most of the customers are within 30. TM195 is very much preferred by young customers. 
> * Not many customers within age group 40 to 50 years.
> * No datapoints for TM798 for customers who earn less than 40k.
> * No datapoints for TM195 and TM498 for customers who earn more than 80k.
> * Very less observations for customers having HS education, if the sample is purly random we can infer that customers having only HS education are not interested in buying exercise equipments.
> * As seen previously TM798 is preferred by higher educated people

Below plot shows the probably distribution of Miles with respect to each Product:

In [None]:
fig = plt.figure(figsize=(10,5))
sns.set_style('whitegrid')
sns.histplot(data=data, x='Miles', kde=True, hue='Product', element='step', stat='probability');

> **Observations**
> * Customers having TM195 and TM498 models have higher probability around 100 miles usage
> * Customers having TM798 model has higher propablity of running more miles than 150

The below plot shows frequency distribution of Age across products:

In [None]:
fig = plt.figure(figsize=(14,5))
ax1 = fig.add_subplot(121)
sns.histplot(data=data, x='Age', kde=True, hue='Product', stat='frequency', ax=ax1);
ax1.set(title='Frequency Distribution of Age');

ax2 = fig.add_subplot(122)
sns.histplot(data=data, x='Age', kde=True, hue='Product', stat='frequency', element='poly', ax=ax2);
ax2.set(title='Frequency Distribution of Age (Zoomed In)');
# Zoom in betweeen 20 to 40 years of Age
ax2.set_xlim([20,40]);
ax2.set_xticks(range(20,41,2));

# Show the median age of TM195 in blue line
TM195_Median_Age = data[data['Product']=='TM195']['Age'].median()
ax2.axvline(TM195_Median_Age, color='b', linestyle='--');

# Show the median age of TM498 in blue line
TM498_Median_Age = data[data['Product']=='TM498']['Age'].median()
ax2.axvline(TM498_Median_Age, color='r', linestyle='-.');

# Show the median age of TM798 in blue line
TM798_Median_Age = data[data['Product']=='TM798']['Age'].median()
ax2.axvline(TM798_Median_Age, color='g', linestyle=':');


> **Observations** 
> * Customers around 25/26 years of age are the most frequent buyers of these Prodcuts. Age has a right skew.
> * TM498 also has frequent buyers around 34 years age. 
> * The median age of customers for TM195 and TM498 overlaps and is 26. 
> * TM798 has median age around 27 years.

The below plot shows income density distribution of customers across products:

In [None]:
g=sns.displot(data=data, x='Income', kind='kde', hue='Product', rug=True, aspect=2);
g.set_xticklabels(rotation=90);

#g.axes[0][0] => to get the AxesSubplot from this FacetGrid
ax = g.axes[0][0]

# Show the median age of TM195 in blue line
TM195_Median_Income = data[data['Product']=='TM195']['Income'].median()
ax.axvline(TM195_Median_Income, color='b', linestyle='--');

# Show the median age of TM498 in Orange line
TM498_Median_Income = data[data['Product']=='TM498']['Income'].median()
ax.axvline(TM498_Median_Income, color='#FF5733', linestyle='-.');

# Show the median age of TM798 in green line
TM798_Median_Income = data[data['Product']=='TM798']['Income'].median()
ax.axvline(TM798_Median_Income, color='g', linestyle='--');

> **Observations**
> * Most of the customers owning TM195 have income density between 30k to 60k, TM498: 40k to 60k.
> * As seen previously most customers having TM798 have income density over 60k 
> * Median income for TM498 is little higher than TM195
> * Median income for TM798 is much higher than TM498 and TM195

We can also use Pandas Profiling to quickly analyze all univariate variables and more as given below:

In [None]:
# Install pandas porfiling if it is not installed
# !pip install pandas-profiling
import pandas_profiling

In [None]:
data.profile_report() # Generate a profile report for Univariate analysis

## <a name='link3'>**Multivariate Data Analysis**</a>
Analysis of interaction between features, in the dataset and important observations

Checking the correlation of numerical variables in the dataset and followed by a heat map. This gives important insights how they are related. 

In [None]:
data.corr()

In [None]:
plt.figure(figsize=(12,6))
sns.heatmap(data.corr(), annot=True, cmap='Spectral');

> **Observations**
> * With age, income increases in general (corr=0.51)
> * Data does not suggest any significant correlation with Age and Fitness
> * Customers having higher education have higher income (corr=0.63)
> * Customers having higher usage, run more miles and vice versa (corr=0.76)
> * Customers who use the products more have higher fitness (corr=0.67)

In the following visuals we will see some of the relations in more details

In [None]:
# Liner regression model between education and income for different product models and marital status
sns.lmplot(data=data, x='Education', y='Income', col='MaritalStatus', hue='Product');

In [None]:
plt.figure(figsize=(10,5))
sns.stripplot(data=data, x='Education', y='Income', hue='Product', jitter=0.2);

> **Observations**: (above two plots)
> * Customers with higher education and higher income level prefer TM798. 
> * With same education level, higher income customers prefer TM498 over TM195 when they are single
> * Based on the customer income levels and their preferences on models, it appears that TM798 is premium model followed by TM498 and TM195 

In [None]:
# Relationship between usage and fitness of customers
sns.relplot(data=data, x='Usage', y='Fitness', col='MaritalStatus', size='Gender', kind='line');

> **Observation**: Usage and Fitness has a strong correlation. Customers who use more of the exercise quipments are more fit and vice versa.

In [None]:
# Relation between Age and Fitness
sns.set_theme() # set to default theme
plt.figure(figsize=(10,5))
sns.swarmplot(data=data, x='Fitness', y='Age', hue='Product');

In [None]:
# Relation between Age and Miles run based on Gender, Product and Fitness
sns.relplot(data=data, x='Miles', y='Age', col='Gender', hue='Product', size='Fitness', sizes=(10, 150));

> **Observations**: (swarmplot:Fitness/Age and relplot:Miles/Age)
> * There no strong correlation as such beween Age (18-50) and Fitness or Miles run
> * Most of the customers belong to Fitness level 3 (swarm plot)
> * Customers having fitness level 5 mostly prefer TM798 and run higher Miles and Male customers are more in this category

In [None]:
# Liner regression model plot based of Miles and Fitness based Gender and MaritalStatus
sns.lmplot(data=data, x='Miles', y='Fitness', col='MaritalStatus', hue='Gender', y_jitter=0.5);

> **Observation**: As expected, Miles and Fitness have strong correlation regarless of gender and marital status.

In [None]:
sns.set_theme()
# Joint plot of Income and Miles to show bivariate and univariate analysis
sns.jointplot(data=data, x='Income', y='Miles', hue='Product');

> **Observations**
> * Most of the customers having TM195 and TM498 models falls between 25k to 75k income group and runs upto 150 miles
> * TM798 customers are from higher income group and run higher miles although the spread is higher than other two models. 

We can also use pairplot to analyze more than two numerical variables at a time as depicted below, since (Income, Education) and (Usage, Miles, Fitness) pairs are correlated we can reduce the dimention of pairplot to Age, Income and Usage.

In [None]:
sns.pairplot(data=data[['Age','Income','Usage','Product']], hue='Product', diag_kind='kde', height=3, aspect=1.2);

> **Observations** 
> * Yonger customers, with higher (Usage, Fitness, Miles) and (Income, Education) prefer TM798
> * Customers of TM195 and TM498 ovarlap, although younger customers in this category having higher income prefer TM498

## <a name='link4'>**Conclusion and Recommendations**</a>
Concluding remarks with key observations and recommendations

- <a href = #link5>Customers Demographic Observations</a>
- <a href = #link6>Product Observations</a>
- <a href = #link7>Customers Buying Preference Observations</a>
- <a href = #link8>Recommendations</a>


### <a name='link5'>**Customers Demographic Observations**</a>
* Customers are between 18 to 50 years of age, with average age around 29. 
* Most of the customer population are young to middle aged (20 to 40)
* Most of the customers are having 16 years or more of education. 
* Most of the customers want to use these Cardio fitness products for 3 to 4 hours per week and run for about 50 to 150 Miles 
* Most of the customers consider themselves as midum fit (3), althogh there are many customers belongs to fitness 4 and 5.
* Around 58% customers are Male and 42% are Female
* Around 60% customers having Partners and 40% are Single
* Customers who use these products more are more fit and run for more miles, data shows strong correlation.
* There is no strong correlation between customers income, age or education to fitness. 

### <a name='link6'>**Product Observations**</a>
* TM195 has the largest market share about 45% followed by TM498 and TM798
* About 78% market share is from TM195 and TM498
* TM798 may be the most expensive product followed by TM498 (mid range) and TM195
* TM798 is preferred by customers having higher income and education or with fitness level 4/5

### <a name='link7'>**Customers Buying Preference Observations**</a>

We can also get insights of customers buying preferences based on certain demographics, for this we do the below additional analysis :-

In [None]:
# Group by MaritalStatus, Gender and order by number of observations desc
df_age_group = data.groupby(['Gender','MaritalStatus'], as_index=False).agg({'Usage': 'mean', 'Miles': 'mean', 'Product': 'count'}).sort_values('Product', ascending=False)
df_age_group['BuyingPercent']=((df_age_group['Product'] / df_age_group['Product'].sum())*100).round(2).astype(str) + '%'
df_age_group

In [None]:
# Seeing visually the number of products based on Marital Status and Gender
g = sns.FacetGrid(data, col="Product", hue="Fitness", row='Gender', height=4, aspect=1)
g.map(sns.countplot, "MaritalStatus", order=['Partnered', 'Single']);
g.add_legend();

In [None]:
# Group by products based on Education, Income and Fitness
data.groupby(['Product'])[['Education', 'Income', 'Fitness']].agg(['median', 'max', 'min', 'count'])

In [None]:
# Count of products based on Income groups, Marital Status and Gender
g = sns.FacetGrid(data, col="Gender", row='MaritalStatus', hue="Product", height=4, aspect=1.25)
g.map(sns.countplot, "IncomeGroup", order=['<40k','40k-80k','>80k']);
g.add_legend();

> **Observations**

* Partnered Male and Female customers are got to customers for the company followed by Male Single and Female Single
* Partnered Female customers mostly have bought model TM195.

* Customers with higher income and education mostly bought TM798 model.
* Customers with Fitness level 5 have mostly bought TM798 model.
* When comparing TM195 and TM498, customers having higher income normally preferred TM498
* Lower income group customers prefer to settle with TM195 model.
* Female customers have not preferred much TM798 model specially over the age of 40.
* Customers having Fitness level 3 or lower prefer to buy TM195 model but if the income is higher he/she may go for TM498.


### <a name='link8'>**Recommendations**</a>
* Since most customers are young to middle age, company needs to make an effor to reach and understand the needs of older customers (40 and older) to increase the sales and revenue.
* Some of the customers having Fitness level 5 have bought TM195, since TM798 is the preferred model for higher Fitness customers, company can offer upgrade plan to these customers.
* Similarly to increase the revenue, company can provide upgrade plan for customers having model TM498 to move to TM798.
* A survey can be run to understand the needs of Female Single customers to increase their market share.
* A survey can be conducted to know why Female customers are not going for TM798. If required company can launch a new product targetting to higher income and educated Female customers.