In [1]:
# EDA on Cardio Fitness Data

## Introduction

This data is collected on individuals who purchased a treadmill at a CardioGoodFitness retail store during the prior three months. [Dataset](https://www.kaggle.com/saurav9786/cardiogoodfitness). <br>
cardiogoodfitness.csv: The csv contains data related to customers who have purchased different model from Cardio Good Fitness :
* Product - the model no. of the treadmill
* Age - in no of years, of the customer
* Gender - of the customer
* Education - in no. of years, of the customer
* Marital Status - of the customer
* Usage - Avg. # times the customer wants to use the treadmill every week
* Fitness - Self rated fitness score of the customer (5 - very fit, 1 - very unfit)
* Income - of the customer
* Miles- expected to run

## Objective

Identify the profile of the typical customer for each treadmill product offered by CardioGood Fitness.

## Importing required packages

In [2]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from matplotlib import pyplot as plt
import seaborn as sns


import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## Loading data into Dataframe

In [3]:
df = pd.read_csv('/kaggle/input/cardiogoodfitness/CardioGoodFitness.csv')
df_copy = df.copy(deep=True)
df.head()

In [4]:
print(df.shape)

Total record in dataset = 180<br> Columns in the dataset = 9

## Data Pre-processing

In [5]:
## Checking for null values
df.isna().sum()

In [6]:
# cheking for duplicate values
df.duplicated().sum()

No null & duplicate value found in features.

In [7]:
## Descriptive analysis
df.describe().T

In [8]:
df.info()

In [9]:
features = df.columns.values
for f in features :
    print(f,': ',df[f].unique())
    print()

* There are 3 different products in this dataset ('TM195' 'TM498' 'TM798').
* Age if customers range from 18 to 50.
* Education ranges from 12 to 21 (years).
* There are both Single and couple as buyer.
* Usage ranges from 2 to 7 (days/week).
* Fitness levelranges from 1-5.

Also, We will be changing datatype of Gender, MaritalStatus and Product from Object to Category. <br>***Refer:***  [Advantages of using Categorical dtype](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html)

In [10]:
tmp_features = ['Gender', 'MaritalStatus', 'Product']
for f in tmp_features:
    df[f] = df[f].astype("category")
df.info()

One point to highlight: <br>
**Compare *memory usage* from earlier.** ;)

## EDA

### Univariate Analysis

In [11]:
features

**Catagorical features**

In [12]:
def plot_uni_cat(d):
    f,ax = plt.subplots(nrows=1,ncols=2,figsize=(8,5))
    f.suptitle(d.name+' Wise Sale',fontsize=15)
    sns.countplot(d,ax=ax[0])
    d.value_counts().plot.pie(autopct='%1.1f%%',ax=ax[1])
    plt.tight_layout()

#### Product

In [13]:
print(df.Product.value_counts())

In [14]:
plot_uni_cat(df['Product'])

Model TM195 was sold the most

In [15]:
print(df.Gender.value_counts())
plot_uni_cat(df['Gender'])

There are more male buyers then female buyers.

In [16]:
print(df.MaritalStatus.value_counts())
plot_uni_cat(df['MaritalStatus'])

Couples are buying more tradmills then singles. <br> Probably, side-effects of being in relationship... Just Joking ;)

**Numerical features**

In [17]:
def plot_uni(d):
    f,ax = plt.subplots(nrows=1,ncols=2,figsize=(10,5))
    sns.histplot(d, kde=True, ax=ax[0])
    ax[0].axvline(d.mean(), color='y', linestyle='--',linewidth=2)
    ax[0].axvline(d.median(), color='r', linestyle='dashed', linewidth=2)
    ax[0].axvline(d.mode()[0],color='g',linestyle='solid',linewidth=2)
    ax[0].legend({'Mean':d.mean(),'Median':d.median(),'Mode':d.mode()})
    
    sns.boxplot(x=d, showmeans=True, ax=ax[1])
    plt.tight_layout()

In [18]:
num_cols = df.select_dtypes('int64').columns.values
num_cols

In [19]:
for f in num_cols:
    plot_uni(df[f])

**Age**
<br> 
- Age is skewed towards right.
- Customers buying treadmill after age of 40 and before 20 are very less.
<br>

**Education**
<br>
- Most customers have 16 years of Education.
- There are few outliers (higher end).
<br>

**Usage**
- Most user loves to use Treadmills 3-4 times/week.
- There are few outliers (higher end).
<br>

**Fitness**
- Most customer have 3-3.5 fitness rating (moderate fit).
- Very few customers that uses treadmill have low score i.e 1.   that a great news ;).
<br>

**Income**
- Income is skewed toward right.
- Income may have outliers (higher end) as there are very few persons who earn >80k.
- Most customers have income less than 70k.
<br>

**Miles**
- Miles is skewed towards right.
- Customers run on an average 80 miles per week.
- There are some outliers, where customers are expecting to run more than 200 miles per week.

### Bivariate analysis

In [20]:
df.groupby(by='Product')['Age'].mean() ##Average age o buying product models

In [21]:
df.groupby('Product')['Income'].mean() ##Average income o buying each model

In [22]:
print(df[['Product','Gender']].value_counts().sort_index()) ## models bought by different Genders
sns.histplot(x='Product',data=df, hue='Gender', multiple="dodge", shrink=.8)

In [23]:
print(df[['Product','MaritalStatus']].value_counts().sort_index()) ## models bought by single vs couples
sns.histplot(x='Product',data=df, hue='MaritalStatus', multiple="dodge", shrink=.8)

In [24]:
sns.heatmap(data=df.corr(),cmap="YlGnBu", annot=True ,linewidths=0.2, linecolor='white')

* Age,Education,Usage,Fitness & Miles has significant correlation with Income and vice versa.
* Usage and Fitness are highly correlated with Miles and vice versa.

### Multivariate Analysis

In [25]:
sns.catplot(x='Usage', y='Income', col='Gender',hue='Product' ,kind="bar", data=df) 

* Customers having lower income range (<60K) prefer to buy models TM195 & TM498 and expect to use treadmill 2-5 times/week.
* Mostly Higher earning customers bought TM798 and expect to use treadmill 4-6 times/week.

In [26]:
sns.catplot(x='Gender',y='Income', hue='Product', col='MaritalStatus', data=df,kind='bar')

In [27]:
pd.crosstab(index=df['Product'], columns=[df['MaritalStatus'],df['Gender']] )  

* Partnered Female bought TM195 Model compared to Partnered male.
* Partnered Male customers bought TM498 & TM798 models more than Single Male customers.
* Single Female customers bought TM498 model more than Single male customers.
* Single Male customers bought TM195 & TM798 models compared to Single females.
* The majority of treadmill buyers are man.

## Conclusion

### Final Observation

* TM195 model is the most purchased model (44.4%) then TM498 (33.3%). TM798 is the least sold model (22.2%).
* There are more Male customers (57.8%) than Female customers (42.2%).
* Average Usage of Males is more than Average usage of Females.
* Customers buying treadmill are younger and average age of customer is 28.
* Most of the customers earns less than 70K and prefer TM195 & TM498 models.
* 59.4% of the customers who purchased treadmill are partnered.
* Customers average education is 16.

### Customer Profiles

#### For model TM195

* Customers who bought this treadmill have income less than 60k with an average of 55K.
* This model has same level of popularity in Male customers as well as Female customers as it has same numbers of Male and Female customers.
* Average age of customer who purchases TM195 is 28.5.
* This model is popular among Bachelors as average years of education of customers for this product is 15.
* Self rate fitness level of customer is average.
* Customers expect to use this treadmill 3-4 times a week.
* **It is the most popular model (in all genders) because of its appealing price and affordability *with 33.3% of sales*.**
* **Customers who bought this treadmill want fitness level atleast average and maybe they were looking for a basic treadmill with appealing price that also does the job.**

#### For model TM498

* This model is second most sold model with **33.3% of sales**.
* Customers with lower income purchase TM195 and TM498 model may be because of lower cost of the Treadmill.
* Average age of customer who purchases TM498 is 29.
* This model is popular among Bachelors as average years of education of customers for this product is 16.
* **Customers expecting TM498 model to use less frequently but to run more miles per week on this.**
* **This model is popular more in Single Female customers compare to Single male customers may be because of difference in provided features or color scheme.**

#### For model TM798

* **This is the least sold product(22.2% sales) in company lineup of Treadmill may be because of it heafty price range making it Company's Premium product**. 
* **This model is popular with customers having high income range as average Income is 75K .**
* Average age of customer who purchases TM798 is 29.
* **This model is popular among Customers with higher education as average education is 17 years.**
* **Treadmill may have some advanced features as people with high income are ready to spend money to buy this model**
* Customers expected usage on this model is 4-5 day a week with moderate Miles to run having average 166 miles per week. 
* **Male customers who are more serious about fitness or Professionals buy this mode** (self fitness rating 3-5).