# Exploratory Data Analysis

This notebook will be used to perform exploratory data analysis on the customer churn dataset. Not will be a deep exploration of the data, but rather a general overview of the dataset and its characteristics.

## 1. About dataset

Customer churn refers to the phenomenon where customers discontinue their relationship or subscription with a company or service provider. It represents the rate at which customers stop using a company's products or services within a specific period. Churn is an important metric for businesses as it directly impacts revenue, growth, and customer retention.

In the context of the Churn dataset, the churn label indicates whether a customer has churned or not. A churned customer is one who has decided to discontinue their subscription or usage of the company's services. On the other hand, a non-churned customer is one who continues to remain engaged and retains their relationship with the company.

Understanding customer churn is crucial for businesses to identify patterns, factors, and indicators that contribute to customer attrition. By analyzing churn behavior and its associated features, companies can develop strategies to retain existing customers, improve customer satisfaction, and reduce customer turnover. Predictive modeling techniques can also be applied to forecast and proactively address potential churn, enabling companies to take proactive measures to retain at-risk customers.

## 2. A first look

In [1]:
#---- Libs 

# Data analysis

import pandas as pd 
import numpy as np

# Data visualization

import plotly.express as px

In [2]:
#--- Load data

train = pd.read_csv('../data/raw/customer_churn_dataset-training-master.csv')

train.head()

Unnamed: 0,CustomerID,Age,Gender,Tenure,Usage Frequency,Support Calls,Payment Delay,Subscription Type,Contract Length,Total Spend,Last Interaction,Churn
0,2.0,30.0,Female,39.0,14.0,5.0,18.0,Standard,Annual,932.0,17.0,1.0
1,3.0,65.0,Female,49.0,1.0,10.0,8.0,Basic,Monthly,557.0,6.0,1.0
2,4.0,55.0,Female,14.0,4.0,6.0,18.0,Basic,Quarterly,185.0,3.0,1.0
3,5.0,58.0,Male,38.0,21.0,7.0,7.0,Standard,Monthly,396.0,29.0,1.0
4,6.0,23.0,Male,32.0,20.0,5.0,8.0,Basic,Monthly,617.0,20.0,1.0


In [3]:
#--- Info about data

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 440833 entries, 0 to 440832
Data columns (total 12 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   CustomerID         440832 non-null  float64
 1   Age                440832 non-null  float64
 2   Gender             440832 non-null  object 
 3   Tenure             440832 non-null  float64
 4   Usage Frequency    440832 non-null  float64
 5   Support Calls      440832 non-null  float64
 6   Payment Delay      440832 non-null  float64
 7   Subscription Type  440832 non-null  object 
 8   Contract Length    440832 non-null  object 
 9   Total Spend        440832 non-null  float64
 10  Last Interaction   440832 non-null  float64
 11  Churn              440832 non-null  float64
dtypes: float64(9), object(3)
memory usage: 40.4+ MB


In the first look we have: 

- 10 features variables
- 1 target variable (Churn)
- 1 ID variable (customerID)

## 3. Clean data

## 3. Univariable analysis

### 3.1. CustomerID

In [10]:
#---- Quantity of unique customers

train['CustomerID'].nunique()

440832

In [11]:
#---- Quantity of customers

train['CustomerID'].shape[0]

440833

In [None]:
train.query('CustomerID.isnull()')

Unnamed: 0,CustomerID,Age,Gender,Tenure,Usage Frequency,Support Calls,Payment Delay,Subscription Type,Contract Length,Total Spend,Last Interaction,Churn
199295,,,,,,,,,,,,


We have a entire row with null's. We will remove this row.

In [19]:
train = train.dropna()

train.query('CustomerID.isnull()')

Unnamed: 0,CustomerID,Age,Gender,Tenure,Usage Frequency,Support Calls,Payment Delay,Subscription Type,Contract Length,Total Spend,Last Interaction,Churn


### 3.2. Age

In [21]:
train['Age'].describe()

count    440832.000000
mean         39.373153
std          12.442369
min          18.000000
25%          29.000000
50%          39.000000
75%          48.000000
max          65.000000
Name: Age, dtype: float64

**Highlights:**

- Min age: 18
- Average age: 39
- Max age: 65
- 50% of customers are 39 years old or younger

In [33]:
#--- Age distribution

fig = px.histogram(
    data_frame = train,
    x = 'Age',
    nbins = 40,
    template = 'plotly_white',
    color_discrete_sequence=["#1f77b4"],  # Soft blue color
    opacity=0.8,  # Slight transparency, 
    text_auto=True
    )

fig.update_layout(
    title="Age Distribution of Customers",
    xaxis_title="Age",
    yaxis_title="Qtd of users",
    bargap=0.01,  # Reduce space between bars
    xaxis=dict(showgrid=True),
    yaxis=dict(showgrid=True),
)

**Highlights**

- We have 3 principals age groups: 20-29, 30-39 and 40-49
- The most frequent age group is 40-49
- From 52 years we have a lower percentage of users, comparing to the other age groups

### 3.3. Gender

In [44]:
#--- Gender distribution

df_gender_distribution = train['Gender'].value_counts().reset_index()

fig = px.bar(
    data_frame = df_gender_distribution,
    x = 'Gender',
    y = 'count',
    template = 'plotly_white',
    color_discrete_sequence=["#1f77b4"],  # Soft blue color
    opacity=0.8,  # Slight transparency, 
    text_auto=True
    )

fig.update_layout(
    title="Gender Distribution of Customers",
    xaxis_title="",
    yaxis_title="Qtd of users",
    bargap=0.1,  # Reduce space between bars
    xaxis=dict(showgrid=False),
    yaxis=dict(showgrid=False, showticklabels=False)
)

fig

### 3.4. Tenure: Time of permanence (contract)

In [45]:
#--- Tenure distribution

fig = px.histogram(
    data_frame = train,
    x = 'Tenure',
    nbins = 40,
    template = 'plotly_white',
    color_discrete_sequence=["#1f77b4"],  # Soft blue color
    opacity=0.8,  # Slight transparency, 
    text_auto=True
    )

fig.update_layout(
    title="Tenure Distribution of Customers",
    xaxis_title="Tenure",
    yaxis_title="Qtd of users",
    bargap=0.01,  # Reduce space between bars
    xaxis=dict(showgrid=True),
    yaxis=dict(showgrid=True),
)

**Highlights:** In general have a constant distribution of users over time of permanence.

In [6]:
train.head()

Unnamed: 0,CustomerID,Age,Gender,Tenure,Usage Frequency,Support Calls,Payment Delay,Subscription Type,Contract Length,Total Spend,Last Interaction,Churn
0,2.0,30.0,Female,39.0,14.0,5.0,18.0,Standard,Annual,932.0,17.0,1.0
1,3.0,65.0,Female,49.0,1.0,10.0,8.0,Basic,Monthly,557.0,6.0,1.0
2,4.0,55.0,Female,14.0,4.0,6.0,18.0,Basic,Quarterly,185.0,3.0,1.0
3,5.0,58.0,Male,38.0,21.0,7.0,7.0,Standard,Monthly,396.0,29.0,1.0
4,6.0,23.0,Male,32.0,20.0,5.0,8.0,Basic,Monthly,617.0,20.0,1.0
