# Exploratory Data Analysis

This notebook explores data using a dataset extracted from [Telco Customer Churn on Kaggle](https://www.kaggle.com/datasets/blastchar/telco-customer-churn). Each row represents a customer, and each column contains the customer's attributes as described in the metadata. The raw dataset consists of 7,043 rows (customers) and 21 columns (features). There are no null or blank values in the data.

In [2]:
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv('dataset/telco_customer_churn.csv')

df

| Column | Description |
| --- | --- |
| customerID | Customer identification |
| gender | Customer gender (Male, Female) |
| SeniorCitizen | Whether the customer is a senior citizen or not (1, 0) |
| Partner | Whether the customer has a partner or not (Yes, No) |
| Dependents | Whether the customer has dependents or not (Yes, No) |
| tenure | Number of months the customer has stayed with the company |
| PhoneService | Whether the customer has phone service or not (Yes, No) |
| MultipleLines | Whether the customer has multiple lines or not (Yes, No, No phone service) |
| InternetService | Customer's internet service provider (DSL, Fiber optic, No) |
| OnlineSecurity | Whether the customer has online security or not (Yes, No, No internet service) |
| OnlineBackup | Whether the customer has online backup or not (Yes, No, No internet service) |
| DeviceProtection | Whether the customer has device protection or not (Yes, No, No internet service) |
| TechSupport | Whether the customer has tech support or not (Yes, No, No internet service) |
| StreamingTV | Whether the customer has streaming TV or not (Yes, No, No internet service) |
| StreamingMovies | Whether the customer has streaming movies or not (Yes, No, No internet service) |
| Contract | Customer's contract (Monthly, Annual, Biennial) |
| PaperlessBilling | Whether the customer has paperless billing or not (Yes, No) |
| PaymentMethod | Customer's payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic)) |
| MonthlyCharges | Customer's monthly charges |
| TotalCharges | Customer's total charges |
| Churn | Whether the customer has churned or not (Yes, No) |

For this analysis, a new column was created to classify whether the customer subscribed to one service (Phone and Internet) or both services.

In [4]:
df['Service'] = None
df.loc[(df['PhoneService'] != 'No') & (df['InternetService'] == 'No'), 'Service'] = "Just Phone"
df.loc[(df['PhoneService'] == 'No') & (df['InternetService'] != 'No'), 'Service'] = "Just Internet"
df.loc[(df['PhoneService'] != 'No') & (df['InternetService'] != 'No'), 'Service'] = "Both"

Three additional columns were created to classify the internet services as follows:

| Columns | New Column Created |
| --- | --- |
| OnlineSecurity and OnlineBackup | OnlineService (Yes, No) |
| DeviceProtection and TechSupport | SecurityHelp (Yes, No) |
| StreamingTV and StreamingMovies | Streaming (Yes, No) |

In [5]:
df['OnlineService'] = df.apply(lambda row: 'Yes' if row['OnlineSecurity'] == 'Yes' or row['OnlineBackup'] == 'Yes' else 'No', axis=1)
df['SecurityHelp'] = df.apply(lambda row: 'Yes' if row['DeviceProtection'] == 'Yes' or row['TechSupport'] == 'Yes' else 'No', axis=1)
df['Streaming'] = df.apply(lambda row: 'Yes' if row['StreamingTV'] == 'Yes' or row['StreamingMovies'] == 'Yes' else 'No', axis=1)

To simplify visualization in this notebook, functions were created to plot the graphs.

In [11]:
def plot_histograma(name_col):
    plt.figure(figsize=(10, 6))
    plt.hist(df[name_col], bins = 3, edgecolor = 'black')
    plt.title(f'Histograma de {name_col}')
    plt.xlabel(name_col)
    plt.ylabel('Quantidade')
    plt.show()

def plot_barras_2col(name_col):
    plt.figure(figsize=(10, 6))
    service_counts = df[df['Churn'] == 'Yes'][name_col].value_counts()
    service_no_counts = df[df['Churn'] == 'No'][name_col].value_counts()

    bar_width = 0.35
    index = range(len(service_counts))

    plt.bar(index, service_counts, bar_width, label='Churn')
    plt.bar([i + bar_width for i in index], service_no_counts, bar_width, label='No Churn')

    plt.xlabel(name_col)
    plt.ylabel('Quantidade')
    plt.title(f'Churn por {name_col}')
    plt.xticks([i + bar_width / 2 for i in index], service_counts.index)
    plt.legend()
    plt.show()

def plot_barras_1col(name_col):
    plt.figure(figsize=(10, 6))
    service_counts = df[df['Churn'] == 'Yes'][name_col].value_counts().sort_index()

    bar_width = 0.35
    index = range(len(service_counts))

    plt.bar(index, service_counts, bar_width, label='Churn')

    plt.xlabel(name_col)
    plt.ylabel('Quantidade')
    plt.title(f'Churn por {name_col}')
    plt.xticks([i + bar_width / 2 for i in index], service_counts.index)
    plt.legend()
    plt.show()

## Analysis

In [None]:
plot_histograma('Churn')
plot_barras_2col('Service')
plot_barras_2col('gender')
plot_barras_2col('SeniorCitizen')
plot_barras_2col('Partner')
plot_barras_2col('Dependents')
plot_barras_2col('PhoneService')
plot_barras_2col('InternetService')
plot_barras_2col('OnlineService')
plot_barras_2col('SecurityHelp')
plot_barras_2col('Streaming')
plot_barras_1col('MonthlyCharges')
plot_barras_1col('tenure')
plot_barras_2col('Contract')
plot_barras_2col('PaperlessBilling')
plot_barras_2col('PaymentMethod')