## Exploratory Data Analysis
1. For the user to understand the data
2. Can include plots, charts and summaries
3. You can introduce "bias" for exploration e.g grouping things
4. Strings/categorical data are okay
5. Messy Data is acceptable

In [None]:
import pandas as pd

### Normal data checking

In [None]:
data = pd.read_csv("Telco-Customer-Churn.csv")
data.head()

In [None]:
print(data.shape)

In [None]:
data.describe()

In [None]:
data.info()

In [None]:
data.isnull().sum()

    The dataset has no nulls

### EDA on numeric data

#### Checking for correlations in my data that a model can learn(numeric data)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

numeric_df = data.select_dtypes(include='number')
numeric_df.columns

#### From the Ploted correlation matrix
1. tenure and monthly charges have the most correlation

In [None]:
corr = numeric_df.corr()
plt.figure(figsize=(10,6))
sns.heatmap(corr,annot=True, cmap='coolwarm')
plt.title("Correlation Matrix")
plt.show()

#### From the histograms:
1. Distribution of senior Citizens - there are fewer senior citizens than non-senior citizens.
2. Monthly charges of 20 are the most occuring

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

for col in data.select_dtypes(include='number').columns:
    sns.histplot(data[col], kde=True)
    plt.title(f'Distribution of {col}')
    plt.show()

#### From the box plots:
1. There are now outliers or large skewness in the other plots except for the binary plot for senior citizens

In [None]:
for col in data.select_dtypes(include='number').columns:
    sns.boxplot(x=data[col])
    plt.title(f'Boxplot of {col}')
    plt.show()

#### Measuring skewness directly instead of depending on histograms and box plots that might not be clear 

#### From it:
1. SeniorCitizen data is too right skewed
2. tenure data has normal like distribution
3. monthly charges are left skewed.

From these observations i can choose to normalize the data and save it to train a model or leave it as it is.

In [None]:
data.select_dtypes(include='number').skew()

### EDA on Categorical data

In [None]:
cat_cols = data.select_dtypes('object').columns
print(cat_cols)

In [None]:
cat_cols.shape

18(numeric cols) + 3(categorical cols) = all the 21 columns we initially had

But why are some entries i clearly know should be numeric, such as total charges and churn being identified as categorical?

    I can choose to change them into numeric data types and save another clean dataset. I can then perform numeric EDA on the resulting dataset and maybe normalize skewed data columns.

#### Check number of unique values per column

In [None]:
for col in cat_cols:
    print(f"\n column: {col}")
    print(data[col].nunique())

#### Check for missing and inconsistent categories

In [None]:
# i Know i've done this in a cell above but i saw this loop on the web and had to use it somewhere. So here it is:

for col in cat_cols:
    print(f"{col} has {data[col].isnull().sum()} missing values")
    print(f"Unique categories: {data[col].unique()}")
    

#### Plot Category distributions

>N/B: Customer ID will look dirty coz all of its entries are unique and will appear on the countplot axis

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

for col in cat_cols:
    plt.figure(figsize=(8,4))
    sns.countplot(data = data, x = col, order=data[col].value_counts().index)
    plt.title(f"Distribution of {col}")
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()

#### See if columns have a relationship with Target variable

>I know this is kinda biased but i still find it good to find out, as long as you don't expect your model to see them that way. It will form its own relationships.

In [None]:
for col in cat_cols:
    plt.figure(figsize=(8,4))
    sns.countplot(data=data, x= col, hue='Churn')
    plt.title(f'{col} vs Churn')
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()

#### See How a numeric column like MonthlyCharges varies across categories like if someone has dependants

In [None]:
print(data.groupby('Dependents')['MonthlyCharges'].mean())

> The output above means customers without dependants pay an average of 67 shillings and those with dependants pay an average of 59 shillings. Thus people without dependants tend to pay more. I'm guessing maybe people with dependants choose cheaper plans and are more loyal? Try to explain it.

> Hint: You can choose to dig deeper and count how many customers are in each group and check if there is a big variance.

## Remarks:
EDA is very broad and can dig deep into intricate details with the right commands. There are infinitely many things you can figure out from your dataset which means you can spend a whole week looking for patterns and causes you imagine(I learned this the hard way) so don't be like me. Get a predefined goal and go talk to the data to realise that goal.