# Keeping Customers: Fighting Churn with Pandas

In this assignment, you'll play the role of lead analyst for a credit card provider. You’ve been provided with a CSV file that provides information on both customers who have churned and customers who still use the credit card. You have been tasked to determine whether churned vs. existing customers show significant differences when it comes to the following factors:

- Age
- Number of dependents 
- Total Revolving balance; i.e. the balance that still needs to be paid 
- Income 

The first three metrics are numeric so you will apply the same strategies to solve those problems. Customer income is categorical so you will be applying a slightly different strategy to solve that problem.

By the end of this notebook, you will have determined which of the above factors may help identify a customer who is likely to churn. This will help your organization better identify and retain flight risks in the future.

---
### Getting Started
To get started, download the following files:
- `Unit 20 - Business - Unsolved.ipynb` (_this notebook_)
- `CreditCardChurn.csv`

Place these together in to a dedicated directory on your hard drive. We recommend creating a folder in your `Documents` directory for this week of class, as follows:

```
Documents/
  Python/
    Week20/
      Unit 20 - Business - Unsolved.ipynb
      CreditCardChurn.csv
```

Then, start Jupyter Notebook and open `Unit 20 - Business - Unsolved.ipynb` in your browser. Make sure the `CreditCardChurn.csv` file lives in the same directory.

---

### Problem Structure
Each problem will be accompanied by:
- **Instructions**
  - Each problem features a markdown cell explaining the problem.
- **Unfinished Code Cells**
  - Each problem has unfinished code cells, where you will write code to solve the problem.
  - Cells will contain either starter code for you to finish, or a comment explaining what your code should do.
- **Expected Output**. 
  - Many unfinished code cells will have output below them. You will be expected to write code that produces the same output.
  - Some unfinished code cells do _not_ have output below them. This is simply because not all code will generate output. Your solutions for these cells should _not_ print anything.
  
---
  
### Deliverables
To receive credit for this assignment, you must submit the following files:
- Your completed Jupyter Notebook

Your completed Jupyter Notebook will be this file, but with all of the problems solved. This is the only file you will need to submit. When you're done with the assignment, run all cells to verify that your code executes as expected. Then, save and submit this notebook.

Good luck!

----

## Part 1: Loading & Cleaning Data
In Part 1, you will perform the following steps on the data in `CreditCardChurn.csv`:
- Load the CSV into a dataframe and print the first five rows
- Add a new column called `Churned`

### Problem 1: Loading Data
You will load the data in `CreditCardChurn.csv`, and inspect its columns using `head`. 

You have been provided a `filename` variable, which contains the path to `CreditCardChurn.csv`. Use it to complete the steps below:
- Load `filename` into a DataFrame called `churn`
- Print the first 5 rows of `churn`

---

Your code should print the following:

```
ClientID	AttritionFlag	CustomerAge	Gender	DependentCount	EducationLevel	IncomeCategory	TotalRevolvingBal
0	768805383	Existing Customer	45	M	3	High School	 60𝐾− 80K	777
1	818770008	Existing Customer	49	F	5	Graduate	Less than $40K	864
2	713982108	Existing Customer	51	M	3	Graduate	 80𝐾− 120K	0
3	769911858	Existing Customer	40	F	4	High School	Less than $40K	2517
4	709106358	Existing Customer	40	M	3	Uneducated	 60𝐾− 80K	0
```


In [30]:
# Provided Code -- Do NOT Edit!
import pandas as pd
filename = 'CreditCardChurn.csv'

In [31]:
# Load `filename` into a DataFrame called `churn`
churn = pd.read_csv('CreditCardChurn.csv')

In [32]:
# Print first 5 rows of `churn`
churn.head()

Unnamed: 0,ClientID,AttritionFlag,CustomerAge,Gender,DependentCount,EducationLevel,IncomeCategory,TotalRevolvingBal
0,768805383,Existing Customer,45,M,3,High School,$60K - $80K,777
1,818770008,Existing Customer,49,F,5,Graduate,Less than $40K,864
2,713982108,Existing Customer,51,M,3,Graduate,$80K - $120K,0
3,769911858,Existing Customer,40,F,4,High School,Less than $40K,2517
4,709106358,Existing Customer,40,M,3,Uneducated,$60K - $80K,0


### Problem 2: Adding a `Churned` Column
Note that the `AttritionFlag` Series contains one of two values: Either `Existing Customer`, indicating that the customer is still a subscriber; or `Attrited Customer`, indicating that they've canceled. 

In this problem, you'll add a new column, called `Churned`, which will be `True` if `AttritionFlag` is `Attrited Customer`, and `False` otherwise. Follow the steps below:
- Create a new column, called `Churned`, which is `True` for rows where `AttritionFlag` is equal to `Attrited Customer`, and `False` otherwise
- Count the values of the `Churned` column

---

Your code should print the following:

```
False    8500
True     1627
Name: Churned, dtype: int64
```

In [33]:
# Create column called `Churned` 
churn['Churned'] = churn.AttritionFlag == 'Attrited Customer'

In [34]:
# Count values in `Churned` column
churn.Churned.value_counts()

False    8500
True     1627
Name: Churned, dtype: int64

## Part 2: Differences in Numeric Variables
In Part 2, you will see if there are significant differences in the following columns between the two groups::
- Customer Age
- Number of Dependents
- Total Revolving Balance

### Problem 1: Calculating the Average Age Difference Between `churned` vs `not_churned`
You will write code that computes the average age of churned customers, and the average age of unchurned customers, and then prints the _difference_ in these two averages.

Follow the steps below to solve this problem:
- Create a variable, called `churned_customers`, that contains only the rows in your DataFrame corresponding to churned customers 
- Create a variable, called `unchurned_customers`, that contains only the rows in your DataFrame corresponding to _unchurned_ customers
- Compute the average `CustomerAge` for each group
- Print the difference between these averages

---

Your code should print the following:

```
AVERAGE AGE DIFFERENCE: 0.3973783578582015
```

In [35]:
# Create `churned_customers` and `unchurned_customers` variables
churned_customers = churn[churn.Churned]
unchurned_customers = churn[~churn.Churned]

In [36]:
# Compute average age of churned and unchurned customers
average_age_of_churned_customers = churned_customers['CustomerAge'].mean()
average_age_of_unchurned_customers = unchurned_customers['CustomerAge'].mean()

# Compute `difference_in_average_age`
difference_in_average_age = average_age_of_churned_customers - average_age_of_unchurned_customers

In [37]:
# Print `difference_in_average_age`
print("AVERAGE AGE DIFFERENCE: ", difference_in_average_age)

AVERAGE AGE DIFFERENCE:  0.3973783578582015


After finding the difference in average age, answer the following questions:
- What are the minimum and maximum values in `churn['CustomerAge']`? Store the minimum value in a variable called `min_age` and maximum value in a variable called `max_age`.
- What is the difference between these values?
- Is `CustomerAge` predictive of churn?

In [38]:
min_age = churn['CustomerAge'].min()
min_age

26

In [39]:
max_age = churn['CustomerAge'].max()
max_age

73

In [40]:
age_range = churn['CustomerAge'].max() - churn['CustomerAge'].min()
age_range

47

- `CustomerAge` ranges from 26 to 73, for a maximum difference of 47 years. 
- However, the difference in average age of customers who churn and those who don't is only 0.39 -- dramatically less than 47.
- Thus, age doesn't seem to be a predictive factor, because the difference in average age between the two groups is miniscule.

### Problem 2: Calculating Difference in Average Number of Dependents
Next, you will write code that computes the average number of dependents of churned customers, and the average number of dependents of unchurned customers, and then prints the _difference_ in these two averages.

Follow the steps below to solve this problem:
- Create a variable, called `churned_customers`, that contains only the rows in your DataFrame corresponding to churned customers 
- Create a variable, called `unchurned_customers`, that contains only the rows in your DataFrame corresponding to _unchurned_ customers
- Compute the average `DependentCount` for each group
- Print the difference between these averages

---

Your code should print the following:

```
AVERAGE DIFFERENCE IN DEPENDENT COUNT:  0.06716967352398884
```

---

In [41]:
# Create `churned_customers` and `unchurned_customers` variables
churned_customers = churn[churn.Churned]
unchurned_customers = churn[~churn.Churned]

In [42]:
# Compute average `DependentCount` for each group
average_count_of_churned_customers = churned_customers['DependentCount'].mean()
average_count_of_unchurned_customers = unchurned_customers['DependentCount'].mean()

# Comput `difference_in_average_dependent_count`
difference_in_average_dependent_count = average_count_of_churned_customers - average_count_of_unchurned_customers

In [43]:
# Print `difference_in_average_dependent_count`
print("AVERAGE DIFFERENCE IN DEPENDENT COUNT: ", difference_in_average_dependent_count)

AVERAGE DIFFERENCE IN DEPENDENT COUNT:  0.06716967352398884


After finding the difference in average number of dependents, answer the following questions:
- What are the minimum and maximum values in `churn['DependentCount']`? Store the minimum value in a variable called `min_dependent_count` and maximum value in a variable called `max_dependent_count`.
- What is the difference between these values? Store the result in a variable called `dependent_count_range`
- Is `DependentCount` predictive of churn?

In [44]:
# Compute and print `min_dependent_count`
min_dependent_count = churn['DependentCount'].min()
min_dependent_count

0

In [45]:
# Compute and print `max_dependent_count`
max_dependent_count = churn['DependentCount'].max()
max_dependent_count

5

In [46]:
# Compute and print difference between max and min dependent count
dependent_count_range = churn['DependentCount'].max() - churn['DependentCount'].min()
dependent_count_range

5

- `DependentCount` ranges from 0 to 5, for a maximum difference of 5 dependents. 
- However, the difference in average dependent count of customers who churn and those who don't is only 0.067 -- considerably less than 5.
- Thus, number of dependents doesn't seem to be a predictive factor, because the difference in average age between the two groups is miniscule.

### Problem 3: Calculating Average Difference in Total Revolving Balance
Next, you will write code that computes the average number of dependents of churned customers, and the average number of dependents of unchurned customers, and then prints the _difference_ in these two averages.

Follow the steps below to solve this problem:
- Create a variable, called `churned_customers`, that contains only the rows in your DataFrame corresponding to churned customers 
- Create a variable, called `unchurned_customers`, that contains only the rows in your DataFrame corresponding to _unchurned_ customers
- Compute the average `DependentCount` for each group
- Print the difference between these averages

---

Your code should print the following:

```
AVERAGE DIFFERENCE IN TOTAL REVOLVING BALANCE:  -583.7811305542499
```

---

In [47]:
# Create `churned_customers` and `unchurned_customers` variables
churned_customers = churn[churn.Churned]
unchurned_customers = churn[~churn.Churned]

In [48]:
# Compute average `TotalRevolvingBal` for each group
average_balance_of_churned_customers = churned_customers['TotalRevolvingBal'].mean()
average_balance_of_unchurned_customers = unchurned_customers['TotalRevolvingBal'].mean()

# Compute `difference_in_average_revolving_balance`
difference_in_average_revolving_balance = average_balance_of_churned_customers - average_balance_of_unchurned_customers

In [49]:
# Print `difference_in_numeric` of `TotalRevolvingBal`
print("AVERAGE DIFFERENCE IN TOTAL REVOLVING BALANCE: ", difference_in_average_revolving_balance)

AVERAGE DIFFERENCE IN TOTAL REVOLVING BALANCE:  -583.7811305542499


After finding the difference in total revolving balance, answer the following questions:
- What are the maximum and minimum values in `churn['TotalRevolvingBal']`? Store the minimum value in a variable called `min_balance` and maximum value in a variable called `max_balance`.
- What is the difference between these values? Store the result in a variable called `balance_range`
- Is `TotalRevolvingBal` predictive of churn?

In [50]:
# Compute and print minimum revolving balance
min_balance = churn['TotalRevolvingBal'].min()
min_balance

0

In [51]:
# Compute and print maximum revolving balance
max_balance = churn['TotalRevolvingBal'].max()
max_balance

2517

In [52]:
# Compute difference between max and min revolving balances
balance_range = max_balance - min_balance
balance_range

2517

- The minimum revolving balance is \\$0, and the maximum is \\$2517, for a total range of \\$2517.
- The difference in average revolving balance between churned and unchurned groups is `-583.7811305542499`. This indicates that, on average, churned customers have a \\$583 _lower_ revolving balance than unchurned customers.
- \\$583 is a significant portion of \\$2517 -- more than 20%. Thus, the difference is probably significant, and we can say that `TotalRevolvingBal` is predictive of churn.

## Part 3: Studying Income Categories
In Part 3, you will continue looking for differences between churned and unchurned customers. This time, you will see if there low-income or high-income customers are more likely to churn, by studying the `IncomeCategory` column.

### Problem 1: Comparing "Low-Income" Churned vs Unchurned Customers
Next, you will determine whether there is a difference in the number of churned vs unchurned customers who qualify as "low-income" -- i.e., those who make less than $40K per year.

You have been provided with a variable, called `low_income`, containing the value `'Less than $40K'`. Use it to complete the steps below:
- Create a variable, called `churned_customers`, that contains only the rows in your DataFrame corresponding to churned customers
- Create a variable, called `unchurned_customers`, that contains only the rows in your DataFrame corresponding to unchurned customers
- Create the following two DataFrames:
  - `low_income_churned`: Filter `churned_customers` for only rows with an `IncomeCategory` value of `low_income'
  - `low_income_unchurned`: Filter `unchurned_customers` for only rows with an `IncomeCategory` value of `low_income`
- Compute the mean of each of these new DataFrames, then print the difference between these means.

---

Your code should print the following:

```
0.029211251310604147
```

In [53]:
# Provided Code -- Do NOT Edit!
low_income = 'Less than $40K'

In [54]:
# Create `churned_customers` and `unchurned_customers` DataFrames
churned_customers = churn[churn.Churned]
unchurned_customers = churn[~churn.Churned]

In [55]:
# Filter for low-income, churned customers
low_income_churned = churned_customers['IncomeCategory'] == low_income

# Filter for low-income, unchurned customers
low_income_unchurned = unchurned_customers['IncomeCategory'] == low_income

# Compute difference in means between `low_income_churned` and `low_income_unchurned` 
low_income_churned.mean() - low_income_unchurned.mean()

0.029211251310604147

### Problem 2: Comparing "High-Income" Churned vs Unchurned Customers
Next, you will determine whether there is a difference in the number of churned vs unchurned customers who qualify as "low-income" -- i.e., those who make $120K or more per year.

You have been provided with a variable, called `high_income`, containing the value `'$120K +'`. Use it to complete the steps below:
- Create a variable, called `churned_customers`, that contains only the rows in your DataFrame corresponding to churned customers
- Create a variable, called `unchurned_customers`, that contains only the rows in your DataFrame corresponding to unchurned customers
- Create the following two DataFrames:
  - `high_income_churned`: Filter `churned_customers` for only rows with an `IncomeCategory` value of `high_income'
  - `high_income_unchurned`: Filter `unchurned_customers` for only rows with an `IncomeCategory` value of `high_income`
- Compute the mean of each of these new DataFrames, then print the difference between these means.


In [56]:
# Provided Code -- Do NOT Edit!
high_income = '$120K +'

In [57]:
# Create `churned_customers` and `unchurned_customers` DataFrames
churned_customers = churn[churn.Churned]
unchurned_customers = churn[~churn.Churned]

In [58]:
# Filter for high-income, churned customers
high_income_churned = churned_customers['IncomeCategory'] == high_income

# Filter for high-income, unchurned customers
high_income_unchurned = unchurned_customers['IncomeCategory'] == high_income

# Compute difference in means between `high_income_churned` and `high_income_unchurned` 
high_income_churned.mean() - high_income_unchurned.mean()

0.006737264543186669

## Wrapping Up


Congratulations -- you've finished digging into the customer churn data set, and have generated significant insights for your organization! Write a paragraph that summarizes your findings and explains which factor(s) you analyzed are predictive of customer churn. 



### Summary of Findings

The analysis shows that neither age, gender nor number of dependents is a predictive factor.
- The difference in average age between existing and attrited customers is only `0.39`, whereas ages range from 23 to 72. This is rather insignificant, as it implies that the difference in age between existing and attrited customers varies by only 0.795% -- virtually no difference at all.
- It was shown that 57% of customers who churn were female, whereas 52% of customers who do _not_ churn were female.Again, this 5% difference between categories is minimal.
- Number of dependents can range from 0 to 5, but the difference in average number of dependents between categories was only 0.067. There is no significant difference between existing versus attrited customers with regard to this feature.


The analysis shows that total revolving balance _is_ important for predicting churn. The maximum total revolving balance in the data set is roughly \\$2500, while the difference between average total revolving balances between existing versus attrited customers is roughly \\$583.78 -- nearly 25% of the maximum balance!

This suggests that the organization should place special emphasis on customers with a high total revolving balance, as this does seem to be predictive of customer churn.