# Day 5 Lab, IS 4487

This lab (like the last one) is designed  to prepare you to complete the project assignment for today. We will go through code (using MegaTelCo as an example) that you will be able to adapt for the AdviseInvest project. Here is what you need to be able to do for the project assignment:

1. Create a plot showing the relationship between a numeric (or count) and a categorical variable.
2. Create a plot showing the relationship between two  categorical variables.


## Load Libraries


In [None]:
import pandas as pd


## Import Data


In [None]:
mtc = pd.read_csv("https://raw.githubusercontent.com/jefftwebb/is_4487_base/dd870389117d5b24eee7417d5378d80496555130/Labs/DataSets/megatelco_leave_survey.csv")

In [None]:
mtc.head()

In [None]:
mtc.info()

# Prepare Data

1. Perform the cleaning from the previous lab:
   1. Remove negative values of `income` and `house`
   2. Remove absurdly large value of `handset_price`
   3. Remove NAs
   4. Make character variables into categorical variables, including `college` which we will use for to demo the plots. (`college` is coded `one`/`zero`, which is weird, but we'll leave it as is.)

For simplicity, I have added the code that you wrote for the previous lab in the code chunks below.

In [None]:
# filter rows
mtc_clean = mtc[(mtc['house'] > 0) & (mtc['income'] > 0) & (mtc['handset_price'] < 1000)]


In [None]:
# remove NAs
mtc_clean = mtc_clean.dropna()

In [None]:
# Convert string to categorical variables (including college)
mtc_clean['reported_satisfaction'] = mtc_clean['reported_satisfaction'].astype('category')
mtc_clean['reported_usage_level'] = mtc_clean['reported_usage_level'].astype('category')
mtc_clean['considering_change_of_plan'] = mtc_clean['considering_change_of_plan'].astype('category')
mtc_clean['college'] = mtc_clean['college'].astype('category')
mtc_clean['leave'] = mtc_clean['leave'].astype('category')


In [None]:
# check that it worked
mtc_clean.isna().sum()

Note that there are now no NAs; 6 rows have been removed.

In the project you will be directed to change a 0/1 variable into a categorical variable (with labels).  This change helps make plots more legible.  Here is how to do that with pandas, using `college` as an example.  The only difference is that college, weirdly, consists not in 0/1 but in the words "zero" and  "one."

We will use the Pandas `replace()` function to make the change. The syntax is: `Series.replace(to_replace, value)`, where "Series" is the data frame column.

This creates a string variable. The second step is to turn that into a categorical variable.

In [None]:
# Write your code here


In [None]:
# check that it worked:


# Plotting

## 1. Plot the relationship between a numeric and a categorical variable

What, for example, would be the appropriate plot type for showing the relationship between `leave`--our target variable--and `income`? In this case, `leave` is a categorical variable, while `income` is numeric.

- A histogram won't work because it shows the distribution (the frequencies of values) for just a single variable.
- A scatterplot? No.  This will show the relationship between two *numeric* variables.
- A line plot?  This is usually reserved for data that has a time dimension, which is displayed on the horizontal axis.  
- A barplot?  This *could* work.  A summary statistic--mean, median, count, max, min--would be shown on the y-axis, with the categories on the x-axis.

Make a barplot of average income with bars for `LEAVE` and `STAY`. Make sure to add a title.

Hint:  calculate a conditional mean first, then use that in the plot.

In [None]:
# Your code goes here



## Choose a different summary statistic and add a title

- Update the plot so the bar heights show the median
- Add an appropriate title


In [None]:
# Write your code here


What doesn't work very well about this barplot? The information is limited; it does not show the *range* of values. The height of the bar is determined by the summary statistic we've chosen, but gives no information about the *distribution* of observations.

For that, we need a *boxplot*.

Pandas should take an `x` and `y` argument. Instead, it uses `by` and `column`.



In [None]:
# Write your code here

 This is fine for exploratory work, but is actually pretty bad, in that an extraneous title is added at the top of the plot.

 Do some research and re-create this plot using the Seaborn package. Make sure to add a title.



In [None]:
# Your code goes here

Now we can see from the box (which represents the middle 50% of the observations, or the *central tendency* of the data) that customers who stay tend to have lower incomes than customers who leave. In general, because box plots provide information about the *distribution* of the underlying data, they are often used to show the relationship between a categorical variable like `leave` and a numeric variable like `income`.

## 2. Plot the relationship between two categorical variables

This is tricky.  Will a boxplot work to show the relationship between, for example, `college` and `leave`?  No. For a boxplot, one of the variables needs to be numeric.



The best option in this case is a barplot, but some preparatory work is required.

1. Calculate counts of college goers at each level of leave. Use the `count()` function. One detail here is that the output is a grouped series. But the `plot()` function takes a dataframe as input. Therefore you'll need to include the. `unstack()` function to return a dataframe.
2. The height of the bars will then represent those counts.

Input the following prompt into Gemini:  "explain what unstack does and why it is called that."

In [None]:
# Your code goes here--calculate counts


Now, the barplot will use the above table of counts:



In [None]:
# Your code goes here

This plot compares churn rates between college-educated and non-college-educated customers, showing how leaving vs staying changes with education level. In other words, it shows the *impact* of education on churn.

In this case the impact is relatively negligible.

This interpretation will be clearer if we make the y-axis into a *proportion* rather than a *count*.

How would we do this?

We'll use lambda function --  or anonymous function--in conjunction with `apply()`  from pandas.

A lambda function is created on the fly by the analyst:

`lambda arguments: expression`

- **lambda keyword**: Signals the start of a lambda function.

- **arguments**: Zero or more comma-separated arguments that the function takes.

- **expression**: A single expression that is evaluated and returned as the result of the function.

Here is an example:

In [None]:
square = lambda x: x * x

square(5)

Now create a lambda function to  turn a count variable `x` into a proportion.  Calculate the proportion for the following series, `example_series`:

In [None]:
example_series = pd.Series([20, 18, 5, 77, 100])

example_series


In [None]:
# Your code goes here

The next task is to change the `leave` variable in the count table into  proportions.  We use `apply()` the apply the lambda function created above to a column.  The syntax is:

`.apply(lambda, axis = 1)`

`axis = 1` means that the function is being applied to  columns.

In [None]:
# your  code goes here



Next task is to use this proportion table to create a bar plot.

In [None]:
# Your code goes here--Make the plot



This plot should show  that leaving and staying is about 50% for college and non-college customers. That is, the difference is negligible.

 Fine-tune your plot

1. Add a title
2. Add an appropriate y-axis label.

In [None]:
# Write your code here


## More practice with plots

Recreate the plots from the lecture:

1. A plot asnwering the question "is churn related to wealth"?
2. Display the distribution of house prices by churn status
3. Is churn related to phone usage?
4. Is churn related to satisfaction?

Make a brief comment on the meaning of the plot for understanding churn.

In [None]:
# Write your code here


In [None]:
# Write your code here


In [None]:
# Write your code here


In [None]:
# Write your code here


# Functions:

- `pd.read_csv()`: Reads a CSV file into a pandas DataFrame.
- `.info()`: Prints a concise summary of a DataFrame, including column names, non-null counts, and data types.
- `.dropna()`: Removes rows with missing values from a DataFrame.
- `.astype()`: Casts a pandas object to a specified dtype.
- `.groupby()`: Groups DataFrame using a mapper or by a Series of columns.
- `.mean()`: Returns the mean of the values for the requested axis.
- `.plot()`: Creates a plot of the data in a DataFrame or Series.
- `.median()`: Returns the median of the values for the requested axis.
- `.count()`: Counts non-null values in a Series or DataFrame.
- `.unstack()`: Pivots a level of the index labels.
- `.apply()`: Applies a function along an axis of the DataFrame.
- `lambda`: Creates an anonymous function.
- `sum()`: Returns the sum of a Series or DataFrame elements.