# Descriptive Analysis

When trying to understand your data, it is typically impossible to just look at raw data and get much insight. We need ways to turn a bunch of data into a smaller set of numbers that are easily digestible summaries of your data. This will make them understandable both for you and for the people you work with. We call these **descriptive statistics**.

## Objectives

- Use business context to guide exploratory analyses
- Pose clear business-relevant questions and answer them with data
- Identify limitations of data for solving business problems

In [None]:
# Imports!
import pandas as pd
import numpy as np

from matplotlib import pyplot as plt
import seaborn as sns

In [None]:
# We'll start with chipotle data
# Be sure to set `sep='\t' since it's a tsv, not csv
chipotle = None

In [None]:
# Check it out


In [None]:
# Those item prices are gross - let's write a lambda function to clean that column!
# Capture the column in a new variable, item_prices
item_prices = None

In [None]:
# Check it


# Different Statistical Measures

## Measures of Center

One natural place to begin is to ask about where the **middle** of the data is. In other words, what is the value that is closest to our other values? 

There are three common measures used to describe the "middle":

- **Mean**: The sum of values / number of values
- **Median**: The value with as many values above it as below it
    - If the dataset has an even number of values, the median is the mean of the two middle numbers.
- **Mode**: The most frequent value(s)
    - A dataset can have multiple modes if multiple values are tied for the most frequent.

Let's see what we have for our example:

In [None]:
print(f"Mean: {item_prices.mean()}")
print("*"*20)
print(f"Median: {item_prices.median()}")
print("*"*20)
print(f"Mode: {item_prices.mode()}")

**Discussion**: If somebody asked you "How expensive are items at chipotle?", how would you answer?

- 


## Measures of Spread

Another natural question is about the **spread** of the data. In other words, how wide a range of values do you have? And how close or far are they from the "middle"?

### Min, Max, and Range

The minimum and maximum values of a dataset tell you the full extent of the values of your dataset. The range of the dataset is the difference between those two values.

In [None]:
print(f"Min: {item_prices.min()}")
print(f"Max: {item_prices.max()}")
print(f"Range: {item_prices.max() - item_prices.min()}")

### Percentiles and IQR

You can also calculate values at various **percentiles** to understand the spread. An "Nth Percentile" value is the value that is greater than N% of other values. The 25th and 75th percentiles are commonly used to describe spread, and the **interquartile range (IQR)** is the difference between these two values.

See [the docs](https://numpy.org/doc/stable/reference/generated/numpy.percentile.html) for more specifics about how percentiles are calculated, which is surprisingly tricky.

In [None]:
print(f"25th Percentile: {np.percentile(item_prices, 25)}")
print(f"75th Percentile: {np.percentile(item_prices, 75)}")
print(f"IQR: {np.percentile(item_prices, 75) - np.percentile(item_prices, 25)}")

### Standard Deviation

The **standard deviation** is in effect the distance from the mean of the "average" data point. It is defined as: $$\sqrt\frac{\Sigma(x_i - \bar{x})^2}{n}$$

In [None]:
print(f"Standard Deviation: {item_prices.std()}")

In [None]:
np.std(item_prices)#, ddof=1)

# Visual Description

A picture is worth a thousand words - or numbers! Here we will show how to use histograms and box-and-whisker plots to describe your data.

## Histograms

One natural way of starting to understand a dataset is to construct a **histogram**, which is a bar chart showing the counts of the different values in the dataset.

There will usually be many distinct values in your dataset, and you will need to decide how many **bins** to use in the histogram. The bins define the ranges of values captured in each bar in your chart. 

In [None]:
fig, ax = plt.subplots()
ax.hist(item_prices, bins=14)
plt.title('Counts, 14 Bins')

In [None]:
fig, ax = plt.subplots()
ax.hist(item_prices, bins=10)
plt.title('Counts, 10 Bins')

In [None]:
fig, ax = plt.subplots()
ax.hist(item_prices, bins=5)
plt.title('Counts, 5 Bins')

In [None]:
fig, ax = plt.subplots()
ax.hist(item_prices, bins=2)
plt.title('Counts, 2 Bins')

## Box and Whisker Plot

A box-and-whisker plot can also be useful for visually summarizing your data by showing the min, IQR, and max.

In [None]:
fig, ax = plt.subplots()
ax.boxplot(item_prices)
plt.title('Counts of Pairs of Shoes')

# Addressing Business Questions

## Fast Food Data

In [None]:
ffood = pd.read_csv('data/Datafiniti_Fast_Food.csv')

In [None]:
ffood.head()

In [None]:
ffood.info()

### Question 1

How many different restaurant chains are represented in the data? Visualize the numbers for the restaurants with 50 or more instances.

In [None]:
# Your code here

### Question 2

Visualize the locations of restaurants in Buffalo, NY.

In [None]:
# Your code here

### Your Turn:

Work on questions 3-5 below in small groups:

### Question 3

In this dataset, how many Taco Bell restaurants are there in Alaska, and in which cities are they?

In [None]:
# Your code here

### Question 4

Convert the ZIP Codes to (five-digit) integers.

In [None]:
# Your code here

### Question 5

Which restaurant chain has the greatest representation in San Francisco, CA? (This city covers all the ZIP Codes between 94100 and 94188, inclusive)

In [None]:
# Your code here

## Credit Card Data

In [None]:
credit = pd.read_csv('data/BankChurners.csv',
                     # Using a lambda function to ignore two unnecessary columns
                     usecols=lambda x: "Naive_Bayes" not in x)

In [None]:
credit.head()

In [None]:
credit.describe()

In [None]:
credit['Attrition_Flag'].value_counts()

We work for a credit card company and are worried about customers churning (becoming attrited).

### Your Turn: Second Exercise!

In breakout rooms, work on questions 1-3 below:

### Question 1

Get the means of the numerical columns for the existing and the attrited customers separately.

In [None]:
# Your code here

### Question 2

Visualize the distributions of total revolving balances for each group.

In [None]:
# Your code here

### Question 3

Make two bar charts counting the numbers in each income category for each group separately.

In [None]:
# Your code here