# Exploratory Data Analysis
## DS-3001: Machine Learning 1

Content adapted from Terence Johnson (UVA)

**Notebook Summary**: In this notebook, we will discuss the process of Exploratory Data Analysis (EDA). We will talk through how to quickly explore the features in your data via statistical descriptions and quick visualizations. We will prioritize quick visualizations rather than publication ready figures.

## Exploratory Data Analysis

- **Exploratory Data Analysis (EDA):** A process performed on cleaned data to understand some basic summaries and visualizations. Helps us understand the basic properties of the data and whether they are clean enough to proceed.

- **Analyzing one or two variables at a time:** We are interested in understanding the distribution of a single variable or the relationship between two variables.

- **No unique way to do EDA:** This can be an exhausting process, and there is no uniquely correct way to do it: choices have consequences.

- **Focus on useful graphs rather than pretty graphs for now:** Since we are doing EDA and not Visualization, the emphasis is on *useful* graphs, not necessarily *pretty* ones

- **Working with Pandas for now:** We will stay inside Pandas, and introduce PyPlot and Seaborn in the next lecture.

- **Using pre-trial data for visualization:** Last time, we cleaned the pretrial data, which works well for visualizations: There are many interesting numeric variables, categorical variables, and dummy variables. We'll use it again today.

#### Setting up the Environment

* First, we'll load in the necessary packages.
* Second, we'll mount our Google Drive to the notebook so that we can access the files for this class.
* Third, we'll change the working directory to the folder for this class.
* Fourth, we'll load in the pre-trial data from the Data Wrangling notebook.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt # Pandas automatically uses some pyplot functions, so we need it loaded
import os # For changing directory

In [None]:
# To mount your google drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# path_to_DS_3001_folder = '/content/drive/MyDrive/DS-3001/01_Python_for_ML'

# Update the path to your folder for the class
# Where you stored the data from the previous noteboook
path_to_DS_3001_folder = ''
os.chdir(path_to_DS_3001_folder)

In [None]:
# Read in the cleaned version of the pretrial data
# Point to where you stored the file
trial_df = pd.read_csv('./data/pretrial_data.csv', low_memory=False)
print(trial_df.head(),'\n')
print(trial_df.describe(),'\n')

*Note:* If mounting your drive does not work, you can find the cleaned version of the data set on Canvas. Upload it to the file system in Google Colab by hitting the folder icon to the left and then the upload button. You can upload the file and then load it in as a data frame from there.

## Exploring a Single Variable

We will start with methods for exploring the distribution of or characterizing a single variable. As we mentioned in the previous lecture (01_Data_Wrangling), variables can either be numeric or categorical. We will start by describing a single numeric value.

### Histograms (Single Variable, Numeric)

- **Histogram:** A fundamental and powerful tool for visualizing numeric data of a single variable. We got an introduction to histograms in 01_Data_Wrangling, but here we will go more in-depth on how they work:
  
- **How Histograms Are Created:**
  - Take the minimum and maximum values that the variable takes, and divide the range into $B$ equally spaced *bins*. For example, if the min value is 0 and the max value is 100 and $B=4$, the bins are $\{ [0,25), [25,50), [50,75), [75,100] \}$.
  - For each bin, count the number of observations that fall into that bin.
  - Plot the result as a bar graph, where the count in the bin equals the height of the bar.

- At this stage, picking the number of `bins=B` is a matter of "taste." Note, the number of bins will shape how the graph looks and thus can change your interpretation, so choose and interpret carefully.

- You can quickly create a histogram directly from a Pandas dataframe using the following notation: `df[var].plot.hist(bins=B)`. If you don't include a value for $B$, Pandas will use a default value of 10.

In [None]:
# Quickly clean the GiniIndex variable using what we learned from the Wrangling Data Set

In [None]:
# Create a histogram of the gini variable with 20 bins
# Set grid to false

#### The Math Behind Histograms
- The formal definition of a histogram is that it is a function, such that the *height* of bar $k$ is:
$$
h_k =  \sum_{i=1}^N \mathbb{I}\left\lbrace b_{k-1} < x_i \le b_k \right\rbrace, \quad k = 1, ..., B
$$
where $\mathbb{I}\{...\}$ equals 1 when the statement inside is true, and zero otherwise.
- When *normalized* by $N$, you can interpret this as the probability that a random draw of $X$ falls into the $k$-th bin, between $b_{k-1}$ and $b_k$


In [None]:
# Next look at the histogram for Bond
# Can you gather much information from this graph? Why or why not?

### Variables with "Long Tails"
- **The bond histogram above is not very useful:** All of the data are consolidated into a single bin. This view can be unhelpful and misleading. We need some methods to deal with visualizing distributions that look like this.

- **The scale leads to these strange graphs:** We often get graphs that are not meaningful to us because their values are **badly scaled**

- **Computers dislike variables with bad scaling:** Stable calculations become challenging when comparing numbers of very different magnitudes.

- **Log scaling:** The traditional way to smooth them is to use the *(natural) logarithm* or `log()` function. This converts multiplication/division to addition/subtraction, levels to growth rates, and shrinks large values significantly.

- **Log Scaling has Drawbacks:** `np.log()` is only defined as a real number for for strictly positive numbers, so it forces us to drop zeros and negative numbers from visualizations or analysis. This is highly undesirable.

- **Inverse hyperbolic sine**: An alternative method for sclaing the magnitude of a variable. `np.arcsinh()`, is defined for any number, positive or negative, and has almost the same interpretation as `np.log()`, so we often use it instead.

#### Log transform of the `bond` Variable

In [None]:
# Let's create a new variable in the data frame
# for the log of the bond variable
# Try to plot the histograms. What happens?


Because the log is not defined for values that are 0 or below, we need to remove all values that are 0 or below from our plot.

In [None]:
# Isolate the bond value


#### Now we can try the inverse hyperbolic sine transform of the `bond` variable instead

In [None]:
# Calculate the argsinh of the bond variable
# and make it a new column


# Plot the resulting histogram


#### Comparing the Math of the `log()` vs. `arcsinh()` functions

- **Similar Derivatives:** The derivative of natural log is $1/x$, while for inverse hyperbolic sine it is $1/\sqrt{1+x^2}$, which are so close as to render the difference negligible for our purposes

- **Easy to transform between scaled and unscaled values for the data:** We won't go over it now, but it's typically easy to go back and forth between the transformed analysis and the original values in levels, so using the transformations are not a problem in analysis.

- **`arcsinh()` lies slightly above `log()`:** Since `arcsinh(0)=0` but `log()` tends to be negative infinity at zero and their derivatives are similar, the `arcsinh()` curve lies above the `log()` curve.

In [None]:
# A visual comparison between the log and arcsinh functions

# Creating the grid points to compare on
x = np.arange(-15,15,.1)

# Calculating the log and arcsinh outputs
y1 = np.log(x)
y2 = np.arcsinh(x)

# Plotting the results
plt.plot(x,y1, label ='Natural Log')
plt.plot(x,y2, label='Inverse Hyperbolic Sine')
plt.xlabel("X")
plt.ylabel("Y")
plt.grid()
plt.legend(loc='upper left')
plt.title('Natural Log and Arcsinh')
plt.show()

## Student Exercise: Histograms

0. First, we need to clean the `ImposedSentenceAllChargeInContactEvent` variable. We'll rename it to be `sentence`

1. Plot a histogram of the `sentence` variable. Is it badly scaled?

**Answer:**

In [None]:
# Code to clean the ImposedSentenceAllChargeInContactEvent

# Isolate the column
sentence = trial_df['ImposedSentenceAllChargeInContactEvent']

# Look at unique values
# print('Unique values of Sentence:', sentence.unique())

# Replace empty strings with nans
sentence = sentence.replace(' ', np.nan)

# Update the data type to be numeric
sentence = pd.to_numeric(sentence, errors = 'coerce')

# Create a new columns for sentence with our updated data
trial_df['sentence'] = sentence

# Now look at trial_df sentence
trial_df['sentence']

In [None]:
# Plot the sentence histogram here


2. Use the `log()` and `arcsinh()` transformations on `sentence` and create histograms. Compare the outcomes.

In [None]:
# Plot the log() here


In [None]:
# Plot the arcsinh() here


3. What would happen if you threw away the zeros when analyzing sentencing? How might it bias or otherwise interfere with your analysis? Hint: For additional information on the `ImposedSentenceAllChargeInContactEvent` variable, look at the codebook.

**Answer:**

**SOLUTION:** We would overinflate the setence amount. This would make us assume everyone was sentenced and no one had a sentence of 0 months. If you created a model that predicted the sentencing time, it would always predict a larger sentence time than if the 0s were included.

### An Overview of Statistics (Numerical Descriptions of the Data)

- **A Sample Statistic:** A function of the data. We take a list of values and aggregate it into a summary number(s) that helps us better understand the phenomenon we're interested in.

- **The mean is a sample statistic**: An example sample statistic is the **mean** or **average**.To calculate the mean, We sum all of the values and divide by the total number of values. If we have $N$ observations and the values are $(x_1, x_2, ..., x_N)$, the average is
$$
\bar{x} = \dfrac{x_1+x_2+...+x_N}{N} = \dfrac{ \sum_{i=1}^N x_i }{N}
$$
The intuition is, "Imagine you drew a number out of the hat. They're all drawn with equal probability. What kind of number do you expect to get?"

- **Statistics considers what happens to a sample statistic as $N$ grows:** The field of statistics, roughly, studies the behavior of sample statistics as the sample size $N$ gets large. For example, what is the behavior of a sample statistic as we get lots of data? Does it approach the "true" value, if one exists?

- **EDA to Summarize** In EDA, we're typically using statistics as a way to summarize the data and understand its features, and not necessarily imputing a "deeper meaning" to them.

#### Always, always, always, look at your data
![Lawyer Salaries](https://github.com/DS3001/EDA/blob/main/lawyerSalaries2018.jpg?raw=1)
- How *useful* is it to say, "The average yearly salary of a lawyer is about $100k?"
- Statistics can be incredibly misleading

### Statistics: Measures of Central Tendency
- These statistics correspond to values around which the data are concentrated:
    - **Mode**: The most frequently occuring value in the data. Calculated using the following code: `df[var].mode()`
    - **Median**: The value(s) at which half the population is above and half the population is below. Calculated using the following code: `df[var].median()`
    - **Mean**: The numeric average value of the data. Calculated using the following code: `df[var].mean()`.
$$
\bar{x} = \dfrac{x_1+x_2+...+x_N}{N} = \dfrac{ \sum_{i=1}^N x_i }{N}
$$

In [None]:
# Look at the histogram for the GiniIndex
# Calculate the mean, median, and mode


In [None]:
# Look at the histogram for the bond_arcsinh variable
# Calculate the mean, median, and mode


### Statistics: Measures of Rank

- **Ordering data by magnitude:** Consider lining the data up by magnitude, from smallest to largest (ascending order).

- Two ways to look at the data in terms of rank:
  - **Percentiles:** The **p-th percentile** is the value for which $p\%$ of the population is below $p$'s value and $(1-p)\%$ of the population is above $p$'s value. Ex. If the value 10 is the 70th percentile in a data set, it means that 70% of data points are below 10.
  - **Quantiles:** If you use decimals instead of percents, like $.05$ for $5\%$ or $.50$ for $50\%$, the word **quantile** is typically used. Ex. If the value 10 is the 0.7 quantile, that means that 70% of the data points are below 10. Calculated in Pandas using: `df[var].quantile(p)`

- **Quantiles and the median are robust to outliers:** Moving extremely large or small values won't affect the median or significantly change the rankings.

#### Identifying the Quantiles

In [None]:
# Look at the quantiles of the GiniIndex varaible
var = 'GiniIndex'


**Empirical Cumulative Distribution Function (ECDF):** The ECDF plots the variable of interest on the x-axis and the quantile on the y-axis.

You can get the quantiles for each value using the following code: `df[var].rank(method = 'average', pct = True)`. Then you plot your data on the x-axis and the quantiles you calculated using `rank` on the y-axis to create the ECDF.

In [None]:
# Generate the ECDF for the GiniIndex variable


In [None]:
# Repeat the same steps of identifying the quantiles and
# ECDF for the 'bond' variable


###  Statistics: Measures of Dispersion
- Measures of dispersion describe how "spread out" the data are:
    - **Range:** The minimum and maximum values of the data.
    - **Interquartile Range:** The distance between the 0.25-quantile and 0.75-quantile, which includes the middle half of the data
    - **Variance:** The average squared distance from the mean,
$$
s^2 = \dfrac{(x_1-\bar{x})^2 + (x_2 + \bar{x})^2 + ... + (x_N - \bar{x})^2}{N-1} = \dfrac{\sum_{i=1}^N (x_i-\bar{x})^2 }{N-1}
$$
So take the value of each observation $i$, subtract off the mean $\bar{x}$, square that, and then divide by $N-1$. If the data are all clustered around $\bar{x}$, this will be small, but if the data are very spread out, this will be larger.
    - **Standard Deviation:** The square root of the variance,
$$
s = \sqrt{s^2} = \sqrt{ \dfrac{\sum_{i=1}^N (x_i-\bar{x})^2 }{N-1} }
$$
- Why deal with the standard deviation instead of variance? The standard deviation is in the same units as the original variable, but the variance is approximately an average. They end up having have different statistical properties in small samples. Some models are more naturally parameterized in terms of the variance rather than the standard deviation.

#### Comparison of two different distributions with the same mean but different dispersion

For your own test, try to move the values of `std1` and `std2` to see what happens as you increase the standard deviation. Also try changing `n` to see what happens to how sample statistics change as the number of samples change.

In [None]:
### Parameters to change
n = 10000 # Number of points to draw
mean = 0 # Mean for both distribtuions
std1 = 1 # Standard deviation for first distiribution
std2 = 3 # Standard deviation for second distribution
nbins = 50
###

# Randomly draw points from normal distribution
x1 = np.random.normal(loc = mean, scale = std1, size = n)
x2 = np.random.normal(loc = mean, scale = std2, size = n)

# Print a Comparison of the range, IQR, sample standard varaince, and sample
# standard deviation
print(
f'''
Distribution 1:
\tRange: {max(x1)-min(x1)}
\tIQR: {np.quantile(x1,.75)-np.quantile(x1,.25)}
\tIQR End Points: [{np.quantile(x1,.25)},  {np.quantile(x1,.75)}]
\tSample Variance: {np.var(x1)}
\tSample STD: {np.std(x1)}
'''
)

print(
f'''
Distribution 1:
\tRange: {max(x2)-min(x2)}
\tIQR: {np.quantile(x2,.75)-np.quantile(x2,.25)}
\tIQR End Points: [{np.quantile(x2,.25)},  {np.quantile(x2,.75)}]
\tSample Variance: {np.var(x2)}
\tSample STD: {np.std(x2)}
'''
)

# Histogram of both distributions for comparison
plt.hist(x1, label = 'Distribution 1', color = 'firebrick', alpha = 0.2, bins = nbins)
plt.hist(x2, label = 'Distribution 2', color = 'dodgerblue', alpha = 0.2, bins = nbins)
plt.xlabel('Value of X')
plt.ylabel('Frequency')
plt.title(f'Comparison of Distributions with Different Dispersion')
plt.legend()

plt.show()

### Boxplots

- **Move towards visualizing dispersion:** The rank and dispersion information is useful to illustrate in a plot, since it can feel somewhat abstract when just looking at the numbers compared to a histogram.

- **Boxplot**: A graph to visualize the 5-number summary:
    - **Median:** The middle bar (green in the below plot) is the median/0.5-quantile/50%-percentile

    - **IQR:** The "box" represents the interquartile range (IQR): The range of values containing everything from the 0.25-quantile to the 0.75-quantile

    - **Whiskers:** The "whiskers" include a range of values from the first quartile minus $\frac{3}{2} * \text{IQR}$ to the third quartile plus $\frac{3}{2} * \text{IQR}$

    - **Outliers:** Values outside the whiskers are typically considered **outliers**. They are represented as points outside the whiskers.

- This plot is intended to illustrate the rank information in a useful way

$\text{Boxplot of GiniIndex}$

In [None]:
# Create a box plot of the GiniIndex variable


In [None]:
# Create a boxplot for GiniIndex but this time in the horizontal orientation


In [None]:
# Create a box plot for the bond variable


In [None]:
# Look at the bond variable that we transformed using arcsinh


### Variable Descriptions

- Many stats packages report a **five-number+** summary:
  - Five-number summary:
    1. Minimum
    2. 25% Percentile
    3. Median
    4. 75% percentile
    5. Maximum
  - The plus:
    6. Mean
    7. Standard deviation
    8. Count: How many non-missing observations are recorded

- In Pandas, you can get the 5-number summary with `df[var].describe()`

- From `.describe()`, you can quickly compute almost all the statistics we've mentioned

- The `count` value is the number of non-missing entries

In [None]:
# Get the description of the age variable
# Compute the variance and IQR using the description


### What do you do with outliers?

- Maybe nothing: They're part of the data. Maybe you trim the outliers and drop them, or **windsorize** and replace them with a high or low value.

- The outliers will typically exert *leverage* on the analysis: extreme values will influence the outcomes of your estimates or algorithm (e.g. they disproportionately affect the variance)

- But if the outliers are "really part of the data," that leverage can be totally legitimate

- What you want to be certain of is that the outliers are actually representative of the population of interest --- some observations might have characteristics that make them uncharacteristic of the data you expect to see in the future, and your models will be less useful if they are trained on those data

- The field of *robust statistics* is generally concerned with estimating models when the presence of outliers is likely to interfere with the results

## Student Exercise: Statistics and Boxplots

1. Generate a description of the `sentence` variable. What is the sample mean and sample standard deviation?

In [None]:
# Answer Here


2. What are the median, and mode of the `sentence` variable? What is the variance? The IQR?

In [None]:
# Answer Here


3. Make a boxplot. Are there a lot of outliers? Explain.

In [None]:
# Answer Here


4. Create an `outlier` dummy for the bond variable that indicates an observation is more than $1.5 \times IQR$ away from the $IQR$. (There are many ways to do this, some easier than others.). What proportion of the observations are outliers?

In [None]:
# Answer Here


### Scatter Plots (Two Variables, Numeric)

- **Scatter Plots to view relationship between two numeric variables:** Just like cross-tabs allow you to think about two variables at once, scatter plots provide a way of looking at the association between two variables in the data set

- **What's a scatterplot?** Pick two variables, $x$ and $y$. For each pair of values $(x_i, y_i)$ for observation $i$, you make a dot. Plot all the dots for all observations, $i=1,...,N$.

- **Goal of the scatter plot:** To uncover patterns of association between the two variables being plotted.

In [None]:
# Quickly clean the prior felonies and misdemenors variables

# Set up prior felonies variable
prior_F = trial_df['PriorConvs_Fel']
prior_F = prior_F.replace(' ', np.nan)
prior_F = pd.to_numeric(prior_F, errors = 'coerce')
trial_df['prior_F'] = prior_F

# Set up prior misdemenors variable
prior_M = trial_df['PriorConvs_Misd']
prior_M = prior_M.replace(' ', np.nan)
prior_M = pd.to_numeric(prior_M, errors = 'coerce')
trial_df['prior_M'] = prior_M

In [None]:
# Plot a scatter plot of the bond and age
# Plot a scatter plot of the GiniIndex and Age
# Plot a scatter plot of the bond arcsinh and age
# Plot a scatter plot of prior misdemenors (prior_M) and felonies (prior_F)


### Statistics: Measures of Association

- **Covariance:** Essentially, the common variance between two variables. Looking at how often the two variables are going in the same direction with one another.
$$
\text{cov}(x,y) = \dfrac{(x_1-\bar{x})(y_1-\bar{y}) + (x_2-\bar{x})(y_2-\bar{y})+...+(x_N-\bar{x})(y_N-\bar{y}) }{N-1} = \dfrac{\sum_{i=1}^{N} (x_i-\bar{x})(y_i-\bar{y}) }{N-1}
$$
*Notice how $\text{cov}(x,x) = s^2$.*

- **Comaprison between pairs of points:** Look at each pair $x_i$ and $y_i$. If they tend to both be above or below their averages, then there is positive covariance. If one tends to be above its average when the other is below, then there is negative covariance.

- **Correlation:** The covariance normalized by the
$$
r_{x,y} = \dfrac{\text{cov}(x,y)}{ s_x s_y}
$$
This is helpful because it is between -1 and 1, with 1 being perfect positive correlation, -1 being perfect negative correlation, and 0 being no correlation at all. *Important Note:* Correlation will only capture linear relationships between variables, not non-linear.

#### Covariance and Correlation Matrices

- **Calculating covariance and correlation with Pandas:** The function `df.cov()` will compute all of the variances and covariances for everything in your dataframe, and `df.corr()` will compute the correlations.

- **Ouputs of `df.cov()` and `df.corr()`:** The output will be a matrix. The variances of the variables will be on the diagonal, and the covariances/correlations will be the off-diagonal terms

- **Restricting number of outputs:** You probably want to use `df.loc[:,list]` to restrict attention to a set of variables in `list`, rather than compute all the possible covariances/correlations.

In [None]:
# Calculate the covariance and correlation matrices of age, bond, and GiniIndex
# Define the variables to consider
vars = ['age', 'bond', 'GiniIndex']

# Compute the covariance and correlations of the variables


## Student Exercise: Scatter Plots and Covariance

1. Plot a scatterplot of the `sentence` and `bond` variables. What do you see?

In [None]:
# Answer Here


2. Try a scatterplot of the inverse hyperbolic sine of `sentence` and `bond`.

In [None]:
# Answer Here


3. What are the covariance and correlation matrices between `sentence` and `bond`?

In [None]:
# Answer Here


### Grouping

- **Conditional Grouping:** We often want to **condition** or **group** our sample statistics and plots on specific categorical variables.
  - For example, the bond or sentence conditional on race or sex. This provides valuable context for what the numbers mean.

- **Diffentiating between categorical cases:** A lot of our tools immediately become more powerful when we can quantitatively differentiate between different categorical cases.

- We'll demonstrate grouping using Kernel Density Plots, Pivot Tabels, and Box Plots.

### Kernel Density Plots

- **Downside of histograms:** Plotting multiple variables on the same histogram at the same time often becomes a jumbled mess since they are filled in. Is the solution to stack them? Jitter them?

- **Enter Kernel Density Plots:** The alternative is to use a smoothed line to represent each variable; this is called a **kernel density plot**.

- **How Kernel Desnity Plots are Created:** The intuition of a kernel density plot is that each data point gets its own little bell curve (normal distribution), centered at its value. All the bell curves are averaged together.
  - So if data are bunched closely, their bell curves sum to a large value. If the data are sparse around some values, the sum is small
  - This smooths the jaggedness of histograms.

- **Plotting a desnity plot in Pandas:** You can plot a kernel density plot of your data using the following code: `df[var].plot.density()`

In [None]:
# Create a kernel density plot of the age variable


In [None]:
# Compare this to the histogram for age


#### Upside and Downside of Kernel Density Plots

- **Upside:** It's easy to visualize many density plots at once, grouped by a categorical variable, and the choice of "bins" isn't as arbitrary (there are a lot of good ways to pick the **bandwidth**)

- **Downside:** If the data have big spikes, the kernel density plot struggles to represent that faithfully, because it is trying to smooth everything out.

In [None]:
# Plot the kernel density plot for bond


In [None]:
# Compare with the histogram


In [None]:
# What about for the bond_arcsinh variable?


In [None]:
# Compare with the histogram as well


### Conditioning on a Categorical Variable: Pivot Tables and Grouped Density Plots

- To make very useful plots to compare the same variable for different groups, you have to do two steps:
    1. Make a **pivot table** of the values, using `df_wide = df.pivot(columns=group,values=var)` where `group_by` is the categorical variable to condition on, and `var` is the variable to plot. This is often called a "wide" dataframe because it explodes values by columns.
    2. Call the `.plot.density()` method on the `df_wide` dataframe you built in step 1

- The result is a kernel density plot, where each line corresponds to one of the values that the conditioning categorical variable takes

In [None]:
# Look at the GiniIndex variable density plot
# grouped by the race variable


In [None]:
# Next look at the bond arcsinh variable
# grouped by the case_type


In [None]:
# Arcsinh transform the prior_F and prior_M variables
# Plot their density plots grouped by sex


### Grouped Boxplots
- Grouping for boxplots is even easier: `df.boxplot(column = var, by = group_by)`

In [None]:
# Create boxplots of the Gini Index grouped by sex and race
# Create boxplots looking at Bond and Arcsinh Bond grouped by race


### Grouped Descriptions

- We can group our calculations like `.describe()` in a similar way:
    1. Use `df.loc[:,[group,var]]` to get the subset of the dataframe you want to analyze.

    2. Then `.groupby(group).describe()` to apply `.describe()` to `var` for each `group`
    
- Like the grouped kernel densities, this can be a really useful way of adding context to numbers.

In [None]:
# Describe the bond variable, grouped by case type


In [None]:
# Do the same for bond by is_poor


In [None]:
# Look at the variables is_poor, case_type, and bond
# Group by is_poor and case_type


## Student Exercise: Grouping

- For the `sentence` and `sentence_arcsinh` variables, create grouped kernel density, boxplot, and descriptive statistics for a categorical variable in the data (e.g. `case_type`, `sex`, `race`)

In [None]:
# Answer Here for sentence


In [None]:
# Answer Here for sentence_arcsinh


## Conclusion
- These graphs are not very aesthetically pleasing -- you wouldn't put them in a publication or on your web page, probably
- But they are very quick to make, and it all happens inside Pandas, instead of moving on to other packages
- You'll find that if you want pretty plots, you need a complex Application Programming Interface (API), and the more complex the API, the more specialized the skills become
- MatPlotLib and Seaborn provide very nice plots (and ggplot2), but ask you to train your mind to think more specifically on their terms