### DS 3001: Foundations of Machine Learning
### Python for ML: Visualization

Content adapted from Terence Johnson (UVA)

**Notebook Summary:** In this notebook we will go over how to make visualizations that look nicer than those we made in the EDA notebook. The EDA notebook was about making quick plots to understand the variable without worrying about how they look visually. This lecture will go over how to make those plots we've already seen (histograms, boxplots, density plots, scatterplots) and more in Seaborn. Seaborn gives the user the ability to make nicer looking plots with less lines of code. This is just an introduction to visualizations, as you could spend an entire semester talking about how to make effective visualizations of your data. This should give you the tools to make some basic plots that convey information in a visually appealing way.

### Useful resources for plotting

* **Changing Color of Plots:** List of the [named colors](https://matplotlib.org/stable/gallery/color/named_colors.html) you can use in your plots.
* **Seaborn Documentation:** The seaborn documentation can be found [here](https://seaborn.pydata.org/).
* **Matplotlib Documentation:** The matplotlib documentation can be found [here](https://matplotlib.org/stable/index.html).

## Fast, Attractive Visualizations: Seaborn
- Quick and dirty Pandas plots can be very useful for cleaning when you only want to get quick results with a few number of commands.
- But when you move on to more in-depth EDA, being able to dig deeper into relationships in the data is extremely valuable.
- There is a much more useful and aesthetically pleasing tool called `seaborn`, which is meant to mimic `ggplot2` for R.
- The import command is typically `import seaborn as sns`.
- MatPlotLib is useful too and interfaces nicely with seaborn. It can give you fine grained control of your plots but requires additional lines of code to make appealing graphs.

### Setting up your directory

In [None]:
# First, mount your google drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Second, change your directory to the desired folder
import os

# Update the path to your folder for the class
# Where you stored the data from the previous noteboook
# path_to_DS_3001_folder = '/content/drive/MyDrive/DS-3001/01_Python_for_ML'

path_to_DS_3001_folder = '' # Include your path to the folder here
os.chdir(path_to_DS_3001_folder)


In [None]:
import numpy as np  # Import NumPy
import pandas as pd  # Import Pandas
import seaborn as sns # Import Seaborn
import matplotlib.pyplot as plt # Import matplotlib

trial_df = pd.read_csv('data/pretrial_data.csv', low_memory = False) # Load the pretrial data

#### Code to clean some of the variables that we've seen before

I'm including the code to clean the variables of interest here just so that everyone has the same data set to work with for these visualizations.

In [None]:
# GiniIndex
Gini = trial_df['GiniIndex']
Gini.unique()
Gini = Gini.replace(' ', np.nan)
Gini = pd.to_numeric(Gini, errors = 'coerce')
trial_df['GiniIndex'] = Gini

# SENTENCE
# Isolate the column
sentence = trial_df['ImposedSentenceAllChargeInContactEvent']
sentence = sentence.replace(' ', np.nan)
sentence = pd.to_numeric(sentence, errors = 'coerce')
trial_df['sentence'] = sentence

# PRIOR_F
# Set up prior felonies variable
prior_F = trial_df['PriorConvs_Fel']
prior_F = prior_F.replace(' ', np.nan)
prior_F = pd.to_numeric(prior_F, errors = 'coerce')
trial_df['prior_F'] = prior_F

# PRIOR_M
# Set up prior misdemenors variable
prior_M = trial_df['PriorConvs_Misd']
prior_M = prior_M.replace(' ', np.nan)
prior_M = pd.to_numeric(prior_M, errors = 'coerce')
trial_df['prior_M'] = prior_M

In [None]:
# Transform some badly scaled variables:
trial_df['bond_arcsinh'] = np.arcsinh(trial_df['bond'])
trial_df['sentence_arcsinh'] = np.arcsinh(trial_df['sentence'])
trial_df['prior_F_arcsinh'] = np.arcsinh(trial_df['prior_F'])
trial_df['prior_M_arcsinh'] = np.arcsinh(trial_df['prior_M'])

columns = trial_df.columns.tolist() # A list of the available variables

## Basic Plots That We Have Seen Before
- All the plots we made with Pandas can be done in Seaborn with almost the same syntax:
    - Histogram: `sns.histplot(df[varName])` or `sns.histplot(data=df,x=varName)`
    - Density: `sns.kdeplot(df[varName])` or `sns.kdeplot(data=df,x=varName)`
    - Boxplot: `sns.boxplot(df[varName])` or `sns.boxplot(data=df,x=varName)`
    - Scatterplot: `sns.scatterplot(df[varName1],df[varName2])` or `sns.scatterplot(data=df,x=varName1, y=varName2)`

##### Histogram (Single Variable, Numeric Data)

* The seaborn documenation for histograms is found [here.](https://seaborn.pydata.org/generated/seaborn.histplot.html)

In [None]:
var = 'GiniIndex'
sns.histplot(trial_df[var], bins = 20) # A histogram

# You can use matplotlib to provide adjust the title, labels, etc. of the plot
# same as we did in the programming review
plt.title(f'Seborn Histogram of {var}')
plt.xlabel('Gini Index')
plt.show()

#sns.histplot(data=df, x=var) # Same thing

In [None]:
# Compare this with the histogram from the pandas implementation
# The seborn plot just aesthetically looks nicer, same information though
trial_df[var].plot.hist(bins = 20)
plt.title(f'Pandas Histogram of {var}')
plt.xlabel('Gini Index')
plt.show()

#### Kernel Density Plot (Single Variable, Numeric Data)

* The seaborn documentation for kde plots is found [here.](https://seaborn.pydata.org/generated/seaborn.kdeplot.html)

In [None]:
var = 'GiniIndex'
sns.kdeplot(trial_df[var], c = 'firebrick') # A kernel density plot
plt.title(f'Seaborn Kernel Density Plot of {var}')
plt.xlabel('Gini Index')
plt.show()
#sns.kdeplot(data=df, x=var) # Same thing

In [None]:
# Comparison with Pandas
trial_df[var].plot.density(c = 'firebrick')
plt.title(f'Pandas Kernel Density Plot of {var}')
plt.xlabel('Gini Index')

# You can change the range of the axis using plt.xlim or plt.ylim
# Depending on if you you're working with the y or x axis
plt.xlim([0.34, 0.57])
plt.show()

#### Box Plot (Single Variable, Numeric Data)

* The seaborn documentation for boxplots is found [here.](https://seaborn.pydata.org/generated/seaborn.boxplot.html)

In [None]:
var = 'GiniIndex'

# A vertical boxplot
# sns.boxplot(trial_df[var]) # A boxplot

# A Horizontal Boxplot
sns.boxplot(data=trial_df, x=var, color = 'lavender')
plt.title(f'Seaborn Boxplot for {var}')
plt.xlabel('Gini Index')
plt.show()

In [None]:
# Comparison with Pandas
trial_df.boxplot(column = 'GiniIndex', vert = False)
plt.xlabel('Gini Index Value')
plt.title(f'Pandas Boxplot for {var}')
plt.show()

#### Scatter Plot (Two Variables, Numeric Data)

* The seaborn documentation for scatterplots is found [here.](https://seaborn.pydata.org/generated/seaborn.scatterplot.html)

In [None]:
var1 = 'sentence_arcsinh'
var2 = 'bond_arcsinh'

# Creating a scatterplot of the transformed sentence and bond variables
sns.scatterplot(
    x=trial_df[var1],
    y=trial_df[var2]
)

plt.title('Seaborn Scatter Plot of Transformed Sentence and Bond Variables')
plt.xlabel('arcsinh Transformed Sentence')
plt.ylabel('arcsinh Transformed Bond')
plt.show()

#sns.scatterplot(data=df, x=var1, y=var2) # Same thing

In [None]:
# Pandas Version
trial_df.plot.scatter(x = var1, y = var2)
plt.title('Pandas Scatter Plot of Transformed Sentence and Bond Variables')
plt.xlabel('arcsinh Transformed Sentence')
plt.ylabel('arcsinh Transformed Bond')
plt.show()

In [None]:
# Changing some addition parameters of the plot with Seaborn
var1 = 'sentence_arcsinh'
var2 = 'bond_arcsinh'

# Additional Parameters
# Play around with these values until you like how the plot looks
color = 'dodgerblue' # You can change the color of the plot, either with named colors, hexcodes, or rgba values
alpha = 0.2 # You can change the transparency of the points (0 means invisible, 1 means entirely filled in)
size = 10 # You can change the size of the points

# Creating a scatterplot of the transformed sentence and bond variables
sns.scatterplot(
    x=trial_df[var1],
    y=trial_df[var2],
    c = color,
    alpha = alpha,
    s = size
)

plt.title('Seaborn Scatter Plot of Transformed Sentence and Bond Variables')
plt.xlabel('arcsinh Transformed Sentence')
plt.ylabel('arcsinh Transformed Bond')
plt.show()

## Grouping (Plotting Numeric Data subset by a Categorical Variable)
- Grouping is a powerful tool for statistics and kernel densities: Perform the same analyis for each category of a categorical variable. Let's us know how the distribution of a variable changes depending on some categorical value.

- Grouping is easy in Seaborn, and it makes plots much more informative:
    - `hue` tells the plot to color points based on the categorical variable you pass as input
    - `style` tells the plot to shape the points based on the categorical variable you pass as input.

- Let's start with scatterplots

In [None]:
# Define the variables and categories we're interested in looking at
var1 = 'sentence_arcsinh' # A numeric variable
var2 = 'bond_arcsinh' # Another numeric
cat1 = 'case_type' # A categorical variable
cat2 = 'is_poor' # A categorical variable

# Starting with an initial plot
sns.scatterplot(data=trial_df, x=var1, y=var2) # Our initial plot
plt.title('Non-grouped scatter plot')
plt.xlabel('arcsinh Transformed Sentence')
plt.ylabel('arcsinh Transformed Bond')
plt.show()

In [None]:
# Now let's color the dots based on the case_type value that the point has
# This allows us to start to see some seperation between the values depending on what their case type was
# What trend do you see?
sns.scatterplot(
    data=trial_df,
    x=var1,
    y=var2,
    hue=cat1 # Coloring dots by case_type
)
plt.title('Scatter Plot Grouped by Case Type')
plt.xlabel('arcsinh Transformed Sentence')
plt.ylabel('arcsinh Transformed Bond')
plt.show()

In [None]:
# Next we can group by both case type (color) and the is_poor (shape) variable
this_plot = sns.scatterplot(
    data=trial_df,
    x=var1,
    y=var2,
    hue=cat1, # Again coloring based on Case Type
    style=cat2 # Changing shape based on is_poor variable
)

# Adding our labels
plt.title('Scatter Plot Colored by Case Type and Shape by is_poor')
plt.xlabel('arcsinh Transformed Sentence')
plt.ylabel('arcsinh Transformed Bond')

# Moving the legend so it's outside of the plot
sns.move_legend(this_plot, "upper right", bbox_to_anchor=(1.3, 1)) # Moves the Legend
plt.show()

## Grouped Histograms and Kernel Densities (Looking at a Single Numeric Variable subset by a Categorical Variable)
- **Histograms:** Grouping with histograms shows the breakdown of the bar into the different categories.
- **KDE:** Grouping with the density shows how the frequencies of different groups' values compare.

#### Histograms

In [None]:
var1 = 'prior_M_arcsinh'
var2 = 'prior_F_arcsinh'
cat1 = 'case_type'
cat2 = 'is_poor'

# An initial histogram without any grouping
sns.histplot(data=trial_df, x=var1)
plt.xlabel('arcsinh Transformed Prior Misdemeanor')
plt.title(f'Ungrouped Histogram of {var1}')
plt.show()

In [None]:
# Now we can break down the bars based on their case type
sns.histplot(
    data=trial_df,
    x=var1,
    hue = cat1 # Changing the color of bars based on case_type
)

plt.xlabel('arcsinh Transformed Prior Misdemeanor')
plt.title(f'Grouped Histogram of {var} by Case Type')
plt.show()

In [None]:
# Comapre the plot to if you isolated a single value for the categorical variable
# This means that the grouped histogram is just overlaying the histograms on one plot for us

# Isolating and plotting misdemeanors
idx = trial_df['case_type'] == 'M'
sns.histplot(data=trial_df[idx], x=var1, color = 'orange')
plt.title('Isolated Hist for When Case Type is Misdemeanor')
plt.show()

# Isolating and plotting felonies
idx = trial_df['case_type'] == 'F'
sns.histplot(data=trial_df[idx], x=var1, color = 'lightblue')
plt.title('Isolated Hist for When Case Type is Felony')
plt.show()

In [None]:
# We can change our grouped histogram to have the y-axis be a proportion
# rather than a count
sns.histplot(
    data=trial_df,
    x=var1,
    hue = cat1,
    stat='proportion' # Change the y-axis to proportion instead of count
)

plt.xlabel('arcsinh Transformed Prior Misdemeanor')
plt.title(f'Grouped Histogram of {var} by Case Type')
plt.show()

In [None]:
# We can look at another variable to group by
sns.histplot(
    data=trial_df,
    x=var1,
    hue = cat2 # Now changing color based on is_poor variable
)

plt.xlabel('arcsinh Transformed Prior Misdemeanor')
plt.title(f'Grouped Histogram of {var} by {cat2}')
plt.show()

#### Kernel Desnity Plot

In [None]:
# Same variables of interest
var1 = 'prior_M_arcsinh'
var2 = 'prior_F_arcsinh'
cat1 = 'case_type'
cat2 = 'is_poor'

# Looking at the ungrouped kernel density plot
# For prior misdemeanors
sns.kdeplot(
    data=trial_df,
    x=var1
)
plt.title(f'Ungrouped KDE Plot for {var1}')
plt.xlabel('arcsinh Transformed Prior Misdemeanor')
plt.show()

In [None]:
# Creating a grouped KDE Plot
sns.kdeplot(
    data=trial_df,
    x=var1,
    hue=cat1 # Grouped by case-type, changing the color
)

plt.title(f'Grouped KDE Plot for {var1} by {cat1}')
plt.xlabel('arcsinh Transformed Prior Misdemeanor')
plt.show()

In [None]:
# Grouping by the is_poor variable instead
sns.kdeplot(
    data=trial_df,
    x=var1,
    hue=cat2 # Now grouping by the is_poor variable
)

plt.title(f'Grouped KDE Plot for {var1} by {cat2}')
plt.xlabel('arcsinh Transformed Prior Misdemeanor')
plt.show()

## Violin Plot: Simultaneous Density/Boxplot
- Violin plots combine a boxplot and a density plot. In the middle, you'll find the boxplot. Along the boxplot on either side, there are density plots that show the distirbution of the variable.

* Some people dislike these kinds of plots because they can be hard to interpret, so be cautious with your use of them.

* The seaborn documentation for violin plots is found [here.](https://seaborn.pydata.org/generated/seaborn.violinplot.html)

In [None]:
category = 'case_type'
values = 'GiniIndex'

# creating a violin plot
violin_plot = sns.violinplot(
    x=category, # The string for the categorical variable to group by
    y=values, # The variable to look at the distribution of
    data=trial_df, # The pandas data frame
    hue = category # having the colors be based on the categorical variable we're grouping by
)


# Labels
plt.title('Violin Plot of Gini Index by Case Type')
plt.xlabel('Case Type')
plt.ylabel('Gini Index')

# Moving the legend to be outside of the plot
sns.move_legend(violin_plot, "upper right", bbox_to_anchor=(1.3, 1))

plt.show()

In [None]:
category = 'case_type'
values = 'bond_arcsinh'

# This time looking at the arcinsh transformed bond type
violin_plot = sns.violinplot(
    x=category, # The categorical variable to group by
    y=values, # The variable to look at the distribtion of
    data=trial_df, # The pandas data frame
    hue = category # Having the colors be based on the categorical variable we're grouping by
)

# Labels
plt.title('Violin Plot of arcsinh Transformed Bond by Case Type')
plt.xlabel('Case Type')
plt.ylabel('arcsinh Bond')

# Moving the legend to be outside of the plot
sns.move_legend(violin_plot, "upper right", bbox_to_anchor=(1.3, 1))

plt.show()

## Jointplot: Scatterplot upgrade

- Similar to how the violin plot combines a density plot with a boxplot, the jointplot combines a histogram with a scatterplot.
- The way this is done is by having a sactter plot in the middle with histograms on the right side and top of the graph. The histograms show the distribtion of the variable on the y-axis (right side) and the variable on the x-axis (top side).
- The goal of this plot is to see the distribution of the variables as well as how they covary together.

* The seaborn documentation for jointplots is found [here.](https://seaborn.pydata.org/generated/seaborn.jointplot.html)

In [None]:
var1 = 'bond_arcsinh'
var2 = 'GiniIndex'

# Creating a jointplot
sns.jointplot(
    x=var1, # The variable to put on the x-axis
    y=var2, # The variable to put on the y-axis
    data=trial_df # The data frame
)

# Labels
plt.xlabel('arcsinh Transformed Bond')
plt.ylabel('Gini Index')
plt.tight_layout()
plt.show()

In [None]:
# If you change the hue parameter so that the values are grouped by a categorical
# variable, you end up with density plots insetad of histograms

sns.jointplot(
    x=var1,
    y=var2,
    data=trial_df,
    hue='case_type' # Adding the hue parameter turns it into grouped desnity plots
)

plt.xlabel('arcsinh Transformed Bond')
plt.ylabel('Gini Index')
plt.tight_layout()
plt.show()

## Visualizing Prevalence
- Graphs can be very misleading about where the data are actually distributed
- A **rugplot** places tick marks along the axes to show you where observations are
- A **hexbin** tiles the area with hexes, and shades the hex to match the number of observations in the hex -- these can very starkly illustrate where the data are, and aren't

* The seaborn documentation for rugplots is found [here.](https://seaborn.pydata.org/generated/seaborn.rugplot.html)
* The hexbin documentation is part of the jointplot documentation found [here.](https://seaborn.pydata.org/generated/seaborn.jointplot.html)

In [None]:
var1 = 'prior_M'
var2 = 'prior_F'
cat1 = 'case_type'
cat2 = 'is_poor'

# First creating a scatter plot
sns.scatterplot(
    data=trial_df, # The data frame
    x=var1, # The variable on the x-axis
    y=var2, # The variable on the y-axis
    hue=cat1 # Grouping by case type
)

# Adding in the rugplot
sns.rugplot(
    data=trial_df, # The data frame
    x=var1, # The variable on the x-axis
    y=var2, # The variable on the y-axis
    height=.02 # The height of the bars for the rugplot
)

# Labels
plt.title('Rugplot of Prior Misdemeanors vs Felonies')
plt.xlabel('Prior Misdemeanors')
plt.ylabel('Prior Felonies')
plt.show()

In [None]:
# Creating the hexbin plot using the sns.jointplot function
# you need to change the 'kind' parameter to be 'hex' to make the hexbin plot
sns.jointplot(
    x='GiniIndex',
    y='age',
    data=trial_df,
    kind='hex' # Setting the kind to hex instead of the deafult of scatter for jointplot
) # More evenly distributed

# Labels
plt.xlabel('Gini Index')
plt.ylabel('Age')
plt.show()

In [None]:
# A hexbin graph of the prior Misdemeanors vs Felonies
# Notice how it is much more sparse
sns.jointplot(
    x='prior_M',
    y='prior_F',
    data=trial_df,
    kind='hex'
)

plt.xlabel('Prior Misdemeanors')
plt.ylabel('Prior Felonies')
plt.show()

In [None]:
# Looking at the transformed bond vs sentence variables
sns.jointplot(
    x='bond_arcsinh',
    y='sentence_arcsinh',
    data=trial_df,
    kind='hex'
)

plt.xlabel('arcsinh Bond Variable')
plt.ylabel('arcsinh Sentence Variable')
plt.show()