# Introduction to Data Visualization with Seaborn
Run the hidden code cell below to import the data used in this course.

In [None]:
# Importing the course packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Importing the course datasets
country_data = pd.read_csv('datasets/countries-of-the-world.csv', decimal=",")
mpg = pd.read_csv('datasets/mpg.csv')
student_data = pd.read_csv('datasets/student-alcohol-consumption.csv', index_col=0)
survey = pd.read_csv('datasets/young-people-survey-responses.csv', index_col=0)

## Intro to Seaborn

for visualizations in python. It's easy to use, works wells with pandas data structures, and it built on top of matplotlib. import seaborn as sns (named after Samual Norman Seaborn from Wes Wing) and import matplotlib.pyplot as plt

sns.scatterplot(x=, y=) for scatter plots

sns.counplot(y=) or x= depending on how you want the data displayed

In [None]:
# Import Matplotlib and Seaborn
import matplotlib.pyplot as plt
import seaborn as sns

# Change this scatter plot to have percent literate on the y-axis
sns.scatterplot(x=gdp, y=percent_literate)

# Show plot
plt.show()

In [None]:
# Import Matplotlib and Seaborn
import matplotlib.pyplot as plt
import seaborn as sns

# Create count plot with region on the y-axis
sns.countplot(y=region)

# Show plot
plt.show()

## Using pandas with Seaborn

Pandas is a Python library for data analysis. The most commonly used data structure using in pandas is a DataFrame. Only works if the data is 'tidy' (each observation has its own row and each variable has its own column)

To use DataFrames with Seaborn, you have to specify the DataFrame name with the data= argument, with the column names in x= or y=

In [None]:
# Import Matplotlib, pandas, and Seaborn
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Create a DataFrame from csv file
df = pd.read_csv('csv_filepath')

# Create a count plot with "Spiders" on the x-axis
sns.countplot(x='Spiders', data=df)

# Display the plot
plt.show()

## Adding a third variable with hue

Adds color to your plots

Seaborn has its own tips dataset that can be accessed using sns.load_dataset('tips'). Set the hue= parameter to set the third variable to what you're interested in. hue_order= takes in a list of values and will set the order of the values in the plot accordingly in the legend. The palette= parameter allows you to pick the colors by passing in a dictionary, using text like "red" and "black" as the values. These strings only work for a small set of numbers defined by matplotlib. You can also use the matplot abbreviation or the HTML hex codes with a pound in front of it (green would be listed "#00FF00")

In [None]:
# Import Matplotlib and Seaborn
import matplotlib.pyplot as plt
import seaborn as sns

# Change the legend order in the scatter plot
sns.scatterplot(x="absences", y="G3", 
                data=student_data, 
                hue="location", hue_order=(['Rural', 'Urban']))

# Show plot
plt.show()

In [None]:
# Import Matplotlib and Seaborn
import matplotlib.pyplot as plt
import seaborn as sns

# Create a dictionary mapping subgroup values to colors
palette_colors = {'Rural': "green", 'Urban': "blue"}

# Create a count plot of school with location subgroups
sns.countplot(x='school', data=student_data, hue='location', palette=palette_colors)

# Display plot
plt.show()

## Introduction to relational plots and subplots

Plot that show the relationship between 2 variables; typically scatter plots or line plots. Creating a separate plot per subgroup is called subplotting, and in seaborn you use relplot() for this to create them in a single figure. relplot() will be used instead of scatterplot() in the course. You'll use kind= to define that it's "scatter" or 'line'. Using col= lets you define the values to prepare the subplots on, arranged horizontally in columns. row= lets you arrange them vertically in rows instead. You can use col= and row= at the same time to define different variables in the same figure.

If you have a lot of subplots to display, like with days of the week, you can use col_wrap= to specify how many subplots you want per row. col_order= takes a list that allows you define the order of the subplots

In [None]:
# Change this scatter plot to arrange the plots in rows instead of columns
sns.relplot(x="absences", y="G3", 
            data=student_data,
            kind="scatter", 
            row="study_time")

# Show plot
plt.show()

In [None]:
# Adjust further to add subplots based on family support
sns.relplot(x="G1", y="G3", 
            data=student_data,
            kind="scatter", 
            col="schoolsup",
            col_order=["yes", "no"],
            row='famsup', 
            row_order=['yes', 'no'])

# Show plot
plt.show()

## Customizing scatter plots

You can update point sizes, styles, and point transparencies using scatterplot() and relplot().

size= changes point size based on a variable, like 'size'. You can also use the hue= parameter with the 'size' variable to make it easier to read the plot

style= will change the point style based on the variable

alpha= will change point transparency, between 0 and 1. 0 is non-tranparent and 1 is non-transparent

In [None]:
# Import Matplotlib and Seaborn
import matplotlib.pyplot as plt
import seaborn as sns

# Create scatter plot of horsepower vs. mpg
sns.relplot(x="horsepower", y="mpg", 
            data=mpg,kind="scatter", 
            size="cylinders",
            hue='cylinders')

# Show plot
plt.show()

In [None]:
# Import Matplotlib and Seaborn
import matplotlib.pyplot as plt
import seaborn as sns

# Create a scatter plot of acceleration vs. mpg
sns.relplot(kind='scatter', x="acceleration", y='mpg', data=mpg, style='origin', hue='origin')

# Show plot
plt.show()

## Introduction to line plots

In scatter plots, each plot point is an independent observation. With line plots, each point represents the same thing typically tracked over time. With a line plot, setting things like the style= and hue= to a third variable will make another set of lines. Setting markers= to True will make a marker for each data point. If you don't want the line styles to vary by subgroup, set the dashes= parameter equal to False

You can use line plots where there are multiple observations per x value, like if each row has levels taken from multiple stations in the same area. If hour is the x-value, it will aggregate them into a single summary measure, by default the mean. It will also show a shaded area around the line that is the confidence interval. As long as the samples are randomly taken, the confidence interval will be 95 %. You can set the ci= argument to 'sd' to get the standard deviation instead of the mean, or turn it off by passing None there. 

In [None]:
# Make the shaded area show the standard deviation
sns.relplot(x="model_year", y="mpg",
            data=mpg, kind="line", ci='sd')

# Show plot
plt.show()

In [None]:
# Import Matplotlib and Seaborn
import matplotlib.pyplot as plt
import seaborn as sns

# Add markers and make each line have the same style
sns.relplot(x="model_year", y="horsepower", 
            data=mpg, kind="line", 
            ci=None, style="origin", 
            hue="origin",
            markers=True,
            dashes=False)

# Show plot
plt.show()

## Count plots and bar plots

These are called categorical plots, and involve a categorical variable. Count plots count the number of observations in each category. Like relplot(), we'll use catplot() (categorical plot) and kind= ('count' or 'bar'. Still can use col= and row=. To change the order of the categorical variables, make a list of strings in the order you want, and then pass that list into order=

Bar plots show the mean of quantitative variable per category. The plot shows 95% confidence intervals automatically for the mean. When the y-variable is True/False, bars will show the percentage of responses reporting True

In [None]:
# Separate into column subplots based on age category
sns.catplot(y="Internet usage", data=survey_data,
            kind="count", col='Age Category')

# Show plot
plt.show()

In [None]:
# List of categories from lowest to highest
category_order = ["<2 hours", 
                  "2 to 5 hours", 
                  "5 to 10 hours", 
                  ">10 hours"]

# Turn off the confidence intervals
sns.catplot(x="study_time", y="G3",
            data=student_data,
            kind="bar",
            order=category_order, ci=None)

# Show plot
plt.show()

## Creating a box plot

kind= will be set to 'box'. To omit the outliers, use the sym=''. sym can also change the appearance of outliers instead of omitting them. Whiskers extend to 1.5xIQR, but you can use whis= to change it to 2.0xIQR with whis=2.0. You can also pass a list for the lower and upper percentile values. Setting them to [0, 100] will use the min and max values

In [None]:
# Create a box plot with subgroups and omit the outliers
sns.catplot(data=student_data, x='internet', y='G3', kind='box', hue='location', sym='')

# Show plot
plt.show()

## Point Plots

This is a categorical plot, show the mean of the quantitative variable for the observation categories as a single point. Show 95% confidence intervals. In this plot it's easier to compare the heights of the subgroup points when they're stacked above each other, and it's easier to see the differences in slope. kind='point'. To remove the lines between points, set join=False.

To calculate the CI and points for the median instead of the mean, import numpy and set estimator=median. To add caps to the confidence interval, set capsize= to the desired width of the caps, like 0.2

In [None]:
sns.catplot(x="famrel", y="absences",
			data=student_data,
            kind="point",
            capsize=0.2, join = False)
            
# Show plot
plt.show()

In [None]:
# Import median function from numpy
from numpy import median

# Plot the median number of absences instead of the mean
sns.catplot(x="romantic", y="absences",
			data=student_data,
            kind="point",
            hue="school",
            ci=None, estimator=median)

# Show plot
plt.show()

## Changing plot style and color

Changign the style can be helpful for improving readability or guide interpretation. 

Figure 'style' includes background and axes. Preset options: "white", 'dark', 'whitegrid', 'darkgrid', 'ticks'. To set one of these styles as the global style for all your plots, use sns.set_style(). 

White grid helps the audience determine specific values of the plotted points instead of making higher level observations. Ticks adds small tick line on the x and y axes

You can change the palette using sns.set_palette(), which changes the color of the main elements of the plot. You can use preset palettes or create a custom palette. Diverging palettes are great when the vis deals with a scale where the two ends are opposites and there is a neutral midpoint, like reb/blue("RdBu") and purple/green ('PRGn'); "RdBu_r" will reverse the palette. Sequential palettes are a single color moving from light to dark, are great for continuous variables. You can make a custom palette with a list of colors as strings, or a list of hex codes (# in front) 

To change the scale, use sns.set_context(). This will change the scale of the plot elements and labels. From smallest to largest: "paper", 'notebook', 'talk', 'poster'. Default is paper

In [None]:
# Change the color palette to "RdBu"
sns.set_style("whitegrid")
sns.set_palette("RdBu")

# Create a count plot of survey responses
category_order = ["Never", "Rarely", "Sometimes", 
                  "Often", "Always"]

sns.catplot(x="Parents Advice", 
            data=survey_data, 
            kind="count", 
            order=category_order)

# Show plot
plt.show()

In [None]:
# Change the context to "poster"
sns.set_context("poster")

# Create bar plot
sns.catplot(x="Number of Siblings", y="Feels Lonely",
            data=survey_data, kind="bar")

# Show plot
plt.show()

In [None]:
# Set the style to "darkgrid"
sns.set_style('darkgrid')

# Set a custom color palette
sns.set_palette(['#39A7D0', "#36ADA4"])

# Create the box plot of age distribution by gender
sns.catplot(x="Gender", y="Age", 
            data=survey_data, kind="box")

# Show plot
plt.show()

## Adding titles and labels: Part 1

Seaborn's plot functions create 2 different types of objects: FacetGrids and AxesSubplots. To figure out which type of object you're working with, first assign the plot output to a variable (often g), then do type(g)

A FacetGrid consists of one or more AxesSubplots, which is how it supports subplots. replot() and catplot() support creating subplots, and that means they are creating FacetGrid objects. scatterplot() and countplot() only make AxesSubplot

To assign a title to a FacetGrid, assign the plot to g, then use g.fig.suptitle('Title'). This sets a title for the figure as a whole. To adjust the height of the title, use y=. y=1.03 makes the title a little higher than default

In [None]:
# Create scatter plot
g = sns.relplot(x="weight", 
                y="horsepower", 
                data=mpg,
                kind="scatter")

# Add a title "Car Weight vs. Horsepower"
g.fig.suptitle('Car Weight vs. Horsepower')

# Show plot
plt.show()

## Adding titles and labels part 2

To add a title to AxesSubplot, use g.set_title('Title')

If the figure has subplots, use g.set_titles to set the titles for each AxesSubplot. If you want to use the variable name in the title, you can use "col_name" in braces to reference the column value, like g.set_titles("This is {col_name}") 

To assign axis labels, assign the plot to a variable and then call the 'set' function with parameters xlabel= and ylabel=. This is for both AxesSubplot and FacetGrids. To rotate the tick labels, call the Matplotlib function plt.xticks(rotation=). This is for both AxesSubplot and FacetGrids.

## Putting it all together

Import the packages

Decide what plot you want to create: relational plot (betwen 2 quantitative vairables), categorical plot (distribution of a quant. within categories)

Adding a 3rd variable: hue, or col/row in relplot() or catplot()



## Explore Datasets
Use the DataFrames imported in the first cell to explore the data and practice your skills!
- From `country_data`, create a scatter plot to look at the relationship between GDP and Literacy. Use color to segment the data points by region.
- Use `mpg` to create a line plot with `model_year` on the x-axis and `weight` on the y-axis. Create differentiating lines for each country of origin (`origin`). 
- Create a box plot from `student_data` to explore the relationship between the number of failures (`failures`) and the average final grade (`G3`).
- Create a bar plot from `survey` to compare how `Loneliness` differs across values for `Internet usage`. Format it to have two subplots for gender.
- Make sure to add titles and labels to your plots and adjust their format for readability!