## An Introduction to Graphically Displaying data in Python

### Start by importing all the packages we will need, and setting any global settings.

If you haven't yet read PEP8, now is a good time to introduce it. It does not discuss Notebooks, but we will still try to follow the conventions as closely as possible.

https://www.python.org/dev/peps/pep-0008/

In [None]:
# Numpy = Numeric Computing
import numpy as np
# MatPlotLib = classic Python math plotting library
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
# Pandas = Python Data Analysis, home of the DataFrame
import pandas as pd
# Seaborn = Statistical plotting built on top of MatPlotLib
import seaborn as sns

sns.set(style="ticks", color_codes=True)

# Tell MatPlotLib to draw plots inline with the code outputs
%matplotlib inline

# If you have a Mac or other high-res display you can include this:
%config InlineBackend.figure_format = 'retina'

#### Generate random univariate data from a Normal distribution to use as example

In [None]:
# We set the random seed so that we always get the same "random" result.
# If you want your results to vary each time, don't set a seed.
np.random.seed(0)

# mu = 100, mean of distribution
# sigma = 15, standard deviation of distribution
x = 100 + 15 * np.random.randn(437)

#### Create a Histogram of our random data

In [None]:
# In a histogram, bins will display as bars
num_bins = 20

fig, ax = plt.subplots()

# Create the histogram of the data
n, bins, patches = ax.hist(x, num_bins)

#### Load the famous Iris dataset from Fisher (1936)

In [None]:
iris = sns.load_dataset("iris")

# The data is loaded to a Pandas DataFrame, so we can easily
# explore using common Pandas functions like head()
print(iris.head())

We can see that the variables are all quantitative, except species which appears to be categorical. Let's find out all the species types in this data set:

In [None]:
print(f"Species types: {iris['species'].unique()}")

As data scientists, we are usually not Subject Matter Experts (SMEs) in the area in which we are working, so take the time to find out about your data. What is a sepal, and how is it related to/different from a petal?

<img src="images/iris.png">

First, let's make a quick scatterplot between Sepal Length and Sepal Width. Scatterplots are a way to visualize the relationship between variables in __bivariate__ quantitative data. They simply make a plot of points, with each point representing one observation of the data.

In [None]:
sns.regplot(x=iris["sepal_length"], y=iris["sepal_width"], fit_reg=False)
plt.show()

Examine this plot. Does it show a clear relationship? 

#### Create a Pair Plot of the Iris data
A Pair Plot is a collection of scatterplots and histograms. This is a nice way to see all of the relationships between each pair of quantitative variables. Just be aware that the more quantitative variables you have, the bigger the grid will get. It's very common to see new Data Scientists have such a big grid you can't actually read any individual graph.

Any time you want more information on a plotting function (or any Python function!), start with the official documentation:

https://seaborn.pydata.org/generated/seaborn.pairplot.html

In [None]:
g = sns.pairplot(iris)
plt.show()

Let's take a moment to describe what we can see from the data so far.


That is a good start, but we know from the head() command above that the species is a categorical data, so let's color the plots based on species and see if we discover anything new.

In [None]:
g = sns.pairplot(iris, hue="species")
plt.show()

We can see that some variables show a stronger separation between species categories than others, so if we want to try to predict species from the other variables, we could select just the variables with the greatest separation.

In [None]:
g = sns.pairplot(iris, height=3, diag_kind="hist",
                 vars=["sepal_length", "petal_length"], hue="species")

#### One more data set: Ames, Iowa Housing (De Cock, 2011)

In [None]:
# We'll load from a CSV file using the built in Pandas function read_csv:
data = pd.read_csv('data/AmesHousing.txt', sep='\t')

# This data has so many columns, Pandas would normally only display the first few.
# We'll force it to show us everything and scroll horizontally to view.
pd.options.display.max_columns = 999

data.head(10)

Now, let's get a quick visually summary with another Pair Plot:

In [None]:
g = sns.pairplot(data, height=3, vars=['Lot Area','Gr Liv Area','SalePrice'], kind="reg")

Not bad, but it seems like there are some outliers that make it hard to see the main bulk of the data. Let's select only properties that aren't really big or really expensive based on what we see in our Pair Plot. Later we will learn more rigorous ways of determining cutoffs, but for now we'll just pick some arbitrary numbers so we don't have any major outliers.

In [None]:
selected_data = data[ (data['Lot Area']<20000) & (data['SalePrice']<500000) ]

In [None]:
g = sns.pairplot(selected_data, height=3, vars=['Lot Area','Gr Liv Area','SalePrice'], 
                 kind="reg")