# Data Visualization

# Importing Data

Importing Datasets

We will use two datasets which are freely available. 

The Iris and Wine Reviews dataset.

We can both load in using pandas read_csv method.

In [None]:
pip list

In [None]:
# pip install pandas

In [None]:
pip list

# What is a pandas dataframe?
A pandas dataframe is a 2-dimensional labeled data structure with columns of potentially different types.

In [None]:
import pandas as pd
iris = pd.read_csv('iris.csv', names=['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class'])
print(iris.head())

In [None]:
wine_reviews = pd.read_csv('winemag-data-130k-v2.csv', index_col=0)
wine_reviews.head()

# Matplotlib

Matplotlib is the most popular python plotting library. 
It is a low-level library with a Matlab like interface.
The library offers lots of freedom at the cost of having to write more code.


In [None]:
# pip install matplotlib

Matplotlib is specifically good for creating basic graphs
like line charts, bar charts, histograms and many more. 


In [None]:
import matplotlib.pyplot as plt

# Scatter Plot
To create a scatter plot in Matplotlib we can use the scatter method. 
We will also create a figure and an axis using plt.subplots 
so we can give our plot a title and labels.

In [None]:
# create a figure and axis
fig, ax = plt.subplots()

# scatter the sepal_length against the sepal_width
ax.scatter(iris['sepal_length'], iris['sepal_width'])
# set a title and labels
ax.set_title('Iris Dataset')
ax.set_xlabel('sepal_length')
ax.set_ylabel('sepal_width')

We can give the graph more meaning by coloring in each data-point by its class. 
This can be done by creating a dictionary which maps from class to color and then 
scattering each point on its own using a for-loop and passing the respective color.

In [None]:
# create color dictionary
colors = {'Iris-setosa':'r', 'Iris-versicolor':'g', 'Iris-virginica':'b'}
# create a figure and axis
fig, ax = plt.subplots()
# plot each data-point
for i in range(len(iris['sepal_length'])):
    ax.scatter(iris['sepal_length'][i], iris['sepal_width'][i],color=colors[iris['class'][i]])
# set a title and labels
ax.set_title('Iris Dataset')
ax.set_xlabel('sepal_length')
ax.set_ylabel('sepal_width')

# Line Chart
In Matplotlib we can create a line chart by calling the plot method. 
We can also plot multiple columns in one graph, by looping through 
the columns we want and plotting each column on the same axis.

In [None]:
# get columns to plot
columns = iris.columns.drop(['class'])
# create x data
x_data = range(0, iris.shape[0])
# create figure and axis
fig, ax = plt.subplots()
# plot each column
for column in columns:
    ax.plot(x_data, iris[column], label=column)
# set title and legend
ax.set_title('Iris Dataset')
ax.legend()

# Histogram
In Matplotlib we can create a Histogram using the hist method. 
If we pass it categorical data like the points column from 
the wine-review dataset it will automatically calculate how often each class occurs.

In [None]:
# create figure and axis
fig, ax = plt.subplots()
# plot histogram
ax.hist(wine_reviews['points'])
# set title and labels
ax.set_title('Wine Review Scores')
ax.set_xlabel('Points')
ax.set_ylabel('Frequency')

# Bar Chart
A bar chart can be created using the bar method. 

The bar-chart isn’t automatically calculating the frequency
of a category so we are going to use pandas value_counts function to do this.

The bar-chart is useful for categorical data that doesn’t have a lot of
different categories (less than 30) because else it can get quite messy.

In [None]:
# create a figure and axis 
fig, ax = plt.subplots() 
# count the occurrence of each class 
data = wine_reviews['points'].value_counts() 
# get x and y data 
points = data.index 
frequency = data.values 
# create bar chart 
ax.bar(points, frequency) 
# set title and labels 
ax.set_title('Wine Review Scores') 
ax.set_xlabel('Points') 
ax.set_ylabel('Frequency')

# Pandas Visualization

Pandas is an open source high-performance, easy-to-use library 
providing data structures, such as dataframes, and data analysis
tools like the visualization tools we will use here.

Pandas Visualization makes it really easy to create plots out of 
a pandas dataframe and series. 

It also has a higher level API than Matplotlib and therefore 
we need less code for the same results.

# Scatter Plot
To create a scatter plot in Pandas we can call <dataset>.plot.scatter()
and pass it two arguments, the name of the x-column as well as the name 
of the y-column. Optionally we can also pass it a title.
    
The image created will automatically set the x and y label to the column names.

In [None]:
iris.plot.scatter(x='sepal_length', y='sepal_width', title='Iris Dataset')

# Line Chart
To create a line-chart in Pandas we can call <dataframe>.plot.line(). 

In the prior example using Matplotlib we needed to loop-through each column 
we wanted to plot.  

In Pandas we don’t need to do this because it automatically 
plots all available numeric columns (at least if we don’t specify a specific column/s).
    
If we have more than one feature Pandas automatically creates a legend.

In [None]:
iris.drop(['class'], axis=1).plot.line(title='Iris Dataset')

# Histogram

In Pandas, we can create a Histogram with the plot.hist method. 

There aren’t any required arguments but we can optionally pass 
some like the bin size.

In [None]:
wine_reviews['points'].plot.hist()

It is easy to create multiple histograms.

The subplots argument specifies that we want a separate plot 
for each feature and the layout specifies the number of plots per row and column.

In [None]:
iris.plot.hist(subplots=True, layout=(2,2), figsize=(10, 10), bins=20)

# Bar Chart

To plot a bar-chart we can use the plot.bar() method, 
but before we can call this we need to get our data. 

For this we will first count the occurrences using the value_count() method 
and then sort the occurrences from smallest to largest using the sort_index() method.

In [None]:
wine_reviews['points'].value_counts().sort_index().plot.bar()

It’s also really simple to make a horizontal bar-chart using the plot.barh() method.

In [None]:
wine_reviews['points'].value_counts().sort_index().plot.barh()

We can also plot other data then the number of occurrences.

In this example we group the data by country and then take the mean of the wine prices, 
ordered the data by country, and plotted the 5 countries with the highest average wine price.

In [None]:
wine_reviews.groupby("country").price.mean().sort_values(ascending=False)[:5].plot.bar()

# Seaborn

Seaborn is a Python data visualization library based on Matplotlib. 

It provides a high-level interface for creating attractive graphs.

Seaborn has a lot to offer. You can create graphs in one line that 
would take you multiple tens of lines in Matplotlib. 

Its standard designs are awesome and it also has a nice interface 
for working with pandas dataframes.

In [None]:
import seaborn as sns

# Scatter plot
We can use the .scatterplot method for creating a scatterplot, and just as in Pandas we need to pass it the column names of the x and y data, but now we also need to pass the data as an additional argument because we aren’t calling the function on the data directly as we did in Pandas.



In [None]:
sns.scatterplot(x='sepal_length', y='sepal_width', data=iris)

We can also highlight the points by class using the hue argument, which is a lot easier than in Matplotlib.

In [None]:
sns.scatterplot(x='sepal_length', y='sepal_width', hue='class', data=iris)

# Line chart
To create a line-chart the sns.lineplot method can be used. The only required argument is the data, which in our case are the four numeric columns from the Iris dataset. We could also use the sns.kdeplot method which rounds of the edges of the curves and therefore is cleaner if you have a lot of outliers in your dataset.

In [None]:
sns.lineplot(data=iris.drop(['class'], axis=1))

# Histogram
To create a histogram in Seaborn we use the sns.distplot method. We need to pass it the column we want to plot and it will calculate the occurrences itself. We can also pass it the number of bins, and if we want to plot a gaussian kernel density estimate inside the graph.

In [None]:
sns.distplot(wine_reviews['points'], bins=20, kde=False)

In [None]:
sns.distplot(wine_reviews['points'], bins=10, kde=True)

# Bar chart
In Seaborn a bar-chart can be created using the sns.countplot method and passing it the data.


In [None]:
sns.countplot(wine_reviews['points'])

# Other graphs
Now that you have a basic understanding of the Matplotlib, Pandas Visualization and Seaborn syntax I want to show you a few other graph types that are useful for extracting insides.

For most of them, Seaborn is the go-to library because of its high-level interface that allows for the creation of beautiful graphs in just a few lines of code.

# Box plots
A Box Plot is a graphical method of displaying the five-number summary. 

The five-number summary is the minimum, first quartile, median, third quartile, and maximum.

We can create box plots using seaborns sns.boxplot method and passing it the data as well as the x and y column name.

Box Plots, just like bar-charts are great for data with only a few categories but can get messy really quickly.

In [None]:
df = wine_reviews[(wine_reviews['points']>=95) & (wine_reviews['price']<1000)]
sns.boxplot('points', 'price', data=df)

# Heatmap
A Heatmap is a graphical representation of data where the individual values contained in a matrix are represented as colors. Heatmaps are perfect for exploring the correlation of features in a dataset.

To get the correlation of the features inside a dataset we can call <dataset>.corr(), which is a Pandas dataframe method. This will give us the correlation matrix.
    
We can now use either Matplotlib or Seaborn to create the heatmap.

In [None]:
import numpy as np

# get correlation matrix
corr = iris.corr()
fig, ax = plt.subplots()
# create heatmap
im = ax.imshow(corr.values)

# set labels
ax.set_xticks(np.arange(len(corr.columns)))
ax.set_yticks(np.arange(len(corr.columns)))
ax.set_xticklabels(corr.columns)
ax.set_yticklabels(corr.columns)

# Rotate the tick labels and set their alignment.
plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
         rotation_mode="anchor")

To add annotations to the heatmap we need to add two for loops:

In [None]:
# get correlation matrix
corr = iris.corr()
fig, ax = plt.subplots()
# create heatmap
im = ax.imshow(corr.values)

# set labels
ax.set_xticks(np.arange(len(corr.columns)))
ax.set_yticks(np.arange(len(corr.columns)))
ax.set_xticklabels(corr.columns)
ax.set_yticklabels(corr.columns)

# Rotate the tick labels and set their alignment.
plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
         rotation_mode="anchor")

# Loop over data dimensions and create text annotations.
for i in range(len(corr.columns)):
    for j in range(len(corr.columns)):
        text = ax.text(j, i, np.around(corr.iloc[i, j], decimals=2),
                       ha="center", va="center", color="black")

Seaborn makes it way easier to create a heatmap and add annotations:

In [None]:
sns.heatmap(iris.corr(), annot=True)

# Faceting
Faceting is the act of breaking data variables up across multiple subplots and combining those subplots into a single figure.

Faceting is really helpful if you want to quickly explore your dataset.



To use one kind of faceting in Seaborn we can use the FacetGrid. 

First of all, we need to define the FacetGrid and pass it our data as well as a row or column, which will be used to split the data. Then we need to call the map function on our FacetGrid object and define the plot type we want to use, as well as the column we want to graph.

In [None]:
g = sns.FacetGrid(iris, col='class')
g = g.map(sns.kdeplot, 'sepal_length')

# Pairplot
We can create a Seaborns pairplot and Pandas scatter_matrix , which enable you to plot a grid of pairwise relationships in a dataset.

As you can see in the images above these techniques are always plotting two features with each other. The diagonal of the graph is filled with histograms and the other plots are scatter plots.

In [None]:
sns.pairplot(iris)

In [None]:
from pandas.plotting import scatter_matrix

fig, ax = plt.subplots(figsize=(12,12))
scatter_matrix(iris, alpha=1, ax=ax)

# Conclusion
Data visualization is the discipline of trying to understand data by placing it in a visual context so that patterns, trends and correlations that might not otherwise be detected can be exposed.

Python offers multiple great graphing libraries that come packed with lots of different features. 
We looked at Matplotlib, Pandas visualization and Seaborn.

In [None]:
import seaborn as sns
import pandas as pd

print('starting to read csv')
df = pd.read_csv('nba.csv')
print('finished reading csv')
print(df.head())

#  ??? sns.lmplot(x = 'Team', y = 'Draft Year', data = df)
sns.scatterplot(x = 'Team', y = 'Draft Year', data = df)

