# Data Visualization with Python

Taken from https://www.techchange.org/2015/05/19/data-visualization-analysis-international-development/

"One way to analyze data is through data visualizations. Data visualization turns numbers and letters into aesthetically pleasing visuals, making it easy to recognize patterns and find exceptions.

Benefits of visualizing data:
1. recognize patterns not seen by plain csv tables
2. more easily identify outliers and exceptions in the dataset
3. allows for analyzing data over time (calendar scrolls)"


There are many ways to visualize data, but this tutorial will provide a reference to many common python visualization techniques.


Techniques adapted from:
1. https://www.analyticsvidhya.com/blog/2015/05/data-visualization-python/
2. https://matplotlib.org/gallery.html


We are going to cover:
    1. histograms
    2. boxplots
    3. scatter plot
    4. heat map
    


In [None]:
## Getting Started
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

## Histograms
used to plot the frequency of score occurrences in a continuous data set that has been divided into classes, called bins.

When should you use one? 

1. When the data are numerical.
2. When you want to see the shape of the data’s distribution, especially when determining whether the output of a process is distributed approximately normally.
3. When you wish to communicate the distribution of data quickly and easily to others.


In [None]:
np.random.seed(0)

mu = 200
sigma = 25
x = np.random.normal(mu, sigma, size=100)

fig, (ax0, ax1) = plt.subplots(ncols=2, figsize=(8, 4))

ax0.hist(x, 20, normed=1, histtype='stepfilled', facecolor='g', alpha=0.75)
ax0.set_title('stepfilled')

# Create a histogram by providing the bin edges (unequally spaced).
bins = [100, 150, 180, 195, 205, 220, 250, 300]
ax1.hist(x, bins, normed=1, histtype='bar', rwidth=0.8)
ax1.set_title('unequal bins')

fig.tight_layout()
plt.show()

## Boxplots

1. are useful for identifying outliers and for comparing distributions. 
2. are non-parametric: they display variation in samples of a statistical population without making any assumptions of the underlying statistical distribution. 
3. The spacings between the different parts of the box indicate the degree of dispersion (spread) and skewness in the data, and show outliers.
3. they allow one to visually estimate various L-estimators, notably the interquartile range, midhinge, range, mid-range

In [None]:
# Random test data
np.random.seed(123)
all_data = [np.random.normal(0, std, 100) for std in range(1, 4)]

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(9, 4))

# rectangular box plot
bplot1 = axes[0].boxplot(all_data,
                         vert=True,   # vertical box aligmnent
                         patch_artist=True)   # fill with color


# fill with colors
colors = ['pink', 'lightblue', 'lightgreen']
for patch, color in zip(bplot1['boxes'], colors):
        patch.set_facecolor(color)

# adding horizontal grid lines
for ax in axes:
    ax.yaxis.grid(True)
    ax.set_xticks([y+1 for y in range(len(all_data))], )
    ax.set_xlabel('xlabel')
    ax.set_ylabel('ylabel')

# add x-tick labels
plt.setp(axes, xticks=[y+1 for y in range(len(all_data))],
         xticklabels=['x1', 'x2', 'x3', 'x4'])

plt.show()

##breakdown boxplots here

## Scatter Plots

show how much one variable is affected by another. The relationship between two variables is called their correlation 

Useful when:
1.  you need to find potential relationships between values, and to find outliers in data sets. 
2.  when you want to show data where each instance has at least two metrics, for example, average life expectancy and average gross domestic product per capita in different countries.

Advantages

Can visualize the correlation of two or more measures at the same time. The third measure is an efficient way of differentiating between values and simplifying the identification of, for example, large countries, large customers, large quantities, and so on.

In [None]:
N = 50
x = np.random.rand(N)
y = np.random.rand(N)
colors = np.random.rand(N)
area = np.pi * (15 * np.random.rand(N))**2  # 0 to 15 point radii

plt.scatter(x, y, s=area, c=colors, alpha=0.5)
plt.show()

## Heat Maps

visualization feature that presents multiple rows of data in a way that makes immediate sense by assigning different size and color to cells each representing a row. 

A color slider at the bottom or on the side of the heat map allows the end-user to easily spot the high and low outliers in the column represented by color.

Biggest Benefit: they use color to communicate relationships between data values that would be would be much harder to understand if presented numerically in a spreadsheet. When done properly, the conclusions of the visualization should be immediately clear to the reader.


In [None]:
data = np.random.rand(4,2)
rows = list('1234') #rows categories
columns = list('MF') #column categories
fig,ax=plt.subplots()
#Advance color controls
ax.pcolor(data,cmap=plt.cm.Reds,edgecolors='k')
ax.set_xticks(np.arange(0,2)+0.5)
ax.set_yticks(np.arange(0,4)+0.5)
# Here we position the tick labels for x and y axis
ax.xaxis.tick_bottom()
ax.yaxis.tick_left()
#Values against each labels
ax.set_xticklabels(columns,minor=False,fontsize=20)
ax.set_yticklabels(rows,minor=False,fontsize=20)
plt.show()

# Explore

need the docs? just run 


In [None]:
help()


# Line of Best Fit



In [None]:
import matplotlib.pyplot as plt
import numpy as np

In [None]:
n = 50
x = np.random.randn(n)
y = x * np.random.randn(n)

fig, ax = plt.subplots()
fit = np.polyfit(x, y, deg=1)
ax.plot(x, fit[0] * x + fit[1], color='red')
ax.scatter(x, y)



In [None]:
fig.show()