# Python Visualization



## Objective

1. Data Visualization with Python
2. Examine different types of charts

## Data Visualization

    - The representation of information in the form of a chart, diagram, picture, etc.
    
<center><img src="./img/dataviz.jpg"/></center>

## Common Python Packages

- Matplotlib: low level, provides lots of freedom
- Pandas Visualization: easy to use interface, built on Matplotlib
- Seaborn: high-level interface, great default styles
- ggplot: based on R’s ggplot2, uses Grammar of Graphics
- Plotly: can create interactive plots

## Matplotlib

- Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shells, the Jupyter notebook, web application servers, and four graphical user interface toolkits.
- Documentation can be found here: https://matplotlib.org/
- Installed with `pip` or `conda`

## Matplotlib Example

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

x = [1, 2, 3, 4, 5]
y1 = [1, 1, 2, 3, 5]
y2 = [0, 4, 2, 6, 8]
y3 = [1, 3, 5, 7, 9]

y = np.vstack([y1, y2, y3])

labels = ["Fibonacci ", "Evens", "Odds"]

fig, ax = plt.subplots()
ax.stackplot(x, y1, y2, y3, labels=labels)
ax.legend(loc='upper left')
plt.show()

## Pandas Visualization

- Visualization built on top of matplotlib
- Documentation can be found here: https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html
- Part of a larger product for working with data


## Pandas Example

In [None]:
import pandas as pd
import numpy as np
ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts = ts.cumsum()
ts.plot()

## Seaborn

- Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
- Documentation can be found here: https://seaborn.pydata.org/
- Installed with `pip` or `conda`

## Seaborn Example

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="whitegrid")

# Load the example iris dataset
diamonds = sns.load_dataset("diamonds")

# Draw a scatter plot while assigning point colors and sizes to different
# variables in the dataset
f, ax = plt.subplots(figsize=(6.5, 6.5))
sns.despine(f, left=True, bottom=True)
clarity_ranking = ["I1", "SI2", "SI1", "VS2", "VS1", "VVS2", "VVS1", "IF"]
sns.scatterplot(x="carat", y="price",
                hue="clarity", size="depth",
                palette="ch:r=-.2,d=.3_r",
                hue_order=clarity_ranking,
                sizes=(1, 8), linewidth=0,
                data=diamonds, ax=ax)

## Point/Scatter Chart

    - displays data in x/y coordinate system on a cartesian plane


In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Fixing random state for reproducibility
#np.random.seed(19680801)


N = np.random.randint(100)
x = np.random.rand(N)
y = np.random.rand(N)
colors = np.random.rand(N)
area = (30 * np.random.rand(N))**2  # 0 to 15 point radii

plt.scatter(x, y, s=area, c=colors, alpha=0.5)
plt.show()

## Point Plots

    - Point plots serve same as bar plots but in a different style. 
    - Rather than the full bar, the value of the estimate is represented 
      the point at a certain height on the other axis.

In [None]:
df = sns.load_dataset('titanic')
sns.pointplot(x = "sex", y="survived", hue="class", data=df)
df.head()

## Bar Charts

    - A bar graph shows comparisons among discrete categories.
    - One axis of the chart shows the specific categories being compared, 
      and the other axis represents a measured value
    - This is a common visualization used in data representation

## Matplotlib Bar Chart Example

In [None]:
import matplotlib.pyplot as plt 
plt.bar([1,3,5,7,9],[5,2,7,8,2], label="Example one")
plt.bar([2,4,6,8,10],[8,6,2,5,6], label="Example two", color='g')
plt.legend()
plt.xlabel('bar number')
plt.ylabel('bar height')
plt.title('Sample Bar Chart')
plt.show()

## Seaborn Bar Chart Example

In [None]:
import seaborn as sns
sns.set(style="whitegrid")
tips = sns.load_dataset("tips")
ax = sns.barplot(x="time", y="tip", data=tips,
                  order=["Dinner", "Lunch"])
ax

## Histograms

    - graphical representation of the distribution of numerical data
    - data is divided into bins and the height represents the frequency

## Matplotlib Histogram Example

- Plot of the frequency distribution of numeric array by splitting it into small equal-sized bins
- Used to estimate the distribution of the data


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [None]:
df = pd.read_csv(r"./data/housing.csv")
data = df['median_house_value']

In [None]:

n, bins, patches = plt.hist(x=data, bins='auto', color='#0504aa',
                            alpha=0.7, rwidth=0.85)
plt.grid(axis='y', alpha=0.75)
plt.xlabel('Home Value')
plt.ylabel('Amount')
plt.title('My First Histogram Ever')

maxfreq = n.max()
plt.ylim(top=np.ceil(maxfreq / 10) * 10 if maxfreq % 10 else maxfreq + 10)

## Seaborn Histogram Example

In [None]:
sns.distplot(df['median_house_value'], kde=False, color='blue')


## Pie Chart

    - Pie chart is a classic way to show the composition of groups. 
    - Not recommended data visualization because it can be misleading
    - When using a pie charts, its highly recommended to explicitly write down the percentage or numbers for each slice

## Bad Pie Chart Example 

- Notice that we do not know what each slice is
- The conclusions derived from this product will probably be wrong

In [None]:
# Import
import pandas as pd
df_raw = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")

# Prepare Data
df = df_raw.groupby('class').size()

# Make the plot with pandas
df.plot(kind='pie', subplots=True, figsize=(8, 8))
plt.title("Pie Chart of Vehicle Class - Bad")
plt.ylabel("")
plt

## Good Pie Chart Example

- Here everything is labeled and shown
- Data is easy to read and understand

In [None]:
import numpy as np
df_raw = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")

# Prepare Data
df = df_raw.groupby('class').size().reset_index(name='counts')

# Draw Plot
fig, ax = plt.subplots(figsize=(12, 7), subplot_kw=dict(aspect="equal"), dpi= 80)

data = df['counts']
categories = df['class']
explode = [0,0,0,0,0,0.1,0]

def func(pct, allvals):
    absolute = int(pct/100.*np.sum(allvals))
    return "{:.1f}% ({:d} )".format(pct, absolute)

wedges, texts, autotexts = ax.pie(data, 
                                  autopct=lambda pct: func(pct, data),
                                  textprops=dict(color="w"), 
                                  colors=plt.cm.Dark2.colors,
                                 startangle=140,
                                 explode=explode)

# Decoration
ax.legend(wedges, categories, title="Vehicle Class", loc="center left", bbox_to_anchor=(1, 0, 0.5, 1))
plt.setp(autotexts, size=10, weight=700)
ax.set_title("Class of Vehicles: Pie Chart")
plt.show()

## Dealing with Point Overlaps

    - Often data can have the same overlap in x/y coordinate 
    - Many times there is a need to show all points
    - `Seaborn` has a plot call jitter plot to handle this case!

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/mpg_ggplot2.csv")
fig, ax = plt.subplots(figsize=(16,10), dpi= 80)    
sns.stripplot(df.cty, df.hwy, jitter=0.25, size=8, ax=ax, linewidth=.5)
plt.title('Use jittered plots to avoid overlapping of points', fontsize=22)
plt

## Heatmap Plots

    - 2D displays of the values in a data matrix
    - Values are normally colored by intensity of value
    - Common applications: correlation matrix

## Heatmap with Seaborn

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

# Load the example flights dataset and conver to long-form
flights_long = sns.load_dataset("flights")
flights = flights_long.pivot("month", "year", "passengers")

# Draw a heatmap with the numeric values in each cell
f, ax = plt.subplots(figsize=(9, 6))
sns.heatmap(flights, annot=True, fmt="d", linewidths=.5, ax=ax)

## Heatmap with Pandas

    - Here is another example of a heatmap showing a divergent color scheme
    - We set `axis=None` in order to compare each value to the whole dataset otherwise it compares values in each row

In [None]:
from pandas import DataFrame
df = flights
df.style.background_gradient(cmap='coolwarm', axis=None)

## Summary of Visualization

   - Multiple ways of visualization data
   - Craft your visuals to meet your needs
   - **DO NOT** mislead individuals with hard to read graphics