<a href="https://colab.research.google.com/github/brendenwest/ad450/blob/master/4_data_visualization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Visualization

### Reading

- McKinney, Chapter 9
- Molin, “Visualizing Data with Pandas and Matplotlb”
- Molin, “Plotting with Seaborn and Customization Techniques”
- https://blog.growingdata.com.au/a-guided-introduction-to-exploratory-data-analysis-eda-using-python/
- https://seaborn.pydata.org/introduction.html
- https://www.datasciencecourse.org/notes/visualization/

### Watch 
- [Bullshit data graphics](https://callingbullshit.org/videos.html) 1.2 & chapter 6

### Practice
- https://www.datacamp.com/community/tutorials/matplotlib-tutorial-python
- https://www.datacamp.com/courses/introduction-to-seaborn
- https://www.datacamp.com/community/tutorials/time-series-analysis-tutorial
- https://www.datacamp.com/community/tutorials/geospatial-data-python

### Learning Outcomes

- common data plots - line, bar, scatter, histogram
- Plotting data with matplotlib
- Plotting with pandas
- Statistical graphics with seaborn
- exploratory data analysis
- Interactive data visualization for web browsers
- Data visualization with Tableau & D3
- visualization practices to avoid


### What is Data Visualization

- visualization for **data exploration** allows one to see structure & patterns of data for later analysis
- visualization for **presentation** of conclusions
- learning how to visualize data before applying more sophisticated methods is a key skill for effective data science

#### Python visualization tools:

- matplotlib
- seaborn - designed to simplify data exploration and production of informative statistical graphics.
- Bokeh & Plotly - Python libraries for interactive web graphics
- [Geoplot](http://geopandas.org/gallery/plotting_with_geoplot.html) - Python library for easy-to-use geospatial visualizations, designed to work with GeoPandas input.
- [ScatterText](https://github.com/JasonKessler/scattertext#overview) - Python tool for visualizing what words and phrases are more characteristic of a category than others.


#### Other Visualization Tools

- [D3.js](https://d3js.org/) - JavaScript library for interactive web graphics
- [Tableau](https://www.tableau.com/) - drag-and-drop data analytics tools
- [PowerBI](https://powerbi.microsoft.com/en-us/what-is-power-bi/) - Microsoft data analytics solution. Similar to Tableau.


### matplotlib

matplotlib is python plotting package designed for creating (mostly two-dimensional) publication-quality plots. 

It allows interactive plotting from the iPython shell or within Jupyter notebooks

matplotlib can export visualizations to  common vector and raster graphics formats (PDF, SVG, JPG, PNG, BMP, GIF, etc.). 

- plots & plot types
- title & labels
- legends
- subplots
- annotations
- saving plots

In [0]:
import matplotlib.pyplot as plt
%matplotlib inline

### Plotting with pandas

- Series plot
- DataFrame plot
- plot features

### Seaborn plots

Seaborn is a Python data visualization library based on matplotlib. 

It's designed to simplify data exploration and production of informative statistical graphics.

Seaborn's dataset-oriented plotting functions operate on dataframes and arrays containing whole datasets and internally perform the necessary semantic mapping and statistical aggregation to produce informative plots.

**concepts**
- confidence intervals
- crosstabs
- distributions
- continuous probability distribution
- linear regression
- box plots
- violin plots
- heat maps

### Exploratory Data Analysis 

EDA is a crucial initial step for understanding a dataset and preparing for statistical modeling.

Answers questions such as:
- what is the data quality?
- is the data predictive enought for modeling?

EDA is the process of performing initial investigations to:
- Uncover underlying structure & patterns in the data
- Identify important variables
- Identify anomalies
- Test a hypothesis
- Check assumptions
- Set the stage for model development


#### Step 1 - Understanding the data

- how many observations?
- how many features (variables)?
- which are the dependent variables?
- what are the data types?
- are variables numeric or categorical?
- what are the primary statistics for each feature?



#### Step 2 - Univariate Analysis

Describe & understand distribution of a single variable. Detect outliers. Identify patterns.

**Numeric**

Measures of **center** (e.g. mean, median, mode)

Measures of **spread** - how variable is the data (e.g. variance & standard deviation). Histograms are useful to show distribution

**outliers** are values outside common distribution. Very dataset specific. Boxplots are useful for identifying outliers.

**Probability Density Function** (PDF) - a function whose value at any given sample is the relative likelihood that the value of a random variable would equal that sample. For continuous variables.

When plotted, the x-axis represents the value ranges while the y-axis represents the percentage of data points for each target value.

**Cumulative Distribution Function (CDF)** - probability that a continuous random variable has a value <= to a given value. 

https://towardsdatascience.com/probability-concepts-explained-probability-distributions-introduction-part-3-4a5db81858dc

**Categorical**

Range & frequency of values (counts & percents)

#### Step 3 - Bi-Variate Analysis

Analysis that explores the relationship between two variables.

**Numeric**

**scatterplot** visualizes relationship. Helps to identify potential relationship (type & strength)

**Correlation matrix**

Square matrix with same variables in rows & columns. level of correlation is shown with values ranging from 0 (no correlation) to 1 (highly correlated). **heatmap** plot can show correlation intensity with color density.

**Categorical**

Can be visualized in a correlation matrix, after encoding textual values into numeric ones (**label encoding**).

**Stacked column chart** can show the percentages that each category from one variable contributes to a total across categories of the second variable.


#### Step 4 - Multivariate Analysis

Generally shows the statistical relationship between two or more variables.

**Contour plot** can show combinations with highest probability.

**Principal Component Analysis (PCA)** - 

A set of observations of correlated variables are converted into a set of values of linearly uncorrelated variables.

Can completely restructure the data, removing redundancies and ordering newly obtained components according to the amount of the original variance that they express.

Can give great insights about how the set of features collectively collaborate to describe the analysis outcome (target).

Very common technique for speeding up a machine learning algorithm by reducing the dimensionality of input features.
