### Exploratory Data Analysis -- Summary from Reading

1. Single variate versus multivariate
1. Non-graphical versus graphical

#### Single Variate, NonGraphical
* Categorical variables:  frequency counts
* Numeric variables:  sample statistics
  * Central tendency
  * Spread
  * Skew
  * Interquartile range
  
#### Single Variate, Graphical
* Histogram
* Box plots 
* QN plot

#### Multivariate, NonGraphical
* Categorical:  crosstab
* Numeric:  correlation and covariance

#### Multivariate: Graphical
* Side-by-side boxplot
* Scatter plot



In [None]:
# Load the same "user" data set from last time

In [None]:
import pandas as pd

In [None]:
users = pd.read_table('user.tbl', sep='|')

In [None]:
# What are the columns, what are their data types, get basic range / frequency information

In [None]:
## Frequency counts for categorical variables -- age and occupations
## Sort occupation by name, and by frequency

In [None]:
##  Frequency table in the chapter includes both absolute and relative counts.
##  Goal is to get that table:  occupation, count, percentage  (formatted as a percentage!)

In [None]:
 #  Finally, make it look tabular by creating a data frame with 
#   Index, then counts, then percentage

#### Sample Statistics for Quantitative Variables

```
Standard Error of Mean. A measure of how much the value of the mean may vary from sample to sample taken from the same distribution. It can be used to roughly compare the observed mean to a hypothesized value (that is, you can conclude the two values are different if the ratio of the difference to the standard error is less than -2 or greater than +2).
```
```
Skewness. A measure of the asymmetry of a distribution. The normal distribution is symmetric and has a skewness value of 0. A distribution with a significant positive skewness has a long right tail. A distribution with a significant negative skewness has a long left tail. As a guideline, a skewness value more than twice its standard error is taken to indicate a departure from symmetry.
```

In [None]:
# Mean median, standard deviation, standard error of mean, skew)

For interquartile range on age, first we need quartiles 

In [None]:
# Or do it directly from the scipy.stats module


#### Graphics for Univariate Attributes

Histograms, boxplots, and their friends

In [None]:
# Histogram on age

In [None]:
# Varying number of bins matters!

Boxplots and Friends


In [None]:
# Boxplot on age is easy

In [None]:
# Boxplot on a whole data frame

In [None]:
#  Adding some new libraries -- we will be looking at both 
#  seaborn and the plotting libraries more, later on

import seaborn as sns
import matplotlib.pyplot as plt

# Choose the style that fits your mood!

plt.style.use('fivethirtyeight')
%matplotlib inline

In [None]:
# What does the strip plot tell us?

In [None]:
# Seaborne also has a boxplot -- 

In [None]:
# What more does a violin plot tell us?

In [None]:
# New data set with more numerical attributes.
#    beer, spirit, wine is number of serving
#    liters is number of liters of pure alcohol consumed

drink_cols = ['country', 'beer', 'spirit', 'wine', 'liters', 'continent']
drinks = pd.read_csv('drinks.csv', header=0, names=drink_cols)

In [None]:
drinks.info()

In [None]:
drinks.head()

In [None]:
# Box plot on this data set is more interesting

In [None]:
# The QN plot on user age? Need yet another library!
import pylab 
import scipy.stats as stats
stats.probplot(users.age, dist="norm", plot=pylab)
pylab.show()

In [None]:
# What about on wine instead?

In [None]:
# How can you make sense of that odd result????

#### Multivariate Non-Graphical

Crosstabs -- 

In [None]:
# For example, relationship between occupation and gender

In [None]:
# Add margins to get the row and column totals

In [None]:
# Can you now get the percentage of females for each gender?


In [None]:
# Correlation and covariance matrices are pretty easy.  Is everything as expected?

#### Scatter Plots



In [None]:
drinks.plot(kind='scatter', x='beer', y='wine');

In [None]:
# That plot gets a little muddled at the bottom, use transparency value to 
#  emphasize where there are overlaps

In [None]:
#  There are many many ways to try to display scatter plot of three or more variables.
#  This is one of the easier one:  third dimension communicated by the color intensity


##### Looking at Multiple Variables Pairwise

In [None]:
# Pandas calls it scatter matrix

In [None]:
# Seaborn calls is pair plot

#### Bar Charts

In [None]:
# Univariate, count by categorical variable  -- beer, wine, spirit by continent


In [None]:
# Bar chart is an effective way of showing this breakdown (kind=bar and kind=barh)


In [None]:
# Calculate the mean alcohol amounts for each continent.
drinks.groupby('continent').mean()

In [None]:
# Plot that in a bar chart

In [None]:
# That was sorted by continent -- sort by beer instead


In [None]:
#  The liters column is out of scale, and it is not comparable to the others anyway, 
#  so we need to get it of it!
#  Notice we make adjustments in the data frame, not by customizing the plot
#   Sort the continent x-axis by a particular column.



In [None]:
# Stacked bar plot (with the liters comparison removed)
# Does this communicate effectively?  When can you and can't you use it?
