# 3. Non-Graphical Univariate Analysis

### Univariate vs Multivariate Analyses
Univariate analysis is done on one variable at a time. Multivariate is analysis done on 2 or more variables. Typically, we start with univariate analysis as it is simpler and allows us to understand each column independently from the others.

### Graphical vs Non-graphical
Each exploratory analysis will result in either **graphical** or **non-graphical** output. The two main types of columns are categorical and continuous, which are analyzed different. Categorical data is typically counted, or has values that are used to form groups. Continuous data is often aggregated or used in some other numerical calculation.

The following table may you become aware of all the combinations of analysis during an EDA and what procedure to use for the specific variable combination.

| Univariate             | Graphical                               | Non-Graphical                     | 
|-------------|-----------------------------------------|-----------------------------------|
| Categorical | Bar char of frequencies (count/percent) | `value_counts` (count/percent) |
| Continuous  | Histogram/KDE, box/violin  | central tendency -mean/median/mode, variance, std, skew, IQR  |

| Multivariate            | Graphical                               | Non-Graphical                     | 
|-------------|-----------------------------------------|-----------------------------------|
| Categorical vs Categorical | heat map, mosaic plot | Cross tabulation (count/percent) |
| Continuous vs Continuous  | all pairwise scatterplots, kde, heatmaps |  all pairwise correlation/regression   |
| Categorical vs Continuous  | All seaborn "categorical" plots | Summary statistics for each group |

## Begin with Univariate Analysis
After you have tidied the data and began the data dictionary, a reasonable place to start is with univariate analysis. 

#### Recreate the data dictionary
Let's recreate the data dictionary that we began earlier.

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

pd.options.display.max_colwidth = 120

diamonds = pd.read_csv('../data/diamonds.csv')
diamonds_dictionary = pd.read_csv('../data/diamonds_dictionary.csv', index_col='Column Name')
diamonds_dictionary['Data Type'] = diamonds.dtypes
diamonds_dictionary['Num Unique'] = diamonds.nunique()
c, o, n = 'continuous', 'ordinal', 'nominal'
d = {'carat':    c, 
     'clarity':  o, 
     'color':    o, 
     'cut':      o, 
     'depth':    c, 
     'price':    c, 
     'table':    c, 
     'x':        c, 
     'y':        c, 
     'z':        c}

type_label = pd.Series(d)
diamonds_dictionary['Data Type Info'] = type_label

new_order = ['cut', 'color', 'clarity','carat', 'price', 'x', 'y','z','depth', 'table']
diamonds = diamonds[new_order]
diamonds_dictionary['Missing Values'] = diamonds.isna().sum()
diamonds.head()

In [None]:
diamonds_dictionary

## Interview each column
Univariate analysis is an analysis done on one variable. For smaller datasets, I like to manually examine each variable. This way, I can learn the distribution, discover potential outliers, missing values and simplify matters by concentrating on only variable at a time.

## Non-graphical univariate analysis on continuous columns
Continuous columns are always numeric, which leads to many available aggregation choices such as min, median, mean, max, standard deviation. All of these values are given with the `describe` method. By default, it works only on numeric columns. You may also pass it specific percentiles of the distribution you would like to see.

In [None]:
diamonds.describe(percentiles=[.01, .1, .3, .5, .7, .9, .99]).round(1)

## Non-graphical univariate analysis on the categorical variables
The frequency of occurrence of each value by raw count and percentage is usually the first (and many times only exploratory step taken) when doing univariate categorical analysis. The **`value_counts`** Series method will be useful here.

In [None]:
diamonds['cut'].value_counts()

In [None]:
diamonds['color'].value_counts()

In [None]:
diamonds['clarity'].value_counts()

In [None]:
# use normalize=True to get percentage
diamonds['cut'].value_counts(normalize=True).round(2)

### Changing `object` to `category`
Let's change actual categorical values to the Pandas `category` data type. Changing the column to type **`category`** does several things. 
* It saves memory by encoding each category as a numerical value. 
* Sorting is possible by the category order (if given). 
* The **`.cat`** accessor makes many more methods available.

### Use `pd.Categorical`

Ordinal variables can be given their ordering through the **`categories`** parameter with **`ordered`** set equal to **`True`**.

In [None]:
order = ['Fair', 'Good', 'Very Good', 'Premium', 'Ideal']
diamonds['cut'] = pd.Categorical(diamonds['cut'], ordered=True, categories=order)

In [None]:
# notice that the data type is now category and the categories are ordered
diamonds['cut'].head()

### Convert color and clarity to category

In [None]:
order = ['J', 'I', 'H', 'G', 'F', 'E', 'D']
diamonds['color'] = pd.Categorical(diamonds['color'], ordered=True, categories=order)

order = ['I1', 'SI2', 'SI1', 'VS2', 'VS1', 'VVS2', 'VVS1', 'IF']
diamonds['clarity'] = pd.Categorical(diamonds['clarity'], ordered=True, categories=order)

### Verify data types
Let's verify that we have converted these three columns to type category.

In [None]:
diamonds.dtypes

###  Nominal categorical variables
There is no need to specify the `ordered` or `categories` parameters for nominal variables.

### Sort based on the category order
The **`value_counts`** method works the same as before the conversion by showing the frequencies in descending order.

In [None]:
diamonds['color'].value_counts()

Chaining the **`sort_index`** method sorts by the given categorical order and not the alphabetical order.

In [None]:
diamonds['color'].value_counts().sort_index()

In [None]:
# percentages
diamonds['color'].value_counts(normalize=True).round(2).sort_index()

# Exercise
Complete these steps on your dataset