# Univariate Exploration of Data

As you start data exploration, always begin with univariate exploration of single variables. This way, we can get a sense for how each variable individually is distributed prior to exploring relationships between variables.

If we find data with outliers or missing data, these may become focus areas for further cleaning or further inspection. These steps may be incorporated into the data wrangling process.

To explore our data we will use various plot types and techniques.
* Plot Types
    * Histograms
    * Bar Charts
    * Pie Charts & Donut Plots
* Techniques
    * Absolute vs Relative Frequency
    * Axis Limits
    * Scales and Transformations

NOTE: This notebook and the examples use `matplotlib` and `seaborn`. Kaggle also hosts a page for [univariate plotting with pandas](https://www.kaggle.com/code/residentmario/univariate-plotting-with-pandas/notebook).

### Tidy Data

A [tidy dataset](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html) is a tabular (i.e., structured) dataset in which
* each variable is a column,
* each observation is a row, and
* each type of observational unit is a table.

Often times, prior to performing data exploration you will need to perform data tidying and should become comfortable with *[data-wrangling](https://en.wikipedia.org/wiki/Data_wrangling)* as a consequence. Reshaping and transformations to split or combine features in your data are two essential tasks for *data-wrangling*. This [pandas cheat sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf) can help with many of the functions useful for data-wrangling.

##### Counting Missing Data

Often times data will having missing values (e.g., `None`, `np.NaN`, or `pd.NaT`). Pandas provides [pandas.DataFrame.isna()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isna.html) and [pandas.DataFrame.isnull()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isnull.html#pandas.DataFrame.isnull) to assist in identifying missing values by returning a same sized object as the calling dataframe made up of boolean `True` and `False` values.

See [counting-missing-data.ipynb](examples\Univariate%20Exploration\Bar%20Charts\counting-missing-data.ipynb) for an example.

### Bar Charts

A bar chart depicts the distribution of a categorical variable. Each level of the categorical variable is depicted with a bar whose height indicates the frequency of data points that take on that level.

##### Tips

* Sort nominal data in decreasing order of frequency (i.e., tallest bar on the left and the shortest on the right)
    * NOTE: This is not generally recommended for ordinal data as the inherent ordering of the levels will usually be a more important feature to convey
        * Example: Data on a survey response that ranges from "Strongly Disagree" to "Strongly Agree" will usually be better depicted if kept in level order vice frequency of response order.
* Horizontal bar charts can be useful when there are a lot of categories or the category names are long.

See [bar-charts.ipynb](examples\Univariate%20Exploration\Bar%20Charts\bar-charts.ipynb) for examples of bar charts.

##### Absolute vs Relative Frequency

In certain cases, you might want to understand the distribution of data or want to compare levels in terms of the proportions of the whole. In this case, you wil want to plot the data in terms of **relative frequency**, where the height of a bar indicates the proportion of data taking each level rather than the absolute count.

See [absolute-vs-relative-frequency.ipynb](examples\Univariate%20Exploration\Bar%20Charts\absolute-vs-relative-frequency.ipynb) for examples of bar charts.

### Pie Charts

A pie chart is a univariate plot used to depict relative frequencies for levels of categorical variables.

##### Tips

* Ensure you interest is *relative frequencies*.
    * Areas should represent parts ofa whole rather than measurements on a second variable.
* Limit the number of slices plotted.
    * A pie chart works best with two or three slices where the area are easily distinguishable.
    * If you have a lot of categories or categories with a small proportional representation, consider grouping them together so that fewer wedges are plotted (e.g., using an "Other" category).
* Plot the dta systematically.
    * Start from the top of the circle while plotting each additional categorical level clockwise from most frequent to least frequent.
    * If you have three categories and are interested in the comparison of two of them, a common plotting method is to place the two categories of interest on either side of the 12 o'clock direction with the third category filling in the remaining space at the bottom.
* If the previous three guidelines cannot be met, then you should probably use a bar chart instead.
    * The heights are more easily perceived and interpreted than areas or angles.
    * A bar chart can be displayed more compactly than a pie chart.
    * A bar chart has greater flexibility for plotting variables with a lot of levels, e.g., plotting the bars horizontally.

See [pie-chart.ipynb](examples\Univariate%20Exploration\Pie%20Charts\pie-chart.ipynb) for examples of pie charts and donut plots.

##### Further Reading

* Eager Eyes: [Understanding Pie Charts](https://eagereyes.org/pie-charts)
* Eager Eyes: [An Illustrated Tour of the Pie Chart Study Results - how accurately do people perceive different formulations of the pie chart?](https://eagereyes.org/blog/2016/an-illustrated-tour-of-the-pie-chart-study-results)
* Datawrapper: [What to Consider when Creating a Pie Chart](https://academy.datawrapper.de/article/127-what-to-consider-when-creating-a-pie-chart)


### Waffle Plot

A waffle plot is one additional type of univariate plot type that can be used for categorical data. It is sometimes known as a square pie chart.

Whereas a standard pie chart uses a circle to represent the whole, a waffle plot is plotted onto a square and divided into a 10x10 grid. Each square in the grid represents one square of the data and is colored by category to indicate the total proportions. Compared to a pie chart, it is much easier to make precise assessments of relative frequencies.

Infographics often have each icon represent some number of units (e.g., one person representing one million people). One downside of the waffle plot is that it is not commonly supported out of the box for most visualization libraries, including Matplotlib and Seaborn. The effort required to create a meaningful and useful waffle plot means that it is best employed carefully as apart of explanatory visualizations.

See [waffle-plots.ipynb](examples\Univariate%20Exploration\waffle-plots.ipynb) for an example.

### Histograms

A histogram is used to plot the distribution of a numeric value - the quantitative version of the bar chart. However, rather than plot one bar for each unique numeric value, values are grouped into continuous bins, and one bar for each bin is plotted to depict the frequency.

##### Tips

* Use multiple bin sizes before deciding on a final value.
    * Too few or many bins can hide the shape of the distribution.
* Normally bins include values on their left end and exclude values on their right end.
    * With `matplotlib` it may make more sense to use `np.arange` to define your range than bin counts as the bin divisions may not be whole numbers in the latter case.

### Choosing a Plot for Discrete Data

If you want to plot a **discrete quantitative variable**, it is possible to select either a histogram or a bar chart to depict the data.

* **Discrete** means non-continuous values. In general, a discrete variable can be assigned to any of the limited (countable) set of values from a given set/range.
* The **quantitative** term shows that it is the outcome of the measurement of a quantity.

The histogram is the most immediate choice since the data is numeric, but there's one particular consideration to make regarding the bin edges. Since data points fall on set values (bar-width), it can help to reduce ambiguity by putting bin edges between the actual values taken by the data.

See [discrete-variables.ipynb](examples\Univariate%20Exploration\Histograms\discrete-variables.ipynb) for an example.

### Descriptive Statistics, Outliers, and Axis Limits

When viewing data, it is often necessary to view your data as the plots may often give you an intuition regarding the data beyond basic descriptive statistics. You should note aspects of the data like the number of modes, skew, and the presence of any outliers that require further investigation.

![distribution-shapes.png](attachment:distribution-shapes.png)

Prior to investigating outliers or making a determination as to how best to handle them, altering the limits or scale of a plot can be one way to take a closer look at the underlying patterns absent those points.

See [outliers-and-axis-limits.ipynb](examples\Univariate%20Exploration\Histograms\outliers-and-axis-limits.ipynb) for examples.

### Scales and Transformations

Certain data distributions will find themselves amenable to scale transformations. The most common example of this is data that follows an approximately [log-normal distribution](https://en.wikipedia.org/wiki/Log-normal_distribution). This is data that, in their natural units, can look highly skewed, e.g., lots of points with low values, with a very long tail of data points with large values. However, after applying a logarithmic transformation to the data, the data will follow a normal distribution.

When we perform a logarithmic transformation, out data values have to all be positive as $log$ values are undefined for zero and negative numbers. Additionally, the transform implies that additive steps on the $log$ scale will result in multiplicative changes in the natural/original scale. As observed, large (multiplicative) difference result in additive difference when viewing at the $log$ scale.

$$log_2(1)=0\\ log_2(2)=1\\ log_2(4)=2\\ log_2(8)=3\\\vdots\\ log_2(128)=7\\\vdots\\ log_2(1024)=10\\\vdots\\ log_2(1,048,576)=20\\\vdots\\ log_2(1,073,741,824)=30$$

The type of transformation that you choose may be informed by the context for the data. For example, this [Wikipedia section](https://en.wikipedia.org/wiki/Log-normal_distribution#Occurrence_and_applications) provides a few examples of places where log-normal distributions have been observed.

If you want to use a different transformation that's not available in `pyplot.scale()`, then you'll need to perform feature engineering. In order to ensure we can make transformations and easily return to the original units, we should always also implement the transformations inverse function.

See [scales-transformations.ipynb](examples\Univariate%20Exploration\scales-transformations.ipynb) for examples.