# Basic graphics for a single column of data

# Document information

<table align="left">
    <tr>
        <th class="text-align:left">Title</th>
        <td class="text-align:left">Basic graphics for a single-column data file</td>
    </tr>
    <tr>
        <th class="text-align:left">Last modified</th>
        <td class="text-align:left">2020-10-12</td>
    </tr>
    <tr>
        <th class="text-align:left">Author</th>
        <td class="text-align:left">Gilles Pilon <gillespilon13@gmail.com></td>
    </tr>
    <tr>
        <th class="text-align:left">Status</th>
        <td class="text-align:left">Active</td>
    </tr>
    <tr>
        <th class="text-align:left">Type</th>
        <td class="text-align:left">Jupyter notebook</td>
    </tr>
    <tr>
        <th class="text-align:left">Created</th>
        <td class="text-align:left">2018-06-22</td>
    </tr>
    <tr>
        <th class="text-align:left">File name</th>
        <td class="text-align:left">basic_graphics_single_column.ipynb</td>
    </tr>
    <tr>
        <th class="text-align:left">Other files required</th>
        <td class="text-align:left">basic_statistics_single_column_data.csv</td>
    </tr>
</table>

# In brevi

The purpose of this Jupyter notebook is to illustrate a basic notebook structure. The script reads a single-column file and calculates basic graphics. It progresses from quick-and-dirty code to pretty.

# Methodology

A box plot is drawn using pandas.plot.box. A histogram is drawn using pandas.plot.hist. A scatter plot is drawn using pandas.plot.scatter. A regression line is estimated for the scatter plot using statsmodels.formula.api. A stem-and-leaf plot is drawn using stemgraphic.stem_graphic. A run chart is drawn using pandas.plot

# Data

Download the data file:

- [basic_graphics_single_column_data.csv](https://drive.google.com/open?id=1N-5611OldWLD2hQwTDPOc1B0O_xxVoYO)

# Import librairies and basic setup

In [2]:
from stemgraphic import stem_graphic as stg
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

ModuleNotFoundError: No module named 'stemgraphic'

In [None]:
%matplotlib inline
colour1 = '#0077bb'
colour2 = '#33bbee'

In [None]:
def despine(ax: axes.Axes) -> None:
    """
    Remove the top and right spines of a graph.

    Parameters
    ----------
    ax : axes.Axes

    Example
    -------
    >>> despine(ax)
    """
    for spine in 'right', 'top':
        ax.spines[spine].set_visible(False)

# Read the data file

The data file has one column labelled "y". It is saved as a CSV format with UTF-8 encoding.

In [None]:
# Read the data file and save it to a dataframe called "df".
df = pd.read_csv('basic_graphics_single_column_data.csv')

In [None]:
# View the first five rows of data.
df.head(5)

In [None]:
# View the last five rows of data.
df.tail(5)

# Perform the analysis

## Box plot

In [None]:
# Create a box plot.
df['y'].plot.box()

In [None]:
# The horizontal lines, from top to bottom, are:
# - Maximum
# - Third quartile
# - Median
# - First quartile
# - Minimum

In [None]:
# Add the 95 % confidence interval of the median.
df['y'].plot.box(notch=True)

In [None]:
# Add the average.
df['y'].plot.box(notch=True, showmeans=True)

In [None]:
# Print the statistics of the box plot.
print('Maximum',
      df['y'].max(),
      sep=" = ")
print('Third quartile',
      df['y'].quantile(.75),
      sep=" = ")
print('Median',
      df['y'].median(),
      sep=" = ")
print('Average',
      df['y'].mean(),
      sep=" = ")
print('First quartile',
      df['y'].quantile(.25),
      sep=" = ")
print('Minimum',
      df['y'].min(),sep=" = ")
print('Upper confidence value',
      df['y'].median() + 1.57 * (df['y'].quantile(.75) - df['y'].quantile(.25))\
          /np.sqrt(df['y'].count()),
      sep=" = ")
print('Lower confidence value',
      df['y'].median() - 1.57 * (df['y'].quantile(.75) - df['y'].quantile(.25))\
          /np.sqrt(df['y'].count()),
      sep=" = ")

In [None]:
df['y'].max()

In [None]:
# Make a pretty graph.
title = 'Box plot of y'
subtitle = 'Pretty graph'
yaxislabel = 'Response y (units)'
ax = df['y'].plot.box(notch=True, showmeans=True)
ax.set_title(title + '\n' + subtitle, fontweight='bold', fontsize=13)
ax.set_ylabel(yaxislabel)
despine(ax)

## Histogram

In [None]:
# Create a histogram.
df['y'].plot.hist()

In [None]:
# Change the number of bins.
df['y'].plot.hist(bins=15)

In [None]:
# Make a pretty graph.
title = 'histogram of y'
subtitle = 'Pretty graph'
yaxislabel = 'Count'
xaxislabel = 'Response y (units)'
ax = df['y'].plot.hist(edgecolor='white', linewidth=.5, color=colour1)
ax.set_title(title + '\n' + subtitle, fontweight='bold', fontsize=13)
ax.set_ylabel(yaxislabel)
ax.set_xlabel(xaxislabel)
for spine in 'right', 'top':
    ax.spines[spine].set_visible(False)

# Scatter plot

In [None]:
# Create a scatter plot, assuming data are in time order.
df['x'] = df.index
df.plot.scatter(x='x', y='y')

In [None]:
# Perform a linear regression.
lm = smf.ols(formula='y ~ x', data=df).fit()
lm.params

In [None]:
# Create a dataframe with the minimum and maximum values of 'x'.
x_min_max = pd.DataFrame({'x': [df['x'].min(), df['x'].max()]})
x_min_max

In [None]:
# Estimate 'y' for these 'x' values and store them in 'estimates'.
y_estimates = lm.predict(x_min_max)
y_estimates

In [None]:
# Plot the linear regression line. Clean it up.
title = 'Scatter plot of y'
subtitle = 'Pretty graph'
yaxislabel = 'Response y (units)'
xaxislabel = 'Order of sampling'
ax = plt.subplot(111)
df.plot.scatter(x='x', y='y', color=colour1, s=3, ax=ax)
plt.plot(x_min_max, y_estimates, color=colour2)
ax.set_title(title + '\n' + subtitle, fontweight='bold', fontsize=13)
ax.set_ylabel(yaxislabel)
ax.set_xlabel(xaxislabel)
despine(ax)

# Stem-and-leaf plot

In [None]:
stg(df['y'])

# Run chart

In [None]:
# Plot a run chart, using the index for x.
df['y'].plot()

In [None]:
# Add markers that are circles and of a certain size.
df['y'].plot(marker='o', markersize=5)

In [None]:
# Clean it up. Add a red line for the average.
title = 'Run chart of y'
subtitle = 'Pretty graph'
yaxislabel = 'Response y (units)'
xaxislabel = 'Order of sampling'
ax = df['y'].plot(marker='o', markersize=5, color=cour1, legend=False)
ax.axhline(y=df['y'].mean(), color=colour2)
ax.set_title(title + '\n' + subtitle, fontweight='bold', fontsize=13)
ax.set_ylabel(yaxislabel)
ax.set_xlabel(xaxislabel)
despine(ax)

# References

- [matplotlib](https://matplotlib.org/api/pyplot_api.html)
- [numpy](https://docs.scipy.org/doc/numpy/reference/)
- [pandas API](https://pandas.pydata.org/pandas-docs/stable/api.html)
- [statsmodels](http://www.statsmodels.org/dev/example_formulas.html)
- [stemgraphic](http://stemgraphic.org/doc/index.html)

McGill, Robert, John W. Tukey, and Wayne A. Larsen. 1978. "Variations of Box Plots." *The American Statistician 21 (February 1978), no. 1; 12-16. [https://www.jstor.org/stable/2683468](https://www.jstor.org/stable/2683468)