# What are data?

## Quick and dirty definitions

"Data" is Latin for "what is given" but this couldn't be further from the truth. Data is collected, in contrast to its etymological origins: It is something that is gathered, collected, synthesized, cleaned, and transformed and the most robust (data) science accounts for all of these aspects in the modeling processes.

For example, consider COVID-19 case count numbers. At many different periods of the pandemic, case count numbers was considered a proxy for prevalence in the population. But even just thinking about the data collection process for a second ot two, we realize that the case count number could, and often would, fluctuate as a function of the number of tests administered on any given day.

To quote [Nate Silver](https://twitter.com/NateSilver538/status/1241113755016138755?ref_src=twsrc%5Etfw),

> Washington State is a good example of the importance of accounting for the number of tests when reporting COVID-19 case counts. Remember I mentioned a couple of days ago how their number of cases in WA had begun to stabilize? Well, guess what happened… Today, they reported 189 positives, along with 175 yesterday, as compared with an average of 106 positives per day in the 7 days before that. So, not great on the surface… new cases increased by 70%! But you also have to look at the number of tests. Washington conducted 3,607 tests today and 2,976 yesterday. By comparison, they’d conducted an average of 1,670 tests in the 7 days before that. So they’ve increased testing capacity by 97% over their baseline. Meanwhile, detected cases have increased, but by “only” 70%. Looked at another way: Today, 5.2% of Washington’s tests came up with a positive result. Yesterday, 5.9% did. In the 7 days before that, 6.4% of them did. So, there *is* a bit of progress after all. Their number of new positives *as a share of new tests* is slightly declining. For the time being, 1) the large (perhaps very large) majority of coronavirus positives are undetected and 2) test capacity is ramping up at extremely fast rates, far faster than coronavirus itself would spread even under worst-case assumptions. So long as those two things hold, the rate of increase in the number of *detected* cases is primarily a function of the rate of increase in the number of *tests* and does not tell us that much about how fast the actual *infection* is spreading.

Silver then wrote [an article for FiveThirtyEight](https://fivethirtyeight.com/features/coronavirus-case-counts-are-meaningless/) called “Coronavirus Case Counts Are Meaningless” with a subtitle “Unless you know something about testing. And even then, it gets complicated.”

This is one of many examples when considering the data-generating process is key for real-world inferences and robust decision-making. 

Before statistical modeling, inference, and decision-making can occur, we need to have good tools for loading, exploring and analyzing our data. In this book, we'll be using the PyData stack to do so. There are many reasons for this, not least of all that Python is in many ways a Swiss Army knife for data science, in that you can do nearly anything with it. This doesn't mean that it will be the absolutely best tool for any particular job, but it will allow you to do pretty much anything you need to do, from loading data in many file formats to visualization, statistical inference to machine learning, scalable computing to model deployment and much more. In this chapter, we'll introduce you to the main tools in the PyData stack that we plan to use. We'll do so by exploring a dataset and introducing the tools, as they become necessary.

## The PyData stack

The PyData stack is a loosely connected and interoperable set of Python packages that began to be developed at the turn of the 21st century. Core developers in the ecosystem were and still are, for the most part, research scientists (_not_ software engineers) who decided to build the tools they needed to solve problems in physics, biology, finance, geology, and astronomy, among many other disciplines.

In his Pycon 2017 [keynote](https://www.youtube.com/watch?v=ZyjCqQEUa8o), "The Unexpected Effectiveness of Python in Science", Jake Vanderplas presented the following figure, which depicts many of the main players in the PyData ecosystem, in particular, many that we will be using in this book: 


In [None]:
#| echo: false
from IPython.display import Image
Image("../../img/pydata-jake.png")

In this chapter, as stated, we'll take a brief tour of the parts of the ecosystem we'll be using. In particular, we'll be making most use of:

* pandas for analysis and manipulation of tabular data;
* Matplotlib and Seaborn for data visualization;
* Numpy for numerical computing and simulation;
* SciPy for an assortment of statistical tools;
* IPython and the Jupyter ecosystem for interactive computing;
* PyMC for statistical modeling;

Let's now load the packages that we'll introduce in this chapter:

In [None]:
#| output: false
#Import packages
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import pymc as pm
from ipywidgets import interact
import arviz as az
import scipy.stats as st
%matplotlib inline
sns.set()

We'll introduce the packages as they often appear in data science workflow, tools for

* importing,
* visualization,
* numerical computing, and
* inference.

It's also worth noting that there are many Pythonic tools for data science that aren't in the PyData stack per se. For example, the PyData ecosystem doesn't aim to solve the natural language processing, deep learning, or machine learning deployment stories!

### Importing data with pandas

One of the most common forms we'll find data in is a CSV file, which stands for comma-separated values, and pandas is a useful tool for loading such files. In [the project's own words](https://pandas.pydata.org/),

> pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language. 

Arguably, one reason for the mass adoption of pandas and, as a result, the mass adoption of Python for data analysis, is how it allowed us all to import so many different types and forms of datasets with signification flexibility. Let's now use pandas to import the Gapminder dataset, which contains economic indicators for many countries since the 1950s.

In [None]:
# Import data
gm = pd.read_csv('../../datasets/gapminder.csv')
# Visualize first 5 rows of data
gm.head()

The above code loads the dataset into an object called a `DataFrame`, which is well-suited for analysis of tabular data. There's a handy `DataFrame` method called `.head()` that displays the first several rows of the `DataFrame` `gm`. We can immediately see that there are columns for country, continent, year, life expectancy, population, and GDP. 

To discover a bit more about `gm`, we can use the `.info()` method, which tells us what datatypes each column contains, how many non-null elements there are in each column, the memory usage, and more:

In [None]:
gm.info()

We can see that there are 1704 total rows and that each column has 1704 non-null entries, meaning that `gm` has no null entries. This is atypical! Very often we will encounter null entries or NAs and we do not, in this case, because the dataset has already been processed and cleaned. Also note that year and population are integers (int64s, to be precise), life expectancy and GDP are floats, and country and continent objects (the most general type, often corresponding to strings, which we have in this case).

We can also use the `.describe()` method to find out more about the statistics of the numerical columns:

In [None]:
gm.describe()

We see that the minimum year is 1952 and the maximum is 2007, which is useful to know. We can see minimum and maximum for other columns, such as life expectancy and population, along with other statistics, such mean and standard deviation, but it isn't quite clear how useful these are, as they are essentially over all countries for all the years in the dataset. We can see that the minimum maximum life expectancies we 23.6 and 82.6, respectively, but it would make sense to look at countries across the globe in a specific years or a time series of a specific country over the years.

To this end, we subset the `DataFrame` to include rows for 1952 alone, view the `.head()`, and then use the `.describe()` method:

In [None]:
gm_52 = gm[gm['year'] == 1952]
gm_52.head()

In [None]:
gm_52.describe()

We can now read off indicator statistics of countries around the globe for 1952, such as

* the mean life expectancy across all countries was 49,
* the minimum population was 60,000, and
* the median GDP was 1968.

Note that we haven't yet specified the units of GDP! According to [the documentation underlying the dataset](https://github.com/jennybc/gapminder),

> Per-capita GDP (Gross domestic product) is given in units of international dollars, “a hypothetical unit of currency that has the same purchasing power parity that the U.S. dollar had in the United States at a given point in time” – 2005, in this case.

There are a huge number of other common data analytic and data exploration patterns that pandas allows us to work with. We'll introduce them on an as-needed basis herein, but it is worth mentioning one of the most common (and often most useful!), that of the `.groupby()` and split-apply-combine. Let's say that we want to figure out the mean values of some of our indicators over continents. 

Then we can

- **split** our dataset of countries into groups, given the continent that they're on,
- **apply** the `.mean()` across each of these groups,
- **combine** the results back into a single `DataFrame`.

This has been a common pattern in data analytics for a long time, but it was only formally given the name "split-apply-combine" in 2011 in [a seminal paper by Hadley Wickham, "The Split-Apply-Combine Strategy for Data Analysis"](https://www.jstatsoft.org/article/view/v040i01), which also introduced the R package allowing you to do so, which in turn inspired `pandas`! 

To do this, we use the pandas `groupby()` method, as follows:

In [None]:
gm_52.groupby(['continent']).mean()

### Visualizing data with matplotlib, pandas, and seaborn



We've already managed to do more exploratory data analysis (EDA) than we usually would without any data visualization. Normally, while performing EDA with such pandas methods as above, we would also be performing a variety of visualizations. In the spirit of not introducing all our tools at the same time, we've split apart some of this process, but know that we are often using many of them in concert. First, let's plot the histogram of life expectancy for `gm_52`:

In [None]:
gm_52.lifeExp.hist();

We can see that life expectancy ranges from 30-something to 70-something with more occurring in the 40s and 50s than in other decades. These are the types of things we want to start noticing in EDA: we're not making strong claims or robust inferences, but attempting to get a sense of what information is in our data.

Although we have seemingly used `pandas` to plot the above histogram, it is actually the visualization package `matplotlib` that is creating the visualization! pandas methods such as `.hist()` and `.plot()` provide convenient wrappers around `matplotlib` functions. We can also use `matplotlib` directly to plot the histogram:

In [None]:
plt.hist(gm_52.lifeExp);

Often, we will want to see the relationship between different variables and so will need to plot them on the same figure. Let's plot a scatter plot of population against life expectancy to see how we can achieve this with `pandas`:

In [None]:
gm_52.plot.scatter(x='lifeExp', y='pop', c='black', alpha=0.5);

One of the first things to catch our eye is that we have populations occuring over several orders of magnitude. When this happens, it can be useful to use a logarithmic axis for the variable in question, so that the data is not all packed into a small region of the visualization:

In [None]:
gm_52.plot.scatter(x='lifeExp', y='pop', c='black');
plt.yscale('log');

By eye, we can perhaps see a negative correlation between log(population) and life expectancy, but we would need to do more work to figure out if this is actually the case or not.

We may also want to plot a numerical variable, such as life expectancy, when grouped by another, such as continent. We can do this using Seaborn, "a Python data visualization library based on matplotlib" that "provides a high-level interface for drawing attractive and informative statistical graphics."

In [None]:
sns.swarmplot(x='continent', y='lifeExp', data=gm_52);

In the above swarm-plot, the y-axis is life expectancy and the x-axis is the continent, while also being used to spread out the data so that it is all visible.

Finally, we may wish to plot as many numerical variables against one another as possible, in order to get a stronger sense of dataset as a whole:

In [None]:
sns.pairplot(gm);

### Generating data with numpy

Just as both pandas and Seaborn use Matplotlib under the hood for plotting, pandas is built on top of [NumPy](https://numpy.org/), "[t]he fundamental package for scientific computing with Python." It is probably impossible to overstate the importance of NumPy but let us merely states that it contains a vast array of functionality for numerical computing in Python, including array computing, "mathematical functions, random number generators, linear algebra routines, Fourier transforms, and more."

We shall use NumPy for many purposes, including random sampling and simulating data, which will be a workhorse for us. Let's provide an example of generating samples from a uniform distribution and plot the resulting histogram. In the next chapter, we'll explain more about these terms and code but for now, let's give a sense:

In [None]:
# Draw 1,000 samples from uniform & plot results
rng = np.random.default_rng(42)
x = rng.random(1000)
plt.hist(x);

One of the cool aspects of NumPy's random sampling is that we're able to generate samples from many types of distributions, including the Normal distribution, or Bell curve: 

In [None]:
samples = rng.normal(0, 1, size=10000)
plt.hist(samples);

One thing we'll be doing time and time again is matching our real-world data to distributions that we can simulate. To do this, we'll often want to plot the real-world data on the same figure as the simulated model. As it's difficult (at best) to see what's happening when plotting multiple histograms together, among other issues with histograms, which we'll get to, we'll introduce the empirical cumulative distribution function (ECDF), which allows us to plot all of our data and to compare several datasets, simulated or otherwise. We'll be working with ECDFs at length in Chapter 4, but are introducing them now in order to demonstrate the power of NumPy's random sampling.

**Definition:** In an ECDF, the x-axis is the range of possible values for the data & for any given x-value, the corresponding y-value is the proportion of data points less than or equal to that x-value.

We now define a utility function that will allow us to compute ECDFs from our data.

In [None]:
def ecdf(data):
    """Compute ECDF for a one-dimensional array of measurements."""
    # Number of data points
    n = len(data)

    # x-data for the ECDF
    x = np.sort(data)

    # y-data for the ECDF
    y = np.arange(1, n+1) / n

    return x, y

Let's now plot the ECDF of the normally distributed data we generated above:

In [None]:
x_s, y_s = ecdf(samples)
_ = plt.plot(x_s, y_s, marker='.', linestyle='none')
_ = plt.ylabel('CDF')

Given a dataset that you think may be normally distributed, you can 

* compute its mean and standard deviation,
* use NumPy to randomly sample a normal distribution with the relevant mean and standard deviation, and
* plot the ECDFs of both to see if they match up.


In Chapter 5, we'll see why we may think certain data is normally distributed, such as speed of light in the famous Michelson–Morley experiment! In the meantime, let's plot the ECDF of both the data and the samples generated from a normal distribution:

In [None]:
# Load data, plot histogram 
df = pd.read_csv('../../datasets/michelson_speed_of_light.csv')
df = df.rename(columns={'velocity of light in air (km/s)': 'c'})


# Get speed of light measurement + mean & standard deviation
michelson_speed_of_light = df.c.values
mean = np.mean(michelson_speed_of_light)
std = np.std(michelson_speed_of_light, ddof=1)

# Generate normal samples w/ mean,  std of data
samples = np.random.normal(mean, std, size=10000)

# Generate data ECDF & model CDF
x, y = ecdf(michelson_speed_of_light)
x_theor, y_theor = ecdf(samples)

# Plot data & model (E)CDFs
_ = plt.plot(x_theor, y_theor)
_ = plt.plot(x, y, marker='.', linestyle='none')
_ = plt.xlabel('speed of light (km/s)')
_ = plt.ylabel('CDF')

### IPython and Jupyter

At first glance, the above are the key basic tools needed for data exploration and iteration. But we would be remiss to not state the absolute and fundamental importance of both IPython and Project Jupyter as foundational infrastructure for a great deal of this work. This book, for example, was written in Jupyter notebooks and Jupyter lab using the IPython kernel. It was inspired by [workshops we have taught at SciPy, PyCon, and ODSC](https://github.com/ericmjl/bayesian-stats-modelling-tutorial) using these tools. Were you to try to reproduce some of the methods you learn in this book (which we hope you do!), the easiest approach would be to use the Jupyter and IPython ecosystem.

IPython, "a command shell for interactive computing in multiple programming languages, originally developed for the Python programming language," was first released by Fernando Perez in 2001, and provided a more useful interface to execute in the REPL style of "Read-Evaluate-Print-Loop", which turned out be incredibly practical for exploratory and iterative scientific research. Fast forward to today and we have both Jupyter notebooks (formerly IPython notebooks), which are web applications for building data and model-centric narratives and documents (not dissimilar from research scientists' journals!),  and Jupyter lab, which is "the latest web-based interactive development environment for notebooks, code, and data."

## What is statistics?

There are many ways to slice the discipline of statistics. One way is into exploratory data analysis (EDA) and inference. The former is essentially about beginning to notice potential patterns in your data, the latter about drawing inferences, that is, about generalizing from the data you've seen to data that you haven't yet seen. We have already seen a variety of EDA techniques above and we will continue to develop a zoo of such techniques as this book goes on. For a deep dive into EDA, we recommend "Exploratory Data Analysis" by John Tukey, the modern parent of exploratory data analysis.

### EDA and the bootstrap

We've seen how computing summary statistics, such as the mean and standard deviation, is an important aspect of EDA. Once we've computed such a statistic, however, a key question arises: how certain can we be of our estimate? Let's make this concrete by looking at a dataset containing measurements of Galapagos finch beaks and computing the mean beak lengths for 2 species:

In [None]:
# Import and view head of data
df_12 = pd.read_csv('../../datasets/finch_beaks_2012.csv')
df_fortis = df_12.loc[df_12['species'] == 'fortis']
df_scandens = df_12.loc[df_12['species'] == 'scandens']
blf = df_fortis['blength']
bls = df_scandens['blength']
np.mean(blf), np.mean(bls)

So what is our uncertainty around these estimates? In later chapters, we'll see how Bayesian modeling and inference can help us to answer such questions. Here, we'll introduce a technique known  as the _bootstrap_, which allows us to estimate the uncertainty without making any modeling assumptions (that is, the bootstrap is _non-parametric_). The key insight involved in the boostrap is that we have calculated the mean once already, and if we can sample the population again, we can calculate the mean again, to get a slightly different result, based on variation in the sample population. We can then do this as many times as desired, to get a distribution of means, which will allow us to express uncertainty around our sample mean.

The way we do this in the bootstrap is to sample the original dataset with replacement, each time we calculate the new mean. Let's do this now for the Fortis population of finches. We'll also plot a histogram of resulting mean beak lengths:

In [None]:
n = 10**3
blf_means = np.zeros(n)

for i in range(n):
    samples = rng.choice(blf, len(blf))
    mean = np.mean(samples)
    blf_means[i] = mean
    
plt.hist(blf_means);

We can see immediately that the mean beak length ranges from around 10.3 to around 10.7. We can also compute percentiles to see get a 95% confidence interval on our estimate:

In [None]:
# 95% CIs
np.percentile(blf_means, [2.5, 97.5])

We interpret this as saying 95% of the time, our computed mean lay between 10.37 and 10.66. There's a lot more we can already do with this simulation which we'll leave for future chapters. For example, if the histogram above looks like a Bell curve to you, this is actually due to the Central Limit Theorem!

It's also worth mentioning that the bootstrap here is essentially a frequentist method of estimation. We do not necessarily advocate the use of such methods for inference but, as we're discovering, it can be an incredibly useful tool for EDA! This is because it allows us to get a sense of our data, while making minimal assumptions. This is actually also why we don't consider it the best tool for inference! As we'll see, we already know much about our data (such as how it was measured and collected), the details of which we'll want to include in any inferential project.

### Inference

Inference is all about inferring unseen stuff from the stuff we **can** see, that is, from the data we have. To perform robust inference, we build statistical models and use them to make predictions.

As we have seen in Chapter 1, science is deeply interested in questions of the form:
- Given data collected in the world, what can we say about the data-generating process(es)? That is, what is $P(H|D)$, the probability of any hypothesis $H$, given the observed/collected data $D$?

We'll be using Bayesian inference and probabilistic programming, in the form of the package PyMC, to answer such questions. To round out this chapter and to give a sense of where we are going, let's look at a couple of examples. Neither the details of the statistical modeling or the code currently matter and we will cover all of this in later chapters.

As before, the dataset we'll be looking at is from measurements of Galapagos finch beaks and we'll be able to ask (and answer!) questions such as "do different species from different islands have different beak lengths?"

#### What is the mean beak length of the Fortis population?

To answer this question, we first specify a statistical model (once again, all the detailed will be explained through the course of this book):

In [None]:
with pm.Model() as model:
    # Prior for mean & standard deviation
    μ_1 = pm.Normal('μ_1', mu=10, sigma=5)
    σ_1 = pm.Lognormal('σ_1', 0, 10)
    # Gaussian Likelihood
    y_1 = pm.Normal('y_1', mu=μ_1, sigma=σ_1, observed=df_fortis['blength'])

Note that, in the above, in accordance with our protocol outlined in Chapter 1, we are completely specifying the model in terms of _probability distributions_. This includes specifying

    - what the form of the sampling distribution of the observed data is, which is given by the **likelihood** _and_
    - what form describes our _uncertainty_ in the unknown parameters, given by the **prior**.


The next step is to calculate the _posterior distribution_, which we do by hitting the "inference button":

In [None]:
#| output: false
# bust it out & sample
with model:
    samples = pm.sample(2000, return_inferencedata=True)

We can now plot the mean beak length. In fact, we plot the distribution of the mean beak length, capturing uncertainty around our estimate:

In [None]:
with model:
    az.plot_posterior(samples);

We are 94% certain that the mean beak length is between 10.38 and 10.66.

### Bayesian hypothesis testing 

We can use a similar process to determine whether the mean beak length of the Galapagos finches differs between species. As before, we first define our model:

In [None]:
with pm.Model() as model:
    # Priors for means and variances
    μ_1 = pm.Normal('μ_1', mu=10, sigma=5)
    σ_1 = pm.Uniform('σ_1', 0, 10)
    μ_2 = pm.Normal('μ_2', mu=10, sigma=5)
    σ_2 = pm.Uniform('σ_2', 0, 10)
    # Gaussian Likelihoods
    y_1 = pm.Normal('y_1', mu=μ_1, sigma=σ_1, observed=df_fortis['blength'])
    y_2 = pm.Normal('y_2', mu=μ_2, sigma=σ_2, observed=df_scandens['blength'])
    # Calculate the effect size and its uncertainty.
    diff_means = pm.Deterministic('diff_means', μ_1 - μ_2)
    pooled_sd = pm.Deterministic('pooled_sd', 
                                 np.sqrt(np.power(σ_1, 2) + 
                                         np.power(σ_2, 2)) / 2)
    effect_size = pm.Deterministic('effect_size', 
                                   diff_means / pooled_sd)

We then hit the "inference button":

In [None]:
#| output: false
# bust it out & sample
with model:
    samples = pm.sample(2000, return_inferencedata=True)

We can now plot the distributions of mean beak lengths. Most importantly, we can plot the distribution of the difference of means!

In [None]:
az.plot_posterior(samples, var_names=['μ_1', 'μ_2', 'diff_means', 'effect_size'], kind='hist');

## Wrapping Up

In this chapter, we introduced the key tools in the PyData stack that we'll use in the rest of this book and that we hope you'll use some of for your own work. To recap, we introduced

* pandas for analysis and manipulation of tabular data;
* Matplotlib and Seaborn for data visualization;
* Numpy for numerical computing and simulation;
* SciPy for an assortment of statistical tools;
* IPython and the Jupyter ecosystem for interactive computing;
* PyMC for statistical modeling.

We did all of this looking at real-world datasets to give a feel for how these tools are used in practice. Now it's time to discover more about data, statistics, and probability using `NumPy`'s pseudorandom number generator. We'll do this in the next chapter.