# STAT 345: Nonparametric Statistics

## Lesson 00.1: Intro to the Course and Setting up Computing Environment

**Reading: Conover Chapter 1**

*Prof. John T. Whelan*

Tuesday 14 January 2025

## Preliminaries

### Administrata

-   Introductions!
-   [Syllabus](syllabus.pdf)
-   Instructor’s name (Whelan) rhymes with "wailin’". 
-   Text: Conover, *Practical Nonparametric Statistics*, 3rd edition.<br>(See syllabus for other resources.)
-   Course materials in MyCourses; see especially [timetable](timetable.html)<br>(but also "Content" and "Assignments")
-   Materials from previous section may be useful:<br>http://ccrg.rit.edu/~whelan/courses/2018_3fa_STAT_345/

### Course work:

- Read relevant sections of textbook before class
- Conover has many short exercises; answers to odd-numbers in the book, but more useful if you try them before looking!

- Problem Sets distributed & collected in MyCourses.
   - Part One: book problems
   - Part Two: computational exercise

- Prelim exam (think midterm, but there are two of them) format TBC.
- Cumulative final exam

### Grading

- 25% Problem Sets
- 20% First Prelim Exam
- 20% Second Prelim Exam
- 35% Final Exam

You'll get a separate grade on the "quality point" scale (e.g., 2.5--3.5
is the B--including B- and B+--range) for each of these four components; course grade is
weighted average.

### COVID Considerations

https://pmc19.com/data/

<img src="pandemic010625.png" width="70%">

### COVID Considerations

https://covid.cdc.gov/ &emsp;
https://peoplescdc.org/ &emsp;
https://pmc19.com/data/<br>
https://www.cdc.gov/nwss/rv/COVID19-statetrend.html

- After the "autumn lull", we've begun the "winter wave"<br>
(plus school starting means new exposure opportunities)

- Lessons will be streamed over Zoom
    - can attend in person or remotely
    - no attendence requirement
- Homework submitted online
- Stay home if you're sick (attend remotely if up to it)
- Consider masking indoors (KN95 > surgical > cloth)

## Perspective on Nonparametric Methods

> "...there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns—the ones we don't know we don't know." -- Donald Rumsfeld, US Defense Secretary 1975-1977 & 2001-2006

- In intro stats (MATH-252, STAT-257 or STAT-205), learned procedures to estimate quantities & testing hypotheses, many based on the normal distribution.  May have seen arbitrary, but many are **optimal procedures**, known to outperform alternatives **if** underlying properties of the random data are known

- E.g., if you have a sample drawn from a specified distribution with unknown parameters, you are quantifying the "known unknowns"

- If you don't know the underlying distribution (not even what family it belongs to), you're dealing with "unknown unknowns", and nonparametric methods can be more useful.

- Note that “nonparametric” is often a misnomer; might be estimating the median of a distribution, which can be considered a parameter. More general term is **robust** methods, which may not be most efficient in ideal cases, but still perform well when simplifying assumptions don't apply.

## Outline

1.  Review/Basics of Probability and Statistical Inference (Chapters One and Two)

2.  Binomial Tests (Chapter Three)

3.  Rank-Based Tests (Chapter Five)

4.  Kolmagorov-Smirnov Statistics (Chapter Six)

5.  Contingency Tables (Chapter Four)

## Computing Environment

We'll use computational tools, mostly Python & the SciPy `stats` package.  Main motivations:

- Avoid anachronism of looking up probabilities & percentiles in tables.  (Reminiscent of math books from the 1980s w/tables of trig functions, exponentials, etc.)  Don't need to look up or interpolate the 97.5th percentile of the Student-$t$ distribution w/12 degrees of freedom if we can just use

In [None]:
from scipy import stats
stats.t(df=12).ppf(0.975)

- Process realistic-sized data sets rather than copying down a dozen numbers from the textbook.

- Some properties of nonparametric methods are difficult or impossible to prove analytically, but we can explore them numerically using Monte Carlo simulations.

- Also gain a potentially useful skill (data analysis with Python) in the process.

## Introduction to Python, NumPy and Jupyter

- Lessons & assignments for this course are Jupyter notebooks
- The notebook can run Python commands (other notebooks can use R or Julia; "Ju-Pyt-R").  Think: computational data analysis, not "coding".

- Can run on JupyterHub server via http://vmware.rit.edu/ or download `.ipynb` file & run yourself (`ipynb` stands for IPython Notebook, an older name for Jupyter)

- Divided into "cells" of different types; we'll mostly use "Markdown" & "Code" cells
- Shift-return either renders or executes a cell.

- If you want to run locally, you need to install Jupyter, Python and a few other packages.
  * User-friendly option: Anaconda: https://www.anaconda.com/products/distribution
  * More robust but still pretty easy: Mamba: https://mamba.readthedocs.io/en/latest/installation/mamba-installation.html

- My local installation used (in 2023):
<pre>~/mambaforge/bin/mamba create -n STAT345
conda activate STAT345
mamba install jupyter
mamba install scipy
pip install rise
mamba install jupyterlab
mamba install jupytext
pip install nbmerge
mamba install matplotlib</pre>

### Markdown Cells

- Markdown is a simple formatting language. (Its name is a pun on "markup")
- Designed so raw code is legible even if not rendered.
- Can use *italics*, **bold**, or `monospace`.
- Most significantly, can use features of LaTeX to format math.
- Easiest introduction to LaTeX is to double-click on a markdown cell and see the source

1. Math is written in dollar signs, like $x=5$.

2. If you use double dollar signs, you get things displayed on a line by themselves, like
$$
X = a + 3 b
$$

3. Subscripts and superscripts are written with `_` and `^`, respectively: $x_1=2.45$, $2^4=16$.

4. Braces `{` and `}` can be used to enclose expressions when needed: $x_{10}$ not $x_10$.

5. Greek letters and other special characters are written with a backslash: $x=\mu\pm \sigma$.

6. Fractions can be written with the `\frac` macro: $\frac{dy}{dx}=\sqrt{x}$

7. Sums, integrals and products have large operators defining them:
$$
\int_0^x e^t\,dt = \sum_{n=0}^{\infty} \frac{x^n}{n!}
$$

You can get a lot more documentation by doing a web search for something like `jupyter markdown`.

## Python Basics
<img alt="Webcomic: Python" src="python.png" height="100%">
https://xkcd.com/353/ "Python"

- We'll also use Code cells, which contain commands to be executed.  We'll use Python 3, but Jupyter can also handle other languages like R, Julia, and SageMath (you can change the language in the notebook if you have a strong preference)

- Python is an interpreted/scripting language, so you can type simple commands & run them one at a time without worrying about technical details like allocating memory, declaring variables, etc.

- In addition to the basic built-in Python commands, we'll use a few specific libraries extensively:
  * Numerical Python or NumPy for computations and sophisticated data types
  * Scientific Python or SciPy, especially the `scipy.stats` package
  * Matplotlib for plotting

- We'll go over some of the most basic/important tricks here; you can examine skipped-over sections in the notebook at your leisure.

We start with some boilerplate to display plots in the notebook (other possibilities are `%matplotlib notebook` or the deprecated `%pylab inline`)

In [None]:
%matplotlib inline

Also to import the libraries we'll use often:

In [None]:
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

And finally some tweaks to make the figures a little more legible:

In [None]:
plt.rcParams['figure.figsize'] = (8.0,5.0)
plt.rcParams['font.size'] = 14

### Data Types

Don't have to declare variables, but they do still have distinct types.  E.g. numbers can be integers:

In [None]:
x = 3; x

Or "floating point":

In [None]:
y = 9.0; y

In [None]:
y + 1.5

In [None]:
y**(-6)

Note that the operator for raising to a power is `**`, not `^`, which does something very different which we don't need right now:

In [None]:
10**2

In [None]:
10^2

Floating point arithmetic can sometimes lead to roundoff errors, since it's not possible to express all real numbers exactly:

In [None]:
1+1+1-3

In [None]:
0.1+0.1+0.1-0.3

Integers and floats behave differently under division.  The `/` operator is for floating point division:

In [None]:
5. / 3.

While integer division with truncation is done with the `//`
and the remainder is available with `%`:

In [None]:
5 // 3

In [None]:
5 % 3

As of Python 3, if you try to use `/` on integers, it automatically converts them to floating point first:

In [None]:
5 / 3

Python also has a string data type, with its own set of operations:

In [None]:
x = 'foo'
y = 'bar'
x + y

### Lists in Python

Python has a standard list datatype which can contain different types of data:

In [None]:
stuff_i = [1,2.0,'three']; stuff_i

You can use indices to get elements of the list.  Like many computing languages, Python counts from zero:

In [None]:
stuff_i[0]

In [None]:
stuff_i[1]

In [None]:
stuff_i[2]

One useful construction is that using `-1` as an index gives the last element (as opposed to R, where `stuff_i[-1]` would be an array containing all the elements *except* `stuff_i[1]`):

In [None]:
stuff_i[-1]

You can also specify multiple indices to get a "slice" of the list:

In [None]:
stuff_i[0:2]

This says that `stuff_i[0]` is the first element in the slice, and `stuff_i[2]` is the first element *not* included in the slice.  Since starting with `stuff_i[0]` isn't actually a restriction, you can omit it:

In [None]:
stuff_i[:2]

Also note that, since there are three elements in the list, element `2` and element `-1` are the same thing:

In [None]:
stuff_i[:2]

In [None]:
stuff_i[:-1]

You can also get the length of a list like this:

In [None]:
len(stuff_i)

You can also make a list of lists, which is useful for organizing things like samples of different sizes:

In [None]:
nestedstuff_i_j = [[1,2,3,4],[10,20,30]]
nestedstuff_i_j

In [None]:
nestedstuff_i_j[0]

In [None]:
nestedstuff_i_j[0][-1]

### Tuples

One of the important features of lists is that you can modify elements of them after they've been defined:

In [None]:
stuff_i[2] = np.pi; stuff_i

Closely related is a *tuple*; like a list, but "immutable", i.e., once you define it, you can't change its elements:

In [None]:
stufftup_i = (1,2.0,'three'); stufftup_i

In [None]:
stufftup_i[2] = np.pi

The previous cell generates an error because `stufftup_i` is a tuple, and you're not allowed to modify it once it's been defined.

### NumPy Arrays

NumPy introduces a generalization of the list (and list of lists) known as an array.  All of the data in an array has to be the same type, and if it's constructed from a list of lists, they must all be the same length.  Here is a one-index array, which is like a 3-element vector:

In [None]:
A_i = np.array([1,2,3]); A_i

And a two-dimensional array, which is like a $2\times 3$ matrix:

In [None]:
B_ij = np.array([[1,2,3],[10,20,30]])
B_ij

Rather than the length of a multi-dimensional array, it's better to think of the "shape":

In [None]:
np.shape(A_i)

In [None]:
np.shape(B_ij)

The length is still defined, but it just says how many rows the array has:

In [None]:
len(B_ij)

In [None]:
B_ij

We can pull out elements of an array by specifying the indices:

In [None]:
B_ij[0,2]

We can also extract a given row:

In [None]:
B_ij[0,:]

or column:

In [None]:
B_ij[:,1]

Unlike in languages like Matlab that have a preference for one- and two-dimensional structures, NumPy arrays can have any number of indices:

In [None]:
C_kij = np.array([B_ij,2*B_ij]); C_kij

In [None]:
np.shape(C_kij)

In [None]:
C_kij[0,1,0]

In [None]:
C_kij[0]

In [None]:
B_ij

The shape tells you what the range of the indices for the array is, but you can also use `ndim` to get the number of indices if that's all you need:

In [None]:
np.ndim(A_i)

In [None]:
np.ndim(B_ij)

In [None]:
np.ndim(C_kij)

"Best practice" for names of array variables: It's easy to lose track of how many and what kind of "indices" an array has, so I encode this in the variable name.  `B_ij` has an index of type `i` and an index of type `j`.

We wrote `C_kij` because `C_kij[0]` is `B_ij` which means the last two indices of `C_kij` are of the same type as `B_ij`.

Note that an array can have more than one of the same kind of index.  For example, the outer product $M_{ij}=A_i A_j$ (or perhaps better written $M_{ii'}=A_i A_{i'}$) can be written

In [None]:
M_ii = np.outer(A_i,A_i); M_ii

If we for example sum over one index, we can write $D_{kj} = \sum_i C_{kij}$ as follows:

In [None]:
C_kij

In [None]:
D_kj = np.sum(C_kij,axis=1); D_kj

The `axis=1` means we sum over the middle index, i.e., the one that we're calling type `i`.

You can take slices out of arrays as well:

In [None]:
C_kij

In [None]:
C_kij[0,:,:]

In [None]:
np.shape(C_kij[0,:,:])

In [None]:
C_kij[0,:,:-1]

In [None]:
np.shape(C_kij[0,:,:-1])

Note that lists and arrays don't always behave the same.  For lists (as for strings), the addition operator concatenates the two lists together:

In [None]:
stuff_i + stuff_i

But for arrays, it performs element-by-element addition; output has the same length as the inputs:

In [None]:
A_i + A_i

If you really do want to concatenate arrays together, there's an operator to do this:

In [None]:
np.concatenate((A_i,A_i))

### Displaying data

We've seen that you can show the contents of a variable by just giving that variable as an input command:

In [None]:
A_i

There is also a `print` command which usually gives a somewhat simplified representation:

In [None]:
print(A_i)

No `printf()` command in Python, but analogous syntax using the `%` operator.  Construction like `string % list`, acts like `sprintf(string,list)`, replacing parts of `string` with number from `list`.  Easiest to see with a couple of examples:

In [None]:
print(('We can format variables as integers: %d\n'
      +'floating point: %f\n'+'scientific notation: %e\n'
      +'general: %g or %g\n'
      +'or string: %s')
      % (14,np.pi,123.456,2**10,2**40,'foobar'))

In [None]:
pinum = 22
pidenom = 7
'Approximate %f as %d/%d=%f' % (np.pi,pinum,pidenom,pinum/pidenom)

### Functions, objects and methods

To take a very brief look into the technicalities of the language, note that `.` is not just another character in a variable or function name, like it is in R.  The construction `a.b` is accessing the "property" `b` from the "object" `a`.  So for instance, we could access the shape of an array as

In [None]:
B_ij.shape

This is equivalent to the previous construction

In [None]:
np.shape(B_ij)

Actually the function `np.shape()` is itself another sort of "object-oriented" construction.  Recall the line

    import numpy as np

in the boilerplate at the top of the notebook.  This defined `np` to be an object that gives us access to the whole NumPy library.  There are constants defined as properties of that object, such as

In [None]:
np.pi

and there are also "methods" like `np.shape()`.  If I write `a.B()` this is the method `B` operating on the object `a`, which is sort of like a function whose first and most important argument is `a`.  It is also possible to use methods with additional function arguments like `a.B(c)`.

To illustrate this, recall our 1-d array:

In [None]:
A_i

Can use the `prod()` method which takes the product of the elements of the array $\prod_i A_i=1\times 2\times 3$

In [None]:
A_i.prod()

The equivalent syntax calling the function from the NumPy library is

In [None]:
np.prod(A_i)

### Vectorization

Like most computing languages, Python lets you define loops to repeat the same or similar operations, but it's best to use other constructions to accomplish the same thing, since loops in Python are much slower than higher-level operations.  So if you want to set $a_i=i^2$ for $i=0,\ldots,n-1$, instead of writing something like

In [None]:
n = 10
a_i = np.empty(n)
for i in range(n):
    a_i[i] = i**2
a_i

you can take advantage of the fact that NumPy's arrays do element-by-element arithmetic to define something like

In [None]:
np.arange(n)**2

A more general construction that does the job of creating a list just like a loop might, but is faster, is called a *list comprehension*:

In [None]:
[i**2 for i in range(n)]

It basically evaluates the `i**2` for each of the `i` values in the `for` part of the statement, and makes a list out of the result.

Note that if you want to use the result as a NumPy array in later calculations, you need to turn the list into an array:

In [None]:
np.array([i**2 for i in range(n)])

### Plotting

We'll use the `matplotlib` library for plotting.  Here is a basic example:

In [None]:
x_x = np.linspace(-5,5,1001)
y_x = x_x**3 - 10*x_x
plt.plot(x_x,y_x);
plt.xlabel(r'$x$'); plt.ylabel(r'$x^3-10x$');
plt.grid(); plt.xlim(-5,5);

## Homework

There is a short practice problem set, Problem Set 0, in MyCourses, due Thursday, with some Python practice.  It won't be graded, but you should submit to the folder to get access to the solutions.