# Scipy: A bunch of miscellaneous scientific functions

Scipy is an even bigger and more diverse library than Numpy.  But, it contains a lot of very niche, specialized tools, from a wide range of niches and specializations.  So, chances are, you might only actually need a few things from Scipy at any given time.

In this notebook, we're going to do a very brief overview of some of the important things Scipy has, and we're going to spend most of our time talking about *sparse arrays,* which are (in my opinion) the most interesting and unique contribution that the library has for data science.

Scipy exists to be a one-stop library for accessing a *lot* of pre-existing math libraries, usually written in C and Fortran, that were originally created to do fast math operations in a scientific, numeric, engineering, or mathematical context.  After all, why re-invent the wheel when you can just make Python talk to these existing tools?

Install Scipy from the `conda-forge` channel (this will give you the newest version, which has some cool features not yet in the main channel's version) with:

```bash
conda install -c conda-forge scipy
```

Note: installing Scipy with `pip` on Windows (and possibly Mac) *will* lead to pain and suffering.  Just use conda for this, unless you want to spend far too long figuring out how to set up Visual Studio compilers and dependencies to get Fortran, C++, and C code compiled.  Trust me, it's not fun to have to figure out.

# A rapid-fire, incomplete list of some of Scipy's contents

1. Signal processing.  Fourier transforms, discrete sine/cosine transforms, inverse transforms, and more.
2. Numeric optimization.  Optimize general numeric expressions, find roots of polynomials, etc.
3. Interpolation.  Given some numbers, "fill in" values between them.
4. Various spatial algorithms.  Approximate nearest neighbors, fast distance metrics, convex hulls, etc.
5. Statistics.  Basic statistical tests.
6. Sparse arrays.  Array formats specialized for data where most values are zero.

Regardless of what you're using Scipy for, you'll be storing your data in Numpy arrays.

We will only be exploring 5 and 6 in this notebook.


# `scipy.stats`: (almost) all the basic statistical tests you'd ever need.

The `scipy.stats` module contains a lot of classical statistical hypothesis tests.  T-tests, linear regressions, ANOVAs, etc., plus ways to sample from a lot of statistical distributions (normal, beta, gamma, Poisson, Zipf, and a bunch more).  If it's an everyday, bread-and-butter test, it's probably here.  But you won't find things like Generalized Linear Models, hierarchical models, or anything particularly cutting edge.  For those things, you're probably using a different library (like `statsmodels`, which we'll cover in a separate session), or you're using a different language like R.

Here's a quick tour of the `scipy.stats` module.

In [1]:
# A quick function to convert raw p-values to stars.
def significance(p):
    if p <= 0.001:
        return "***"
    elif p <= 0.01:
        return "**"
    elif p <= 0.05:
        return "*"
    else:
        return ""

from scipy import stats

# Generate some normally-distributed random data.
# stats.norm is the distribution/sampling object;
# .rvs() is the method for generating random numbers.
# This uses a mean of 0, standard deviation of 1,
# and generates 1,000 samples.
# 
# This is the general paradigm for random sampling
# with scipy: 
# stats.DISTRIBUTION.rvs(
#    distribution parameters,
#    size=number of samples to generate
# )
sample_1 = stats.norm.rvs(loc=0, scale=1, size=1000)

# 1,500 samples from a distribution with mean 0.5,
# standard deviation 0.5.
sample_2 = stats.norm.rvs(loc=0.5, scale=0.5, size=1500)

# Run a t-test.  This returns a kind of object called a Named Tuple.
# It behaves like a special kind of class, with attributes.
ttest = stats.ttest_ind(sample_1, sample_2)
print(ttest)
print(f"T-statistic: {ttest.statistic}")
print(f"P-value: {ttest.pvalue}{significance(ttest.pvalue)}")
print()

Ttest_indResult(statistic=-16.726994750938918, pvalue=1.2939642932458158e-59)
T-statistic: -16.726994750938918
P-value: 1.2939642932458158e-59***



In [2]:
# Or, we could run a Mann-Whitney U-test.
utest = stats.mannwhitneyu(sample_1, sample_2)
print(utest)
print(f"T-statistic: {utest.statistic}")
print(f"P-value: {utest.pvalue}{significance(utest.pvalue)}")

MannwhitneyuResult(statistic=486787.0, pvalue=4.0291318137427627e-50)
T-statistic: 486787.0
P-value: 4.0291318137427627e-50***


In [3]:
# Or a linear regression.
# Get some uniformly distributed random data between 0 and 10,
# and set the y value to be 3x + some Gaussian noise + 5.
x = stats.uniform.rvs(0, 2, size=20)
y = 3*x + stats.norm.rvs(0, 1, size=20) + 5

linreg = stats.linregress(x, y)
print(linreg)
print(f"y = {linreg.slope:.3f}x + {linreg.intercept:.3f}; p={linreg.pvalue:.3e}{significance(linreg.pvalue)}")

LinregressResult(slope=2.407798202576446, intercept=5.157866484883578, rvalue=0.7775896925059598, pvalue=5.459108591374976e-05, stderr=0.4589089790404404, intercept_stderr=0.44995882897285644)
y = 2.408x + 5.158; p=5.459e-05***


Other statistical tests look basically the same.  You pass in your different samples as Numpy arrays (or, really, as anything *array-like*, so a Python `list` would work too), and scipy computes the test statistics and p-values for you.  Not a lot of fuss, but also, not a lot of bells and whistles.  These are pretty barebones functions, but that can be a good thing: they're very easy to integrate with most programs, epecially if you're arleady using Numpy arrays or anything built on top of them (like Pandas, which we'll see next session).

There are libraries that offer a richer, but more complex, interface for doing statistical work.  Some worth knowing about:
- `statsmodels`, which offers an R-like formula interface, and much richer sets of models.  (e.g. GLMs).
- `pymc3`, which does Monte Carlo samping and Bayesian modeling.
- `scikit-learn`, which uses some variations on statistical models to do machine learning, prediction, and classification tasks.  We will be covering `scikit-learn` later in the year.
- `rpy2`, which lets you use R code/models from within Python.

Honestly, if you really do need sophisticated and cutting-edge statistical models, and the simpler/more widespread tests available in these libraries just won't cut it, you're going to be better served by using a dedicated statistics language like R or [Stan](https://mc-stan.org/).

# Sparse arrays

The other big important party of Scipy is its support for *sparse arrays.*  Sparse arrays are a data format that's specifically designed to store, surprise surprise, *sparse data:* data where most of your values are 0.  (or, generally, when most of your values are *the same value,* but usually this ends up being 0).

Let's start with a real-world example of sparse data.  Let's say you want to list every building in Texas, and whether or not that building is owned by UT Arlington.  This data is extremely sparse; UTA owns a decent number of buildings, but compared to how many there are in Texas, it's a vanishingly small number.  So, most buildings in Texas are not owned by UTA.  If we decided to record every building in Texas anyways, we would end up with a huge list, with millions of items on it, but only a few of those would be listed as "owned by UTA."  Obviously, this is a silly way to store this data.  A much better approach is to only write down the buildings UTA owns, and leave everything else off the list.  That way, you save a huge amount of storage (whether that's on paper or digital).

Sparse arrays exploit this same trick.  If most of your data is zero, why bother storing anything but the non-zero elements?  With one-dimensiona data, this is a bit silly, but it comes up a lot with 2-dimensional data.  Consider the following example of a basic natural language processing (NLP) analysis:

Start with a collection of texts, each of which is labeled for something you care about--let's say they're labeled for the presence of some emotional state you're researching.  You want to know what words are highly indicative of this emotional state.  So, you use the following steps to conduct your analysis:

1. You choose to make one feature (independent variable) for each word that appears anywhere in your collection of texts.
2. You construct a matrix (or matrix-like data structure) where each row corresponds to a document, each column corresponds to a word, and the values are "how many times does this word appear in this document?"  You'll get a big matrix that looks something like this:

    | Text | the | cat | dog | sat | barked | on | at | mailman |
    | --- | --- | --- | --- | --- | --- | --- | --- | --- |
    | "The cat sat on the mailman" | 2 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |
    | "The dog barked at the mailman" | 2 | 0 | 1 | 0 | 1 | 0 | 1 | 1 |

3. You then run a regression of some sort on this data, and presumably go publish the results somewhere.

(This is known as a *bag of words* analysis in NLP).  If you think about the matrix you're generating in step 2, you might start to see some issues.  A few words--like *the, to, a, for, in*, etc--appear in nearly every single piece of English text that's more than a few words long.  But most words, like *barked*, only appear in a small number of texts.  So you'll have a lot of words that only appear in a handful of documents.  Furthermore, there are millions of possible word that any single document could use, and in any decent collection of texts, you'll get a few tens or hundreds of thousands of unique words.  But if your texts are all fairly short--say, 2500 words long--then it's mathematically impossible for them to use most of the words you're keeping track of!  A 2500 word essay can only use, at most, 2500 unique words.  If you're tracking 50,000 words, then you'll have 47,500 words that you're keeping track of, but which don't show up in the essay.

Long story short: you'll have a *lot* of zeros.  It's not unrealistic to expect that 99.99%--or more--of the values you measure will be zero.

Here's where that becomes a problem: *storing a zero still requires storing something.*  Back to the "what buildings does UTA own" example, the naive approach to constructing the matrix in step 2 is the equivalent of writing down every building in Texas, and a simple "yes/no" for whether or not it's owned by UTA.  It's hugely wasteful.

*Sparse arrays* address this problem in the same basic way as we did in the UTA example.  They use some tricks to not even bother storing entries that are 0.  There are a lot of different ways to write the code to do this, each one optimized for different use cases, but conceptually they all boil down to this: why not just store a list of coordinates and values, but only for non-zero entries?  So, rather than a huge matrix of mostly zeros, where every zero has to be stored separately, why not just store something like:

```
my_array = [
    {"row": 0, "column": 0, "value": 1},
    {"row": 0, "column": 1000, "value": 6},
    {"row": 10000, "column": 56, "value": 12},
    ...
]
```

This is exactly what sparse arrays do.  By doing this, they can save *huge* amounts of memory.  In my own experience, I've seen cases where converting to a sparse array format made something go from nearly 100 *gigabytes* down to about 200 megabytes.  That's 1/500th the original space requirements--there were a lot of zeros in that data.

Scipy implements sparse arrays in the `scipy.sparse` module.  Fortunately, they can be used almost anywhere Numpy arrays can be used!

NOTE: you probably want to install Scipy from the conda-forge channel (see top of this notebook).  That gives access to the sparse `*_array` classes, which have the newer, more Numpy-compatible interface.  Older versions of scipy (like the one available in the main conda channel) only have the older `*_matrix` classes, which are mostly, but not entirely, Numpy-compatible.

In [4]:
import numpy as np
from scipy import sparse

# Create a 10,000 x 10,000 array of random numbers, and set ~99% of them to 0.
rng = np.random.default_rng()
x = rng.random(size=(2_500, 2_500))
x[x <= 0.99] = 0

# How much memory does X take up?
# np.array.size -> number of items stored in the array
# np.array.dtype -> returns the dtype of data stored in the array
# dtype.itemsize -> how much space, in bytes, a value of this dtype takes up
print(f"This array takes up {x.size * x.dtype.itemsize:,} bytes of memory.")

This array takes up 50,000,000 bytes of memory.


In [5]:
# Convert it to a sparse array.  When in doubt, csr_array is generally a solid bet.
x = sparse.csr_matrix(x)
print(f"This array takes up {x.size * x.dtype.itemsize:,} bytes of memory.")

This array takes up 500,336 bytes of memory.


Note: if you do anything to the sparse array that will cause something to get added to, or subtracted from, all of the values, *don't do that.*  It'll actually be *worse* than storing the data in a dense format; there is some overhead to the sparse array formats, but usually this is far outweighed by the space savings elsewhere.  A sparse array will also be slower than a dense array for doing all sorts of math operations, if there are a good number of non-zero items.

Fortunately, the sparse array formats in Scipy just don't let you do this in most cases.

In [6]:
x = x + 1
print(f"This array takes up {x.size * x.dtype.itemsize:,} bytes of memory.")

NotImplementedError: adding a nonzero scalar to a sparse matrix is not supported

If needed, you can convert back to a dense/Numpy array using the `.toarray()` method.

In [7]:
x = x.toarray()
print(x)
print(type(x))

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
<class 'numpy.ndarray'>


Sparse arrays, since they don't even store non-zero values, can actually be *way* faster than an identical dense array when it comes to some math operations, e.g. matrix multiplication.  However, you usually need a decently large matrix to see these benefits.

In [8]:
x_sparse = sparse.csr_matrix(x)
x_dense = x_sparse.toarray()

from timeit import timeit
print("Matrix multiplication, sparse matrix:", timeit("x_sparse @ x_sparse", globals=globals(), number=1))
print("Matrix multiplication, dense matrix:", timeit("x_dense @ x_dense", globals=globals(), number=1))

Matrix multiplication, sparse matrix: 0.012221199999999044
Matrix multiplication, dense matrix: 10.268713399999982


As a little aside, sparse arrays in scipy are built on Numpy arrays, just using some clever tricks.  (e.g., using one array to store the row index, one for the column, and one for the values).  So, they support all of the same dtypes as numpy: `np.int64`, `np.float64`, `np.complex128`, etc.

## When should you use sparse arrays?

You might be wondering when, exactly, it becomes more advantageous to use sparse arrays rather than dense ones, and unfortunately, that cutoff isn't possible to reason about in general.  Usually, you'll need two criteria to be true:

1. Your data is very sparse.  The large majority of your values are 0.
2. Your data is reasonably large.

There are, though, no real hard-and-fast rules, or even rules of thumb, on these two criteria.  *But,* most data tend to fall pretty obviously at one extreme or another.  Either your data is obviouly pretty dense, or it's obviouly extremely sparse.  Either your data is pretty small, or it's obviously pretty large.  Usually, it's very clear when you need to use a sparse array format, because the kinds of problems that sparse arrays solve tend to be *really* obvious when you run into them.

In my experience, sparse arrays are an absolutely indispensible--if niche--tool to have in your belt.  They can sometimes be the one tool that turns a problem from "impossible to run on my computer" to "I can run it, scale up my data, and it still only takes a minute or two."