*This jupyter notebook is part of Arizona State University's course CAS 523 (Methods for Complex Systems Science: Statistics and Dimensionality Reduction) and was written by Bryan Daniels. It was last updated September 15, 2022.*

*This assignment uses publicly available data from the following publication: Sosna MMG, Twomey CR, Bak-Coleman J, Poel W, Daniels BC, Romanczuk P, Couzin ID. 2019. [Individual and collective encoding of risk in animal groups](https://www.pnas.org/doi/10.1073/pnas.1905585116). _Proceedings of the National Academy of Sciences._ 116(41): 20556-20561.  The full dataset is available here: https://doi.org/10.5061/dryad.sn02v6x5x.*

# Using regression to find predictors of decisions

In this exercise, we will practice using statistical regression to make predictions about a complex system.  We will also use regularization to produce a regression that is easier to interpret.

The data we will work with come from a collaboration with a group of biologists interested in the collective movement of animals.  Here, they observed a school of fish (Golden shiners) in shallow water.  The lab setup was isolated so that there were minimal visual or audio cues coming from the external environment.  In this environment, the schools typically had relatively boring schooling behavior, but every once in a while one fish would suddenly become startled.  Their sudden movement would sometimes induce other fish to also startle, and these startling cascades could spread through the school.

Our question here is: What cues from startled neighbors are individual fish using to decide whether to startle themselves?  

For this exercise, we will focus on using data to determine predictors of startling behavior.  In our research, we used such an analysis to build a quantitative model of startling cascades.  

## Get set up and load the data

First load our usual set of standard packages:

In [None]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
plt.rcParams.update({'font.size': 18}) # increases font size on plots
from pathlib import Path # to handle file paths across all operating systems

Now load the dataset we will use:

In [None]:
data = pd.read_csv(Path('data/SosnaEtAl2022/first_responders_srk1.csv'))

Take a look at what the data include:

In [None]:
data

These data are focused on so-called "first responders".  The basic format is that each startle cascade has a unique identifier given in `Event_raw`.  For each startle cascade, initiated by some individual fish in the school, these data contain information about which fish (if any) was the *first to respond* by startling itself.  This first responder is identified by a row of the data where `Response` has a value of 1.  There are also rows corresponding to the other fish that did *not* respond first to the initial startle, identified by rows of the data where `Response` has a value of 0.

The other columns we will focus on here are potential predictors of being a first responder: the column `Dist_metric` and all columns to the right of that.  The exact meaning of each is not so important for this exercise, but in case you are interested, here are the descriptions from the `README.md` file:

* `Dist_metric`: metric distance between initiator and fish
* `Log_dist_metric`: the log (base 10) of metric distance
* `Dist_topological`: the topological distance, i.e. ranked metric distance. If the initiator was the closest fish to the focal individual, topological distance would be 1.
* `Ang_area`: the angular area subtended by the initiator on the focal fish
* `Rank_area`: the rank of the angular area subtended by the initiator on the focal fish. If the initiator takes up the most area in the focal's vision, rank area is 1.
* `Loom`: the change in angular area of the initiator on the focal fish from the frame prior to the startle through 10 frames after (FPS = 120)
* `Log_loom`: the log (base 10) of loom
* `Ang_pos`: the angle (in radians) between the head of the initiator and the head of the focal. This measures whether the initiator was to the left of the focal, in front of it, behind it, etc.
* `Heading`: the absolute value of the angle of the initiator relative to the focal individual. 0 indicates facing the same direction, pi indicates facing opposite directions. Wraps at 2*pi.

(A video of one of the startle cascades is available [here](https://www.pnas.org/doi/suppl/10.1073/pnas.1420068112/suppl_file/pnas.1420068112.sm01.mp4) if you want to visualize what's going on.  The initiator is red and the fish that can see the initiator are blue.  See if you can spot the first responder.)

We have a good amount of data here:

In [None]:
print("In this dataset, we have {} examples of fish behavior (startling or not startling) "\
      "from {} startle cascades.".format(len(data),len(data['Event_raw'].unique())))

Let's split our data explicitly into the thing we are trying to predict (the `Response` column) and the potential predictors:

In [None]:
responseData = data['Response']
predictorData = data.loc[:,'Dist_metric':]

Finally, in our exercise we will test how well our regression can do at predicting which fish will respond first.  To do this, we will fit our regression parameters to about half the data (the "training set") and see how well the regression can predict which fish are the first responders in the other data that we haven't used in the fit (the "test set").  

Here we split both the response and predictor data into training and test sets:

In [None]:
numTrain = 7200 # the number of data rows we will use as training data

# training set = first numTrain rows
dataTrain = data.iloc[:numTrain]
responseDataTrain,predictorDataTrain = responseData.iloc[:numTrain],predictorData.iloc[:numTrain]
# test set = all remaining rows
dataTest = data.iloc[numTrain:] 
responseDataTest,predictorDataTest = responseData.iloc[numTrain:],predictorData.iloc[numTrain:]

❓ **Why is it safer to test our method's ability to predict responses in the test data as opposed to the training data?  How is this connected to the idea of predictable patterns in the data versus unpredictable noise?**

✳️ **Answer:** 

## Perform logistic regression

Since we have continuous predictors and a binary output (startled versus not startled), logistic regression is a good candidate for fitting a simple model.  

Recall that logistic regression fits linear coefficients that multiply each predictor variable and are then summed up—similar in a sense to PCA.  The difference from PCA is that we are "working in log space" (hence the logistic part of the name), so that this sum is proportional to the logarithm of the ratio of probabilities of startled versus not startled.

Logistic regression is performed by `sklearn`'s function `linear_model.LogisticRegression`:

In [None]:
from sklearn.linear_model import LogisticRegression

We'll first run vanilla logistic regression with no regularization "penalty" (hence `penalty='none'`).  

(Side technical note: The algorithm for finding the best fit parameters is iterative, and I've used `max_iter=1000` here to allow the algorithm to converge to the best solution.  The default `max_iter=100` is quicker to run but does not allow enough iterations for full convergence.)

In [None]:
regression = LogisticRegression(penalty='none',max_iter=1000)
regressionResults = regression.fit(predictorDataTrain,responseDataTrain)

We can look at the resulting fit parameters for coefficients:

In [None]:
coeffs = pd.Series(regressionResults.coef_[0],index=predictorDataTrain.columns)
coeffs

Using the fit regression, we can also ask for the probability of startling versus not startling for a particular case of predictors.  For instance, the predictors for the first three training data points look like this:

In [None]:
predictorDataTrain.iloc[:3]

We can insert this into the function `results.predict_proba` to compute the regressed model's predictions for the probability of startling and not startling for each of the three cases:

In [None]:
regressionResults.predict_proba(predictorDataTrain.iloc[:3])

We see that the estimated probabilities of not startling (the first column) are much larger than startling (the second column) for these caess.

To get a feeling for how these probabilities depend on the values of the predictors, here's a plot for the probabilities as a function of a changing log distance:

In [None]:
logDistList = np.linspace(-3,3)
predictorsList = [predictorDataTrain.iloc[0]+np.array([0,logDist,0,0,0,0,0,0,0]) for logDist in logDistList]
probs = regressionResults.predict_proba(predictorsList)
plt.plot(logDistList,probs)
plt.legend(['No startle','Startle'])
plt.xlabel('Log distance')
plt.ylabel('Probability');

❓ **Does this plot make intuitive sense based on how you would expect fish to behave?** 

✳️ **Answer:** 

## Try to predict first startlers

One way we can test our model is to see whether it can correctly predict which fish will be the first to respond to the initial startle in a trial, given the values of the predictors for all the fish.

I wrote the following function to extract from the data the first fish that actually responded in each trial (for trials that had a responder), as well as the regression's prediction for which fish is the most likely to respond first:

In [None]:
def trueAndPredictedFirstStartles(data,regressionResults):
    # record and predict first startlers in training data
    firstStartler,firstStartlerPredicted = [],[]
    for event in data['Event_raw'].unique():
        # restrict to data in a single cascade event
        dataEvent = data[data['Event_raw']==event]
        if len(dataEvent[dataEvent['Response']==1]) > 0:
            # if a responding fish startled, record it:
            # record startler as index of fish with response=1
            startler = dataEvent[dataEvent['Response']==1].index[0]

            # now use regression to predict the first startler
            dataEventPredictors = dataEvent.loc[:,'Dist_metric':]
            predictedProbs = regressionResults.predict_proba(dataEventPredictors)[:,1]
            startlerPredicted = dataEvent.index[np.argmax(predictedProbs)]

            firstStartler.append(startler)
            firstStartlerPredicted.append(startlerPredicted)
    return np.array(firstStartler),np.array(firstStartlerPredicted)

Trying this on the training data:

In [None]:
firstStartlers,firstStartlersPredicted = trueAndPredictedFirstStartles(dataTrain,regressionResults)

In [None]:
firstStartlersPredicted

In [None]:
firstStartlers

We see that in many cases, the correct fish is predicted!  This is an indication that our regression is doing something right.  How well is it doing?  The following function calculates the accuracy:

In [None]:
def predictionAccuracy(trueStartlers,predictedStartlers):
    numCorrect = np.sum(trueStartlers==predictedStartlers)
    numTotal = len(trueStartlers)
    print('Proportion of trials with a correct prediction = {}'.format(numCorrect/numTotal))

In [None]:
predictionAccuracy(firstStartlers,firstStartlersPredicted)

Now let's compare how we do in predictions using the test data.

❓ **Use the functions `trueAndPredictedFirstStartles` and `predictionAccuracy` as above to compute the accuracy of our regression's predictions in the test data that we held out above.** *Hint: We called the test data `dataTest` and the training data `dataTrain`.  Do a sanity check: Do you expect the test accuracy to be larger or smaller than the training accuracy?*

In [None]:
# ✳️ **Answer:**

## Compare non-regularized to regularized logistic regression

Now let's use a regularizer to "penalize" having many fit coefficients that are nonzero.  The following code performs the regression with a relatively strong L1 penalty:

In [None]:
regressionReg = LogisticRegression(C=0.01,penalty='l1',max_iter=10000,solver='liblinear')
regressionResultsReg = regressionReg.fit(predictorDataTrain,responseDataTrain)

Looking again at the fit coefficients:

In [None]:
coeffsReg = pd.Series(regressionResultsReg.coef_[0],index=predictorDataTrain.columns)
coeffsReg

❓ **How are the regularized coefficients different from the non-regularized version above?  What are the conceptual advantages of interpreting the regularized case?  How is this related to dimensionality reduction?**

✳️ **Answer:** 

❓ **Compute the accuracy of predictions on the test data using this regularized regression.  Do we suffer much in predictive power using the regularized coefficients?**

In [None]:
# ✳️ **Answer:**