# Statistical Analyses with `statsmodels`

**Learning Objectives**:
- Introduce the `statsmodels` package for statistical analysis.
- Calculate a linear regression.
- Perform a simple t-test.

****

`statsmodels` is a package that's useful for statistical analysis in Python. This allows for a lot of statistical models to be developed directly in Python without needing to go to other languages or software. In this section, we will introduce two basic statistical methods available through `statsmodels`. 

In [1]:
# Install statsmodels if necessary
!pip install statsmodels



In [2]:
import statsmodels.api as sm
import pandas as pd
import numpy as np

In [4]:
# Load in data and drop null values
df = pd.read_csv('penguins.csv').dropna()

## Performing a t-test

A t-test is a test of the significance for the difference between two distributions.

Let's look at the difference between species of penguin. For example, for the Adelie and Chinstrap species, let's see if there's a significant difference in flipper length. 

We proceed as follows:

1. Subset to the appropriate rows and column using `df.loc[]`.
2. Run the `ttest_ind`` function on each series.

In [5]:
adelie = df.loc[df['species'] == 'Adelie', 'flipper_length_mm']
chinstrap = df.loc[df['species'] == 'Chinstrap', 'flipper_length_mm']

In [6]:
res = sm.stats.ttest_ind(adelie, chinstrap)
res

(-5.797900789295094, 2.413241410912911e-08, 212.0)

In [7]:
print('t-score:', res[0])
print('p-value:', res[1])
print('Degrees of Freedom:', res[2])

t-score: -5.797900789295094
p-value: 2.413241410912911e-08
Degrees of Freedom: 212.0


These and other statistical tests can be found in the [documentation](https://www.statsmodels.org/dev/api.html). 

## Performing Linear Regression

Regression is another useful part of the `statsmodels` package. We will work through an example with Ordinary Least Squares (OLS) regression, using `sm.OLS()`.

For the penguins data, let's predict body mass as a function of culmen length, culmen depth, and flipper length. 

This regression function takes two inputs: 
- An array $X$ with the input variables (one or more columns). In this case, it will be an array containing culmen length, culmen depth, and flipper length.
- An array $y$ with the output variable (single column). In this case, it will be body mass.

All variables must be numeric (so that they can be converted to a numpy array within the function). The arrays must also have the same numpy of samples.

In [8]:
# Set up X and y
X = df[['culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm']]
y = df['body_mass_g']

The model is set up using `sm.OLS(y, X)` which tells which data to use in the model. The `.fit()` method generates the fitted model, which is then saved as another variable. The fitted model has a `.summary()` method that gives a good summary of each coefficient and overall statistical properties of the model.

In [9]:
results = sm.OLS(y, X).fit()
results

<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7fca3d1c6990>

In [10]:
print(results.summary())

                                 OLS Regression Results                                
Dep. Variable:            body_mass_g   R-squared (uncentered):                   0.988
Model:                            OLS   Adj. R-squared (uncentered):              0.988
Method:                 Least Squares   F-statistic:                              9442.
Date:                Sat, 30 Apr 2022   Prob (F-statistic):                   3.32e-320
Time:                        06:57:33   Log-Likelihood:                         -2522.1
No. Observations:                 334   AIC:                                      5050.
Df Residuals:                     331   BIC:                                      5062.
Df Model:                           3                                                  
Covariance Type:            nonrobust                                                  
                        coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------

## Challenge 1: More `statsmodels`

Let's practice with some more `statsmodels` functions.

Choose one of the following options (or both!):

1. In the penguins dataset, conduct pairwise t-tests for body mass between all three species. Essentially, this means a t-test for Adelie vs Chinstrap, Adelie vs Gentoo, and Chinstrap vs Gentoo. Did you use a loop for this? Why or why not?
2. Set up a new linear regression. In this case, normalize each of the columns by subtracting the mean of the column and dividing by the standard deviation. Check your normalization (The mean should be 0 and the standard deviation should be 1 for each of the columns), and re-run the linear regression. What does the model say now?

Make notes of what barriers you run into, and remember the general steps of coding!

In [None]:
# YOUR CODE HERE
