# Statsmodels
**Learning Objectives**:
- Introduce the statsmodels package for statistical analysis
- Calculate a linear regression
- Perform a simple t-test


## Introduction to Statsmodels

Statsmodels is a package that's useful for statistical analysis in Python. This allows for a lot of statistical models to be developed directly in Python without needing to go to other languages or software. In this section we will introduce two basic statistical methods available through statsmodels. 


In [41]:
import statsmodels.api as sm
import pandas as pd
import numpy as np

In [42]:
#load in data and drop null values
df = pd.read_csv('penguins_size.csv').dropna()


## T-test

The model is set up using `sm.OLS(y,X)` which tells which data to use in the model. `.fit()` generates the fitted model, which is then saved as another variable. The fitted model has a method `.summary()` that gives a good summary of each coefficient and overall statistical properties of the model.

In [43]:
adelie = df.loc[df['species']=='Adelie','flipper_length_mm']
chinstrap = df.loc[df['species']=='Chinstrap','flipper_length_mm']



res = sm.stats.ttest_ind(adelie,chinstrap)
print('t-value:',res[0])
print('p-value:',res[1])
print('DoF:',res[2])

t-value: -5.797900789295094
p-value: 2.413241410912911e-08
DoF: 212.0


These and other statistical tests can be found in the [documentation](https://www.statsmodels.org/dev/api.html). 

## Linear Regression

Regression is another useful part of the statsmodels package. We will work through an example with OLS (Ordinary Least Squares) regression, using `sm.OLS()`. For the penguins data, let's predict body mass as a function of culmen length/depth and flipper length. 

This regression function takes two inputs: 
- An array y with the output variable (single column)
- An array X with the input variables (one or more columns)

All variables must be numeric (so that they can be converted to a numpy array within the function). The arrays must also be the same size. 


In [44]:
# Set up X and y
y = df['body_mass_g']
X = df.loc[:,['culmen_length_mm','culmen_depth_mm','flipper_length_mm']]



The model is set up using `sm.OLS(y,X)` which tells which data to use in the model. `.fit()` generates the fitted model, which is then saved as another variable. The fitted model has a method `.summary()` that gives a good summary of each coefficient and overall statistical properties of the model.

In [45]:


results=sm.OLS(y,X).fit()
print(results.summary())

## Challenge 1: further statsmodels

Let's practice with some more statsmodels functions.

Choose one of the following options (or both!): 

1. In the penguins dataset, conduct pairwise t-tests for body mass between all three species. (Essentially, this means a t-test for Adelie vs Chinstrap, Adelie vs Gentoo, and Chinstrap vs Gentoo). Did you use a loop for this? Why or why not?

2. Set up a new linear regression. In this case, normalize each of the columns by subtracting the mean of the column and dividing by the standard deviation. Check your normalization (The mean should be 0 and the std deviation 1 for each of the columns), and re-run the linear regression. What does the model say now?


Make notes of what barriers you run into, and remember the general steps of coding!

In [46]:
## your model here