# Statistics in Python  
Now we've already seen how libraries like `numpy` and `pandas` have functions for finding the `mean`, `median` etc. but what if you wanted to do something more? What about running some statisitical tests? It would be great if the wonderful and smart developers over at Python had already coded some libraries that implement those useful functions so people like you and me don't have to and can rest assured that the functions that we're using have been tried and tested by many people before us. Luckily they have!  


`scipy` and `statsmodels` are powerful Python libraries used for scientific and statistical computations. In this tutorial, I'll introduce you to the basic functionality of both libraries.  

In [3]:
!pip install scipy statsmodels #check we have them!

Defaulting to user installation because normal site-packages is not writeable












## SciPy

In [3]:
import scipy
import scipy.stats as ss #we can import the module we want! SciPy has many modules!

We are going to perform a t-test using `scipy`!  


A t-test is a statistical test used to compare the means of two groups and determine if they are significantly different from each other. It's a fundamental tool in statistics, often used to assess whether an observed difference between two groups is likely due to a real effect or if it could have occurred by random chance. We will cover this and more in Week 9!

In [4]:
# run a t-test

data1 = [3, 4, 5, 6, 7]
data2 = [7, 8, 9, 10, 11]

t_stat, p_value = ss.ttest_ind(data1, data2)
print("T-statistic:", t_stat)
print("P-value:", p_value)

T-statistic: -4.0
P-value: 0.003949772803445322


In [6]:
#help(ss.ttest_ind)

## Statsmodels

In [7]:
import statsmodels.api as sm

We can a simple linear regression using `statsmodels`! (We can also do a multiple linear regression!)

In [8]:
# Sample data
x = [1, 2, 3, 4, 5, 13, 4, 6, 10, 12, 13, 13, 15, 1, 1, 22, 4, 15, 16, 20, 3]
y = [2, 3, 5, 4, 6, 14, 7, 8, 11, 13, 12, 6, 7, 7, 7, 12, 4, 21, 18, 2, 3]

# Add a constant for the intercept term
x = sm.add_constant(x)

model = sm.OLS(y, x).fit() # function for Linear Regression or Ordinary Least Squares
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.299
Model:                            OLS   Adj. R-squared:                  0.263
Method:                 Least Squares   F-statistic:                     8.123
Date:                Thu, 12 Oct 2023   Prob (F-statistic):             0.0102
Time:                        00:33:37   Log-Likelihood:                -60.255
No. Observations:                  21   AIC:                             124.5
Df Residuals:                      19   BIC:                             126.6
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.4372      1.641      2.705      0.0

We can also use `statsmodels` to a Logistic Regression

In [9]:
# Sample data
x = [1, 2, 3, 4, 5, 4, 5, 13, 4, 6, 10, 12, 13, 12, 6, 7, 7, 7, 12, 4, 21,]
y = [0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0]

# Add a constant for the intercept term
x = sm.add_constant(x)

logit_model = sm.Logit(y, x).fit() # function for Logistic Regression
print(logit_model.summary())

Optimization terminated successfully.
         Current function value: 0.651861
         Iterations 5
                           Logit Regression Results                           
Dep. Variable:                      y   No. Observations:                   21
Model:                          Logit   Df Residuals:                       19
Method:                           MLE   Df Model:                            1
Date:                Thu, 12 Oct 2023   Pseudo R-squ.:                 0.05802
Time:                        00:35:52   Log-Likelihood:                -13.689
converged:                       True   LL-Null:                       -14.532
Covariance Type:            nonrobust   LLR p-value:                    0.1941
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.0532      0.896      1.175      0.240      -0.704       2.810
x1            -0.1284      0.