# Scipy Tutorial

### By: Jun Song, Zimei Xu, Qin Che, Peiyu Li 

## Motivation 

SciPy is a collection of mathematical algorithms and convenience functions built on the Numpy extension of Python. 

It adds significant power to the interactive Python session by providing the user with high-level commands and classes for manipulating and visualizing data. With SciPy an interactive Python session becomes a data-processing and system-prototyping environment rivaling systems such as MATLAB, IDL, Octave, R-Lab, and SciLab.

SciPy on Python makes a powerful programming language available for use in developing sophisticated programs and specialized applications. 


## Installation Instrution 

Type either one of the following code in IPython notebook

In [62]:
import scipy.stats as stats

In [63]:
from scipy import stats

##  Platform Restriction

No platform restriction. This package can run on multiple platforms including Windows, Mac OS, Linux.

## Dependent Libraries 

Scipy library is dependent on numpy, sometimes pandas in case of dealing with dataframe.

## Example 1

scipy.stats.pearsonr

scipy.stats.pearsonr(x, y)[source]
Calculates a Pearson correlation coefficient and the p-value for testing non-correlation.

The Pearson correlation coefficient measures the linear relationship between two datasets. Strictly speaking, Pearson’s correlation requires that each dataset be normally distributed, and not necessarily zero-mean. Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation. Correlations of -1 or +1 imply an exact linear relationship. Positive correlations imply that as x increases, so does y. Negative correlations imply that as x increases, y decreases.

The p-value roughly indicates the probability of an uncorrelated system producing datasets that have a Pearson correlation at least as extreme as the one computed from these datasets. The p-values are not entirely reliable but are probably reasonable for datasets larger than 500 or so.



Parameters:	
x : (N,) array_like
Input
y : (N,) array_like
Input
Returns:	
r : float
Pearson’s correlation coefficient
p-value : float
2-tailed p-value


In [64]:
import numpy as np
import pandas as pd
from scipy import stats

In [65]:
x=np.arange(10)
x
y=np.arange(11,21)
y

array([11, 12, 13, 14, 15, 16, 17, 18, 19, 20])

In [66]:
stats.pearsonr(x,y)

(1.0, 0.0)

# Minimal working example

## 1. Type Error

There must be two arguments in stats.pearsonr() function call, otherwaise it gives error warning.

In [15]:
a = np.array([1, 2, 3])
b = np.array([0, 4, 1])
x = stats.pearsonr(a)
x

TypeError: pearsonr() missing 1 required positional argument: 'y'

In [16]:
a = np.array([1, 2, 3])
b = np.array([0, 4, 1])
x = stats.pearsonr(a,b)
x

(0.24019223070763071, 0.84557904168873266)

## 2. Value Error 

Each argument should be the same in shape.

In [24]:
a = np.array([[1, 2, 3],[4, 5, 6]])
b = np.array([0, 4, 1])
x = stats.pearsonr(a,b)
x

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

In [25]:
a.shape

(2, 3)

In [26]:
b.shape

(3,)

The shapes of these arguments are different, so python gives Value Error warning.

In [27]:
a = np.array([1, 2, 3])
b = np.array([0, 4, 1])
x = stats.pearsonr(a,b)
x

(0.24019223070763071, 0.84557904168873266)

### Alternative 1:

In [67]:
df=pd.DataFrame({'x':np.arange(10),'y':np.arange(11,21)})

In [68]:
df.corr()

Unnamed: 0,x,y
x,1.0,1.0
y,1.0,1.0


For p-value, there's no built-in function about distribution in numpy package, thus numpy cannot directly calculate p-value. If necessary, we can use the definition of  t distribution to calculate p-value which the process will be cumbersome.

## Example 2

scipy.stats.linregr ess

scipy.stats.linregress(x, y=None)[source]
Calculate a linear least-squares regression for two sets of measurements.



Parameters:	
x, y : array_like
Two sets of measurements. Both arrays should have the same length. If only x is given (and y=None), then it must be a two-dimensional array where one dimension has length 2. The two sets of measurements are then found by splitting the array along the length-2 dimension.
Returns:	
slope : float
slope of the regression line
intercept : float
intercept of the regression line
rvalue : float
correlation coefficient
pvalue : float
two-sided p-value for a hypothesis test whose null hypothesis is that the slope is zero.
stderr : float
Standard error of the estimated gradient.

In [69]:
from scipy import stats
np.random.seed(423)
x = np.random.random(20)
y = np.random.random(20)

In [70]:
result = stats.linregress(x,y)

In [72]:
stats.linregress(x,y)

LinregressResult(slope=-0.18098477148618336, intercept=0.62454639529821299, rvalue=-0.17267819204022053, pvalue=0.46660579940866975, stderr=0.24332960374940393)

In [73]:
slope, intercept, r_value, p_value, std_err = stats.linregress(x,y)

In [74]:
slope

-0.18098477148618336

### Alternative 1

There is another package in Python called "statsmodels" which can also implement linear regression.

In [75]:
import statsmodels.api as sm

In [81]:
model=sm.OLS(x,y)

In [82]:
results=model.fit()

In [83]:
print(results.summary())

                            GLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.566
Model:                            GLS   Adj. R-squared:                  0.543
Method:                 Least Squares   F-statistic:                     24.76
Date:                Thu, 16 Feb 2017   Prob (F-statistic):           8.39e-05
Time:                        19:00:44   Log-Likelihood:                -10.699
No. Observations:                  20   AIC:                             23.40
Df Residuals:                      19   BIC:                             24.39
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1             0.7911      0.159      4.976      0.0

### Alternative 2 

In [80]:
np.polynomial.polynomial.polyfit(x,y,1)

array([ 0.6245464 , -0.18098477])

## References

http://statsmodels.sourceforge.net/devel/examples/notebooks/generated/ols.html

https://docs.scipy.org/doc/scipy-0.18.1/reference/tutorial/index.html