# Workshop 6 optional notebook: statistics with Pandas


Pandas -- a "Python Data Analysis Library" -- is a powerful tool for analyzing datasets. It has built-in facilities for reading/writing multiple file formats, integration with statistical tools, and plotting. For those familiar with the particle/nuclear physics tool ROOT -- Pandas is a formidable Python-specific option

In [None]:
# Standard preamble
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd 


We will use the dataset you have already analyzed in Homework 4: `asymdata.txt`

The core `pandas` data type is `DataFrame` -- a 2-dimensional labeled data structure with columns of potentially different types. The most familiar equivalent is spreadsheet or SQL table. In this example, we want to create a dataframe 

In [None]:
df = pd.read_csv('asymdata.txt',sep="\s+",names=("Counter","Araw","deltaX","deltaY"),skiprows=(1))
print(df)

Producing a histogram is pretty straightforward:

In [None]:
plt.figure()
df['Araw'].plot.hist(bins=50)

Covariance matrix is also very easy:

So is computing the basic statistical quantities:

In [None]:
print(df.cov())

So is the correlation matrix:

In [None]:
print(df.corr())

It is even more instructive to plot it. There's a nice package for statistics visualization called seaborn: https://seaborn.pydata.org/

In [None]:
import seaborn as sns
sns.heatmap(df.corr(), annot=True,cmap='bwr',vmin=-1,vmax=1)

#### Exercise

Plot histograms of `deltaX` and `deltaY` and compute their means and standard deviations

It is actually pretty easy to plot all columns:

In [None]:
df.hist(bins=50)

Or even better, plot the correlations as scatter plots:

In [None]:
from pandas.plotting import scatter_matrix
scatter_matrix(df[['Araw','deltaX','deltaY']], alpha=0.2, figsize=(6, 6), diagonal='kde');

Fitting to a gaussian distribition is straightforward: convert a column into numpy array

In [None]:
from scipy.stats import norm

par = norm.fit(df['Araw']) # distribution fitting
print ('Gaussian parameters: mu = {0:5.3f}, sigma={1:4.1f}'.format(par[0],par[1]))

# plot the histogram of the data; overlay fit 
df['Araw'].hist(bins=40,density=True)

# fitted distribution
x = np.linspace(df['Araw'].min(),df['Araw'].max(),100)
pdf_fitted = norm.pdf(x,loc=par[0],scale=par[1])
plt.plot(x,pdf_fitted,'r-')
plt.xlabel('Araw')

#### Exercise: 

Fit deltaX and deltaY variables

Example of linear regression of one variable against two others

In [None]:
import statsmodels.api as sm

linearRegress = sm.GLM.from_formula('Araw ~ deltaX+deltaY',data=df).fit()
linearRegress.summary()

Bin data into a profile histogram, overlay linear regression model

In [None]:
sns.regplot(x=df['deltaY'], y=df['Araw'], x_bins=50, fit_reg=None)

In [None]:
ax = sns.regplot(x=df['deltaY'], y=df['Araw'],x_bins=50)
ax.set_xlim(-75,75)

#### Exercise: plot linear regression of Araw vs deltaX