Jmpy is an analysis library created to simplify common plotting and modeling tasks. Simplicity, a common plotting signature, and ease of use are preferred over flexibility.
The goal is to create mini-reports with each function for better visualization of data.
Currently, there are three modules: plotting, modeling, and bayes.
Plotting is used to make pretty graphs, and modeling is used to anaylyze data with visual output.
Bayes is currenlty in development to support easy bayesian analysis and plotting, relying on significant assumptions about the underlaying data.
This module relies heavily on statsmodels, patsy, pymc, and pandas.
Overview
Plotting
%load_ext autoreload
%autoreload 2
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('bmh')
import jmpy.plotting as jp
import jmpy.modeling as jm
import warnings
warnings.filterwarnings('ignore')
The autoreload extension is already loaded. To reload it, use:
%reload_ext autoreload
Create some artifical data for our analysis:
nsamples = 250
xc = np.linspace(0, 100, nsamples)
xc2 = xc**2
xd = np.random.choice([1, 3, 5, 7], nsamples)
xe = np.random.choice([10, 30, 50], nsamples)
xf = np.random.choice([.1, .4], nsamples)
xg = np.random.normal(size=nsamples)*15
X = np.column_stack((xc, xc2, xd, xe))
beta = np.array([1, .01, 17, .001])
e = np.random.normal(size=nsamples)*10
ytrue = np.dot(X, beta)
y = ytrue + e
data = {}
data['xc'] = xc
data['xc2'] = xc2
data['xd'] = xd
data['xe'] = xe
data['xf'] = xf
data['xg'] = xg
data['y'] = y
data['ytrue'] = ytrue
df = pd.DataFrame.from_dict(data)
df.head()
xc | xc2 | xd | xe | xf | xg | y | ytrue | |
---|---|---|---|---|---|---|---|---|
0 | 0.000000 | 0.000000 | 3 | 50 | 0.4 | -9.081563 | 69.428600 | 51.050000 |
1 | 0.401606 | 0.161288 | 3 | 50 | 0.1 | 5.957173 | 66.168140 | 51.453219 |
2 | 0.803213 | 0.645151 | 7 | 10 | 0.1 | -1.048492 | 110.819879 | 119.819664 |
3 | 1.204819 | 1.451589 | 7 | 50 | 0.4 | -15.881455 | 111.754919 | 120.269335 |
4 | 1.606426 | 2.580604 | 3 | 50 | 0.4 | -19.796739 | 57.810824 | 52.682232 |
jp.histogram('y', df)
Lets start to visualize how our artifical data looks. First, plot a histogram of the results.
If you want to look at the data color coded by a categorical variable, you need to specify "legend":
jp.histogram('y', df, legend='xd', bins=50, cumprob=True)
You can also create cumprob plots:
jp.cumprob('y', df, legend='xd', marker='D')
Lets look at the data with a boxplot to see if there is any difference between the groups defined by xd:
fig = jp.boxplot(x='xd', y='y', data=df, legend='xd', cumprob=True)
fig
Violin plots can be created in the boxplot function:
fig = jp.boxplot(x='xd', y='y', data=df, legend='xd', cumprob=True, violin=True)
fig
You can also create a scatter plot with a fit.
jp.scatter(x='xc', y='y', data=df, legend='xd', fit='linear', marker='^')
We can generate the same graph using the arrays directly without creating the pandas dataframe. You can fit the data by specifying a fit param. Currently, linear, quadratic, smooth, and interpolate are supported.
jp.scatter(xc, y=y, legend=xd, fit='linear', marker='^')
jp.scatter('xc2', 'y', df, legend='xd', fit='quadratic', marker='o')
# fitparams get passed into the fitting functions. Smoothing uses the scipy Univariate spline function.
jp.scatter(x='xc2', y='y', data=df, legend='xd', fit='smooth', fitparams={'s': 1e6})
Contour plots can be created as well:
cdf = pd.DataFrame()
x = np.random.triangular(-1, 0, 1, 400)
y = np.random.triangular(-1, 0, 1, 400)
z = np.sqrt(x**2 + 2*y**2)
cdf['x'] = x
cdf['y'] = y
cdf['z'] = z
cdf['Gx'] = np.random.choice(['X1', 'X2'], size=400)
cdf['Gy'] = np.random.choice(['Y1', 'Y2', 'Y3'], size=400)
cdf['Gz'] = np.random.choice(['Z1', 'Z2'], size=400)
jp.contour(x='x', y='y', z='z', data=cdf, cmap='YlGnBu', figsize=(7,5), marker='.')
JMPY has the capability of creating grids by using the jp.grid function. You can specify either rows, or columns, or both, and the individual charts will be created with data filtered for the grid.
jp.grid(cols='Gz', data=cdf,
chart=jp.contour, args={'x':'x', 'y':'y', 'z':'z', 'marker':'.', 'cmap':'YlGnBu',
'axislabels':False, 'axisticks':False},
figsize=(8, 4),
)
jp.grid(rows='Gx', cols='Gy', data=cdf,
chart=jp.scatter, args={'x':'x', 'y':'y', 'marker':'D', 'fit': 'linear'},
figsize=(12, 7),
legend='Gz')
jp.grid(cols='Gx', data=cdf,
chart=jp.boxplot, args={'x':'Gz', 'y':'y', 'marker':'D'},
figsize=(12, 7),
legend='Gz')
Now that we have visualized some of the data, lets do some modeling. jmpy current supports two types of linear modeling: ordinary least squares and robust linear model, all built on the statsmodels functions. Lets do the OLS first.
All models are specified based on the patsy text formulas that are very similar to R.
By default, only 80% of the data is used to fit the model, and the other 20% is plotted alongside the data to validate the model results. This can be changed by specifying a different sample_rate parameter.
model = 'y ~ xc + xc2 + xg'
jm.fit(model, data=df, sample_rate=.8, model_type='ols');
Now... what if we had a couple of outliers in our dataset... lets create some outliers
dfo = df.copy()
p = np.random.uniform(0, 1, size=dfo.y.shape)
err = np.random.normal(size=dfo.y.shape)
dfo['yout'] = dfo.y + np.greater_equal(p, 0.9) * (dfo.y * err)
Our coefficient estimates will be skewed due to the outliers.
model = 'yout ~ xc + xc2 + C(xd)'
jm.fit(model, data=dfo, sample_rate=.8, model_type='ols');
Employing the robust linear model, we can minimize the influence of the outliers, and get better coefficient predictions.
model = 'yout ~ xc + xc2 + C(xd)'
jm.fit(model, data=dfo, sample_rate=.8, model_type='rlm');
Our parameter estimates using the robust linear model are much closer to the truth, than using the OLS.
Next we will test how robust the parameter estimates are by running many iterations of the model, and randomly subsetting the data, and then looking at the parameter estimate distributions.
model = 'y ~ xc + xc2 + C(xd)'
jm.check_estimates(model, data=df, sample_rate=.8, model_type='ols', iterations=275);