Pandas is a python package that simplifies working with tabular or relational data. Because columns and rows of data in a pandas DataFrame are naturally array-like, using pandas with sci-analysis is the preferred way to use sci-analysis.
Let's create a pandas DataFrame to use for analysis:
%matplotlib inline
import numpy as np
import scipy.stats as st
from sci_analysis import analyze
import pandas as pd
np.random.seed(987654321)
df = pd.DataFrame(
{
'ID' : np.random.randint(10000, 50000, size=60).astype(str),
'One' : st.norm.rvs(0.0, 1, size=60),
'Two' : st.norm.rvs(0.0, 3, size=60),
'Three' : st.weibull_max.rvs(1.2, size=60),
'Four' : st.norm.rvs(0.0, 1, size=60),
'Month' : ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'] * 5,
'Condition' : ['Group A', 'Group B', 'Group C', 'Group D'] * 15
}
)
df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
ID | One | Two | Three | Four | Month | Condition | |
---|---|---|---|---|---|---|---|
0 | 33815 | -1.199973 | -0.051015 | -0.556609 | -1.145177 | Jan | Group A |
1 | 49378 | -0.142682 | 3.522920 | -1.424446 | -0.880138 | Feb | Group B |
2 | 21015 | -1.746777 | -7.415294 | -1.804494 | -0.487270 | Mar | Group C |
3 | 15552 | -0.437626 | 0.805884 | -1.235840 | 0.416363 | Apr | Group D |
4 | 38833 | -1.205166 | -0.105672 | -1.683723 | -0.151296 | May | Group A |
5 | 13561 | -0.610066 | -1.842630 | -1.280547 | 0.645674 | Jun | Group B |
6 | 36967 | -0.203453 | 2.323542 | -1.326379 | 1.014516 | Jul | Group C |
7 | 43379 | 0.085310 | -2.053241 | -0.503970 | -1.349427 | Aug | Group D |
8 | 36113 | 1.853726 | -5.176661 | -1.935414 | 0.513536 | Sep | Group A |
9 | 18422 | -0.614827 | 1.266392 | -0.292610 | -2.234853 | Oct | Group B |
10 | 24986 | 0.091151 | -5.721601 | -0.330216 | 1.269432 | Nov | Group C |
11 | 27743 | 0.367027 | 1.929861 | -0.388752 | -0.807231 | Dec | Group D |
12 | 29620 | 0.337290 | 3.160379 | -0.139480 | 0.917287 | Jan | Group A |
13 | 16982 | -2.403575 | 1.846666 | -0.689486 | -0.406160 | Feb | Group B |
14 | 49184 | -1.008465 | -0.148862 | -1.620799 | 0.707101 | Mar | Group C |
15 | 35850 | -0.352184 | 5.654610 | -0.966809 | 0.927546 | Apr | Group D |
16 | 47386 | 0.598030 | -2.689915 | -1.253860 | 0.570852 | May | Group A |
17 | 19963 | 0.573027 | 1.372438 | -0.690340 | -0.798698 | Jun | Group B |
18 | 25412 | 0.215652 | 4.065310 | -2.168703 | -1.567035 | Jul | Group C |
19 | 38630 | -0.436220 | -2.193357 | -0.821331 | 0.071891 | Aug | Group D |
20 | 29330 | 0.209028 | -0.720595 | -1.019263 | 0.798486 | Sep | Group A |
21 | 15612 | 0.406017 | 0.221715 | -2.252922 | -0.006731 | Oct | Group B |
22 | 42431 | -0.112333 | 3.377393 | -1.023559 | 0.813721 | Nov | Group C |
23 | 38265 | 0.341139 | 2.775356 | -0.434224 | 3.408679 | Dec | Group D |
24 | 38936 | -1.435777 | -3.183457 | -2.417681 | 0.994073 | Jan | Group A |
25 | 22402 | 0.719249 | -2.281663 | -0.419681 | 0.025585 | Feb | Group B |
26 | 10490 | 1.022446 | -0.884773 | -0.962212 | 0.532781 | Mar | Group C |
27 | 15452 | 1.463633 | -1.052140 | -0.316955 | 0.135338 | Apr | Group D |
28 | 20401 | 1.375250 | -3.150916 | -0.842319 | 1.060090 | May | Group A |
29 | 24927 | -1.885277 | -1.824083 | -0.368665 | -0.636261 | Jun | Group B |
30 | 27535 | 0.379140 | 4.224249 | -1.415238 | 1.940782 | Jul | Group C |
31 | 28809 | 0.427183 | 9.419918 | -0.730669 | -1.345587 | Aug | Group D |
32 | 42130 | 1.671549 | 3.501617 | -2.043236 | 1.939640 | Sep | Group A |
33 | 43074 | 0.933478 | -0.262629 | -0.523070 | -0.551311 | Oct | Group B |
34 | 37342 | -0.986482 | -1.544095 | -1.795220 | -0.523349 | Nov | Group C |
35 | 16932 | -0.121223 | 4.431819 | -0.554931 | -1.387899 | Dec | Group D |
36 | 46045 | -0.121785 | 3.704645 | -0.555530 | -0.610390 | Jan | Group A |
37 | 13717 | -1.303516 | -3.525625 | -0.268351 | 0.220075 | Feb | Group B |
38 | 26979 | 0.167418 | 1.754883 | -0.570240 | -0.258519 | Mar | Group C |
39 | 32178 | 0.379428 | -2.980751 | -1.046118 | -1.266934 | Apr | Group D |
40 | 22586 | 1.528444 | 3.847906 | -0.564254 | 0.168995 | May | Group A |
41 | 30676 | 0.579101 | 1.481665 | -2.000617 | -1.136057 | Jun | Group B |
42 | 12145 | -0.512085 | -5.800311 | -0.703269 | 1.309943 | Jul | Group C |
43 | 33851 | 0.329299 | -4.168787 | -1.832158 | 2.626034 | Aug | Group D |
44 | 20004 | 1.761672 | 6.218044 | -0.903584 | 0.052300 | Sep | Group A |
45 | 48903 | -0.391426 | 3.600225 | -0.390644 | -1.282138 | Oct | Group B |
46 | 11760 | -0.153558 | 0.022388 | -0.882584 | 0.461477 | Nov | Group C |
47 | 21061 | -0.765411 | -1.856080 | -0.967070 | -0.169594 | Dec | Group D |
48 | 18232 | -1.983058 | 3.544743 | -1.246127 | 1.408816 | Jan | Group A |
49 | 29927 | -0.017905 | -6.803385 | -0.043450 | -0.192027 | Feb | Group B |
50 | 30740 | -0.772090 | 0.826426 | -0.875306 | 0.074382 | Mar | Group C |
51 | 41741 | -1.017919 | 4.670395 | -0.080428 | 0.408054 | Apr | Group D |
52 | 45287 | 1.721927 | 7.581574 | -0.395787 | 0.114241 | May | Group A |
53 | 39581 | -0.860012 | 3.638375 | -0.530987 | 0.019394 | Jun | Group B |
54 | 19179 | 0.441536 | 3.921498 | -0.001505 | 0.373191 | Jul | Group C |
55 | 17116 | 0.604572 | 2.716440 | -0.580509 | -0.157461 | Aug | Group D |
56 | 34913 | 1.635415 | 2.587376 | -0.463056 | -0.189674 | Sep | Group A |
57 | 13794 | 0.623878 | -5.834247 | -1.710010 | -0.232304 | Oct | Group B |
58 | 28453 | -0.349846 | -0.703319 | -1.094846 | -0.238145 | Nov | Group C |
59 | 25158 | 0.097128 | -3.303646 | -0.508852 | 0.112469 | Dec | Group D |
This creates a table (pandas DataFrame object) with 6 columns and an index which is the row id. The following command can be used to analyze the distribution of the column titled One:
analyze(
df['One'],
name='Column One',
title='Distribution from pandas'
)
Statistics
----------
n = 60
Mean = -0.0035
Std Dev = 0.9733
Std Error = 0.1257
Skewness = -0.1472
Kurtosis = -0.2412
Maximum = 1.8537
75% = 0.5745
50% = 0.0882
25% = -0.6113
Minimum = -2.4036
IQR = 1.1858
Range = 4.2573
Shapiro-Wilk test for normality
-------------------------------
alpha = 0.0500
W value = 0.9804
p value = 0.4460
H0: Data is normally distributed
Anywhere you use a python list or numpy Array in sci-analysis, you can use a column or row of a pandas DataFrame (known in pandas terms as a Series). This is because a pandas Series has much of the same behavior as a numpy Array, causing sci-analysis to handle a pandas Series as if it were a numpy Array.
By passing two array-like arguments to the analyze()
function, the correlation can be determined between the two array-like arguments. The following command can be used to analyze the correlation between columns One and Three:
analyze(
df['One'],
df['Three'],
xname='Column One',
yname='Column Three',
title='Bivariate Analysis between Column One and Column Three'
)
Linear Regression
-----------------
n = 60
Slope = 0.0281
Intercept = -0.9407
r = 0.0449
r^2 = 0.0020
Std Err = 0.0820
p value = 0.7332
Spearman Correlation Coefficient
--------------------------------
alpha = 0.0500
r value = 0.0316
p value = 0.8105
H0: There is no significant relationship between predictor and response
Since there isn't a correlation between columns One and Three, it might be useful to see where most of the data is concentrated. This can be done by adding the argument contours=True
and turning off the best fit line with fit=False
. For example:
analyze(
df['One'],
df['Three'],
xname='Column One',
yname='Column Three',
contours=True,
fit=False,
title='Bivariate Analysis between Column One and Column Three'
)
Linear Regression
-----------------
n = 60
Slope = 0.0281
Intercept = -0.9407
r = 0.0449
r^2 = 0.0020
Std Err = 0.0820
p value = 0.7332
Spearman Correlation Coefficient
--------------------------------
alpha = 0.0500
r value = 0.0316
p value = 0.8105
H0: There is no significant relationship between predictor and response
With a few point below -2.0, it might be useful to know which data point they are. This can be done by passing the ID column to the labels
argument and then selecting which labels to highlight with the highlight
argument:
analyze(
df['One'],
df['Three'],
labels=df['ID'],
highlight=df[df['Three'] < -2.0]['ID'],
fit=False,
xname='Column One',
yname='Column Three',
title='Bivariate Analysis between Column One and Column Three'
)
Linear Regression
-----------------
n = 60
Slope = 0.0281
Intercept = -0.9407
r = 0.0449
r^2 = 0.0020
Std Err = 0.0820
p value = 0.7332
Spearman Correlation Coefficient
--------------------------------
alpha = 0.0500
r value = 0.0316
p value = 0.8105
H0: There is no significant relationship between predictor and response
To check whether an individual Condition correlates between Column One and Column Three, the same analysis can be done, but this time by passing the Condition column to the groups argument. For example:
analyze(
df['One'],
df['Three'],
xname='Column One',
yname='Column Three',
groups=df['Condition'],
title='Bivariate Analysis between Column One and Column Three'
)
Linear Regression
-----------------
n Slope Intercept r^2 Std Err p value Group
--------------------------------------------------------------------------------------------------
15 0.1113 -1.1181 0.0487 0.1364 0.4293 Group A
15 -0.2586 -0.9348 0.1392 0.1784 0.1708 Group B
15 0.3688 -1.0182 0.1869 0.2134 0.1076 Group C
15 0.0611 -0.7352 0.0075 0.1952 0.7591 Group D
Spearman Correlation Coefficient
--------------------------------
n r value p value Group
--------------------------------------------------------
15 0.1357 0.6296 Group A
15 -0.3643 0.1819 Group B
15 0.3714 0.1728 Group C
15 0.1786 0.5243 Group D
The borders of the graph have boxplots for all the data points on the x-axis and y-axis, regardless of which group they belong to. The borders can be removed by adding the argument boxplot_borders=False
.
According to the Spearman Correlation, there is no significant correlation among the groups. Group B is the only group with a negative slope, but it can be difficult to see the data points for Group B with so many colors on the graph. The Group B data points can be highlighted by using the argument highlight=['Group B']
. In fact, any number of groups can be highlighted by passing a list of the group names using the highlight
argument.
analyze(
df['One'],
df['Three'],
xname='Column One',
yname='Column Three',
groups=df['Condition'],
boxplot_borders=False,
highlight=['Group B'],
title='Bivariate Analysis between Column One and Column Three'
)
Linear Regression
-----------------
n Slope Intercept r^2 Std Err p value Group
--------------------------------------------------------------------------------------------------
15 0.1113 -1.1181 0.0487 0.1364 0.4293 Group A
15 -0.2586 -0.9348 0.1392 0.1784 0.1708 Group B
15 0.3688 -1.0182 0.1869 0.2134 0.1076 Group C
15 0.0611 -0.7352 0.0075 0.1952 0.7591 Group D
Spearman Correlation Coefficient
--------------------------------
n r value p value Group
--------------------------------------------------------
15 0.1357 0.6296 Group A
15 -0.3643 0.1819 Group B
15 0.3714 0.1728 Group C
15 0.1786 0.5243 Group D
Performing a location test on data in a pandas DataFrame requires some explanation. A location test can be performed with stacked or unstacked data. One method will be easier than the other depending on how the data to be analyzed is stored. In the example DataFrame used so far, to perform a location test between the groups in the Condition column, the stacked method will be easier to use.
Let's start with an example. The following code will perform a location test using each of the four values in the Condition column:
analyze(
df['Two'],
groups=df['Condition'],
categories='Condition',
name='Column Two',
title='Oneway from pandas'
)
Overall Statistics
------------------
Number of Groups = 4
Total = 60
Grand Mean = 0.4456
Pooled Std Dev = 3.6841
Grand Median = 0.5138
Group Statistics
----------------
n Mean Std Dev Min Median Max Group
--------------------------------------------------------------------------------------------------
15 1.2712 3.7471 -5.1767 2.5874 7.5816 Group A
15 -0.3616 3.2792 -6.8034 0.2217 3.6384 Group B
15 -0.1135 3.7338 -7.4153 0.0224 4.2242 Group C
15 0.9864 3.9441 -4.1688 0.8059 9.4199 Group D
Bartlett Test
-------------
alpha = 0.0500
T value = 0.4868
p value = 0.9218
H0: Variances are equal
Oneway ANOVA
------------
alpha = 0.0500
f value = 0.7140
p value = 0.5477
H0: Group means are matched
From the graph, there are four groups: Group A, Group B, Group C and Group D in Column Two. The analysis shows that the variances are equal and there is no significant difference in the means. Noting the tests that are being performed, the Bartlett test is being used to check for equal variance because all four groups are normally distributed, and the Oneway ANOVA is being used to test if all means are equal because all four groups are normally distributed and the variances are equal. However, if not all the groups are normally distributed, the Levene Test will be used to check for equal variance instead of the Bartlett Test. Also, if the groups are not normally distributed or the variances are not equal, the Kruskal-Wallis test will be used instead of the Oneway ANOVA.
If instead the four columns One, Two, Three and Four are to be analyzed, the easier way to perform the analysis is with the unstacked method. The following code will perform a location test of the four columns:
analyze(
[df['One'], df['Two'], df['Three'], df['Four']],
groups=['One', 'Two', 'Three', 'Four'],
categories='Columns',
title='Unstacked Oneway'
)
Overall Statistics
------------------
Number of Groups = 4
Total = 240
Grand Mean = -0.0995
Pooled Std Dev = 1.9859
Grand Median = 0.0752
Group Statistics
----------------
n Mean Std Dev Min Median Max Group
--------------------------------------------------------------------------------------------------
60 0.1007 1.0294 -2.2349 0.0621 3.4087 Four
60 -0.0035 0.9815 -2.4036 0.0882 1.8537 One
60 -0.9408 0.6133 -2.4177 -0.8318 -0.0015 Three
60 0.4456 3.6572 -7.4153 0.5138 9.4199 Two
Levene Test
-----------
alpha = 0.0500
W value = 64.7684
p value = 0.0000
HA: Variances are not equal
Kruskal-Wallis
--------------
alpha = 0.0500
h value = 33.8441
p value = 0.0000
HA: Group means are not matched
To perform a location test using the unstacked method, the columns to be analyzed are passed in a list or tuple, and the groups argument needs to be a list or tuple of the group names. One thing to note is that the groups argument was used to explicitly define the group names. This will only work if the group names and order are known in advance. If they are unknown, a dictionary comprehension can be used instead of a list comprehension to to get the group names along with the data:
analyze(
{'One': df['One'], 'Two': df['Two'], 'Three': df['Three'], 'Four': df['Four']},
categories='Columns',
title='Unstacked Oneway Using a Dictionary'
)
Overall Statistics
------------------
Number of Groups = 4
Total = 240
Grand Mean = -0.0995
Pooled Std Dev = 1.9859
Grand Median = 0.0752
Group Statistics
----------------
n Mean Std Dev Min Median Max Group
--------------------------------------------------------------------------------------------------
60 0.1007 1.0294 -2.2349 0.0621 3.4087 Four
60 -0.0035 0.9815 -2.4036 0.0882 1.8537 One
60 -0.9408 0.6133 -2.4177 -0.8318 -0.0015 Three
60 0.4456 3.6572 -7.4153 0.5138 9.4199 Two
Levene Test
-----------
alpha = 0.0500
W value = 64.7684
p value = 0.0000
HA: Variances are not equal
Kruskal-Wallis
--------------
alpha = 0.0500
h value = 33.8441
p value = 0.0000
HA: Group means are not matched
The output will be identical to the previous example. The analysis also shows that the variances are not equal, and the means are not matched. Also, because the data in column Three is not normally distributed, the Levene Test is used to test for equal variance instead of the Bartlett Test, and the Kruskal-Wallis Test is used instead of the Oneway ANOVA.
With pandas, it's possible to perform advanced aggregation and filtering functions using the GroupBy object's apply()
method. Since the sample sizes were small for each month in the above examples, it might be helpful to group the data by annual quarters instead. First, let's create a function that adds a column called Quarter to the DataFrame where the value is either Q1, Q2, Q3 or Q4 depending on the month.
def set_quarter(data):
month = data['Month']
if month.all() in ('Jan', 'Feb', 'Mar'):
quarter = 'Q1'
elif month.all() in ('Apr', 'May', 'Jun'):
quarter = 'Q2'
elif month.all() in ('Jul', 'Aug', 'Sep'):
quarter = 'Q3'
elif month.all() in ('Oct', 'Nov', 'Dec'):
quarter = 'Q4'
else:
quarter = 'Unknown'
data.loc[:, 'Quarter'] = quarter
return data
This function will take a GroupBy object called data, where data's DataFrame object was grouped by month, and set the variable quarter based off the month. Then, a new column called Quarter is added to data where the value of each row is equal to quarter. Finally, the resulting DataFrame object is returned.
Using the new function is simple. The same techniques from previous examples are used, but this time, a new DataFrame object called df2 is created by first grouping by the Month column then calling the apply()
method which will run the set_quarter()
function.
quarters = ('Q1', 'Q2', 'Q3', 'Q4')
df2 = df.groupby(df['Month']).apply(set_quarter)
data = {quarter: data['Two'] for quarter, data in df2.groupby(df2['Quarter'])}
analyze(
[data[quarter] for quarter in quarters],
groups=quarters,
categories='Quarters',
name='Column Two',
title='Oneway of Annual Quarters'
)
Overall Statistics
------------------
Number of Groups = 4
Total = 60
Grand Mean = 0.4456
Pooled Std Dev = 3.6815
Grand Median = 0.4141
Group Statistics
----------------
n Mean Std Dev Min Median Max Group
--------------------------------------------------------------------------------------------------
15 -0.3956 3.6190 -7.4153 -0.0510 3.7046 Q1
15 1.0271 3.4028 -3.1509 0.8059 7.5816 Q2
15 1.2577 4.4120 -5.8003 2.5874 9.4199 Q3
15 -0.1067 3.1736 -5.8342 0.0224 4.4318 Q4
Bartlett Test
-------------
alpha = 0.0500
T value = 1.7209
p value = 0.6323
H0: Variances are equal
Oneway ANOVA
------------
alpha = 0.0500
f value = 0.7416
p value = 0.5318
H0: Group means are matched