<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Reading-in-some-example-data" data-toc-modified-id="Reading-in-some-example-data-1">Reading in some example data</a></span></li><li><span><a href="#Simple-descriptive-statistics-of-numerical-variables" data-toc-modified-id="Simple-descriptive-statistics-of-numerical-variables-2">Simple descriptive statistics of numerical variables</a></span></li><li><span><a href="#Descriptive-statistics-of-categorical-variables" data-toc-modified-id="Descriptive-statistics-of-categorical-variables-3">Descriptive statistics of categorical variables</a></span></li><li><span><a href="#Exercises" data-toc-modified-id="Exercises-4">Exercises</a></span></li><li><span><a href="#Data-aggregation" data-toc-modified-id="Data-aggregation-5">Data aggregation</a></span><ul class="toc-item"><li><span><a href="#Grouping-data-according-to-a-single-variable" data-toc-modified-id="Grouping-data-according-to-a-single-variable-5.1">Grouping data according to a single variable</a></span></li><li><span><a href="#Grouping-according-to-multiple-variables" data-toc-modified-id="Grouping-according-to-multiple-variables-5.2">Grouping according to multiple variables</a></span></li><li><span><a href="#Resetting-the-indices" data-toc-modified-id="Resetting-the-indices-5.3">Resetting the indices</a></span></li></ul></li><li><span><a href="#When-not-to-reset-the-indices" data-toc-modified-id="When-not-to-reset-the-indices-6">When not to reset the indices</a></span><ul class="toc-item"><li><span><a href="#Using-the-stats-module" data-toc-modified-id="Using-the-stats-module-6.1">Using the <code>stats</code> module</a></span></li></ul></li><li><span><a href="#Reorganizing-a-dataframe" data-toc-modified-id="Reorganizing-a-dataframe-7">Reorganizing a dataframe</a></span></li><li><span><a href="#Exercise-1" data-toc-modified-id="Exercise-1-8">Exercise 1</a></span></li><li><span><a href="#Exercise-2" data-toc-modified-id="Exercise-2-9">Exercise 2</a></span></li><li><span><a href="#Exercises-3" data-toc-modified-id="Exercises-3-10">Exercises 3</a></span></li></ul></div>

# Pandas: statistics

## Reading in some example data

The fossum data consists of nine morphometric measurements on each of 43 female mountain brushtail possums, trapped at seven sites from Southern Victoria to central Queensland:

https://vincentarelbundock.github.io/Rdatasets/doc/DAAG/fossum.html

In [1]:
import pandas
link = 'http://tinyurl.com/y8ummdf9'
data = pandas.read_csv(link,index_col=0)
data.head()

Unnamed: 0,case,site,Pop,sex,age,hdlngth,skullw,totlngth,taill,footlgth,earconch,eye,chest,belly
C5,2,1,Vic,f,6,92.5,57.6,91.5,36.5,72.5,51.2,16.0,28.5,33.0
C10,3,1,Vic,f,6,94.0,60.0,95.5,39.0,75.4,51.9,15.5,30.0,34.0
C15,4,1,Vic,f,6,93.2,57.1,92.0,38.0,76.1,52.2,15.2,28.0,34.0
C23,5,1,Vic,f,2,91.5,56.3,85.5,36.0,71.0,53.2,15.1,28.5,33.0
C24,6,1,Vic,f,1,93.1,54.8,90.5,35.5,73.2,53.6,14.2,30.0,32.0


## Simple descriptive statistics of numerical variables

The describe() function can be used to get descriptives statistics of all *numerical* variables in a dataframe.

In [3]:
print(data.describe())

             case       site        age    hdlngth     skullw   totlngth  \
count   43.000000  43.000000  43.000000  43.000000  43.000000  43.000000   
mean    43.418605   2.976744   3.976744  92.148837  56.588372  87.906977   
std     30.264888   2.219914   1.945549   2.574913   2.568788   4.182241   
min      2.000000   1.000000   1.000000  84.700000  51.500000  75.000000   
25%     18.000000   1.000000   3.000000  90.750000  55.200000  85.250000   
50%     40.000000   2.000000   4.000000  92.500000  56.400000  88.500000   
75%     64.500000   5.000000   5.000000  93.800000  57.650000  90.500000   
max    104.000000   7.000000   9.000000  96.900000  67.700000  96.500000   

           taill   footlgth   earconch        eye      chest      belly  
count  43.000000  42.000000  43.000000  43.000000  43.000000  43.000000  
mean   37.104651  69.111905  48.576744  14.811628  27.337209  32.883721  
std     1.830815   4.911321   4.274444   1.030074   1.841069   2.929402  
min    32.000000  6

Isolating a single column and getting its descriptives works as well.

In [4]:
a = data.age
a.describe()

count    43.000000
mean      3.976744
std       1.945549
min       1.000000
25%       3.000000
50%       4.000000
75%       5.000000
max       9.000000
Name: age, dtype: float64

You can also get specific descriptices. 

In [4]:
a.max()

9

In [5]:
a.min()

1

In [6]:
a.mean()

3.9767441860465116

In [7]:
a.std()

1.9455489137665618

In [8]:
a.count()

43

**Table of descriptive functions**

This is a table of some of the descriptives that are available:

+ count.......Number of non-null observations
+ sum.........Sum of values
+ mean........Mean of values
+ mad.........Mean absolute deviation
+ median......Arithmetic median of values
+ min.........Minimum
+ max.........Maximum
+ mode........Mode
+ prod........Product of values
+ std.........Bessel-corrected sample standard deviation
+ var.........Unbiased variance
+ sem.........Standard error of the mean
+ skew........Sample skewness (3rd moment)
+ kurt........Sample kurtosis (4th moment)
+ quantile....Sample quantile (value at %)

## Descriptive statistics of categorical variables

In [9]:
p = data['Pop']
p.value_counts()

Vic      24
other    19
Name: Pop, dtype: int64

In [10]:
p = data['sex']
p.value_counts()

f    43
Name: sex, dtype: int64

In [11]:
p.mode()

0    f
dtype: object

In [12]:
p = data['site']
p.value_counts()

1    19
5     6
2     5
7     4
6     4
3     3
4     2
Name: site, dtype: int64

## Exercises

+ What is the longest recorded possum?
+ What is the standard deviation of possum length?
+ The longest relative tail length?

## Data aggregation

### Grouping data according to a single variable

The ```groupby()``` function allows
+ grouping data according to variable values
+ calculating stastistics on each subgroup

In [13]:
#Step 1: create the groups
grp = data.groupby('Pop')
grp

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f69bb7cf4d0>

In [13]:
#Step 2: get the statistics
grp.mean()

Unnamed: 0_level_0,case,site,age,hdlngth,skullw,totlngth,taill,footlgth,earconch,eye,chest,belly
Pop,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Vic,20.333333,1.208333,4.041667,92.4125,56.645833,88.333333,36.333333,73.021739,51.758333,14.695833,27.75,32.541667
other,72.578947,5.210526,3.894737,91.815789,56.515789,87.368421,38.078947,64.378947,44.557895,14.957895,26.815789,33.315789


In [14]:
# Getting statistics of a single variable
grp.hdlngth.max()

Pop
Vic      95.9
other    96.9
Name: hdlngth, dtype: float64

### Grouping according to multiple variables

You can also group by more than one variable

In [15]:
data['young'] = data.age < 3
grp = data.groupby(['Pop','young'])
table = grp.mean()
table

Unnamed: 0_level_0,Unnamed: 1_level_0,case,site,age,hdlngth,skullw,totlngth,taill,footlgth,earconch,eye,chest,belly
Pop,young,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Vic,False,17.75,1.125,5.25,93.25,56.875,90.4375,37.15625,73.953333,51.6125,15.0,28.0625,33.53125
Vic,True,25.5,1.375,1.625,90.7375,56.1875,84.125,34.6875,71.275,52.05,14.0875,27.125,30.5625
other,False,72.411765,5.176471,4.117647,91.976471,56.358824,87.323529,38.029412,64.317647,44.476471,14.929412,26.941176,33.617647
other,True,74.0,5.5,2.0,90.45,57.85,87.75,38.5,64.9,45.25,15.2,25.75,30.75


In [16]:
grp.taill.max()

Pop    young
Vic    False    39.5
       True     36.5
other  False    41.0
       True     39.0
Name: taill, dtype: float64

### Resetting the indices

The results of calculating statstics by group, is a dataframe with the grouping variables as indices. This makes it less easy to use the result in subsequent operations.

In [17]:
b = grp.taill.max()
b.index

MultiIndex([(  'Vic', False),
            (  'Vic',  True),
            ('other', False),
            ('other',  True)],
           names=['Pop', 'young'])

Using the reset_index() function, the results can be converted to a 'clean' dataframe with only a single, numeric index. The indices will be converted to variables.

In [26]:
print(table.index)

MultiIndex([(  'Vic', False),
            (  'Vic',  True),
            ('other', False),
            ('other',  True)],
           names=['Pop', 'young'])


In [27]:
table

Unnamed: 0_level_0,Unnamed: 1_level_0,case,site,age,hdlngth,skullw,totlngth,taill,footlgth,earconch,eye,chest,belly
Pop,young,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Vic,False,17.75,1.125,5.25,93.25,56.875,90.4375,37.15625,73.953333,51.6125,15.0,28.0625,33.53125
Vic,True,25.5,1.375,1.625,90.7375,56.1875,84.125,34.6875,71.275,52.05,14.0875,27.125,30.5625
other,False,72.411765,5.176471,4.117647,91.976471,56.358824,87.323529,38.029412,64.317647,44.476471,14.929412,26.941176,33.617647
other,True,74.0,5.5,2.0,90.45,57.85,87.75,38.5,64.9,45.25,15.2,25.75,30.75


In [28]:
new = table.reset_index()
new

Unnamed: 0,Pop,young,case,site,age,hdlngth,skullw,totlngth,taill,footlgth,earconch,eye,chest,belly
0,Vic,False,17.75,1.125,5.25,93.25,56.875,90.4375,37.15625,73.953333,51.6125,15.0,28.0625,33.53125
1,Vic,True,25.5,1.375,1.625,90.7375,56.1875,84.125,34.6875,71.275,52.05,14.0875,27.125,30.5625
2,other,False,72.411765,5.176471,4.117647,91.976471,56.358824,87.323529,38.029412,64.317647,44.476471,14.929412,26.941176,33.617647
3,other,True,74.0,5.5,2.0,90.45,57.85,87.75,38.5,64.9,45.25,15.2,25.75,30.75


In [29]:
print(new.columns)

Index(['Pop', 'young', 'case', 'site', 'age', 'hdlngth', 'skullw', 'totlngth',
       'taill', 'footlgth', 'earconch', 'eye', 'chest', 'belly'],
      dtype='object')


## When not to reset the indices

Reseting the indices, converts the grouping variables from indices to plain variables. This means that the data DataFrame "forgets" its grouping variables. If you want to select a specific variable from the DataFrame, it might be more useful to do so before resetting the indices.

In [39]:
data['young'] = data.age < 3
grp = data.groupby(['Pop','young'])
table = grp.mean()
table

Unnamed: 0_level_0,Unnamed: 1_level_0,case,site,age,hdlngth,skullw,totlngth,taill,footlgth,earconch,eye,chest,belly
Pop,young,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Vic,False,17.75,1.125,5.25,93.25,56.875,90.4375,37.15625,73.953333,51.6125,15.0,28.0625,33.53125
Vic,True,25.5,1.375,1.625,90.7375,56.1875,84.125,34.6875,71.275,52.05,14.0875,27.125,30.5625
other,False,72.411765,5.176471,4.117647,91.976471,56.358824,87.323529,38.029412,64.317647,44.476471,14.929412,26.941176,33.617647
other,True,74.0,5.5,2.0,90.45,57.85,87.75,38.5,64.9,45.25,15.2,25.75,30.75


In [41]:
tail_data = table['taill']
tail_data

Pop    young
Vic    False    37.156250
       True     34.687500
other  False    38.029412
       True     38.500000
Name: taill, dtype: float64

In [43]:
tail_data.reset_index()

Unnamed: 0,Pop,young,taill
0,Vic,False,37.15625
1,Vic,True,34.6875
2,other,False,38.029412
3,other,True,38.5


### Using the ```stats``` module



The ```stats``` module contains a function ```group()``` that simplifies getting group statistics from a dataframe. You can use this module, if you like (it is available from the the course Github page: [https://github.com/dvanderelst-python-class/python-class/tree/spring2021/class_code](https://github.com/dvanderelst-python-class/python-class/tree/spring2021/class_code)).

In [18]:
# These lines are needed on my computer to be able to import the stats module. 
# You should place the stats.py file in your working directory.
import sys
sys.path.append('/home/dieter/Dropbox/Python-Class/class_code')
# end of code specific to my computer

import stats
result = stats.group(data, ['sex', 'Age'], 'mean')
result.head()

Unnamed: 0,sex,age,case,site,hdlngth,skullw,totlngth,taill,footlgth,earconch,eye,chest,belly,young
0,f,1,20.666667,1.333333,90.833333,58.0,85.0,35.333333,71.7,53.4,13.966667,28.0,29.333333,True
1,f,2,41.428571,2.571429,90.614286,55.885714,84.785714,35.5,69.271429,49.528571,14.457143,26.357143,31.142857,True
2,f,3,64.272727,4.363636,92.409091,56.418182,88.136364,37.772727,67.218182,46.318182,14.863636,26.681818,33.0,False
3,f,4,48.0,3.166667,91.9,55.75,88.333333,38.083333,68.466667,48.633333,15.066667,27.666667,35.0,False
4,f,5,39.666667,2.5,93.066667,56.9,87.833333,36.5,68.46,46.333333,15.45,28.0,34.75,False


In [32]:
result = stats.group(data, ['sex', 'Age'], 'std')
result.head()

Unnamed: 0,sex,age,case,site,hdlngth,skullw,totlngth,taill,footlgth,earconch,eye,chest,belly,young
0,f,1,16.802778,0.57735,5.37153,8.560958,8.674676,1.258306,2.598076,0.2,0.873689,2.645751,3.785939,0.0
1,f,2,26.462912,2.225395,1.04949,1.493797,2.530763,2.43242,3.206095,3.305911,0.886674,2.230738,1.749149,0.0
2,f,3,27.291357,2.248232,2.568445,2.002907,3.918488,1.633457,5.676939,3.909685,1.246814,1.692228,2.924038,0.0
3,f,4,26.750701,2.483277,3.736844,2.520913,4.966555,1.800463,6.289568,4.444172,0.42269,1.861899,2.366432,0.0
4,f,5,20.353542,1.516575,2.67108,1.532319,3.356586,1.264911,4.087542,5.358607,1.139737,0.632456,3.41687,0.0


In [34]:
result = stats.group(data, ['gender', 'Age'], 'std')
result.head()



Unnamed: 0,age,case,site,hdlngth,skullw,totlngth,taill,footlgth,earconch,eye,chest,belly,young
0,1,16.802778,0.57735,5.37153,8.560958,8.674676,1.258306,2.598076,0.2,0.873689,2.645751,3.785939,0.0
1,2,26.462912,2.225395,1.04949,1.493797,2.530763,2.43242,3.206095,3.305911,0.886674,2.230738,1.749149,0.0
2,3,27.291357,2.248232,2.568445,2.002907,3.918488,1.633457,5.676939,3.909685,1.246814,1.692228,2.924038,0.0
3,4,26.750701,2.483277,3.736844,2.520913,4.966555,1.800463,6.289568,4.444172,0.42269,1.861899,2.366432,0.0
4,5,20.353542,1.516575,2.67108,1.532319,3.356586,1.264911,4.087542,5.358607,1.139737,0.632456,3.41687,0.0


## Reorganizing a dataframe

In [35]:
c = new.pivot(index='young',columns='Pop', values='footlgth')
c

Pop,Vic,other
young,Unnamed: 1_level_1,Unnamed: 2_level_1
False,73.953333,64.317647
True,71.275,64.9


# Exercises

Use the following data for this quiz: `pizzasize.csv` (in the Data folder). The data give the diameters of 250 pizzas, 125 each from two pizza chains, for a variety of crust types and toppings.

## Exercise 1

 Write code that creates a dataframe (table) that lists for each `store` and each `CrustDescription`, the **average pizza diameter**.

 ## Exercise 2

 Write code that creates a dataframe (table) that lists for each `store` and each `CrustDescription`, the **maximum pizza diameter**.

## Exercises 3

Write code that creates a dataframe (table) that lists for each `store` and each `CrustDescription`, the **average pizza area**. Reorganize the table to have  `CrustDescription` as rows and `store` as columns.