<img src="./images/shouke_logo.png"
     style="float: right"
     width=100
     style="padding-bottom:100px;"/>
<br>
<br>

<table style="float:center;">
    <tr>
        <td>
            <img src='./images/python-logo.png'width=130>
        </td>
        <td>
            <img src='./images/pandas-logo.png'width=150>
        </td>
    </tr>
</table>

<h1 style='text-align: center;'>Calculating Summary Statistics of DataFrame</h1>
<h3 style='text-align: center;'>Shouke Wei, Ph.D. Professor</h3>
<h4 style='text-align: center;'>Email: shouke.wei@gmail.com</h4>

## Objective
- Learn how to calculate summary statistics of DataFrame, or aggregation statistics 

In [30]:
# Load the required packages
import pandas as pd

# Read the data
df = pd.read_csv('./data/gdp_china_renamed.csv')

# diplay the first 5 rows
df.head()

Unnamed: 0,prov,gdpr,year,gdp,pop,finv,trade,fexpen,uinc
0,Guangdong,First,2000,1.074125,8.65,0.314513,1.408147,0.108032,0.976157
1,Guangdong,First,2001,1.203925,8.733,0.348443,14.609701,0.132133,1.041519
2,Guangdong,First,2002,1.350242,8.842,0.385078,1.830169,0.152108,1.11372
3,Guangdong,First,2003,1.584464,8.963,0.48132,2.346735,0.169563,1.238043
4,Guangdong,First,2004,1.886462,,0.587002,2.955899,0.185295,1.362765


## 1. Summary statistics vs. Aggregation statistics
- **Summary statistics**: a measure of location, or central tendency of the data, such as mean, median, mode, minimum value, maximum value, range, standard deviation, etc.
- **Aggregation statistics**: splits data into subsets, computes summary statistics on each subset

<p style="text-align:center;">Table 1: Built-in Pandas aggregations</p>

|Aggregation       |Description|
|:-----------------|:--------------------|
|count()           |Total number of items|
|first(), last()   |First and last item|
|mean(), median()  |Mean and median|
|min(), max()      |Minimum and maximum|
|std(), var()      |Standard deviation and variance|
|mad()             |Mean absolute deviation|
|prod()            |Product of all items|
|sum()             |Sum of all items|

## 2. Aggregating statistics
#### (1) One column

In [32]:
df['pop'].mean().round(3)

8.321

#### (2) Multiple columns

In [33]:
df[['gdp','pop']].median()

gdp    3.093328
pop    9.194000
dtype: float64

## 3. Describe() method
#### (1)  Whole DataFrame

In [35]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
year,95.0,2009.0,5.506283,2000.0,2004.0,2009.0,2014.0,2018.0
gdp,95.0,3.456443,2.365092,0.505299,1.425301,3.093328,4.956176,9.727777
pop,93.0,8.321032,1.802813,4.68,7.588,9.194,9.488,10.999
finv,95.0,1.914406,1.558411,0.137774,0.588736,1.418528,3.05626,5.52135
trade,95.0,2.122045,2.300531,0.000523,0.412089,1.523541,3.054562,14.609701
fexpen,95.0,0.448604,0.366626,0.04313,0.138071,0.326767,0.700097,1.572926
uinc,95.0,2.204282,1.22531,0.476626,1.159632,1.994583,3.05987,5.55743


#### (2) Multiple columns 

In [36]:
df[['gdp','pop']].describe()

Unnamed: 0,gdp,pop
count,95.0,93.0
mean,3.456443,8.321032
std,2.365092,1.802813
min,0.505299,4.68
25%,1.425301,7.588
50%,3.093328,9.194
75%,4.956176,9.488
max,9.727777,10.999


## 4. agg() method 
- specific combinations of aggregating statistics for given columns

In [37]:
df.agg(
    {
        'gdp':['min','max','median','skew'],
        'pop':['min','max','median','mean']
        
    }
)

Unnamed: 0,gdp,pop
min,0.505299,4.68
max,9.727777,10.999
median,3.093328,9.194
skew,0.767165,
mean,,8.321032


## 5. Aggregating statistics grouped by category

#### (1)  Average GDP for each of the 5 provinces

In [38]:
df.groupby(['prov'])[['gdp']].mean()

Unnamed: 0_level_0,gdp
prov,Unnamed: 1_level_1
Guangdong,4.50221
Henan,2.233366
Jiangsu,4.122462
Shandong,3.76376
Zhejiang,2.660416


#### (2) Average of each collumn for each of the 5 provinces

In [39]:
df.groupby(['prov']).mean()

Unnamed: 0_level_0,year,gdp,pop,finv,trade,fexpen,uinc
prov,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Guangdong,2009.0,4.50221,9.853611,1.623354,5.371958,0.619538,2.355681
Henan,2009.0,2.233366,9.468421,1.737832,0.185544,0.361909,1.5931
Jiangsu,2009.0,4.122462,7.723889,2.35207,2.509187,0.496992,2.30079
Shandong,2009.0,3.76376,9.428368,2.427043,1.075928,0.427598,1.983445
Zhejiang,2009.0,2.660416,5.180105,1.43173,1.467606,0.336983,2.788392


## 6. Count total numbers of column items by category¶

In [41]:
df.groupby(['prov'])[['gdp']].count()

Unnamed: 0_level_0,gdp
prov,Unnamed: 1_level_1
Guangdong,19
Henan,19
Jiangsu,19
Shandong,19
Zhejiang,19
