# Class 5 Descriptive Statistics

## 1. Summary Statistics

pandas objects are equipped with a set of common mathematical and statistical methods.
Most of these fall into the category of **reductions** or **summary statistics**, methods
that extract **a single value (like the sum or mean)** from a Series or **a Series of values**
from the rows or columns of a DataFrame. Consider a
small DataFrame:

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.DataFrame([[1.4, 4.1], [7.1, -4.5],
                   [2.5, 3.2], [0.75, -1.3]],
                  index=['a', 'b', 'c', 'd'],
                  columns=['one', 'two'])
df

Unnamed: 0,one,two
a,1.4,4.1
b,7.1,-4.5
c,2.5,3.2
d,0.75,-1.3


Calling DataFrame’s sum method returns a Series containing **column sums**:

In [3]:
df.sum()

one    11.75
two     1.50
dtype: float64

Passing axis='columns' or axis=1 sums across the columns instead:

In [4]:
df.sum(axis='columns')

a    5.50
b    2.60
c    5.70
d   -0.55
dtype: float64

In [5]:
df.idxmax()

one    b
two    a
dtype: object

Other methods are accumulations:

In [6]:
df.cumsum()

Unnamed: 0,one,two
a,1.4,4.1
b,8.5,-0.4
c,11.0,2.8
d,11.75,1.5


Another type of method is neither a reduction nor an accumulation. describe is one
such example, producing multiple summary statistics in one shot:

In [7]:
df.describe()

Unnamed: 0,one,two
count,4.0,4.0
mean,2.9375,0.375
std,2.867454,4.017773
min,0.75,-4.5
25%,1.2375,-2.1
50%,1.95,0.95
75%,3.65,3.425
max,7.1,4.1


On non-numeric data, describe produces alternative summary statistics:

In [8]:
obj = pd.Series(['a', 'a', 'b', 'c'] * 4)
obj.describe()

count     16
unique     3
top        a
freq       8
dtype: object

See table below for a full list of summary statistics and related methods. <br>
![5-8.PNG](attachment:5-8.PNG)

## 2. Correlation and Covariance

Some summary statistics, like correlation and covariance, are computed from **pairs of
arguments**. Let’s consider COVID data. 

In [9]:
data = pd.read_csv("covid-data.csv") # Make sure the data file is in the same folder as this notebook file

The **corr method of Series** computes the correlation of the overlapping, non-NA,
aligned-by-index values in two Series. Relatedly, cov computes the covariance:

In [10]:
data['new_cases'].corr(data['new_deaths'])

0.8792842487853555

In [11]:
data['new_cases'].cov(data['new_deaths'])

4164807.5583742927

You can also get a full correlation or
covariance matrix as a DataFrame:

In [12]:
data.corr()

Unnamed: 0,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,total_cases_per_million,new_cases_per_million,new_cases_smoothed_per_million,total_deaths_per_million,...,gdp_per_capita,extreme_poverty,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index
total_cases,1.0,0.946965,0.957017,0.975316,0.766002,0.794178,0.092237,0.048949,0.064105,0.095789,...,0.006848,-0.027669,-0.027849,0.016497,-0.011522,-0.011347,0.025728,-0.015844,0.004361,0.089035
new_cases,0.946965,1.0,0.993541,0.960778,0.879284,0.891452,0.067651,0.065895,0.074847,0.073363,...,-0.002461,-0.026301,-0.028411,0.019356,-0.021613,-0.017706,0.031324,-0.024534,-0.000875,0.066944
new_cases_smoothed,0.957017,0.993541,1.0,0.968543,0.869593,0.895762,0.069181,0.058895,0.07636,0.074528,...,-0.002371,-0.02606,-0.028775,0.019562,-0.021639,-0.018089,0.030698,-0.024747,-0.00117,0.068194
total_deaths,0.975316,0.960778,0.968543,1.0,0.834291,0.865553,0.081902,0.04283,0.055768,0.131821,...,0.014802,-0.03674,-0.047532,0.009378,0.001985,-0.017034,0.030595,-0.015369,0.018351,0.144937
new_deaths,0.766002,0.879284,0.869593,0.834291,1.0,0.971176,0.047781,0.049946,0.061414,0.078264,...,0.002596,-0.036723,-0.044243,0.015728,-0.013555,-0.021833,0.038222,-0.024557,0.009574,0.100536
new_deaths_smoothed,0.794178,0.891452,0.895762,0.865553,0.971176,1.0,0.050173,0.047999,0.063383,0.083552,...,0.00286,-0.037298,-0.04591,0.016068,-0.01355,-0.022747,0.038283,-0.025319,0.009758,0.109589
total_cases_per_million,0.092237,0.067651,0.069181,0.081902,0.047781,0.050173,1.0,0.471525,0.622321,0.571364,...,0.355468,-0.210157,-0.184531,0.113537,0.066771,-0.049526,0.249554,0.002561,0.243314,0.237426
new_cases_per_million,0.048949,0.065895,0.058895,0.04283,0.049946,0.047999,0.471525,1.0,0.781821,0.217994,...,0.200799,-0.169122,-0.135868,0.116905,0.049209,-0.031712,0.223616,-0.011216,0.145564,0.174315
new_cases_smoothed_per_million,0.064105,0.074847,0.07636,0.055768,0.061414,0.063383,0.622321,0.781821,1.0,0.287043,...,0.233292,-0.198197,-0.158725,0.137614,0.050411,-0.038963,0.263592,-0.014895,0.183371,0.199635
total_deaths_per_million,0.095789,0.073363,0.074528,0.131821,0.078264,0.083552,0.571364,0.217994,0.287043,1.0,...,0.251582,-0.207008,-0.275025,-0.082512,0.281677,-0.100312,0.209621,0.042399,0.286392,0.28985


In [13]:
data.cov()

Unnamed: 0,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,total_cases_per_million,new_cases_per_million,new_cases_smoothed_per_million,total_deaths_per_million,...,gdp_per_capita,extreme_poverty,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index
total_cases,1039807000000.0,12499870000.0,12627210000.0,37118650000.0,287098500.0,294124700.0,372324600.0,3689204.0,3780009.0,13953900.0,...,152321800.0,-707453.7,-3530063.0,72690.57,-147261.6,-186804.3,1233721.0,-45457.97,33124.88,4605.756591
new_cases,12499870000.0,166815500.0,165688200.0,464180500.0,4164808.0,4172819.0,3451060.0,62762.95,55772.89,135059.0,...,-691633.8,-8490.095,-45505.29,1077.739,-3489.023,-3681.801,19059.21,-889.3124,-83.97104,48.692047
new_cases_smoothed,12627210000.0,165688200.0,161478700.0,467720600.0,4117562.0,4060114.0,3518629.0,55719.29,55142.31,136896.5,...,-655937.0,-8242.789,-45392.13,1069.015,-3434.578,-3700.339,18367.58,-884.7988,-110.2452,48.172788
total_deaths,37118650000.0,464180500.0,467720600.0,1392968000.0,11444860.0,11732410.0,12100490.0,118149.5,120355.5,702838.9,...,12049960.0,-34330.18,-220494.0,1512.247,928.1123,-10257.88,53725.03,-1613.45,5101.531,259.060423
new_deaths,287098500.0,4164808.0,4117562.0,11444860.0,134491.5,129076.9,69208.43,1350.77,1299.381,4091.063,...,20714.32,-335.8402,-2011.843,24.86355,-62.09544,-128.5084,651.0261,-25.26857,26.09332,2.080748
new_deaths_smoothed,294124700.0,4172819.0,4060114.0,11732410.0,129076.9,127226.0,71627.56,1274.628,1284.753,4307.777,...,22205.44,-330.3918,-2032.478,24.6456,-60.33234,-130.4653,636.2695,-25.40154,25.80949,2.087941
total_cases_per_million,372324600.0,3451060.0,3518629.0,12100490.0,69208.43,71627.56,15577980.0,137148.4,141246.0,321209.8,...,28898990.0,-12380.44,-81875.96,1861.54,2894.16,-2712.883,17977.01,25.88652,7119.531,138.352607
new_cases_per_million,3689204.0,62762.95,55719.29,118149.5,1350.77,1274.628,137148.4,5430.755,3300.764,2288.205,...,262994.4,-169.8315,-941.4246,31.87529,31.81975,-26.04991,290.8543,-1.650802,79.41037,1.545658
new_cases_smoothed_per_million,3780009.0,55772.89,55142.31,120355.5,1299.381,1284.753,141246.0,3300.764,3224.726,2352.856,...,261556.2,-164.4264,-929.2856,31.38495,27.98568,-27.39029,284.1241,-1.882775,76.90583,1.500807
total_deaths_per_million,13953900.0,135059.0,136896.5,702838.9,4091.063,4307.777,321209.8,2288.205,2352.856,20288.07,...,739537.2,-544.8876,-4016.386,-49.33977,403.3238,-178.8235,422.7039,15.81432,300.5731,5.604491


Using **DataFrame’s corrwith method**, you can compute pairwise correlations
between a DataFrame’s columns or rows with another Series or DataFrame. Passing a
Series returns a Series with the correlation value computed for each column:

In [14]:
data.corrwith(data.new_deaths)

total_cases                        0.766002
new_cases                          0.879284
new_cases_smoothed                 0.869593
total_deaths                       0.834291
new_deaths                         1.000000
new_deaths_smoothed                0.971176
total_cases_per_million            0.047781
new_cases_per_million              0.049946
new_cases_smoothed_per_million     0.061414
total_deaths_per_million           0.078264
new_deaths_per_million             0.124897
new_deaths_smoothed_per_million    0.124387
new_tests                          0.610419
total_tests                        0.485051
total_tests_per_thousand          -0.000237
new_tests_per_thousand             0.029181
new_tests_smoothed                 0.570033
new_tests_smoothed_per_thousand    0.017439
tests_per_case                    -0.040316
positive_rate                      0.225269
stringency_index                   0.117093
population                         0.756338
population_density              

## 3. Unique Values, Value Counts, and Membership<br>
Another class of related methods extracts information about the values contained in a
one-dimensional Series. To illustrate these, consider this example:

In [15]:
obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])

The first function is unique, which gives you an array of the unique values in a Series:

In [16]:
uniques = obj.unique()
uniques

array(['c', 'a', 'd', 'b'], dtype=object)

The unique values are not necessarily returned in sorted order, but could be sorted if needed (uniques.sort()). Relatedly, value_counts computes a Series
containing value frequencies:

In [17]:
obj.value_counts()

a    3
c    3
b    2
d    1
dtype: int64

The Series is sorted by value in **descending order** as a convenience. value_counts is
also available as a top-level pandas method that can be used with any array or
sequence:

In [18]:
pd.value_counts(obj.values, sort=False)

d    1
b    2
a    3
c    3
dtype: int64

isin performs a membership check and can be useful in filtering a
dataset down to a subset of values in a Series or column in a DataFrame:

In [19]:
obj

0    c
1    a
2    d
3    a
4    a
5    b
6    b
7    c
8    c
dtype: object

In [20]:
mask = obj.isin(['b', 'c'])
mask

0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool

In [21]:
obj[mask]

0    c
5    b
6    b
7    c
8    c
dtype: object