## CMPINF 2100 Week 05
### Summarize DataFrames for EDA

## Import Modules

In [1]:
import numpy as np
import pandas as pd

## Read in data

Read in the JOINED data set we created previosuly!

In [2]:
df = pd.read_csv("joined_data.csv")

In [3]:
df

Unnamed: 0,A,B,C,D,E,F,G,H
0,a,0.0,-100.0,Jan,aa,10,100.0,AAA
1,b,1.0,-200.0,Feb,aa,20,100.0,BBB
2,c,2.0,-300.0,Mar,aa,10,100.0,AAA
3,d,3.0,-400.0,Apr,bb,20,200.0,BBB
4,e,4.0,-500.0,May,bb,10,200.0,AAA
5,f,5.0,-600.0,Jun,bb,20,200.0,BBB
6,g,6.0,-700.0,Jul,cc,10,,AAA
7,h,7.0,-800.0,Aug,cc,20,,BBB
8,i,8.0,-900.0,Sep,cc,10,,AAA
9,j,9.0,-1000.0,Oct,dd,20,400.0,BBB


## Summarize Columns

We learned from the previous recording taht summarizing cols in a DF is accomplished just as we apply methods to summarize Pandas Series!

In [4]:
df.B.mean()

5.5

In [5]:
df.F.mean()

17.857142857142858

But how does Pandas handle missing values?

The MISSINGs are DROPPED or SKIPPED or REMOVED before the summary fx is applied!

In [6]:
df.B.mean(skipna=False)

nan

MISSING values prevents the summary statistics from being calculated!!!

In [7]:
df.F.mean(skipna=False)

17.857142857142858

MANY summary fxs/methods DROP or SKIP MISSING values!

In [8]:
df.B.std()

3.605551275463989

In [9]:
df.B.var()

13.0

In [10]:
df.B.max()

11.0

In [11]:
df.B.min()

0.0

What happens if we apply the .mean() method to the ENTIRE DF?

In [12]:
df.mean()

TypeError: can only concatenate str (not "int") to str

In [13]:
df.mean(numeric_only=True)

B      5.500000
C   -650.000000
F     17.857143
G    233.333333
dtype: float64

In [14]:
df.C.mean()

-650.0

In [15]:
df.var(numeric_only=True)

B        13.00000
C    130000.00000
F        79.67033
G     17500.00000
dtype: float64

In [16]:
df.std(numeric_only=True)

B      3.605551
C    360.555128
F      8.925824
G    132.287566
dtype: float64

In [17]:
df.min(numeric_only=True)

B       0.0
C   -1200.0
F      10.0
G     100.0
dtype: float64

In [18]:
df.max(numeric_only=True)

B     11.0
C   -100.0
F     40.0
G    400.0
dtype: float64

We can also calc the SEM for all num columns!!

In [19]:
df.sem(numeric_only=True)

B      1.040833
C    104.083300
F      2.385527
G     44.095855
dtype: float64

This comes from the simple formula which is the standard deviation divided by the square root of the sample size!

In [20]:
df.F

0     10
1     20
2     10
3     20
4     10
5     20
6     10
7     20
8     10
9     20
10    10
11    20
12    30
13    40
Name: F, dtype: int64

In [21]:
df.F.std()

8.92582375303981

In [22]:
df.F.size

14

In [23]:
np.sqrt(df.F.size)

3.7416573867739413

The SEM for the F column is:

In [24]:
df.F.std()/np.sqrt(df.F.size)

2.385526741328836

In [25]:
df.F.sem()

2.385526741328836

The SEM method DROPS missings!! The SEM CANNOT calc if MISSINGs are considered!

In [26]:
df.C

0     -100.0
1     -200.0
2     -300.0
3     -400.0
4     -500.0
5     -600.0
6     -700.0
7     -800.0
8     -900.0
9    -1000.0
10   -1100.0
11   -1200.0
12       NaN
13       NaN
Name: C, dtype: float64

In [27]:
df.C.size

14

In [28]:
df.C.count()

12

The .size attribute is the number of elements in teh column (series), while the .count() method returns the number of NON-missing entries!!

In [29]:
df.C.sem()

104.08329997330664

In [30]:
df.C.sem(skipna=False)

nan

In [31]:
df.C.std()/np.sqrt(df.C.count())

104.08329997330664

In [34]:
df.C.std()/np.sqrt(df.C.size)

96.36241116594316

All of the previous methods have been used to summarize cols!!

However, like NumPy, the methods do have the axis atugment so we can apply them to individual ROWS and then summarize ACROSS COLS rather than DOWN COLS.

For example, we can calc the ROW sample size!

In [35]:
df.mean(axis=1)

TypeError: can only concatenate str (not "float") to str

## Custom summary functions
We can define our own fxs and APPLY them to the DF.

To demonstrate, lets define our own AVG!

In [41]:
def my_avg(x):
    """assume x is a Pandas Series and
        assume x is a numeric type
    """
    return x.sum()/x.count()

In [42]:
df.mean(numeric_only=True)

B      5.500000
C   -650.000000
F     17.857143
G    233.333333
dtype: float64

In [43]:
my_avg(df.B)

5.5

In [44]:
my_avg(df.C)

-650.0

In [45]:
my_avg(df.F)

17.857142857142858

In [46]:
my_avg(df.G)

233.33333333333334

But to APPLY a custom fx, to the entire DF, we need to use the .apply() method. So a METHOD is used to APPLY the fx to the cols!

In [47]:
df.apply(my_avg)

TypeError: can only concatenate str (not "int") to str

We should only apply my_avg() to NUMERIC cols!!

A simple way to do that is to SELECT all NUMS in the DF.

Pansas has a helper method .select_dtypes() that allows you to easily select all cols of a particular data type.

In [48]:
df.select_dtypes("number").apply(my_avg)

B      5.500000
C   -650.000000
F     17.857143
G    233.333333
dtype: float64

## Methods for MISSING values

There are specializaed and predefined methods dedicated to IDENTIFYING missing entries in a DF.

In [49]:
df

Unnamed: 0,A,B,C,D,E,F,G,H
0,a,0.0,-100.0,Jan,aa,10,100.0,AAA
1,b,1.0,-200.0,Feb,aa,20,100.0,BBB
2,c,2.0,-300.0,Mar,aa,10,100.0,AAA
3,d,3.0,-400.0,Apr,bb,20,200.0,BBB
4,e,4.0,-500.0,May,bb,10,200.0,AAA
5,f,5.0,-600.0,Jun,bb,20,200.0,BBB
6,g,6.0,-700.0,Jul,cc,10,,AAA
7,h,7.0,-800.0,Aug,cc,20,,BBB
8,i,8.0,-900.0,Sep,cc,10,,AAA
9,j,9.0,-1000.0,Oct,dd,20,400.0,BBB


The .isnull method converts a Series into  BOOLEAN!! A TRUE value means that the entry is MISSING, while a False corresponds to the value being present.

In [51]:
df.B.isnull()

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12     True
13     True
Name: B, dtype: bool

But applying the .isnull() to the ENTIRE DF returns a DF of BOOLEANS!

In [53]:
df.isnull()

Unnamed: 0,A,B,C,D,E,F,G,H
0,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False
5,False,False,False,False,False,False,False,False
6,False,False,False,False,False,False,True,False
7,False,False,False,False,False,False,True,False
8,False,False,False,False,False,False,True,False
9,False,False,False,False,False,False,False,False


In [54]:
df.isna()

Unnamed: 0,A,B,C,D,E,F,G,H
0,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False
5,False,False,False,False,False,False,False,False
6,False,False,False,False,False,False,True,False
7,False,False,False,False,False,False,True,False
8,False,False,False,False,False,False,True,False
9,False,False,False,False,False,False,False,False


But since the result is a DF, we can apply any approproate summ method to the Boolean DF.

Such, SUMMING the total number of MISSING!

In [55]:
df.isna().sum()

A    2
B    2
C    2
D    2
E    2
F    0
G    5
H    0
dtype: int64

Buf if you want the PROPORTION missing apply the .mean() method instead of the .sum()

In [57]:
df.isna().mean()

A    0.142857
B    0.142857
C    0.142857
D    0.142857
E    0.142857
F    0.000000
G    0.357143
H    0.000000
dtype: float64

The above is the summary per column...but we can also summarize per row!

In [59]:
df

Unnamed: 0,A,B,C,D,E,F,G,H
0,a,0.0,-100.0,Jan,aa,10,100.0,AAA
1,b,1.0,-200.0,Feb,aa,20,100.0,BBB
2,c,2.0,-300.0,Mar,aa,10,100.0,AAA
3,d,3.0,-400.0,Apr,bb,20,200.0,BBB
4,e,4.0,-500.0,May,bb,10,200.0,AAA
5,f,5.0,-600.0,Jun,bb,20,200.0,BBB
6,g,6.0,-700.0,Jul,cc,10,,AAA
7,h,7.0,-800.0,Aug,cc,20,,BBB
8,i,8.0,-900.0,Sep,cc,10,,AAA
9,j,9.0,-1000.0,Oct,dd,20,400.0,BBB


In [61]:
df.isna().sum(axis=1)

0     0
1     0
2     0
3     0
4     0
5     0
6     1
7     1
8     1
9     0
10    0
11    0
12    6
13    6
dtype: int64

This is a simple way to identify ROWS with 0 missings!! Or the ROW is complete!

In [62]:
df.loc[df.isna().sum(axis=1)==0, :]

Unnamed: 0,A,B,C,D,E,F,G,H
0,a,0.0,-100.0,Jan,aa,10,100.0,AAA
1,b,1.0,-200.0,Feb,aa,20,100.0,BBB
2,c,2.0,-300.0,Mar,aa,10,100.0,AAA
3,d,3.0,-400.0,Apr,bb,20,200.0,BBB
4,e,4.0,-500.0,May,bb,10,200.0,AAA
5,f,5.0,-600.0,Jun,bb,20,200.0,BBB
9,j,9.0,-1000.0,Oct,dd,20,400.0,BBB
10,k,10.0,-1100.0,Nov,dd,10,400.0,AAA
11,l,11.0,-1200.0,Dec,dd,20,400.0,BBB


This idea of removing ALL missings, or any row that has at least 1 missing col...is known as creating the **COMPLETE CASES**

This is what happens behind the scenes in a LOT of MODELING FUNCTIONS!!!

In [63]:
df.shape

(14, 8)

In [65]:
df.loc[df.isna().sum(axis=1) == 0, :].shape

(9, 8)

There is a streamlined operation for creating COMPLETE CASES!!

In [66]:
df.dropna()

Unnamed: 0,A,B,C,D,E,F,G,H
0,a,0.0,-100.0,Jan,aa,10,100.0,AAA
1,b,1.0,-200.0,Feb,aa,20,100.0,BBB
2,c,2.0,-300.0,Mar,aa,10,100.0,AAA
3,d,3.0,-400.0,Apr,bb,20,200.0,BBB
4,e,4.0,-500.0,May,bb,10,200.0,AAA
5,f,5.0,-600.0,Jun,bb,20,200.0,BBB
9,j,9.0,-1000.0,Oct,dd,20,400.0,BBB
10,k,10.0,-1100.0,Nov,dd,10,400.0,AAA
11,l,11.0,-1200.0,Dec,dd,20,400.0,BBB
