## CMPINF 2100 Week 05
### Essential Exploratory Data Analysis (EDA) Pandas Methods
We have used many different Pandas attributes and methods over the last two weeks. Lets review the most essential ones that you will use when you BEGIN exploring data.

## Import Modules

In [1]:
import numpy as np
import pandas as pd

## Read in data

Lets continue to work with the JOINED data set that we created previously.

In [2]:
df = pd.read_csv("joined_data.csv")

In [3]:
df

Unnamed: 0,A,B,C,D,E,F,G,H
0,a,0.0,-100.0,Jan,aa,10,100.0,AAA
1,b,1.0,-200.0,Feb,aa,20,100.0,BBB
2,c,2.0,-300.0,Mar,aa,10,100.0,AAA
3,d,3.0,-400.0,Apr,bb,20,200.0,BBB
4,e,4.0,-500.0,May,bb,10,200.0,AAA
5,f,5.0,-600.0,Jun,bb,20,200.0,BBB
6,g,6.0,-700.0,Jul,cc,10,,AAA
7,h,7.0,-800.0,Aug,cc,20,,BBB
8,i,8.0,-900.0,Sep,cc,10,,AAA
9,j,9.0,-1000.0,Oct,dd,20,400.0,BBB


But we CANNOT look at a dataset that has thousands to hundred or even millions of rows!!

We cannot look at a data set that has dozens to hundreds of cols!

What are the basic actions that we should perform for ANY data analysis task?

## Exploratory Data Analysis (EDA)

In [4]:
df.shape

(14, 8)

In [5]:
df.columns

Index(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'], dtype='object')

In [6]:
df.dtypes

A     object
B    float64
C    float64
D     object
E     object
F      int64
G    float64
H     object
dtype: object

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14 entries, 0 to 13
Data columns (total 8 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A       12 non-null     object 
 1   B       12 non-null     float64
 2   C       12 non-null     float64
 3   D       12 non-null     object 
 4   E       12 non-null     object 
 5   F       14 non-null     int64  
 6   G       9 non-null      float64
 7   H       14 non-null     object 
dtypes: float64(3), int64(1), object(4)
memory usage: 1.0+ KB


In [8]:
df.dtypes.value_counts()

object     4
float64    3
int64      1
Name: count, dtype: int64

In [9]:
df.isna()

Unnamed: 0,A,B,C,D,E,F,G,H
0,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False
5,False,False,False,False,False,False,False,False
6,False,False,False,False,False,False,True,False
7,False,False,False,False,False,False,True,False
8,False,False,False,False,False,False,True,False
9,False,False,False,False,False,False,False,False


In [10]:
df.isna().sum(axis=0)

A    2
B    2
C    2
D    2
E    2
F    0
G    5
H    0
dtype: int64

In [11]:
df.isna().sum(axis=0).loc[df.isna().sum() == 0]

F    0
H    0
dtype: int64

In [12]:
df.isna().sum(axis=0).loc[df.isna().sum() > 0]

A    2
B    2
C    2
D    2
E    2
G    5
dtype: int64

In [13]:
df.nunique()

A    12
B    12
C    12
D    12
E     4
F     4
G     3
H     4
dtype: int64

.nunique() method does NOT treat MISSINGS as a VALUE. It drops NA

If you switch dropna=False then the MISSING is counted as a VALUE.

In [14]:
df.nunique(dropna=True)

A    12
B    12
C    12
D    12
E     4
F     4
G     3
H     4
dtype: int64

I think it is useful to examine for cols with 1 or 2 unique values.

Lastly, it is always important to being summarizing the cols.

In [15]:
df.mean(numeric_only=True)

B      5.500000
C   -650.000000
F     17.857143
G    233.333333
dtype: float64

In [17]:
df.describe()

Unnamed: 0,B,C,F,G
count,12.0,12.0,14.0,9.0
mean,5.5,-650.0,17.857143,233.333333
std,3.605551,360.555128,8.925824,132.287566
min,0.0,-1200.0,10.0,100.0
25%,2.75,-925.0,10.0,100.0
50%,5.5,-650.0,20.0,200.0
75%,8.25,-375.0,20.0,400.0
max,11.0,-100.0,40.0,400.0


In [18]:
df.describe(include="object")

Unnamed: 0,A,D,E,H
count,12,12,12,14
unique,12,12,4,4
top,a,Jan,aa,AAA
freq,1,1,3,6


In [19]:
df.describe(include="all")

Unnamed: 0,A,B,C,D,E,F,G,H
count,12,12.0,12.0,12,12,14.0,9.0,14
unique,12,,,12,4,,,4
top,a,,,Jan,aa,,,AAA
freq,1,,,1,3,,,6
mean,,5.5,-650.0,,,17.857143,233.333333,
std,,3.605551,360.555128,,,8.925824,132.287566,
min,,0.0,-1200.0,,,10.0,100.0,
25%,,2.75,-925.0,,,10.0,100.0,
50%,,5.5,-650.0,,,20.0,200.0,
75%,,8.25,-375.0,,,20.0,400.0,


The next step is to being counting the categorical/string cols!

In [20]:
df.A.value_counts()

A
a    1
b    1
c    1
d    1
e    1
f    1
g    1
h    1
i    1
j    1
k    1
l    1
Name: count, dtype: int64

In [21]:
df.A.value_counts(dropna=False)

A
NaN    2
a      1
b      1
c      1
d      1
e      1
f      1
g      1
h      1
i      1
j      1
k      1
l      1
Name: count, dtype: int64

In [22]:
df.D.value_counts(dropna=False)

D
NaN    2
Jan    1
Feb    1
Mar    1
Apr    1
May    1
Jun    1
Jul    1
Aug    1
Sep    1
Oct    1
Nov    1
Dec    1
Name: count, dtype: int64

In [25]:
df.E.value_counts(dropna=False, normalize=True)

E
aa     0.214286
bb     0.214286
cc     0.214286
dd     0.214286
NaN    0.142857
Name: proportion, dtype: float64

In [24]:
df.F.value_counts(dropna=False)

F
10    6
20    6
30    1
40    1
Name: count, dtype: int64

In [26]:
df.F.value_counts(dropna=False, normalize=True)

F
10    0.428571
20    0.428571
30    0.071429
40    0.071429
Name: proportion, dtype: float64

In [28]:
df.F.value_counts(dropna=False)

F
10    6
20    6
30    1
40    1
Name: count, dtype: int64

In [29]:
df.F.value_counts(dropna=False, normalize=True)

F
10    0.428571
20    0.428571
30    0.071429
40    0.071429
Name: proportion, dtype: float64

In [32]:
df.dtypes.value_counts()

object     4
float64    3
int64      1
Name: count, dtype: int64

In [35]:
df.isna().sum().loc[df.isna().sum() == 0]

F    0
H    0
dtype: int64

## Realistic example

Lets follow the same steps to get the same type of BASIC information on a real data set!

In [36]:
import seaborn as sns

In [37]:
titanic = sns.load_dataset("titanic")

In [38]:
titanic.shape

(891, 15)

In [39]:
titanic.columns

Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
       'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town',
       'alive', 'alone'],
      dtype='object')

In [40]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB


In [41]:
titanic.dtypes.value_counts()

object      5
int64       4
float64     2
bool        2
category    1
category    1
Name: count, dtype: int64

In [42]:
titanic

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


In [43]:
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [44]:
titanic.tail()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
886,0,2,male,27.0,0,0,13.0,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.45,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0,C,First,man,True,C,Cherbourg,yes,True
890,0,3,male,32.0,0,0,7.75,Q,Third,man,True,,Queenstown,no,True


In [45]:
titanic.isna().sum()

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

In [46]:
titanic.isna().sum().loc[titanic.isna().sum() == 0]

survived      0
pclass        0
sex           0
sibsp         0
parch         0
fare          0
class         0
who           0
adult_male    0
alive         0
alone         0
dtype: int64

In [47]:
titanic.isna().sum().loc[titanic.isna().sum() > 0]

age            177
embarked         2
deck           688
embark_town      2
dtype: int64

In [48]:
titanic.nunique()

survived         2
pclass           3
sex              2
age             88
sibsp            7
parch            7
fare           248
embarked         3
class            3
who              3
adult_male       2
deck             7
embark_town      3
alive            2
alone            2
dtype: int64

In [50]:
titanic.nunique(dropna=False)

survived         2
pclass           3
sex              2
age             89
sibsp            7
parch            7
fare           248
embarked         4
class            3
who              3
adult_male       2
deck             8
embark_town      4
alive            2
alone            2
dtype: int64

In [51]:
titanic.describe()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.3292


In [52]:
titanic.survived.value_counts()

survived
0    549
1    342
Name: count, dtype: int64

In [54]:
titanic.survived.value_counts(dropna=False, normalize=True)

survived
0    0.616162
1    0.383838
Name: proportion, dtype: float64

In [55]:
titanic.describe(include="object")

Unnamed: 0,sex,embarked,who,embark_town,alive
count,891,889,891,889,891
unique,2,3,3,3,2
top,male,S,man,Southampton,no
freq,577,644,537,644,549


In [56]:
titanic.alive.value_counts()

alive
no     549
yes    342
Name: count, dtype: int64

In [57]:
titanic.alive.value_counts(dropna=False, normalize=True)

alive
no     0.616162
yes    0.383838
Name: proportion, dtype: float64

In [58]:
titanic.describe(include="category")

Unnamed: 0,class,deck
count,891,203
unique,3,7
top,Third,C
freq,491,59


In [59]:
titanic.describe(include="boolean")

Unnamed: 0,adult_male,alone
count,891,891
unique,2,2
top,True,True
freq,537,537


In [60]:
titanic.pclass.value_counts()

pclass
3    491
1    216
2    184
Name: count, dtype: int64

In [61]:
titanic.pclass.value_counts(dropna=False, normalize=True)

pclass
3    0.551066
1    0.242424
2    0.206510
Name: proportion, dtype: float64

In [62]:
titanic['class'].value_counts()

class
Third     491
First     216
Second    184
Name: count, dtype: int64

In [63]:
titanic['class'].value_counts(dropna=False, normalize=True)

class
Third     0.551066
First     0.242424
Second    0.206510
Name: proportion, dtype: float64

In [64]:
titanic['deck'].value_counts()

deck
C    59
B    47
D    33
E    32
A    15
F    13
G     4
Name: count, dtype: int64

In [66]:
titanic['deck'].value_counts(dropna=False)

deck
NaN    688
C       59
B       47
D       33
E       32
A       15
F       13
G        4
Name: count, dtype: int64

In [67]:
titanic['deck'].value_counts(dropna=False, normalize=True)

deck
NaN    0.772166
C      0.066218
B      0.052750
D      0.037037
E      0.035915
A      0.016835
F      0.014590
G      0.004489
Name: proportion, dtype: float64