## CMPINF 2100 Week 05
### Begin Exploring by Summarizing Pandas Series

We will explore data before training predictive models. This process is known as Exploratory Data Analysis (EDA).

An important aspect of EDA is knowing how to calculate SUMMARY STATISTICS. This Notebook demonstrates how to summarize the Panda series. 

## Import Modules

In [1]:
import numpy as np
import pandas as pd

## Review of Numpy Summary Methods
Lets create a list of integers and then convert that list to a 1D NumPy Array.

In [2]:
my_list = [10, 20, 30, 40, 50, 60, 70,80]

Convert to an array.

In [3]:
my_array = np.array(my_list)

In [4]:
my_array.mean()

45.0

In [5]:
my_array.var(ddof=1)

600.0

In [6]:
my_array.std(ddof=1)

24.49489742783178

In [7]:
my_array.min()

10

In [8]:
my_array.max()

80

## Pandas Series - Summary Methods
Convert the list into a Pandas Series

In [10]:
my_series = pd.Series(my_list)

In [11]:
my_series

0    10
1    20
2    30
3    40
4    50
5    60
6    70
7    80
dtype: int64

Most of the Pandas Series summary methods work very similarly to their NumPy counterpats!!

In [12]:
my_series.mean()

45.0

In [13]:
my_series.min()

10

In [14]:
my_series.max()

80

But look closely at the VARIANCE!!

In [16]:
my_series.var()

600.0

In [17]:
my_array.var()

525.0

In [18]:
my_array.var(ddof=1)

600.0

Look closely at the standard deviation!!!

In [19]:
my_series.std()

24.49489742783178

In [20]:
my_array.std()

22.9128784747792

In [21]:
my_array.std(ddof=1)

24.49489742783178

Pandas CORRECTLY sets `ddof=1` when variance or std are calculated!!

Pandas calculates the UNBIASED estimate variance and standard deviation!!!

### Unique values 

We can get the num of unique values for a Pandas Series!!

In [22]:
my_series.nunique()

8

In [23]:
my_series

0    10
1    20
2    30
3    40
4    50
5    60
6    70
7    80
dtype: int64

Knowing the number of unique values is especially important for CATEGORICAL or STRING variables!!

In [26]:
my_series_b = pd.Series(['A', 'A', 'A', 'B', 'B', 'B', 'C', 'D', 'D'])

In [27]:
my_series_b

0    A
1    A
2    A
3    B
4    B
5    B
6    C
7    D
8    D
dtype: object

In [28]:
my_series_b.nunique()

4

In [29]:
my_series_b.size

9

In [30]:
my_series_b.shape

(9,)

The number of unique values does NOT need to equal the num of elements or SIZE!!

My fav Pandas method focuses on dealing with unique values!!!

Often times we want to COUNT the num of times a unique value occurs!

In [32]:
my_series_b.value_counts()

A    3
B    3
D    2
C    1
Name: count, dtype: int64

The COUNTS give us more information thatn just the unique value.

In [34]:
my_series_b.value_counts().index

Index(['A', 'B', 'D', 'C'], dtype='object')

If you want the unique values, then you can use the `.unique()` method.

In [35]:
my_series_b.unique()

array(['A', 'B', 'C', 'D'], dtype=object)

## Summarize Individual columns within DataFrames

This is to reenforce the fact that COLUMNS are really Pandas Series within a DataFrame.

Lets read in the JOINED data set we created previously.

In [36]:
df = pd.read_csv("joined_data.csv")

In [37]:
df

Unnamed: 0,A,B,C,D,E,F,G,H
0,a,0.0,-100.0,Jan,aa,10,100.0,AAA
1,b,1.0,-200.0,Feb,aa,20,100.0,BBB
2,c,2.0,-300.0,Mar,aa,10,100.0,AAA
3,d,3.0,-400.0,Apr,bb,20,200.0,BBB
4,e,4.0,-500.0,May,bb,10,200.0,AAA
5,f,5.0,-600.0,Jun,bb,20,200.0,BBB
6,g,6.0,-700.0,Jul,cc,10,,AAA
7,h,7.0,-800.0,Aug,cc,20,,BBB
8,i,8.0,-900.0,Sep,cc,10,,AAA
9,j,9.0,-1000.0,Oct,dd,20,400.0,BBB


Access any column and apply summary methods just like it was a regular Pandas Series in the environment.

In [38]:
df['A']

0       a
1       b
2       c
3       d
4       e
5       f
6       g
7       h
8       i
9       j
10      k
11      l
12    NaN
13    NaN
Name: A, dtype: object

In [39]:
df['A'].unique()

array(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', nan],
      dtype=object)

In [42]:
df.A.unique()

array(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', nan],
      dtype=object)

In [43]:
df.A.nunique()

12

Can apply summary methods like .mean() and .std() to any numeric column!

In [44]:
df.F.mean()

17.857142857142858

In [45]:
df.F.std()

8.92582375303981

In [46]:
df['F'].mean()

17.857142857142858

In [47]:
df['F'].std()

8.92582375303981

We can also calculated the STANDARD error on the mean (SEM)!!!

In [48]:
df.F.std()/np.sqrt(df.F.size)

2.385526741328836

In [49]:
df.F.sem()

2.385526741328836