# CPSC380: 3_Pandas_4_Statistics

In this notebook, you will learn how to create the following objects:
 - Sorting
 - Descriptive statistics
 - Other useful methods
 
Read more: 
 - Python Data Analysis textbook (chapter 5) and 
 - [Pandas website] (https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html).

In [22]:
import pandas as pd
import numpy as np

## 1. Sorting

### 1.1 Sort by index / column

In [23]:
obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])
print(obj,'\n')
print(obj.sort_index(),'\n')

d    0
a    1
b    2
c    3
dtype: int64 

a    1
b    2
c    3
d    0
dtype: int64 



In [24]:
frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
                     index=['three', 'one'],
                     columns=['d', 'a', 'b', 'c'])

print(frame, "\n") 
print(frame.sort_index(), "\n")                    # default sort on rows (row index based)
print(frame.sort_index(axis='columns'))            # specify axis =1 or 'columns'   

       d  a  b  c
three  0  1  2  3
one    4  5  6  7 

       d  a  b  c
one    4  5  6  7
three  0  1  2  3 

       a  b  c  d
three  1  2  3  0
one    5  6  7  4


In [25]:
frame.sort_index(axis=1, ascending=False)

Unnamed: 0,d,c,b,a
three,0,3,2,1
one,4,7,6,5


### 1.2 Sort by value

In [26]:
obj = pd.Series([4, 7, -3, 2])
print(obj,'\n')
print(obj.sort_values(),'\n')

0    4
1    7
2   -3
3    2
dtype: int64 

2   -3
3    2
0    4
1    7
dtype: int64 



In [27]:
obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])
obj.sort_values()

4   -3.0
5    2.0
0    4.0
2    7.0
1    NaN
3    NaN
dtype: float64

In [28]:
frame = pd.DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
print(frame,'\n')
print(frame.sort_values(by='b'),'\n') # sort by column name   

   b  a
0  4  0
1  7  1
2 -3  0
3  2  1 

   b  a
2 -3  0
3  2  1
0  4  0
1  7  1 



In [29]:
#multiple level sort
print(frame,'\n')
print(frame.sort_values(by=['a', 'b']),'\n')    

   b  a
0  4  0
1  7  1
2 -3  0
3  2  1 

   b  a
2 -3  0
0  4  0
3  2  1
1  7  1 



## 2. Descriptive Statistics

In [30]:
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3]],
                  index=['a', 'b', 'c', 'd'], columns=['one', 'two'])
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


### 2.1 DataFrame/Series Sum

In [31]:
print (df,'\n') 
print (df.sum(),'\n')                          # sum along the rows     (different with numpy.sum)
print (df.sum(axis='columns'),'\n')            # sum along the columns   skipping NaN

    one  two
a  1.40  NaN
b  7.10 -4.5
c   NaN  NaN
d  0.75 -1.3 

one    9.25
two   -5.80
dtype: float64 

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64 



### 2.2 DataFrame/Series Mean

In [32]:
print (df,'\n') 
print(df.mean(),'\n')                           # along the rows
print(df.mean(axis='columns'),'\n')             # along the columns 
print(df.mean(axis=1),'\n')                     # along the columns 
print(df.mean(axis='columns', skipna=False))    # along the columns and not skipping NaN


    one  two
a  1.40  NaN
b  7.10 -4.5
c   NaN  NaN
d  0.75 -1.3 

one    3.083333
two   -2.900000
dtype: float64 

a    1.400
b    1.300
c      NaN
d   -0.275
dtype: float64 

a    1.400
b    1.300
c      NaN
d   -0.275
dtype: float64 

a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64


### 2.3 DataFrame/Series Max/Min

In [33]:
print (df,'\n') 
print(df.max(),'\n')                           # along the rows
print(df.max(axis='columns'),'\n')             # along the columns 
print(df.max(axis='columns', skipna=False))    # along the columns and not skipping NaN

    one  two
a  1.40  NaN
b  7.10 -4.5
c   NaN  NaN
d  0.75 -1.3 

one    7.1
two   -1.3
dtype: float64 

a    1.40
b    7.10
c     NaN
d    0.75
dtype: float64 

a     NaN
b    7.10
c     NaN
d    0.75
dtype: float64


### 2.4 DataFrame/Series Idxmax/Idxmin

In [34]:
#Compute index labels at which minimum or maximum value obtained, respectively
print (df,'\n') 
print(df.idxmax(),'\n') 
print(df.idxmax(axis='columns'),'\n') 

    one  two
a  1.40  NaN
b  7.10 -4.5
c   NaN  NaN
d  0.75 -1.3 

one    b
two    d
dtype: object 

a    one
b    one
c    NaN
d    one
dtype: object 



### 2.5 DataFrame/Series Cumsum

In [35]:
#Cumulative sum of values
print(df,'\n') 
print(df.cumsum(),'\n') 
print(df.cumsum(axis='columns'),'\n') 

    one  two
a  1.40  NaN
b  7.10 -4.5
c   NaN  NaN
d  0.75 -1.3 

    one  two
a  1.40  NaN
b  8.50 -4.5
c   NaN  NaN
d  9.25 -5.8 

    one   two
a  1.40   NaN
b  7.10  2.60
c   NaN   NaN
d  0.75 -0.55 



### 2.6 DataFrame/Series Describe()

Generate descriptive statistics, including those that summarize the **central tendency**, **dispersion and shape** of a dataset’s distribution, **excluding NaN values**.

In [36]:
print(df,'\n') 

# Compute set of summary statistics for Series or each DataFrame column
df.describe()

    one  two
a  1.40  NaN
b  7.10 -4.5
c   NaN  NaN
d  0.75 -1.3 



Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


In [37]:
obj = pd.Series(['a', 'a', 'b', 'c'] * 4)
obj.describe()

count     16
unique     3
top        a
freq       8
dtype: object

### 2.1 DataFrame/Series Info()

This method prints information about a DataFrame including the **index dtype** and **columns**, **non-null** values and **memory usage**.

In [38]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, a to d
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   one     3 non-null      float64
 1   two     2 non-null      float64
dtypes: float64(2)
memory usage: 96.0+ bytes



## 3. Other useful methods

### 3.1 DataFrame.head(n=5): 

This function returns the first n rows for the object based on position. It is useful for quickly testing if your object has the right type of data in it.

In [39]:
print (df,'\n') 
print (df.head(2),'\n')

    one  two
a  1.40  NaN
b  7.10 -4.5
c   NaN  NaN
d  0.75 -1.3 

   one  two
a  1.4  NaN
b  7.1 -4.5 



### 3.2 DataFrame.tail(n=5):

This function returns last n rows from the object based on position. It is useful for quickly verifying data, for example, after sorting or appending rows.

In [40]:
print (df,'\n') 
print (df.tail(2),'\n')

    one  two
a  1.40  NaN
b  7.10 -4.5
c   NaN  NaN
d  0.75 -1.3 

    one  two
c   NaN  NaN
d  0.75 -1.3 



### 3.3 Series.unique()

Uniques are returned in order of appearance. Hash table-based unique, therefore does NOT sort.

In [41]:
obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
uniques = obj.unique()
uniques

array(['c', 'a', 'd', 'b'], dtype=object)

### 3.4 Series.value_counts()

The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.

In [42]:
obj.value_counts()

c    3
a    3
b    2
d    1
dtype: int64