<a href='http://www.scienceacademy.ca'> <img style="float: left;height:70px" src="Log_SA.jpeg"></a>

# Useful methods and operations

There are lots of options available in pandas to explore and get the basic statistics on your data. We have already covered some of them e.g. <code>head(), isnull(), dropna(), fillna()</code> etc. 

In this lecture, we will explore some more general purpose operations and revise what we have learned in the previous lectures. <br>
Let's create a dataframe to get hands-on experience on these operations.<br>
I will repeat some values and also generate NaN in our dataframe.

In [1]:
import numpy as np
import pandas as pd
data_dic = {'col_1':[1,2,3,4,5],
           'col_2':[111,222,333,111,555],
           'col_3':['alpha','bravo','charlie',np.nan,np.nan],
           }
df = pd.DataFrame(data_dic,index=[1,2,3,4,5])
df

Unnamed: 0,col_1,col_2,col_3
1,1,111,alpha
2,2,222,bravo
3,3,333,charlie
4,4,111,
5,5,555,


Lets start with what we know.

### <code>info()</code>
provides a concise summary of a DataFrame. We will use this function very often in the course.

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 1 to 5
Data columns (total 3 columns):
col_1    5 non-null int64
col_2    5 non-null int64
col_3    3 non-null object
dtypes: int64(2), object(1)
memory usage: 160.0+ bytes


### <code>head(n)</code>
Returns the first n rows, default is 5. This is very useful to get the overview on our data. We will use this very often in the course. 

In [3]:
df.head(2)

Unnamed: 0,col_1,col_2,col_3
1,1,111,alpha
2,2,222,bravo


### <code>isnull()</code>
Return a boolean same-sized object indicating if the values are null.

In [4]:
df.isnull()

Unnamed: 0,col_1,col_2,col_3
1,False,False,False
2,False,False,False
3,False,False,False
4,False,False,True
5,False,False,True


### <code>dropna()</code>
* <code>axis = 0/rows, 1/columns</code> -- 0 is default
* <code>inplace = False</code> by default, to make the permanent change, needs to be True

*using print function to compare the output for both axis*

In [5]:
print(df.dropna(axis = 0))
print(df.dropna(axis = 1))

   col_1  col_2    col_3
1      1    111    alpha
2      2    222    bravo
3      3    333  charlie
   col_1  col_2
1      1    111
2      2    222
3      3    333
4      4    111
5      5    555


### <code>fillna()</code>
Fill NA/NaN values using the specified method <br>
* <code>value = None</code> by default 
* <code>method = None</code> by default ('backfill', 'ffill' etc)
* <code>axis = 0/row or index, 1/columns</code>
* <code>inplace = False</code> by default, If <code>True</code>, fill in place and the data will be modified.

In [6]:
# df.fillna() # ValueError: must specify a fill method or value
print(df.fillna(value = 'XYZ'))
print(df.fillna(method = 'ffill'))

   col_1  col_2    col_3
1      1    111    alpha
2      2    222    bravo
3      3    333  charlie
4      4    111      XYZ
5      5    555      XYZ
   col_1  col_2    col_3
1      1    111    alpha
2      2    222    bravo
3      3    333  charlie
4      4    111  charlie
5      5    555  charlie


### <code>unique()</code>
Find and returns all the unique values.<br>
Lets see how it works on all the columns in our dataframe.

In [7]:
print(df['col_1'].unique())
print(df['col_2'].unique())
print(df['col_3'].unique())
# 111 and NaN are repeated values, unique will only return once. 

[1 2 3 4 5]
[111 222 333 555]
['alpha' 'bravo' 'charlie' nan]


### <code>nunique()</code>
Find returns "how many unique values exist".<br>
&#9758; Notice the difference, for NaN, it count a missing value and returns "3" for col_3.

In [8]:
print(df['col_1'].nunique())
print(df['col_2'].nunique())
print(df['col_3'].nunique())

5
4
3


### <code>value_counts()</code>
We want a table with all the values along with no. of times they appeared in our data, value_counts do the work here!<br>
&#9758; for NaN, it count a missing value, nothing in the output.

In [9]:
print(df['col_1'].value_counts())
print(df['col_2'].value_counts())
print(df['col_3'].value_counts())

5    1
4    1
3    1
2    1
1    1
Name: col_1, dtype: int64
111    2
222    1
333    1
555    1
Name: col_2, dtype: int64
charlie    1
bravo      1
alpha      1
Name: col_3, dtype: int64


&#9989; <code>**unique(), unique(), value_counts()**</code> are three very useful and frequently used methods, which are associated with finding unique values in the data.

### <code>sort_values()</code>
by default:<br>
* <code>ascending=True
* inplace=False</code> 

In [10]:
df.sort_values(by='col_2')

Unnamed: 0,col_1,col_2,col_3
1,1,111,alpha
4,4,111,
2,2,222,bravo
3,3,333,charlie
5,5,555,


## Data Selection
Lets talk about ***Selecting Data*** once again. We have learned to grab data in our previous lectures as well. <br>
* We can grab a column with its name, do the conditional selection and much more .... <br>
* We can use loc and iloc to find rows as well.<br>

Let's revise the conditional selection, this also includes data selection based on the column name. <br>

Lets do the following steps:<br>

    * df['col_1'] > 2 : returns the data where condition is True (if you remember, this is just a boolean series)
    * df['col_2'] == 111 : returns the data where condition is True
    * Lets combine these tow conditions with & by putting both conditions in ().
    * wrap them in df[] and see what it returns!

Our one line code is <code>**(df['col_1'] > 2) & (df['col_2'] == 111)**</code>

In [11]:
df['col_1'] > 2 # boolean series

1    False
2    False
3     True
4     True
5     True
Name: col_1, dtype: bool

In [12]:
df['col_2'] # boolean series

1    111
2    222
3    333
4    111
5    555
Name: col_2, dtype: int64

In [13]:
"""We can say, this is a boolean mask on said condition to provide 
to the dataframe, df, for filtering out the results."""
bool_ser = (df['col_1'] > 2) & (df['col_2'] == 111)
bool_ser

1    False
2    False
3    False
4     True
5    False
dtype: bool

In [14]:
result = df[bool_ser]
result
# df[(df['col_1'] > 2) & (df['col_2'] == 111)]
# In the output below, we got the date based on our provided conditions!

Unnamed: 0,col_1,col_2,col_3
4,4,111,


### <code>apply()</code>
Indeed, this is one of the most powerful pandas feature. Using <code>**apply()**</code> method, we can **broadcast** our **customized functions** on our data.<br>
Let's see how to calculate square of col_1 

In [15]:
# Our customized function to calculate the squares 
def square(value):
    return value*2

* Let's broadcast our customized function <code>"square"</code> using <code>"apply"</code> method to calculate squares of the col_1 in our DataFrame, df.

In [16]:
df['col_1'].apply(square)

1     2
2     4
3     6
4     8
5    10
Name: col_1, dtype: int64

* The same operation can be conveniently carried out using state of the art <code>**lambda**</code> expression!

In [17]:
df['col_1'].apply(lambda value:value*2)

1     2
2     4
3     6
4     8
5    10
Name: col_1, dtype: int64

In [18]:
# Yes, we can use built-in functions with apply as well 
# Finding a lenght of strings in the column
df['col_3'][0:3].apply(len)

1    5
2    5
3    7
Name: col_3, dtype: int64

&#9758; We avoiding <code>NaN</code> in col_3, because:<br>
<code>TypeError: object of type 'float' has no len()</code>

In [19]:
# Let's confirm the type of NaN
type(np.nan)

float

## Good to know

In [20]:
# Getting index names
df.index

Int64Index([1, 2, 3, 4, 5], dtype='int64')

In [21]:
# Getting column names
df.columns

Index(['col_1', 'col_2', 'col_3'], dtype='object')

In [22]:
# Deleting row (axis=0) or column (axis=1) 
print(df.drop('col_1',axis=1))
print(df) # inplace = True for permanent change

   col_2    col_3
1    111    alpha
2    222    bravo
3    333  charlie
4    111      NaN
5    555      NaN
   col_1  col_2    col_3
1      1    111    alpha
2      2    222    bravo
3      3    333  charlie
4      4    111      NaN
5      5    555      NaN


In [23]:
# deleting col_1 permanently
newdf= df.copy() # creating a copy, may need to use df at later stage
del newdf['col_1']
newdf

Unnamed: 0,col_2,col_3
1,111,alpha
2,222,bravo
3,333,charlie
4,111,
5,555,


In [24]:
df.index

Int64Index([1, 2, 3, 4, 5], dtype='int64')

### <code>pivot_table()</code>
<code>shift + tab</code> to read the documentation.<br> 
Create a spreadsheet-style pivot table as a DataFrame. The levels in the
pivot table will be stored in MultiIndex objects (hierarchical indexes) on
the index and columns of the result DataFrame.<br>

<code>**pivot_table**</code> takes three main arguments:<br>
* <code>**values**</code> default is None
* <code>**index**</code> default is None
* <code>**columns**</code> default is None

Let's create a pivot table from our dataframe df. <br>
* We want our data points to be col_2, so, **values = 'col_2'**<br>
* We want our index to be col_1, so, **index = 'col_1'**<br>
* Finally, We want our columns to be defined by col_3, so, **columns = ['col_3']** 

&#9758; If you are an excel user, you may be familiar with pivot_table. If not, don't worry about this at this stage, we will discuss it in the coming sections of the course.

In [25]:
df

Unnamed: 0,col_1,col_2,col_3
1,1,111,alpha
2,2,222,bravo
3,3,333,charlie
4,4,111,
5,5,555,


In [26]:
df.pivot_table(values = 'col_2',  index='col_1', columns=['col_3'])

col_3,alpha,bravo,charlie
col_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,111.0,,
2,,222.0,
3,,,333.0


NaN appeared for missing data.<br>
NaN in col_3 will not be used for the column name in the pivot table, skipped index 4 and 5.

**Let's have a look on another example for Pivot_table**<br>

In [27]:
# Creating DataFrame 
data = {'A':['foo','foo','foo','bar','bar','bar'],
     'B':['one','one','two','two','one','one'],
       'C':['x','y','x','y','x','y'],
       'D':[1,3,2,5,4,1]}

foobar = pd.DataFrame(data)

In [28]:
# Our dataframe looks like
foobar

Unnamed: 0,A,B,C,D
0,foo,one,x,1
1,foo,one,y,3
2,foo,two,x,2
3,bar,two,y,5
4,bar,one,x,4
5,bar,one,y,1


Let's create a pivot table from our dataframe foobar. <br>
* We want our data points to be D, so, **values = 'D'**<br>
* We want our index to be A,B in multilevel index, so, **index = ['A','B']**<br>
* Finally, We want our columns to be defined by C, so, **columns = ['C']**

In [29]:
foobar.pivot_table(values='D',index=['A', 'B'],columns=['C'])

Unnamed: 0_level_0,C,x,y
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,4.0,1.0
bar,two,,5.0
foo,one,1.0,3.0
foo,two,2.0,


# Great Job!
This was little long, but you did it! Let's have a quick over view and move on to use the skills we have learned in the coming exercises!<br>
This was all about pandas that we wanted to learn. <br>
&#9989; Keep practicing to brush-up and add new skills.