---   
 <img align="left" width="75" height="75"  src="https://upload.wikimedia.org/wikipedia/en/c/c8/University_of_the_Punjab_logo.png"> 

<h1 align="center">Department of Data Science</h1>
<h1 align="center">Course: Tools and Techniques for Data Science</h1>

---
<h3><div align="right">Instructor: Muhammad Arif Butt, Ph.D.</div></h3>    

<h1 align="center">Pandas</h1>

<a href="https://colab.research.google.com/github/arifpucit/data-science/blob/master/Section-2-Basics-of-Python-Programming/Lec-2.02-Anaconda-and-Jupyter-Notebook/02-markdown-example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## _08-Sorting Dataframes.ipynb_

## Learning agenda of this notebook

1. Sorting dataframes using sort_values()
    - Creating a simple Dataframe
    - Understanding sort_values()
    - Sorting by Single Column
    - Sorting by Multiple Columns
    - Reset the Index
    - Handle NaN Values
2. Sorting dataframes using sort_index()
    - Sorting by Column Label
    - Sorting by Index
 

## 1. Sorting dataframes using `df.sort_values()`

>Pandas data frame has two useful functions. **`df.sort_values()`** to sort by values of one or more columns and **`df.sort_index()`** to sort by the index. Each of these functions come with numerous options, like sorting in specific order (ascending or descending), sorting in place, sorting with missing values, sorting by specific algorithm etc.
- The `df.sort_values()` function sort by the values along either axis. It returns a dataframe with sorted values or None if 'inplace=True'. Its signature is:
```
df.sort_values(by,axis=0,ascending=True,inplace=False,kind='quicksort',na_position='last',ignore_index=False)
```
Where,
-  `by`: str or list of str to sort
-  `axis`: If `axis` is 0 or 'index' then 'by' may contain index levels and/or column labels. If `axis` is 1 or 'columns' then 'by' may contain column levels and/or index labels.
- `ascending`: if True then ascending and if False then descending
- `inplace`:  If True, perform operation in-place.
- `kind`: {'quicksort', 'mergesort', 'heapsort', 'stable'}, default 'quicksort'. This option is only applied when sorting on a single column or label.
- `na_position`: If first then puts NaNs at the beginning. Default is last
- `ignore_index`: If True, the resulting axis will be labeled 0, 1, …, n - 1. Default False

### a. Creating a Simple Dataframe

In [2]:
# Let us create a simple data frame
import pandas as pd
df = pd.DataFrame({
    'roll_no': [ 102, 101, 104, 103, 105],
    'name' : ['Kamal', 'Saima', 'Jamal', 'Shaikh', 'Farzana'],
    'gender' : ['M', 'F', 'M', 'M', 'F'],
    'grade'  : ['A', 'A', 'B', 'B', 'A'],
    'marks'  : [ 21,  23,  12,  14,  20],
    'city' : ['Lahore', 'Peshawer', 'Lahore', 'Karachi', 'Peshawer']
})
df

Unnamed: 0,roll_no,name,gender,grade,marks,city
0,102,Kamal,M,A,21,Lahore
1,101,Saima,F,A,23,Peshawer
2,104,Jamal,M,B,12,Lahore
3,103,Shaikh,M,B,14,Karachi
4,105,Farzana,F,A,20,Peshawer


### b. Sorting by Single Column

In [3]:
# Let us sort the data by grade column
# By default the sorting is done in ascending order and is not inplace
sorted_df = df.sort_values(by=['marks'])
sorted_df


Unnamed: 0,roll_no,name,gender,grade,marks,city
2,104,Jamal,M,B,12,Lahore
3,103,Shaikh,M,B,14,Karachi
4,105,Farzana,F,A,20,Peshawer
0,102,Kamal,M,A,21,Lahore
1,101,Saima,F,A,23,Peshawer


In [4]:
# Note since the sorting is not inplace, so the original dataframe is still unsorted
df

Unnamed: 0,roll_no,name,gender,grade,marks,city
0,102,Kamal,M,A,21,Lahore
1,101,Saima,F,A,23,Peshawer
2,104,Jamal,M,B,12,Lahore
3,103,Shaikh,M,B,14,Karachi
4,105,Farzana,F,A,20,Peshawer


In [5]:
# Let us do an in-place sort and in descending order
df.sort_values(by=['marks'], ascending=False, inplace=True)
df

Unnamed: 0,roll_no,name,gender,grade,marks,city
1,101,Saima,F,A,23,Peshawer
0,102,Kamal,M,A,21,Lahore
4,105,Farzana,F,A,20,Peshawer
3,103,Shaikh,M,B,14,Karachi
2,104,Jamal,M,B,12,Lahore


### c. Sorting by Multiple Columns

In [6]:
# Let us sort again in ascending order by grade
d1 = df.sort_values(by=['grade'])
d1

Unnamed: 0,roll_no,name,gender,grade,marks,city
1,101,Saima,F,A,23,Peshawer
0,102,Kamal,M,A,21,Lahore
4,105,Farzana,F,A,20,Peshawer
3,103,Shaikh,M,B,14,Karachi
2,104,Jamal,M,B,12,Lahore


- Note in above output, we have sorted the data based on the grade column. You can observe that some of the students with higher marks are ranked lower.
- We want to sort the data based on both grades and marks.

In [7]:
# sort the dataframe
d2 = df.sort_values(by=['grade','marks'])
d2

Unnamed: 0,roll_no,name,gender,grade,marks,city
4,105,Farzana,F,A,20,Peshawer
0,102,Kamal,M,A,21,Lahore
1,101,Saima,F,A,23,Peshawer
2,104,Jamal,M,B,12,Lahore
3,103,Shaikh,M,B,14,Karachi


- Note that the data is first sorted by grade, and then within grade it is sorted by marks
- Let us now sort by grades in ascending order and marks in descending order.


In [8]:
d3 = df.sort_values(by=['grade','marks'], ascending=[True,False])
d3

Unnamed: 0,roll_no,name,gender,grade,marks,city
1,101,Saima,F,A,23,Peshawer
0,102,Kamal,M,A,21,Lahore
4,105,Farzana,F,A,20,Peshawer
3,103,Shaikh,M,B,14,Karachi
2,104,Jamal,M,B,12,Lahore


- When sorting by multiple columns, `df.sort_value()` uses the first variable first and second variable next. 
- Let us understand this by switching the order of column names in the list.

In [9]:
# changed the order of columns
# When sorting by multiple columns, pandas sort_value() uses the first variable first and second variable next. 
# Let us understand this by switching the order of column names in the list.
d4 = df.sort_values(by=['marks','grade'])
d4

Unnamed: 0,roll_no,name,gender,grade,marks,city
2,104,Jamal,M,B,12,Lahore
3,103,Shaikh,M,B,14,Karachi
4,105,Farzana,F,A,20,Peshawer
0,102,Kamal,M,A,21,Lahore
1,101,Saima,F,A,23,Peshawer


### d. Reset the Index (if you want)
- After you sort your dataset, you can observe that the index is also shuffled according to the sorting. If we want to reset the index we use `reset_index()` function.


In [10]:
d4.reset_index()

Unnamed: 0,index,roll_no,name,gender,grade,marks,city
0,2,104,Jamal,M,B,12,Lahore
1,3,103,Shaikh,M,B,14,Karachi
2,4,105,Farzana,F,A,20,Peshawer
3,0,102,Kamal,M,A,21,Lahore
4,1,101,Saima,F,A,23,Peshawer


- Observe that now it has created another column 'index' which is the previous index. 
- If you want to remove this just pass the parameter `drop = True` and also `inplace = True` to save the state.

In [11]:
d4.reset_index(inplace=True, drop=True)
d4

Unnamed: 0,roll_no,name,gender,grade,marks,city
0,104,Jamal,M,B,12,Lahore
1,103,Shaikh,M,B,14,Karachi
2,105,Farzana,F,A,20,Peshawer
3,102,Kamal,M,A,21,Lahore
4,101,Saima,F,A,23,Peshawer


### e. Handle NaN Values

In [33]:
import numpy as np
import pandas as pd
a2DNumPyArray =[
           ['MS01', 'Rauf',    52, 'Lahore',    'MORNING',   'group C', 'Male',   78.3, 84.4, 5000],
           ['MS02', 'Arif',    51, 'Islamabad', 'AFT',       'group A', 'Male',   70.5, np.nan, 6000],
           ['MS03', 'Shaista', 35, 'Karachi',   'AFTERNOON', 'group B', 'Female', 64.9, 75.1, 8500],
           ['MS04', 'Hadeed',  20, 'Lahore',    'MOR',       'group A', 'Male',   np.nan, 84.3, 4000],
           ['MS05', 'Zara',    40, 'Peshawer',  'AFT',       'group D', 'Female', 65.9, 72.8, 3500],
           ['MS06', 'Mohid',   16, 'Lahore',    'MORNING' ,  'group C', 'Female', 69.3, 78.6, np.nan],
           ['MS07', 'Zobia',   40, 'Sialkot',   'AFT',       'group B', 'Female', 90.2, np.nan, 4000],
           ['MS08', 'Idrees',  51, 'Multan',    'MORNING',   'group D', 'Male',   84.1, 76.0, 8000],
           ['MS09', 'Jamil',   53, 'Karachi',   'AFT',       'group C', 'Male',   90.5, 81.3, np.nan],
           ['MS10', 'Shahid',  38, 'Lahore',   'AFTERNOON', 'group D', 'Male',   90.5, 81.3, 3800],
           ['MS11', 'Khurram', 35, 'Islamabad',   'MOR',       'group B', 'Male',   90.5, 81.3, 6000],
           ['MS12', 'Maaz',    25, 'Karachi',   'AFTERNOON', 'group C', 'Male',   90.5, 81.3, np.nan],
           ['MS13', 'Mujahid', 18, 'Lahore',    'MORNING',   'group D', 'Male',   np.nan, 76.5, 7000],
           ['MS14', 'Sara',  28, 'Multan',    'AFTERNOON',   'group A', 'Female',   84.1, 76.0, 8000],
           ['MS15', 'Fatima',   33, 'Sialkot',   'AFT',       'group C', 'Female',   90.5, 81.3, 3500],
           ['MS16', 'Kakamanna',  42, 'Multan',   'AFTERNOON', 'group A', 'Male',   90.5, 81.3, 3800],

      ]

list1 = ['rollno', 'name', 'age', 'address', 'session', 'group', 'gender', 'subj1', 'subj2', 'scholarship']
df = pd.DataFrame(data=a2DNumPyArray, columns=list1)
df

Unnamed: 0,rollno,name,age,address,session,group,gender,subj1,subj2,scholarship
0,MS01,Rauf,52,Lahore,MORNING,group C,Male,78.3,84.4,5000.0
1,MS02,Arif,51,Islamabad,AFT,group A,Male,70.5,,6000.0
2,MS03,Shaista,35,Karachi,AFTERNOON,group B,Female,64.9,75.1,8500.0
3,MS04,Hadeed,20,Lahore,MOR,group A,Male,,84.3,4000.0
4,MS05,Zara,40,Peshawer,AFT,group D,Female,65.9,72.8,3500.0
5,MS06,Mohid,16,Lahore,MORNING,group C,Female,69.3,78.6,
6,MS07,Zobia,40,Sialkot,AFT,group B,Female,90.2,,4000.0
7,MS08,Idrees,51,Multan,MORNING,group D,Male,84.1,76.0,8000.0
8,MS09,Jamil,53,Karachi,AFT,group C,Male,90.5,81.3,
9,MS10,Shahid,38,Lahore,AFTERNOON,group D,Male,90.5,81.3,3800.0


In [35]:
# If there is a missing value NaN, by default it is listed at the end when using sort_values function
# Regardless of the sorting order (Ascending or Descending)
d1 = df.sort_values(by=['scholarship'])
d1.tail()

Unnamed: 0,rollno,name,age,address,session,group,gender,subj1,subj2,scholarship
13,MS14,Sara,28,Multan,AFTERNOON,group A,Female,84.1,76.0,8000.0
2,MS03,Shaista,35,Karachi,AFTERNOON,group B,Female,64.9,75.1,8500.0
5,MS06,Mohid,16,Lahore,MORNING,group C,Female,69.3,78.6,
8,MS09,Jamil,53,Karachi,AFT,group C,Male,90.5,81.3,
11,MS12,Maaz,25,Karachi,AFTERNOON,group C,Male,90.5,81.3,


In [36]:
# If the argument na_position='first', it will be listed at the top.
d2 = df.sort_values(by=['scholarship'], na_position='first')
d2.head()

Unnamed: 0,rollno,name,age,address,session,group,gender,subj1,subj2,scholarship
5,MS06,Mohid,16,Lahore,MORNING,group C,Female,69.3,78.6,
8,MS09,Jamil,53,Karachi,AFT,group C,Male,90.5,81.3,
11,MS12,Maaz,25,Karachi,AFTERNOON,group C,Male,90.5,81.3,
4,MS05,Zara,40,Peshawer,AFT,group D,Female,65.9,72.8,3500.0
14,MS15,Fatima,33,Sialkot,AFT,group C,Female,90.5,81.3,3500.0


## 2. Sorting dataframes using `df.sort_index()`
> We have observed while using `df.sort_values()`, by default the sorting is performed in the vertical direction. If you want to sort in the row direction, we can set the`axis` argument of  `df.sort_values()` method to 1, which is by default set to zero. However, it may cause problems when a number and a string are mixed

- So to sort a dataframe in the horizontal direction, we normally use **`df.sort_index()`** method.
```
df.sort_index(axis=0,ascending=True,inplace=False,kind='quicksort',na_position='last',ignore_index=False)
```
Where,
-  `axis`: The axis along which to sort. The value 0 identifies the rows, and 1 identifies the columns. (default is 0)
- `ascending`: If True then ascending and If False then descending
- `inplace`:  If True, perform operation in-place.
- `kind`: {'quicksort', 'mergesort', 'heapsort', 'stable'}, default 'quicksort'. This option is only applied when sorting on a single column or label.
- `na_position`: If first then puts NaNs at the beginning. Default is last
- `ignore_index`: If True, the resulting axis will be labeled 0, 1, …, n - 1. Default False

In [2]:
# Let us create a simple data frame
import pandas as pd
df = pd.DataFrame({
    'roll_no': [ 102, 101, 104, 103, 105],
    'name' : ['Kamal', 'Saima', 'Jamal', 'Shaikh', 'Farzana'],
    'gender' : ['M', 'F', 'M', 'M', 'F'],
    'grade'  : ['A', 'A', 'B', 'B', 'A'],
    'marks'  : [ 21,  23,  12,  14,  20],
    'city' : ['Lahore', 'Peshawer', 'Lahore', 'Karachi', 'Peshawer']
})
df

Unnamed: 0,roll_no,name,gender,grade,marks,city
0,102,Kamal,M,A,21,Lahore
1,101,Saima,F,A,23,Peshawer
2,104,Jamal,M,B,12,Lahore
3,103,Shaikh,M,B,14,Karachi
4,105,Farzana,F,A,20,Peshawer


### a. Sort by Column Labels
- By passing the axis argument with a value 0 or 1, the sorting can be done on the column labels. By default, axis=0

In [3]:
df1 = df.sort_index(axis=1)
df1

Unnamed: 0,city,gender,grade,marks,name,roll_no
0,Lahore,M,A,21,Kamal,102
1,Peshawer,F,A,23,Saima,101
2,Lahore,M,B,12,Jamal,104
3,Karachi,M,B,14,Shaikh,103
4,Peshawer,F,A,20,Farzana,105


In [4]:
# Let us do an in-place sort and in descending order
df1 = df.sort_index(axis=1, ascending=False)
df1

Unnamed: 0,roll_no,name,marks,grade,gender,city
0,102,Kamal,21,A,M,Lahore
1,101,Saima,23,A,F,Peshawer
2,104,Jamal,12,B,M,Lahore
3,103,Shaikh,14,B,M,Karachi
4,105,Farzana,20,A,F,Peshawer


### b. Sort by Index
- This will be a three step process
    - Make the specific column as index
    - Call sort_index() with axis=0
    - Reset Index

**Let us sort by grades**

In [5]:
# Lets us set the roll_no column as index
df1 = df.set_index(["grade"])
df1

Unnamed: 0_level_0,roll_no,name,gender,marks,city
grade,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
A,102,Kamal,M,21,Lahore
A,101,Saima,F,23,Peshawer
B,104,Jamal,M,12,Lahore
B,103,Shaikh,M,14,Karachi
A,105,Farzana,F,20,Peshawer


In [57]:
# sort the datframe by index 
df2 = df1.sort_index(axis=0)
df2

Unnamed: 0_level_0,roll_no,name,gender,marks,city
grade,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
A,102,Kamal,M,21,Lahore
A,101,Saima,F,23,Peshawer
A,105,Farzana,F,20,Peshawer
B,104,Jamal,M,12,Lahore
B,103,Shaikh,M,14,Karachi


In [58]:
# After sort you can reset the index if you want
df2.reset_index(inplace=True, drop=False)
df2

Unnamed: 0,grade,roll_no,name,gender,marks,city
0,A,102,Kamal,M,21,Lahore
1,A,101,Saima,F,23,Peshawer
2,A,105,Farzana,F,20,Peshawer
3,B,104,Jamal,M,12,Lahore
4,B,103,Shaikh,M,14,Karachi
