## Inserting, Deleting and Sorting Data

In this topic, we review **basic** ways to insert or remove complete rows or 
columns. Later, in other notebook, the topics about **concatenation** and 
**merging** will be covered, which are more complex and useful ways to insert or drop
data.

Topics Reviewd

- Inserting columns (using accesors or `.insert()`)
- Deleting using `del`, `pop()`, or `drop()`
- Sorting using `.sort_values()`

**NOTE:** Versions previous to pandas 2.0 could use `.append()` to insert rows, but now it is deprecated and instead we will always use `concat()`. `concat()` will be explored in other notebook.

In [3]:
import pandas as pd
import numpy as np

np.random.seed(0)

In [16]:
## Handy function
from IPython.display import display_html

def display_side_by_side(*args):
    html_str=''
    for df in args:
        html_str+=df.to_html()
    display_html(html_str.replace('table','table style="display:inline"'),raw=True)

## Inserting columns

There are two basic ways to insert columns.

1. Insert cols using accessors `df["column_name"] = <new_data>`. Note `<new_data>` can be a scalar or a list/array of the length of columns.
2. Insert cols using `.insert(loc, column, value)`, when the location of the new column is important.
    - `loc` indicates the location of the new column
    - `column` indicates the name of the column
    - `value` indicates the values (scalar or list/array) to fit the new column.


In [4]:
data = {
     'name': ['Xavier', 'Ann', 'Jana', 'Yi', 'Robin', 'Amal', 'Nori'],
     'city': ['Mexico City', 'Toronto', 'Prague', 'Shanghai',
              'Manchester', 'Cairo', 'Osaka'],
     'age': [41, 28, 33, 34, 38, 31, 37],
     'py-score': [88.0, 79.0, 81.0, 80.0, 68.0, 61.0, 84.0]
 }

row_labels = [100, 101, 102, 103, 104, 105, 106]
df = pd.DataFrame(data=data, index=row_labels)

df

Unnamed: 0,name,city,age,py-score
100,Xavier,Mexico City,41,88.0
101,Ann,Toronto,28,79.0
102,Jana,Prague,33,81.0
103,Yi,Shanghai,34,80.0
104,Robin,Manchester,38,68.0
105,Amal,Cairo,31,61.0
106,Nori,Osaka,37,84.0


In [5]:
# 1. Inserting new col using accessors
# NOTE: we are inserting using an array
df["js-score"] = np.random.randint(100, size = 7)
df

Unnamed: 0,name,city,age,py-score,js-score
100,Xavier,Mexico City,41,88.0,44
101,Ann,Toronto,28,79.0,47
102,Jana,Prague,33,81.0,64
103,Yi,Shanghai,34,80.0,67
104,Robin,Manchester,38,68.0,67
105,Amal,Cairo,31,61.0,9
106,Nori,Osaka,37,84.0,83


In [6]:
# 1. Inserting new col using accessors
# NOTE: inserting using a scalar will broadcast the value
df["java-score"] = 50
df

Unnamed: 0,name,city,age,py-score,js-score,java-score
100,Xavier,Mexico City,41,88.0,44,50
101,Ann,Toronto,28,79.0,47,50
102,Jana,Prague,33,81.0,64,50
103,Yi,Shanghai,34,80.0,67,50
104,Robin,Manchester,38,68.0,67,50
105,Amal,Cairo,31,61.0,9,50
106,Nori,Osaka,37,84.0,83,50


In [7]:
# 2. Inserting using function .insert()
# NOTE: this function will insert the column in place
df.insert(loc=4,
          column="django-score",
          value=[86.0, 81.0, 78.0, 88.0, 74.0, 70.0, 81.0])
df

Unnamed: 0,name,city,age,py-score,django-score,js-score,java-score
100,Xavier,Mexico City,41,88.0,86.0,44,50
101,Ann,Toronto,28,79.0,81.0,47,50
102,Jana,Prague,33,81.0,78.0,64,50
103,Yi,Shanghai,34,80.0,88.0,67,50
104,Robin,Manchester,38,68.0,74.0,67,50
105,Amal,Cairo,31,61.0,70.0,9,50
106,Nori,Osaka,37,84.0,81.0,83,50


## Deleting rows or cols

There are some ways to delete rows or cols

1. Remove a row using `drop(labels)`
2. Remove a col using `drop(labels, axis = 1)`
3. Remove a col using `del df[<column_label>]`
4. Remove and retrieve a col using `pop(<column_label>)`

**NOTE:** To remove values in place with `drop()` use `inplace = True`.

**NOTE:** Differences between `del` and `drop()`.

- `drop` operates on both columns and rows; `del` operates on column only.
- `drop` can operate on multiple items at a time; `del` operates only on one at a time.
- `drop` can operate in-place or return a copy; `del` is an in-place operation only.

In [8]:
df

Unnamed: 0,name,city,age,py-score,django-score,js-score,java-score
100,Xavier,Mexico City,41,88.0,86.0,44,50
101,Ann,Toronto,28,79.0,81.0,47,50
102,Jana,Prague,33,81.0,78.0,64,50
103,Yi,Shanghai,34,80.0,88.0,67,50
104,Robin,Manchester,38,68.0,74.0,67,50
105,Amal,Cairo,31,61.0,70.0,9,50
106,Nori,Osaka,37,84.0,81.0,83,50


In [9]:
# 1. Removing rows with index labels 103, and 106
df = df.drop(labels=[103, 106])
df

Unnamed: 0,name,city,age,py-score,django-score,js-score,java-score
100,Xavier,Mexico City,41,88.0,86.0,44,50
101,Ann,Toronto,28,79.0,81.0,47,50
102,Jana,Prague,33,81.0,78.0,64,50
104,Robin,Manchester,38,68.0,74.0,67,50
105,Amal,Cairo,31,61.0,70.0,9,50


In [10]:
# 2. Removing a column with drop() + axis = 1
df = df.drop(labels=["java-score", "py-score"], axis = 1)
df

Unnamed: 0,name,city,age,django-score,js-score
100,Xavier,Mexico City,41,86.0,44
101,Ann,Toronto,28,81.0,47
102,Jana,Prague,33,78.0,64
104,Robin,Manchester,38,74.0,67
105,Amal,Cairo,31,70.0,9


In [11]:
# 3. Removing a column using del
del df['js-score']
df

Unnamed: 0,name,city,age,django-score
100,Xavier,Mexico City,41,86.0
101,Ann,Toronto,28,81.0
102,Jana,Prague,33,78.0
104,Robin,Manchester,38,74.0
105,Amal,Cairo,31,70.0


In [12]:
# 4. Remove and retrieve column using pop()
column = df.pop("age")
print(column)
df

100    41
101    28
102    33
104    38
105    31
Name: age, dtype: int64


Unnamed: 0,name,city,django-score
100,Xavier,Mexico City,86.0
101,Ann,Toronto,81.0
102,Jana,Prague,78.0
104,Robin,Manchester,74.0
105,Amal,Cairo,70.0


## Sorting

You can sort a pandas DataFrame **by index** with

`.sort_index(axis, level, ascending)`

- `axis` sets the axis to be sorted (default 0).
- `level` set the level (or list of levels) to be sorted. It works with Multi-index.
- `ascending` if True sort in ascending order. It can be a list of values per each level.

You can also sort a pandas DataFrame **by values** with 

`.sort_values(by, axis, ascending, inplace)`

- `by` is the label (or list of labels) used to sort. It can be a col with `axis = 0` or a row with `axis = 1`.
- `axis` sets the axis to be sorted.
- `ascending` if True sort in ascending order. It can be a list of values per each column (or row)
- `inplace` if True, perform operation in-place.


**NOTE:** you can also sort by index and values at the same time using `.sort_values()` but that its is out of the scope of this tutorial.

In [13]:
index = pd.MultiIndex.from_tuples([("B", 10), 
                                   ("B", 30),  
                                   ("B", 20), 
                                   ("A", 40), 
                                   ("A", 50),
                                   ("A", 10),
                                   ("A", 20)])

columns = pd.MultiIndex.from_product([
    ["foo", "bar"],
    [100,300,200]
])

df = pd.DataFrame(
    data= np.random.randint(15, size = (7, 6)), 
    index= index,
    columns=columns
    )

df

Unnamed: 0_level_0,Unnamed: 1_level_0,foo,foo,foo,bar,bar,bar
Unnamed: 0_level_1,Unnamed: 1_level_1,100,300,200,100,300,200
B,10,5,2,4,7,6,8
B,30,8,12,10,1,6,7
B,20,7,14,8,1,5,9
A,40,13,8,9,4,3,0
A,50,3,5,14,0,2,3
A,10,8,1,3,13,3,3
A,20,14,7,0,1,9,9


In [18]:
# 1. sort by index in level 1
result = df.sort_index(
    level = 1,
    ascending=False
)
display_side_by_side(df, result)

Unnamed: 0_level_0,Unnamed: 1_level_0,foo,foo,foo,bar,bar,bar
Unnamed: 0_level_1,Unnamed: 1_level_1,100,300,200,100,300,200
B,10,5,2,4,7,6,8
B,30,8,12,10,1,6,7
B,20,7,14,8,1,5,9
A,40,13,8,9,4,3,0
A,50,3,5,14,0,2,3
A,10,8,1,3,13,3,3
A,20,14,7,0,1,9,9

Unnamed: 0_level_0,Unnamed: 1_level_0,foo,foo,foo,bar,bar,bar
Unnamed: 0_level_1,Unnamed: 1_level_1,100,300,200,100,300,200
A,50,3,5,14,0,2,3
A,40,13,8,9,4,3,0
B,30,8,12,10,1,6,7
B,20,7,14,8,1,5,9
A,20,14,7,0,1,9,9
B,10,5,2,4,7,6,8
A,10,8,1,3,13,3,3


In [19]:
# 2. sort by index in level 0 y 1 (default in not specified level)
# NOTE: when there are two options (or more) to sort by, it is sorted the first one, and 
# if there are duplicates the second one.
result = df.sort_index(
    level = [0, 1],
    ascending=False
)
display_side_by_side(df, result)

Unnamed: 0_level_0,Unnamed: 1_level_0,foo,foo,foo,bar,bar,bar
Unnamed: 0_level_1,Unnamed: 1_level_1,100,300,200,100,300,200
B,10,5,2,4,7,6,8
B,30,8,12,10,1,6,7
B,20,7,14,8,1,5,9
A,40,13,8,9,4,3,0
A,50,3,5,14,0,2,3
A,10,8,1,3,13,3,3
A,20,14,7,0,1,9,9

Unnamed: 0_level_0,Unnamed: 1_level_0,foo,foo,foo,bar,bar,bar
Unnamed: 0_level_1,Unnamed: 1_level_1,100,300,200,100,300,200
B,30,8,12,10,1,6,7
B,20,7,14,8,1,5,9
B,10,5,2,4,7,6,8
A,50,3,5,14,0,2,3
A,40,13,8,9,4,3,0
A,20,14,7,0,1,9,9
A,10,8,1,3,13,3,3


In [21]:
#3. sorting by index in column labels
result = df.sort_index(
    level = 1,
    ascending=False,
    axis=1
)
display_side_by_side(df, result)

Unnamed: 0_level_0,Unnamed: 1_level_0,foo,foo,foo,bar,bar,bar
Unnamed: 0_level_1,Unnamed: 1_level_1,100,300,200,100,300,200
B,10,5,2,4,7,6,8
B,30,8,12,10,1,6,7
B,20,7,14,8,1,5,9
A,40,13,8,9,4,3,0
A,50,3,5,14,0,2,3
A,10,8,1,3,13,3,3
A,20,14,7,0,1,9,9

Unnamed: 0_level_0,Unnamed: 1_level_0,foo,bar,foo,bar,foo,bar
Unnamed: 0_level_1,Unnamed: 1_level_1,300,300,200,200,100,100
B,10,2,6,4,8,5,7
B,30,12,6,10,7,8,1
B,20,14,5,8,9,7,1
A,40,8,3,9,0,13,4
A,50,5,2,14,3,3,0
A,10,1,3,3,3,8,13
A,20,7,9,0,9,14,1


In [22]:
#4. sort by column values
result = df.sort_values(
    by = ("foo", 100),
    ascending = False
)
display_side_by_side(df, result)

Unnamed: 0_level_0,Unnamed: 1_level_0,foo,foo,foo,bar,bar,bar
Unnamed: 0_level_1,Unnamed: 1_level_1,100,300,200,100,300,200
B,10,5,2,4,7,6,8
B,30,8,12,10,1,6,7
B,20,7,14,8,1,5,9
A,40,13,8,9,4,3,0
A,50,3,5,14,0,2,3
A,10,8,1,3,13,3,3
A,20,14,7,0,1,9,9

Unnamed: 0_level_0,Unnamed: 1_level_0,foo,foo,foo,bar,bar,bar
Unnamed: 0_level_1,Unnamed: 1_level_1,100,300,200,100,300,200
A,20,14,7,0,1,9,9
A,40,13,8,9,4,3,0
B,30,8,12,10,1,6,7
A,10,8,1,3,13,3,3
B,20,7,14,8,1,5,9
B,10,5,2,4,7,6,8
A,50,3,5,14,0,2,3


In [23]:
#5. sort by multiple column values
# NOTE: first sort by ("foo", 100), then if duplicates sort by ("foo", 300)
result = df.sort_values(
    by = [("foo", 100),("foo", 300)],
    ascending = [True, True]
)
display_side_by_side(df, result)

Unnamed: 0_level_0,Unnamed: 1_level_0,foo,foo,foo,bar,bar,bar
Unnamed: 0_level_1,Unnamed: 1_level_1,100,300,200,100,300,200
B,10,5,2,4,7,6,8
B,30,8,12,10,1,6,7
B,20,7,14,8,1,5,9
A,40,13,8,9,4,3,0
A,50,3,5,14,0,2,3
A,10,8,1,3,13,3,3
A,20,14,7,0,1,9,9

Unnamed: 0_level_0,Unnamed: 1_level_0,foo,foo,foo,bar,bar,bar
Unnamed: 0_level_1,Unnamed: 1_level_1,100,300,200,100,300,200
A,50,3,5,14,0,2,3
B,10,5,2,4,7,6,8
B,20,7,14,8,1,5,9
A,10,8,1,3,13,3,3
B,30,8,12,10,1,6,7
A,40,13,8,9,4,3,0
A,20,14,7,0,1,9,9


In [24]:
#6. sort by row values
# NOTE: sort the based on the row ("B", 10)
result = df.sort_values(
    by = [("B", 10)],
    ascending = True,
    axis=1
)
display_side_by_side(df, result)

Unnamed: 0_level_0,Unnamed: 1_level_0,foo,foo,foo,bar,bar,bar
Unnamed: 0_level_1,Unnamed: 1_level_1,100,300,200,100,300,200
B,10,5,2,4,7,6,8
B,30,8,12,10,1,6,7
B,20,7,14,8,1,5,9
A,40,13,8,9,4,3,0
A,50,3,5,14,0,2,3
A,10,8,1,3,13,3,3
A,20,14,7,0,1,9,9

Unnamed: 0_level_0,Unnamed: 1_level_0,foo,foo,foo,bar,bar,bar
Unnamed: 0_level_1,Unnamed: 1_level_1,300,200,100,300,100,200
B,10,2,4,5,6,7,8
B,30,12,10,8,6,1,7
B,20,14,8,7,5,1,9
A,40,8,9,13,3,4,0
A,50,5,14,3,2,0,3
A,10,1,3,8,3,13,3
A,20,7,0,14,9,1,9
