## Inserting, Deleting and Sorting Data

In this topic, we review **basic** ways to insert or remove complete rows or 
columns. Later, in other notebook, the topics about **concatenation** and 
**merging** will be covered, which are more complex and useful ways to insert or drop
data.

Topics Reviewd

- Inserting columns (using accesors or `.insert()`)
- Deleting using `del`, `pop()`, or `drop()`
- Sorting using `.sort_values()`

**NOTE:** Versions previous to pandas 2.0 could use `.append()` to insert rows, but now it is deprecated and instead we will always use `concat()`. `concat()` will be explored in other notebook.

In [2]:
import pandas as pd
import numpy as np

np.random.seed(0)

## Inserting columns

There are two basic ways to insert columns.

1. Insert cols using accessors `df["column_name"] = <new_data>`. Note `<new_data>` can be a scalar or a list/array of the length of columns.
2. Insert cols using `.insert(loc, column, value)`, when the location of the new column is important.
    - `loc` indicates the location of the new column
    - `column` indicates the name of the column
    - `value` indicates the values (scalar or list/array) to fit the new column.


In [3]:
data = {
     'name': ['Xavier', 'Ann', 'Jana', 'Yi', 'Robin', 'Amal', 'Nori'],
     'city': ['Mexico City', 'Toronto', 'Prague', 'Shanghai',
              'Manchester', 'Cairo', 'Osaka'],
     'age': [41, 28, 33, 34, 38, 31, 37],
     'py-score': [88.0, 79.0, 81.0, 80.0, 68.0, 61.0, 84.0]
 }

row_labels = [100, 101, 102, 103, 104, 105, 106]
df = pd.DataFrame(data=data, index=row_labels)

df

Unnamed: 0,name,city,age,py-score
100,Xavier,Mexico City,41,88.0
101,Ann,Toronto,28,79.0
102,Jana,Prague,33,81.0
103,Yi,Shanghai,34,80.0
104,Robin,Manchester,38,68.0
105,Amal,Cairo,31,61.0
106,Nori,Osaka,37,84.0


In [4]:
# 1. Inserting new col using accessors
# NOTE: we are inserting using an array
df["js-score"] = np.random.randint(100, size = 7)
df

Unnamed: 0,name,city,age,py-score,js-score
100,Xavier,Mexico City,41,88.0,44
101,Ann,Toronto,28,79.0,47
102,Jana,Prague,33,81.0,64
103,Yi,Shanghai,34,80.0,67
104,Robin,Manchester,38,68.0,67
105,Amal,Cairo,31,61.0,9
106,Nori,Osaka,37,84.0,83


In [5]:
# 1. Inserting new col using accessors
# NOTE: inserting using a scalar will broadcast the value
df["java-score"] = 50
df

Unnamed: 0,name,city,age,py-score,js-score,java-score
100,Xavier,Mexico City,41,88.0,44,50
101,Ann,Toronto,28,79.0,47,50
102,Jana,Prague,33,81.0,64,50
103,Yi,Shanghai,34,80.0,67,50
104,Robin,Manchester,38,68.0,67,50
105,Amal,Cairo,31,61.0,9,50
106,Nori,Osaka,37,84.0,83,50


In [6]:
# 2. Inserting using function .insert()
# NOTE: this function will insert the column in place
df.insert(loc=4,
          column="django-score",
          value=[86.0, 81.0, 78.0, 88.0, 74.0, 70.0, 81.0])
df

Unnamed: 0,name,city,age,py-score,django-score,js-score,java-score
100,Xavier,Mexico City,41,88.0,86.0,44,50
101,Ann,Toronto,28,79.0,81.0,47,50
102,Jana,Prague,33,81.0,78.0,64,50
103,Yi,Shanghai,34,80.0,88.0,67,50
104,Robin,Manchester,38,68.0,74.0,67,50
105,Amal,Cairo,31,61.0,70.0,9,50
106,Nori,Osaka,37,84.0,81.0,83,50


## Deleting rows or cols

There are some ways to delete rows or cols

1. Remove a row using `drop(labels)`
2. Remove a col using `drop(labels, axis = 1)`
3. Remove a col using `del df[<column_label>]`
4. Remove and retrieve a col using `pop(<column_label>)`

**NOTE:** To remove values in place with `drop()` use `inplace = True`.

**NOTE:** Differences between `del` and `drop()`.

- `drop` operates on both columns and rows; `del` operates on column only.
- `drop` can operate on multiple items at a time; `del` operates only on one at a time.
- `drop` can operate in-place or return a copy; `del` is an in-place operation only.

In [7]:
df

Unnamed: 0,name,city,age,py-score,django-score,js-score,java-score
100,Xavier,Mexico City,41,88.0,86.0,44,50
101,Ann,Toronto,28,79.0,81.0,47,50
102,Jana,Prague,33,81.0,78.0,64,50
103,Yi,Shanghai,34,80.0,88.0,67,50
104,Robin,Manchester,38,68.0,74.0,67,50
105,Amal,Cairo,31,61.0,70.0,9,50
106,Nori,Osaka,37,84.0,81.0,83,50


In [8]:
# 1. Removing rows with index labels 103, and 106
df = df.drop(labels=[103, 106])
df

Unnamed: 0,name,city,age,py-score,django-score,js-score,java-score
100,Xavier,Mexico City,41,88.0,86.0,44,50
101,Ann,Toronto,28,79.0,81.0,47,50
102,Jana,Prague,33,81.0,78.0,64,50
104,Robin,Manchester,38,68.0,74.0,67,50
105,Amal,Cairo,31,61.0,70.0,9,50


In [9]:
# 2. Removing a column with drop() + axis = 1
df = df.drop(labels=["java-score", "py-score"], axis = 1)
df

Unnamed: 0,name,city,age,django-score,js-score
100,Xavier,Mexico City,41,86.0,44
101,Ann,Toronto,28,81.0,47
102,Jana,Prague,33,78.0,64
104,Robin,Manchester,38,74.0,67
105,Amal,Cairo,31,70.0,9


In [10]:
# 3. Removing a column using del
del df['js-score']
df

Unnamed: 0,name,city,age,django-score
100,Xavier,Mexico City,41,86.0
101,Ann,Toronto,28,81.0
102,Jana,Prague,33,78.0
104,Robin,Manchester,38,74.0
105,Amal,Cairo,31,70.0


In [11]:
# 4. Remove and retrieve column using pop()
column = df.pop("age")
print(column)
df

100    41
101    28
102    33
104    38
105    31
Name: age, dtype: int64


Unnamed: 0,name,city,django-score
100,Xavier,Mexico City,86.0
101,Ann,Toronto,81.0
102,Jana,Prague,78.0
104,Robin,Manchester,74.0
105,Amal,Cairo,70.0


## Sorting
You can sort a pandas DataFrame with 

`.sort_values(by, axis, ascending, inplace)`

- `by` is the label (or list of labels) used to sort. It can be a col with `axis = 0` or a row with `axis = 1`.
- `axis` sets the axis to be sorted.
- `ascending` if True sort in ascending order.
- `inplace` if True, perform operation in-place.


In [12]:
data = {
     'name': ['Xavier', 'Ann', 'Jana', 'Yi', 'Robin', 'Amal', 'Nori'],
     'city': ['Mexico City', 'Toronto', 'Prague', 'Shanghai',
              'Manchester', 'Cairo', 'Osaka'],
     'age': [41, 28, 33, 34, 38, 31, 37],
     'py-score': [88.0, 79.0, 81.0, 80.0, 68.0, 61.0, 84.0]
 }

row_labels = [100, 101, 102, 103, 104, 105, 106]
df = pd.DataFrame(data=data, index=row_labels)

df

Unnamed: 0,name,city,age,py-score
100,Xavier,Mexico City,41,88.0
101,Ann,Toronto,28,79.0
102,Jana,Prague,33,81.0
103,Yi,Shanghai,34,80.0
104,Robin,Manchester,38,68.0
105,Amal,Cairo,31,61.0
106,Nori,Osaka,37,84.0


In [14]:
#1. sort in descending order by one column
df.sort_values(
    by = "age",
    ascending = False
)

Unnamed: 0,name,city,age,py-score
100,Xavier,Mexico City,41,88.0
104,Robin,Manchester,38,68.0
106,Nori,Osaka,37,84.0
103,Yi,Shanghai,34,80.0
102,Jana,Prague,33,81.0
105,Amal,Cairo,31,61.0
101,Ann,Toronto,28,79.0


In [15]:
#2. sort in ascending order by multiple columns
# NOTE: first sort by "age", then if duplicates sort by "py-score"
df.sort_values(
    by = ["age","py-score"],
    ascending = [True, True]
)

Unnamed: 0,name,city,age,py-score
101,Ann,Toronto,28,79.0
105,Amal,Cairo,31,61.0
102,Jana,Prague,33,81.0
103,Yi,Shanghai,34,80.0
106,Nori,Osaka,37,84.0
104,Robin,Manchester,38,68.0
100,Xavier,Mexico City,41,88.0
