In [0]:
import numpy as np
import pandas as pd

# Creating Derived Columns

To analyze a dataset, we will almost always need to perform calculations in order to transform the data into usable information. In traditional programming we would write loops and in excel we would copy/paste a formula into every cell in a column. One fantastic feature of Pandas is that it expects calculations to be performed over an entire column which makes the code for calculations very succinct.

# Mount Drive

# Load CSV Data

In [0]:
# Load the CSV data 
# Additional parameters indicate that the first column should be considered the 
# index, and dates should be parsed into python's datetime format
air_quality = pd.read_csv("https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/air_quality_no2.csv", 
                          index_col=0, parse_dates=True)
air_quality.head()

Unnamed: 0_level_0,station_antwerp,station_paris,station_london
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2019-05-07 02:00:00,,,23.0
2019-05-07 03:00:00,50.5,25.0,19.0
2019-05-07 04:00:00,45.0,27.7,19.0
2019-05-07 05:00:00,,50.4,16.0
2019-05-07 06:00:00,,61.9,


This air quality dataset contans NO<sub>2</sub> concentrations at three different subway locations: Antwerp, Paris, and London.



# Create a new column from an existing column

![Derived Column](https://drive.google.com/uc?id=13GSyNbzWjnvK7uysYaECsi1r2OkpeKF_)

If we want to express **NO<sub>2</sub>** concentration in units of **mg/m<sup>3</sup>**, the conversion factor is 1.882 (at a given temperature and pressure).

Let's create a new column to display the London station values in **mg/m<sup>3</sup>**.

In [0]:
# Create a new column named "london_mg_per_cubic_meter"
air_quality["london_mg_per_cubic_meter"] = air_quality["station_london"] * 1.882

air_quality.head()

Unnamed: 0_level_0,station_antwerp,station_paris,station_london,london_mg_per_cubic,london_mg_per_cubic_meter
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2019-05-07 02:00:00,,,23.0,43.286,43.286
2019-05-07 03:00:00,50.5,25.0,19.0,35.758,35.758
2019-05-07 04:00:00,45.0,27.7,19.0,35.758,35.758
2019-05-07 05:00:00,,50.4,16.0,30.112,30.112
2019-05-07 06:00:00,,61.9,,,


On the left side of the statement, we create a new column named **london_mg_per_cubic_meter**.

On the right, we give it a value by multiplying the **station_london** column by **1.882**.

Pandas performs the calculations element wise, meaning all values/cells in the column are multiplied by **1.882** at once. We don't need a loop to iterate over the rows. The above one-line expression performs every calculation for us!

# Create a new column from multiple columns

![Derived Column](https://drive.google.com/uc?id=1o-4uXXMGIVOFhl7sC37QKRcn5rZoaIIm)

Now we would like to compare the air quality in paris and antwerp.

In [0]:
# Create a new column named "ratio_paris_antwerp"
air_quality["ratio_paris_antwerp"] = air_quality["station_paris"] / air_quality["station_antwerp"]

air_quality.head()

Unnamed: 0_level_0,station_antwerp,station_paris,station_london,london_mg_per_cubic,london_mg_per_cubic_meter,ratio_paris_antwerp
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2019-05-07 02:00:00,,,23.0,43.286,43.286,
2019-05-07 03:00:00,50.5,25.0,19.0,35.758,35.758,0.49505
2019-05-07 04:00:00,45.0,27.7,19.0,35.758,35.758,0.615556
2019-05-07 05:00:00,,50.4,16.0,30.112,30.112,
2019-05-07 06:00:00,,61.9,,,,


On the left side of the statement, we create a new column named **ratio_paris_antwerp**.

On the right, we give it a value by dividing the **station_paris** column by the **station_antwerp** column.

Again, the calculation is done element wise for each row.

-----------------------
NOTE:

Other mathematical operators (+, -, \*, /) and logical operators (<, >, =, …) work element wise. We use the logical operators to select specific rows in the **Data Selection** notebook.

-----------------------

# Column Creation and Deletion

Inserting a scalar or a text value will fill the entire column with that value.


In [0]:
# Insert a scalar value
air_quality['scalar'] = 333

air_quality.head()

Unnamed: 0_level_0,Antwerp,Paris,London,london_mg_per_cubic,london_mg_per_cubic_meter,ratio_paris_antwerp,scalar
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2019-05-07 02:00:00,,,23.0,43.286,43.286,,333
2019-05-07 03:00:00,50.5,25.0,19.0,35.758,35.758,0.49505,333
2019-05-07 04:00:00,45.0,27.7,19.0,35.758,35.758,0.615556,333
2019-05-07 05:00:00,,50.4,16.0,30.112,30.112,,333
2019-05-07 06:00:00,,61.9,,,,,333


In [0]:
# Insert a text value
air_quality['text'] = "Pandas is awesome"

air_quality.head()

Unnamed: 0_level_0,Antwerp,Paris,London,london_mg_per_cubic,london_mg_per_cubic_meter,ratio_paris_antwerp,scalar,text
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2019-05-07 02:00:00,,,23.0,43.286,43.286,,333,Pandas is awesome
2019-05-07 03:00:00,50.5,25.0,19.0,35.758,35.758,0.49505,333,Pandas is awesome
2019-05-07 04:00:00,45.0,27.7,19.0,35.758,35.758,0.615556,333,Pandas is awesome
2019-05-07 05:00:00,,50.4,16.0,30.112,30.112,,333,Pandas is awesome
2019-05-07 06:00:00,,61.9,,,,,333,Pandas is awesome


In [0]:
# Delete the new "text" column
del air_quality['text']

air_quality.head()

Unnamed: 0_level_0,Antwerp,Paris,London,london_mg_per_cubic,london_mg_per_cubic_meter,ratio_paris_antwerp,scalar
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2019-05-07 02:00:00,,,23.0,43.286,43.286,,333
2019-05-07 03:00:00,50.5,25.0,19.0,35.758,35.758,0.49505,333
2019-05-07 04:00:00,45.0,27.7,19.0,35.758,35.758,0.615556,333
2019-05-07 05:00:00,,50.4,16.0,30.112,30.112,,333
2019-05-07 06:00:00,,61.9,,,,,333


# Rename columns

Columns can be renamed by providing a dictionary where keys are the current names, and values are the new names.

In [0]:
air_quality = air_quality.rename(columns={"station_antwerp": "Antwerp",
                                        "station_paris": "Paris",
                                        "station_london": "London"})
air_quality.head()

Unnamed: 0_level_0,Antwerp,Paris,London,london_mg_per_cubic,london_mg_per_cubic_meter,ratio_paris_antwerp
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2019-05-07 02:00:00,,,23.0,43.286,43.286,
2019-05-07 03:00:00,50.5,25.0,19.0,35.758,35.758,0.49505
2019-05-07 04:00:00,45.0,27.7,19.0,35.758,35.758,0.615556
2019-05-07 05:00:00,,50.4,16.0,30.112,30.112,
2019-05-07 06:00:00,,61.9,,,,


Row labels can be renamed the same way, by passing a dictionary to the **rows** parameter:

`air_quality = air_quality.rename(rows={'old_name': 'new_name', ...})`

# Summary

- Create a new column by assigning the output to the DataFrame with a new column name in between the [].
- Operations are element-wise, no need to loop over rows.
- Use rename with a dictionary to rename row labels or column names.