# Adding and dropping columns from DataFrames

This document will demonstrate how to add and remove columns from a DataFrame

In [22]:
# import pandas
import pandas as pd
# load the gapminder dataset
gapminder = pd.read_csv('data/gapminder.csv')
# take a look at the head of gapminder
gapminder.head()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.85303
2,Afghanistan,Asia,1962,31.997,10267083,853.10071
3,Afghanistan,Asia,1967,34.02,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106


### Creating a new column in a DataFrame

In [23]:
# compute the product of the pop and gdpPercap columns
gapminder['pop'] * gapminder['gdpPercap']

0       6.567086e+09
1       7.585449e+09
2       8.758856e+09
3       9.648014e+09
4       9.678553e+09
            ...     
1699    6.508241e+09
1700    7.422612e+09
1701    9.037851e+09
1702    8.015111e+09
1703    5.782658e+09
Length: 1704, dtype: float64

Let's add a new column `gdp` which is the product of the `pop` and `gdpPercap` columns:

In [24]:
# add a new column to gapminder corresponding to the product of the values in the 'pop' and 'gdpPercap' columns
gapminder['gdp'] = gapminder['pop'] * gapminder['gdpPercap']

Notice that the result is that the `gapminder` DataFrame object now has a new `gdp` column in the final column position:

In [25]:
# Has the original gapminder object changed?
gapminder

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap,gdp
0,Afghanistan,Asia,1952,28.801,8425333,779.445314,6.567086e+09
1,Afghanistan,Asia,1957,30.332,9240934,820.853030,7.585449e+09
2,Afghanistan,Asia,1962,31.997,10267083,853.100710,8.758856e+09
3,Afghanistan,Asia,1967,34.020,11537966,836.197138,9.648014e+09
4,Afghanistan,Asia,1972,36.088,13079460,739.981106,9.678553e+09
...,...,...,...,...,...,...,...
1699,Zimbabwe,Africa,1987,62.351,9216418,706.157306,6.508241e+09
1700,Zimbabwe,Africa,1992,60.377,10704340,693.420786,7.422612e+09
1701,Zimbabwe,Africa,1997,46.809,11404948,792.449960,9.037851e+09
1702,Zimbabwe,Africa,2002,39.989,11926563,672.038623,8.015111e+09


### Removing a column from a DataFrame using `.drop()`

To remove a column from a DataFrame, you can use the pandas `.drop` method. 
The code below "drops" the `gdp` column that we just created from the `gapminder` DataFrame:

In [26]:
# remove gdp from gapminder using the .drop(columns=) method
gapminder.drop(columns='gdp')

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.853030
2,Afghanistan,Asia,1962,31.997,10267083,853.100710
3,Afghanistan,Asia,1967,34.020,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106
...,...,...,...,...,...,...
1699,Zimbabwe,Africa,1987,62.351,9216418,706.157306
1700,Zimbabwe,Africa,1992,60.377,10704340,693.420786
1701,Zimbabwe,Africa,1997,46.809,11404948,792.449960
1702,Zimbabwe,Africa,2002,39.989,11926563,672.038623


However, notice that this did not actually remove the `gdp` column from the gapminder object.

In [27]:
# did the original gapminder data object change?
gapminder

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap,gdp
0,Afghanistan,Asia,1952,28.801,8425333,779.445314,6.567086e+09
1,Afghanistan,Asia,1957,30.332,9240934,820.853030,7.585449e+09
2,Afghanistan,Asia,1962,31.997,10267083,853.100710,8.758856e+09
3,Afghanistan,Asia,1967,34.020,11537966,836.197138,9.648014e+09
4,Afghanistan,Asia,1972,36.088,13079460,739.981106,9.678553e+09
...,...,...,...,...,...,...,...
1699,Zimbabwe,Africa,1987,62.351,9216418,706.157306,6.508241e+09
1700,Zimbabwe,Africa,1992,60.377,10704340,693.420786,7.422612e+09
1701,Zimbabwe,Africa,1997,46.809,11404948,792.449960,9.037851e+09
1702,Zimbabwe,Africa,2002,39.989,11926563,672.038623,8.015111e+09


`gapminder.drop(columns='gdp')` just printed out a version of the `gapminder` DataFrame without the `gdp` column, but it didn't update the `gapminder` DataFrame itself.

To update the `gapminder` DataFrame to be the version without the `gdp` column, you need to overwrite the `gapminder` object by assigning it to be the version without `gdp` as follows:

In [28]:
# overwrite gapminder with the output of the .drop(columns=) method
gapminder = gapminder.drop(columns='gdp')

The `gapminder` DataFrame no longer contains the `gdp` column:

In [29]:
# now look at the gapminder data object - has it changed?
gapminder

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.853030
2,Afghanistan,Asia,1962,31.997,10267083,853.100710
3,Afghanistan,Asia,1967,34.020,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106
...,...,...,...,...,...,...
1699,Zimbabwe,Africa,1987,62.351,9216418,706.157306
1700,Zimbabwe,Africa,1992,60.377,10704340,693.420786
1701,Zimbabwe,Africa,1997,46.809,11404948,792.449960
1702,Zimbabwe,Africa,2002,39.989,11926563,672.038623


### Creating a copy of a DataFrame object

Suppose that you want to keep an unmodified copy of the original `gapminder` DataFrame object in your environment, and create a different version, called `gapminder_new`, that you can modify as much as you like. 

You might try to create a new variable `gapminder_new` that contains the original `gapminder` DataFrame as follows:

In [30]:
# define gapminder_new and set it equal to gapminder
gapminder_new = gapminder

Indeed, `gapminder_new` contains the same DataFrame object as `gapminder`:

In [31]:
# take a look at gapminder_new
gapminder_new

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.853030
2,Afghanistan,Asia,1962,31.997,10267083,853.100710
3,Afghanistan,Asia,1967,34.020,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106
...,...,...,...,...,...,...
1699,Zimbabwe,Africa,1987,62.351,9216418,706.157306
1700,Zimbabwe,Africa,1992,60.377,10704340,693.420786
1701,Zimbabwe,Africa,1997,46.809,11404948,792.449960
1702,Zimbabwe,Africa,2002,39.989,11926563,672.038623


Let's add a `GDP` column to this new `gapminder_new` DataFrame object:

In [32]:
# Define a new column in gapminder_new called 'GDP' that is equal to the product of the 'pop' and 'gdpPercap' columns
gapminder_new['GDP'] = gapminder_new['pop'] * gapminder_new['gdpPercap']
# take a look at gapminder_new
gapminder_new

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap,GDP
0,Afghanistan,Asia,1952,28.801,8425333,779.445314,6.567086e+09
1,Afghanistan,Asia,1957,30.332,9240934,820.853030,7.585449e+09
2,Afghanistan,Asia,1962,31.997,10267083,853.100710,8.758856e+09
3,Afghanistan,Asia,1967,34.020,11537966,836.197138,9.648014e+09
4,Afghanistan,Asia,1972,36.088,13079460,739.981106,9.678553e+09
...,...,...,...,...,...,...,...
1699,Zimbabwe,Africa,1987,62.351,9216418,706.157306,6.508241e+09
1700,Zimbabwe,Africa,1992,60.377,10704340,693.420786,7.422612e+09
1701,Zimbabwe,Africa,1997,46.809,11404948,792.449960,9.037851e+09
1702,Zimbabwe,Africa,2002,39.989,11926563,672.038623,8.015111e+09


However, notice that this *also* added a `GDP` column to the original `gapminder` DataFrame (even though the code in the previous cell did not modify the `gapminder` DataFrame object at all)

In [33]:
# take a look at the original gapminder object -- has it changed?
gapminder

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap,GDP
0,Afghanistan,Asia,1952,28.801,8425333,779.445314,6.567086e+09
1,Afghanistan,Asia,1957,30.332,9240934,820.853030,7.585449e+09
2,Afghanistan,Asia,1962,31.997,10267083,853.100710,8.758856e+09
3,Afghanistan,Asia,1967,34.020,11537966,836.197138,9.648014e+09
4,Afghanistan,Asia,1972,36.088,13079460,739.981106,9.678553e+09
...,...,...,...,...,...,...,...
1699,Zimbabwe,Africa,1987,62.351,9216418,706.157306,6.508241e+09
1700,Zimbabwe,Africa,1992,60.377,10704340,693.420786,7.422612e+09
1701,Zimbabwe,Africa,1997,46.809,11404948,792.449960,9.037851e+09
1702,Zimbabwe,Africa,2002,39.989,11926563,672.038623,8.015111e+09


What's going on here?

Let's revert the `gapminder` DataFrame object to the original dataset by re-loading the csv file:

In [34]:
# read in gapminder again to revert to the original dataset
gapminder = pd.read_csv('data/gapminder.csv')
gapminder

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.853030
2,Afghanistan,Asia,1962,31.997,10267083,853.100710
3,Afghanistan,Asia,1967,34.020,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106
...,...,...,...,...,...,...
1699,Zimbabwe,Africa,1987,62.351,9216418,706.157306
1700,Zimbabwe,Africa,1992,60.377,10704340,693.420786
1701,Zimbabwe,Africa,1997,46.809,11404948,792.449960
1702,Zimbabwe,Africa,2002,39.989,11926563,672.038623


### The `.copy()`` method

The problem, is that when you write `gapminder_new = gapminder`, this is not creating a new independent version of `gapminder` and saving it as `gapminder_new`, instead, this is creating a new "pointer" to the `gapminder` DataFrame: `gapminder_new` acts as an "alias" for the original `gapminder` DataFrame. Think of it as though you can now access the same DataFame object using two variable names: `gapminder` and `gapminder_new`. The result is that modifying one will also modify the other.

The way to create an independent copy of a DataFrame, for which modifications of this new DataFrame will not be reflected in the original one is to use the Pandas `.copy()` method.

The code below will create an *independent* copy of the `gapminder` DataFrame, and will save it in a new variable called `gapminder_new`:

In [35]:
# define gapminder_new this time as a copy of gapminder
gapminder_new = gapminder.copy()

Now let's add a new column to `gapminder_new` called `gdp_new`:

In [36]:
# add a column, gdp_new, to gapminder_new that is equal to the product of the 'pop' and 'gdpPercap' columns
gapminder_new['gdp_new'] = gapminder_new['pop'] * gapminder_new['gdpPercap']
# take a look at gapminder_new
gapminder_new

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap,gdp_new
0,Afghanistan,Asia,1952,28.801,8425333,779.445314,6.567086e+09
1,Afghanistan,Asia,1957,30.332,9240934,820.853030,7.585449e+09
2,Afghanistan,Asia,1962,31.997,10267083,853.100710,8.758856e+09
3,Afghanistan,Asia,1967,34.020,11537966,836.197138,9.648014e+09
4,Afghanistan,Asia,1972,36.088,13079460,739.981106,9.678553e+09
...,...,...,...,...,...,...,...
1699,Zimbabwe,Africa,1987,62.351,9216418,706.157306,6.508241e+09
1700,Zimbabwe,Africa,1992,60.377,10704340,693.420786,7.422612e+09
1701,Zimbabwe,Africa,1997,46.809,11404948,792.449960,9.037851e+09
1702,Zimbabwe,Africa,2002,39.989,11926563,672.038623,8.015111e+09


This time, this new column was not also created for the `gapminder` DataSet. They are now independent objects that can be modified separately.

In [37]:
# check whether the original gapminder object has changed
gapminder

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.853030
2,Afghanistan,Asia,1962,31.997,10267083,853.100710
3,Afghanistan,Asia,1967,34.020,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106
...,...,...,...,...,...,...
1699,Zimbabwe,Africa,1987,62.351,9216418,706.157306
1700,Zimbabwe,Africa,1992,60.377,10704340,693.420786
1701,Zimbabwe,Africa,1997,46.809,11404948,792.449960
1702,Zimbabwe,Africa,2002,39.989,11926563,672.038623


### Exercise 

Create a version of gapminder called `gapminder_gdp` that contains three columns: country, year, and gdp (the GDP for each country-year in millions). Make sure that the original `gapminder` DataFrame is not modified.

In [38]:
# create a copy of gapminder
gapminder_gdp = gapminder.copy()
# add a column called gdp
gapminder_gdp['gdp'] = gapminder_gdp['pop'] * gapminder_gdp['gdpPercap'] / 1e6
# subset to just the rows of interest
gapminder_gdp = gapminder_gdp[['country', 'year', 'gdp']]

In [39]:
gapminder_gdp

Unnamed: 0,country,year,gdp
0,Afghanistan,1952,6567.086330
1,Afghanistan,1957,7585.448670
2,Afghanistan,1962,8758.855797
3,Afghanistan,1967,9648.014150
4,Afghanistan,1972,9678.553274
...,...,...,...
1699,Zimbabwe,1987,6508.240905
1700,Zimbabwe,1992,7422.611852
1701,Zimbabwe,1997,9037.850590
1702,Zimbabwe,2002,8015.110972


In [40]:
# show that the original gapminder object remains unmodified
gapminder

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.853030
2,Afghanistan,Asia,1962,31.997,10267083,853.100710
3,Afghanistan,Asia,1967,34.020,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106
...,...,...,...,...,...,...
1699,Zimbabwe,Africa,1987,62.351,9216418,706.157306
1700,Zimbabwe,Africa,1992,60.377,10704340,693.420786
1701,Zimbabwe,Africa,1997,46.809,11404948,792.449960
1702,Zimbabwe,Africa,2002,39.989,11926563,672.038623


## Modifying existing columns of a DataFrame

The `df['col] = ...` syntax can be used not only to add new columns, but also to modify existing columns.

Suppose that we want to replace the `lifeExp` column with a "rounded" version of the original column. We could use the `round()` function from the numpy library to compute a rounded version of this column (note that you will need to import the numpy library, e.g., in the first cell of this notebook):

In [41]:
# import numpy
import numpy as np
# apply np.round() to the 'lifeExp' column of gapminder
np.round(gapminder['lifeExp'])

0       29.0
1       30.0
2       32.0
3       34.0
4       36.0
        ... 
1699    62.0
1700    60.0
1701    47.0
1702    40.0
1703    43.0
Name: lifeExp, Length: 1704, dtype: float64

And we could update the existing `lifeExp` column with this rounded version as follows:

In [42]:
# update the lifeExp column of gapminder with the rounded version
gapminder['lifeExp'] = np.round(gapminder['lifeExp'])
gapminder

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,29.0,8425333,779.445314
1,Afghanistan,Asia,1957,30.0,9240934,820.853030
2,Afghanistan,Asia,1962,32.0,10267083,853.100710
3,Afghanistan,Asia,1967,34.0,11537966,836.197138
4,Afghanistan,Asia,1972,36.0,13079460,739.981106
...,...,...,...,...,...,...
1699,Zimbabwe,Africa,1987,62.0,9216418,706.157306
1700,Zimbabwe,Africa,1992,60.0,10704340,693.420786
1701,Zimbabwe,Africa,1997,47.0,11404948,792.449960
1702,Zimbabwe,Africa,2002,40.0,11926563,672.038623
