# Extracting columns from a DataFrame as a Pandas Series object

In the video I was working in the old file, but I thought it was actually getting a bit long, so here is a shorter one. 

Let's start by importing the libraries that we need:

In [14]:
# import the pandas library
import pandas as pd
import numpy as np

Let's load the gapminder dataset:

In [2]:
# read the gapminder data and save it as a DataFrame object called 'gapminder'
gapminder = pd.read_csv('data/gapminder.csv')

To view the dataset we have just loaded, we can type the name of the variable that we saved it in:

In [3]:
# take a look at gapminder
gapminder

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.853030
2,Afghanistan,Asia,1962,31.997,10267083,853.100710
3,Afghanistan,Asia,1967,34.020,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106
...,...,...,...,...,...,...
1699,Zimbabwe,Africa,1987,62.351,9216418,706.157306
1700,Zimbabwe,Africa,1992,60.377,10704340,693.420786
1701,Zimbabwe,Africa,1997,46.809,11404948,792.449960
1702,Zimbabwe,Africa,2002,39.989,11926563,672.038623


## Extracting columns from a DataFrame

There are several ways to extract a column from a DataFrame. 

### Method 1: Using square brackets

The first, involves writing the name of the DataFrame object followed by square parentheses inside which you provide the name of the column you want to extract as a string:

In [4]:
# use the `df['colname']` syntax to extract the 'year' column
gapminder['year']

0       1952
1       1957
2       1962
3       1967
4       1972
        ... 
1699    1987
1700    1992
1701    1997
1702    2002
1703    2007
Name: year, Length: 1704, dtype: int64

### Method 2: Using the column attribute with `.`

Another way to do the same thing is to use the `.` syntax to extract the named column attribute from the DataFrame object, such as:

In [5]:
# use the df.colname syntax to extract the 'year' column
gapminder.year

0       1952
1       1957
2       1962
3       1967
4       1972
        ... 
1699    1987
1700    1992
1701    1997
1702    2002
1703    2007
Name: year, Length: 1704, dtype: int64

## The Pandas Series object

Above, we are extracting the `year` column from the gapminder DataFrame, but we are just displaying this column rather than saving it. 

Below, we save the `year` column from gapminder as a variable called `year`:

In [6]:
# define a variable called 'year' and assign it the 'year' column of gapminder
year = gapminder['year']

And we can display the `year` object by writing its name:

In [7]:
# take a look at the year variable
year

0       1952
1       1957
2       1962
3       1967
4       1972
        ... 
1699    1987
1700    1992
1701    1997
1702    2002
1703    2007
Name: year, Length: 1704, dtype: int64

The question is: what is the type of this `year` object? The answer is a **Pandas Series** object:

In [8]:
# check the type of year using type()
type(year)

pandas.core.series.Series

Series objects are like 1-column Dataframe objects, but they don't have a `columns` attribute like a DataFrame does:

In [9]:
# try to extract the columns attribute from the year object
year.columns

AttributeError: 'Series' object has no attribute 'columns'

### The Series index

They do however have an `index` (row name) attribute, which is inherited from the DataFrame from which the Series came:

In [10]:
# extract the index attribute from the year object
year.index

RangeIndex(start=0, stop=1704, step=1)

### The vectorized nature of Series objects

The nice thing about Pandas Series objects is that they are **vectorized**. 

This means that when you apply simple mathematical operations to them, the operation will be applied to *every* entry in the Series. For example, if we add `5` to the `year` Series object, `5` will be added to *every* value in the `year` Series object:

In [11]:
# Try to add 5 to year
year + 5

0       1957
1       1962
2       1967
3       1972
4       1977
        ... 
1699    1992
1700    1997
1701    2002
1702    2007
1703    2012
Name: year, Length: 1704, dtype: int64

Similarly, if we raise `lifeExp` to the power of 2, this computation will be applied to every single value in the Series object:

In [12]:
# Try to raise the lifeExp column of gapminder to the power of 2
gapminder['lifeExp'] ** 2

0        829.497601
1        920.030224
2       1023.808009
3       1157.360400
4       1302.343744
           ...     
1699    3887.647201
1700    3645.382129
1701    2191.082481
1702    1599.120121
1703    1891.119169
Name: lifeExp, Length: 1704, dtype: float64

You can also apply the numpy mathematical functions in a vectorized way to Series objects:

In [15]:
# Try to compute the logarithm of the gdpPercap column of gapminder
np.log(gapminder['gdpPercap'])

0       6.658583
1       6.710344
2       6.748878
3       6.728864
4       6.606625
          ...   
1699    6.559838
1700    6.541637
1701    6.675129
1702    6.510316
1703    6.152114
Name: gdpPercap, Length: 1704, dtype: float64

It is important to note that this behaviour, while it seeems natural, is not exhibited in other Python object types, such as lists (more on lists in a future video).

You can also ask boolean/logical questions of each value in a Series object simultaneously. For example, below, we ask which entries in the `year` column are equal to 2007:

In [16]:
# ask which entries in the year column of gapminder equal 2007
gapminder['year'] == 2007

0       False
1       False
2       False
3       False
4       False
        ...  
1699    False
1700    False
1701    False
1702    False
1703     True
Name: year, Length: 1704, dtype: bool

And here we ask which entries in the `lifeExp` column are at least 60:

In [17]:
# ask which entries in the lifeExp column are greater or equal to 60
gapminder['lifeExp'] >= 60

0       False
1       False
2       False
3       False
4       False
        ...  
1699     True
1700     True
1701    False
1702    False
1703    False
Name: lifeExp, Length: 1704, dtype: bool

These will be very helpful when it comes to filtering our DataFrames as you will see later.

### Exercise

1. Extract the `country` and `continent` columns from gapminder and create a Series object that contains the country and continent values separated by a comma, e.g., the first few entries should be "Afghanistan, Asia". As an added challenge, use the `drop_duplicates()` Pandas method to ensure that you only have unique values in your output.

In [18]:
# recall that adding two strings concatenates them
"a" + "b"

'ab'

In [24]:
# create a series that contains the country and continent for each row separated by a comma
country_continent = gapminder['country'] + ', ' + gapminder['continent']
country_continent.drop_duplicates()

0              Afghanistan, Asia
12               Albania, Europe
24               Algeria, Africa
36                Angola, Africa
48           Argentina, Americas
                  ...           
1644               Vietnam, Asia
1656    West Bank and Gaza, Asia
1668           Yemen, Rep., Asia
1680              Zambia, Africa
1692            Zimbabwe, Africa
Length: 142, dtype: object

2. Extract the `pop` and `gdpPercap` columns from gapminder and create a Series object that contains the total GDP for each country. 

In [None]:
# create a series object containing the GDP by multiplying pop by gdpPercap
gapminder['pop'] * gapminder['gdpPercap']