# Pandas data frames

In this notebook, we will start learning about Pandas data frames. To import the pandas library to our notebook, you will first need to download and install the pandas library. You can do this in a notebook by writing `!pip install pandas` or in the terminal by opening up a terminal window "Terminal > New Terminal" and writing the command `python3 -m pip install pandas`.



In [6]:
# install pandas
# !pip install pandas

Then you can import the pandas library into this notebook as follows:

In [7]:
# import the pandas library and alias as pd
import pandas as pd

### Loading a data file into a pandas DataFrame

To load a .csv data file into our space, we need to use the `read_csv()` function from the pandas library. Make sure that you have saved the `gapminder.csv` file in a `data` subfolder that lives in the same place where this notebook is saved.

Let's load the gapminder dataset:

In [8]:
# read the csv file living in data/gapminder.csv into a pandas dataframe
# if local: 
# pd.read_csv('data/gapminder.csv')
# else load the file from the web:
pd.read_csv('https://raw.githubusercontent.com/UofUDELPHI/2024-02-08-python/main/content/complete/data/gapminder.csv')

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.853030
2,Afghanistan,Asia,1962,31.997,10267083,853.100710
3,Afghanistan,Asia,1967,34.020,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106
...,...,...,...,...,...,...
1699,Zimbabwe,Africa,1987,62.351,9216418,706.157306
1700,Zimbabwe,Africa,1992,60.377,10704340,693.420786
1701,Zimbabwe,Africa,1997,46.809,11404948,792.449960
1702,Zimbabwe,Africa,2002,39.989,11926563,672.038623


This just prints out the gapminder dataset, but it doesn't save it. 

To save the dataset so that we can use it in our notebook, we can to assign the results of the `pd.read_csv()` function to a variable called `gapminder`:

In [9]:
# Save the above dataframe as a variable called gapminder
# if local file exists:
# gapminder = pd.read_csv('data/gapminder.csv')
# otherwise
gapminder = pd.read_csv('https://raw.githubusercontent.com/UofUDELPHI/2024-02-08-python/main/content/complete/data/gapminder.csv')

To view the dataset we have just loaded, we can type the name of the variable that we saved it in:

In [10]:
# look at the dataframe object
gapminder

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.853030
2,Afghanistan,Asia,1962,31.997,10267083,853.100710
3,Afghanistan,Asia,1967,34.020,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106
...,...,...,...,...,...,...
1699,Zimbabwe,Africa,1987,62.351,9216418,706.157306
1700,Zimbabwe,Africa,1992,60.377,10704340,693.420786
1701,Zimbabwe,Africa,1997,46.809,11404948,792.449960
1702,Zimbabwe,Africa,2002,39.989,11926563,672.038623


We can then use the same `type()` function that we used in the previous notebook to ask what kind of object the `gapminder` variable is (the answer is a pandas DataFrame):

In [11]:
# check the type of the dataframe object
type(gapminder)

pandas.core.frame.DataFrame

## Extracting information/attributes from DataFrames

In this section, we will learn how to extract attributes from DataFrame objects and how to apply DataFrame-specific "methods" to DataFrames, both using the `.` syntax.

As a reminder, let's print out the `gapminder` DataFrame that we're working with:

In [12]:
# look at gapminder again
gapminder

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.853030
2,Afghanistan,Asia,1962,31.997,10267083,853.100710
3,Afghanistan,Asia,1967,34.020,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106
...,...,...,...,...,...,...
1699,Zimbabwe,Africa,1987,62.351,9216418,706.157306
1700,Zimbabwe,Africa,1992,60.377,10704340,693.420786
1701,Zimbabwe,Africa,1997,46.809,11404948,792.449960
1702,Zimbabwe,Africa,2002,39.989,11926563,672.038623


### The shape attribute

To extract an attribute from an object in Python, we use the `object.attribute` syntax. So if we want to extract the `shape` attribute from the `gapminder` DataFrame object, we can do so as follows:

In [13]:
# extract the shape attribute from gapminder
gapminder.shape

(1704, 6)

This `shape` attribute tells us the number of rows (1704) and the number of columns (6) and is helpful for learning about the size of our data objects

### The head() method

The `head()` function typically prints out the first few rows of a DataFrame. However, `head()` is not a regular function. If `head()` were a regular function, we would be able to apply it like this:

In [14]:
# try to look at the first 6 rows of the gapminder dataset using the head() function
head(gapminder)

NameError: name 'head' is not defined

But this results in an error. This is because `head()` is not a function that can be applied in the regular way. Instead, `head()` is a  special type of function that can only be applied to pandas DataFrame objects. Object-specific functions are called **methods**, and are applied using the `object.method()` syntax rather than the `method(object)` syntax above. 

We can apply the `head()` method to the `gapminder` dataset as follows, which will print out the first 5 rows of the DataFrame:

In [15]:
# apply the the head() method to gapminder
gapminder.head()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.85303
2,Afghanistan,Asia,1962,31.997,10267083,853.10071
3,Afghanistan,Asia,1967,34.02,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106


### Arguments

You can provide additional arguments to the `head()` inside the parentheses. For example, if you want to print 10 rows instead of 5, you can do so as follows:

In [16]:
# apply the head() method to gapminder with an argument of 10
gapminder.head(10)

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.85303
2,Afghanistan,Asia,1962,31.997,10267083,853.10071
3,Afghanistan,Asia,1967,34.02,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106
5,Afghanistan,Asia,1977,38.438,14880372,786.11336
6,Afghanistan,Asia,1982,39.854,12881816,978.011439
7,Afghanistan,Asia,1987,40.822,13867957,852.395945
8,Afghanistan,Asia,1992,41.674,16317921,649.341395
9,Afghanistan,Asia,1997,41.763,22227415,635.341351


This argument has a name `n`, which you can explicitly specify:

In [17]:
# apply the head method to gapminder with a *named* argument of n=10
gapminder.head(n=10)

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.85303
2,Afghanistan,Asia,1962,31.997,10267083,853.10071
3,Afghanistan,Asia,1967,34.02,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106
5,Afghanistan,Asia,1977,38.438,14880372,786.11336
6,Afghanistan,Asia,1982,39.854,12881816,978.011439
7,Afghanistan,Asia,1987,40.822,13867957,852.395945
8,Afghanistan,Asia,1992,41.674,16317921,649.341395
9,Afghanistan,Asia,1997,41.763,22227415,635.341351


But you don't need to specify the `n=` part of the argument because the `head()` method knows that the first argument is the number of rows to print.

### Exercise

1. The pandas DataFrame has an attribute called `dtypes` that will print out the *type* of each column. Extract the `dtypes` attribute from the `gapminder` DataFrame:

In [18]:
# extract the dtypes attribute from gapminder
gapminder.dtypes

country       object
continent     object
year           int64
lifeExp      float64
pop            int64
gdpPercap    float64
dtype: object

Note that the "string" type is called `object` in pandas.

2. The pandas DataFrame has a "method" called `select_dtypes()` that will extract just the columns of a certain type from the DataFrame. Use the `select_dtypes()` function to extract the numeric (float and integer) columns of gapminder by providing an argument `include='number'` inside the parentheses of `select_dtypes()`. 

In [19]:
# use the select_dtypes() method to select only the columns of type number
gapminder.select_dtypes(include='number')

Unnamed: 0,year,lifeExp,pop,gdpPercap
0,1952,28.801,8425333,779.445314
1,1957,30.332,9240934,820.853030
2,1962,31.997,10267083,853.100710
3,1967,34.020,11537966,836.197138
4,1972,36.088,13079460,739.981106
...,...,...,...,...
1699,1987,62.351,9216418,706.157306
1700,1992,60.377,10704340,693.420786
1701,1997,46.809,11404948,792.449960
1702,2002,39.989,11926563,672.038623


In [20]:
# to instead extract the string/object-type columns:
gapminder.select_dtypes(include='object')

Unnamed: 0,country,continent
0,Afghanistan,Asia
1,Afghanistan,Asia
2,Afghanistan,Asia
3,Afghanistan,Asia
4,Afghanistan,Asia
...,...,...
1699,Zimbabwe,Africa
1700,Zimbabwe,Africa
1701,Zimbabwe,Africa
1702,Zimbabwe,Africa
