# Pandas data frames

In this notebook, we will start learning about Pandas data frames. To import the pandas library to our notebook, you will first need to download and install the pandas library. You can do this by writing the command `pip install pandas`.



In [None]:
# install the pandas library


Then you can import the pandas (and numpy) libraries into this notebook as follows:

In [None]:
# import pandas and numpy


To load a .csv data file into our space, we need to use the `read_csv()` function from the pandas library. Make sure that you have saved the `gapminder.csv` file in a `data` subfolder that lives in the same place where this notebook is saved.

Let's load the gapminder dataset:

In [None]:
# use pd.read_csv to import the gapminder dataset


This just prints out the gapminder dataset, but it doesn't save it. 

To save the dataset so that we can use it in our notebook, we can to assign the results of the `pd.read_csv()` function to a variable called `gapminder`:

In [None]:
# use pd.read_csv to import the gapminder dataset and save it as gapminder


To view the dataset we have just loaded, we can type the name of the variable that we saved it in:

In [None]:
# view gapminder


We can then use the same `type()` function that we used in the previous notebook to ask what kind of object the `gapminder` variable is (the answer is a pandas DataFrame):

In [None]:
# check the type of gapminder using type()


## Extracting information/attributes from DataFrames

In this section, we will learn how to extract attributes from DataFrame objects and how to apply DataFrame-specific "methods" to DataFrames, both using the `.` syntax.

To extract an attribute from an object in Python, we use the `object.attribute` syntax. So if we want to extract the `shape` attribute from the `gapminder` DataFrame object, we can do so as follows:

In [None]:
# extract the shape attribute from gapminder


This `shape` attribute tells us the number of rows (1704) and the number of columns (6) and is helpful for learning about the size of our data objects

The `head()` function typically prints out the first few rows of a DataFrame. However, `head()` is not a regular function. If `head()` were a regular function, we would be able to apply it like this:

In [None]:
# look at the first 6 rows of gapminder
# try: head(gapminder)


But this results in an error. This is because `head()` is not a function that can be applied in the regular way. Instead, `head()` is a  special type of function that can only be applied to pandas DataFrame objects. Object-specific functions are called **methods**, and are applied using the `object.method()` syntax rather than the `method(object)` syntax above. 

We can apply the `head()` method to the `gapminder` dataset as follows, which will print out the first 5 rows of the DataFrame:

In [None]:
# use the head() method on gapminder


You can provide additional arguments to the `head()` inside the parentheses. For example, if you want to print 10 rows instead of 5, you can do so as follows:

In [None]:
# look at the first 10 rows using head


This argument has a name `n`, which you can explicitly specify:

In [None]:
# look at the first 10 rows using head using the 'n' named argument


But you don't need to specify the `n=` part of the argument because the `head()` method knows that the first argument is the number of rows to print.

### Exercise

1. The pandas DataFrame has an attribute called `dtypes` that will print out the *type* of each column. Extract the `dtypes` attribute from the `gapminder` DataFrame:

In [None]:
# extract the dtypes attribute from gapminder


Note that the "string" type is called `object` in pandas.

2. The pandas DataFrame has a "method" called `select_dtypes()` that will extract just the columns of a certain type from the DataFrame. Use the `select_dtypes()` function to extract the numeric (float and integer) columns of gapminder by providing an argument `include='number'` inside the parentheses of `select_dtypes()`. 

In [None]:
# use the select_dtypes() DataFrame method to extract the number type columns only


In [None]:
# instead extract the string/object-type columns:


## DataFrame indexing

In this section we will learn about the column and row indexes of the DataFrame. These are essentially the column and row names of the DataFrame.

We will keep working with the gapminder DataFrame that is printed below:

In [None]:
# print out gapminder again


### The column index

To extract the column index, which corresponds to the column names of the DataFrame, we need to extract the `columns` attribute of the Dataframe object.

In [None]:
# extract the column index via the columns attribute


Notice that the output of the cell above is an "Index" object. If we want to just extract the column values themselves from the index object, we can use the `list()` function to convert the index object to a simpler type of object called a "list" (which is just a collection of values):

In [None]:
# convert the columns attribute to a list using the list() function


### The row index

The row index can be extracted using the `index` attribute (there is no `rows` attribute):

In [None]:
# extract the row index via the index attribute


This time, the output is a `RangeIndex` object, which corresponds to a sequence of integer values with a start value and a stop value with a step size. Since the start is 0, the stop is 1704 (note that the stop is *not* inclusive) and the step size is 1, this RangeIndex corresponds to the integer values 0, 1, 2, 3, ..., 1703. 

To extract the actual integer values from the RangeIndex object, we can convert it to a list using the `list()` function:

In [None]:
# convert the row index to a list using the list() function


You can change the index using the `set_index()` method and providing, for example, a column name as a string.

Let's set the `country` column to be the row index:

In [None]:
# use the set_index() method to set the country column as the row index


Notice that the row index name is on its own line and the original integer row index has disappeared. 

However, notice also that this did not actually modify the `gapminder` object itself. When we print it out below, notice that it is unchanged:

In [None]:
# print out gapminder


If we wanted to create a version of the `gapminder` DataFrame with the country column as the index, we need to save it as a new variable (you can instead overwrite the `gapminder` variable with this new version, but this is not recommended because we want to keep an unmodified version of the original dataset accessible in our environment). 

Below, we create a *new* DataFrame corresponding to the version of `gapminder` with the `'country'` column as the row index:

In [None]:
# define gapminder_country with the country column as the index
# use set_index() method


Notice that the original gapminder dataset is unchanged:

In [None]:
# look at gapminder


But that the `gapminder_country` DataFrame has the `country` column as its index. 

In [None]:
# look at gapminder_country
