#### Week 2 - Data Analysis with Pandas

# Lesson 2: Selecting Columns and Rows 🏛🚣

## Objectives:

Columns:
**_Selecting_** and **_Working with_** columns from a dataframe


Rows: 
**_Understanding the Difference_** between **_loc and iloc_** methods, and **_Selecting_** rows from a dataframe


> ## Warmup
>
> 1. Open a new jupyter notebook
> 2. Import pandas
> 3. Read the large countries dataset from the data folder

In [31]:
import pandas as pd

In [68]:
countries = pd.read_csv('./data/large_countries_2015.csv')
countries 

Unnamed: 0,country,population,fertility,continent
0,Bangladesh,160995600.0,2.12,Asia
1,Brazil,207847500.0,1.78,South America
2,China,1376049000.0,1.57,Asia
3,India,1311051000.0,2.43,Asia
4,Indonesia,257563800.0,2.28,Asia
5,Japan,126573500.0,1.45,Asia
6,Mexico,127017200.0,2.13,North America
7,Nigeria,182202000.0,5.89,Africa
8,Pakistan,188924900.0,3.04,Asia
9,Philippines,100699400.0,2.98,Asia


In [69]:
# look at the dataframe

df = countries
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   country     12 non-null     object 
 1   population  12 non-null     float64
 2   fertility   12 non-null     float64
 3   continent   12 non-null     object 
dtypes: float64(2), object(2)
memory usage: 516.0+ bytes


## 1. Columns

### 1.1 Renaming the columns

In [34]:
# change the column names



In [35]:
# look at the dataframe



### 1.2 Selecting a single column

In [36]:
# using single square brackets, we select a single column as a series


In [37]:
# Note that each column is a pandas Series



In [38]:
# using double square brackets, we select a single columns as a dataframe



In [39]:
# type



### 1.3 Selecting Multiple Columns

In [40]:
# when selecting multiple columns, we HAVE to use double square brackets
# and we get a dataframe back



### 1.4 Changing column types

In [41]:
# Notice that dataframes may contain different data types and that's why here it is dtypes

countries.dtypes

country        object
population    float64
fertility     float64
continent      object
dtype: object

In [42]:
# we sometimes need to change the data type of a column, that is usually part of data cleaning



In [43]:
# Let's check the datatypes again



###  1.5 Creating columns

How to round numbers in python:
`round(<number_to_round>, <number_of_decimal_points>)`

### 1.6 Dropping columns

In [44]:
# dropping a single column



In [45]:
# note that the dataframe did not change!



Uh-oh! Dropping returned a (changed) copy of the dataframe, but didn't change the original!

To make the changes stick, you can:
* assign the result to another dataframe
* use the `inplace=True` parameter

In [46]:
# assign the result to another datafram



In [47]:
# to drop from the dataframe we use the inplace=True parameter



## 2. Selecting rows (and columns): `.loc[]` and `.iloc[]` methods

A brief slicing recap:

In [48]:
a = [1,2,3,4,5,6]

In [49]:
a

[1, 2, 3, 4, 5, 6]

Reminder: slicing syntax is `a[start:end:step]`
* If not specified, `start` is the beginning of the list.
* If not specified, `end` is the end of the list.
* You can use minus to count from the back, e.g. the second element from the back is `a[-2]`
* If not specified, `step` is 1.

In [50]:
# select first 4 numbers

a[0:4]

[1, 2, 3, 4]

In [51]:
# what does this do?
a[::3]


[1, 4]

### 2.1. Selecting based on a single value (select a single row/column)

In [52]:
countries

Unnamed: 0,country,population,fertility,continent
0,Bangladesh,160995600.0,2.12,Asia
1,Brazil,207847500.0,1.78,South America
2,China,1376049000.0,1.57,Asia
3,India,1311051000.0,2.43,Asia
4,Indonesia,257563800.0,2.28,Asia
5,Japan,126573500.0,1.45,Asia
6,Mexico,127017200.0,2.13,North America
7,Nigeria,182202000.0,5.89,Africa
8,Pakistan,188924900.0,3.04,Asia
9,Philippines,100699400.0,2.98,Asia


### loc[ ]

In [53]:
# Example 1: Select the row that contains Brazil's data

countries.loc[1,:] # show all info about one of the elements

country              Brazil
population      207847528.0
fertility              1.78
continent     South America
Name: 1, dtype: object

In [55]:
# Example 2: Select the rows that contain Brazil and Bangladesh

countries.loc[[1,2,:] 


SyntaxError: invalid syntax (2490559148.py, line 3)

**_How does loc[ ] work?_**

- Notice that loc is including both 0 and 1 rows
- Loc is using the labels of the row. Let's make some changes to understand it better

In [None]:
# changing the index column of the dataframe
countries.set_index('country', inplace=True) 

# It is also possible to assign a column as index, when we are reading the data
# countries = pd.read_csv('./data/large_countries_2015.csv', index_col=0)


**_Let's do the example 1 and 2 again_**

In [None]:
# Example 1: Select the row that contains Brazil's data

# we will get an error now, because we changed the index labels as countries. we should give the new labels
countries.loc[1, :] 


In [None]:
# Example 2: Select the rows that contain Brazil and Bangladesh



### iloc[ ]

In [64]:
countries

Unnamed: 0,country,population,fertility,continent
0,Bangladesh,160995600.0,2.12,Asia
1,Brazil,207847500.0,1.78,South America
2,China,1376049000.0,1.57,Asia
3,India,1311051000.0,2.43,Asia
4,Indonesia,257563800.0,2.28,Asia
5,Japan,126573500.0,1.45,Asia
6,Mexico,127017200.0,2.13,North America
7,Nigeria,182202000.0,5.89,Africa
8,Pakistan,188924900.0,3.04,Asia
9,Philippines,100699400.0,2.98,Asia


In [66]:
# Example 1: Select the row that contains Brazil's data

countries.loc[1,0]

KeyError: 0

In [63]:
# Example 2: Select the rows that contain Brazil and Bangladesh



- Notice that iloc is not including the end [0:2]
- iloc is using the position values not the labels. That's why we cannot use countries with iloc

In [None]:
# that's why this will give an error
countries.iloc['Bangladesh':'Brazil', :]

In [None]:
# to understand it better, let's create a column with the row count and add it to our dataframe
import numpy as np
index = np.arange(0, countries.shape[0])

index # we will use this array again in the later steps, please keep that in mind

In [None]:
#let's add the column with the row count

countries['row'] = index

countries

# now we can see the location number of each column better

#### Example 3: What is Brazil's fertility rate?

In [None]:
# using loc



In [None]:
# using iloc



#### Example 4: Select the rows for **Brazil**, **Mexico** and **Japan** 

In [56]:
countries

Unnamed: 0,country,population,fertility,continent
0,Bangladesh,160995600.0,2.12,Asia
1,Brazil,207847500.0,1.78,South America
2,China,1376049000.0,1.57,Asia
3,India,1311051000.0,2.43,Asia
4,Indonesia,257563800.0,2.28,Asia
5,Japan,126573500.0,1.45,Asia
6,Mexico,127017200.0,2.13,North America
7,Nigeria,182202000.0,5.89,Africa
8,Pakistan,188924900.0,3.04,Asia
9,Philippines,100699400.0,2.98,Asia


In [67]:
# using loc

countries.loc


<pandas.core.indexing._LocIndexer at 0x118959270>

In [None]:
# using iloc, we have to locate the rows and columns we need by number/index




#### Example 5: Select **population** and **continent** for Brazil, Mexico, and Japan

In [None]:
# we can also choose one or more columns to show
#loc



In [None]:
#iloc



- The **.loc** method is **label-based**: 
    - You have to specify `rows` and `columns` based on their row and column **labels**. 
    - It has the following syntax: `df.loc[row_label, column_label]`


- The **.iloc** is integer **position-based**:
    - You have to specify `rows` and `columns` by their **integer position values** (0-based integer position). 
    - It has the following syntax: `df.iloc[row_position, column_position]`