# Pandas 1

---

## Content

- Introduction to Pandas
- DataFrame & Series
- Creating DataFrame from Scratch (Post-read)
- Basic ops on a DataFrame
- Basic ops on Columns
    - Accessing column(s)
    - Check for unique values
    - Rename column
    - Deleting column(s)
    - Creating new column(s)
- Basic ops on Rows
    - Implicit/Explicit index
    - Indexing in Series
    - Slicing in Series
        - loc/iloc
    - Indexing/Slicing in DataFrame


---

## **Introduction to Pandas**

### Pandas Installation

In [None]:
# !pip install pandas

### Importing Pandas

- You should be able to import Pandas after installing it.
- We'll import `pandas` using its **alias name `pd`**.

In [3]:
import pandas as pd
import numpy as np

### Why use Pandas?

- The major **limitation of numpy** is that it can only work with one datatype at a time.
- Most real-world datasets contain a mix of different datatypes.
  - **names of a place would be string**
  - **population of a place would be int**
  
It is difficult to work with data having **heterogeneous values** using Numpy.

On the other hand, Pandas can work with numbers and strings together.


### Problem Statement

- Imagine that you are a Data Scientist with McKinsey.
- McKinsey wants to understand the relation between GDP per capita and life expectancy for their clients.
- The company has obtained data from various surveys conducted in different countries over several years.
- The acquired data includes information on
  - Country
  - Population Size
  - Life Expectancy
  - GDP per Capita
- We have to analyse the data and draw inferences that are meaningful to the company.

### Loading the dataset

Dataset: https://drive.google.com/file/d/1E3bwvYGf1ig32RmcYiWc0IXPN-mD_bI_/view?usp=sharing

In [1]:
!wget "https://drive.google.com/uc?export=download&id=1E3bwvYGf1ig32RmcYiWc0IXPN-mD_bI_" -O mckinsey.csv

--2024-09-08 21:56:48--  https://drive.google.com/uc?export=download&id=1E3bwvYGf1ig32RmcYiWc0IXPN-mD_bI_
Resolving drive.google.com (drive.google.com)... 142.251.107.113, 142.251.107.102, 142.251.107.139, ...
Connecting to drive.google.com (drive.google.com)|142.251.107.113|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://drive.usercontent.google.com/download?id=1E3bwvYGf1ig32RmcYiWc0IXPN-mD_bI_&export=download [following]
--2024-09-08 21:56:48--  https://drive.usercontent.google.com/download?id=1E3bwvYGf1ig32RmcYiWc0IXPN-mD_bI_&export=download
Resolving drive.usercontent.google.com (drive.usercontent.google.com)... 74.125.139.132, 2607:f8b0:400c:c05::84
Connecting to drive.usercontent.google.com (drive.usercontent.google.com)|74.125.139.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 83785 (82K) [application/octet-stream]
Saving to: ‘mckinsey.csv’


2024-09-08 21:56:51 (61.7 MB/s) - ‘mckinsey.csv’ saved [83785/83

**Now how should we read this dataset?**

Pandas makes it very easy to work with these kinds of files.

In [4]:
df = pd.read_csv('mckinsey.csv') # storing the data in df
df

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.853030
2,Afghanistan,1962,10267083,Asia,31.997,853.100710
3,Afghanistan,1967,11537966,Asia,34.020,836.197138
4,Afghanistan,1972,13079460,Asia,36.088,739.981106
...,...,...,...,...,...,...
1699,Zimbabwe,1987,9216418,Africa,62.351,706.157306
1700,Zimbabwe,1992,10704340,Africa,60.377,693.420786
1701,Zimbabwe,1997,11404948,Africa,46.809,792.449960
1702,Zimbabwe,2002,11926563,Africa,39.989,672.038623


---

### DataFrame and Series

**What can we observe from the above dataset?**

We can see that it has:
- 6 columns
- 1704 rows

**What do you think is the datatype of `df` ?**

In [5]:
type(df)

It is a **Pandas DataFrame**

#### What is a Pandas DataFrame?

- A DataFrame is a **table-like (structured)** representation of data in Pandas.
- Considered as a **counterpart of 2D matrix** in Numpy.

<img src="https://drive.google.com/uc?id=1urINAXwrx9Fg5cgm5yxtUKqcYV7_qViZ">

**How can we access a column, say `country` of the dataframe?**

In [None]:
df["country"]

0       Afghanistan
1       Afghanistan
2       Afghanistan
3       Afghanistan
4       Afghanistan
           ...     
1699       Zimbabwe
1700       Zimbabwe
1701       Zimbabwe
1702       Zimbabwe
1703       Zimbabwe
Name: country, Length: 1704, dtype: object

As you can see, we get all the values present in the **country** column.

**What is the data-type of a column?**

In [None]:
type(df["country"])

It is a **Pandas Series**

#### What is a Pandas Series?
- A **Series** in Pandas is what a **Vector** is in Numpy.

**What exactly does that mean?**
- It means that a Series is a **single column of data**.
- Multiple Series are stacked together to form a DataFrame.

<img src="https://drive.google.com/uc?id=1y1DPzrethqr7DDomwxbDtt3ti7vTU7s8">  

In [None]:
df[['country']]
print(type(df['country'])) # <class 'pandas.core.series.Series'>
type(df[['country']]) #pandas.core.frame.DataFrame


# accessing multiple columns
df[['country', 'population']]

<class 'pandas.core.series.Series'>


Unnamed: 0,country,population
0,Afghanistan,8425333
1,Afghanistan,9240934
2,Afghanistan,10267083
3,Afghanistan,11537966
4,Afghanistan,13079460
...,...,...
1699,Zimbabwe,9216418
1700,Zimbabwe,10704340
1701,Zimbabwe,11404948
1702,Zimbabwe,11926563


Now we have understood what Series and DataFrame are.

**How can we find the datatype, name, total entries in each column?**


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   country     1704 non-null   object 
 1   year        1704 non-null   int64  
 2   population  1704 non-null   int64  
 3   continent   1704 non-null   object 
 4   life_exp    1704 non-null   float64
 5   gdp_cap     1704 non-null   float64
dtypes: float64(2), int64(2), object(2)
memory usage: 80.0+ KB


`df.info()` gives a list of columns with:

- **Name** of columns
- **How many non-null values (blank cells)** each column has.
- **Type of values** in each column - int, float, etc.

**By default**, it shows **Dtype** as `object` for anything other than **int or float**.

**What if we want to see the first few rows in the dataset?**

In [6]:
df.head()

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.85303
2,Afghanistan,1962,10267083,Asia,31.997,853.10071
3,Afghanistan,1967,11537966,Asia,34.02,836.197138
4,Afghanistan,1972,13079460,Asia,36.088,739.981106


**`df.head()` prints the top 5 rows by default.**

We can also pass in number of rows that we want to see.


In [7]:
df.head(10)

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.85303
2,Afghanistan,1962,10267083,Asia,31.997,853.10071
3,Afghanistan,1967,11537966,Asia,34.02,836.197138
4,Afghanistan,1972,13079460,Asia,36.088,739.981106
5,Afghanistan,1977,14880372,Asia,38.438,786.11336
6,Afghanistan,1982,12881816,Asia,39.854,978.011439
7,Afghanistan,1987,13867957,Asia,40.822,852.395945
8,Afghanistan,1992,16317921,Asia,41.674,649.341395
9,Afghanistan,1997,22227415,Asia,41.763,635.341351


Similarly, we can use **`df.tail()` if we wish to see the last few rows**.

In [8]:
df.tail()

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
1699,Zimbabwe,1987,9216418,Africa,62.351,706.157306
1700,Zimbabwe,1992,10704340,Africa,60.377,693.420786
1701,Zimbabwe,1997,11404948,Africa,46.809,792.44996
1702,Zimbabwe,2002,11926563,Africa,39.989,672.038623
1703,Zimbabwe,2007,12311143,Africa,43.487,469.709298


**How can we find the shape of a dataframe?**

In [None]:
df.shape

(1704, 6)

Similar to Numpy, it gives the **no. of rows and columns**.


### Post-read

- [DataFrame from Scratch](https://colab.research.google.com/drive/1x3ct95RtIIQTJeGbyuuYaMociVp90ww6?usp=sharing)

---

## **Basic operations on Columns**


**What operations can we do using columns?**

- Add a column
- Delete a column
- Rename a column

We can see that our dataset has 6 columns.

**How can we get the names of all these cols?**

We can do it in two ways:
1. `df.columns`
2. `df.keys`

In [None]:
df.columns  # using attribute `columns` of dataframe

Index(['country', 'year', 'population', 'continent', 'life_exp', 'gdp_cap'], dtype='object')

In [None]:
df.keys()  # using method `keys()` of dataframe

Index(['country', 'year', 'population', 'continent', 'life_exp', 'gdp_cap'], dtype='object')

**Note:**
- Here, `Index` is a type of Pandas class used to store the `address` of the series/dataframe.
- It is an immutable sequence used for indexing.

**How can we access these columns?**

In [None]:
df['country'].head()  # accessing a single column

0    Afghanistan
1    Afghanistan
2    Afghanistan
3    Afghanistan
4    Afghanistan
Name: country, dtype: object

In [None]:
df[['country', 'life_exp']].head() # accessing multiple columns

Unnamed: 0,country,life_exp
0,Afghanistan,28.801
1,Afghanistan,30.332
2,Afghanistan,31.997
3,Afghanistan,34.02
4,Afghanistan,36.088


**And what if we pass a single column name?**

In [None]:
df[['country']].head()

Unnamed: 0,country
0,Afghanistan
1,Afghanistan
2,Afghanistan
3,Afghanistan
4,Afghanistan


**Note:**
- Notice how this output type is different from our earlier output using `df['country']`
- `['country']` gives a Series while `[['country']]` gives a DataFrame.

**How can we find the countries that have been surveyed?**

We can find the unique values in the `country` column.

In [None]:
df['country'].unique()

array(['Afghanistan', 'Albania', 'Algeria', 'Angola', 'Argentina',
       'Australia', 'Austria', 'Bahrain', 'Bangladesh', 'Belgium',
       'Benin', 'Bolivia', 'Bosnia and Herzegovina', 'Botswana', 'Brazil',
       'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon',
       'Canada', 'Central African Republic', 'Chad', 'Chile', 'China',
       'Colombia', 'Comoros', 'Congo, Dem. Rep.', 'Congo, Rep.',
       'Costa Rica', "Cote d'Ivoire", 'Croatia', 'Cuba', 'Czech Republic',
       'Denmark', 'Djibouti', 'Dominican Republic', 'Ecuador', 'Egypt',
       'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Ethiopia',
       'Finland', 'France', 'Gabon', 'Gambia', 'Germany', 'Ghana',
       'Greece', 'Guatemala', 'Guinea', 'Guinea-Bissau', 'Haiti',
       'Honduras', 'Hong Kong, China', 'Hungary', 'Iceland', 'India',
       'Indonesia', 'Iran', 'Iraq', 'Ireland', 'Israel', 'Italy',
       'Jamaica', 'Japan', 'Jordan', 'Kenya', 'Korea, Dem. Rep.',
       'Korea, Rep.', 'Kuwait', 'Leba

In [None]:
df['country'].nunique() # 142 - total unique countries

142

**What if you also want to check the count of occurence of each country in the dataframe?**

In [None]:
df['country'].value_counts()

Afghanistan          12
Pakistan             12
New Zealand          12
Nicaragua            12
Niger                12
                     ..
Eritrea              12
Equatorial Guinea    12
El Salvador          12
Egypt                12
Zimbabwe             12
Name: country, Length: 142, dtype: int64

**Note:** `value_counts()` shows the output in **decreasing order of frequency**.

**What if we want to change the name of a column?**

We can rename the column by
- passing the dictionary with `old_name:new_name` pair
- specifying `axis=1`

In [None]:
df.rename({"population": "Population", "country":"Country" }, axis = 1) # not inplace

Unnamed: 0,Country,year,Population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.853030
2,Afghanistan,1962,10267083,Asia,31.997,853.100710
3,Afghanistan,1967,11537966,Asia,34.020,836.197138
4,Afghanistan,1972,13079460,Asia,36.088,739.981106
...,...,...,...,...,...,...
1699,Zimbabwe,1987,9216418,Africa,62.351,706.157306
1700,Zimbabwe,1992,10704340,Africa,60.377,693.420786
1701,Zimbabwe,1997,11404948,Africa,46.809,792.449960
1702,Zimbabwe,2002,11926563,Africa,39.989,672.038623


Alternatively, we can also rename the column
- without specifying `axis`
- by using the `column` parameter



In [9]:
df.rename(columns={"country":"Country"})

Unnamed: 0,Country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.853030
2,Afghanistan,1962,10267083,Asia,31.997,853.100710
3,Afghanistan,1967,11537966,Asia,34.020,836.197138
4,Afghanistan,1972,13079460,Asia,36.088,739.981106
...,...,...,...,...,...,...
1699,Zimbabwe,1987,9216418,Africa,62.351,706.157306
1700,Zimbabwe,1992,10704340,Africa,60.377,693.420786
1701,Zimbabwe,1997,11404948,Africa,46.809,792.449960
1702,Zimbabwe,2002,11926563,Africa,39.989,672.038623


If we try and check the original dataframe `df` -

In [None]:
df

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.853030
2,Afghanistan,1962,10267083,Asia,31.997,853.100710
3,Afghanistan,1967,11537966,Asia,34.020,836.197138
4,Afghanistan,1972,13079460,Asia,36.088,739.981106
...,...,...,...,...,...,...
1699,Zimbabwe,1987,9216418,Africa,62.351,706.157306
1700,Zimbabwe,1992,10704340,Africa,60.377,693.420786
1701,Zimbabwe,1997,11404948,Africa,46.809,792.449960
1702,Zimbabwe,2002,11926563,Africa,39.989,672.038623


We can clearly see that the column names are still the same and have not changed.

The changes doesn't happen in original dataframe unless we specify a parameter called `inplace` as True.

In [None]:
df.rename({"country": "Country"}, axis = 1, inplace = True)
df

Unnamed: 0,Country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.853030
2,Afghanistan,1962,10267083,Asia,31.997,853.100710
3,Afghanistan,1967,11537966,Asia,34.020,836.197138
4,Afghanistan,1972,13079460,Asia,36.088,739.981106
...,...,...,...,...,...,...
1699,Zimbabwe,1987,9216418,Africa,62.351,706.157306
1700,Zimbabwe,1992,10704340,Africa,60.377,693.420786
1701,Zimbabwe,1997,11404948,Africa,46.809,792.449960
1702,Zimbabwe,2002,11926563,Africa,39.989,672.038623


**Note**
- `.rename` has default value of axis=0
- If two columns have the **same name**, then `df['column']` will display both columns.

There's another way of accessing the column values.

In [None]:
df.Country

0       Afghanistan
1       Afghanistan
2       Afghanistan
3       Afghanistan
4       Afghanistan
           ...     
1699       Zimbabwe
1700       Zimbabwe
1701       Zimbabwe
1702       Zimbabwe
1703       Zimbabwe
Name: Country, Length: 1704, dtype: object

This however doesn't work everytime.

**What do you think could be the problem here?**

- If the column names are **not strings**
  - Starting with **number**: e.g. `2nd`
  - Contains a **whitespace**: e.g. `Roll Number`
- If the column names conflict with **methods of the DataFrame**
  - e.g. `shape`

We already know the continents in which each country lies.

So we probably don't need this column.

**How can we delete columns from a dataframe?**


In [None]:
df.drop('continent', axis=1)

Unnamed: 0,Country,year,population,life_exp,gdp_cap
0,Afghanistan,1952,8425333,28.801,779.445314
1,Afghanistan,1957,9240934,30.332,820.853030
2,Afghanistan,1962,10267083,31.997,853.100710
3,Afghanistan,1967,11537966,34.020,836.197138
4,Afghanistan,1972,13079460,36.088,739.981106
...,...,...,...,...,...
1699,Zimbabwe,1987,9216418,62.351,706.157306
1700,Zimbabwe,1992,10704340,60.377,693.420786
1701,Zimbabwe,1997,11404948,46.809,792.449960
1702,Zimbabwe,2002,11926563,39.989,672.038623


The `drop()` function takes two parameters:
- column name
- axis
  
By default, the value of `axis` is 0.

An alternative to the above approach is using the "columns" parameter as we did in `rename()`.

In [None]:
df.drop(columns=['continent'])

Unnamed: 0,Country,year,population,life_exp,gdp_cap
0,Afghanistan,1952,8425333,28.801,779.445314
1,Afghanistan,1957,9240934,30.332,820.853030
2,Afghanistan,1962,10267083,31.997,853.100710
3,Afghanistan,1967,11537966,34.020,836.197138
4,Afghanistan,1972,13079460,36.088,739.981106
...,...,...,...,...,...
1699,Zimbabwe,1987,9216418,62.351,706.157306
1700,Zimbabwe,1992,10704340,60.377,693.420786
1701,Zimbabwe,1997,11404948,46.809,792.449960
1702,Zimbabwe,2002,11926563,39.989,672.038623


As you can see, the column `contintent` is dropped.

**Has the column been permanently deleted?**

In [None]:
df.head()

Unnamed: 0,Country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.85303
2,Afghanistan,1962,10267083,Asia,31.997,853.10071
3,Afghanistan,1967,11537966,Asia,34.02,836.197138
4,Afghanistan,1972,13079460,Asia,36.088,739.981106


No, the column `continent` is still there in the original dataframe.

**Do you see what's happening here?**

We only got a **view of dataframe** with column `continent` dropped.

**How can we permanently drop the column?**

- We can either **re-assign** it `df = df.drop('continent', axis=1)`   
- Or we can **set the parameter `inplace=True`**
  - By default, `inplace=False`.

In [None]:
df.drop('continent', axis=1, inplace=True)

**What if we want to create a new column?**

- We can either use values from **existing columns**.
- Or we can create our own values.

**How to create a column using values from an existing column?**

In [10]:
df["year+7"] = df["year"] + 7
df.head()

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap,year+7
0,Afghanistan,1952,8425333,Asia,28.801,779.445314,1959
1,Afghanistan,1957,9240934,Asia,30.332,820.85303,1964
2,Afghanistan,1962,10267083,Asia,31.997,853.10071,1969
3,Afghanistan,1967,11537966,Asia,34.02,836.197138,1974
4,Afghanistan,1972,13079460,Asia,36.088,739.981106,1979


As we see, a new column `year+7` is created from the column `year`.

We can also use values from two columns to form a new column.

**Which two columns can we use to create a new column `gdp`?**

In [11]:
df['gdp'] = df['gdp_cap'] * df['population']
df.head()

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap,year+7,gdp
0,Afghanistan,1952,8425333,Asia,28.801,779.445314,1959,6567086000.0
1,Afghanistan,1957,9240934,Asia,30.332,820.85303,1964,7585449000.0
2,Afghanistan,1962,10267083,Asia,31.997,853.10071,1969,8758856000.0
3,Afghanistan,1967,11537966,Asia,34.02,836.197138,1974,9648014000.0
4,Afghanistan,1972,13079460,Asia,36.088,739.981106,1979,9678553000.0


As you can see
- An additional column has been created.
- Values in this column are **product of respective values in `gdp_cap` and `population` columns**.

**What other operations we can use?**

- Addition
- Subtraction
- Division

**How can we create a new column from our own values?**

- We can either **create a list**.
- Or we can **create a Pandas Series** from a list/numpy array for our new column.

In [None]:
df["Own"] = [i for i in range(1704)]  # count of these values should be correct
df

Unnamed: 0,Country,year,population,life_exp,gdp_cap,year+7,gdp,Own
0,Afghanistan,1952,8425333,28.801,779.445314,1959,6.567086e+09,0
1,Afghanistan,1957,9240934,30.332,820.853030,1964,7.585449e+09,1
2,Afghanistan,1962,10267083,31.997,853.100710,1969,8.758856e+09,2
3,Afghanistan,1967,11537966,34.020,836.197138,1974,9.648014e+09,3
4,Afghanistan,1972,13079460,36.088,739.981106,1979,9.678553e+09,4
...,...,...,...,...,...,...,...,...
1699,Zimbabwe,1987,9216418,62.351,706.157306,1994,6.508241e+09,1699
1700,Zimbabwe,1992,10704340,60.377,693.420786,1999,7.422612e+09,1700
1701,Zimbabwe,1997,11404948,46.809,792.449960,2004,9.037851e+09,1701
1702,Zimbabwe,2002,11926563,39.989,672.038623,2009,8.015111e+09,1702


Before we move to ops on rows, let's drop the newly created columns.

In [None]:
df.drop(columns=["Own",'gdp', 'year+7'], axis = 1, inplace = True)
df

Unnamed: 0,Country,year,population,life_exp,gdp_cap
0,Afghanistan,1952,8425333,28.801,779.445314
1,Afghanistan,1957,9240934,30.332,820.853030
2,Afghanistan,1962,10267083,31.997,853.100710
3,Afghanistan,1967,11537966,34.020,836.197138
4,Afghanistan,1972,13079460,36.088,739.981106
...,...,...,...,...,...
1699,Zimbabwe,1987,9216418,62.351,706.157306
1700,Zimbabwe,1992,10704340,60.377,693.420786
1701,Zimbabwe,1997,11404948,46.809,792.449960
1702,Zimbabwe,2002,11926563,39.989,672.038623


---

## **Basic operations on Rows**




In [12]:
# Accessing dataframe rows

df.index.values

array([   0,    1,    2, ..., 1701, 1702, 1703])

**Just like columns, do rows also have labels? Yes.**

- **Can we change row labels (like we did for columns)?**
- **What if we want to start indexing from 1 (instead of 0)?**

In [13]:
df.index = list(range(1, df.shape[0]+1)) # create a list of indices of same length
df

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap,year+7,gdp
1,Afghanistan,1952,8425333,Asia,28.801,779.445314,1959,6.567086e+09
2,Afghanistan,1957,9240934,Asia,30.332,820.853030,1964,7.585449e+09
3,Afghanistan,1962,10267083,Asia,31.997,853.100710,1969,8.758856e+09
4,Afghanistan,1967,11537966,Asia,34.020,836.197138,1974,9.648014e+09
5,Afghanistan,1972,13079460,Asia,36.088,739.981106,1979,9.678553e+09
...,...,...,...,...,...,...,...,...
1700,Zimbabwe,1987,9216418,Africa,62.351,706.157306,1994,6.508241e+09
1701,Zimbabwe,1992,10704340,Africa,60.377,693.420786,1999,7.422612e+09
1702,Zimbabwe,1997,11404948,Africa,46.809,792.449960,2004,9.037851e+09
1703,Zimbabwe,2002,11926563,Africa,39.989,672.038623,2009,8.015111e+09


As you can see the indexing now starts from 1 instead of 0.

### Explicit & Implicit Indices

**What are these row labels/indices exactly?**
  
- They can be called identifiers of a particular row.
- Specifically known as **explicit indices**.

**Additionally, can a series/dataframe also use Python style indexing? Yes.**

- The Python style indices are known as **implicit indices**.

**How can we access explicit index of a particular row?**
- using `df.index[]`
- Takes **impicit index** of row to give its **explicit index**.

In [14]:
df.index[1] # implicit index 1 gave explicit index 2

2

**But why not use just implicit indexing?**

Explicit indices can be changed to any value of any datatype.
- e.g. explicit index of 1st row can be changed to `first`
- Or something like a floating point value, say `1.0`

In [15]:
df.index = np.arange(1, df.shape[0]+1, dtype='float')
df

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap,year+7,gdp
1.0,Afghanistan,1952,8425333,Asia,28.801,779.445314,1959,6.567086e+09
2.0,Afghanistan,1957,9240934,Asia,30.332,820.853030,1964,7.585449e+09
3.0,Afghanistan,1962,10267083,Asia,31.997,853.100710,1969,8.758856e+09
4.0,Afghanistan,1967,11537966,Asia,34.020,836.197138,1974,9.648014e+09
5.0,Afghanistan,1972,13079460,Asia,36.088,739.981106,1979,9.678553e+09
...,...,...,...,...,...,...,...,...
1700.0,Zimbabwe,1987,9216418,Africa,62.351,706.157306,1994,6.508241e+09
1701.0,Zimbabwe,1992,10704340,Africa,60.377,693.420786,1999,7.422612e+09
1702.0,Zimbabwe,1997,11404948,Africa,46.809,792.449960,2004,9.037851e+09
1703.0,Zimbabwe,2002,11926563,Africa,39.989,672.038623,2009,8.015111e+09


As we can see, the indices are now floating point values.

Now to understand string indices, let's take a small subset of our original dataframe.

In [16]:
sample = df.head()
sample

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap,year+7,gdp
1.0,Afghanistan,1952,8425333,Asia,28.801,779.445314,1959,6567086000.0
2.0,Afghanistan,1957,9240934,Asia,30.332,820.85303,1964,7585449000.0
3.0,Afghanistan,1962,10267083,Asia,31.997,853.10071,1969,8758856000.0
4.0,Afghanistan,1967,11537966,Asia,34.02,836.197138,1974,9648014000.0
5.0,Afghanistan,1972,13079460,Asia,36.088,739.981106,1979,9678553000.0


**What if we want to use string indices?**

In [22]:
sample.index = ['a', 'b', 'c', 'd', 'e']
sample

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap,year+7,gdp
a,Afghanistan,1952,8425333,Asia,28.801,779.445314,1959,6567086000.0
b,Afghanistan,1957,9240934,Asia,30.332,820.85303,1964,7585449000.0
c,Afghanistan,1962,10267083,Asia,31.997,853.10071,1969,8758856000.0
d,Afghanistan,1967,11537966,Asia,34.02,836.197138,1974,9648014000.0
e,Afghanistan,1972,13079460,Asia,36.088,739.981106,1979,9678553000.0


This shows us that we can use almost anything as our explicit index.

Now, let's reset our indices back to integers.

In [23]:
df.index = np.arange(1, df.shape[0]+1, dtype='int')

**What if we want to access any particular row (say first row)?**

Let's first see for one column.

Later, we can generalise the same for the entire dataframe.

In [63]:
df.rename({ "country": "Country"}, axis=1, inplace=True)
ser = df["Country"]
ser.head(20)

Unnamed: 0,Country
1,Afghanistan
2,Afghanistan
3,Afghanistan
4,Afghanistan
5,Afghanistan
6,Afghanistan
7,Afghanistan
8,Afghanistan
9,Afghanistan
10,Afghanistan


We can simply use its indices much like we do in a Numpy array.

**So, how will be then access the 13th element?**

In [38]:
ser[13] # Indexing in series uses explicit indices

'Albania'

**What about accessing a subset of rows (say 6th to 15th)?**

In [39]:
ser[5:15] # slicing in series uses implicit indices

Unnamed: 0,country
6,Afghanistan
7,Afghanistan
8,Afghanistan
9,Afghanistan
10,Afghanistan
11,Afghanistan
12,Afghanistan
13,Albania
14,Albania
15,Albania


This is known as `Slicing`.

Notice something different though?

- **Indexing in Series** used **explicit indices**
- **Slicing** however used **implicit indices**

Let's try the same for the dataframe.

**How can we access a row in a dataframe?**

In [52]:
# df[0]
df
df.index
df[1]

Index([   1,    2,    3,    4,    5,    6,    7,    8,    9,   10,
       ...
       1695, 1696, 1697, 1698, 1699, 1700, 1701, 1702, 1703, 1704],
      dtype='int64', length=1704)

Notice that this syntax is exactly same as how we tried accessing a column.

- `df[x]` looks for column with name `x`

**How can we access a slice of rows in the dataframe?**

In [53]:
df[5:15]

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap,year+7,gdp
6,Afghanistan,1977,14880372,Asia,38.438,786.11336,1984,11697660000.0
7,Afghanistan,1982,12881816,Asia,39.854,978.011439,1989,12598560000.0
8,Afghanistan,1987,13867957,Asia,40.822,852.395945,1994,11820990000.0
9,Afghanistan,1992,16317921,Asia,41.674,649.341395,1999,10595900000.0
10,Afghanistan,1997,22227415,Asia,41.763,635.341351,2004,14122000000.0
11,Afghanistan,2002,25268405,Asia,42.129,726.734055,2009,18363410000.0
12,Afghanistan,2007,31889923,Asia,43.828,974.580338,2014,31079290000.0
13,Albania,1952,1282697,Europe,55.23,1601.056136,1959,2053670000.0
14,Albania,1957,1476505,Europe,59.28,1942.284244,1964,2867792000.0
15,Albania,1962,1728137,Europe,64.82,2312.888958,1969,3996989000.0


Woah, so the slicing works.

This can be a cause for confusion.

To avoid this, Pandas provides special indexers, `loc` and `iloc`

## **loc and iloc**

### **1. loc**

- Allows indexing and slicing that always references the explicit index.

In [56]:
df.loc[1]

Unnamed: 0,1
country,Afghanistan
year,1952
population,8425333
continent,Asia
life_exp,28.801
gdp_cap,779.445314
year+7,1959
gdp,6567086329.952229


In [55]:
df.loc[1:3]

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap,year+7,gdp
1,Afghanistan,1952,8425333,Asia,28.801,779.445314,1959,6567086000.0
2,Afghanistan,1957,9240934,Asia,30.332,820.85303,1964,7585449000.0
3,Afghanistan,1962,10267083,Asia,31.997,853.10071,1969,8758856000.0


Did you notice something strange here?

- The **range is inclusive** of **end point** for `loc`.
- **Row with label 3** is **included** in the result.


### **2. iloc**

- Allows indexing and slicing that always references the implicit index.

In [None]:
df.iloc[1]

Country       Afghanistan
year                 1957
population        9240934
life_exp           30.332
gdp_cap         820.85303
Name: 2, dtype: object

**Will `iloc` also consider the range inclusive?**

In [57]:
df.iloc[0:2]

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap,year+7,gdp
1,Afghanistan,1952,8425333,Asia,28.801,779.445314,1959,6567086000.0
2,Afghanistan,1957,9240934,Asia,30.332,820.85303,1964,7585449000.0


No, because **`iloc` works with implicit Python-style indices**.





**Which one should we use?**
- Generally, explicit indexing is considered to be better than implicit indexing.
- But it is recommended to always use both `loc` and `iloc` to avoid any confusions.

**What if we want to access multiple non-consecutive rows at same time?**

In [58]:
df.iloc[[1, 10, 100]]

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap,year+7,gdp
2,Afghanistan,1957,9240934,Asia,30.332,820.85303,1964,7585449000.0
11,Afghanistan,2002,25268405,Asia,42.129,726.734055,2009,18363410000.0
101,Bangladesh,1972,70759295,Asia,45.252,630.233627,1979,44594890000.0


We can just **pack the indices in `[]`** and pass it in `loc` or `iloc`.

**What about negative index? Which would work between `iloc` and `loc`?**

In [70]:
df.iloc[-1]

# Works and gives last row in dataframe

# select 30th to 40th rows and last 3 cols
df.iloc[29:40,-3:]


Unnamed: 0,gdp_cap,year+7,gdp
30,4910.416756,1984,84227420000.0
31,5745.160213,1989,115097100000.0
32,5681.358539,1994,132119700000.0
33,5023.216647,1999,132102400000.0
34,4797.295051,2004,139467000000.0
35,5288.040382,2009,165447700000.0
36,6223.367465,2014,207444900000.0
37,3520.610273,1959,14899560000.0
38,3827.940465,1964,17460620000.0
39,4269.276742,1969,20603590000.0


In [60]:
df.loc[-1]

# Does not work

KeyError: -1

**So, why did `iloc[-1]` worked, but `loc[-1]` didn't?**

- Because **`iloc` works with positional indices, while `loc` with assigned labels**.
- `[-1]` here points to the **row at last position** in `iloc`.


**Can we use one of the columns as row index?**

In [65]:
temp = df.set_index("Country")
temp

Unnamed: 0_level_0,year,population,continent,life_exp,gdp_cap,year+7,gdp
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Afghanistan,1952,8425333,Asia,28.801,779.445314,1959,6.567086e+09
Afghanistan,1957,9240934,Asia,30.332,820.853030,1964,7.585449e+09
Afghanistan,1962,10267083,Asia,31.997,853.100710,1969,8.758856e+09
Afghanistan,1967,11537966,Asia,34.020,836.197138,1974,9.648014e+09
Afghanistan,1972,13079460,Asia,36.088,739.981106,1979,9.678553e+09
...,...,...,...,...,...,...,...
Zimbabwe,1987,9216418,Africa,62.351,706.157306,1994,6.508241e+09
Zimbabwe,1992,10704340,Africa,60.377,693.420786,1999,7.422612e+09
Zimbabwe,1997,11404948,Africa,46.809,792.449960,2004,9.037851e+09
Zimbabwe,2002,11926563,Africa,39.989,672.038623,2009,8.015111e+09


**Note:**
In earlier versions of Pandas, `drop=True` has to be provided to delete the column being used as new index.

**Now what would the row corresponding to index `Afghanistan` give?**

In [66]:
temp.loc['Afghanistan']

Unnamed: 0_level_0,year,population,continent,life_exp,gdp_cap,year+7,gdp
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Afghanistan,1952,8425333,Asia,28.801,779.445314,1959,6567086000.0
Afghanistan,1957,9240934,Asia,30.332,820.85303,1964,7585449000.0
Afghanistan,1962,10267083,Asia,31.997,853.10071,1969,8758856000.0
Afghanistan,1967,11537966,Asia,34.02,836.197138,1974,9648014000.0
Afghanistan,1972,13079460,Asia,36.088,739.981106,1979,9678553000.0
Afghanistan,1977,14880372,Asia,38.438,786.11336,1984,11697660000.0
Afghanistan,1982,12881816,Asia,39.854,978.011439,1989,12598560000.0
Afghanistan,1987,13867957,Asia,40.822,852.395945,1994,11820990000.0
Afghanistan,1992,16317921,Asia,41.674,649.341395,1999,10595900000.0
Afghanistan,1997,22227415,Asia,41.763,635.341351,2004,14122000000.0


As you can see, we got the rows all having index `Afghanistan`.

Generally, it is advisable to keep unique indices. But it also depends on the use-case.

**How can we reset our indices back to integers?**

In [67]:
df.reset_index()

Unnamed: 0,index,Country,year,population,continent,life_exp,gdp_cap,year+7,gdp
0,1,Afghanistan,1952,8425333,Asia,28.801,779.445314,1959,6.567086e+09
1,2,Afghanistan,1957,9240934,Asia,30.332,820.853030,1964,7.585449e+09
2,3,Afghanistan,1962,10267083,Asia,31.997,853.100710,1969,8.758856e+09
3,4,Afghanistan,1967,11537966,Asia,34.020,836.197138,1974,9.648014e+09
4,5,Afghanistan,1972,13079460,Asia,36.088,739.981106,1979,9.678553e+09
...,...,...,...,...,...,...,...,...,...
1699,1700,Zimbabwe,1987,9216418,Africa,62.351,706.157306,1994,6.508241e+09
1700,1701,Zimbabwe,1992,10704340,Africa,60.377,693.420786,1999,7.422612e+09
1701,1702,Zimbabwe,1997,11404948,Africa,46.809,792.449960,2004,9.037851e+09
1702,1703,Zimbabwe,2002,11926563,Africa,39.989,672.038623,2009,8.015111e+09


Notice that it's creating a new column `index`.

**How can we reset our index without creating this new column?**

In [69]:
df.reset_index(drop=True) # by using drop=True we can prevent creation of a new column

Unnamed: 0,Country,year,population,continent,life_exp,gdp_cap,year+7,gdp
0,Afghanistan,1952,8425333,Asia,28.801,779.445314,1959,6.567086e+09
1,Afghanistan,1957,9240934,Asia,30.332,820.853030,1964,7.585449e+09
2,Afghanistan,1962,10267083,Asia,31.997,853.100710,1969,8.758856e+09
3,Afghanistan,1967,11537966,Asia,34.020,836.197138,1974,9.648014e+09
4,Afghanistan,1972,13079460,Asia,36.088,739.981106,1979,9.678553e+09
...,...,...,...,...,...,...,...,...
1699,Zimbabwe,1987,9216418,Africa,62.351,706.157306,1994,6.508241e+09
1700,Zimbabwe,1992,10704340,Africa,60.377,693.420786,1999,7.422612e+09
1701,Zimbabwe,1997,11404948,Africa,46.809,792.449960,2004,9.037851e+09
1702,Zimbabwe,2002,11926563,Africa,39.989,672.038623,2009,8.015111e+09


Great!

Now let's do this in place.

In [None]:
df.reset_index(drop=True, inplace=True)

In [71]:
x = pd.Series(['a', 'b', 'c'], index=[1,2,2])
print(x[2]) # print both b and c

2    b
2    c
dtype: object


In [76]:
# drop rows from 10 to 15

df.drop(np.arange(10,16), axis=0, inplace=True)
df.head(15)

Unnamed: 0,Country,year,population,continent,life_exp,gdp_cap,year+7,gdp
1,Afghanistan,1952,8425333,Asia,28.801,779.445314,1959,6567086000.0
2,Afghanistan,1957,9240934,Asia,30.332,820.85303,1964,7585449000.0
3,Afghanistan,1962,10267083,Asia,31.997,853.10071,1969,8758856000.0
4,Afghanistan,1967,11537966,Asia,34.02,836.197138,1974,9648014000.0
5,Afghanistan,1972,13079460,Asia,36.088,739.981106,1979,9678553000.0
6,Afghanistan,1977,14880372,Asia,38.438,786.11336,1984,11697660000.0
7,Afghanistan,1982,12881816,Asia,39.854,978.011439,1989,12598560000.0
8,Afghanistan,1987,13867957,Asia,40.822,852.395945,1994,11820990000.0
9,Afghanistan,1992,16317921,Asia,41.674,649.341395,1999,10595900000.0
16,Albania,1967,1984060,Europe,66.22,2760.196931,1974,5476396000.0


---