## Content

 - **Basic ops on rows**
    - Implicit/explicit index
    - df.index
    - Indexing in series
    - Slicing in series
    - loc/iloc      
    - Indexing/Slicing in dataframe
    - Adding a row
    - Deleting a row
    - Check for duplicates



## Working with Rows




In [1]:
import pandas as pd
import numpy as np

In [None]:
# !wget "https://drive.google.com/uc?export=download&id=1E3bwvYGf1ig32RmcYiWc0IXPN-mD_bI_" -O mckinsey.csv

In [2]:
df=pd.read_csv('../mckinsey.csv')
df.head()

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.85303
2,Afghanistan,1962,10267083,Asia,31.997,853.10071
3,Afghanistan,1967,11537966,Asia,34.02,836.197138
4,Afghanistan,1972,13079460,Asia,36.088,739.981106


#### Just like columns, do rows also have labels?

**YES**

Notice the indexes in bold against each row

Lets see how can we access these indexes

In [3]:
df.index

RangeIndex(start=0, stop=1704, step=1)

In [4]:
df.index.values

array([   0,    1,    2, ..., 1701, 1702, 1703])

#### Can we change row labels (like we did for columns)?

What if we want to start indexing from 1 (instead of 0)?

In [5]:
df.index = list(range(1, df.shape[0]+1)) # create a list of indexes of same length
df

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
1,Afghanistan,1952,8425333,Asia,28.801,779.445314
2,Afghanistan,1957,9240934,Asia,30.332,820.853030
3,Afghanistan,1962,10267083,Asia,31.997,853.100710
4,Afghanistan,1967,11537966,Asia,34.020,836.197138
5,Afghanistan,1972,13079460,Asia,36.088,739.981106
...,...,...,...,...,...,...
1700,Zimbabwe,1987,9216418,Africa,62.351,706.157306
1701,Zimbabwe,1992,10704340,Africa,60.377,693.420786
1702,Zimbabwe,1997,11404948,Africa,46.809,792.449960
1703,Zimbabwe,2002,11926563,Africa,39.989,672.038623


As you can see the indexing is now starting from 1 instead of 0.


### Explicit and Implicit Indices


#### What are these row labels/indices exactly ?
  
- They can be called identifiers of a particular row
  
- Specifically known as **explicit indices**

#### Additionally, can series/dataframes can also use python style indexing?

**YES**

The python style indices are known as **implicit indices**


#### How can we access explicit index of a particular row?
  - Using df.index[]
  - Takes **impicit index** of row to give its explicit index


In [6]:
df.index[1] #Implicit index 1 gave explicit index 2

2

#### But why not use just implicit indexing ?

Explicit indices can be changed to any value of any datatype
  - Eg: Explicit Index of 1st row can be changed to `First`
  - Or, something like a floating point value, say `1.0`



In [7]:
df.index = np.arange(1, df.shape[0]+1, dtype='float')
df

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
1.0,Afghanistan,1952,8425333,Asia,28.801,779.445314
2.0,Afghanistan,1957,9240934,Asia,30.332,820.853030
3.0,Afghanistan,1962,10267083,Asia,31.997,853.100710
4.0,Afghanistan,1967,11537966,Asia,34.020,836.197138
5.0,Afghanistan,1972,13079460,Asia,36.088,739.981106
...,...,...,...,...,...,...
1700.0,Zimbabwe,1987,9216418,Africa,62.351,706.157306
1701.0,Zimbabwe,1992,10704340,Africa,60.377,693.420786
1702.0,Zimbabwe,1997,11404948,Africa,46.809,792.449960
1703.0,Zimbabwe,2002,11926563,Africa,39.989,672.038623


As we can see, the indices are floating point values now

Now to understand string indices, let's take a small subset of our original dataframe


In [8]:
sample = df.head()
sample

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
1.0,Afghanistan,1952,8425333,Asia,28.801,779.445314
2.0,Afghanistan,1957,9240934,Asia,30.332,820.85303
3.0,Afghanistan,1962,10267083,Asia,31.997,853.10071
4.0,Afghanistan,1967,11537966,Asia,34.02,836.197138
5.0,Afghanistan,1972,13079460,Asia,36.088,739.981106


#### Now what if we want to use string indices?

In [9]:
sample.index = ['a', 'b', 'c', 'd', 'e']
sample

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
a,Afghanistan,1952,8425333,Asia,28.801,779.445314
b,Afghanistan,1957,9240934,Asia,30.332,820.85303
c,Afghanistan,1962,10267083,Asia,31.997,853.10071
d,Afghanistan,1967,11537966,Asia,34.02,836.197138
e,Afghanistan,1972,13079460,Asia,36.088,739.981106


This shows us we can use almost anything as our explicit index

Now let's reset our indices back to integers

In [10]:
df.index = np.arange(1, df.shape[0]+1, dtype='int')

#### What if we want to access any particular row (say first row)?

Let's first see for one column

Later, we can generalise the same for the entire dataframe

In [11]:
ser = df["country"]
ser.head(20)

1     Afghanistan
2     Afghanistan
3     Afghanistan
4     Afghanistan
5     Afghanistan
6     Afghanistan
7     Afghanistan
8     Afghanistan
9     Afghanistan
10    Afghanistan
11    Afghanistan
12    Afghanistan
13        Albania
14        Albania
15        Albania
16        Albania
17        Albania
18        Albania
19        Albania
20        Albania
Name: country, dtype: object

We can simply use its indices much like we do in a numpy array

So, how will be then access the thirteenth element (or say thirteenth row)?

In [12]:
ser[12]

'Afghanistan'

#### And what about accessing a subset of rows (say 6th:15th) ?

In [13]:
ser[5:15]

6     Afghanistan
7     Afghanistan
8     Afghanistan
9     Afghanistan
10    Afghanistan
11    Afghanistan
12    Afghanistan
13        Albania
14        Albania
15        Albania
Name: country, dtype: object

This is known as slicing

#### Notice something different though?

- **Indexing in Series** used **explicit indices**
- **Slicing** however used **implicit indices**

Let's try the same for the dataframe now

#### So how can we access a row in a dataframe?

In [14]:
df[0]

KeyError: 0

Notice, that this syntax is exactly same as how we tried accessing a column

===> `df[x]` looks for column with name `x`

#### How can we access a slice of rows in the dataframe?

In [15]:
df[5:15]

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
6,Afghanistan,1977,14880372,Asia,38.438,786.11336
7,Afghanistan,1982,12881816,Asia,39.854,978.011439
8,Afghanistan,1987,13867957,Asia,40.822,852.395945
9,Afghanistan,1992,16317921,Asia,41.674,649.341395
10,Afghanistan,1997,22227415,Asia,41.763,635.341351
11,Afghanistan,2002,25268405,Asia,42.129,726.734055
12,Afghanistan,2007,31889923,Asia,43.828,974.580338
13,Albania,1952,1282697,Europe,55.23,1601.056136
14,Albania,1957,1476505,Europe,59.28,1942.284244
15,Albania,1962,1728137,Europe,64.82,2312.888958


Woah, so the slicing works

===> Indexing in dataframe looks only for explicit indices \
===> Slicing, however, checked for implicit indices

This can be a cause for confusion

To avoid this pandas provides special indexers, `loc and iloc`

We will look at these in a bit
Lets look at them one by one

### loc and iloc

#### **1. loc**

Allows indexing and slicing that always references the explicit index

In [16]:
df.loc[1]

country       Afghanistan
year                 1952
population        8425333
continent            Asia
life_exp           28.801
gdp_cap        779.445314
Name: 1, dtype: object

In [17]:
df.loc[1:3]

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
1,Afghanistan,1952,8425333,Asia,28.801,779.445314
2,Afghanistan,1957,9240934,Asia,30.332,820.85303
3,Afghanistan,1962,10267083,Asia,31.997,853.10071


#### Did you notice something strange here?

- The **range is inclusive** of **end point** for `loc`

- **Row with Label 3** is **included** in the result


#### **2. iloc**

Allows indexing and slicing that always references the implicit Python-style index

In [18]:
df.iloc[1]

country       Afghanistan
year                 1957
population        9240934
continent            Asia
life_exp           30.332
gdp_cap         820.85303
Name: 2, dtype: object

#### Now will `iloc` also consider the range inclusive?

In [19]:
df.iloc[0:2]

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
1,Afghanistan,1952,8425333,Asia,28.801,779.445314
2,Afghanistan,1957,9240934,Asia,30.332,820.85303


**NO**

Because **`iloc` works with implicit Python-style indices**





#### It is important to know about these conceptual differences

Not just b/w `loc` and `iloc`, but in general while working in DS and ML

#### Which one should we use ?
  - Generally explicit indexing is considered to be better than implicit
  - But it is recommended to always use both loc and iloc to avoid any confusions

#### What if we want to access multiple non-consecutive rows at same time ?

For eg: rows 1, 10, 100


In [20]:
df.iloc[[1, 10, 100]]

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
2,Afghanistan,1957,9240934,Asia,30.332,820.85303
11,Afghanistan,2002,25268405,Asia,42.129,726.734055
101,Bangladesh,1972,70759295,Asia,45.252,630.233627


As we see, We can just **pack the indices in `[]`** and pass it in `loc` or `iloc`

In [21]:
df.loc[[1, 10, 100]]

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
1,Afghanistan,1952,8425333,Asia,28.801,779.445314
10,Afghanistan,1997,22227415,Asia,41.763,635.341351
100,Bangladesh,1967,62821884,Asia,43.453,721.186086


#### What about negative index?

#### Which would work between `iloc` and `loc`?

In [22]:
df.iloc[-1]

# Works and gives last row in dataframe

country         Zimbabwe
year                2007
population      12311143
continent         Africa
life_exp          43.487
gdp_cap       469.709298
Name: 1704, dtype: object

In [24]:
# df.loc[-1]

# # Does NOT work

#### So, why did `iloc[-1]` worked, but `loc[-1]` didn't?

- Because **`iloc` works with positional indices, while `loc` with assigned labels**
- [-1] here points to the **row at last position** in iloc


#### Can we use one of the columns as row index?

In [25]:
temp = df.set_index("country")
temp

Unnamed: 0_level_0,year,population,continent,life_exp,gdp_cap
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Afghanistan,1952,8425333,Asia,28.801,779.445314
Afghanistan,1957,9240934,Asia,30.332,820.853030
Afghanistan,1962,10267083,Asia,31.997,853.100710
Afghanistan,1967,11537966,Asia,34.020,836.197138
Afghanistan,1972,13079460,Asia,36.088,739.981106
...,...,...,...,...,...
Zimbabwe,1987,9216418,Africa,62.351,706.157306
Zimbabwe,1992,10704340,Africa,60.377,693.420786
Zimbabwe,1997,11404948,Africa,46.809,792.449960
Zimbabwe,2002,11926563,Africa,39.989,672.038623


#### Now what would the row corresponding to index `Afghanistan` give?

In [26]:
temp.loc['Afghanistan']

Unnamed: 0_level_0,year,population,continent,life_exp,gdp_cap
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Afghanistan,1952,8425333,Asia,28.801,779.445314
Afghanistan,1957,9240934,Asia,30.332,820.85303
Afghanistan,1962,10267083,Asia,31.997,853.10071
Afghanistan,1967,11537966,Asia,34.02,836.197138
Afghanistan,1972,13079460,Asia,36.088,739.981106
Afghanistan,1977,14880372,Asia,38.438,786.11336
Afghanistan,1982,12881816,Asia,39.854,978.011439
Afghanistan,1987,13867957,Asia,40.822,852.395945
Afghanistan,1992,16317921,Asia,41.674,649.341395
Afghanistan,1997,22227415,Asia,41.763,635.341351


As you can see we got the rows all having index `Afghanistan`

Generally it is advisable to keep unique indices, but it is also use-case dependent



### Now how can we reset our indices back to integers?


In [27]:
df.reset_index()

Unnamed: 0,index,country,year,population,continent,life_exp,gdp_cap
0,1,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,2,Afghanistan,1957,9240934,Asia,30.332,820.853030
2,3,Afghanistan,1962,10267083,Asia,31.997,853.100710
3,4,Afghanistan,1967,11537966,Asia,34.020,836.197138
4,5,Afghanistan,1972,13079460,Asia,36.088,739.981106
...,...,...,...,...,...,...,...
1699,1700,Zimbabwe,1987,9216418,Africa,62.351,706.157306
1700,1701,Zimbabwe,1992,10704340,Africa,60.377,693.420786
1701,1702,Zimbabwe,1997,11404948,Africa,46.809,792.449960
1702,1703,Zimbabwe,2002,11926563,Africa,39.989,672.038623


Notice it's creating a new column `index`

#### How can we reset our index without creating this new column?

In [28]:
df.reset_index(drop=True) # By using drop=True we can prevent creation of a new column

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.853030
2,Afghanistan,1962,10267083,Asia,31.997,853.100710
3,Afghanistan,1967,11537966,Asia,34.020,836.197138
4,Afghanistan,1972,13079460,Asia,36.088,739.981106
...,...,...,...,...,...,...
1699,Zimbabwe,1987,9216418,Africa,62.351,706.157306
1700,Zimbabwe,1992,10704340,Africa,60.377,693.420786
1701,Zimbabwe,1997,11404948,Africa,46.809,792.449960
1702,Zimbabwe,2002,11926563,Africa,39.989,672.038623


Great, now let's do this in place

In [29]:
df.reset_index(drop=True, inplace=True)

In [30]:
df

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.853030
2,Afghanistan,1962,10267083,Asia,31.997,853.100710
3,Afghanistan,1967,11537966,Asia,34.020,836.197138
4,Afghanistan,1972,13079460,Asia,36.088,739.981106
...,...,...,...,...,...,...
1699,Zimbabwe,1987,9216418,Africa,62.351,706.157306
1700,Zimbabwe,1992,10704340,Africa,60.377,693.420786
1701,Zimbabwe,1997,11404948,Africa,46.809,792.449960
1702,Zimbabwe,2002,11926563,Africa,39.989,672.038623


### Now how can we add a row to our dataframe?

There are multiple ways to do this:

- `append()`
- `loc/iloc`

#### How can we do add a row using the **append()** method?



In [31]:
new_row = {'country': 'India', 'year': 2000,'life_exp':37.08,'population':13500000,'gdp_cap':900.23}
df.append(new_row)

  df.append(new_row)


TypeError: Can only append a dict if ignore_index=True

Why are we getting an error here?

Its' saying the `ignore_index()` parameter needs to be set to True

In [32]:
new_row = {'country': 'India', 'year': 2000,'life_exp':37.08,'population':13500000,'gdp_cap':900.23}
df = df.append(new_row, ignore_index=True)
df

  df = df.append(new_row, ignore_index=True)


Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.853030
2,Afghanistan,1962,10267083,Asia,31.997,853.100710
3,Afghanistan,1967,11537966,Asia,34.020,836.197138
4,Afghanistan,1972,13079460,Asia,36.088,739.981106
...,...,...,...,...,...,...
1700,Zimbabwe,1992,10704340,Africa,60.377,693.420786
1701,Zimbabwe,1997,11404948,Africa,46.809,792.449960
1702,Zimbabwe,2002,11926563,Africa,39.989,672.038623
1703,Zimbabwe,2007,12311143,Africa,43.487,469.709298


Perfect! So now our row is added at the bottom of the dataframe


**But Please Note that:**

- `append()` doesn't mutate the the dataframe.

- It does not change the DataFrame, but returns a new DataFrame with the row appended.

Another method would be by **using loc:**

We will need to provide the position at which we will add the new row

#### What do you think this positional value would be?



In [35]:
# df.loc[len(df.index)] = ['India',2000 ,13500000,37.08,900.23]  # len(df.index) since we will add at the last row

In [None]:
df

The new row was added but the data has been duplicated

####What you can infer from last two duplicate rows ?

Dataframe allow us to feed duplicate rows in the data

####Now, can we also **use iloc**?

Adding a row at a specific index position will replace the existing row at that position.

In [37]:
# df.iloc[len(df.index)-1] = ['India', 2000,13500000,37.08,900.23]
# df

#### What if we try to add the row with a new index?

In [39]:
# df.iloc[len(df.index)] = ['India', 2000,13500000,37.08,900.23]

####Why we are getting error ?

For using iloc to add a row, the dataframe must already have a row in that position.

If a row is not available, you’ll see this IndexError


**Please Note:**

* When using the `loc[]` attribute, it’s not mandatory that a row already exists with a specific label.



### Now what if we want to delete a row ?

Use df.drop()

If you remember we specified axis=1 for columns

We can modify this for rows
- We can use `axis=0` for rows

#### Does `drop()` method uses positional indices or labels?

#### What do you think by looking at code for deleting column?

- We had to specify column title

- So **`drop()` uses labels**, NOT positional indices

In [40]:
# Let's drop row with label 3
df = df.drop(3, axis=0)
df

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.853030
2,Afghanistan,1962,10267083,Asia,31.997,853.100710
4,Afghanistan,1972,13079460,Asia,36.088,739.981106
5,Afghanistan,1977,14880372,Asia,38.438,786.113360
...,...,...,...,...,...,...
1700,Zimbabwe,1992,10704340,Africa,60.377,693.420786
1701,Zimbabwe,1997,11404948,Africa,46.809,792.449960
1702,Zimbabwe,2002,11926563,Africa,39.989,672.038623
1703,Zimbabwe,2007,12311143,Africa,43.487,469.709298


Now we see that **row with label 3 is deleted**

We now have **rows with labels 0, 1, 2, 4, 5, ...**

#### Now `df.loc[4]` and `df.iloc[4]` will give different rows

In [41]:
df.loc[4] # The 4th row is printed

country       Afghanistan
year                 1972
population       13079460
continent            Asia
life_exp           36.088
gdp_cap        739.981106
Name: 4, dtype: object

In [42]:
df.iloc[4] # The 5th row is printed

country       Afghanistan
year                 1977
population       14880372
continent            Asia
life_exp           38.438
gdp_cap         786.11336
Name: 5, dtype: object

#### And hww can we drop multiple rows?

In [43]:
df.drop([1, 2, 4], axis=0) # drops rows with labels 1, 2, 4

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
5,Afghanistan,1977,14880372,Asia,38.438,786.113360
6,Afghanistan,1982,12881816,Asia,39.854,978.011439
7,Afghanistan,1987,13867957,Asia,40.822,852.395945
8,Afghanistan,1992,16317921,Asia,41.674,649.341395
...,...,...,...,...,...,...
1700,Zimbabwe,1992,10704340,Africa,60.377,693.420786
1701,Zimbabwe,1997,11404948,Africa,46.809,792.449960
1702,Zimbabwe,2002,11926563,Africa,39.989,672.038623
1703,Zimbabwe,2007,12311143,Africa,43.487,469.709298


Let's reset our indices now

In [47]:
df.reset_index(drop=True,inplace=True) # Since we removed a row earlier, we reset our indices
df

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.853030
2,Afghanistan,1962,10267083,Asia,31.997,853.100710
3,Afghanistan,1972,13079460,Asia,36.088,739.981106
4,Afghanistan,1977,14880372,Asia,38.438,786.113360
...,...,...,...,...,...,...
1699,Zimbabwe,1992,10704340,Africa,60.377,693.420786
1700,Zimbabwe,1997,11404948,Africa,46.809,792.449960
1701,Zimbabwe,2002,11926563,Africa,39.989,672.038623
1702,Zimbabwe,2007,12311143,Africa,43.487,469.709298


Now if you remember, the last two rows were duplicates.

### How can we deal with these duplicate rows?

Let's create some more duplicate rows to understand this



In [56]:
df.tail()

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
1699,Zimbabwe,1992,10704340,Africa,60.377,693.420786
1700,Zimbabwe,1997,11404948,Africa,46.809,792.44996
1701,Zimbabwe,2002,11926563,Africa,39.989,672.038623
1702,Zimbabwe,2007,12311143,Africa,43.487,469.709298
1703,India,2000,13500000,,37.08,900.23


In [58]:

df.loc[len(df.index)] = ['India',2000,13500000,37.08,900.23, 10]
df.loc[len(df.index)] = ['Sri Lanka',2022 ,130000000,80.00,500.00, 20]
df.loc[len(df.index)] = ['Sri Lanka',2022 ,130000000,80.00,500.00, 20]
df.loc[len(df.index)] = ['India',2000 ,13500000,80.00,900.23, 2]

print(df)


          country  year  population continent  life_exp     gdp_cap
0     Afghanistan  1952     8425333      Asia    28.801  779.445314
1     Afghanistan  1957     9240934      Asia    30.332  820.853030
2     Afghanistan  1962    10267083      Asia    31.997  853.100710
3     Afghanistan  1972    13079460      Asia    36.088  739.981106
4     Afghanistan  1977    14880372      Asia    38.438  786.113360
...           ...   ...         ...       ...       ...         ...
1703        India  2000    13500000       NaN    37.080  900.230000
1704        India  2000    13500000     37.08   900.230   10.000000
1705    Sri Lanka  2022   130000000      80.0   500.000   20.000000
1706    Sri Lanka  2022   130000000      80.0   500.000   20.000000
1707        India  2000    13500000      80.0   900.230    2.000000

[1708 rows x 6 columns]


#### Now how can we check for duplicate rows?

Use `duplicated()` method on the DataFrame


In [59]:
df.duplicated()

0       False
1       False
2       False
3       False
4       False
        ...  
1703    False
1704    False
1705    False
1706     True
1707    False
Length: 1708, dtype: bool


It outputs True if an entire row is identical to a previous row.

However, it is not practical to see a list of True and False

We can Pandas `loc` data selector to extract those duplicate rows

In [60]:
# Extract duplicate rows
df.loc[df.duplicated()]

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
1706,Sri Lanka,2022,130000000,80.0,500.0,20.0


The first argument **df.duplicated()** will find the duplicate rows

The second argument `:` will display all columns

#### Now how can we remove these **duplicate rows** ?

We can use `drop_duplicates()` of Pandas for this



In [61]:
df.drop_duplicates()

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.853030
2,Afghanistan,1962,10267083,Asia,31.997,853.100710
3,Afghanistan,1972,13079460,Asia,36.088,739.981106
4,Afghanistan,1977,14880372,Asia,38.438,786.113360
...,...,...,...,...,...,...
1702,Zimbabwe,2007,12311143,Africa,43.487,469.709298
1703,India,2000,13500000,,37.080,900.230000
1704,India,2000,13500000,37.08,900.230,10.000000
1705,Sri Lanka,2022,130000000,80.0,500.000,20.000000


#### But how can we decide among all duplicate rows which ones we want to keep ?

Here we can use argument **keep**:

This Controls how to consider duplicate value.

It has only three distinct value
- `first`
- `last`
- `False`

The default is ‘first’.

If `first`, this considers first value as unique and rest of the same values as duplicate.

In [52]:
df.drop_duplicates(keep='first')

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.853030
2,Afghanistan,1962,10267083,Asia,31.997,853.100710
3,Afghanistan,1972,13079460,Asia,36.088,739.981106
4,Afghanistan,1977,14880372,Asia,38.438,786.113360
...,...,...,...,...,...,...
1699,Zimbabwe,1992,10704340,Africa,60.377,693.420786
1700,Zimbabwe,1997,11404948,Africa,46.809,792.449960
1701,Zimbabwe,2002,11926563,Africa,39.989,672.038623
1702,Zimbabwe,2007,12311143,Africa,43.487,469.709298


If `last`, This considers last value as unique and rest of the same values as duplicate.

In [53]:
df.drop_duplicates(keep='last')

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.853030
2,Afghanistan,1962,10267083,Asia,31.997,853.100710
3,Afghanistan,1972,13079460,Asia,36.088,739.981106
4,Afghanistan,1977,14880372,Asia,38.438,786.113360
...,...,...,...,...,...,...
1699,Zimbabwe,1992,10704340,Africa,60.377,693.420786
1700,Zimbabwe,1997,11404948,Africa,46.809,792.449960
1701,Zimbabwe,2002,11926563,Africa,39.989,672.038623
1702,Zimbabwe,2007,12311143,Africa,43.487,469.709298


If `False`, this considers all of the same values as duplicates.

In [54]:
df.drop_duplicates(keep=False)

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
1,Afghanistan,1957,9240934,Asia,30.332,820.853030
2,Afghanistan,1962,10267083,Asia,31.997,853.100710
3,Afghanistan,1972,13079460,Asia,36.088,739.981106
4,Afghanistan,1977,14880372,Asia,38.438,786.113360
...,...,...,...,...,...,...
1699,Zimbabwe,1992,10704340,Africa,60.377,693.420786
1700,Zimbabwe,1997,11404948,Africa,46.809,792.449960
1701,Zimbabwe,2002,11926563,Africa,39.989,672.038623
1702,Zimbabwe,2007,12311143,Africa,43.487,469.709298


#### What if you want to look for duplicacy only for a few columns?

We can use the argument subset to mention the list of columns which we want to use.

In [55]:
df.drop_duplicates(subset=['country'],keep='first')

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
11,Albania,1952,1282697,Europe,55.230,1601.056136
23,Algeria,1952,9279525,Africa,43.077,2449.008185
35,Angola,1952,4232095,Africa,30.015,3520.610273
47,Argentina,1952,17876956,Americas,62.485,5911.315053
...,...,...,...,...,...,...
1643,Vietnam,1952,26246839,Asia,40.412,605.066492
1655,West Bank and Gaza,1952,1030585,Asia,43.160,1515.592329
1667,"Yemen, Rep.",1952,4963829,Asia,32.548,781.717576
1679,Zambia,1952,2672000,Africa,42.038,1147.388831


That's it for today.

Next we will see how to work with both rows and columns together

---------------------
### Q&A Session
