In [1]:
import pandas as pd

In [2]:
# lets see the columns name
df = pd.read_csv("imdb_top_1000.csv")
df.columns

Index(['Poster_Link', 'Series_Title', 'Released_Year', 'Certificate',
       'Runtime', 'Genre', 'IMDB_Rating', 'Overview', 'Meta_score', 'Director',
       'Star1', 'Star2', 'Star3', 'Star4', 'No_of_Votes', 'Gross'],
      dtype='object')

In [3]:
# Now, I am gonna take only few columns
cols = ['Series_Title', 'IMDB_Rating', 'No_of_Votes']

In [4]:
# Reading the dataset again with those columns
movies_df = pd.read_csv('imdb_top_1000.csv', usecols=cols)
movies_df

Unnamed: 0,Series_Title,IMDB_Rating,No_of_Votes
0,The Shawshank Redemption,9.3,2343110
1,The Godfather,9.2,1620367
2,The Dark Knight,9.0,2303232
3,The Godfather: Part II,9.0,1129952
4,12 Angry Men,9.0,689845
...,...,...,...
995,Breakfast at Tiffany's,7.6,166544
996,Giant,7.6,34075
997,From Here to Eternity,7.6,43374
998,Lifeboat,7.6,26471


### .nlargest()
This method is equivalent to `df.sort_values(columns, ascending=False).head(n)` but more performant.
So, basically we can short our dataframe with nlargest and nsmallest. *The nlargest() method is designed to work on columns that contain numeric data*

#### so, first let's try to sort the DataFrame using `sort_values`.
- if we want to use .sort_values() method in a series, 
- then we don't need to provide any column name.

#### but when using .sort_values() in a DataFrame, then we need to pass columns as well
```
syntax: 

DataFrame.sort_values(
    by, # The column to sort by. This can be a single column name or a list of column names.
    axis=0, # rows(axis=0, the default) or by columns (axis=1).
    ascending=True, 
    inplace=False, 
    kind='quicksort', 
    na_position='last', 
    ignore_index=False # If True, the result will have a new index, otherwise, the original index is retained.
)
```

In [6]:
# with nlargest() we are going to get back n number of largest values.
# n: number of rows we want to display - just like .head(5)
# columns: we want to perform this method on which column
movies_df.nlargest(n=10, columns=['No_of_Votes'])

Unnamed: 0,Series_Title,IMDB_Rating,No_of_Votes
0,The Shawshank Redemption,9.3,2343110
2,The Dark Knight,9.0,2303232
8,Inception,8.8,2067042
9,Fight Club,8.8,1854740
6,Pulp Fiction,8.9,1826188
11,Forrest Gump,8.8,1809221
14,The Matrix,8.7,1676426
10,The Lord of the Rings: The Fellowship of the Ring,8.8,1661481
5,The Lord of the Rings: The Return of the King,8.9,1642758
1,The Godfather,9.2,1620367


In [8]:
# we can even short with two different columns
movies_df.nlargest(n=10, columns=['IMDB_Rating', 'No_of_Votes'])
# in multiple shorting works like that:
# as i passed two columns - 'IMDB_Rating', 'No_of_Votes'
# so, it is going to short by IMDB_Rating but if it sees IMDB_Rating has two or more rows
# with the same value then it is going to short by No_of_Votes
# so whatever the first column we pass in the list - it is going to short by that, 
# and if finds duplicate then going to short by second item in the list.

Unnamed: 0,Series_Title,IMDB_Rating,No_of_Votes
0,The Shawshank Redemption,9.3,2343110
1,The Godfather,9.2,1620367
2,The Dark Knight,9.0,2303232
3,The Godfather: Part II,9.0,1129952
4,12 Angry Men,9.0,689845
6,Pulp Fiction,8.9,1826188
5,The Lord of the Rings: The Return of the King,8.9,1642758
7,Schindler's List,8.9,1213505
8,Inception,8.8,2067042
9,Fight Club,8.8,1854740


In [10]:
# with nsmallest() we are going to get back n number of smallest values.
movies_df.nsmallest(n=10, columns=['No_of_Votes'])

Unnamed: 0,Series_Title,IMDB_Rating,No_of_Votes
264,Ba wang bie ji,8.1,25088
721,God's Own Country,7.7,25198
694,La planète sauvage,7.8,25229
718,Scarface: The Shame of the Nation,7.8,25312
570,Raazi,7.8,25344
785,The Magdalene Sisters,7.7,25938
989,The Long Goodbye,7.6,26337
169,Dom za vesanje,8.2,26402
814,Do lok tin si,7.7,26429
863,Cape Fear,7.7,26457


### next argument in this function we have `keep`
keep = `first` means it will keep the first duplicate (default)
keep = `last` means it will keep the last duplicate
keep = `all` means it will keep all duplicate

In [13]:
movies_df.nlargest(n=10, columns=['IMDB_Rating'], keep='first')

Unnamed: 0,Series_Title,IMDB_Rating,No_of_Votes
0,The Shawshank Redemption,9.3,2343110
1,The Godfather,9.2,1620367
2,The Dark Knight,9.0,2303232
3,The Godfather: Part II,9.0,1129952
4,12 Angry Men,9.0,689845
5,The Lord of the Rings: The Return of the King,8.9,1642758
6,Pulp Fiction,8.9,1826188
7,Schindler's List,8.9,1213505
8,Inception,8.8,2067042
9,Fight Club,8.8,1854740


In [14]:
movies_df.nlargest(n=10, columns=['IMDB_Rating'], keep='last')

Unnamed: 0,Series_Title,IMDB_Rating,No_of_Votes
0,The Shawshank Redemption,9.3,2343110
1,The Godfather,9.2,1620367
4,12 Angry Men,9.0,689845
3,The Godfather: Part II,9.0,1129952
2,The Dark Knight,9.0,2303232
7,Schindler's List,8.9,1213505
6,Pulp Fiction,8.9,1826188
5,The Lord of the Rings: The Return of the King,8.9,1642758
12,"Il buono, il brutto, il cattivo",8.8,688390
11,Forrest Gump,8.8,1809221


In [15]:
# I've set here keep to all
# so when it find the duplicate then it going to list out those duplicates in our dataframe
# doesn't matter if we provided n = 10, so as noticed below output shows more than 10 rows
# because it had to dsiplay all the duplicate -  so it found 8.8 in sorted value of IMDB column.
# so it is going to show all the rows with 8.8 value
movies_df.nlargest(n=10, columns=['IMDB_Rating'], keep='all')

Unnamed: 0,Series_Title,IMDB_Rating,No_of_Votes
0,The Shawshank Redemption,9.3,2343110
1,The Godfather,9.2,1620367
2,The Dark Knight,9.0,2303232
3,The Godfather: Part II,9.0,1129952
4,12 Angry Men,9.0,689845
5,The Lord of the Rings: The Return of the King,8.9,1642758
6,Pulp Fiction,8.9,1826188
7,Schindler's List,8.9,1213505
8,Inception,8.8,2067042
9,Fight Club,8.8,1854740
