# Filter data

Filter data is a very important part of data analysis. It is used to select a subset of data from a dataframe.

In [1]:
import pandas as pd
import numpy as np

from pathlib import Path 
path = Path().absolute().parent.parent.parent.parent / 'resources' / 'data' / 'bestsellers with categories.csv'

In [2]:
data = pd.read_csv(path)
data

Unnamed: 0,Name,Author,User Rating,Reviews,Price,Year,Genre
0,10-Day Green Smoothie Cleanse,JJ Smith,4.7,17350,8,2016,Non Fiction
1,11/22/63: A Novel,Stephen King,4.6,2052,22,2011,Fiction
2,12 Rules for Life: An Antidote to Chaos,Jordan B. Peterson,4.7,18979,15,2018,Non Fiction
3,1984 (Signet Classics),George Orwell,4.7,21424,6,2017,Fiction
4,"5,000 Awesome Facts (About Everything!) (Natio...",National Geographic Kids,4.8,7665,12,2019,Non Fiction
...,...,...,...,...,...,...,...
545,Wrecking Ball (Diary of a Wimpy Kid Book 14),Jeff Kinney,4.9,9413,8,2019,Fiction
546,You Are a Badass: How to Stop Doubting Your Gr...,Jen Sincero,4.7,14331,8,2016,Non Fiction
547,You Are a Badass: How to Stop Doubting Your Gr...,Jen Sincero,4.7,14331,8,2017,Non Fiction
548,You Are a Badass: How to Stop Doubting Your Gr...,Jen Sincero,4.7,14331,8,2018,Non Fiction


A dataframe is really similar to a numpy array. It is a 2D array. The difference is that a dataframe can have columns with different types. A numpy array can only have one type.

So it's possible to use the same methods to filter data as we used with numpy arrays.

In [3]:
filtered_data = data[pd.read_csv(path)['User Rating'] > 4.7]  # Filter rows where the user rating is greater than 4.7
filtered_data = filtered_data[data['Genre'] == 'Fiction']  # filter by fiction genre
filtered_data  # fiction books with a rating greater than 4.7

  filtered_data = filtered_data[data['Genre'] == 'Fiction']  # filter by fiction genre


Unnamed: 0,Name,Author,User Rating,Reviews,Price,Year,Genre
40,"Brown Bear, Brown Bear, What Do You See?",Bill Martin Jr.,4.9,14344,5,2017,Fiction
41,"Brown Bear, Brown Bear, What Do You See?",Bill Martin Jr.,4.9,14344,5,2019,Fiction
42,"Cabin Fever (Diary of a Wimpy Kid, Book 6)",Jeff Kinney,4.8,4505,0,2011,Fiction
63,Dear Zoo: A Lift-the-Flap Book,Rod Campbell,4.8,10922,5,2015,Fiction
64,Dear Zoo: A Lift-the-Flap Book,Rod Campbell,4.8,10922,5,2016,Fiction
...,...,...,...,...,...,...,...
541,Wonder,R. J. Palacio,4.8,21625,9,2014,Fiction
542,Wonder,R. J. Palacio,4.8,21625,9,2015,Fiction
543,Wonder,R. J. Palacio,4.8,21625,9,2016,Fiction
544,Wonder,R. J. Palacio,4.8,21625,9,2017,Fiction


### Loc & Iloc

Loc and iloc are used to **select rows and columns by label and integer position respectively**. They are used to select a subset of the data. They are used to select a subset of the data. They are used to select a subset of the data.

##### Loc
Is a row-based indexer. It takes two single/list/range operator separated by ','. The first one indicates the row and the second one indicates columns.

- A single label, e.g. 5 or 'a', (note that 5 is interpreted as a label of the index, and never as an integer position along the index)
- A list or array of labels, e.g. ['a', 'b', 'c']
- A slice object with labels, e.g. 'a':'f'


In [4]:
data.loc[0:5]  # Select rows 0 to 5 (index-based)

Unnamed: 0,Name,Author,User Rating,Reviews,Price,Year,Genre
0,10-Day Green Smoothie Cleanse,JJ Smith,4.7,17350,8,2016,Non Fiction
1,11/22/63: A Novel,Stephen King,4.6,2052,22,2011,Fiction
2,12 Rules for Life: An Antidote to Chaos,Jordan B. Peterson,4.7,18979,15,2018,Non Fiction
3,1984 (Signet Classics),George Orwell,4.7,21424,6,2017,Fiction
4,"5,000 Awesome Facts (About Everything!) (Natio...",National Geographic Kids,4.8,7665,12,2019,Non Fiction
5,A Dance with Dragons (A Song of Ice and Fire),George R. R. Martin,4.4,12643,11,2011,Fiction


In [5]:
data.loc[0:5, ['Name', 'User Rating', 'Reviews']]  # Select rows 0 to 5 and only the Name column (index-based)

Unnamed: 0,Name,User Rating,Reviews
0,10-Day Green Smoothie Cleanse,4.7,17350
1,11/22/63: A Novel,4.6,2052
2,12 Rules for Life: An Antidote to Chaos,4.7,18979
3,1984 (Signet Classics),4.7,21424
4,"5,000 Awesome Facts (About Everything!) (Natio...",4.8,7665
5,A Dance with Dragons (A Song of Ice and Fire),4.4,12643



##### Iloc
The iloc indexer for Pandas Dataframe is used for integer-location based indexing / selection by position.

- An integer e.g. 5
- A list or array of integers e.g. [4, 3, 0]
- A slice object with ints e.g. 1:7

In [6]:
data.iloc[:, 0::2]  # iloc is index-based

Unnamed: 0,Name,User Rating,Price,Genre
0,10-Day Green Smoothie Cleanse,4.7,8,Non Fiction
1,11/22/63: A Novel,4.6,22,Fiction
2,12 Rules for Life: An Antidote to Chaos,4.7,15,Non Fiction
3,1984 (Signet Classics),4.7,6,Fiction
4,"5,000 Awesome Facts (About Everything!) (Natio...",4.8,12,Non Fiction
...,...,...,...,...
545,Wrecking Ball (Diary of a Wimpy Kid Book 14),4.9,8,Fiction
546,You Are a Badass: How to Stop Doubting Your Gr...,4.7,8,Non Fiction
547,You Are a Badass: How to Stop Doubting Your Gr...,4.7,8,Non Fiction
548,You Are a Badass: How to Stop Doubting Your Gr...,4.7,8,Non Fiction
