# Subsetting in Pandas

A large part of data science is about finding which bits of your dataset are interesting. One of the simplest techniques for this is to find a subset of rows that match some criteria. This is sometimes known as filtering rows or selecting rows.

## Reading and loading data

In [1]:
# import the pandas library
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

print(pd.__version__)

2.1.1


In [2]:
# Read the dataset
data = pd.read_csv('datasets/big_mart_sales.csv')
print(data)

     Item_Identifier  Item_Weight Item_Fat_Content  Item_Visibility  \
0              FDA15        9.300          Low Fat         0.016047   
1              DRC01        5.920          Regular         0.019278   
2              FDN15       17.500          Low Fat         0.016760   
3              FDX07       19.200          Regular         0.000000   
4              NCD19        8.930          Low Fat         0.000000   
...              ...          ...              ...              ...   
8518           FDF22        6.865          Low Fat         0.056783   
8519           FDS36        8.380          Regular         0.046982   
8520           NCJ29       10.600          Low Fat         0.035186   
8521           FDN46        7.210          Regular         0.145221   
8522           DRG01       14.800          Low Fat         0.044878   

                  Item_Type  Item_MRP Outlet_Identifier  \
0                     Dairy  249.8092            OUT049   
1               Soft Drinks  

## Subsetting values by position

### How to view the top and bottom rows of the data?

- Use **`head`** function to view top n rows.
- Use **`tail`** function to view bottom n rows.

In [3]:
# view top 5 rows
data.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [4]:
# view bottom 5 rows
data.tail()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
8518,FDF22,6.865,Low Fat,0.056783,Snack Foods,214.5218,OUT013,1987,High,Tier 3,Supermarket Type1,2778.3834
8519,FDS36,8.38,Regular,0.046982,Baking Goods,108.157,OUT045,2002,,Tier 2,Supermarket Type1,549.285
8520,NCJ29,10.6,Low Fat,0.035186,Health and Hygiene,85.1224,OUT035,2004,Small,Tier 2,Supermarket Type1,1193.1136
8521,FDN46,7.21,Regular,0.145221,Snack Foods,103.1332,OUT018,2009,Medium,Tier 3,Supermarket Type2,1845.5976
8522,DRG01,14.8,Low Fat,0.044878,Soft Drinks,75.467,OUT046,1997,Small,Tier 1,Supermarket Type1,765.67


### How to select rows in a particular range?

In [6]:
# select the data from range 10-15
data[10:15]

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
10,FDY07,11.8,Low Fat,0.0,Fruits and Vegetables,45.5402,OUT049,1999,Medium,Tier 1,Supermarket Type1,1516.0266
11,FDA03,18.5,Regular,0.045464,Dairy,144.1102,OUT046,1997,Small,Tier 1,Supermarket Type1,2187.153
12,FDX32,15.1,Regular,0.100014,Fruits and Vegetables,145.4786,OUT049,1999,Medium,Tier 1,Supermarket Type1,1589.2646
13,FDS46,17.6,Regular,0.047257,Snack Foods,119.6782,OUT046,1997,Small,Tier 1,Supermarket Type1,2145.2076
14,FDF32,16.35,Low Fat,0.068024,Fruits and Vegetables,196.4426,OUT013,1987,High,Tier 3,Supermarket Type1,1977.426


### How to select the rows by position?

When we are using `iloc`, we need to specify the rows and columns by their position.

In [7]:
# select specific rows by index number: This will print the values on that particular index
data.iloc[[1, 5, 2, 4, 6, 14]]

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
5,FDP36,10.395,Regular,0.0,Baking Goods,51.4008,OUT018,2009,Medium,Tier 3,Supermarket Type2,556.6088
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052
6,FDO10,13.65,Regular,0.012741,Snack Foods,57.6588,OUT013,1987,High,Tier 3,Supermarket Type1,343.5528
14,FDF32,16.35,Low Fat,0.068024,Fruits and Vegetables,196.4426,OUT013,1987,High,Tier 3,Supermarket Type1,1977.426


### How to select the specific rows and columns from the data using their position?

In the `iloc` function pass the first list as the order of rows by their index and pass the second list as the order of columns.

In [8]:
# This will print rows with index 1, 4, 5 and 2 and the columns at 1st, 3rd and 5th index
data.iloc[[1,4,5,2],[1,3,5]]

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP
1,5.92,0.019278,48.2692
4,8.93,0.0,53.8614
5,10.395,0.0,51.4008
2,17.5,0.01676,141.618


## Subsetting values by label

In [9]:
# set the Item_Identifier as the index of the dataframe.
data.set_index('Item_Identifier',inplace=True, drop=True)
data.head()

Unnamed: 0_level_0,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
Item_Identifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


### How to select rows using the label of the index?

Wee can using `loc` to select rows using the label of the index.

In [10]:
# Select rows with index value 'FDA15'
data.loc['FDA15'].head()

Unnamed: 0_level_0,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
Item_Identifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
FDA15,9.3,Low Fat,0.016055,Dairy,250.2092,OUT045,2002,,Tier 2,Supermarket Type1,5976.2208
FDA15,9.3,Low Fat,0.016019,Dairy,248.5092,OUT035,2004,Small,Tier 2,Supermarket Type1,6474.2392
FDA15,9.3,Low Fat,0.016088,Dairy,249.6092,OUT018,2009,Medium,Tier 3,Supermarket Type2,5976.2208
FDA15,9.3,Low Fat,0.026818,Dairy,248.9092,OUT010,1998,,Tier 3,Grocery Store,498.0184


In [16]:
# Select rows with index value 'FDA15' and 'FDA03'
data.loc[['FDA15', 'FDA03'], ['Item_Weight', 'Item_Type', 'Outlet_Size']]

Unnamed: 0_level_0,Item_Weight,Item_Type,Outlet_Size
Item_Identifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
FDA15,9.3,Dairy,Medium
FDA15,9.3,Dairy,
FDA15,9.3,Dairy,Small
FDA15,9.3,Dairy,Medium
FDA15,9.3,Dairy,
FDA15,9.3,Dairy,High
FDA15,,Dairy,Medium
FDA15,9.3,Dairy,
FDA03,18.5,Dairy,Small
FDA03,,Dairy,Medium


### Difference between loc and iloc

In [19]:
# Read the dataset
data = pd.read_csv('datasets/big_mart_sales.csv')
data.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [27]:
# Using loc for subsetting the data
data.loc[0:2]

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27


In [28]:
# Using iloc for subsetting the data
data.iloc[0:2]

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228


- When we try to slice the dataframe using the **`loc`** function on `range(0 to 2)` it first finds out the index with a **label 0** and goes till it finds the index with a **label 2**. 
- When we try to slice the dataframe using the **`iloc`** function on `range(0 to 2)` it starts with the index with a **label 2** and goes till **end point-1** which is 3.

In [30]:
# Sorting the data by Item Weight and arranging it to descending
data = data.sort_values('Item_Weight', ascending = False)
data.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
4257,FDR07,21.35,Low Fat,0.130127,Fruits and Vegetables,96.2094,OUT010,1998,,Tier 3,Grocery Store,190.4188
4468,FDC02,21.35,Low Fat,0.068822,Canned,258.3278,OUT046,1997,Small,Tier 1,Supermarket Type1,7028.8506
2368,FDC02,21.35,Low Fat,0.068809,Canned,258.5278,OUT035,2004,Small,Tier 2,Supermarket Type1,5206.556
2802,FDC02,21.35,Low Fat,0.068765,Canned,260.4278,OUT013,1987,High,Tier 3,Supermarket Type1,3644.5892
43,FDC02,21.35,Low Fat,0.069103,Canned,259.9278,OUT018,2009,Medium,Tier 3,Supermarket Type2,6768.5228


In [34]:
# Using loc for subsetting the data
data.loc[0:2]

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales


In [35]:
# Using iloc for subsetting the data
data.iloc[0:2]

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
4257,FDR07,21.35,Low Fat,0.130127,Fruits and Vegetables,96.2094,OUT010,1998,,Tier 3,Grocery Store,190.4188
4468,FDC02,21.35,Low Fat,0.068822,Canned,258.3278,OUT046,1997,Small,Tier 1,Supermarket Type1,7028.8506


## Subsetting values by value

### How to select rows based on condition?

In [43]:
# Read the dataset
data = pd.read_csv('datasets/big_mart_sales.csv')
data.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [44]:
# filter rows with condition
data.loc[data['Item_Weight'] > 21].head()
# data[data['Item_Weight'] > 21].head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
43,FDC02,21.35,Low Fat,0.069103,Canned,259.9278,OUT018,2009,Medium,Tier 3,Supermarket Type2,6768.5228
148,FDA45,21.25,Low Fat,0.15535,Snack Foods,178.237,OUT035,2004,Small,Tier 2,Supermarket Type1,529.311
397,FDG35,21.2,Regular,0.007041,Starchy Foods,173.5738,OUT046,1997,Small,Tier 1,Supermarket Type1,2954.1546
483,FDC02,21.35,Low Fat,0.115195,Canned,258.3278,OUT010,1998,,Tier 3,Grocery Store,520.6556
934,FDQ21,21.25,Low Fat,0.019502,Snack Foods,120.8756,OUT018,2009,Medium,Tier 3,Supermarket Type2,3150.5656


### How to select rows based on multiple conditions?

When passing multiple conditions make sure that you put each of the condition in a parenthesis ().

In [45]:
# filter rows with multiple conditions (& or | & !)
data.loc[(data['Item_Weight'] > 21) & (data['Item_Type'] == 'Canned')].head()
# data[(data['Item_Weight'] > 21) & (data['Item_Type'] == 'Canned')].head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
43,FDC02,21.35,Low Fat,0.069103,Canned,259.9278,OUT018,2009,Medium,Tier 3,Supermarket Type2,6768.5228
483,FDC02,21.35,Low Fat,0.115195,Canned,258.3278,OUT010,1998,,Tier 3,Grocery Store,520.6556
2368,FDC02,21.35,Low Fat,0.068809,Canned,258.5278,OUT035,2004,Small,Tier 2,Supermarket Type1,5206.556
2802,FDC02,21.35,Low Fat,0.068765,Canned,260.4278,OUT013,1987,High,Tier 3,Supermarket Type1,3644.5892
4468,FDC02,21.35,Low Fat,0.068822,Canned,258.3278,OUT046,1997,Small,Tier 1,Supermarket Type1,7028.8506


### How to filter for a list of values?

If you want to filter from a list of values from a column then instead of writing multiple conditions use the **`isin`** function.

In [46]:
# filter for a list of values
# data.loc[data['Outlet_Establishment_Year'].isin([1987, 1988])]
data[data['Outlet_Establishment_Year'].isin([1987, 1988])].head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052
6,FDO10,13.65,Regular,0.012741,Snack Foods,57.6588,OUT013,1987,High,Tier 3,Supermarket Type1,343.5528
14,FDF32,16.35,Low Fat,0.068024,Fruits and Vegetables,196.4426,OUT013,1987,High,Tier 3,Supermarket Type1,1977.426
20,FDN22,18.85,Regular,0.13819,Snack Foods,250.8724,OUT013,1987,High,Tier 3,Supermarket Type1,3775.086
27,DRJ59,11.65,low fat,0.019356,Hard Drinks,39.1164,OUT013,1987,High,Tier 3,Supermarket Type1,308.9312


### How to select specific columns?

You just need to pass a list of columns that you need.

In [47]:
# list of columns
cols = ['Item_Identifier', 'Item_Weight', 'Item_Fat_Content', 'Item_Visibility']

# dataframe with specific columns
data[cols].head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility
0,FDA15,9.3,Low Fat,0.016047
1,DRC01,5.92,Regular,0.019278
2,FDN15,17.5,Low Fat,0.01676
3,FDX07,19.2,Regular,0.0
4,NCD19,8.93,Low Fat,0.0


In [48]:
# filter the data with conditions and specific columns
data[(data['Item_Identifier'] == 'FDA15') & (data['Item_Weight'] > 5)][cols]
# data.loc[(data['Item_Identifier'] == 'FDA15') & (data['Item_Weight'] > 5), cols]

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility
0,FDA15,9.3,Low Fat,0.016047
831,FDA15,9.3,Low Fat,0.016055
2599,FDA15,9.3,Low Fat,0.016019
2643,FDA15,9.3,Low Fat,0.016088
4874,FDA15,9.3,Low Fat,0.026818
5413,FDA15,9.3,Low Fat,0.016009
7543,FDA15,9.3,LF,0.016113


### How to select columns with specific data types?

Let's first see the data type of each column using `dtypes` function.

In [49]:
# check the data types of the columns
print(data.dtypes)

Item_Identifier               object
Item_Weight                  float64
Item_Fat_Content              object
Item_Visibility              float64
Item_Type                     object
Item_MRP                     float64
Outlet_Identifier             object
Outlet_Establishment_Year      int64
Outlet_Size                   object
Outlet_Location_Type          object
Outlet_Type                   object
Item_Outlet_Sales            float64
dtype: object


In [50]:
# Select the columns where data type = object
data.select_dtypes('object').head()

Unnamed: 0,Item_Identifier,Item_Fat_Content,Item_Type,Outlet_Identifier,Outlet_Size,Outlet_Location_Type,Outlet_Type
0,FDA15,Low Fat,Dairy,OUT049,Medium,Tier 1,Supermarket Type1
1,DRC01,Regular,Soft Drinks,OUT018,Medium,Tier 3,Supermarket Type2
2,FDN15,Low Fat,Meat,OUT049,Medium,Tier 1,Supermarket Type1
3,FDX07,Regular,Fruits and Vegetables,OUT010,,Tier 3,Grocery Store
4,NCD19,Low Fat,Household,OUT013,High,Tier 3,Supermarket Type1


In [52]:
# Select the columns where data type = object and int
data.select_dtypes(['object', 'int64']).head()

Unnamed: 0,Item_Identifier,Item_Fat_Content,Item_Type,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type
0,FDA15,Low Fat,Dairy,OUT049,1999,Medium,Tier 1,Supermarket Type1
1,DRC01,Regular,Soft Drinks,OUT018,2009,Medium,Tier 3,Supermarket Type2
2,FDN15,Low Fat,Meat,OUT049,1999,Medium,Tier 1,Supermarket Type1
3,FDX07,Regular,Fruits and Vegetables,OUT010,1998,,Tier 3,Grocery Store
4,NCD19,Low Fat,Household,OUT013,1987,High,Tier 3,Supermarket Type1
