# Subsetting: Value Based

## TABLE OF CONTENTS

- How to select rows based on condition?
- How to select rows based on multiple conditions?
- How to select specific columns from a data?
- How to select rows based on a condition and view only the specific columns?
- How to select the columns with specific data types?


### READ THE DATA

- We are going to use the big mart sales data that is stored in the folder name datasets.

In [1]:
# importing the required libraries
import pandas as pd

In [2]:
#read the data
data = pd.read_csv('datasets/big_mart_sales.csv')

In [3]:
#views the top rows of the data
data.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


## HOW TO SELECT ROWS BASED ON CONDITION?
- Select all rows where the value of Outlet_Establishment_Year is 1987

In [14]:
# Filter rows with condition
#data.loc[data.Outlet_Establishment_Year==1987]
#data.loc[data['Outlet_Establishment_Year']==1987]
data[data.Outlet_Establishment_Year==1987]

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
4,NCD19,8.930,Low Fat,0.000000,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052
6,FDO10,13.650,Regular,0.012741,Snack Foods,57.6588,OUT013,1987,High,Tier 3,Supermarket Type1,343.5528
14,FDF32,16.350,Low Fat,0.068024,Fruits and Vegetables,196.4426,OUT013,1987,High,Tier 3,Supermarket Type1,1977.4260
20,FDN22,18.850,Regular,0.138190,Snack Foods,250.8724,OUT013,1987,High,Tier 3,Supermarket Type1,3775.0860
27,DRJ59,11.650,low fat,0.019356,Hard Drinks,39.1164,OUT013,1987,High,Tier 3,Supermarket Type1,308.9312
...,...,...,...,...,...,...,...,...,...,...,...,...
8462,FDQ31,5.785,Regular,0.053802,Fruits and Vegetables,85.9856,OUT013,1987,High,Tier 3,Supermarket Type1,1494.0552
8466,FDJ32,10.695,Low Fat,0.057744,Fruits and Vegetables,61.2536,OUT013,1987,High,Tier 3,Supermarket Type1,673.7896
8484,DRJ49,6.865,Low Fat,0.000000,Soft Drinks,129.9652,OUT013,1987,High,Tier 3,Supermarket Type1,2324.9736
8512,FDR26,20.700,Low Fat,0.042801,Dairy,178.3028,OUT013,1987,High,Tier 3,Supermarket Type1,2479.4392


## HOW TO SELECT ROWS BASED ON MULTIPLE CONDITIONS?
- When passing multiple conditions make sure that you put each of the condition in parenthesis().
- Select all rows where the value of Outlet_Establishment_Year is 1987 and value of Outlet_Size is High.

In [18]:
#filter data based on multiple conditions

# and &
# or |
# not !

#data.loc[(data.Outlet_Establishment_Year == 1987) & (data.Outlet_Size == 'High')]

data[(data.Outlet_Establishment_Year == 2009) & (data.Outlet_Size == 'Medium')]

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
1,DRC01,5.920,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
5,FDP36,10.395,Regular,0.000000,Baking Goods,51.4008,OUT018,2009,Medium,Tier 3,Supermarket Type2,556.6088
16,NCB42,11.800,Low Fat,0.008596,Health and Hygiene,115.3492,OUT018,2009,Medium,Tier 3,Supermarket Type2,1621.8888
31,NCS17,18.600,Low Fat,0.080829,Health and Hygiene,96.4436,OUT018,2009,Medium,Tier 3,Supermarket Type2,2741.7644
32,FDP33,18.700,Low Fat,0.000000,Snack Foods,256.6672,OUT018,2009,Medium,Tier 3,Supermarket Type2,3068.0064
...,...,...,...,...,...,...,...,...,...,...,...,...
8506,DRF37,17.250,Low Fat,0.084676,Soft Drinks,263.1910,OUT018,2009,Medium,Tier 3,Supermarket Type2,3944.8650
8511,FDF05,17.500,Low Fat,0.026980,Frozen Foods,262.5910,OUT018,2009,Medium,Tier 3,Supermarket Type2,4207.8560
8515,FDH24,20.700,Low Fat,0.021518,Baking Goods,157.5288,OUT018,2009,Medium,Tier 3,Supermarket Type2,1571.2880
8516,NCJ19,18.600,Low Fat,0.118661,Others,58.7588,OUT018,2009,Medium,Tier 3,Supermarket Type2,858.8820


#### If you want to filter for a list of values from a column then instead of writing multiple conditions use the <font color='orange'>isin</font> function.


## HOW TO FILTER FOR A LIST OF VALUES?


In [17]:
# get rows for 3 years 1987, 1988, 1999

data[(data.Outlet_Establishment_Year == 1987) | (data.Outlet_Establishment_Year == 1988) | (data.Outlet_Establishment_Year ==1999)]

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.300,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.1380
2,FDN15,17.500,Low Fat,0.016760,Meat,141.6180,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.2700
4,NCD19,8.930,Low Fat,0.000000,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052
6,FDO10,13.650,Regular,0.012741,Snack Foods,57.6588,OUT013,1987,High,Tier 3,Supermarket Type1,343.5528
10,FDY07,11.800,Low Fat,0.000000,Fruits and Vegetables,45.5402,OUT049,1999,Medium,Tier 1,Supermarket Type1,1516.0266
...,...,...,...,...,...,...,...,...,...,...,...,...
8475,NCS17,18.600,Low Fat,0.080627,Health and Hygiene,92.5436,OUT049,1999,Medium,Tier 1,Supermarket Type1,378.1744
8479,FDL10,8.395,Low Fat,0.039554,Snack Foods,99.1042,OUT049,1999,Medium,Tier 1,Supermarket Type1,2579.3092
8484,DRJ49,6.865,Low Fat,0.000000,Soft Drinks,129.9652,OUT013,1987,High,Tier 3,Supermarket Type1,2324.9736
8512,FDR26,20.700,Low Fat,0.042801,Dairy,178.3028,OUT013,1987,High,Tier 3,Supermarket Type1,2479.4392


In [20]:
# filter for a list of values

#data.loc[data.Outlet_Establishment_Year.isin([1987, 1988, 1999])]

data[data.Outlet_Establishment_Year.isin([1987, 1988, 1999])]

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.300,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.1380
2,FDN15,17.500,Low Fat,0.016760,Meat,141.6180,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.2700
4,NCD19,8.930,Low Fat,0.000000,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052
6,FDO10,13.650,Regular,0.012741,Snack Foods,57.6588,OUT013,1987,High,Tier 3,Supermarket Type1,343.5528
10,FDY07,11.800,Low Fat,0.000000,Fruits and Vegetables,45.5402,OUT049,1999,Medium,Tier 1,Supermarket Type1,1516.0266
...,...,...,...,...,...,...,...,...,...,...,...,...
8475,NCS17,18.600,Low Fat,0.080627,Health and Hygiene,92.5436,OUT049,1999,Medium,Tier 1,Supermarket Type1,378.1744
8479,FDL10,8.395,Low Fat,0.039554,Snack Foods,99.1042,OUT049,1999,Medium,Tier 1,Supermarket Type1,2579.3092
8484,DRJ49,6.865,Low Fat,0.000000,Soft Drinks,129.9652,OUT013,1987,High,Tier 3,Supermarket Type1,2324.9736
8512,FDR26,20.700,Low Fat,0.042801,Dairy,178.3028,OUT013,1987,High,Tier 3,Supermarket Type1,2479.4392


## HOW TO SELECT SPECIFIC COLUMNS FROM A DATA?
- You just need to pass a list of columns that you need.

In [23]:
# list of columns
select_columns = ['Item_Identifier', 'Item_MRP', 'Outlet_Establishment_Year', 'Outlet_Size']

#dataframe with specific columns
data[select_columns]

Unnamed: 0,Item_Identifier,Item_MRP,Outlet_Establishment_Year,Outlet_Size
0,FDA15,249.8092,1999,Medium
1,DRC01,48.2692,2009,Medium
2,FDN15,141.6180,1999,Medium
3,FDX07,182.0950,1998,
4,NCD19,53.8614,1987,High
...,...,...,...,...
8518,FDF22,214.5218,1987,High
8519,FDS36,108.1570,2002,
8520,NCJ29,85.1224,2004,Small
8521,FDN46,103.1332,2009,Medium


## HOW TO SELECT ROWS BASED ON A CONDITION AND VIEW ONLY THE SPECIFIC COLUMNS?

**Using Square Brackets**

In [24]:
# list of specific columns
select_columns = ['Item_Identifier', 'Item_MRP', 'Outlet_Establishment_Year', 'Outlet_Size']

#filter the data
data[(data.Outlet_Establishment_Year==1987) & (data.Outlet_Size == 'High')][select_columns]

Unnamed: 0,Item_Identifier,Item_MRP,Outlet_Establishment_Year,Outlet_Size
4,NCD19,53.8614,1987,High
6,FDO10,57.6588,1987,High
14,FDF32,196.4426,1987,High
20,FDN22,250.8724,1987,High
27,DRJ59,39.1164,1987,High
...,...,...,...,...
8462,FDQ31,85.9856,1987,High
8466,FDJ32,61.2536,1987,High
8484,DRJ49,129.9652,1987,High
8512,FDR26,178.3028,1987,High


## Using loc

- Using loc, we can provide columns to select within the same square bracket.

In [25]:
data.loc[(data.Outlet_Establishment_Year == 1987) & (data.Outlet_Size == 'High'), select_columns]

Unnamed: 0,Item_Identifier,Item_MRP,Outlet_Establishment_Year,Outlet_Size
4,NCD19,53.8614,1987,High
6,FDO10,57.6588,1987,High
14,FDF32,196.4426,1987,High
20,FDN22,250.8724,1987,High
27,DRJ59,39.1164,1987,High
...,...,...,...,...
8462,FDQ31,85.9856,1987,High
8466,FDJ32,61.2536,1987,High
8484,DRJ49,129.9652,1987,High
8512,FDR26,178.3028,1987,High


## HOW TO SELECT THE COLUMNS WITH SPECIFIC DATA TYPES?

- **Let's first see the data type of each column using dtypes function.**

In [26]:
# check the data types of the columns
data.dtypes

Item_Identifier               object
Item_Weight                  float64
Item_Fat_Content              object
Item_Visibility              float64
Item_Type                     object
Item_MRP                     float64
Outlet_Identifier             object
Outlet_Establishment_Year      int64
Outlet_Size                   object
Outlet_Location_Type          object
Outlet_Type                   object
Item_Outlet_Sales            float64
dtype: object

## Select the columns with object data type (categorical variables) only

In [27]:
# select the columns where data type = object

data.select_dtypes('object')

Unnamed: 0,Item_Identifier,Item_Fat_Content,Item_Type,Outlet_Identifier,Outlet_Size,Outlet_Location_Type,Outlet_Type
0,FDA15,Low Fat,Dairy,OUT049,Medium,Tier 1,Supermarket Type1
1,DRC01,Regular,Soft Drinks,OUT018,Medium,Tier 3,Supermarket Type2
2,FDN15,Low Fat,Meat,OUT049,Medium,Tier 1,Supermarket Type1
3,FDX07,Regular,Fruits and Vegetables,OUT010,,Tier 3,Grocery Store
4,NCD19,Low Fat,Household,OUT013,High,Tier 3,Supermarket Type1
...,...,...,...,...,...,...,...
8518,FDF22,Low Fat,Snack Foods,OUT013,High,Tier 3,Supermarket Type1
8519,FDS36,Regular,Baking Goods,OUT045,,Tier 2,Supermarket Type1
8520,NCJ29,Low Fat,Health and Hygiene,OUT035,Small,Tier 2,Supermarket Type1
8521,FDN46,Regular,Snack Foods,OUT018,Medium,Tier 3,Supermarket Type2


## Select the columns with float64 datatype

In [29]:
# select numerical datatype
#data.select_dtypes('float64')
data.select_dtypes('int64')

Unnamed: 0,Outlet_Establishment_Year
0,1999
1,2009
2,1999
3,1998
4,1987
...,...
8518,1987
8519,2002
8520,2004
8521,2009


## df[] vs df.loc[] When To Use Which?

- df.loc[] provides simpler syntax over df[]
- Both have similar performance in terms of execution time
- df.loc[] also works with label based subsetting
- df[] sometimes has unwanted behavior, hence as a good practice it is recommended to use df.loc[].
