# Video: Querying Data Frames in Pandas

This video shows how to query pandas data frames for rows of interest using declarative queries.

In [None]:
import pandas as pd
import sys

In [None]:
abalone = pd.read_csv("https://raw.githubusercontent.com/bu-cds-omds/dx602-examples/main/data/abalone.tsv", sep="\t")
abalone

Unnamed: 0,Sex,Length,Diameter,Height,Whole_weight,Shucked_weight,Viscera_weight,Shell_weight,Rings
0,M,0.455,0.365,0.095,0.5140,0.2245,0.1010,0.1500,15
1,M,0.350,0.265,0.090,0.2255,0.0995,0.0485,0.0700,7
2,F,0.530,0.420,0.135,0.6770,0.2565,0.1415,0.2100,9
3,M,0.440,0.365,0.125,0.5160,0.2155,0.1140,0.1550,10
4,I,0.330,0.255,0.080,0.2050,0.0895,0.0395,0.0550,7
...,...,...,...,...,...,...,...,...,...
4172,F,0.565,0.450,0.165,0.8870,0.3700,0.2390,0.2490,11
4173,M,0.590,0.440,0.135,0.9660,0.4390,0.2145,0.2605,10
4174,M,0.600,0.475,0.205,1.1760,0.5255,0.2875,0.3080,9
4175,F,0.625,0.485,0.150,1.0945,0.5310,0.2610,0.2960,10


Script:
* The pandas data frame query method is a convenient way to filter data.
* The query method takes in a string with a Python-like expression, and returns rows where that expression is true.
* This is very similar to a WHERE clause in a database query, but using Python instead of SQL syntax.
* Let's look at an example.

In [None]:
abalone.query("Sex in ('M', 'F')")

Unnamed: 0,Sex,Length,Diameter,Height,Whole_weight,Shucked_weight,Viscera_weight,Shell_weight,Rings
0,M,0.455,0.365,0.095,0.5140,0.2245,0.1010,0.1500,15
1,M,0.350,0.265,0.090,0.2255,0.0995,0.0485,0.0700,7
2,F,0.530,0.420,0.135,0.6770,0.2565,0.1415,0.2100,9
3,M,0.440,0.365,0.125,0.5160,0.2155,0.1140,0.1550,10
6,F,0.530,0.415,0.150,0.7775,0.2370,0.1415,0.3300,20
...,...,...,...,...,...,...,...,...,...
4172,F,0.565,0.450,0.165,0.8870,0.3700,0.2390,0.2490,11
4173,M,0.590,0.440,0.135,0.9660,0.4390,0.2145,0.2605,10
4174,M,0.600,0.475,0.205,1.1760,0.5255,0.2875,0.3080,9
4175,F,0.625,0.485,0.150,1.0945,0.5310,0.2610,0.2960,10


Script:
* That expression is pretty short; M and F are not very long.
* If you are searching for longer values or lots of values, or you are comparing to a dynamically created list, you can write the query referring to Python variables.

In [None]:
target_sexes = ['M', 'F']
abalone.query("Sex in @target_sexes")

Unnamed: 0,Sex,Length,Diameter,Height,Whole_weight,Shucked_weight,Viscera_weight,Shell_weight,Rings
0,M,0.455,0.365,0.095,0.5140,0.2245,0.1010,0.1500,15
1,M,0.350,0.265,0.090,0.2255,0.0995,0.0485,0.0700,7
2,F,0.530,0.420,0.135,0.6770,0.2565,0.1415,0.2100,9
3,M,0.440,0.365,0.125,0.5160,0.2155,0.1140,0.1550,10
6,F,0.530,0.415,0.150,0.7775,0.2370,0.1415,0.3300,20
...,...,...,...,...,...,...,...,...,...
4172,F,0.565,0.450,0.165,0.8870,0.3700,0.2390,0.2490,11
4173,M,0.590,0.440,0.135,0.9660,0.4390,0.2145,0.2605,10
4174,M,0.600,0.475,0.205,1.1760,0.5255,0.2875,0.3080,9
4175,F,0.625,0.485,0.150,1.0945,0.5310,0.2610,0.2960,10


Script:
* This is a trivially short example, but you can see how that would be easier to read with a long list.
* You can write much longer query expressions with most kinds of Python expressions being available.

In [None]:
abalone.query("Sex in @target_sexes and (Whole_weight > 0.5 or Height > 0.2)")

Unnamed: 0,Sex,Length,Diameter,Height,Whole_weight,Shucked_weight,Viscera_weight,Shell_weight,Rings
0,M,0.455,0.365,0.095,0.5140,0.2245,0.1010,0.1500,15
2,F,0.530,0.420,0.135,0.6770,0.2565,0.1415,0.2100,9
3,M,0.440,0.365,0.125,0.5160,0.2155,0.1140,0.1550,10
6,F,0.530,0.415,0.150,0.7775,0.2370,0.1415,0.3300,20
7,F,0.545,0.425,0.125,0.7680,0.2940,0.1495,0.2600,16
...,...,...,...,...,...,...,...,...,...
4172,F,0.565,0.450,0.165,0.8870,0.3700,0.2390,0.2490,11
4173,M,0.590,0.440,0.135,0.9660,0.4390,0.2145,0.2605,10
4174,M,0.600,0.475,0.205,1.1760,0.5255,0.2875,0.3080,9
4175,F,0.625,0.485,0.150,1.0945,0.5310,0.2610,0.2960,10


Script:
* The query requirement is roughly that the expression return a Boolean value.
* If the query is a number or even a column name by itself, it will interpret the query as an index value.

In [None]:
abalone.query("4174")

Sex                    M
Length               0.6
Diameter           0.475
Height             0.205
Whole_weight       1.176
Shucked_weight    0.5255
Viscera_weight    0.2875
Shell_weight       0.308
Rings                  9
Name: 4174, dtype: object

Script:
* You should be careful with such usage.
* There is a lot of potential ambiguity if you plug in a possible label that might look like an expression.
* The next video will say more about security risks from plugging in outside values like that.
* Returning to query usage, calling the query method returns a data frame like the rest, so we can do the usual computations and analysis.

In [None]:
abalone.query("Sex in @target_sexes and (Whole_weight > 0.5 or Height > 0.2)")["Rings"].mean()

11.222448149654332

Scripts:
* This query selected adult abalone (the other choice was infant), and then filtered to heavier and taller abalone.
* As a non-biologist, I would speculate that these were older abalone overall.
* And the data set description told me the original scientists were trying to predict the number of shell rings as a proxy for age.
* Let's compare to the overall data set.

In [None]:
abalone["Rings"].mean()

9.933684462532918

Script:
* And my amateur biology guess was confirmed.
* There is a lot you can do with querying data frames since you can write almost any expression to describe which rows you want.