# Specifying Intent in Lux

Lux provides a flexible language for communicating your analysis intent to the system, so that Lux can provide better and more relevant recommendations to you. In this tutorial, we will see different ways of specifying the intent, including the attributes and values that you are interested or not interested in, enumeration specifiers, as well as any constraints on the visualization encoding.

The primary way to set the current intent associated with a dataframe is by setting the `intent` property of the dataframe, and providing a list of specification as input. We will first describe how intent can be specified through convenient shorthand descriptions as string inputs, then we will describe advance usage via the `lux.Clause` object.


## Basic descriptions

Continuing with our college dataset example from earlier,

In [14]:
import pandas as pd
import lux

In [18]:
# Collecting basic usage statistics for Lux (For more information, see: https://tinyurl.com/logging-consent)
lux.logger = True # Remove this line if you do not want your interactions recorded

In [19]:
df = pd.read_csv("../data/college.csv")
lux.config.default_display = "lux"

AttributeError: module 'lux' has no attribute 'config'

### Specifying attributes of interest

You can indicate that you are interested in an attribute, let's say `AverageCost`.

In [20]:
df.intent = ['AverageCost']
df

  df.intent = ['AverageCost']


Unnamed: 0,Name,PredominantDegree,HighestDegree,FundingModel,Region,Geography,AdmissionRate,ACTMedian,SATAverage,AverageCost,Expenditure,AverageFacultySalary,MedianDebt,AverageAgeofEntry,MedianFamilyIncome,MedianEarnings
0,Alabama A & M University,Bachelor's,Graduate,Public,Southeast,Mid-size City,0.8989,17,823,18888,7459,7079,19500.0,20.629999,29039.0,27000
1,University of Alabama at Birmingham,Bachelor's,Graduate,Public,Southeast,Mid-size City,0.8673,25,1146,19990,17208,10170,16250.0,22.670000,34909.0,37200
2,University of Alabama in Huntsville,Bachelor's,Graduate,Public,Southeast,Mid-size City,0.8062,26,1180,20306,9352,9341,16500.0,23.190001,39766.0,41500
3,Alabama State University,Bachelor's,Graduate,Public,Southeast,Mid-size City,0.5125,17,830,17400,7393,6557,15854.5,20.889999,24029.5,22400
4,The University of Alabama,Bachelor's,Graduate,Public,Southeast,Small City,0.5655,26,1171,26717,9817,9605,17750.0,20.770000,58976.0,39200
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1289,University of Connecticut-Avery Point,Bachelor's,Graduate,Public,New England,Mid-size Suburb,0.5940,24,1020,12946,11730,14803,18983.0,20.120001,86510.0,49700
1290,University of Connecticut-Stamford,Bachelor's,Graduate,Public,New England,Mid-size City,0.4107,21,1017,13028,4958,14803,18983.0,20.120001,86510.0,49700
1291,California State University-Channel Islands,Bachelor's,Graduate,Public,Far West,Mid-size Suburb,0.6443,20,954,22570,12026,8434,12500.0,24.850000,32103.0,35800
1292,DigiPen Institute of Technology,Bachelor's,Graduate,Private For-Profit,Far West,Small City,0.6635,28,1225,37848,5998,7659,19000.0,21.209999,68233.0,72800


You might be interested in multiple attributes, for instance you might want to look at both `AverageCost` and `FundingModel`. When multiple clauses are specified, Lux applies all the clauses in the intent and searches for visualizations that are relevant to `AverageCost` **and** `FundingModel`.


In [21]:
df.intent = ['AverageCost','FundingModel']
df

Unnamed: 0,Name,PredominantDegree,HighestDegree,FundingModel,Region,Geography,AdmissionRate,ACTMedian,SATAverage,AverageCost,Expenditure,AverageFacultySalary,MedianDebt,AverageAgeofEntry,MedianFamilyIncome,MedianEarnings
0,Alabama A & M University,Bachelor's,Graduate,Public,Southeast,Mid-size City,0.8989,17,823,18888,7459,7079,19500.0,20.629999,29039.0,27000
1,University of Alabama at Birmingham,Bachelor's,Graduate,Public,Southeast,Mid-size City,0.8673,25,1146,19990,17208,10170,16250.0,22.670000,34909.0,37200
2,University of Alabama in Huntsville,Bachelor's,Graduate,Public,Southeast,Mid-size City,0.8062,26,1180,20306,9352,9341,16500.0,23.190001,39766.0,41500
3,Alabama State University,Bachelor's,Graduate,Public,Southeast,Mid-size City,0.5125,17,830,17400,7393,6557,15854.5,20.889999,24029.5,22400
4,The University of Alabama,Bachelor's,Graduate,Public,Southeast,Small City,0.5655,26,1171,26717,9817,9605,17750.0,20.770000,58976.0,39200
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1289,University of Connecticut-Avery Point,Bachelor's,Graduate,Public,New England,Mid-size Suburb,0.5940,24,1020,12946,11730,14803,18983.0,20.120001,86510.0,49700
1290,University of Connecticut-Stamford,Bachelor's,Graduate,Public,New England,Mid-size City,0.4107,21,1017,13028,4958,14803,18983.0,20.120001,86510.0,49700
1291,California State University-Channel Islands,Bachelor's,Graduate,Public,Far West,Mid-size Suburb,0.6443,20,954,22570,12026,8434,12500.0,24.850000,32103.0,35800
1292,DigiPen Institute of Technology,Bachelor's,Graduate,Private For-Profit,Far West,Small City,0.6635,28,1225,37848,5998,7659,19000.0,21.209999,68233.0,72800


Let's say that in addition to `AverageCost`, you are interested in the looking at a list of attributes that are related to different financial measures, such as `Expenditure` or `MedianDebt`, and how they breakdown with respect to `FundingModel`. 

You can specify a list of desired attributes separated by the `|` symbol, which indicates an `OR` relationship between the list of attributes. If multiple clauses are specified, Lux automatically create combinations of the specified attributes. 

In [22]:
possible_attributes = "AverageCost|Expenditure|MedianDebt|MedianEarnings"
df.intent = [possible_attributes,"FundingModel"]
df

Unnamed: 0,Name,PredominantDegree,HighestDegree,FundingModel,Region,Geography,AdmissionRate,ACTMedian,SATAverage,AverageCost,Expenditure,AverageFacultySalary,MedianDebt,AverageAgeofEntry,MedianFamilyIncome,MedianEarnings
0,Alabama A & M University,Bachelor's,Graduate,Public,Southeast,Mid-size City,0.8989,17,823,18888,7459,7079,19500.0,20.629999,29039.0,27000
1,University of Alabama at Birmingham,Bachelor's,Graduate,Public,Southeast,Mid-size City,0.8673,25,1146,19990,17208,10170,16250.0,22.670000,34909.0,37200
2,University of Alabama in Huntsville,Bachelor's,Graduate,Public,Southeast,Mid-size City,0.8062,26,1180,20306,9352,9341,16500.0,23.190001,39766.0,41500
3,Alabama State University,Bachelor's,Graduate,Public,Southeast,Mid-size City,0.5125,17,830,17400,7393,6557,15854.5,20.889999,24029.5,22400
4,The University of Alabama,Bachelor's,Graduate,Public,Southeast,Small City,0.5655,26,1171,26717,9817,9605,17750.0,20.770000,58976.0,39200
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1289,University of Connecticut-Avery Point,Bachelor's,Graduate,Public,New England,Mid-size Suburb,0.5940,24,1020,12946,11730,14803,18983.0,20.120001,86510.0,49700
1290,University of Connecticut-Stamford,Bachelor's,Graduate,Public,New England,Mid-size City,0.4107,21,1017,13028,4958,14803,18983.0,20.120001,86510.0,49700
1291,California State University-Channel Islands,Bachelor's,Graduate,Public,Far West,Mid-size Suburb,0.6443,20,954,22570,12026,8434,12500.0,24.850000,32103.0,35800
1292,DigiPen Institute of Technology,Bachelor's,Graduate,Private For-Profit,Far West,Small City,0.6635,28,1225,37848,5998,7659,19000.0,21.209999,68233.0,72800


Alternatively, you could also provide the specification as a list: 

In [23]:
possible_attributes = ['AverageCost','Expenditure','MedianDebt','MedianEarnings']
df.intent = [possible_attributes,"FundingModel"]
df

Unnamed: 0,Name,PredominantDegree,HighestDegree,FundingModel,Region,Geography,AdmissionRate,ACTMedian,SATAverage,AverageCost,Expenditure,AverageFacultySalary,MedianDebt,AverageAgeofEntry,MedianFamilyIncome,MedianEarnings
0,Alabama A & M University,Bachelor's,Graduate,Public,Southeast,Mid-size City,0.8989,17,823,18888,7459,7079,19500.0,20.629999,29039.0,27000
1,University of Alabama at Birmingham,Bachelor's,Graduate,Public,Southeast,Mid-size City,0.8673,25,1146,19990,17208,10170,16250.0,22.670000,34909.0,37200
2,University of Alabama in Huntsville,Bachelor's,Graduate,Public,Southeast,Mid-size City,0.8062,26,1180,20306,9352,9341,16500.0,23.190001,39766.0,41500
3,Alabama State University,Bachelor's,Graduate,Public,Southeast,Mid-size City,0.5125,17,830,17400,7393,6557,15854.5,20.889999,24029.5,22400
4,The University of Alabama,Bachelor's,Graduate,Public,Southeast,Small City,0.5655,26,1171,26717,9817,9605,17750.0,20.770000,58976.0,39200
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1289,University of Connecticut-Avery Point,Bachelor's,Graduate,Public,New England,Mid-size Suburb,0.5940,24,1020,12946,11730,14803,18983.0,20.120001,86510.0,49700
1290,University of Connecticut-Stamford,Bachelor's,Graduate,Public,New England,Mid-size City,0.4107,21,1017,13028,4958,14803,18983.0,20.120001,86510.0,49700
1291,California State University-Channel Islands,Bachelor's,Graduate,Public,Far West,Mid-size Suburb,0.6443,20,954,22570,12026,8434,12500.0,24.850000,32103.0,35800
1292,DigiPen Institute of Technology,Bachelor's,Graduate,Private For-Profit,Far West,Small City,0.6635,28,1225,37848,5998,7659,19000.0,21.209999,68233.0,72800


### Specifying values of interest

In Lux, you can also specify particular values corresponding to subsets of the data that you might be interested in. For example, you may be interested in only colleges located in New England. 


In [24]:
df.intent = ["Region=New England"]
df

Unnamed: 0,Name,PredominantDegree,HighestDegree,FundingModel,Region,Geography,AdmissionRate,ACTMedian,SATAverage,AverageCost,Expenditure,AverageFacultySalary,MedianDebt,AverageAgeofEntry,MedianFamilyIncome,MedianEarnings
0,Alabama A & M University,Bachelor's,Graduate,Public,Southeast,Mid-size City,0.8989,17,823,18888,7459,7079,19500.0,20.629999,29039.0,27000
1,University of Alabama at Birmingham,Bachelor's,Graduate,Public,Southeast,Mid-size City,0.8673,25,1146,19990,17208,10170,16250.0,22.670000,34909.0,37200
2,University of Alabama in Huntsville,Bachelor's,Graduate,Public,Southeast,Mid-size City,0.8062,26,1180,20306,9352,9341,16500.0,23.190001,39766.0,41500
3,Alabama State University,Bachelor's,Graduate,Public,Southeast,Mid-size City,0.5125,17,830,17400,7393,6557,15854.5,20.889999,24029.5,22400
4,The University of Alabama,Bachelor's,Graduate,Public,Southeast,Small City,0.5655,26,1171,26717,9817,9605,17750.0,20.770000,58976.0,39200
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1289,University of Connecticut-Avery Point,Bachelor's,Graduate,Public,New England,Mid-size Suburb,0.5940,24,1020,12946,11730,14803,18983.0,20.120001,86510.0,49700
1290,University of Connecticut-Stamford,Bachelor's,Graduate,Public,New England,Mid-size City,0.4107,21,1017,13028,4958,14803,18983.0,20.120001,86510.0,49700
1291,California State University-Channel Islands,Bachelor's,Graduate,Public,Far West,Mid-size Suburb,0.6443,20,954,22570,12026,8434,12500.0,24.850000,32103.0,35800
1292,DigiPen Institute of Technology,Bachelor's,Graduate,Private For-Profit,Far West,Small City,0.6635,28,1225,37848,5998,7659,19000.0,21.209999,68233.0,72800


You can also specify multiple values of interest using the same `|` notation that we saw earlier. For example, you can compare the median debt of students from colleges in New England, Southeast, and Far West.

In [25]:
df.intent = ["MedianDebt","Region=New England|Southeast|Far West"]
df

Unnamed: 0,Name,PredominantDegree,HighestDegree,FundingModel,Region,Geography,AdmissionRate,ACTMedian,SATAverage,AverageCost,Expenditure,AverageFacultySalary,MedianDebt,AverageAgeofEntry,MedianFamilyIncome,MedianEarnings
0,Alabama A & M University,Bachelor's,Graduate,Public,Southeast,Mid-size City,0.8989,17,823,18888,7459,7079,19500.0,20.629999,29039.0,27000
1,University of Alabama at Birmingham,Bachelor's,Graduate,Public,Southeast,Mid-size City,0.8673,25,1146,19990,17208,10170,16250.0,22.670000,34909.0,37200
2,University of Alabama in Huntsville,Bachelor's,Graduate,Public,Southeast,Mid-size City,0.8062,26,1180,20306,9352,9341,16500.0,23.190001,39766.0,41500
3,Alabama State University,Bachelor's,Graduate,Public,Southeast,Mid-size City,0.5125,17,830,17400,7393,6557,15854.5,20.889999,24029.5,22400
4,The University of Alabama,Bachelor's,Graduate,Public,Southeast,Small City,0.5655,26,1171,26717,9817,9605,17750.0,20.770000,58976.0,39200
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1289,University of Connecticut-Avery Point,Bachelor's,Graduate,Public,New England,Mid-size Suburb,0.5940,24,1020,12946,11730,14803,18983.0,20.120001,86510.0,49700
1290,University of Connecticut-Stamford,Bachelor's,Graduate,Public,New England,Mid-size City,0.4107,21,1017,13028,4958,14803,18983.0,20.120001,86510.0,49700
1291,California State University-Channel Islands,Bachelor's,Graduate,Public,Far West,Mid-size Suburb,0.6443,20,954,22570,12026,8434,12500.0,24.850000,32103.0,35800
1292,DigiPen Institute of Technology,Bachelor's,Graduate,Private For-Profit,Far West,Small City,0.6635,28,1225,37848,5998,7659,19000.0,21.209999,68233.0,72800


Note that since there are three different visualizations that is generated based on the intent, we only display these possible visualization, rather than the recommendations. 

In [26]:
df.clear_intent()

AttributeError: 'DataFrame' object has no attribute 'clear_intent'

#### Note: Applying Filters v.s. Expressing Filter Intent

You might be wondering what is the difference between specifying values of interest through the intent in Lux versus applying a filter directly on the dataframe through Pandas. By specifying the intent directly via Pandas, Lux is not aware of the specified inputs to Pandas, so these values of interest will not be reflected in the recommendations.

In [27]:
df[df["Region"]=="New England"]

Unnamed: 0,Name,PredominantDegree,HighestDegree,FundingModel,Region,Geography,AdmissionRate,ACTMedian,SATAverage,AverageCost,Expenditure,AverageFacultySalary,MedianDebt,AverageAgeofEntry,MedianFamilyIncome,MedianEarnings
137,University of Bridgeport,Bachelor's,Graduate,Private,New England,Mid-size City,0.6385,20,923,45640,7757,7160,17500.0,24.610001,36730.0,37600
138,Central Connecticut State University,Bachelor's,Graduate,Public,New England,Large Suburb,0.6413,22,1006,20372,8990,8513,16852.5,21.230000,66799.5,41200
139,University of Connecticut,Bachelor's,Graduate,Public,New England,Large Suburb,0.5366,28,1233,25786,20908,11500,18983.0,20.120001,86510.0,49700
140,Eastern Connecticut State University,Bachelor's,Graduate,Public,New England,Fringe Town,0.6453,22,1017,22633,8554,8091,17500.0,21.110001,76549.0,39500
141,University of Hartford,Bachelor's,Graduate,Private,New England,Mid-size City,0.6173,23,1053,47368,12566,8178,19944.0,20.150000,85192.0,39400
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1171,Vermont Technical College,Associate,Bachelor's,Public,New England,Remote Rural,0.8257,19,933,22937,14564,6170,13000.0,24.809999,48966.0,40000
1172,University of Vermont,Bachelor's,Graduate,Public,New England,Small City,0.7755,27,1190,27788,16066,9562,18878.0,20.240000,77614.0,39400
1288,University of Connecticut-Tri-Campus,Bachelor's,Graduate,Public,New England,Mid-size Suburb,0.4207,22,1015,13099,3882,12909,18983.0,20.120001,86510.0,49700
1289,University of Connecticut-Avery Point,Bachelor's,Graduate,Public,New England,Mid-size Suburb,0.5940,24,1020,12946,11730,14803,18983.0,20.120001,86510.0,49700


Specifying the values through `set_intent` tells Lux that you are interested in colleges in New England. In the resulting Filter action, we see that Lux suggests visualizations in other `Region`s as recommendations.

In [None]:
df.intent = ["Region=New England"]
df

So while both approaches applies the filter on the specified visualization, the subtle difference between *applying* a filter and *indicating* a filter intent leads to different sets of resulting recommendations. In general, we encourage using Pandas for filtering if you are certain about applying the filter (e.g., a cleaning operation deleting a specific data subset), and specify the intent through Lux if you might want to experiment and change aspects related to the filter in your analysis. 

### Advanced intent specification through `lux.Clause`

The basic string-based description provides a convenient way of specifying the intent. However, not all specification can be expressed through the string-based descriptions, more complex specification can be expressed through the `lux.Clause` object. The two modes of specification is essentially equivalent, with the Parser parsing the `description` field in the `lux.Clause` object.

#### Specifying attributes or values of interest

To see an example of how lux.Clause is used, we rewrite our earlier example of expressing interest in `AverageCost` as: 

In [28]:
df.intent = [lux.Clause(attribute='AverageCost')]

AttributeError: module 'lux' has no attribute 'Clause'

Similarly, we can use `lux.Clause` to specify values of interest:

In [29]:
df.intent = ['MedianDebt',
                lux.Clause(attribute='Region',filter_op='=', value=['New England','Southeast','Far West'])]

AttributeError: module 'lux' has no attribute 'Clause'

Both the `attribute` and `value` fields can take in either a single string or a list of attributes to specify items of interest. This example also demonstrates how we can intermix the `lux.Clause` specification alongside the basic string-based specification for convenience.

#### Adding constraints to override auto-inferred details

So far, we have seen examples of how Lux takes in a loosely specified intent and automatically fills in many of the details that is required to generate the intended visualizations. There are situations where the user may want to override these auto-inferred values. For example, you might be interested in fixing an attribute to show up on a particular axis, ensuring that an aggregated attribute is summed up instead of averaged by default, or picking a specific bin size for a histogram. Additional properties specified on lux.Clause acts as constraints to the specified intent. 

<ins>Fixing attributes to specific axis channels<ins>

As we saw earlier, when we set `AverageCost` as the intent, Lux generates a histogram with `AverageCost` on the x-axis.
While this is unconventional, let's say that instead we want to set `AverageCost` to the y axis. We would specify this as additional properties to constrain the intent clause.

In [None]:
df.intent = [lux.Clause(attribute='AverageCost', channel='y')]
df

<ins>Changing aggregation function applied<ins>

We can also set constraints on the type of aggregation that is used. For example, by default, we use `mean` as the default aggregation function for quantitative attributes.

In [32]:
df.intent = ["HighestDegree","AverageCost"]
df

Unnamed: 0,Name,PredominantDegree,HighestDegree,FundingModel,Region,Geography,AdmissionRate,ACTMedian,SATAverage,AverageCost,Expenditure,AverageFacultySalary,MedianDebt,AverageAgeofEntry,MedianFamilyIncome,MedianEarnings
0,Alabama A & M University,Bachelor's,Graduate,Public,Southeast,Mid-size City,0.8989,17,823,18888,7459,7079,19500.0,20.629999,29039.0,27000
1,University of Alabama at Birmingham,Bachelor's,Graduate,Public,Southeast,Mid-size City,0.8673,25,1146,19990,17208,10170,16250.0,22.670000,34909.0,37200
2,University of Alabama in Huntsville,Bachelor's,Graduate,Public,Southeast,Mid-size City,0.8062,26,1180,20306,9352,9341,16500.0,23.190001,39766.0,41500
3,Alabama State University,Bachelor's,Graduate,Public,Southeast,Mid-size City,0.5125,17,830,17400,7393,6557,15854.5,20.889999,24029.5,22400
4,The University of Alabama,Bachelor's,Graduate,Public,Southeast,Small City,0.5655,26,1171,26717,9817,9605,17750.0,20.770000,58976.0,39200
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1289,University of Connecticut-Avery Point,Bachelor's,Graduate,Public,New England,Mid-size Suburb,0.5940,24,1020,12946,11730,14803,18983.0,20.120001,86510.0,49700
1290,University of Connecticut-Stamford,Bachelor's,Graduate,Public,New England,Mid-size City,0.4107,21,1017,13028,4958,14803,18983.0,20.120001,86510.0,49700
1291,California State University-Channel Islands,Bachelor's,Graduate,Public,Far West,Mid-size Suburb,0.6443,20,954,22570,12026,8434,12500.0,24.850000,32103.0,35800
1292,DigiPen Institute of Technology,Bachelor's,Graduate,Private For-Profit,Far West,Small City,0.6635,28,1225,37848,5998,7659,19000.0,21.209999,68233.0,72800


We can override the aggregation function to be `sum` instead. 


In [33]:
df.intent = ["HighestDegree",lux.Clause("AverageCost",aggregation="sum")]
df

AttributeError: module 'lux' has no attribute 'Clause'

The possible aggregation values are the same as the ones supported in Pandas's [agg](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.agg.html) function, which can either be a string shorthand (e.g., "sum", "count", "min", "max", "median") or as a numpy aggregation function.


For example, we can change the aggregation function to be the point-to-point value ([np.ptp](https://numpy.org/doc/stable/reference/generated/numpy.ptp.html)) by inputting the numpy function.

In [30]:
import numpy as np
df.intent = ["HighestDegree",lux.Clause("AverageCost",aggregation=np.ptp)]
df

AttributeError: module 'lux' has no attribute 'Clause'

### Specifying wildcards

Let's say that you are interested in *any* attribute with respect to `AverageCost`. Lux support *wildcards* (based on [CompassQL](https://idl.cs.washington.edu/papers/compassql/) ), which specifies the enumeration of any possible attribute or values that satisfies the provided constraints.

In [None]:
df.intent = ['AverageCost',lux.Clause('?')]
df

The space of enumeration can be narrowed based on constraints. For example, you might only be interested in looking at scatterplots of `AverageCost` with respect to quantitative attributes. This narrows the 15 visualizations that we had earlier to only 9 visualizations now, involving only quantitative attributes.

In [None]:
df.intent = ['AverageCost',lux.Clause('?',data_type='quantitative')]
df

The enumeration specifier can also be placed on the value field. For example, you might be interested in looking at how the distribution of `AverageCost` varies for all possible values of `Geography`.


In [None]:
df.intent = ['AverageCost','Geography=?']

OR

In [None]:
df.intent = ['AverageCost',lux.Clause(attribute='Geography',filter_op='=',value='?')]
df