Specify treatment of categorical value #78

rolyp · 2020-09-22T16:44:32Z

See notebook. I will add a summary for this.

So far, we have the following use-cases at the moment:

handling anomalous values misclassified as categorical values (e.g., 'Error' in ['A', 'B', 'Error']),
- find a real-world example for this problem
merging categorical values when we have string variability issues (e.g., MySQL and mysql labeled as two separate categorical values),
- add the DBMS of the USP05 dataset to notebooks
misclassifying categorical values as anomalies due to unsupported characters (e.g., Central_Hawkes_Bay and Central_Hawkes_Bay(coastal) where '(' and ')' are not supported by the PFSM for the string type.
- add the Locality column of the Eucalyptus to notebooks

Taha writes:

Assume that we have three unique values such as ['A', 'B', 'error'] in a data column. ptype would predict its column type as string and its canonical type as categorical (assume there is a high number of rows so that our additional features such as uniqueness ratio work). Therefore, our model would treat these three values as categorical values (with probability of 1.0 for each unique value that comes from the posterior row type distribution). But let’s say the user tells that ‘error’ is not a categorical value.
My question is how we should design the corresponding interaction? I think that one reasonable approach is to overwrite the posterior row type distribution for ‘error’ such that it is labelled as ’non-type’ rather than ‘type’. We can then store this constraint and overwrite the posterior row type distribution when the inference is re-run.

One way of thinking about it might be to imagine what the use case would be like if the user already knew in advance that that the column contained error as a value and that they didn’t want it to be treated as a categorical value. I.e., how would they instruct ptype to analyse the column while treating error as anomalous or missing.

If we can answer this question, then I think the same method could probably apply in the situation where they run ptype first, discover that it has “misclassified” error as a categorical value, and then want to correct it.

(“Overwriting” the output of ptype and expecting ptype to somehow be able to respond to that is probably best avoided as an interaction model. ptype should have a well-defined behaviour for a given input, so the question is how to generalise the input to ptype to allow the user to specify constraints such as “treat error in a certain way”.)

Taha responded:

Perhaps we can directly overwrite the probabilities assigned by the PFSMs. For example, we can let the anomaly PFSM be the only PFSM that assigns a non-zero probability for “error”. This wouldn’t (probably) interfere with the column type inference as it will get the same “weight” for each column type.

So, the column type and canonical type would still be string and categorical respectively. also, “error” would be labeled as an anomaly using the row type dist.

The text was updated successfully, but these errors were encountered:

tahaceritli · 2020-09-23T09:46:27Z

I notice that we were considering letting users interact through changing schemas (see #62 for details):

schema = ptype.fit_schema(df)
schema
{
'col_1': ('Int64',),
'col_2': ('Categorical', 'A', 'B', 'ERR'),
'col_3': ('String',),
'col_4': ('Float',),
}

schema['col_2'] = ('Categorical', 'A', 'B')
typed_df = ptype.transform_schema(df, schema)

Our current schema looks like as follows:
schema['col2'] = {
...
"normal_values": ...,
"missing_values": ...,
"anomalies": ...,
"categorical_values":...
...
}

So, 'schema['col_2'] = ('Categorical', 'A', 'B')' would correspond to schema['col_2']['categorical_values'] = ['A', 'B']. But we would also need to update schema['col_2']['normal_values'], schema['col_2']['missing_values'] and schema['col_2']['anomalies']. In that case, users would need to specify which values are valid, missing and anomalies. We can then overwrite the probabilities assigned by PFSMs, if we want.

rolyp · 2020-09-23T10:05:37Z

Let’s discuss this. I think we should be careful with thinking of Ptype as “bidirectional”, i.e. a computation whose output we can meaningfully manipulate and then “repair” so that all components of the output are consistent. Instead, we need to make it possible to configure how Ptype runs so that its inference produces the desired output.

The workflow would be similar: the user runs Ptype once to find out its first guess, then they run Ptype again with different settings (perhaps passing it a schema derived from a previous run as input, for example) until they achieve the desired result. The key point is that the computation is always “feed-forward”: the user iterates until they get the inference they want, modifying the inputs to Ptype each time, but at no point do we attempt any complex/error-prone “schema repair”.

rolyp added task:core-api task:use-cases type:feature and removed task:core-api labels Sep 22, 2020

rolyp changed the title ~~Customise default treatment of categorical value~~ Specify treatment of categorical value Sep 23, 2020

rolyp removed the type:feature label Sep 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specify treatment of categorical value #78

Specify treatment of categorical value #78

rolyp commented Sep 22, 2020 •

edited by tahaceritli

Loading

tahaceritli commented Sep 23, 2020

rolyp commented Sep 23, 2020

Specify treatment of categorical value #78

Specify treatment of categorical value #78

Comments

rolyp commented Sep 22, 2020 • edited by tahaceritli Loading

tahaceritli commented Sep 23, 2020

rolyp commented Sep 23, 2020

rolyp commented Sep 22, 2020 •

edited by tahaceritli

Loading