Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify treatment of categorical value #78

Open
6 tasks
rolyp opened this issue Sep 22, 2020 · 2 comments
Open
6 tasks

Specify treatment of categorical value #78

rolyp opened this issue Sep 22, 2020 · 2 comments

Comments

@rolyp
Copy link
Collaborator

rolyp commented Sep 22, 2020

See notebook. I will add a summary for this.

So far, we have the following use-cases at the moment:

  • handling anomalous values misclassified as categorical values (e.g., 'Error' in ['A', 'B', 'Error']),
    • find a real-world example for this problem
  • merging categorical values when we have string variability issues (e.g., MySQL and mysql labeled as two separate categorical values),
    • add the DBMS of the USP05 dataset to notebooks
  • misclassifying categorical values as anomalies due to unsupported characters (e.g., Central_Hawkes_Bay and Central_Hawkes_Bay(coastal) where '(' and ')' are not supported by the PFSM for the string type.
    • add the Locality column of the Eucalyptus to notebooks

Taha writes:

Assume that we have three unique values such as ['A', 'B', 'error'] in a data column. ptype would predict its column type as string and its canonical type as categorical (assume there is a high number of rows so that our additional features such as uniqueness ratio work). Therefore, our model would treat these three values as categorical values (with probability of 1.0 for each unique value that comes from the posterior row type distribution). But let’s say the user tells that ‘error’ is not a categorical value.
My question is how we should design the corresponding interaction? I think that one reasonable approach is to overwrite the posterior row type distribution for ‘error’ such that it is labelled as ’non-type’ rather than ‘type’. We can then store this constraint and overwrite the posterior row type distribution when the inference is re-run.

One way of thinking about it might be to imagine what the use case would be like if the user already knew in advance that that the column contained error as a value and that they didn’t want it to be treated as a categorical value. I.e., how would they instruct ptype to analyse the column while treating error as anomalous or missing.

If we can answer this question, then I think the same method could probably apply in the situation where they run ptype first, discover that it has “misclassified” error as a categorical value, and then want to correct it.

(“Overwriting” the output of ptype and expecting ptype to somehow be able to respond to that is probably best avoided as an interaction model. ptype should have a well-defined behaviour for a given input, so the question is how to generalise the input to ptype to allow the user to specify constraints such as “treat error in a certain way”.)

Taha responded:

Perhaps we can directly overwrite the probabilities assigned by the PFSMs. For example, we can let the anomaly PFSM be the only PFSM that assigns a non-zero probability for “error”. This wouldn’t (probably) interfere with the column type inference as it will get the same “weight” for each column type.

So, the column type and canonical type would still be string and categorical respectively. also, “error” would be labeled as an anomaly using the row type dist.

@rolyp rolyp changed the title Customise default treatment of categorical value Specify treatment of categorical value Sep 23, 2020
@tahaceritli
Copy link
Collaborator

I notice that we were considering letting users interact through changing schemas (see #62 for details):

schema = ptype.fit_schema(df)
schema
{
'col_1': ('Int64',),
'col_2': ('Categorical', 'A', 'B', 'ERR'),
'col_3': ('String',),
'col_4': ('Float',),
}

schema['col_2'] = ('Categorical', 'A', 'B')
typed_df = ptype.transform_schema(df, schema)

Our current schema looks like as follows:
schema['col2'] = {
...
"normal_values": ...,
"missing_values": ...,
"anomalies": ...,
"categorical_values":...
...
}

So, 'schema['col_2'] = ('Categorical', 'A', 'B')' would correspond to schema['col_2']['categorical_values'] = ['A', 'B']. But we would also need to update schema['col_2']['normal_values'], schema['col_2']['missing_values'] and schema['col_2']['anomalies']. In that case, users would need to specify which values are valid, missing and anomalies. We can then overwrite the probabilities assigned by PFSMs, if we want.

@rolyp
Copy link
Collaborator Author

rolyp commented Sep 23, 2020

Let’s discuss this. I think we should be careful with thinking of Ptype as “bidirectional”, i.e. a computation whose output we can meaningfully manipulate and then “repair” so that all components of the output are consistent. Instead, we need to make it possible to configure how Ptype runs so that its inference produces the desired output.

The workflow would be similar: the user runs Ptype once to find out its first guess, then they run Ptype again with different settings (perhaps passing it a schema derived from a previous run as input, for example) until they achieve the desired result. The key point is that the computation is always “feed-forward”: the user iterates until they get the inference they want, modifying the inputs to Ptype each time, but at no point do we attempt any complex/error-prone “schema repair”.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants