Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reporting ptype outputs via schemas #62

Closed
16 tasks done
tahaceritli opened this issue Sep 10, 2020 · 4 comments
Closed
16 tasks done

Reporting ptype outputs via schemas #62

tahaceritli opened this issue Sep 10, 2020 · 4 comments

Comments

@tahaceritli
Copy link
Collaborator

tahaceritli commented Sep 10, 2020

Some issues to address:

  • redundancy between Column and components of a Schema
  • there seem to be 3 versions of as_normal: one in test_ptype and two in Ptype
  • show_results_df no longer used (have it return a dataframe with a single row?)
  • should transform_schema take a df argument?
  • value of as_normal is discarded in transform_schema
  • move features to Column
    • drop dataframe index on all_posteriors
    • replace assignment by initialisation in Ptype.column
  • move arff_type inference to Column constructor

Done/dropped:

  • show summaries of ptype outputs using “schemas”
  • show the list of symbols used to encode missing values and anomalies
  • show the ratios of missing values and anomalies
  • value of as_normal is discarded in get_final_df – method no longer needed
  • signature and usage of update_dtypes suggests a pure function, but mutates its argument
  • disable strip-notebook-output in .gitattributes
    • roll back to seaborn==0.9.0 so it doesn’t issue a warning about distplot
  • converted p_t from list to dictionary

@GjjvdBurg proposes the following design which could be further modified if needed:

schema = ptype.fit_schema(df)
schema
{
'col_1': ('Int64',),
'col_2': ('Categorical', 'A', 'B', 'ERR'),
'col_3': ('String',),
'col_4': ('Float',),
}

schema['col_2'] = ('Categorical', 'A', 'B')
typed_df = ptype.transform_schema(df, schema)

@tahaceritli tahaceritli changed the title Producing schemas Reporting ptype outputs Sep 10, 2020
@tahaceritli
Copy link
Collaborator Author

I have made some progress on this. Please see #64.

@tahaceritli tahaceritli changed the title Reporting ptype outputs Reporting ptype outputs via schemas Sep 14, 2020
@rolyp
Copy link
Collaborator

rolyp commented Sep 14, 2020

@tahaceritli If you add new issues to the Project (drop-down menu on the right) they’ll appear in the kanban board!

@tahaceritli
Copy link
Collaborator Author

tahaceritli commented Sep 14, 2020

Thanks for pointing these out.

  1. Now that we are switching to fit_schema, transform_schema and fit_transform_schema, I don't think we will use get_final_df anymore.
  2. Schema is just a dictionary that returns some of the properties of Column. I'm open to alternative solutions.
  3. Yes. I had copied the one in test_ptype to the Ptype class so that it would be easier to reach in the notebook. The other version of as_normal in Ptype just takes different parameters - it takes the schema and obtains the normal values according to the schema. I will now remove the other one in Ptype.
  4. Now that we are presenting information in schemas, I think we won't need to put the data type inside the header (that's why it's not used in the notebooks for now. we may need it in the future though.)
  5. If we will follow the sklearn notation (e.g., https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html), it should just take as input df. But then we will need to have the interaction with the user through setters and getters. Also, we would need to store schema internally. But if we want to implement the interactions as Gerrit has suggested earlier, we would also need to treat schema as an input. I think this would also be easier for the user. But perhaps that's just me. I'm also happy to follow the standard sklearn notation.
    Btw I will soon add fit_transform_schema which takes only a df. This will infer the corresponding schema using fit_schema and then create a new data frame with the changes.
  6. Thanks. That's true. In fact, the data frame was updated because of "pd.to_numeric(df[col_name], errors="coerce").astype(new_dtype)". It should be okay now. But there may be another way of doing this.

Let me know what you think. I will try to sort out item 6 and push my changes in the meantime.

@GjjvdBurg
Copy link
Collaborator

Minor comment regarding no. 5: This reminds me of an earlier discussion (original message in #11) that this fits in the scikit-learn "transformer" idea: you fit the schema with, say, ptype.fit(df), then cast the types using new_df = ptype.transform(df). This can be combined in new_df = ptype.fit_transform(df), and you could then cast a second dataset using the same schema with ptype.transform(df2).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants