Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add select_ltypes to DataTables to allow user to select Data Columns #81

Closed
gsheni opened this issue Sep 9, 2020 · 2 comments · Fixed by #96
Closed

Add select_ltypes to DataTables to allow user to select Data Columns #81

gsheni opened this issue Sep 9, 2020 · 2 comments · Fixed by #96
Assignees

Comments

@gsheni
Copy link
Contributor

gsheni commented Sep 9, 2020

  • Similar to how pandas has select_dtypes function, there should be a function on DataTable that allows users to select_logical_types or select_ltypes.
  • User can pass in a list of Logical Types, a list of strings (class name or camelCase type string) of Logical Types, or 1 specific Logical Type objects or a list of Logical Type objects
  • In the future, we can look into adding exclude to select_ltypes
df = pd.read_csv(...)
dt = DataTable(df, name='data')
# support single string
dt.select_ltypes('bool')
# support type name as string
dt.select_ltypes('zip_code')

# support list of type_string of Logical Type
dt.select_ltypes(['categorical', 'natural_language'])
# support list of string of class name
dt.select_ltypes(['Categorical', 'NaturalLanguage'])

from data_table.logical import Categorical, NaturalLanguage
# support actual Logical Types objects
dt.select_ltypes(Categorical)
# support actual Logical Types list of objects
dt.select_ltypes([Categorical, NaturalLanguage])
@tamargrey
Copy link
Contributor

Some implementation questions

  • Is the selection happening from the original dataframe always or whatever the current state of the DataTable is?
    • My guess would be we always apply to the DataTable's columns, so if we've already changed some aspects of the logical or semantic types, those changes get propagated.
  • Can input lists be of mixed types - so something like ['boolean', Categorical]?
  • Do we want to be case blind in logical type strings?
    • Just thinking that we do this with primitives in dfs
  • Should the underlying dataframe also change when we select certain DataTable columns?
    • I might expect the underlying df to never change, but I haven't seen that explicitly stated anywhere
  • Do we want any warnings if no columns fall under any of the ltypes specified (empty DataTable) or if all of them apply (no change from the original)?
  • Should this remove the index and time_index columns even if their ltypes aren't included?

@gsheni
Copy link
Contributor Author

gsheni commented Sep 14, 2020

@tamargrey

  • The selection is happening ont he current state of the DataTable. We would always apply it to the DataColumns on the DataTable.
  • Yes, the input lists can be mixed types.
  • Yes, let's be case blind (upper/lowercase). If we are doing that with primitives and DFS, let's do that in this case.
  • The underlying dataframe should not change. This is a helper function to return DataTable based on the inputted Logical Types.
  • If no logical types fall under the specified, let's return an empty DataTable with an empty dataframe.
    • If all logical types fall under the specified, return the full DataTable with all columns.
  • Yes, for now, we can remove the index and time_index if the logical types are not included. We can revisit this behavior in the future. But for now let's be specific.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants