Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a document specifying scope of the package #36

Closed
ankur-srivastava1 opened this issue May 17, 2020 · 2 comments
Closed

Add a document specifying scope of the package #36

ankur-srivastava1 opened this issue May 17, 2020 · 2 comments
Assignees
Projects

Comments

@ankur-srivastava1
Copy link
Collaborator

Identifying scope of any package is important. We need to decide:
What types of data should this package accept?
-- Signal Data
-- Alarm Data
-- Work order data
-- Geospatial data
-- Computer vision data
-- LIDAR data
-- Any others?
Once the data types are identified, we can define acceptable schemas and functionality for each data type.

@dvdjlaw
Copy link
Member

dvdjlaw commented May 17, 2020

Thinking out loud...:

Scope prioritization for this package should follow the 80/20 rule. I think the obvious main target should be tabular data used for classical ML problems i.e. classification/regression. However, there are some main considerations:

  1. EDA for classical ML problems is most common and least likely to stand out against other tools. Coverage in this area should be focused on providing a polished, "one-stop shop" that can handle the most common analyses.
  • We should also be opinionated about analyses that are both effective and technically sound e.g. we shouldn't exert effort on building / supporting pie charts and word clouds just because they are used frequently.
  1. Other types of data that can be used as a predictive feature in classical ML should be strongly considered. For example, locality (e.g. country, state, city, zip) is very common to include as a predictive feature, while other geospatial-specific analyses (e.g. flight maps) are much less commonly-applicable and may or may not be easily represented in tabular format.
  • In the geospatial domain, we benefit from geopandas representing geospatial data in a Pandas-like API which reduces the hurdle of incorporating support for certain types of geospatial data.
  1. In scoping out data sources we should be careful not to conflate the source of data with the analysis approach. For example, signal data and alarm data can framed as generic time series analysis or even classic ML/tabular. Computer vision data can be framed as a correlation heatmap on the raw pixels.

  2. However, we should also keep an eye out for opportunities to cover more niche data types where there is a need and is not already covered by other (open source) tools. We should further brainstorm about what opportunities lie here.

  3. We should be judicious about how/where we provide "automated data preparation" / transformation for specific data types as this package is not intended to eliminate the need for data preparation. I can think of two situations where we should lean towards providing this preparation as part of this package:

  • Strong standardization in data formats in the industry/domain, as in geospatial
  • Transformations that are considered to be part of the analysis itself: a simplistic example is the binning/counting for histograms

@terrytangyuan terrytangyuan added this to To do in Open Source May 18, 2020
@brianray
Copy link
Member

My suggested next steps

  • convert this into a proposal
  • provide some simplistic data examples

@dvdjlaw dvdjlaw self-assigned this Jul 9, 2020
@dvdjlaw dvdjlaw moved this from To do to In progress in Open Source Jul 9, 2020
Open Source automation moved this from In progress to Done Jul 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Open Source
  
Done
Development

No branches or pull requests

3 participants