Add a document specifying scope of the package #36

ankur-srivastava1 · 2020-05-17T04:52:57Z

Identifying scope of any package is important. We need to decide:
What types of data should this package accept?
-- Signal Data
-- Alarm Data
-- Work order data
-- Geospatial data
-- Computer vision data
-- LIDAR data
-- Any others?
Once the data types are identified, we can define acceptable schemas and functionality for each data type.

dvdjlaw · 2020-05-17T08:17:15Z

Thinking out loud...:

Scope prioritization for this package should follow the 80/20 rule. I think the obvious main target should be tabular data used for classical ML problems i.e. classification/regression. However, there are some main considerations:

EDA for classical ML problems is most common and least likely to stand out against other tools. Coverage in this area should be focused on providing a polished, "one-stop shop" that can handle the most common analyses.

We should also be opinionated about analyses that are both effective and technically sound e.g. we shouldn't exert effort on building / supporting pie charts and word clouds just because they are used frequently.

Other types of data that can be used as a predictive feature in classical ML should be strongly considered. For example, locality (e.g. country, state, city, zip) is very common to include as a predictive feature, while other geospatial-specific analyses (e.g. flight maps) are much less commonly-applicable and may or may not be easily represented in tabular format.

In the geospatial domain, we benefit from geopandas representing geospatial data in a Pandas-like API which reduces the hurdle of incorporating support for certain types of geospatial data.

In scoping out data sources we should be careful not to conflate the source of data with the analysis approach. For example, signal data and alarm data can framed as generic time series analysis or even classic ML/tabular. Computer vision data can be framed as a correlation heatmap on the raw pixels.
However, we should also keep an eye out for opportunities to cover more niche data types where there is a need and is not already covered by other (open source) tools. We should further brainstorm about what opportunities lie here.
We should be judicious about how/where we provide "automated data preparation" / transformation for specific data types as this package is not intended to eliminate the need for data preparation. I can think of two situations where we should lean towards providing this preparation as part of this package:

Strong standardization in data formats in the industry/domain, as in geospatial
Transformations that are considered to be part of the analysis itself: a simplistic example is the binning/counting for histograms

brianray · 2020-05-19T14:58:20Z

My suggested next steps

convert this into a proposal
provide some simplistic data examples

terrytangyuan added this to To do in Open Source May 18, 2020

dvdjlaw self-assigned this Jul 9, 2020

dvdjlaw moved this from To do to In progress in Open Source Jul 9, 2020

brianray closed this as completed in 81ddf84 Jul 20, 2020

Open Source automation moved this from In progress to Done Jul 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a document specifying scope of the package #36

Add a document specifying scope of the package #36

ankur-srivastava1 commented May 17, 2020

dvdjlaw commented May 17, 2020 •

edited

Loading

brianray commented May 19, 2020

Add a document specifying scope of the package #36

Add a document specifying scope of the package #36

Comments

ankur-srivastava1 commented May 17, 2020

dvdjlaw commented May 17, 2020 • edited Loading

brianray commented May 19, 2020

dvdjlaw commented May 17, 2020 •

edited

Loading