# What is Data Science?

**Learning Objective:** Understand different ways that Data Science can be defined.

## Data Science as a *skill set*

Perhaps the most common definition of Data Science is to enumerate the skills and knowledge areas used in Data Science. The best known treatment of that approach is [Drew Conway's Data Science Venn diagram](http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram), seen here:

![Data Science Venn Diagram](images/data_science_vd.png)

## Data Science as a *process*

An operational definition of Data Science answers the question, "what do Data Scientists do?". Over the last few years, the community of Data Scientists have been building a concensus answer to this question. The different activities involved in Data Science are linked together to form the Data Science **process** or **workflow**. By looking at the descriptions of the Data Science process by a few individuals, we can start to see a clear picture emerging.

* [A Data Science Taxonomy](http://www.dataists.com/2010/09/a-taxonomy-of-data-science/), Hilary Mason (2012):
  - Obtain
  - Scrub
  - Explore
  - Model
  - Interpret
* [The Data Science Process](http://columbiadatascience.com/2012/09/24/reflections-after-jakes-lecture/), Rachel Shutt (2012):
  - Observation and collection
  - Processing
  - Exploratory data analysis
  - Modeling: Stats, ML
  - Build data product
  - Communicate
  - Make decisions
* [Introduction to Data Science 2.0](http://columbiadatascience.com/2013/09/16/introduction-to-data-science-version-2-0/), Rachel Shutt (2013):
  - Gather and observe
  - Process
  - Modeling: Stats, ML
  - Summarize, communicate, build
  - Decide, interact
* [Data Science Workflow: Overview and Challenges ](http://cacm.acm.org/blogs/blog-cacm/169199-data-science-workflow-overview-and-challenges/fulltext), Philip Guo (2014):
  - Preparation
  - Analysis
  - Reflection
  - Dissemination

## Data Science as a *set of questions*

The *skills* and *process* approaches to defining Data Science do have some limitations. Another approach, for which I advocate, is to enumerate the underlying questions that the field is pursuing. Here are some possibilities:

* How/where do we get data?
* What is the raw format of the data?
* How much data and how often?
* What variables/fields are present in the data and what are their types?
* What relevant variables/fields are not present in the data?
* What relationships are present in the data and how are they expressed?
* Is the data observational or collected in a controlled manner?
* What practical questions can we, or would we like to answer with the data?
* How is the data stored after collection and how does that relate to the
  practical questions we are interested in answering?
* What in memory data structures are appropriate for answering those practical
  questions efficiently?
* What can we predict with the data?
* What can we understand with the data?
* What hypotheses can be supported or rejected with the data?
* What statistical or machine learning methods are needed to answer these questions?
* What user interfaces are required for humans to work with the data efficiently and
  productively?
* How can the data be visualized effectively?
* How can code, data and visualizations be embedded into narratives used to
  communicate results?
* What software is needed to support the activities around these questions?
* What computational infrastructure is needed?
* How can organizations leverage data to meet their goals?
* What organizational structures are needed to best take advantage of data?
* What are the economic benefits of pursuing these questions?
* What are the social benefits of pursuing these questions?
* Where do these questions and the activities in pursuit of them intersect important ethical issues.

## Data Science as *Science*

If we take the name "Data Science" seriously, then we have to assume that it is somehow related to science. Here is my own take:

> Data Science involves the application of scientific methods and approaches to data sets that *may* lie outside the traditional fields of science (Physics, Chemistry, Biology, etc.).

In other words, Data Science involves a broad application of the scientific method.

## Resources

* [Scientific Method](https://en.wikipedia.org/wiki/Scientific_method), Wikipedia (2016).
* [50 Years of Data Science](http://courses.csail.mit.edu/18.337/2015/docs/50YearsDataScience.pdf), David Donoho (2015).
* [Data Science Survey](https://www.oreilly.com/ideas/2015-data-science-salary-survey), O'Reilly Media (2015).
* [The Emerging Role of Data Scientists on Software Development Teams](http://research.microsoft.com/apps/pubs/default.aspx?id=242286), Microsoft Research (2015).