# 1.1 What is \[research\] data science?

There is an incredible abundance of webpages, YouTube videos, newspaper articles
and books defining what data science is. Instead of providing yet another
clear-cut single definition of such a complex topic, we have opted to simply
discuss the main components of data science. In particular, we present our
experience of what a \[research\] data scientist does. 


In this submodule we will offer an overview of:
- what we mean when we say "data" in data science
- the different types of tasks that a data scientist covers
- what role data science plays in research


## Data

- Data is central to data science
- Data is not a natural resource

![data_oil](../../figures/m1/data_oil.jpeg)

 

We start from the most essential element of this profession, data. The
availability of large-scale datasets for training face recognition algorithms or
language models is not something that we should take for granted, and it is
absolutely not something that has always been in place. Creating, refining and
making data consumable takes a substantial effort. 


### Scarcity & Abundance

- The availability of data is a fairly new phenomenon
- Acquiring data remains very complex

![digitisation](../../figures/m1/digitisation.jpeg)
[Image link](https://www.bl.uk/help/initiate-a-new-digitisation-project-with-the-british-library)


### Representativeness

- Often, the data we have available is just a sample and not the complete story
- The question we should ask ourselves is:
  - "What can these data points tell us about the wider phenomenon that we're
    really interested in?"

![twitter](../../figures/m1/twitter.jpeg)
[Image link](https://www.businessinsider.com/library-of-congress-twitter-wont-archive-every-public-tweet-anymore-2017-12?r=US&IR=T)

 In an article by [Anna Rogers](https://aclanthology.org/2021.acl-long.170.pdf),
 the author considers the following argument: 
> “the size of the data is so large that, in fact, our training sets are not a sample at all, they are the entire data universe”. 

Rogers replies to it by saying that this argument would stand if the “data universe" that we use for training for instance a speech recognition system was the same as “the totality of human speech/writing". It is not, and will hopefully never be, because collecting all speech is problematic for ethical, legal, and practical reasons. Anything less than that is a sample. Given the existing social structures, no matter how big that sample is, it is not representative due to (at least) unequal access to technology, unequal possibility to defend one’s privacy and copyright, and limited access to the huge volumes of speech produced in the “walled garden" platforms like Facebook. 

 

### Creation

- Data is <ins>always</ins> the product of human decisions and actions
- It is the outcome of an enormous amount of labour, resources, infrastructure, and time

![imagenet](../../figures/m1/imagenet.jpeg)
[Image link](https://excavating.ai/)

 
Data is not a natural resource, but the product of human decisions. Whether we are conscious of it or not, there is always a 
[human-in-the-loop](https://en.wikipedia.org/wiki/Human-in-the-loop). Data
creation can be, for example:
- collecting information or the tracking of historical information
- organising information in specific categories
-  measuring and storing information as data on digital infrastructure

When we find a data
collection enriched with metadata information or specific labels, we always need
to remember that someone has either directly provided those labels or those have
been automatically assigned by a tool trained on other manual labels.
For example, the famous ImageNet dataset, central component for the development of many well known image recognition pipelines relies on two pillars:
- a taxonomy developed since 1985 as part of the lexical database WordNet, which provides a top-down hierarchical structure of concepts ("chair" is under artifact->furnishing->furniture->seat->chair)
- an enormous amount of cheap workforce provided by Amazon Mechanical Turk.

ImageNet is not an abstract resource, but the result of a gigantic effort and the specific representation of the World of *1)* the people who have designed WordNet, *2)* the researchers who have decided which WordNet categories are included and which are not in ImageNet and *3)* the many, many annotators who have selected which images associate to concepts like "brother", "boss", "debtor", "drug-addict" and "call girl", all included both in WordNet and ImageNet (at least until 2019).

 
## Data Science

We use the term to refer to a varied ensemble of practices, methodologies and tools that may be used to learn from or about data.
    
![The data science hierarchy of needs](../../figures/m1/pyramid_of_needs.png)

 

Having established that the necessary premise of data science is its
relationship with data, the other fundamental component is constituted by a
broad ensemble of practices, methodologies and tools that, combined, can lead to
obtaining "new insights" from or about a given dataset. 

If we consider Monica Rogati's representation of the "Data science hierarchy of needs" we can see that whilst developing new machine learning models sits at the top of the pyramid, metaphorically becoming the most visible component of a data scientist's work, this actually relies on many other steps, which we briefly introduce here and will be discussed in details in modules 2, 3 and 4.

 

### Collecting and Storing

- Data collection and storage is (often) part of our job
- If data needs to be generated in the first place, this should be an entirely different project!

![newspapers](../../figures/m1/newspapers.jpeg) [Image link](https://blogs.bl.uk/thenewsroom/2019/07/moving-from-a-newspaper-collection-to-a-news-collection.html)


 

Part of the work of a data scientist is knowing the challenges and hurdles
involved in data collection and storage.


 While, depending on the size of the team, such practices might be taken care of by software engineers or data curators, it is essential that we know who owns the data, which restrictions apply, how a resource should be stored for long-term preservation and made available to collaborators. How complex it would be to set up a secure environment such as a Data Safe Haven. In small team contexts, as data scientists we often take care directly of such responsibilities. Data collection additionally implies also data generation. In settings where the research team aims to produce a new dataset (instead of acquiring one already available). This would imply an even larger set of skills, ranging from selecting a representative sample, preparing annotation guidelines, hiring and monitoring the work of annotators, measuring their agreement and finding ways of improving their performance, etc. Our course does not cover such aspects because in many settings if data needs to be generated in the first place (through a data collection and annotation task for instance), that would become an independent project on its own.
 

### Processing and Exploring

- Data cleaning implies actions such as: removing, normalising, ignoring, structuring, correcting
- Data exploration allows you to know the collection under study

![cleaning](../../figures/m1/cleaning.jpeg) [Image link](https://www.nytimes.com/2021/10/29/technology/apple-polishing-cloth.htmll)

 

Another famous expression in the community is "data cleaning" and many
practicioners would say that a large portion of their time is spent processing, wrangling,
cleaning and preparing raw data to be used. Cleaning data means that we are imposing our control over a
given collection and explicitly (or more often implicitly) shaping it following
our own definition of "clean". Additionally, note that, even if this task takes
generally the largest part of a data scientist's work, it is often undervalued.

For many disciplines the availability of large datasets is absolutely unprecedented. While we will focus later on how this is changing science as a whole, for the moment it is important to understand that defining new research questions or business goals is very complex. As without data exploration you often don't know what is contained in the data and what's not, how this could be used and which new insight you could derive. If we take an example directly from our work at the Turing, [Project Odysseus](https://www.turing.ac.uk/research/research-projects/project-odysseus-understanding-london-busyness-and-exiting-lockdown) relies on information on the level of activity in London during the pandemic, which is derived from data collected from JamCam cameras, traffic intersection monitors, and aggregate GPS activity which were initially adopted by another Turing project, the [London Air Quality](https://www.turing.ac.uk/research/research-projects/london-air-quality) project. Knowing your data collection, the way it has been created and the additional information it might contain is an essential skill for a data scientist. For this reason in Module 3 we will focus on data exploration techniques, to help data scientists get a better understanding of the collection, allow collaborators to move from a general intuition to a specific question and allow serendipitous discoveries.

 

### Modelling

- Often presented as the core activity of a data scientist
- We build models with a specific goal in mind
- And (properly!) assess which solution is the best, in a given setting


![free_lunch](../../figures/m1/free_lunch.jpg) [Image link](https://fineartamerica.com/featured/theres-no-such-thing-as-a-free-lunch-dana-fradon.html)

 

Modelling is generally considered the core activity of a data scientist. While we have already stressed the fact that modelling is just one of the steps of our work, it is also important to remark from the beginning on two aspects of modelling that are inherently present in each data science activity: first of all, that we build models with a specific goal in mind. In fact, we focus a large part of our project scoping activities (which we will see in Lesson 1.2) on defining a specific question, a corresponding data science task and a measure of success. So the modelling that we do is always clearly focused to address a specific, well-defined need. 

Second, and highly related, modelling is about comparing solutions to determine what works "best" in a given setting. For instance, if the task is segmenting cells given a microscope image, we would first implement and test a series of established approaches for the task, and then we would assess whether, for instance, recent advancements in the field of computer vision would offer improvements in this specific setting. As we will remark later, the job of a data scientist is usually not to improve over a given state-of-the-art method (this might be the job of a researcher in machine learning for computer vision for instance), but to implement and assess the current “best” approach for a given task. In certain situations, this might lead to an improvement over the state-of-the-art, or it might just reconfirm that a very well known and established baseline remains the most reliable solution for a problem.

 





### A Fourth Paradigm?

From "Beyond the Data Deluge" (Bell et al., 2009, Science):

- Experimental and theoretical science as the basic research paradigms for understanding nature
- Recently, computer simulations have become an essential third paradigm
- Now a fourth paradigm is emerging, consisting of the techniques and technologies needed to perform data-intensive science

![fourth_para](../../figures/m1/fourth_para.jpg) 


As Bell, Hey and Szalay (2009) said in a famous article in Science, for a long
time "scientists have recognized experimental and theoretical science as the
basic research paradigms for understanding nature. In recent decades, computer
simulations have become an essential third paradigm [...]" and now "a fourth
paradigm is emerging, consisting of the techniques and technologies needed to
perform data-intensive science. Today, some areas of science are facing hundred-
to thousandfold increases in data volumes from satellites, telescopes, high
throughput instruments, sensor networks, accelerators, and supercomputers,
compared to the volumes generated only a decade ago. [...] Other research fields
also face major data management challenges. In almost every laboratory, “born
digital” data proliferate in files, spreadsheets, or databases stored on hard
drives, digital notebooks, Web sites, blogs, and wikis. The management,
curation, and archiving of these digital data are becoming increasingly
burdensome for research scientists." 

 


### Data-Driven Science?

From "The end of Theory" (Anderson, 2008, Wired):

- Are we at the "end of theory" and the advent of "data-driven" science?
- Is it true that "Correlation is enough" and that "we can analyze the data without hypotheses about what it might show"?
- There is no need for a priori theory, models or hypotheses

![theory](../../figures/m1/theory.jpg) 
[Image link](https://www.wired.com/2008/06/pb-theory/)

 

In just a few years the discussion around the fourth-paradigm has moved to even a more contentious topic: The "end of theory" and the advent of "data-driven" science. Such discussion was started by a provocative article by Chris Anderson on Wired (2008) containing statements such as "Petabytes allow us to say: ‘Correlation is enough. We can analyse the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot. Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all." As Kitchin (2014) has highlighted in similar tones there is "a powerful and attractive set of ideas at work in the empiricist epistemology that runs counter to the deductive approach that is hegemonic within modern science:

- Big Data can capture a whole domain and provide full resolution;
- there is no need for a priori theory, models, or hypotheses;
- through the application of agnostic data analytics the data can speak for themselves free of human bias or framing, and any patterns and relationships within Big Data are inherently meaningful and truthful;
- meaning transcends context or domain-specific knowledge, thus can be interpreted by anyone who can decode a statistic or data visualization.

 

### Data-Driven Science with a Critical Mindset

- Our perception of data science in society and research has drastically changed
- The core of our course will be on how to approach data, methods, and questions in a critical way


![fourth_para](../../figures/m1/atlas_ai.png) 
[Image link](https://www.cambridge.org/core/journals/mrs-bulletin/article/abs/nomad-the-fair-concept-for-big-datadriven-materials-science/1EEF321F62D41997CA16AD367B74C4B0)

 

Reading such statements now might leave us speechless, especially after a few years of discussions around the limitations of computational methods, of biases embedded in trained models, on the fact that data don't speak for themselves or that we need experts in defining the scope of the study and interpret the results, and of the impact that neglecting or undervaluing all these things has on science and society as a whole. In our course we will spend a lot of time focusing on how to approach data, methods and research questions in a highly critical way to ensure (for the best we can) that the new findings that we encounter are reliable and reproducible.

 



## Research data scientist

- We are often central element in research projects (connecting data providers, domain experts, final users)
- We will be in the position of asking “why” people want to use data science approaches
- We contribute in shaping research directions and guaranteeing reproducibility

![reg](../../figures/m1/reg.png) 


As we have highlighted in this introduction to the first module, the role of a data scientist in research brings with it many responsibilities. We will often be the central element in the projects we are involved with, connecting data providers, domain experts and final users, and it will be our duty to understand all challenges involved in each step to facilitate cross-disciplinary communication. Even more importantly, in such position we will be in the position of asking "why" people want to use data science methods to address a particular research question, always ensuring that people are aware of both the limitations and the risk that such methodological frameworks might further emphasize established social hierarchies in very subtle ways.

If we consider the large-scale [Living with Machines](https://www.turing.ac.uk/research/research-projects/living-machines) project, a five-year study on the Industrial Revolution using data-driven approach and which has over twenty members and currently five REG members working on it, we will see in how many aspects they are now become essential elements. RDS are for instance highly involved in data acquisition, classification (based on the level of sensitivity) and storage. They coordinate the use of a secure environment (a [Data Safe Haven](https://www.turing.ac.uk/research/research-projects/data-safe-havens-cloud)) to work on copyright protected collections and ensure the secure egress of all outputs from this environment. They are also responsible for organising acquired materials in databases and of providing the necessary skills to other researchers on the project so that they will be able to access such resources easily. As the project employs many types of data sources (digitised maps, newspaper articles and tables of census records), RDS develop tools for facilitating their adoption, for instance [historical language models](https://github.com/Living-with-machines/histLM), methods for dealing with [fuzzy string matching](https://github.com/Living-with-machines/DeezyMatch) or for [sampling resources](https://github.com/Living-with-machines/PressPicker_public)) as well as they contribute to research papers based on data science methods (see for instance: [Living Machines: A study of atypical animacy](https://arxiv.org/pdf/2005.11140.pdf) or [Maps of a Nation?](https://web.archive.org/web/20210423221450id_/https://watermark.silverchair.com/vcab009.pdf?token=AQECAHi208BE49Ooan9kkhW_Ercy7Dm3ZL_9Cf3qfKAc485ysgAAArcwggKzBgkqhkiG9w0BBwagggKkMIICoAIBADCCApkGCSqGSIb3DQEHATAeBglghkgBZQMEAS4wEQQM1AlAgUtq5ba0pasxAgEQgIICal_obCGyNK5g70TgQwLWGBcI748EBTsJcVj0MKBXtN2ktjgLdQxm8MkLSEMf65JLQZZifGp_UZfGzn6K6ZeHQ04ODsAElr2wIqj0cA0dFHsAV9pC1hbbWuvEv5QwzrAGRDAg4nh2ALPok39NLd4CKzDv0zjeKLUQOMyoxxUPaIXA4KDkL1aq1EXbcjDdSIB65L9e8K5F0bfpPToIMdck_QztBtCm05JVYJMIWAOWDrshmAbnDVoz4STt99fCj9mML860iRGvlcOExguDKTLtFVxxMuKHRY6tduTApzTiAsGhgFTKE1dZ47pzZVV_giTA9od6U9BQbhpyJ1spDoKHf4RGjecgO7hbi3DdrN73BzOEmBsnv9uUhpKGOYBtd4PMIgXh-SNFpMtWQ2SdoOLxKSmM-lN8LdnyMILUqxneyzMMd-oKGOLJodHP71Xy8Y97N3uqnLtZZ_7Fcb87BnLRNQv5So3udV0UzJWxOxyJwB3iym2mmH1XKwrv2xLMY7hKMNBCp87qNZcUVHZitcjNh40HgPualpIjU_sibvokmEWCiBo1gTZmKf5kr2f6JtO3W9chQaKTJpqVf-LCCq2ABiURFDewU11SGg-81jot7lLHH22QSetHc8UlyRZvUo7bkxxGnsjNnRkL8JNUJyl2nnGYUAV2TWOwU0k_tXIFCNEWGlgwr9RCqat0DtjiQqPgOeL4l72gFyWDQ4VPfFmZELBgoxzoe8HrCJRbsuicqbqV4jn6C4ui55T17hKnJtkm68YaT-3S8ni4tNX3gGZyBnCps6E3Je9oyPOwU8ArXRs42G1kzlHzW2IsFg). On such works, RDS not only build benchmarks and assess the performance of the methods employed but especially contribute in shaping the research goal from the very beginning and make sure that results are reproducible and limitations are clearly discussed in the paper.

Finally, they also take under their wings many project management tasks, for instance planning and leading specific subprojects, taking care of scheduling milestones, deliverables and handling stake-holders expectations. While Living with Machines remains almost a unicorn at the Turing for its size and its interdisciplinary dynamics, it offers a clear overview of the many skills, duties, and responsibilities that are part of our job.

To offer you a broad overview of such duties across Research Data Science projects, in module 1.2 we will start by examining the importance of the project lifecycle and how being proactive in shaping it around each specific research project will guarantee that we will be able to at least deliver the minimum valuable outcome that has been agreed initially by all parties involved. Due to the fact that data science is so present in public conversations in and out of academia, it is also our role to fight against the many myths and toxic narratives that are highly common regarding the field. On the opposite, our role involves making our collaborators aware of the many risks and challenges that data science poses. For this reason, module 1.3 will address many aspects of the current debate about Equality, Diversity, and Inclusion (EDI) in data science. Finally, as we have already remarked, being a research data scientist implies working in highly interdisciplinary collaborative environments, so for this reason the last submodule, 1.4, will conclude discussing best practices for collaborative coding. 



## References

Rogers, A. (2021). Changing the World by Changing the Data. arXiv preprint arXiv:2105.13947.

Anderson, C. (2008). The end of theory: The data deluge makes the scientific method obsolete. Wired magazine, 16(7), 16-07.

Bell, G., Hey, T., & Szalay, A. (2009). Beyond the data deluge. Science, 323(5919), 1297-1298.

Crawford, K. (2021). The Atlas of AI. Yale University Press.

D'ignazio, C., & Klein, L. F. (2020). Data feminism. MIT press.

Kitchin, R. (2014). Big Data, new epistemologies and paradigm shifts. Big data & society, 1(1).

Wickham, H. (2014). Tidy data. Journal of statistical software, 59(1), 1-23.
