# Lesson 1.2: Research Data Science project lifecycle

Data science projects in research environments comprise multiple stages and require careful planning that takes into account each specific step in its own complexity. 

In this lesson we:
- First briefly go through all the stages of a typical research data science project and give examples.
- Focus on the problem formulation and scoping stages, in preparation for the hands-on activity.
- Finally highlight common hurdles and challenges which might impact a project timeline.


### The project lifecycle

There are different types of research data science projects. Their content and structure can differ depending on:
- Aims of the project (e.g. a short project that aims to enhance functionality of an existing system or a large exploratory research project that aims to uncover new knowledge).
- Number of people involved, their expertise and expectations (data scientists, researchers, software/data engineers, data providers, domain experts, users, project managers)
- Methods used (e.g. are we using predictive models, optimisation algorithms or visualisation tools?)
- Type of data (e.g. offline or online, big or small, structured or unstructured)
- Setting (academic, corporate, government)

Nevertheless, there are some high level steps/stages that we encounter in most projects. A typical project lifecycle is shown in the figure below, assuming that the core methodology used involves building a model (this is the most common scenario but not the only one). 

It requires years of experience to understand the ins and outs and challenges of each stage and familiarise with the end-to-end process but having an overview of the lifecycle helps prepare for this.



![ds_lifecycle](../../figures/m1/ds_lifecycle.png)

There are three main stages in the figure:
- Design: Involves formulating the problem, planning the project, collecting the data and doing preliminary analysis.
- Develop: Comprises the core work of developing and implementing the right methodology (usually some kind of model) to solve the problem and testing/reporting how well it works, in comparison for instance with a series of predefined baselines.
- Deploy: Contains the steps necessary to move from a prototype implementation to a production-level system, as well as to allow adoption by the users and regular performance monitoring.



These stages initially seem sequential. But it is almost always the case that the process is iterative:
- Many projects go through the lifecycle multiple times, iteratively refining the model or output.
- New findings during one stage can lead to revisiting a previous stage. For example, while developing and validating the model, data scientists might realise that the available data are either not enough or not completely representative of the studied domain to train an adequate model and might decide to go back and collect/procure more data. 
- The same iterative process is followed within each of the three stages, with additional overlaps, e.g. problem formulation and data procurement sometimes happens in parallel.

 The data scientists in the project need to have an active role in all of the stages above and have to be proactive in keep all other people involved in the discussion and up-to-date with the process, as all their inputs are necessary when considering different options during an iterative process (for instance, data providers and domain experts will be able to raise challenges and sketch timelines in case the researchers are considering collecting more data).

## Our Focus: Problem formulation, scoping and planning
Here we will focus mostly on the first two stages of the lifecycle. We will delve into most of the remaining stage in Modules 2-4 of our Research Data Science Course.

Research data science is used to address problems, take decisions and answer questions in a data-driven way. A research data science project begins with an initial question or problem. This can originate from:
- A research or business leader who wants to address a particular aspect.
- A domain expert who observes some issue in their day-to-day work and questions how to address it.
- An informal or formal discussion between various members of a team or unit that leads to an idea.
- An external client with a specific request.
- The availability of a new (often large-scale) dataset in a specific research domain.

More often than not, the initial question at this early stage is vague and high-level. Data science teams have processes in place to move from this initial vagueness to a well-defined research question and project plan.

One important first step is to include in this process all the different teams and stakeholders that are involved in the project. These might involve:
- Data scientists
- Researchers and/or business experts
- Data engineers, software developers and Dev Ops teams
- Project managers
- Data providers
- Potential or current users
- Communities that might be involved in the project or affected by it
- The public

During the problem formulation and scoping phase we aim to answer a number of questions by having detailed and careful converstations with all of the stakeholders. Here, we list some important questions that come in handy when talking to stakeholders and trying to scope a project. 

Discussing these questions and giving clear answers is likely to require meetings with all stakeholders in the project, having an open dialogue across communities, expertise and often academic cultures. Therefore it needs building a welcoming and respectful environment as well as careful documentation. Negotiation between the various teams is necessary in order to narrow down the scope and agree on what could and what needs to be delivered and how. At the end of the process, we should have agreed and documented the answers to all the questions.

**Question: What is the broad challenge we are trying to solve?**

- We want to understand the status quo (e.g. any existing solution or research or lack thereof) and what is missing or problematic, in other words the motivating problem behind the project. For example, the Turing's collaboration with NATS ([Project Bluebird](https://www.turing.ac.uk/research/research-projects/project-bluebird-ai-system-air-traffic-control)) aims to solve the following problem:


> Air traffic control (ATC) is a remarkably complex task. In the UK alone, air traffic controllers handle as many as 8,000 planes per day, issuing instructions to keep aircraft safely separated. Although the aviation industry has been hit by the pandemic, European air traffic is forecast to return to pre-pandemic levels within five years. In the long term, operations will be constrained by human performance amid rising volumes of air traffic, only to accelerate further with the introduction of unmanned aircraft systems, so next-generation ATC systems are needed to choreograph plane movements as efficiently as possible, keeping our skies safe while reducing fuel burn.

- Another Turing project called [QUIPP](https://www.turing.ac.uk/research/research-projects/quipp-quantifying-utility-and-preserving-privacy-synthetic-data-sets) is trying to solve the following problem:

> Sensitive datasets are often too inaccessible to make the most effective use of them (for example healthcare or census micro-data). Synthetic data - artificially generated data used to replicate the statistical components of real-world data but without any identifiable information - offers an alternative. However, synthetic data is poorly understood in terms of how well it preserves the privacy of individuals on which the synthesis is based, and also of its utility (i.e. how representative of the underlying population the data are).

- A third very different project, [Living with Machines](https://www.turing.ac.uk/research/research-projects/living-machines), is trying to solve the following problem:

> It is widely recognised that Britain was the birthplace of the world’s first industrial revolution, yet there is still much to learn about the human, social, and cultural consequences of this historical moment. Focussing on the long nineteenth century (c.1780-1918), the Living with Machines project aims to harness the combined power of massive digitised historical collections and computational analytical tools to examine the ways in which technology altered the very  fabric of human existence on a hitherto unprecedented scale.

**Question: What is the specific research question? How does it translate to a data science problem?**
    
Starting from a relatively vague and broad challenge which sets the context, our role is to help the researchers and domain experts refining the scope of our collaboration towards a specific clearly defined question, which we will then try to translate (as much as possible) into a well-defined data science (computational) problem that can be solved by using an algorithm. We also need to guarantee that this translation would still deliver what the researchers expect, by bridging the knowledge gap through specific examples.

If we consider the case studies presented above, asking questions regarding the specific work settings are good starting point for moving from a broad challenge to a specific task. For instance regarding NATS how people generally measure efficiency could be a good starting point to nail down a defined study. Other questions which help clarify the scope of a project are the ones focused on data availability, for instance regarding Living with Machines: which collections are promptly available and how representative are they?


Knowledge gap:
- Knowledge gaps between data scientists and collaborators/customers/researchers are common, encountered particularly in projects where data scientists collaborate with domain experts and researchers.
- Domain experts have a lot of knowledge about their area but: 
  - Often do not have a good grasp of data science methods and computing, e.g. they might not know what machine learning or other methods can and cannot do or, on the opposite, they might only know the most popular methods of the moment (e.g., deep learning toolkits).
  - Might not be able to translate what they want to a list of technical requirements and a plan to deliver the output:
  - Might not understand how much time the technical work is likely to take.
- Research data scientists have an understanding of data and analytics but:
    - Often are strangers to the domain area, without a good understanding of how the field operates, how data are generated (and their related complexities), what knowledge on previous interdisciplinary studies in the area etc. Narratives that present data scientists as rock stars, geniuses, able to move across academia because all can be turned into a data science task in the same way only exacerbate this problem, by giving a false sense of omnipotence. We should always act against this type of narrative and it is part of our job to recognise the differences and adapt to various fields of study and communities.
    - Might rush the scoping process to get on with the work, leading to gaps in understanding that can cost later.


Mitigation:
- It is part of the job of the research data scientist to closely work with collaborators to co-define requirements and map out a technical solution and project plan. Data scientists should invest time in understanding the domain area. We highly recommend, whenever possible, to spend time in the environment where data are generated, previous research is discussed and domain experts, users or scientists spend their day. 
- Data scientists should help collaborators understand what is possible when using data-driven methods and learning algorithms and especially what is not. They should spend time describing in lay terms how a technical solution will work, what data are needed, what issues may arise. They should ask a lot of probing questions to shed light in dark corners of the project scope and detect blockers and inconsistencies early on.



** Question: Is data available?**

  - The dilemma of whether to first develop a research question and then find a dataset, or first select a known dataset and then develop the question is well-known. The default way to do science is the former, but at the same time data science as a field is driven by the availability in digital format of an unprecedented new amount of data. Therefore in reality it is common to follow a hybrid approach: Define a broad research question or select a wider area of research, then select datasets that provide rich information for this area and finally examine the datasets to understand how to best refine your question in order to be impactful, interesting but also answerable. The opposite approach of first choosing a dataset and then try to find research questions that arise from looking at it is usually not advisable as it often leads to biased hypotheses. It might also run counter to the ideas of data minimisation, i.e. collecting personal data only for defined purposes, rather than just to store them and decide how to use them later. 
  - Refining the research question should lead to a better understanding of what data are needed to solve the problem and if existing data sources adequately represent variables that have been identified as important. There are cases where the question cannot be answered with available data. This might lead to further work and discussion in order to refine the research question given the constraints of the data we have available. Or it might lead to a decision to collect new data (which could become a different research project altogether).
- When a dataset has been selected, it is always essential to assess the copyright, data sharing agreement, sensitivity of the information contained and data collection policy of the selected resource.
- When choosing a dataset, an additional concern is to what extent this dataset has been tested and how much code and insights about it already exist out there. The easiest datasets to work with are the ones where (1) there is local or community experience using the dataset that can be put to use, and/or (2) the dataset is relatively easy to access, learn, and use. 
- We will come back to the topic of choosing datasets. Module 2 will help you understand what are good places to look for datasets and what dataset characteristics are beneficial for your research projects.

Along with finding the right data, the following factors should be taken into account and their impact on the present and future understood:
- What is the quality and quantity of the data?
- Will these change in the future? Will the data availability stay the same?


** Question: What are the stakeholders' expectations?**

- Defining the research question involves conversations and negotiations with various stakeholders, including domain experts and potential users of the product. There are a number of challenges in these discussions. Domain experts might be sceptical about the project. This can be due to lack of understanding of data science techniques, real or imaginary anxiety over job security and other types of impact on ther work, long experience with a specific way of working, lack of evidence that data driven approaches work in their field, perception of data scientists as strangers in the area and/or elitist.

- Many of the above concerns might be justified and it is part of the role of a research data scientist to critically question the motivations and impact of the projects they work on and understand in what context and for what purpose their code will be used. In some cases, scepticism might be unjustified and projects should try to secure buy-in from everyone involved by carefully explaining what will happen and who will be impacted. In other cases, the conversation might highlight significant risks and possibly result in a decision to not go ahead with the project.

- It is also important to moderate opposite, maybe over optimistic reactions, for instance researchers thinking that data science methods would completely solve problems or automatise processes and would automatically lead to top scientific publications.

**Question: How does the output product look like and how is it going to be used?**
- This question often helps expose differences in understanding about the directon of the project between stakeholders. It is often the case that stakeholders have differents ideas in mind about the final product or might not have a very well-shaped idea.
- Data scientists can use their experience from past projects and their knowledge of the limits of their tools to define an output product that fulfils requirements but also is realisable.
- It is usually good practice to define a **Minimum Viable Product (MVP)** which is a version of the output product with just enough features to be usable by early customers who can then provide feedback for future product development.

![mvp](../../figures/m1/mvp.jpeg)


**Question: What is the state of the art (either in literature or within the organisation)? Is the goal of this project to go beyond this? Is there any documentation of legacy systems?**

- These questions are important to understand what is already out there and what needs to be added. However they rely on the fact that the researchers and domain expert are already aware of previous interdisciplinary literature in their field. It is important that if they refer to previous tools, libraries and papers we take our time to understand them and to explain whether they would not be feasible in this context and why.
- In case there are already solutions (e.g. open source software libraries) that offer most of the features or functionality required, data scientists and researchers should do the necessary research to understand the state of the art and define if their project will be incremental or create something from scratch. 

> For example, the recent [Adaptive Multilevel MCMC Sampling](https://www.turing.ac.uk/research/research-projects/adaptive-multilevel-mcmc-sampling) project aimed to create a Python library that implements the adaptive MLDA MCMC sampler (a new algorithm that researchers from the University of Exeter developed). The initial thinking was to create a library from scratch but after some research the team decided that extending an existing Probabilistic Programming package (for example [PyMC3](https://docs.pymc.io/), [Stan](https://mc-stan.org/) or [Pyro](https://pyro.ai/)) would have a lot of benefits: No need to duplicate a lot of functionality that is mature in those packages, better visibility, a wider community to support the code development.

- It is also important for data scientists to have an understanding of legacy functionality of the systems currently in place.

- Finally, if the goal is to build a system that would improve over a given state-of-the-art, it is important to make clear to the researchers that this is very challenging and that improvement needs to be properly assessed, for instance by performing significance tests. Rephrasing the task as, for instance, assessing whether a specific new method could improve performance over a state-of-the-art would be a better contained study.


**Question: What is in-scope and out-of-scope?**

It is difficult to scope a real-world project and even more so a research project whose aims often involve discovering new knowledge. One of the most common problems when formulating a research question or problem is that the scope is not defined, documented and controlled well and this causes problems and confusion later on. We often reference to this as scope creep: Changes, continuous or uncontrolled growth in a project’s scope, at any point after the project begins (e.g. adding features without addressing effects on time, cost, resources or without customer approval).
   
   
Example:
> An industrial data science project aims to build an automated ML system to predict customer conversion. The system is designed and tested but during deployment the commercial team requests a new feature that would allow them to manually override some of the predictions.
> ...
   
   
Mitigation:
- During problem formulation, the scope needs to be documented clearly and agreed by all parties.
- It is advisable to keep scope limited and small, especially for small projects.
- Document what is out of scope. This helps bring a lot of issues to the surface early on and removes uncertainty.
- Again, having defined a MVP would be really helpful at this stage.
- A measure of success needs to be agreed.
- The project manager and technical lead should control the scope and not allow it to deviate.

![scope_creep](../../figures/m1/scope_creep_meaning.png)

**Question: What is the expected impact?**

It is important to understand the impact of the proposed project when it is complete. Impact should be understood broadly, from increased knowledge and understanding of the way we see and operate in the world, to increased efficiency resulting from the uptake of new tools and technologies. What will be different if this project is successful? 

Additionally, this question should involve a conversation on how likely it is to realise the impact. By combining impact and likelihood to achieve it, we can create a 2D space and place each project in this space. This can be useful  for prioritising projects. It might be desirable for some teams to work on high-impact high-likelihood of success projects or have a mix of low- and high-likelhood ones.

**Question: What metrics do we use to measure the success of the project?**

We should always aim to define a success metric. Where possible, this metric should be quantitative and allow us to track how successive iterations and improvements change it. Success metrics can include (the list is not exhaustive):
- Whether the output of the project allows us to do something that was not possible before.
- Whether a piece of software is delivered
- Approval from users and/or the community
- A certain level in predictive performance of a model or some other quantitative measure.
- ...

Along with a success metric, it is important to also define some baseline models/systems. These could be the currently used solution or some simple (naive or not) model. These can serve to measure how well the product we are building performs compared to the baselines.


Example:
> The researchers in Living with Machines need a new tool for identifying mentions of cities in a collection of newspapers. As an MVP we would test a commonly used tool (a named entity recogniser available in the python library Spacy), we would assess the performance on a test-set developed with thee researchers and examine the errors and limitations. Next, we would extend this tool by fine-tuning the pre-trained model on in-domain data and measure whether this would lead to a significant improvement in performance.

** Question: What computational resources are available? **

We need to make sure that computational requirement are covered. This typcially involves requesting access to cloud resources, databases, HPC infrastructure, secure environments, paying for compute time. It might also involve paying for licensed software.

** BBefore defining a timeline, w should discuss again this series of common issues**

- Data extraction and procurement:
  - Data not easily accessible (e.g. due to sensitivity): In fields like healthcare or finance, it is often the case that individual-level data have barriers to access. There are typically complex governance processes to gain access and even then access might be granted only within secure environments (e.g. a Data Safe Haven). It is recommended to apply for access early in the project or even long before the project starts in these cases. It might also be appropriate to explore other options like anonymised or synthetic versions of the sensitive data.
  - No documentation: This is a typical issue with many datasets that have been created without propoerly documenting the contents. It is advisable to come in touch with data curators/owners but also with domain experts that might be able to shed light in those cases.
  - Data quality is low: This is the case where there are a lot of missing or erroneous data due to issues in data generation or collection. These might be related to the physical process that generates the data. There are various options to deal with this problem. In some cases, we might be able to use data science methods that handle missingness, etc (see Module 2). In other cases, domain experts might help clean the data by working with the data scientists. It is also an option to avoid using a data source if we think it is too problematic or even drop the project entirely.
  
- Modeling and reporting:
  - Black boxes: A common issue is when a data science product is built without any care for explainability and reporting. This is the case of an algorithm that just produces predictions but without providing any information on how these are produced, what factors are considered, how performance changes over time and how individual users are affected. It is important to agree ways to achieve those things in the scoping stage and to consider these challenges if we plan to reproduce the results of a tool presented by other researchers.
  - No baseline: If we do not set a baseline, we might not be able to tell if our models work well or not. It is always disappointing to spend months of work on a sophisticated model only to realise that a naive predictor does an equally good job. Automated ML tools (see [auto-sklearn](https://github.com/automl/auto-sklearn) and [TPOT](http://epistasislab.github.io/tpot/)) can help in this case, as they are capable of generating a lot of baseline models and do some basic data science work (e.g. feature engineering) without much user effort.

- Deployment:
  - Initial deployment is tricky: A lot of issues typically pop-up in eaarly deployment. It is also the critical period for user adoption if users exist. Data scientists should be prepared for iterative improvements and communicate expectations clearly to the users before this stage. The timing of the deployment is important; avoid overlaps with busy periods for users (e.g. cases where a business team needs to deliver another big project while testing the deployed product).
  -  Maintenance responsibility: Maintenance of a piece of software is a usually under-estimaated part of the project lifecycle. A lot of open source software depends on support from volunteers which might not be sustainable. Research Engineering teams typically deliver products but then their time in the project expires. This might lead to lack of maintenance. It is advisable to try to engage researchers and other stakeholders in the software development process so that they have the knowledge to maintain the project once developer time expires. 
  > For example, in research projects it is good practice to involve research assistants or PhD student in the process while supporting them to develop the necessary software development skills, familiarity with version control, etc.
  

**Ethical Questions:** 

Now that the project has been clearly defined and all its aspects have been agreed, it is the moment to consider the following points.

- Are there any ethical and legal concerns related to the activities of the project or the use of the data (e.g. individuals' data privacy, socioeconomic inequalities that may be exacerbated)?
- What should be done to gain social license and public trust, e.g. a wider impact assessment and bias assessment to understand impact on communities and users.
- Are there negative impacts on other parts of the organisation?
- Are there any conflicts of interest?

As this is a really complex topic, we will expand it in Lesson 1.3 (EDI and ethics) and Module 2 (legal issues).

**When all questions have been addressed, we can come up with a project plan that includes a timeline, milestones, risks, required human effort and defined roles for participants (including skills necessary and work required from each)**

The overall scoping and planning stage should lead to a reasonably accurate idea about the above aspects of the project. Specifically:
- We should define the start and end dates of the project (sometimes this might involve defining earliest/latest dates to allow for some flexibility when managing a team of data scientists that work on multiple projects).
- What is the amount of effort required by each team (i.e. how many people for how much time).
- What are the necessary skills (e.g. some projects might require specialist skills in Bayesian statistics or visualisation dashboards).
- We should define a set of intermediate milestones with as much accuracy as possible. For long-term projects these might be shifted during execution but for shorter ones it is important to capture them realistically.
- Are there any risks we should aware of? What is the level of risk and what mitigation measures can we take?

