# Hands-on session part 1: Scoping & set-up

In this hands-on session we will cover

- Project scoping: What is it, what it involves and why it is important. 
- Introduction to Scivision
- Work in groups:

## Project Scoping

A RSE / RDS projcet begins with an initial question or problem.This can originate from:
- A research or business leader who wants to address a particular aspect.
- A domain expert who observes some issue in their day-to-day work and questions how to address it.
- An informal or formal discussion between various members of a team or unit that leads to an idea.
- An external client with a specific request.
- The availability of a new (often large-scale) dataset in a specific research domain.
  
These initial questions are often vague and moving to a well-defined project scope is the aim of the scoping process. This requires engagement with the stakeholders:
- During the problem formulation and scoping phase we aim to answer a number of questions by having detailed and careful converstations with all of the stakeholders. 
- having an open dialogue across communities, expertise and often highly different academic cultures
- needs building a welcoming and respectful environment as well as keeping careful documentation of all discussions and decisions
- Negotiations are necessary to narrow down the scope: agree on what needs to be delivered and how.
- Agreed and documented answers to the questions (project board)

## What questions should teams ask themselves?

Here, we list some important questions that come in handy when talking to stakeholders and trying to scope a project. This list is by no means exhaustive, but it should help you in the scoping part of the hands-on session. This topic is covered in more detail in our **Research Data Science Course** [link].

### Question 1: What's the problem we are trying to solve?

We want to understand the status quo (e.g. any existing solution or research or lack thereof) and what is missing or problematic, in other words the motivating problem behind the project. 



### Question 2: What is the specific research question? How does it translate to a data science / software engineering problem?
    
Starting from a possibly vague and broad challenge which sets the context, we want to help the researchers and domain experts clarify the scope of the project by:
- Defining a specific research question
- Translating the question into a well-defined data science or software engineering problem.

In projects where RSE / RDS collaborate wiht domain experts and researcher, there are often **knowledge gaps**:
- Domain experts have a lot of knowledge about their area but: 
  - Often do not have a good grasp of data science methods and computing.
  - Might not be able to translate what they want to a list of technical requirements and a plan to deliver the output.
  - Might not understand how much time the technical work is likely to take.
- Research data scientists have an understanding of data and analytics and software engineering but:
  - Often are strangers to the domain area, without a good understanding of how the field operates, how data are generated (and their related complexities), what knowledge on previous interdisciplinary studies in the area etc. 
  - Might rush the scoping process to get on with the work, leading to gaps in understanding that can cost later.

### Question 3: In case of data science projects: Is data available and appropriate?

- **Data or questions first?** 
The dilemma of whether to first develop a research question and then find a dataset, or first select a known dataset and then develop the question is well-known. The default way to do science is the former, but at the same time data science as a field is driven by the availability in digital format of an unprecedented new amount of data. 

- **Can I legally use the data?** 
When a dataset has been selected, it is always essential to assess the copyright, data sharing agreement, sensitivity of the information contained and data collection policy of the selected resource. 
 

- **Is the data easily accessible?**
    

- **Is the dataset well-understood and tested?**


- **Is data quality and quantity appropriate?**
Along with finding the right data, the following factors should be taken into account and their impact on the present and future understood:
   - What is the quality and quantity of the data?
   - Will these change in the future? Will the data availability stay the same?

### Question 4: What are the stakeholders' expectations?

- **Stakeholder scepticism is common and has many reasons**
 
- **Data scientists should think critically about the projects they are involved in and discuss concerns with stakeholders**    

- **Data scientists should help moderate extreme expectations**
   


### Question 5: How does the output product look like and how is it going to be used?

This question often helps expose differences in understanding about the directon of the project between stakeholders. It is often the case that stakeholders have different ideas in mind about the final product or might not have a very well-shaped idea.
- RSE / RDS can use their experience from past projects and their knowledge of the limits of their tools to define an output product that fulfils requirements but also is realisable.
- It is usually good practice to define a **Minimum Viable Product (MVP)** which is a version of the output product with just enough features to be usable by early customers who can then provide feedback for future product development.

### Question 6: What is the state of the art?

This question is important to understand what is already out there and what needs to be added. 
Important related questions include: 
- **Is there a system or method in place? Is there any documentation?**
Researchers should do the necessary research to decide if their project will be incremental or create something from scratch.

- **Is the goal of this project to go beyond the state of the art?** 
  If the goal is to build a system that would improve over a given state-of-the-art, it is important to understand how difficult and realistic this would be and communicate it to stakeholders. 


### Question 7: What is in-scope and out-of-scope?

One of the most common problems when formulating a research question or problem is that the scope is not defined, documented and controlled well and this causes problems and confusion later on. 

- During problem formulation, the scope needs to be documented clearly and agreed by all parties.
- It is advisable to keep scope limited and small, especially for small projects.
- Document what is out of scope. This helps bring a lot of issues to the surface early on and removes uncertainty.
- Again, having defined a MVP would be really helpful at this stage.
- A measure of success needs to be agreed.


### Question 9: How do we measure the success of the project?

- We should always aim to define a success metric. In many cases this should be quantitative and allow us to track how successive iterations change it, but there are exceptions.
- We should define baseline models/systems. 

### Question 11: What computational resources are available?

We need to make sure that computational requirement are covered. This typcially involves requesting access to cloud resources, databases, HPC infrastructure, secure environments, paying for compute time. 

### Question 13: What are the timelines and milestones of the project?

We should come up with a project plan that includes a timeline, milestones, risks, required human effort and defined roles for participants (including skills necessary and work required from each). These should be documented with agreement from all stakeholders.  

### Question 14: Ethical considerations

This is a theme that should run throughout the scoping process. We should always try to think about our project through a lens that contains considerations about diversity, equity, inclusivity and wider ethics. As this is a really complex topic, but these are some first pointers:
- Are there any ethical and legal concerns related to the activities of the project or the use of the data (e.g. individuals' data privacy, socioeconomic inequalities that may be exacerbated)?
- What should be done to gain social license and public trust, e.g. a wider impact assessment and bias assessment to understand impact on communities and users.
- Are there negative impacts on society or other parts of any of the involved organisations?
- Are there any conflicts of interest?