# Hands-on (Guidance for Helpers)

In this exercise you act as a PI who has a proposal (see below) for a collaboration with REG (the participants). The goal of the exercise is to teach them how to scope a project, from an initial idea to a defined RDS task. We have listed below a few things to check and a specific answer to each of the scoping questions.

## Phase 1: Orientation

During orientation, make sure they:

1. Setup a GitHub repo for each group
2. Ensure that all participants have access to it
3. Prepare a scoping project board, with an issue for each of the scoping questions (they have a list)
4. Go through all the received materials
5. Explore the dataset, documentation and [the original paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3678208/) we used as an inspiration
6. Start filling the questions based on the initial exploration


Regarding in particular the initial contact from the PI, they should notice the following:

1. The URL in the PI email does not point to the correct website. If they ask, you can point them to the correct one (see below under data availability) - if they start googling without checking with you, you can correct them and say that when in doubt they should reach out to the PI.

2. Additionally, the dataset that is freely available online is only the 2007-2011 period (not the 2016 linked in the email). They should notice this as well, or you can make them aware of it.

3. The deadline is very short - they should notice this and investigate with you if there is flexibility.

4. The methodology is incredibly vague. They should notice this and check with you if the PI wants to rely on REG completely for methods. If so, the goal of the analysis needs to be very clear from the beginning.


## Phase 2: Technical Questions

The second part is an open dialogue between you (the PI) and the participants. They want to reach an agreement with you regarding the following questions. We have filled in the answer they should obtain from you at the end, but you can be more vague initially, to see how they approach the scoping phase. Note also that they are not necessarily sequential and the conversation could jump across them.

### The goal

**1. What is the broad challenge we are trying to solve?**

The PI wants to find an explanation to social inequalities in health acroos all Europe.

**2. What is the specific research question? How does it translate to a data science problem?**

This is absolutely not clear from the initial email, they need to discuss it with you during the meeting (knowing the dataset is essential here). The PI should direct people towards the definition of a specific task focused on: 
- predicting a specific variable (the self reported health)
- use other variables as features

The specific RQ is defined [here](https://github.com/alan-turing-institute/rds-course/issues/15#issuecomment-905345498).

### The data

**3. Is data available?**

The PI refers to "European Quality of Life Survey", linking to a Cornell library website. However the dataset pointed is the most recent version of a series of datasets created by [this project](https://www.eurofound.europa.eu/surveys/european-quality-of-life-surveys). Data for 2007 and 2011 is available in open access (CC 4.0) thanks to UK Data Service [here](https://beta.ukdataservice.ac.uk/datacatalogue/studies/study?id=7724#!/access-data), for the most recent version you would need to register to [UKDS](https://www.eurofound.europa.eu/surveys/about-eurofound-surveys/data-availability). 

Additional issues that should emerge:
- **Data procurement and missingness**: Only the `European Quality of Life Time Series, 2007 and 2011: Open Access` is [readily available](https://beta.ukdataservice.ac.uk/datacatalogue/studies/study?id=7724#!/access-data). However, there are a lot of variables missing in Wave 2. So it would be better to stick with Wave 3. This will be discussed further in Module 2, but good if someone already spots it and also if they notice differences between countries.
- **Documentation:** They should find documentation and understand who carried out the surveys and issues across countries.

### The expectations

**4. What are the stakeholders' expectations?**

From the initial email they are at the same time very vague and problematic. There is both a bit of skepticism around "big-data" methods and the desire of doing something ground-breaking. Participants should try to define more clearly what the PI really would like to achieve. 

What the PI wants is a study that tries to see if there are variables explaining the association between the socio-economic status and poor health across Europepe.

**5. What is in-scope and out-of-scope?**

In scope: see [the defined RQ](https://github.com/alan-turing-institute/rds-course/issues/15#issuecomment-905345498)
Out of scope: studies beyond Europe, engaging with the public / government. No more datasets.

### Success

**6. What metrics do we use to measure the success of the project?**

Define a clear MVP [see here](https://github.com/alan-turing-institute/rds-course/issues/15#issuecomment-905345498)

**7. What is the expected impact?**

Questions they should ask:
- Are we really expecting to influence governments? 
- Does the PI have previous experience regarding engaging with potical actors? 

Answer: No, the PI was being overly optimistic, the expected impact is a wide analysis of the dataset to offer a better understanding of the phenomenon in the academic environment, not beyond. The grant will not include activities (workshops etc) for engaging with the public or government.

## Phase 3: Ethical Points (30 mins in groups, then 30 mins together)

As a final discussion we would like to focus on the goal of studying **health** while having **self-reported health** information in the dataset. Participants should discuss together about the issues and challenges that we might encounter (by also exploring the dataset, the paper and the documentation), We will first discuss in sub-groups and then as a final open discussion with all the participants.

### Aspects that we expect to emerge

- People don't report negative health. 
- One person per house-hold
- Difference across countries (but how do we quantify it?)
- We have no information about the person asking questions
- How "Fair" was interpreted

### Additional points that could emerge

These are some of the additional points that could emerge. 

**1. Representation bias**

How representative is the dataset?

Who are the people interviewed (do you know if they are foreigners?)

Who are the people who asked the questions?

**2. Label bias**

Issues with binarising health

How can you minimise this?

**3. Missing data bias**

What happens to the study if we only use Wave 3? Why differences between Wave 2 and 3?

**4. Measurement bias**

A differential reporting of self-reported health across countries may be suspected due to cultural differences. 

**5. Chronological bias**

Can we easily align the two Waves?


