# Module 1 hands-on session (Guidance for Helpers)

```{warning}
If you are a course attendee, we discourage you from looking at this material as it reveals some of the things that should naturally come out in converstations during the hands-on session.
```

```{warning}
Helpers should be aware that this is the first time we run this activity and thus there is an experimental element in it. Also, we are not experts in the field of healthcare and social inequalities and thus the research question and the way it should be addressed are new to us. This document gives a framework but will not prepare you entirely for what is going to happen. Please use your judgement and experience from real project situations. We would appreciate your feedback so that we can keep improving the content in the future.
```

In this exercise, course attendees are split into groups to simulate small data science teams. You will act as a PI who seeks funding for a new project proposal (see below) and wants to collaborate with the data science team. The goal of the exercise is to give them exposure to the following aspects of RDS scoping work:
- How to scope a project from an initial vague idea to a defined RDS task with clear goals, measures of success, out-of-scope activities, roles of team members, etc.
- How to recognise EDI issues in the project scope and how to challenge them.
- How to document the scoping process and outcome in GitHub. 

We have listed below a few things to check and a potential answer to each of the scoping questions. The PI should be aware of these but feel free to improvise or base your approach on your own experiences.

## Phase 1: Setup, initial contact and discussion


#### Schedule
20 mins setup (in groups)

35 mins collaborative activity (exploration of materials and discussion, in groups)

***

In this phase, the PI does not need to act strictly as a PI. You should mainly help them set up and access all materials, plus solve any immediate issues and confusion.

During the setup, make sure they:

1. Setup a GitHub repo 
2. Give all participants access to it
3. Prepare a scoping project board


In the rest of Phase 1, make sure they:

4. Can access the dataset. The PI refers to "European Quality of Life Survey" and links to a Cornell library website. However the Cornell page only has a pdf report from the 2016 version of the dataset (which is the most recent version of a series of datasets created by [this project](https://www.eurofound.europa.eu/surveys/european-quality-of-life-surveys)). The only version of the dataset that is openly available without any restrictions is the 2007-2011 version (open access - CC 4.0), which can be found in UK Data Service website [here](https://beta.ukdataservice.ac.uk/datacatalogue/studies/study?id=7724#!/access-data).  **Attendees should use the 2007-2011 open version**. Allow attendees to search for an appropriate version online but if you see they struggle or start downloading and using datasets they should not, help them. For more recent versions you would have to register to [UKDS](https://www.eurofound.europa.eu/surveys/about-eurofound-surveys/data-availability) with more restrictive use terms. 
5. Go through all the received materials (dataset, [dataset documentation in UKDS website](https://beta.ukdataservice.ac.uk/datacatalogue/studies/study?id=7724#!/documentation) and [the original paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3678208/) we used for research question inspiration (note the paper pdf is not accessible without a subscription). The dataset's [User Guide](http://doc.ukdataservice.ac.uk/doc/7724/mrdoc/pdf/7724_eqls_2007-2011_user_guide_v2.pdf) is particularly useful for introduction into the dataset as it contains an overview of variables and how the dataset was created (useful for the scoping and EDI discussions later).
6. Create issues in the project board to address all the scoping questions listed below (and maybe others they come up with).
7. Maybe start answering the questions based on the initial exploration if there is time.

Note that it might be useful for the attendees to split their team into groups and parallelise the work.

Regarding in particular the initial email from the PI, they should notice the following:

1. The URL in the PI email does not point to the correct website. If they ask, you can point them to the correct one (see details above) - if they start googling without checking with you, you can correct them and say that when in doubt they should reach out to the PI.

2. Additionally, the dataset that is freely available online is only the 2007-2011 period (not the 2016 linked in the email). They should notice this as well, or you can make them aware of it.

3. The deadline is very short - they should notice this and note it for discussion with the PI.

4. The methodology is incredibly vague. They should notice this and check with the PI (now or in Phase 2) to understand their plan. Which methods do they want to use? See more below about PI's responses. Also, they should note that if the PI wants to rely on REG completely for developiong methodology it should be noted and timelines should be adjusted, plus the right people should be allocated.

## Phase 2: Technical Questions

#### Schedule

40 mins collaborative activity (scoping, in groups)

20 mins presentation (all together)

***

The second part is an open dialogue between you (the PI) and the participants. They want to reach an agreement with you regarding the following questions. We have filled in the answer they should obtain from you at the end, but you can be more vague initially, to see how they approach the scoping phase. Note also that questions do not need to be answered in this order and conversation could jump across them.

Attendees should document your discussions in the GitHub project board, trying to answer the questions as clearly as possible. They could split the team in sub-groups if they want. After documenting the converstations and answers, one representative of the team will be asked to present the main conclusions in the common room.

### The goal

**1. What is the broad challenge we are trying to solve?**

The PI wants to find an explanation to inequalities in health across different countries in Europe. We know these exist but we are not sure what drives them.

**2. What is the specific research question? How does it translate to a data science problem?**

This is absolutely not clear from the initial email, they need to discuss it with you during the meeting (knowing the dataset is essential here). 

Keep the following things roughly in mind:
- Some discussion on what factors could be driving health inequalities is needed (social, material, occupational, psychological). This should first be spontaneous and then informed by the dataset too (look at variables and documentation). 
- The definition of health should be discussed - what is it, what can we measure, what do we have in the data? They should use self-report health (SRH) collected by the Wave 2 and 3 of the EQLTS survey, aware that they offer only a partial representation of European populations and that SRH is per-se a highly subjective indicator, difficult to compare across countries
- The discussion should involve the question of what connection/association means. Does it mean causality or some weaker relationship? The project should not pursue discovering causality as this is a very complex issue with different theories about how social factors connect with health, e.g. look at Figure 4 of [this study](https://www.annualreviews.org/doi/full/10.1146/annurev-publhealth-031210-101218) for different theoretical models. The PI should discuss this with the team and direct people towards explaining differences in health by building a model that predicts SRH (available in the dataset) from other several variables. The goal of modeling here will be to explain variation in the outcome variable as much as possible by finding variables to include in the model, rather than strictly optimising predictive accuracy. As a first stage data science task (before modeling), they could also do exploratory data analysis and visualisations to capture some of the relationships between variables.
- The PI seems from the email to not be an expert in ML. They mention deep learning but they are not sure if this is the right answer. There should be a conversation about why they think deep learning is a good idea for this problem. It might not be as it makes it harder to explain contributions of variables. 

The research question should be something along these lines: 
> Which material, occupational, and psychosocial factors explain/contribute to self-reported health (SRH) across different European countries and how much? 

The data science task should be to build a model that predicts SRH from variables in the dataset. Type of model to be defined but it has to be explainable in some way.

See [here](https://github.com/alan-turing-institute/rds-course/issues/15#issuecomment-905345498) for historic discussion on research question.

### The data

**3. Is data available, legally accessible and appropriate?**

- See Phase 1 for info on the dataset. The dataset discussion overlaps a lot with the previous question (e.g. variables to use, definition of health).
- The teams should check what license applies to the data and confirm that they can use them freely and redistribute in their repo.
- **Data procurement and missingness**: Only the `European Quality of Life Time Series, 2007 and 2011: Open Access` is [readily available](https://beta.ukdataservice.ac.uk/datacatalogue/studies/study?id=7724#!/access-data). However, there are a lot of variables missing in Wave 2. So it would be better to stick with Wave 3. This will be discussed further in Module 2, but good if someone already spots it and also if they notice differences between countries. The missingness means we might want to drop the time dimension from the analysis, unless we handle it in some way - could be done in following modules.
- **Documentation:** They should find documentation and understand who carried out the surveys and issues across countries.
- The data is appropriate for the defined task but have certain limitations, e.g. self-reported health only, missingness, various factors might be missing from the survey that could contribute to health outcome, hard to fully understand contributions of factors by by only observing data like this).

### The expectations

**4. What are the stakeholders' expectations?**

From the initial email they are at the same time very vague and problematic. There is both a bit of skepticism around "big-data" methods and the desire of doing something ground-breaking. Participants should try to define more clearly what the PI really would like to achieve. 

One way to run this is a PI that is initially ambitious and wants to produce a model that explains all health inequalities and can then impact governement decisions. This can then be moderated based on the real limitations of the dataset as discussed above. Government policy is also very difficult to influence and there should be a plan to do that, which the PI does not have.

Eventually, the PI's expectations could just end up as follows: A study that tries to see if there are variables explaining the association between the socio-economic status and poor health across Europepe as some way to communicate the findings (e.g. dashboard/visualisation).

**5. What is in-scope and out-of-scope?**

In scope:
- As described above

Out of scope: 
- Studies beyond Europe
- Engaging with the public / government
- Comparing 2007 and 2011 (unless missingness is handled)
- More datasets
- Causality

**6. How does the output of the project look like and how is it going to be used?**
A realistic output could be a publication of the analysis and a website to make it accessible to the public. This could be a visualisation allowing people to show relationships between different variables and health in different countries and across periods. A repository with reproducible code should also be created.


### Success

**7. What metrics do we use to measure the success of the project?**

For the model, metrics could involve typical classification metrics like confusion matrices, ROC curves and predictive accuracy although some of them might not be that interesting given we do not prioritise prediction. We could use goodness of fit/model selection measures like deviance, Baysian information criterion, etc. This could be left to be explored in M4.

For the project as a whole, success should be defined as a set of binary variables: Did we build a dashboard? Did we publish a paper?

**8. What is the expected impact? Is it realistic**

Questions they should ask:
- Are we really expecting to influence governments? 
- Does the PI have previous experience regarding engaging with political actors? 

Answer: No, the PI was being overly optimistic, the expected impact is a wide analysis of the dataset to offer a better understanding of the phenomenon in the academic environment, not beyond. The grant will not include activities (workshops etc) for engaging with the public or government. But the output can involve a web platform to allow access to the results to the public.

**9. Computational resources and timelines**
See above for discussion on funding timelines. The PI should negotitate the data science effort they fund if there is time, as well as try to predict what cloud resources will be needed (compute time for VMs to run the analyses maybe though the data scale might be manageable without them, a server for the web app).


## Phase 3: Ethical Points (30 mins in groups, then 30 mins together)

#### Schedule

40 mins collaborative activity (EDI discussion, in groups)

20 mins presentation (all together)

***

As a final step in the scoping activity, we want to examine questions relating to equality diversity and inclusion. 

This should be a very open discussion with the PI using the taught content in 1.3 as a guide but also improvising and addressing things they consider important or controversial. The PI should be available throughout Phase 3.

Discussions should documentated in GitHub, similarly to Phase 2. They could split the team or work all together. After documenting the converstations and conclusions, one representative of the team will be asked to present your them in the common room.


### Some aspects that could emerge or the PI could bring up

- In this project we focus on the goal of studying health outcomes. Do the data contain the right type of information to do that (explore the dataset, the paper and the documentation)? 
  - Answer: Only self-reported health is included. Also the following points could be made:
    - People might not report negative health. 
    - One person per house-hold is interviewed
    - Differences across countries in how health is self-reported (but how do we quantify it?)
    - We have no information about the person asking questions
    - How is "Fair" interpreted?
- To what extent can we understand the relationship between health and social factors with this dataset? What claims can we make?
  - Answer: We cannot understand causality and there might be many social variables missing. We can only claim to find connections and associations but not causes or drivers. And our understanding will be partial, the phenomenon is much more complex than what is captured by a supervised model. Point to existing theories about how the phenomenon works (given above). The claims of the PI should be clear and realistic from the onset of the project.
- Given the importance of understanding and explaining health outcomes, what does the PI plan to do to share the information publicly and achieve positive societal impact?
  - A web platform can be created to allow access to the analysis via easy-to-navigate visualisations. The code should be open and reproducible. Acedemic publications help but they should be communicated in other ways too, e.g. blog posts, articles in the press, maybe even consider a youtube video.  
- Is it important to make the work reproducible?
  - Very. Other people will be able to pick it up and improve it in the future, it increases trust from scientific community and public, etc.
- Strangers in the dataset
  - Teams should point out that they are not experts in the field and that their analysis could benefit from consultation with academics, social scientists, the people that did the surveys. This might not be possible always but should be pursued by the PI.
- Purpose of the project
  - What is the intended use of the output of the project? Do we foresee use by policy makers to make decisions? Are there any negative societal impacts we should be aware of an try to minimise?
- Bias issues that might arise:
  - Representation bias:
    - How representative is the dataset? 
    - Who are the people interviewed (do you know if they are foreigners?)
    - Are certain ages excluded?
    - Who are the people who asked the questions?
  - Label bias
    - Issues with binarising health
    - How can you minimise this?
  - Missing data bias
    - What happens to the study if we only use Wave 3? Why differences between Wave 2 and 3?
  - Measurement bias
    - A differential reporting of self-reported health across countries may be suspected due to cultural differences. 
  - Chronological bias
    - Can we easily align the two Waves?
