# Hands-on

## Phase 1: Orientation (1 hour)
Participants are divided in groups and are paired with a helper who will act as the PI. Each group will receive a short research proposal from the PI (see below for now). They will need to go through the materials to conduct the initial scoping following these steps:

1. Setup a GitHub repo for each group
2. Ensure that all participants have access to it
3. Prepare a scoping project board (decide the flow that better capture the scoping process you aim to conduct)
4. Go through all the received materials (you can follow up with the PI if something is not clear from the beginning)
5. To get a better understanding of the project, it might be necessary to:

    a. explore the dataset 

    b. examine the dataset's documentation 

    c. do an initial general literature search. To simplify this step, you can just look at the paper we used as an inspiration for the hands on activities.

    We suggest to work in sub-groups to address these three points.
    
6. If you notice something specific in the dataset, make a pull request to the main branch of the repo with a script higlighting what you have found, so that others in the group will be aware of it as well.
7. Prepare an issue dedicated to each of the scoping questions you want to discuss with the PI (see Module 1.2) and start filling them in based on your initial exploration and the information you received from the PI. You will have a chance to speak with the PI in the second phase.


### The initial contact

Notes for REG: this should be a fairly vague first contact. The URL is not the correct one and should be spotted by students.

>15th of November 2021

>Subject: Initial request for collaboration
>
>
> Dear Research Engineering Group,
I am reaching out for scoping a potential collaboration. Social inequalities in health have been described across a range of European countries. While it is well-known in the literature that the higher the social class, the lower the prevalence/incidence of health problems, no study has attempted to explain social inequalities in health for Europe as a whole. To address this, I am setting up a project proposal for a large-scale study using promptly available data (European Quality of Life Time Series, freely available [online](https://ecommons.cornell.edu/handle/1813/87445)) and deep learning techniques. I envision a 2-year project answering the call “Personal Stories, Healing Nations” employing 1 full-time Post-Doctoral researcher covering the social science parts of the study and (potentially) in collaboration with your team for the technological parts. We are hoping to submit by Dec 1st, so we would be keen to establish the costs for this digital component by Nov 28. 
>
> While I am fairly new to big-data (and, I have to admit, I have my reservations), I believe a well-designed project with these sources might be able to rewrite our understanding of social inequalities in health in Europe (and even beyond) during the last two decaces, a period involving a series of major socio-economical and political events, ça va sans dire. Its impact will be relevant for the general public and could potentially even suggest actions to governments.
>
>Your Sincerely,

>Professor J. Doe


### Things we expect them to notice from the beginning

1. The URL does not point to the correct website. Additionally, the resource that is freely available online is only the 2007-2011 period. They should come back to the PI with this information or we should point towards it if they just say "this is not the correct link".

2. The deadline is very short - is there flexibility?

3. The methodology is incredibly vague, are they relying on us completely for methods? If so, the goal of the analysis needs to be super clear from the beginning.

## Phase 2: Technical Questions (1 hour)

After having explored the collection, you can start an iterative conversation with the PI to reach an answer to the following questions. This should consist of a series of discussions and further exploration of the collection.

### The goal

**1. What is the broad challenge we are trying to solve?**

The PI wants to find an explanation to social inequalities in health acroos all Europe.

**2. What is the specific research question? How does it translate to a data science problem?**

This is absolutely not clear from the initial email, you need to discuss it with the PI during the meeting (knowing the dataset is essential here). The PI should direct people towards the definition of a specific task focused on: 
- predicting a specific variable (the self reported health)
- use other variables as features

The specific RQ is defined [here](https://github.com/alan-turing-institute/rds-course/issues/15#issuecomment-905345498).

### The data

**3. Is data available?**

The PI refers to "European Quality of Life Survey", linking to a Cornell library website. However, by googling more, the dataset pointed is the most recent version of a series of datasets created by [this project](https://www.eurofound.europa.eu/surveys/european-quality-of-life-surveys). Data for 2007 and 2011 is available in open access (CC 4.0) [here](https://beta.ukdataservice.ac.uk/datacatalogue/studies/study?id=7724#!/access-data), for the most recent version you would need to register to [UKDS](https://www.eurofound.europa.eu/surveys/about-eurofound-surveys/data-availability). There are publications referring to the dataset, are there available libraries?

Additional issues that should emerge:
- **Data procurement and missingness**: Only the `European Quality of Life Time Series, 2007 and 2011: Open Access` is [readily available](https://beta.ukdataservice.ac.uk/datacatalogue/studies/study?id=7724#!/access-data). However, there are a lot of variables missing in Wave 2. So it would be better to stick with Wave 3. This will be discussed further in Module 2, but good if someone already spots it and also if they notice differences between countries.
- **Documentation:** Find documentation and understand who carried out the surveys and issues across countries.

### The expectations

**4. What are the stakeholders' expectations?**

From the initial email they are at the same time very vague and problematic. There is both a bit of skepticism around "big-data" methods and the desire of doing something ground-breaking. Participants should try to define more clearly what the PI really would like to achieve. What the PI wants is a study that tries to see if there variables explaining the association between the socio-economic status and poor health across Europepe.

**5. What is in-scope and out-of-scope?**

In scope: see [the defined RQ](https://github.com/alan-turing-institute/rds-course/issues/15#issuecomment-905345498)
Out of scope: studies beyond Europe, engaging with the public / government. No more datasets.

### Success

**7. What metrics do we use to measure the success of the project?**

Define a clear MVP [see here](https://github.com/alan-turing-institute/rds-course/issues/15#issuecomment-905345498)

**8. What is the expected impact?**

Are we really expecting to influence governments? Does the PI have previous experience regarding engaging with potical actors? No, the PI was being overly optimistic, the expected impact is a wide analysis of the dataset to offer a better understanding of the phenomenon in the academic environment, not beyond. The grant will not include activities (workshops etc) for engaging with the public or government.




## Phase 3: Ethical Points (30 mins in groups, then 30 mins together)

As a final discussion we would like to focus on the goal of studying **health** while having **self-reported health** information in the dataset. Discuss together about the issues and challenges that we might encounter (by also exploring the dataset, the paper and the documentation), We will first discuss in sub-groups and then as a final open discussion with all the participants.

### Aspects that we expect to emerge

- People don't report negative health. 
- One person per house-hold
- Difference across countries (but how do we quantify it?)
- We have no information about the person asking questions
- How "Fair" was interpreted

### Additional points that could emerge

These are some of the additional points that could emerge, while we speak about ethics.

**1. Representation bias**

How representative is the dataset?

Who are the people interviewed (do you know if they are foreigners?)

Who are the people who asked the questions?

**2. Label bias**

Issues with binarising health

How can you minimise this?

**3. Missing data bias**

What happens to the study if we only use Wave 3? Why differences between Wave 2 and 3?

**4. Measurement bias**

A differential reporting of self-reported health across countries may be suspected due to cultural differences. 

**5. Chronological bias**

Can we easily align the two Waves?


