# Hands-on

## Phase 1: Orientation
Participants are divided in groups. Each group will receive a short research proposal from a PI, together with a dataset. They will need to go through the materials to conduct the initial scoping following these steps:

1. Setup a GitHub repo for each group
2. Ensure that all participants have access to it
3. Prepare a scoping project board
4. Go through all the received materials and discuss together if things are not clear.
5. To get a better understanding of the project, it might be necessary to explore the dataset. To do so, open a dedicated branch, load the dataset and explore it. 
6. Prepare an issue dedicated to each of the scoping questions you want to address to the PI (see Module 1.2). In particular:
    a. Do you have any initial concerns about elements of the project’s feasibility?
    b. What else do you need to know about the project?


### The initial contact

Notes for REG: this should be a fairly vague first contact. The URL is not the correct one and should be spotted by students.

1st of November 2021
Subject: Initial request for collaboration

Dear Research Engineering Group,
I am reaching out for scoping a potential collaboration. Social inequalities in health have been described across a range of European countries. While it is well-known in the literature that the higher the social class, the lower the prevalence and/or incidence of health problems, no study has attempted to explain social inequalities in health for Europe as a whole. To address this, I am setting up a project proposal for a large-scale study using new available data (European Quality of Life Time Series, freely available [online](https://ecommons.cornell.edu/handle/1813/87445)) and machine learning techniques. I envision a a 2-year project answering the call “Personal Stories, Healing Nations” employing 1 full-time Post-Doctoral researcher at University of Eastfolk covering the social science parts of the study and (potentially) in collaboration with your team for the technological parts. We are hoping to submit by Dec 1st, so we would be keen to establish the costs for this digital component by Nov 28. 

While I am fairly new to big-data (and, I have to admit, I have my reservations), I believe a well-designed project with these sources will be able to rewrite our understanding of social inequalities in health in Europe (and even beyond) and its impact will be relevant for the general public and could potentially even suggest actions to governments.

Your Sincerely,
Professor 


### Questions (and answers)

1. What is the broad challenge we are trying to solve?
They want to explain social inequalities in health for Europe as a whole through the use of a "new" dataset and "machine learning".

2. What is the specific research question? How does it translate to a data science problem?
This is absolutely not clear from the initial email, you need to discuss it with the PI during the meeting (knowing the dataset is essential here). The PI should direct people towards the definition of a task focused on: 
- predicting a specific variable (the self reported health)
- binarize the variable, to make the analysis simpler (The variable was dichotomised as “good” health (good and fair) versus “poor” health
- use other variables as features

3. Is data available?
The PI refers to "European Quality of Life Survey", linking to a Cornell library website. However, by googling more, the dataset pointed is the most recent version of a series of datasets created by [this project](https://www.eurofound.europa.eu/surveys/european-quality-of-life-surveys). Data for 2007 and 2011 is available in open access (CC 4.0) [here](https://beta.ukdataservice.ac.uk/datacatalogue/studies/study?id=7724#!/access-data), for the most recent version you would need to register to [UKDS](https://www.eurofound.europa.eu/surveys/about-eurofound-surveys/data-availability). There are publications referring to the dataset, are there available libraries?

4. What are the stakeholders' expectations?
From the initial email they are at the same time very vague and problematic. There is both a bit of skepticism around "big-data" methods and the desire of doing something ground-breaking.

5. How does the output product look like and how is it going to be used?
The discussion with the PI should reach the point where a MVP is agreed in the form of an initial study in which some variables are examined. We could start by for instance focusing only on a single country.

6. What is the state of the art (either in literature or within the organisation)? Is the goal of this project to go beyond this? Is there any documentation of legacy systems?

7. What is in-scope and out-of-scope?
Out of scope: studies beyond Europe, engaging with the public. No more datasets.

8. What is the expected impact?
Is this really expected "could potentially even suggest actions to governments"? Does the PI have previous experience with engaging with potical actions from research outcomes? No, the PI was being overly optimistic here, for the project the expected impact is a wide analysis of the dataset to get a better understanding in the academic environment, not beyond. The grant will not include activities (workshops etc) for engaging with the public or government.

9. What metrics do we use to measure the success of the project?
Define what we are expecting and what are reasonable baselines.

10. What computational resources are available?
The project seems to rely completely on us, so we should make an estimate of the resources needed.

Additional issues that should emerge:
- **Data procurement and missingness**: Only the `European Quality of Life Time Series, 2007 and 2011: Open Access` is [readily available](https://beta.ukdataservice.ac.uk/datacatalogue/studies/study?id=7724#!/access-data). However, there are a lot of variables missing in Wave 2. So it would be better to stick with Wave 3. This will be discussed further in Module 2, but good if someone already spots it and also if they notice differences between countries.
- **Documentation:** Find documentation and understand who carried out the surveys and issues across countries.

