![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

<a href="https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fcallysto%2Fpresentations&branch=master&subPath=health-data-privacy.ipynb&depth=1" target="_parent"><img src="https://raw.githubusercontent.com/callysto/curriculum-notebooks/master/open-in-callysto-button.svg?sanitize=true" width="123" height="24" alt="Open in Callysto"/> </a>

# Data Privacy

As our world becomes increasingly digitized and interconnected, our reliance on data has also increased. The types of devices that are connected to the internet get [stranger and stranger every day](https://www.metrikus.io/blog/10-weirdest-iot-enabled-devices-of-all-time). This is a fairly recent trend, made possible by advances in computing, the internet, and data-storage technology. It's becoming such an integrated part of our lives that it's sometimes easy to forget just how much information is collected about us, and who has access to it. This is especially true for health data, which is typically quite closely guarded and can have very negative effects if it falls into the wrong hands.

In this notebook and activity, we'll look at ways that data impacts your life, from its collection to its applications. We'll have a special focus on the role of health data, with some information about what you can do to make sure that your data isn't used in a way that negatively impacts you. Hopefully, you'll leave with an appreciation of how seemingly irrelevant data can be used to paint a picture of who you are, and how that information can be used for both bad and good.

# Positive Use of Data
### Historical Health Data
A classic example of the role of health data is in [determining the source of a cholera outbreak in London in 1854](https://www.rcseng.ac.uk/library-and-publications/library/blog/mapping-disease-john-snow-and-cholera/). Cholera is an [incredibly nasty bacterial disease](https://www.mayoclinic.org/diseases-conditions/cholera/symptoms-causes/syc-20355287) that affects the digestive tract, and though it's rare today in developed nations, it still results in the [deaths of tens of thousands of people](https://www.who.int/news-room/fact-sheets/detail/cholera) in developing countries across the world *each year*. 

At the time, many people were moving to London, and the sewage system wasn't able to handle the removal of all the waste, especially in one area of the city. This resulted in the sewage contaminating the drinking water supply, which is now known as the primary route of cholera infection in humans. This lead to an outbreak of cholera, and at its peak over [600 people were dying each week from the disease](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7150208/).

A young physician named [John Snow](https://en.wikipedia.org/wiki/John_Snow) rejected the leading theory at the time that cholera was an airborne disease, and insisted that it spread through water. To illustrate his point, he recorded the locations of the homes of those who had died from cholera, and added that data to a map of the area. The map also had the locations of the water pumps that provided the drinking water for the nearby homes:

#### John Snow's original map overlaid with cholera deaths (bubble size is number of deaths; blue taps indicate location of water pumps)

![John Snow's original map overlaid with cholera causes (bubble size is number of cases](https://blog.rtwilson.com/wp-content/uploads/2012/01/SnowMap_Points.png)
<p>
<b>https://blog.rtwilson.com/wp-content/uploads/2012/01/SnowMap_Points.png</b>
</p>

It became obvious which pump was contaminated and was causing the epidemic once he compared the spread of the cases with the location of the pumps. He convinced the local government to disable the contaminated pump and the outbreak was contained. This is considered to be a major event in both the birth of the field of [epidemiology](https://en.wikipedia.org/wiki/Epidemiology), and the use of data in public health.

John Snow's work laid the foundation for the use of health data, and is a great example of how  personal data (in this case, home address and infection status) can be used for good. Unfortunately, data hasn't always been used for the benefit of the population it's collected from. Let's take a look at an example of data that was intended to be used for good but had unintended consequences.

# Unintentional Misuse of Data
### Smart Devices and GPS

[Strava](https://en.wikipedia.org/wiki/Strava) is an internet service that allows users to track their physical activity and compare with their friends and others, both locally and across the world. Users can upload their activities to the service and compare their times or progress towards certain goals. To enable that, Strava collects the GPS data from the users' smartwatches and other devices that they use to record their activity, and uses that data to calculate times, distances, and speeds. The social media aspect of the service encourages people to exercise, which is a fairly noble goal.

Another function of Strava is to help users find new routes near them that are popular with other users. To help users explore their local routes, Strava generates a heatmap that shows the most popular locations for activities, shown as the brightest locations below:

![](https://1n4rcn88bk4ziht713dla5ub-wpengine.netdna-ssl.com/wp-content/uploads/2017/10/Global-Heatmap.png)
<p>
<b>https://blog.strava.com/zi/press/strava-community-creates-ultimate-map-of-athlete-playgrounds/</b>
</p>

In November 2017, Strava released their heatmap as an [interactive tool](http://labs.strava.com/heatmap) that allowed anyone to explore the map and the popular locations for physical activity. A few months later, one user discovered that there were many "hot spots" showing up in otherwise completely remote regions. What this user had discovered was the [existence of (previously) secret military bases](https://techcrunch.com/2018/01/28/strava-exposes-military-bases/) in regions such as Afghanistan, Syria, and Somalia. Looking deeper, it was also possible to track the movements of troops, as some of them had uploaded recordings of their training exercises.

![](https://ichef.bbci.co.uk/news/976/cpsprodpb/112EA/production/_99787307_bagram_airbase.jpg)
<p>
    <b>https://ichef.bbci.co.uk/news/976/cpsprodpb/112EA/production/_99787307_bagram_airbase.jpg</b>
</p>

As you can imagine, the militaries of the countries whose bases were exposed were not pleased. They certainly had never expected the locations of their operations to be revealed in this way, and had therefore not put in any protections to stop this from happening. Likewise, the service members who uploaded the data were just trying to track their workouts, and had no idea how this data would be used. It's also difficult to blame Strava as they had no idea that the bases existed; all it took was some further probing into the data to uncover valuable strategic information. Strava has since added privacy tools where users can obscure the starts and ends of their activities (to hide the location of their home or work), and default to 'opt-in' for potentially privacy-invasive features.

Although all parties involved in the collection, analysis, and release of the data had no intention of causing any harm, the damage was still done and the heatmap became a serious security concern. As the demand for data increases, the world will continue to struggle to keep up with the potential negative side effects of carelessly releasing personal information.

Unfortunately, as we learn more about how seemingly harmless data can be used to discover previously secret information, there are ever-increasing numbers of entities that are using data specifically for personal benefit.

# Negative Use of Data
### Genetic Testing and Health Insurance Discrimination

[Carpal Tunnel Syndrome (CPS)](https://www.hopkinsmedicine.org/health/conditions-and-diseases/carpal-tunnel-syndrome) is a chronic medical condition in the hands and wrists that happens primarily in workers who perform repetitive manual tasks with their hands as part of their job, such as typing, or using some tools. Once the condition develops, it usually requires rest to allow the the wrists to recover, sometimes even requiring surgery. For medical conditions acquired while working, most jurisdictions ensure that workers are still getting paid while undergoing treatment, which a large expense for employers.

In 2000, a [railroad worker named Gary Avary developed CTS](https://www.latimes.com/archives/la-xpm-2002-may-09-fi-nucarpal9-story.html) while working with high-powered tools. He reported his symptoms to the company and had an appointment scheduled with the company's doctor. While undergoing assessment by the doctor, they also took blood samples from Gary, telling him it was for routine bloodwork. Unknown to Gary, the company was actually performing genetic testing on his blood to look for evidence that Gary had some genes that made him more likely to develop CTS regardless of his workplace activities.

![](https://lsminsurance.ca/images/2016/09/genetic-testing-insurance.jpg)
<p>
    <b> https://lsminsurance.ca/images/2016/09/genetic-testing-insurance.jpg </b>
</p>

Gary's company then tried to use the tests to prove that his CTS was a ["pre-existing condition"](https://en.wikipedia.org/wiki/Pre-existing_condition), meaning that it existed before he began employment at the company. If the company could prove that Gary's CTS was a pre-existing condition, they would not be responsible for paying for his treatment or for his lost wages. Genetic testing was still a very new (and expensive) technology in 2000, but the cost of the testing was far lower than the cost of paying for Gary's treatment.

Gary's wife, a nurse, was suspicious of the need to collect bloodwork for this injury, and discovered that it had been sent for genetic testing. Pretty quickly, dozens of other employees of the railroad company came forward with similar stories, and they ended up [suing the company for health discrimination](https://www.eeoc.gov/newsroom/eeoc-and-bnsf-settle-genetic-testing-case-under-americans-disabilities-act-0). Ironically, the research that the railroad company had used to justify the link between the genes in question and CTS was completely unfounded, [according to the scientist who made the discovery](https://www.wired.com/2001/04/genetic-testing-case-settled/).

The resulting lawsuit helped set precedent for the role of genetic testing in discriminatory insurance practices, leading to the [passing of law in the USA that forbids such discrimination](https://www.genome.gov/genetics-glossary/Genetic-Information-Nondiscrimination-Act). Unfortunately, no such legislation exists in Canada at this time.

Though the actions of the railroad company in this story would now be considered illegal (and were always unethical), it shows how health information can be used in negative, even discriminatory ways, sometimes without the knowledge of the person the data was collected from. Seemingly innocent pieces of data can be combined to arrive at very important and severe consequences.

In the next exercise, we'll walk through an activity where you'll get first-hand experience comparing datasets to see how they can be combined to uncover information that should be hidden. After the activity, we'll also talk about best practices to ensure that data remains private.

# Activity Time

To explore the interconnectivity of datasets, we'll go through a scenario where you'll have the chance to explore several related datasets, and see how they can be used together to find even more information. This is an important skill in data science, using multiple different sources of data to help arrive at a conclusion.

Throughout the activity, keep in kind how seemingly harmless the individual datasets are on their own, and how combining them quickly allows you to single out the individuals that the data was collected on.

Alongside the second half of this notebook, follow the below link to access a Google Form to help with the activity:

https://forms.gle/684QTAZGN3SjzmRFA

# Data Privacy Locked Room

## Introduction

You're spending time with your friend one day, killing time waiting for your ride to arrive. Your friend is on their phone, mindlessly scrolling through their social media, when you happen to see a photo pop up in their feed:

![](img/1.png)

The photo catches your eye! You're not familiar with the person who posted the picture (and your friend says it's just an acquaintance of theirs), but you do recognize an old friend of yours in the dark red plaid shirt on the left, Sarah. It's been a long time since you've seen her, and you'd love to get back in contact with her. Unfortunately, she seems to have removed herself from all forms of social media, and the only information you know about her is:

- She's since married and changed her last name
- Her husband is the man in the blue plaid with the guitar
- Both your friend and her husband are under 30 years old
- The dog on the left in the photo is their dog
- The location the photo was tagged in is a popular campground near your city

**Using your data science skills, along with the information you just acquired and the resources you have access to, can you connect the data to find a way to get in contact with your old friend?**

## Tips and Tricks

We'll be using the Python library *pandas* to handle the datasets, so here are some commands you might need to use:

In [None]:
import pandas as pd
sample = pd.read_csv('data/sample.csv')
sample

You can filter dataframe rows by selecting the column and returning only values that meet a criteria. In the below code, we're looking at column `C` and returning only rows where the value in column `C` is equal to `'Cat'`:

In [None]:
sample[sample['C']=='Cat']

You can also do simple math equality functions like 'greater than' or 'less than' if the column contains only numbers. Here we're looking in column `A` for values greater than 2:

In [None]:
sample[sample['A']>2]

If the dataframe is too large to fit all the contents on screen, but you're curious what values exist in each column, you can use the following command to return all unique values in the column:

In [None]:
list(sample['C'].unique())

You can also filter by multiple conditions by surrounding each condition with round brackets and joining them with a `&`:

In [None]:
sample[(sample['A']<=4) & (sample['B']==True)]

## Let's Get Started

The first dataset you come across is a spreadsheet of contact info from a local dog group that you think Sarah and her husband belong to:

### Dataset #1 - Pet Group Contact List

In [None]:
petDf = pd.read_csv('data/PetGroupContactList.csv')
petDf



### Dataset #2 - ?

In [None]:
# Paste code snippet from Google Form below






### Dataset #3 - ?

In [None]:
# Post code snippet from Google Form below






[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)