# Solution

This notebook demonstrates one way to solve the challenge using Python.

## Packages

Here, we are going to use pandas to inspect the data, clean the data and calculate our result.

We will also make use of numpy.

Different security tiers of Data Safe Haven support installing packages in different ways,

- Tier ≤ 2: Able to install any package from PyPI and CRAN
- Tier 3: Able to install a configurable, curated list of packages from PyPI and CRAN

In [None]:
%pip install pandas numpy

Now we can import pandas and numpy

In [None]:
import pandas as pd
import numpy as np

## Find the data

In a Data Safe Haven environment the input data is located at `/data`

In [None]:
!ls /data

The `/data` directory, and the files within it are read only.

This preserves the integrity of input data.

We can test this by trying to write to `/data` and append data to `/data/data.csv`.

In [None]:
!touch /data/new_file.txt
!echo "hello" >> /data/data.csv

## Inspect the data

We will read the data from the csv file and store it in a Pandas dataframe.

In [None]:
!cp /data/data.csv ./

In [None]:
df = pd.read_csv("./data.csv")

We want to calculate the **mean height**.

Let's look at the columns to find which contains the height data.

In [None]:
df.columns

We could go ahead and calculate the mean immediately,

```python
df["height"].mean()
```

However, we should look at the data first to check it makes sense.
Panda's `describe` method can help us.

In [None]:
height = df["height"]
height.describe()

The mean looks reasonable.
168.8 would make sense if the height values are in cm.

But look at the minimum and maximum values.
Those don't seem reasonable at all.

Let's look closer at the smallest values.

In [None]:
height_sorted = height.sort_values()
height_sorted.head(10)

And the largest values.

In [None]:
height_sorted.tail(10)

## Cleaning the data

It would be more reasonable to remove these outliers.

Here, we will use information about how those outliers were generated to remove them.

The anomalously large heights are always greater than or equal to 1000.
The anomalously smaller heights are always less than 10.

In [None]:
height_filtered = height[(height <= 1000) & (height > 10)]

## Calculating the mean

Now we can calculate the mean of our cleaned height data.

In [None]:
mean = np.around(height_filtered.mean(), 1)
mean

## Prepare your result for extraction

In a Data Safe Haven environment data in `/output` can be considered for removing from the environment.

You can read, write and delete files in the `/output` directory.

In [None]:
!ls /output
!echo "hello" >> /output/a.txt
!cat /output/a.txt

If we write our output to a file in `/output` an admin will be able to create a link to download that file after the review process is completed.

In [None]:
with open("/output/mean_height.txt", "w") as f:
    f.write(f'{mean}\n')

In [None]:
!cat /output/mean_height.txt