# Exercise I

## Exploratory Data Analysis (EDA)

In our first exercise, we will explore a public dataset of Coronavirus PCR tests based on a fascinating [blog post](https://blog.mafatchallenge.com/2020/04/30/covid-19-testing-with-ml-part-2/) published as part of the [MAFAT Challenge](https://mafatchallenge.mod.gov.il/).

The purpose of this exercise is to demonstrate the importance of inspecting data and understanding it before trying to do anything fancy. As elementary as it may sound, this preliminary step is the pitfall of many data analysis pipelines and nurturing an instinct for visualizing and organizing your raw data will pay off tremendously both in the process of developing and in evaluating the robustness of your models.

We'll start things off by summarizing some key information about the dataset:

* The dataset is shared as [a GitHub repository](https://github.com/mdcollab/covidclinicaldata) containing a directory of CSV files.
* It is collected by two US-based companies providing services in the health industry.
* The number of observations is still relatively small, but more are added weekly and the quality of the data is high (there are many features to work with and it is relatively organized and "clean").
* There are also chest X-rays available for some of the patients, that were not included in this analysis.

So far so good, let's get to work. 

```{note}
This notebook is somewhat verbose and includes a lot of non-crucial code. The purpose of including it is to provide a more "complete" example for a realistic EDA workflow and help students who aren't already adequately familiar with Python and Pandas to become more comfortable with the endless possibilities these tools offer. If you find it hard to follow the code, this might be a good time to invest in leveling up your Python skills so that in the following exercises you can focus exclusively on *what* the code is doing rather than *how* it is doing it.
```

### Data Retrieval

First, we will create a function to retrieve the dataset from the public repository:

In [101]:
import math
import numpy as np
import pandas as pd

from typing import Any

# Set some constants required for data retrieval.
DATA_DIRECTORY = "https://raw.githubusercontent.com/mdcollab/covidclinicaldata/master/data/"
CSV_FILE_PATTERN = "{week_id}_carbonhealth_and_braidhealth.csv"
CSV_URL_PATTERN = f"{DATA_DIRECTORY}/{CSV_FILE_PATTERN}"
WEEK_IDS = (
    "04-07",
    "04-14",
    "04-21",
    "04-28",
    "05-05",
    "05-12",
    "05-19",
    "05-26",
    "06-02",
    "06-09",
    "06-16",
)
REPLACE_DICT = {"covid19_test_results": {"Positive": True, "Negative": False}}

def remove_x_ray_columns(data: pd.DataFrame) -> pd.DataFrame:
    """
    Removes radiology information columns from the dataset.
    
    Parameters
    ----------
    data : pd.DataFrame
        Input dataset
        
    Returns
    -------
    pd.DataFrame
        Dataset without X-ray data
    """
    
    xray_columns = [
        column_name for column_name in data.columns
        if column_name.startswith('cxr_')
    ]
    return data.drop(xray_columns, axis=1)

def read_data() -> pd.DataFrame:
    """
    Returns the public COVID-19 PCR test dataset from the *covidclinicaldata*
    repository.
    
    Returns
    -------
    pd.DataFrame
        COVID-19 PCR test dataset
    """

    urls = [CSV_URL_PATTERN.format(week_id=week_id) for week_id in WEEK_IDS]
    dataframes = [
        pd.read_csv(url, parse_dates=True, error_bad_lines=False)
        for url in urls
    ]
    data = pd.concat(dataframes, ignore_index=True)
    data.replace(REPLACE_DICT, inplace=True)
    data = remove_x_ray_columns(data)
    return data

We will also prepare a function to make the dataframe a little easier on the eyes:

```{tip}
For more information about styling Pandas dataframes, see [the documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html).
```

In [107]:
TARGET_COLUMN_NAME = "covid19_test_results"


def highlight_nan(value: Any) -> str:
    """
    Highlight NaN values in grey.

    Parameters
    ----------
    value : Any
        Cell value

    Returns
    -------
    str
        Cell background color definition
    """

    try:
        value = float(value)
    except ValueError:
        color = "white"
    else:
        color = "grey" if math.isnan(value) else "white"
    finally:
        return f"background-color: {color}"


def highlight_positives(test_result: bool) -> str:
    """
    Highlight positive values in red.

    Parameters
    ----------
    test_result : bool
        Observed test result

    Returns
    -------
    str
        Cell background color definition
    """

    color = "red" if test_result else "white"
    return f"background-color: {color}"


def get_table_styles(
    header_font_size: int = 12, cell_font_size: int = 11
) -> list:
    """
    Creates a table styles definition to be used by the
    `set_table_styles()` method.

    References
    ----------
    * Pandas' table styles documentation:
      https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html#Table-styles

    Parameters
    ----------
    header_font_size : int
        Header text font size in pixels (px)
    cell_font_size : int
        Cell text font size in pixels (px)

    Returns
    -------
    list
        Table styles definition
    """

    heading_properties = [("font-size", f"{header_font_size}px")]
    cell_properties = [("font-size", f"{cell_font_size}px")]
    return [
        dict(selector="th", props=heading_properties),
        dict(selector="td", props=cell_properties),
    ]


def pretty_print(
    data: pd.DataFrame,
    header_font_size: int = 12,
    cell_font_size: int = 11,
    text_align: str = "center",
) -> pd.DataFrame.style:
    """
    Returns a styled representation of the dataframe.

    Parameters
    ----------
    data : pd.DataFrame
        Input data
    header_font_size : int
        Header text font size in pixels (px)
    cell_font_size : int
        Cell text font size in pixels (px)
    text_align : str
        Any of the CSS text-align property options, defaults to "center"

    Returns
    -------
    pd.DataFrame.style
        Styled dataframe representation
    """

    table_styles = get_table_styles(
        header_font_size=header_font_size, cell_font_size=cell_font_size
    )
    return (
        data.style.set_table_styles(table_styles)
        .applymap(highlight_nan)
        .applymap(highlight_positives, subset=[TARGET_COLUMN_NAME])
        .set_properties(**{"text-align": text_align})
    )


Now we can read the data:

In [94]:
data = read_data()

And inspect it:

In [105]:
def get_scattered_chunks(data: pd.DataFrame,
                         n_chunks: int = 5,
                         chunk_size: int = 3) -> pd.DataFrame:
    """
    Returns a subsample of equally scattered chunks of rows.
    
    Parameters
    ----------
    data : pd.DataFrame
        Input dataset
    n_chunks : int
        Number of chunks to collect for the subsample
    chunk_size : int
        Number of rows to include in each chunk
    
    Returns
    -------
    pd.DataFrame
        Subsample data
    """
    
    endpoint = len(data) - chunk_size
    sample_indices = np.linspace(0, endpoint, n_chunks, dtype=int)
    sample_indices = [
        index for i in sample_indices for index in range(i, i + chunk_size)
    ]
    return data.iloc[sample_indices, :]

In [106]:
pretty_print(get_scattered_chunks(data))

Unnamed: 0,batch_date,test_name,swab_type,covid19_test_results,age,high_risk_exposure_occupation,high_risk_interactions,diabetes,chd,htn,cancer,asthma,copd,autoimmune_dis,smoker,temperature,pulse,sys,dia,rr,sats,rapid_flu_results,rapid_strep_results,ctab,labored_respiration,rhonchi,wheezes,days_since_symptom_onset,cough,cough_severity,fever,sob,sob_severity,diarrhea,fatigue,headache,loss_of_smell,loss_of_taste,runny_nose,muscle_sore,sore_throat,er_referral
0,2020-04-07,SARS COV 2 RNA RTPCR,Nasopharyngeal,False,58,True,,False,False,False,False,False,False,False,False,36.95,81.0,126.0,82.0,18.0,97.0,,,False,False,False,False,28.0,True,Severe,,False,,False,False,False,False,False,False,False,False,False
1,2020-04-07,"SARS-CoV-2, NAA",Oropharyngeal,False,35,False,,False,False,False,False,False,False,False,False,36.75,77.0,131.0,86.0,16.0,98.0,,,False,False,False,False,,True,Mild,False,False,,False,False,False,False,False,False,False,False,False
2,2020-04-07,SARS CoV w/CoV 2 RNA,Oropharyngeal,False,12,,,False,False,False,False,False,False,False,False,36.95,74.0,122.0,73.0,17.0,98.0,,,,,,,,False,,,,,,,,,,,,,False
2791,2020-04-28,Rapid COVID-19 Test,Nasopharyngeal,False,56,False,False,False,True,True,False,False,False,False,False,,,,,,,,,,,,,,False,,False,False,,False,False,False,False,False,False,False,False,False
2792,2020-04-28,Rapid COVID-19 Test,Nasal,False,73,False,True,False,False,True,False,False,False,True,False,36.75,86.0,124.0,80.0,16.0,98.0,,,,False,,,5.0,True,Mild,False,False,,False,False,True,False,False,True,True,False,False
2793,2020-04-28,Rapid COVID-19 Test,Nasal,False,25,True,False,False,False,False,False,False,False,False,False,,,,,,,,,,,,,,False,,False,False,,False,False,False,False,False,False,False,False,False
5583,2020-05-19,"SARS-CoV-2, NAA",Nasal,False,62,True,False,False,False,False,False,False,False,False,False,37.0,100.0,110.0,76.0,16.0,98.0,,,False,False,False,False,,False,,,False,,True,False,True,False,False,True,True,False,False
5584,2020-05-19,SARS COV2 NAAT,Nasopharyngeal,False,33,False,,False,False,False,False,False,False,False,False,36.95,79.0,123.0,79.0,12.0,99.0,,,,False,,,,False,,,False,,False,False,False,False,False,False,False,False,False
5585,2020-05-19,SARS COV2 NAAT,Nasopharyngeal,True,20,True,True,False,False,False,False,False,False,False,False,36.9,60.0,114.0,75.0,12.0,97.0,,,False,,,,3.0,True,Mild,True,False,,True,False,True,True,True,False,False,False,False
8374,2020-06-09,SARS COV2 NAAT,Nasopharyngeal,False,28,False,False,False,False,False,False,True,False,False,False,36.75,100.0,119.0,85.0,16.0,99.0,,,False,False,False,False,,False,,False,False,,False,False,False,False,False,False,False,False,False


In [97]:
from myst_nb import glue

glue("n_observations", len(data), display=False)
glue("n_columns", len(data.columns), display=False)
glue("target_column_name", TARGET_COLUMN_NAME, display=False)

Some things that we can already learn about our dataset from this table are:
* It contains a total of {glue:}`n_observations` observations.
* There are {glue:}`n_columns` columns with mixed data types (nemeric and categorical).
* Missing values certainly exist (we can easily spot `NaN` entries).
* The subsample raises a strong suspicion that dataset is imbalanced, i.e. when examining our target variable ({glue:}`target_column_name`) it seems there are far more negative observations than positive ones.