# Data Analysis
This module will introduce and examine the stage of 'exploratory data analysis'. 

Rather than focusing on the statistical or technical techniques employed in modern data science though, we will approach this stage with a bias-aware perspective. However, we will make use of Jupyter notebooks—a popular tool in data science—to aid our exploratory data analysis[^jupyter] by visualising some data. You do not need to be familiar with either Python or Jupyter Notebooks if you just want to gain an understanding of how social, cognitive, and statistical biases interact and affect downstream stages in the research and innovation lifecycle. But the code is presented for those who wish to get more "hands-on".

## What is Exploratory Data Analysis?

Exploratory data analysis is a crucial stage in the project lifecycle. It is where a number of techniques are employed for the purpose of gaining a better understanding of the dataset and any relationships that exist between the relevant variables. Among other things, this could mean,

- Describing the dataset and important variables
- Identifying missing data and outliers, and deciding how to handle them
- Cleaning the dataset
- Provisional analysis of any relationships between variables
- Uncovering possible limitations of the dataset (e.g. class imbalances) that could affect the project

We will cover each of these sub-stages of EDA briefly, but to reiterate, our primary focus in this section is on the risks and challenges that stem from a variety of biases that can cause cascading issues that affect downstream tasks (e.g. model training).

## COVID-19 Hopsital Data

For the purpose of this section we have created a synthetic dataset that contains X records for fictional patients who were triaged (and possibly admitted) to a single hospital for treatment of COVID-19. 

The dataset has been designed with this pedagogical task in mind. Therefore, although we relied upon plausible assumptions when developing our generative model, the data are not intended to be fully representative of actual patients. Our methodology for generating this dataset can be found here.

### Importing Data

First of all, we need to import our data and the software packages that we will use to describe, analyse, and visualise the data. The following lines of code achieve this, and also loads our data into a DataFrame using the Pandas package:

In [None]:
# The following lines import necessary packages that help us with the data analysis
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
import seaborn as sns

# This line imports data from a csv file into a DataFrame (df)

df = pd.read_csv('covid_patients_syn_data.csv')

### Describing the Data


In [None]:
# This line returns information about the number of rows and the number of columns for a dataset.
df.shape

# This line returns the first 5 rows of a dataset, which can be useful if you want to see a small sample of values for each variable.
df.head() 

# This line returns the name of all of the columns in your dataset. This is helpful if you want to quickly see which variables you will have access to during your analysis.
df.columns 

# This line returns the number of unique values for each of the variables. For example, in the ethnicity column there are X different values.
df.nunique(axis=0)

# This line uses the pandas 'describe()' method to provide a summary of the dataset (formatted for readability). This includes the count, mean, standard deviation, min, and max for numeric variables.
df.describe().apply(lambda s: s.apply(lambda x: format(x, 'f')))

### Cleaning and Querying the Data


In [None]:
#It may be useful to use this section as a way of addressing missingness

### Analysing the Data


In [None]:
# The following lines return a correlation matrix, for the cleaned dataframe, using the seaborn package
corr = df_cleaned.corr()# plot the heatmap
sns.heatmap(corr, xticklabels=corr.columns, yticklabels=corr.columns, annot=True, cmap=sns.diverging_palette(220, 20, as_cmap=True))

### Visualising the Data

In [None]:
# The following line of code allows us to visualise the distribution of variable X as a simple histogram
df_cleaned['var_x'].plot(kind='hist', bins=100, figsize=(12,6), facecolor='grey',edgecolor='black')

# The following line of code allows us to visualise the distribution of variable X as a boxplot
df_cleaned.boxplot('var_x')

# The following line of code allows us to visualise the relationship between X and Y as a scatterplot
df_cleaned.plot(kind='scatter', x='var_x', y='var_y')

# The seaborn package also come with a useful method to quickly create scatterplots for all numeric variables in the dataset
sns.pairplot(df_cleaned)