## Questions and Findings

*A space to document open questions about the dataset, and findings that come about during the analysis.* 

#### *FOR EXAMPLE:*

Questions:
- What is the difference between variable 4 and 8? (Given that they have similar names)


Findings:
- There is a strong skew between categories for variable 1, with 72% spent in state 1 and only 28% across state 2, 3 and 4
- Some variables have high amounts of NaNs. In particular, variable 9 (36%) and variable 12 (31%).
- Subject 1 is entirely missing data for variable 7.


## Module imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os, sys

# add src folder to module imports path
sys.path.insert(0,(os.path.join(os.path.dirname(os.getcwd()), 'src')))


## Load data 

In [None]:
df1 = pd.read_csv("../data/<FILE NAME>", index_col=0)
df2 = pd.read_csv("../data/<FILE NAME 2>", index_col=0)

In [None]:
df1.head()

In [None]:
df2.head()

## Describe data

In [None]:
for column in df1.columns:
    df1.column.describe()

In [None]:
def quick_analysis(df):
    print(f"Rows: {df.shape[0]}")
    print(f"\nColumns: {df.shape[1]}")
    print(f"\nColumn Names: {df.columns}")
    print(f"\nData Types: {df.dtypes}")

quick_analysis(df1)

In [None]:
def rank_missing_data(df):
    pct_null = df.apply(lambda x: sum(x.isnull()) / len(df))*100
    pct_null = pct_null.sort_values(ascending=False)
    print("Variables with most null values:")
    for index, col in enumerate(pct_null.index[:10]):
        print(f"\t- Column {col}: {pct_null.values[index]}% null")    

rank_missing_data(df1)

sum(df1.isna().values)

In [None]:
# Print number of data points per subject
for subject in df1.SUBJECT.unique():
    print(f"{subject} data points: {len(all_subjects_df.loc[all_subjects_df.SUBJECT_NAME == subject])}")

## Clean data

In [None]:
# Remove empty columns
for col in list(df1):
    if sum(df1[col].notna()) <= 2:
        df1.drop(col, axis=1, inplace=True)

In [None]:
# Remove duplicate rows
df1 = df1[~df1.index.duplicated(keep='first')]

In [None]:
# Convert empty data points into NaNs
df1 = df1.replace(r'^\s*$', np.nan, regex=True)

In [None]:
# Convert non-numerical variables into numerical
df1['variable 1'] = df1['variable 1'].astype(float)

In [None]:
# Convert string variables to binary
df['diagnosis'] = df['diagnosis'].map({'M':1,'B':0})

In [None]:
# Merge datasets
combined_df = pd.merge(df1, df2, left_index=True, right_index=True)
combined_df

## Visualise data

In [None]:
df1.hist(figsize=(16, 12))
plt.show()

In [None]:
# TODO: add more visualisation plots

## More targetted analysis

*A space for bespoke code, to answer questions specific to our dataset.*