# Exploratory Data Analysis

## Load the Data

In [1]:
# Import necessary libraries
%matplotlib inline
import numpy as numpy
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
from helpers import *

%load_ext autoreload
%autoreload 2



In [2]:
# Load the Data
data_path = "data/dataset/"
x_train, x_test, y_train, train_ids, test_ids = load_csv_data(data_path ,sub_sample=False)

## Data Overview

In [3]:
# View shape of the dataset
x_train.shape, x_test.shape, y_train.shape, train_ids.shape, test_ids.shape

((328135, 321), (109379, 321), (328135,), (328135,), (109379,))

- There are 321 features.
- `x_train` contains the training data and it has 328 135 data entries (before cleaning).
- `x_test` contains the test data and it has 109 379 data entires (before cleaning).
- The `y` vector corresponds to the true values of the output (the variable we wish to predict). The output describes whether a person is diagnosed with MICHD or not. It is binary, -1 or +1, where -1 means no MICHD and +1 means MICHD. There are 328 135 data points (before cleaning).
- `train_ids` and `test_ids` are numpy arrays. Their values correspond to the ids of the data entries of train data and test data, respectively. Therefore, the length of `train_ids` and `test_ids` correspond to the number of data entries for both train and test data, respectively.

In [4]:
# View the few first and last rows of the dataset
print(x_train[:5])  # First 5 elements

[[5.3000000e+01 1.1000000e+01 1.1162015e+07 ...           nan
            nan 2.0000000e+00]
 [3.3000000e+01 1.2000000e+01 1.2152015e+07 ...           nan
            nan           nan]
 [2.0000000e+01 1.0000000e+01 1.0202015e+07 ... 1.0000000e+00
  2.0000000e+00 2.0000000e+00]
 [4.2000000e+01 6.0000000e+00 6.1820150e+06 ... 2.0000000e+00
  2.0000000e+00 2.0000000e+00]
 [2.4000000e+01 1.1000000e+01 1.1062015e+07 ... 9.0000000e+00
  9.0000000e+00 2.0000000e+00]]


In [5]:
print(x_train[-5:])  # Last 5 elements

[[4.9000000e+01 7.0000000e+00 1.1232015e+07 ...           nan
            nan 2.0000000e+00]
 [5.1000000e+01 5.0000000e+00 6.0820150e+06 ...           nan
            nan 1.0000000e+00]
 [3.9000000e+01 1.0000000e+01 1.0202015e+07 ... 2.0000000e+00
  2.0000000e+00 2.0000000e+00]
 [3.3000000e+01 1.2000000e+01 1.2302015e+07 ...           nan
            nan 2.0000000e+00]
 [3.2000000e+01 9.0000000e+00 9.1220150e+06 ...           nan
            nan 2.0000000e+00]]


In [6]:
# Getting rid of useless features

In [7]:
# Note which features are categorical/continuous

## Getting Rid of Useless Features

To figure out which features are unimportant, we take a look at the column names of the dataset.

After taking a closer look at the features, we decide to remove the columns as done below. They were removed for either of these reasons:
- The columns were not relevant to the goal of our project (e.g. State, Income, etc.)
- The columns represented questions about a specific subject that were later regrouped into a single feature (e.g. for Cholesterol, many questions were asked to the participants. One final feature summarized the findings. We only keep this final feature.)
- The columns had too many null values, becoming irrelevant.

In [8]:
x_train_new = x_train.copy()
columns_to_remove = range(50)  # Indices of columns to remove
x_train_new = np.delete(x_train_new, columns_to_remove, axis=1)

In [9]:
columns_to_remove = range(1, 14)  # Indices of columns to remove
x_train_new = np.delete(x_train_new, columns_to_remove, axis=1)

In [10]:
columns_to_remove = range(2, 44)  # Indices of columns to remove
x_train_new = np.delete(x_train_new, columns_to_remove, axis=1)

In [11]:
columns_to_remove = range(13, 37)  # Indices of columns to remove
x_train_new = np.delete(x_train_new, columns_to_remove, axis=1)

In [12]:
columns_to_remove = range(16, 37)  # Indices of columns to remove
x_train_new = np.delete(x_train_new, columns_to_remove, axis=1)

In [13]:
columns_to_remove = range(42, 56)  # Indices of columns to remove
x_train_new = np.delete(x_train_new, columns_to_remove, axis=1)

In [14]:
columns_to_remove = range(52, 66)  # Indices of columns to remove
x_train_new = np.delete(x_train_new, columns_to_remove, axis=1)

In [15]:
columns_to_remove = [78, 79, 80]  # Indices of columns to remove
x_train_new = np.delete(x_train_new, columns_to_remove, axis=1)

In [25]:
x_train_new = np.delete(x_train_new, 72, axis=1) # remove height in inches to only keep height in meters

In [27]:
# Get rid of columns that have too many null values

x_train_clean = x_train_new.copy()
max_nan_threshold = 50000 # specify the threshold for the maximum allowed NaN values in a column
nan_counts =  np.isnan(x_train_new).sum(axis=0) #count the number of NaN values in each column
columns_to_keep = nan_counts <= max_nan_threshold # Identify columns to keep (those that have NaN counts below the threshold)
x_train_clean = x_train_clean[:, columns_to_keep] # Remove columns with too many NaN values

In [28]:
x_train_clean.shape

(328135, 70)

This leaves us with 70 features. Let's ignore take a closer look at the features left. 

In [29]:
np.isnan(x_train_clean).sum(axis=0)

array([    0,     0,     0,     0,     0, 43801,     0,     0,     0,
        1883,     0,     0,     0,     0,     0,     0,  5438,     0,
           0,     0,     0, 11368, 23006, 27073, 27073,     0,     0,
           0,     0,     0,     0,     0,     0, 28366, 26927, 29382,
       27893, 28958, 30593,     0,     0,     0,     0, 32115, 37605,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
       32496,     0,     0,     0,     0,     0,     0,     0,     0,
           0,  1883,  1883,  1883,     0,     0, 32080])

Below, we specify the index of the column, its name, the number of NaNs it contains, and the type of feature. If not specified, there are no NaNs.
- 2: health coverage. Type of health coverage, if it exists. **(Not relevant to this problem, can be removed)**
- 4: Cholesterol Check. How long it has been since last cholesterol check. **(Not relevant to this problem, can be removed)**
- 5,6: Asthma. Adults who have been told they formerly or currently have asthma.  **(Not relevant because the information is summarized in another variable)**
- 11,12,13,14,15: Race **(Not relevant because the information is summarized in another variable)**
- 28,29,30,31: drinking categories. **(Not relevant because the information is summarized in another variable)**
- 69: Tested for HIV. Measures whether the participant has been tested for HIV, but doesn't give information on whether they have been diagnoses. **(Not relevant to this problem, can be removed)**

In [30]:
x_train_final = x_train_clean.copy()
columns_to_remove = [2, 4, 7, 8, 11, 12, 13, 14, 15, 28, 29, 30, 31, 69] # Indices of columns to remove
x_train_final = np.delete(x_train_final, columns_to_remove, axis=1)

x_train_final.shape

(328135, 56)

Below, we specify the index of the column, its name, and we describe what it measures.
- 0: sex. Categorical: 1 for male, 2 for female.
- 1: health status. Categorical: 1 for good or better health, 2 for fair or poor health. 
- 2: Blood pressure levels. Categorical: 1 for low blood pressure, 2 for high blood pressure.
- 3: level of cholesterol, 43 801 NaNs. Categorical: 1 for low cholesterol, 2 for high cholesterol.
- 4: CHD or MI. Categorical: 1 for reported having MI or CHD, 2 if not.
- 5: asthma status, 1883 NaNs. Categorical: 1 current asthma, 2 former asthma, 3 no asthma.
- 6: Arthritis. Categorical: 1 diagnosed for arthiritis, 2 if not.
- 7: race groups, 5438 NaNs. Categorical: 1 White - Non-Hispanic, 2 Black - Non-Hispanic, 3 Hispanic, 4 Other race only - Non-hispanic, 5 Multiracial, Non-Hispanic.
- 8,9,10,11: Age categories. Different categorizations of Age. We keep them all to choose later.
- 12: height in meters, 11 368 NaNs.
- 13,14,15: weight in kg, BMI and 4 categories of BMI. They all have around 25 000 NaNs. We keep them all to choose later.
- 16: BMI categories. We keep them all to choose later.
- 17,18: smoke categories. We keep them both to choose later.
- 19: Heavy drinkers. Categorical: 1 if not heavy drinker, 2 if heavy drinker.
- 20 -> 25: fruits and vegetable consumption information, all have aroung 30 000 NaNs.
- 26,27,28,29: Same
- 30, 31: Same, all have around 30 000 NaNs.
- 32 -> 40: Exercise information
- 41: Exercise information, 32 496 NaNs.
- 42 -> 50: Same
- 51, 52, 53: Same, each have 1883 NaNs.
- 54, 55: Same

## Missing Values

In [None]:
#categorical vs continuous
#set a threshold of unique values and say that those who have less than ... unique values are categorical and the rest are continuosu