# Data Cleaning Notebook

## Notebook Summary
In this Notebook:
- Remove all columns related to ADHD
- Handle all NaN values present in the dataset
- Trim columns that I won't be using
- Perform a train test split on the cleaned data

In [66]:
# Relevant imports
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer, SimpleImputer

In [67]:
# Loading data into pandas dateframe
nsch = pd.read_sas('../Data/nsch_2020_topical_SAS/nsch_2020_topical.sas7bdat')

# Visually checking successful loading of dataframe
nsch.head()

  rslt[name] = self._byte_chunk[jb, :].view(dtype=self.byte_order + "d")


Unnamed: 0,FIPSST,STRATUM,HHID,FORMTYPE,TOTKIDS_R,TENURE,HHLANGUAGE,SC_AGE_YEARS,SC_SEX,K2Q35A_1_YEARS,...,BIRTH_YR_F,BMICLASS,HHCOUNT_IF,FPL_I1,FPL_I2,FPL_I3,FPL_I4,FPL_I5,FPL_I6,FWC
0,b'17',b'1',b'20000003',b'T1',2.0,1.0,1.0,3.0,1.0,,...,0.0,,0.0,400.0,400.0,400.0,400.0,400.0,400.0,3296.080092
1,b'29',b'2A',b'20000004',b'T3',1.0,1.0,1.0,14.0,2.0,,...,0.0,2.0,0.0,400.0,400.0,400.0,400.0,400.0,400.0,2888.54533
2,b'47',b'1',b'20000005',b'T1',1.0,1.0,1.0,1.0,2.0,,...,0.0,,0.0,400.0,400.0,400.0,400.0,400.0,400.0,1016.68273
3,b'28',b'1',b'20000014',b'T3',2.0,1.0,1.0,15.0,2.0,,...,0.0,2.0,0.0,143.0,143.0,143.0,143.0,143.0,143.0,1042.091065
4,b'55',b'1',b'20000015',b'T3',2.0,2.0,1.0,16.0,2.0,,...,0.0,3.0,0.0,400.0,400.0,400.0,400.0,400.0,400.0,402.372392


## Removing Columns

### Columns Related to ADHD
Along with our target column, 'K2Q31A' there are multiple other columns that relate to ADHD that need to be removed. These columns are closely tied to ADHD so they need to be removed so the model doesn't take them into account when making predictions. Column descriptions can be found in the [EDA notebook](https://github.com/austint1121/Undiagnosed-ADHD-Identification/blob/main/Notebooks/EDA.ipynb).

In [68]:
# Creating list of columns to be dropped
related_ADHD = [
    'K2Q31A',
    'K2Q31B',
    'K2Q31C',
    'K2Q31D',
    'K4Q23',
    'SC_K2Q10',
    'SC_K2Q11',
    'SC_K2Q12',
    'ADDTREAT',
    'SC_CSHCN',
    'SC_K2Q22',
    'SC_K2Q10',
    'K4Q22_R',
    'K6Q15',
    'SC_K2Q20',
    'K4Q36',
    'TOTNONSHCN',
    'K4Q28X04',
]

In [69]:
# Dropping rows with NAN values in target column
dropped_adhd = nsch.dropna(subset=['K2Q31A'])

# Saving Target column
target = dropped_adhd['K2Q31A']

# Creating new dataframe without columns from above
dropped_adhd = dropped_adhd.drop(columns=related_ADHD)

# Confirming expected results
dropped_adhd.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 42589 entries, 0 to 42776
Columns: 426 entries, FIPSST to FWC
dtypes: float64(422), object(4)
memory usage: 138.7+ MB


### Object Columns

In [70]:
# Checking the column types
dropped_adhd.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 42589 entries, 0 to 42776
Columns: 426 entries, FIPSST to FWC
dtypes: float64(422), object(4)
memory usage: 138.7+ MB


In [71]:
# We have 422 float64 and 4 object types. Lets investigate those 4 objects
dropped_adhd.select_dtypes('object')

Unnamed: 0,FIPSST,STRATUM,HHID,FORMTYPE
0,b'17',b'1',b'20000003',b'T1'
1,b'29',b'2A',b'20000004',b'T3'
2,b'47',b'1',b'20000005',b'T1'
3,b'28',b'1',b'20000014',b'T3'
4,b'55',b'1',b'20000015',b'T3'
...,...,...,...,...
42772,b'26',b'1',b'20239975',b'T3'
42773,b'54',b'1',b'20239979',b'T2'
42774,b'54',b'1',b'20239980',b'T3'
42775,b'15',b'1',b'20239994',b'T2'


**FIPSST** - State FIPS code
**STRATUM** - Sampling Stratum
**HHID** - Unique Household ID
**FORMTYPE** - A proxy for age, kids are given a form base on age ranges (T1: 0-5, T2: 6-11, T3:12-17)

All of these columns can be dropped as they should have an effect on whether a child has ADHD, or they are a proxy for an already present variable.

In [72]:
# Dropping object columns
dropped_final = dropped_adhd.drop(columns=['FIPSST', 'STRATUM', 'HHID', 'FORMTYPE'])
# Confirming expected column count
dropped_final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 42589 entries, 0 to 42776
Columns: 422 entries, TOTKIDS_R to FWC
dtypes: float64(422)
memory usage: 137.4 MB


## Train Test Split
Before doing any transformations it will be necessary to perform the train test split beforehand to prevent data leakage.

In [74]:
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(dropped_final, target, random_state=15, stratify=target)

# Split test into a testing and final holdout/validation set
X_test, X_val, y_test, y_val = train_test_split(X_test, y_test, random_state=15, stratify=y_test)

# Printing total rows in each set
print(f'Training set is {len(X_train)} entries')
print(f'Testing set is {len(X_test)} entries')
print(f'Validation set is {len(X_val)} entries')

Training set is 31941 entries
Testing set is 7986 entries
Validation set is 2662 entries


In [83]:
# Printing the amount of kids diagnosed with ADHD in each split
print(f'There are {y_train.value_counts().values[1]} kids with ADHD in the training set')
print(f'There are {y_test.value_counts().values[1]} kids with ADHD in the testing set')
print(f'There are {y_val.value_counts().values[1]} kids with ADHD in the validation set')

There are 3229 kids with ADHD in the training set
There are 808 kids with ADHD in the testing set
There are 269 kids with ADHD in the validation set



## Handling Missing Values
There are multiple strategies to handling missing values, normally I would love to use Sklearns experimental [Iterative Imputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html) however, in this survey some questions are sub questions of others, and can be left blank as a result of the answer to the parent question.
<br>
An example of this is the "ADDTREAT" column we dropped earlier. This column is left blank if "K2Q31A" (our target) is answered "No". I'm unsure if the iterative imputer will pick this up, so lets create two pipelines, one using a simple imputer, and one using an iterative imputer.

In [12]:
# First lets remember how many NaN values are present in the data.
dropped_final.isna().sum()

TOTKIDS_R         0
TENURE            0
HHLANGUAGE      154
SC_AGE_YEARS      0
SC_SEX            0
               ... 
FPL_I3            0
FPL_I4            0
FPL_I5            0
FPL_I6            0
FWC               0
Length: 422, dtype: int64

In [29]:
# Simple Imputer
SI_imputer = SimpleImputer(strategy='constant', fill_value=999)

# Iterative Imputer
Iter_imputer = IterativeImputer(max_iter=15)


In [32]:
transformed_final = SI_imputer.fit_transform(dropped_final, target)
transformed_final_df = pd.DataFrame(transformed_final, columns=dropped_final.columns)

Unnamed: 0,TOTKIDS_R,TENURE,HHLANGUAGE,SC_AGE_YEARS,SC_SEX,K2Q35A_1_YEARS,BIRTH_MO,BIRTH_YR,MOMAGE,K6Q41R_STILL,...,BIRTH_YR_F,BMICLASS,HHCOUNT_IF,FPL_I1,FPL_I2,FPL_I3,FPL_I4,FPL_I5,FPL_I6,FWC
0,2.0,1.0,1.0,3.0,1.0,999.0,8.0,2017.0,26.0,2.0,...,0.0,999.0,0.0,400.0,400.0,400.0,400.0,400.0,400.0,3296.080092
1,1.0,1.0,1.0,14.0,2.0,999.0,2.0,2006.0,31.0,999.0,...,0.0,2.0,0.0,400.0,400.0,400.0,400.0,400.0,400.0,2888.545330
2,1.0,1.0,1.0,1.0,2.0,999.0,10.0,2018.0,28.0,2.0,...,0.0,999.0,0.0,400.0,400.0,400.0,400.0,400.0,400.0,1016.682730
3,2.0,1.0,1.0,15.0,2.0,999.0,10.0,2004.0,29.0,999.0,...,0.0,2.0,0.0,143.0,143.0,143.0,143.0,143.0,143.0,1042.091065
4,2.0,2.0,1.0,16.0,2.0,999.0,8.0,2004.0,24.0,999.0,...,0.0,3.0,0.0,400.0,400.0,400.0,400.0,400.0,400.0,402.372392
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
42772,4.0,1.0,1.0,13.0,1.0,7.0,9.0,2006.0,21.0,999.0,...,0.0,2.0,0.0,187.0,187.0,187.0,187.0,187.0,187.0,6106.727992
42773,1.0,1.0,1.0,7.0,1.0,999.0,9.0,2012.0,38.0,999.0,...,0.0,999.0,0.0,298.0,298.0,298.0,298.0,298.0,298.0,219.975635
42774,1.0,1.0,1.0,14.0,2.0,5.0,1.0,2006.0,39.0,999.0,...,0.0,4.0,0.0,314.0,400.0,126.0,331.0,110.0,133.0,217.557552
42775,1.0,1.0,1.0,10.0,2.0,999.0,7.0,2010.0,30.0,999.0,...,0.0,2.0,0.0,400.0,400.0,400.0,400.0,136.0,400.0,110.289068
