# Data Cleaning Notebook

## Notebook Summary
In this Notebook:
- Remove all columns related to ADHD
- Handle all NaN values present in the dataset
- Trim columns that I won't be using
- Perform a train test split on the cleaned data

In [1]:
# Relevant imports
import pandas as pd


In [2]:
# Loading data into pandas dateframe
nsch = pd.read_sas('../Data/nsch_2020_topical_SAS/nsch_2020_topical.sas7bdat')

# Visually checking successful loading of dataframe
nsch.head()

  rslt[name] = self._byte_chunk[jb, :].view(dtype=self.byte_order + "d")


Unnamed: 0,FIPSST,STRATUM,HHID,FORMTYPE,TOTKIDS_R,TENURE,HHLANGUAGE,SC_AGE_YEARS,SC_SEX,K2Q35A_1_YEARS,...,BIRTH_YR_F,BMICLASS,HHCOUNT_IF,FPL_I1,FPL_I2,FPL_I3,FPL_I4,FPL_I5,FPL_I6,FWC
0,b'17',b'1',b'20000003',b'T1',2.0,1.0,1.0,3.0,1.0,,...,0.0,,0.0,400.0,400.0,400.0,400.0,400.0,400.0,3296.080092
1,b'29',b'2A',b'20000004',b'T3',1.0,1.0,1.0,14.0,2.0,,...,0.0,2.0,0.0,400.0,400.0,400.0,400.0,400.0,400.0,2888.54533
2,b'47',b'1',b'20000005',b'T1',1.0,1.0,1.0,1.0,2.0,,...,0.0,,0.0,400.0,400.0,400.0,400.0,400.0,400.0,1016.68273
3,b'28',b'1',b'20000014',b'T3',2.0,1.0,1.0,15.0,2.0,,...,0.0,2.0,0.0,143.0,143.0,143.0,143.0,143.0,143.0,1042.091065
4,b'55',b'1',b'20000015',b'T3',2.0,2.0,1.0,16.0,2.0,,...,0.0,3.0,0.0,400.0,400.0,400.0,400.0,400.0,400.0,402.372392


## Removing Columns

### Columns Related to ADHD
Along with our target column, 'K2Q31A' there are multiple other columns that relate to ADHD that need to be removed. These columns are closely tied to ADHD so they need to be removed so the model doesn't take them into account when making predictions. Column descriptions can be found in the [EDA notebook](https://github.com/austint1121/Undiagnosed-ADHD-Identification/blob/main/Notebooks/EDA.ipynb).

In [4]:
# Creating list of columns to be dropped
related_ADHD = [
    'K2Q31A',
    'K2Q31B',
    'K2Q31C',
    'K2Q31D',
    'K4Q23',
    'SC_K2Q10',
    'SC_K2Q11',
    'SC_K2Q12',
    'ADDTREAT',
    'SC_CSHCN',
    'SC_K2Q22',
    'SC_K2Q10',
    'K4Q22_R',
    'K6Q15',
    'SC_K2Q20',
    'K4Q36',
    'TOTNONSHCN',
    'K4Q28X04',
]

In [18]:
# Creating new dataframe without columns from above
dropped_adhd = nsch.drop(columns=related_ADHD)

# Confirming expected results
dropped_adhd.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42777 entries, 0 to 42776
Columns: 426 entries, FIPSST to FWC
dtypes: float64(422), object(4)
memory usage: 139.0+ MB


### Object Columns

In [6]:
# Checking the column types
dropped_adhd.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42777 entries, 0 to 42776
Columns: 426 entries, FIPSST to FWC
dtypes: float64(422), object(4)
memory usage: 139.0+ MB


In [11]:
# We have 422 float64 and 4 object types. Lets investigate those 4 objects
dropped_adhd.select_dtypes('object')

Unnamed: 0,FIPSST,STRATUM,HHID,FORMTYPE
0,b'17',b'1',b'20000003',b'T1'
1,b'29',b'2A',b'20000004',b'T3'
2,b'47',b'1',b'20000005',b'T1'
3,b'28',b'1',b'20000014',b'T3'
4,b'55',b'1',b'20000015',b'T3'
...,...,...,...,...
42772,b'26',b'1',b'20239975',b'T3'
42773,b'54',b'1',b'20239979',b'T2'
42774,b'54',b'1',b'20239980',b'T3'
42775,b'15',b'1',b'20239994',b'T2'


**FIPSST** - State FIPS code
**STRATUM** - Sampling Stratum
**HHID** - Unique Household ID
**FORMTYPE** - A proxy for age, kids are given a form base on age ranges (T1: 0-5, T2: 6-11, T3:12-17)

All of these columns can be dropped as they should have an effect on whether a child has ADHD, or they are a proxy for an already present variable.

In [17]:
# Dropping object columns
dropped_final = dropped_adhd.drop(columns=['FIPSST', 'STRATUM', 'HHID', 'FORMTYPE'])
# Confirming expected column count
dropped_final.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42777 entries, 0 to 42776
Columns: 422 entries, TOTKIDS_R to FWC
dtypes: float64(422)
memory usage: 137.7 MB


## Handling Missing Values
There are multiple strategies to handling missing values, normally I would love to use Sklearns experimental [Iterative Imputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html) however, in this survey some questions are sub questions of others, and can be left blank as a result of the answer to the parent question.
<br>
An example of this is the "ADDTREAT" column we dropped earlier. This column is left blank if "K2Q31A" (our target) is answered "No". An iterative imputer would attempt to fill in this column, and while there is a chance that the imputer would figure it out, it would take more time and effort then I currently have.