# FSM: First Simple Model 

# Overview

# Business Problem

# Data Understanding

Due to the size of the dataset, I can not directly push it to my online repository. The **dataset can be accessed here**:

- [2022 BRFSS Survey Data and Documentation](https://www.cdc.gov/brfss/annual_data/annual_2022.html)

The **codebook** can be accessed here:

- [LLCP 2022: Codebook Report](/Users/emmascotson/Documents/capstone_flatiron/data/Codebook_Report.pdf)

# Data Preparation

# Modeling

# Evaluation

In [4]:
import pandas as pd

In [5]:
import os
print(os.getcwd())
# List files in the current directory
print(os.listdir('.'))

/Users/emmascotson/Documents/capstone_flatiron/notebooks
['Capstone.ipynb', '.ipynb_checkpoints']


In [9]:
# Specify the full path to the XPT file
file_path = '/Users/emmascotson/Documents/capstone_flatiron/data/diabetes.xpt'

# Attempt to read the XPT file
try:
    data = pd.read_sas(file_path, format='xport')
    print(data.head())
except FileNotFoundError as e:
    print(f"Error: {e}")
except Exception as e:
    print(f"An error occurred: {e}")

   _STATE  FMONTH        IDATE IMONTH   IDAY    IYEAR  DISPCODE  \
0     1.0     1.0  b'02032022'  b'02'  b'03'  b'2022'    1100.0   
1     1.0     1.0  b'02042022'  b'02'  b'04'  b'2022'    1100.0   
2     1.0     1.0  b'02022022'  b'02'  b'02'  b'2022'    1100.0   
3     1.0     1.0  b'02032022'  b'02'  b'03'  b'2022'    1100.0   
4     1.0     1.0  b'02022022'  b'02'  b'02'  b'2022'    1100.0   

           SEQNO          _PSU  CTELENM1  ...  _SMOKGRP  _LCSREC  DRNKANY6  \
0  b'2022000001'  2.022000e+09       1.0  ...       4.0      NaN       2.0   
1  b'2022000002'  2.022000e+09       1.0  ...       4.0      NaN       2.0   
2  b'2022000003'  2.022000e+09       1.0  ...       4.0      NaN       2.0   
3  b'2022000004'  2.022000e+09       1.0  ...       3.0      2.0       2.0   
4  b'2022000005'  2.022000e+09       1.0  ...       4.0      NaN       1.0   

       DROCDY4_  _RFBING6      _DRNKWK2  _RFDRHV8  _FLSHOT7  _PNEUMO3  \
0  5.397605e-79       1.0  5.397605e-79       1.0      

In [10]:
data.head()

Unnamed: 0,_STATE,FMONTH,IDATE,IMONTH,IDAY,IYEAR,DISPCODE,SEQNO,_PSU,CTELENM1,...,_SMOKGRP,_LCSREC,DRNKANY6,DROCDY4_,_RFBING6,_DRNKWK2,_RFDRHV8,_FLSHOT7,_PNEUMO3,_AIDTST4
0,1.0,1.0,b'02032022',b'02',b'03',b'2022',1100.0,b'2022000001',2022000000.0,1.0,...,4.0,,2.0,5.397605e-79,1.0,5.397605e-79,1.0,1.0,2.0,2.0
1,1.0,1.0,b'02042022',b'02',b'04',b'2022',1100.0,b'2022000002',2022000000.0,1.0,...,4.0,,2.0,5.397605e-79,1.0,5.397605e-79,1.0,2.0,2.0,2.0
2,1.0,1.0,b'02022022',b'02',b'02',b'2022',1100.0,b'2022000003',2022000000.0,1.0,...,4.0,,2.0,5.397605e-79,1.0,5.397605e-79,1.0,,,2.0
3,1.0,1.0,b'02032022',b'02',b'03',b'2022',1100.0,b'2022000004',2022000000.0,1.0,...,3.0,2.0,2.0,5.397605e-79,1.0,5.397605e-79,1.0,9.0,9.0,2.0
4,1.0,1.0,b'02022022',b'02',b'02',b'2022',1100.0,b'2022000005',2022000000.0,1.0,...,4.0,,1.0,10.0,1.0,140.0,1.0,,,2.0


In [11]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 445132 entries, 0 to 445131
Columns: 328 entries, _STATE to _AIDTST4
dtypes: float64(323), object(5)
memory usage: 1.1+ GB


# Filtering for Genetic/Involuntary 

I'll keep a few of the behavioral categories that have clear severe impacts on physical health, such as smoking and drinking habits -- so I can better compare whether someone's symptoms might be related to disease-prone genetics inherited from their parents or the result of smoking a pack of cigarettes a day for the past 40 years. 

Since I'll be adding many other behavioral columns in later, for our larger model, I'll be sparing for this preliminary model.

In [19]:
columns_to_keep = ['CADULT1', 'CELLSEX1', 'CSTATE1', 'SEXVAR', 'GENHLTH', 'PHYSHLTH', 'PRIMINSR', 
                   'PERSDOC3', 'MEDCOST1', 'RMVTETH4', 'CVDINFR4', 'CVDCRHD4', 'CVDSTRK3', 'ASTHMA3',
                   'ASTHNOW', 'CHCSCNC1', 'CHCOCNC1', 'PREGNANT', 'WEIGHT2', 'HEIGHT3', 'DEAF',
                   'BLIND', 'DECIDE', 'DIFFWALK', 'CERVSCRN', 'SMOKE100', 'SMOKDAY2', 'LCSCTSC1',
                   '_STATE', 'LCSSCNCR', 'PDIABTS1', 'PREDIAB2', 'DIABETE4', 'DIABTYPE', 'INSULIN1',
                   'CHKHEMO3', 'EYEEXAM1', 'DIABEYE1', 'FEETSORE', 'TOLDCFS', 'HAVECFS', 'COPDCOGH',
                   'COPDFLEM', 'COPDBTST', 'CNCRDIFF', 'CNCRAGE', 'CNCRTYP2', 'CSRVDOC1', 'CSRVSUM',
                   'CSRVRTRN', 'PSATEST1', 'CIMEMLOS', 'CDDISCUS', 'CAREGIV1', 'CRGVREL4', 'CRGVPRB3',
                   'CRGVALZD', 'LSATISFY', 'ASBIRDUC', 'BIRTHSEX', 'TRNSGNDR', 'RRCLASS3', 'RRHCARE4',
                   'RRPHYSM2', '_METSTAT', '_URBSTAT', 'MSCODE', '_IMPRACE', '_CHISPNC', '_RFHLTH',
                   '_PHYS14D', '_HLTHPLN', '_HCVU652', '_MICHD', '_LTASTH1', '_CASTHM1', '_ASTHMS1',
                   '_DRDXAR2', '_MRACE2', '_HISPANC', '_RACE1', '_RACEG22', '_RACEGR4', '_SEX', '_AGEG5YR',
                   '_AGE65YR', '_AGE80', '_AGE_G', 'HTIN4', 'WTKG3', '_BMI5', '_BMI5CAT', '_RFBMI5', 
                   '_INCOMG1', '_SMOKER3', '_YRSSMOK', '_SMOKGRP', '_LCSREC', '_RFBING6', '_RFDRHV8']

In [20]:
# Creating new filtered dataframe for First Simple Model

fsm = data[columns_to_keep]

In [21]:
fsm.head()

Unnamed: 0,CADULT1,CELLSEX1,CSTATE1,SEXVAR,GENHLTH,PHYSHLTH,PRIMINSR,PERSDOC3,MEDCOST1,RMVTETH4,...,_BMI5,_BMI5CAT,_RFBMI5,_INCOMG1,_SMOKER3,_YRSSMOK,_SMOKGRP,_LCSREC,_RFBING6,_RFDRHV8
0,,,,2.0,2.0,88.0,99.0,1.0,2.0,,...,,,9.0,9.0,4.0,,4.0,,1.0,1.0
1,,,,2.0,1.0,88.0,3.0,2.0,2.0,,...,2657.0,3.0,2.0,3.0,4.0,,4.0,,1.0,1.0
2,,,,2.0,2.0,2.0,1.0,1.0,2.0,,...,2561.0,3.0,2.0,6.0,4.0,,4.0,,1.0,1.0
3,,,,2.0,1.0,88.0,99.0,1.0,2.0,,...,2330.0,2.0,1.0,9.0,2.0,56.0,3.0,2.0,1.0,1.0
4,,,,2.0,4.0,2.0,7.0,2.0,2.0,,...,2177.0,2.0,1.0,3.0,4.0,,4.0,,1.0,1.0


In [22]:
fsm.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 445132 entries, 0 to 445131
Data columns (total 100 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   CADULT1   349080 non-null  float64
 1   CELLSEX1  349079 non-null  float64
 2   CSTATE1   349072 non-null  float64
 3   SEXVAR    445132 non-null  float64
 4   GENHLTH   445129 non-null  float64
 5   PHYSHLTH  445127 non-null  float64
 6   PRIMINSR  445128 non-null  float64
 7   PERSDOC3  445130 non-null  float64
 8   MEDCOST1  445128 non-null  float64
 9   RMVTETH4  443769 non-null  float64
 10  CVDINFR4  445128 non-null  float64
 11  CVDCRHD4  445130 non-null  float64
 12  CVDSTRK3  445130 non-null  float64
 13  ASTHMA3   445130 non-null  float64
 14  ASTHNOW   66694 non-null   float64
 15  CHCSCNC1  445130 non-null  float64
 16  CHCOCNC1  445129 non-null  float64
 17  PREGNANT  79018 non-null   float64
 18  WEIGHT2   429231 non-null  float64
 19  HEIGHT3   428077 non-null  float64
 20  DEA

# Nulls

Huh! It seems like there are some columns with completely missing values. Let's re-print the columns in ascending order of non-null values to reference back to the Codebook and examine whether our Null values are accurate. 

In [23]:
# Get non-null counts for each column
non_null_counts = fsm.notnull().sum()

# Sort columns by non-null counts in ascending order
sorted_columns = non_null_counts.sort_values().index

# Reorder DataFrame columns based on sorted order
fsm_sorted = fsm[sorted_columns]

# Print info of the sorted DataFrame
print(fsm_sorted.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 445132 entries, 0 to 445131
Data columns (total 100 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   HAVECFS   0 non-null       float64
 1   TOLDCFS   0 non-null       float64
 2   PSATEST1  7180 non-null    float64
 3   COPDBTST  7232 non-null    float64
 4   COPDFLEM  7262 non-null    float64
 5   COPDCOGH  7269 non-null    float64
 6   CDDISCUS  7400 non-null    float64
 7   CSRVRTRN  9771 non-null    float64
 8   CSRVSUM   9783 non-null    float64
 9   CSRVDOC1  9807 non-null    float64
 10  FEETSORE  12600 non-null   float64
 11  DIABEYE1  12600 non-null   float64
 12  EYEEXAM1  12600 non-null   float64
 13  CHKHEMO3  12600 non-null   float64
 14  INSULIN1  12600 non-null   float64
 15  DIABTYPE  12600 non-null   float64
 16  CRGVALZD  17300 non-null   float64
 17  CRGVPRB3  19472 non-null   float64
 18  CRGVREL4  19634 non-null   float64
 19  CNCRTYP2  22544 non-null   float64
 20  CNC

##### HAVECFS and TOLDCFS

The codebook confirms that the data is completely missing for all rows in these columns, which indicate a person's status with regards to having or being told they have **Chronic Fatigue Syndrome (CFS)** or **Myalgic Encephalomyelitis**. 

##### PSATEST1

Survey Question: *Have you ever had a P.S.A. test?*

*PSA test*: measures the amount of PSA in a man's blood to help screen for and monitor prostate cancer.

According to Cleveland Clinic PSA tests are very common. Which leaves us to wonder why so many of these values are Nulls -- certainly a far greater number than the amount of men that were surveyed in the dataset. 


##### COPDCOGH AND COPDFLEM

Survey Question: *During the past three months, did you have a cough and/or cough up phlegm or mucus on most days? 

These columns might be more of the **hindsight bias** symptoms I was referring to early, rather than prior warning signs. It's not a big deal if we have a lot of Nulls for these, we have other genetic and physical factors that will be far more important to our model. 

##### CDDISCUS

Survey Question: Have you are anyone else discussed your confusion or memory loss with a health care professional? 

Again, not one of our most important features, so no need to stress that there are a lot of Nulls. 

An interesting takeaway from the Codebook Report with regards to this feature: the **non**-missing values seemed to be pretty balanced between our two main class...3,432 'Yes' and 3,862 'No'. 

##### CSVRTRN, CSRVSUM

Survey Question CSVRTRN: Have you ever received instructions from a doctor, nurse, or other health professional about where you should return or who you should see for routine cancer check-
ups after completing treatment for cancer?

Survey Question CSRVSUM: Did any doctor, nurse, or other health professional ever give you a written summary of all the cancer treatments that you received?

These are sort of behavioral questions, as they bleed into the question of whether or not people choose to incorporate doctor's visits into their lifestyle routines. I thought they might be helpful in isolating whether people had **reasons** to go to the doctor (due to underlying genetic conditions), but there are other features we have that are more suitable for this anyway. 

##### CSRVDOC1 

Survey Question: What type of doctor provides the majority of your health care? Is it a….

This could still be useful, even if we have Nulls, to identify any specialized possibly genetic reasons a person requires specific medical care from a particular kind of doctor. But again, not entirely important compared to other features. So it's okay that we have mostly Nulls.

## Data Exploration

The rest of our columns have over 10,000 non-null values. To save time, I'll keep moving for now and examine these as I go if need be.

# Target Variable: DIABETE4

**Target Survey Question**: (Ever told) (you had) diabetes? (If ´Y es´ and respondent is female, ask ´Was this only when you were pregnant?´. If Respondent says pre-diabetes or borderline diabetes, use response code 4.)

**Target Answer Values**:

- 1: Yes
- 2: Yes, but female told only during pregnancy
- 3: No
- 4: No, pre-diabetes or borderline diabetes
- 7: Don’t know/Not Sure
- 9: Refused
- BLANK: Not asked or Missing

445129 non-null values, 3 missing values

Few! Our target class barely has any missingness. And looking at the *frequency values* in the Codebook, there are only 1084 of the 'Don't know/Not Sure' and 'Refused' values combined. Obviously the greater the volume of rows and data with regards to the target variable, the better.

I'll print the frequencies of the other values shortly. 

### Columns to Contextualize Target: PDIABTS1, PREDIAB2, DIABTYPE

##### PDIABTS1

Survey Question: When was the last time you had a blood test for high blood sugar or diabetes by a doctor, nurse, or other health professional?

140,248 non-null, 304,884 missing malues

##### PREDIAB2

Survey Question: Has a doctor or other health professional ever told you that you had prediabetes or borderline diabetes? (If “Y es” and respondent is female, ask: “Was this only when you were pregnan

140,222 non-null, 304,910 missing values

#### DIABTYPE

Survey Question: According to your doctor or other health professional, what type of diabetes do you have?

Answer Values: *Type 1, Type 2, Don't know/Not Sure, Refused, Not asked or Missing*

This column could be a good way to expand the target variable with more specifity, as I move onto to larger more complex models after this FSM. If I have time, I can start to develop a chained model that not only predicts whether someone develops diabetes, but further predicts which type they will progress to. Otherwise, I can use it to contextualize my findings.

12,600 non-null, 432,532 missing values

In [26]:
fsm['DIABETE4'].value_counts(normalize=True)

DIABETE4
3.0    0.828349
1.0    0.137394
4.0    0.023205
2.0    0.008618
7.0    0.001714
9.0    0.000721
Name: proportion, dtype: float64

#### Imbalanced Target

82.83% of the target variable is 'No', and only 13.74% of the target variable is 'Yes'. 

I'll have to keep this in mind when building the model!