# Diabetes Prediction using demographic data and body measurements

# About our data

The [National Health and Nutrition Examination Survey (NHANES)](https://www.cdc.gov/Nchs/Nhanes/about_nhanes.htm) is a program of studies designed to assess the health and nutritional status of adults and children in the United States. The survey is unique in that it combines interviews and physical examinations. NHANES is a major program of the National Center for Health Statistics (NCHS). NCHS is part of the Centers for Disease Control and Prevention (CDC) and has the responsibility for producing vital and health statistics for the Nation.

The NHANES program began in the early 1960s and has been conducted as a series of surveys focusing on different population groups or health topics. In 1999, the survey became a continuous program that has a changing focus on a variety of health and nutrition measurements to meet emerging needs. The survey examines a nationally representative sample of about 5,000 persons each year. These persons are located in counties across the country, 15 of which are visited each year.

The NHANES interview includes demographic, socioeconomic, dietary, and health-related questions. The examination component consists of medical, dental, and physiological measurements, as well as laboratory tests administered by highly trained medical personnel.

To date, [thousands of research findings have been published using the NHANES data.](https://www.ncbi.nlm.nih.gov/pubmed?orig_db=PubMed&term=NHANES&cmd=search)

Content
The 2013-2014 NHANES datasets include the following components:

**1. Demographics dataset:**

A complete variable dictionary can be found [here](https://wwwn.cdc.gov/Nchs/Nhanes/Search/variablelist.aspx?Component=Demographics&CycleBeginYear=2013)

**2. Examinations dataset, which contains:**

Blood pressure

Body measures

Muscle strength - grip test

Oral health - dentition

Taste & smell

A complete variable dictionary can be found [here](https://wwwn.cdc.gov/Nchs/Nhanes/Search/variablelist.aspx?Component=Examination&CycleBeginYear=2013)

**3. Dietary data - total nutrient intake, first day:**

A complete variable dictionary can be found [here](https://wwwn.cdc.gov/Nchs/Nhanes/Search/variablelist.aspx?Component=Dietary&CycleBeginYear=2013)

**4. Laboratory dataset, which includes:**

Albumin & Creatinine - Urine

Apolipoprotein B

Blood Lead, Cadmium, Total Mercury, Selenium, and Manganese

Blood mercury: inorganic, ethyl and methyl

Cholesterol - HDL

Cholesterol - LDL & Triglycerides

Cholesterol - Total

Complete Blood Count with 5-part Differential - Whole Blood

Copper, Selenium & Zinc - Serum

Fasting Questionnaire

Fluoride - Plasma

Fluoride - Water

Glycohemoglobin

Hepatitis A

Hepatitis B Surface Antibody

Hepatitis B: core antibody, surface antigen, and Hepatitis D antibody

Hepatitis C RNA (HCV-RNA) and Hepatitis C Genotype

Hepatitis E: IgG & IgM Antibodies

Herpes Simplex Virus Type-1 & Type-2

HIV Antibody Test

Human Papillomavirus (HPV) - Oral Rinse

Human Papillomavirus (HPV) DNA - Vaginal Swab: Roche Cobas & Roche Linear Array

Human Papillomavirus (HPV) DNA Results from Penile Swab Samples: Roche Linear Array

Insulin

Iodine - Urine

Perchlorate, Nitrate & Thiocyanate - Urine

Perfluoroalkyl and Polyfluoroalkyl Substances (formerly Polyfluoroalkyl Chemicals - PFC)

Personal Care and Consumer Product Chemicals and Metabolites

Phthalates and Plasticizers Metabolites - Urine

Plasma Fasting Glucose

Polycyclic Aromatic Hydrocarbons (PAH) - Urine

Standard Biochemistry Profile

Tissue Transglutaminase Assay (IgA-TTG) & IgA Endomyseal Antibody Assay (IgA EMA)

Trichomonas - Urine

Two-hour Oral Glucose Tolerance Test

Urinary Chlamydia

Urinary Mercury

Urinary Speciated Arsenics

Urinary Total Arsenic

Urine Flow Rate

Urine Metals

Urine Pregnancy Test

Vitamin B12

A complete data dictionary can be found [here](https://wwwn.cdc.gov/Nchs/Nhanes/Search/variablelist.aspx?Component=Laboratory&CycleBeginYear=2013)

**Questionnaire dataset, which includes information on:**

Acculturation

Alcohol Use

Blood Pressure & Cholesterol

Cardiovascular Health

Consumer Behavior

Current Health Status

Dermatology

Diabetes

Diet Behavior & Nutrition

Disability

Drug Use

Early Childhood

Food Security

Health Insurance

Hepatitis

Hospital Utilization & Access to Care

Housing Characteristics

Immunization

Income

Medical Conditions

Mental Health - Depression Screener

Occupation

Oral Health

Osteoporosis

Pesticide Use

Physical Activity

Physical Functioning

Preventive Aspirin Use

Reproductive Health

Sexual Behavior

Sleep Disorders

Smoking - Cigarette Use

Smoking - Household Smokers

Smoking - Recent Tobacco Use

Smoking - Secondhand Smoke Exposure

Taste & Smell

Weight History

Weight History - Youth

A complete variable dictionary can be found [here](https://wwwn.cdc.gov/Nchs/Nhanes/Search/variablelist.aspx?Component=Questionnaire&CycleBeginYear=2013)


In [1]:
import pandas as pd

df1 = pd.read_csv("labs.csv")
df2 = pd.read_csv("examination.csv")
df3 = pd.read_csv("demographic.csv")
df4 = pd.read_csv("diet.csv")
df5 = pd.read_csv("questionnaire.csv")

In [2]:
df1.columns.values

array(['SEQN', 'URXUMA', 'URXUMS', 'URXUCR.x', 'URXCRS', 'URDACT',
       'WTSAF2YR.x', 'LBXAPB', 'LBDAPBSI', 'LBXSAL', 'LBDSALSI',
       'LBXSAPSI', 'LBXSASSI', 'LBXSATSI', 'LBXSBU', 'LBDSBUSI',
       'LBXSC3SI', 'LBXSCA', 'LBDSCASI', 'LBXSCH', 'LBDSCHSI', 'LBXSCK',
       'LBXSCLSI', 'LBXSCR', 'LBDSCRSI', 'LBXSGB', 'LBDSGBSI', 'LBXSGL',
       'LBDSGLSI', 'LBXSGTSI', 'LBXSIR', 'LBDSIRSI', 'LBXSKSI',
       'LBXSLDSI', 'LBXSNASI', 'LBXSOSSI', 'LBXSPH', 'LBDSPHSI', 'LBXSTB',
       'LBDSTBSI', 'LBXSTP', 'LBDSTPSI', 'LBXSTR', 'LBDSTRSI', 'LBXSUA',
       'LBDSUASI', 'LBXWBCSI', 'LBXLYPCT', 'LBXMOPCT', 'LBXNEPCT',
       'LBXEOPCT', 'LBXBAPCT', 'LBDLYMNO', 'LBDMONO', 'LBDNENO',
       'LBDEONO', 'LBDBANO', 'LBXRBCSI', 'LBXHGB', 'LBXHCT', 'LBXMCVSI',
       'LBXMCHSI', 'LBXMC', 'LBXRDW', 'LBXPLTSI', 'LBXMPSI', 'URXUCL',
       'WTSA2YR.x', 'LBXSCU', 'LBDSCUSI', 'LBXSSE', 'LBDSSESI', 'LBXSZN',
       'LBDSZNSI', 'URXUCR.y', 'WTSB2YR.x', 'URXBP3', 'URDBP3LC',
       'URXBPH', 'URDBPHLC', 

In [3]:
df2.columns.values

array(['SEQN', 'PEASCST1', 'PEASCTM1', 'PEASCCT1', 'BPXCHR', 'BPAARM',
       'BPACSZ', 'BPXPLS', 'BPXPULS', 'BPXPTY', 'BPXML1', 'BPXSY1',
       'BPXDI1', 'BPAEN1', 'BPXSY2', 'BPXDI2', 'BPAEN2', 'BPXSY3',
       'BPXDI3', 'BPAEN3', 'BPXSY4', 'BPXDI4', 'BPAEN4', 'BMDSTATS',
       'BMXWT', 'BMIWT', 'BMXRECUM', 'BMIRECUM', 'BMXHEAD', 'BMIHEAD',
       'BMXHT', 'BMIHT', 'BMXBMI', 'BMDBMIC', 'BMXLEG', 'BMILEG',
       'BMXARML', 'BMIARML', 'BMXARMC', 'BMIARMC', 'BMXWAIST', 'BMIWAIST',
       'BMXSAD1', 'BMXSAD2', 'BMXSAD3', 'BMXSAD4', 'BMDAVSAD', 'BMDSADCM',
       'MGDEXSTS', 'MGD050', 'MGD060', 'MGQ070', 'MGQ080', 'MGQ090',
       'MGQ100', 'MGQ110', 'MGQ120', 'MGD130', 'MGQ90DG', 'MGDSEAT',
       'MGAPHAND', 'MGATHAND', 'MGXH1T1', 'MGXH1T1E', 'MGXH2T1',
       'MGXH2T1E', 'MGXH1T2', 'MGXH1T2E', 'MGXH2T2', 'MGXH2T2E',
       'MGXH1T3', 'MGXH1T3E', 'MGXH2T3', 'MGXH2T3E', 'MGDCGSZ',
       'OHDEXSTS', 'OHDDESTS', 'OHXIMP', 'OHX01TC', 'OHX02TC', 'OHX03TC',
       'OHX04TC', 'OHX05TC', 'OH

In [4]:
df3.columns.values

array(['SEQN', 'SDDSRVYR', 'RIDSTATR', 'RIAGENDR', 'RIDAGEYR', 'RIDAGEMN',
       'RIDRETH1', 'RIDRETH3', 'RIDEXMON', 'RIDEXAGM', 'DMQMILIZ',
       'DMQADFC', 'DMDBORN4', 'DMDCITZN', 'DMDYRSUS', 'DMDEDUC3',
       'DMDEDUC2', 'DMDMARTL', 'RIDEXPRG', 'SIALANG', 'SIAPROXY',
       'SIAINTRP', 'FIALANG', 'FIAPROXY', 'FIAINTRP', 'MIALANG',
       'MIAPROXY', 'MIAINTRP', 'AIALANGA', 'DMDHHSIZ', 'DMDFMSIZ',
       'DMDHHSZA', 'DMDHHSZB', 'DMDHHSZE', 'DMDHRGND', 'DMDHRAGE',
       'DMDHRBR4', 'DMDHREDU', 'DMDHRMAR', 'DMDHSEDU', 'WTINT2YR',
       'WTMEC2YR', 'SDMVPSU', 'SDMVSTRA', 'INDHHIN2', 'INDFMIN2',
       'INDFMPIR'], dtype=object)

In [5]:
df4.columns.values

array(['SEQN', 'WTDRD1', 'WTDR2D', 'DR1DRSTZ', 'DR1EXMER', 'DRABF',
       'DRDINT', 'DR1DBIH', 'DR1DAY', 'DR1LANG', 'DR1MNRSP', 'DR1HELPD',
       'DBQ095Z', 'DBD100', 'DRQSPREP', 'DR1STY', 'DR1SKY', 'DRQSDIET',
       'DRQSDT1', 'DRQSDT2', 'DRQSDT3', 'DRQSDT4', 'DRQSDT5', 'DRQSDT6',
       'DRQSDT7', 'DRQSDT8', 'DRQSDT9', 'DRQSDT10', 'DRQSDT11',
       'DRQSDT12', 'DRQSDT91', 'DR1TNUMF', 'DR1TKCAL', 'DR1TPROT',
       'DR1TCARB', 'DR1TSUGR', 'DR1TFIBE', 'DR1TTFAT', 'DR1TSFAT',
       'DR1TMFAT', 'DR1TPFAT', 'DR1TCHOL', 'DR1TATOC', 'DR1TATOA',
       'DR1TRET', 'DR1TVARA', 'DR1TACAR', 'DR1TBCAR', 'DR1TCRYP',
       'DR1TLYCO', 'DR1TLZ', 'DR1TVB1', 'DR1TVB2', 'DR1TNIAC', 'DR1TVB6',
       'DR1TFOLA', 'DR1TFA', 'DR1TFF', 'DR1TFDFE', 'DR1TCHL', 'DR1TVB12',
       'DR1TB12A', 'DR1TVC', 'DR1TVD', 'DR1TVK', 'DR1TCALC', 'DR1TPHOS',
       'DR1TMAGN', 'DR1TIRON', 'DR1TZINC', 'DR1TCOPP', 'DR1TSODI',
       'DR1TPOTA', 'DR1TSELE', 'DR1TCAFF', 'DR1TTHEO', 'DR1TALCO',
       'DR1TMOIS', 'DR1TS040

In [6]:
df5.columns.values

array(['SEQN', 'ACD011A', 'ACD011B', 'ACD011C', 'ACD040', 'ACD110',
       'ALQ101', 'ALQ110', 'ALQ120Q', 'ALQ120U', 'ALQ130', 'ALQ141Q',
       'ALQ141U', 'ALQ151', 'ALQ160', 'BPQ020', 'BPQ030', 'BPD035',
       'BPQ040A', 'BPQ050A', 'BPQ056', 'BPD058', 'BPQ059', 'BPQ080',
       'BPQ060', 'BPQ070', 'BPQ090D', 'BPQ100D', 'CBD070', 'CBD090',
       'CBD110', 'CBD120', 'CBD130', 'HSD010', 'HSQ500', 'HSQ510',
       'HSQ520', 'HSQ571', 'HSQ580', 'HSQ590', 'HSAQUEX', 'CSQ010',
       'CSQ020', 'CSQ030', 'CSQ040', 'CSQ060', 'CSQ070', 'CSQ080',
       'CSQ090A', 'CSQ090B', 'CSQ090C', 'CSQ090D', 'CSQ100', 'CSQ110',
       'CSQ120A', 'CSQ120B', 'CSQ120C', 'CSQ120D', 'CSQ120E', 'CSQ120F',
       'CSQ120G', 'CSQ120H', 'CSQ140', 'CSQ160', 'CSQ170', 'CSQ180',
       'CSQ190', 'CSQ200', 'CSQ202', 'CSQ204', 'CSQ210', 'CSQ220',
       'CSQ240', 'CSQ250', 'CSQ260', 'AUQ136', 'AUQ138', 'CDQ001',
       'CDQ002', 'CDQ003', 'CDQ004', 'CDQ005', 'CDQ006', 'CDQ009A',
       'CDQ009B', 'CDQ009C', 'CDQ009D',

In [7]:
df2.drop(['SEQN'],axis = 1, inplace=True)
df3.drop(['SEQN'],axis = 1, inplace=True)
df4.drop(['SEQN'],axis = 1, inplace=True)
df5.drop(['SEQN'],axis = 1, inplace=True)

In [8]:
df = pd.concat([df1,df2], axis =1, join='inner')
df = pd.concat([df,df3], axis =1, join='inner')
df = pd.concat([df,df4], axis =1, join='inner')
df = pd.concat([df,df5], axis =1, join='inner')
df.describe()

Unnamed: 0,SEQN,URXUMA,URXUMS,URXUCR.x,URXCRS,URDACT,WTSAF2YR.x,LBXAPB,LBDAPBSI,LBXSAL,...,WHD080U,WHD080L,WHD110,WHD120,WHD130,WHD140,WHQ150,WHQ030M,WHQ500,WHQ520
count,9813.0,8052.0,8052.0,8052.0,8052.0,8052.0,3329.0,3145.0,3145.0,6553.0,...,14.0,28.0,4036.0,4842.0,2667.0,5879.0,5800.0,1424.0,1424.0,1424.0
mean,78644.559971,41.218854,41.218854,121.072529,10702.811525,41.905695,78917.195254,85.898569,0.858986,4.282085,...,35.0,40.0,413.440287,567.920074,373.831646,315.447355,574.222069,2.586376,2.295646,1.747893
std,2938.592266,238.910226,238.910226,78.574882,6946.019595,276.261093,71088.020067,25.595258,0.255953,0.343649,...,0.0,0.0,1511.368399,1975.492188,1716.83115,1075.040013,7288.930842,0.782529,1.210905,0.7076
min,73557.0,0.21,0.21,5.0,442.0,0.21,0.0,20.0,0.2,2.4,...,35.0,40.0,75.0,55.0,50.0,85.0,10.0,1.0,1.0,1.0
25%,76092.0,4.5,4.5,60.0,5304.0,5.02,33217.405018,68.0,0.68,4.1,...,35.0,40.0,140.0,125.0,63.0,155.0,25.0,3.0,1.0,1.0
50%,78643.0,8.4,8.4,106.0,9370.4,7.78,56397.702304,84.0,0.84,4.3,...,35.0,40.0,165.0,150.0,66.0,185.0,38.0,3.0,2.0,2.0
75%,81191.0,17.625,17.625,163.0,14409.2,15.295,99356.561999,101.0,1.01,4.5,...,35.0,40.0,198.0,180.0,70.0,225.0,53.0,3.0,3.0,2.0
max,83731.0,9600.0,9600.0,659.0,58255.6,9000.0,395978.465792,234.0,2.34,5.6,...,35.0,40.0,9999.0,9999.0,9999.0,9999.0,99999.0,9.0,9.0,9.0


In [9]:
df.head()

Unnamed: 0,SEQN,URXUMA,URXUMS,URXUCR.x,URXCRS,URDACT,WTSAF2YR.x,LBXAPB,LBDAPBSI,LBXSAL,...,WHD080U,WHD080L,WHD110,WHD120,WHD130,WHD140,WHQ150,WHQ030M,WHQ500,WHQ520
0,73557,4.3,4.3,39.0,3447.6,11.03,,,,4.1,...,,40.0,270.0,200.0,69.0,270.0,62.0,,,
1,73558,153.0,153.0,50.0,4420.0,306.0,,,,4.7,...,,,240.0,250.0,72.0,250.0,25.0,,,
2,73559,11.9,11.9,113.0,9989.2,10.53,142196.890197,57.0,0.57,3.7,...,,,180.0,190.0,70.0,228.0,35.0,,,
3,73560,16.0,16.0,76.0,6718.4,21.05,,,,,...,,,,,,,,3.0,3.0,3.0
4,73561,255.0,255.0,147.0,12994.8,173.47,142266.006548,92.0,0.92,4.3,...,,,150.0,135.0,67.0,170.0,60.0,,,
