# Auto Data Type
Auto data type combines heuristics and machine learning to more accurately guess the data types of structured data.

In [1]:
from data_describe.type.autotype import select_dtypes, guess_dtypes


The sklearn.metrics.scorer module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.metrics. Anything that cannot be imported from sklearn.metrics is now part of the private API.


The sklearn.feature_selection.base module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.feature_selection. Anything that cannot be imported from sklearn.feature_selection is now part of the private API.



Using `dtype=str` prevents Pandas' auto typing from automatically converting mis-classified values

In [2]:
import pandas as pd
df = pd.read_csv("../data/er_data.csv", dtype=str)

In [3]:
df.head()

Unnamed: 0,readmitted,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,payer_code,medical_specialty,...,glipizide.metformin,glimepiride.pioglitazone,metformin.rosiglitazone,metformin.pioglitazone,change,diabetesMed,Unnamed: 47,diag_1_desc,diag_2_desc,diag_3_desc
0,False,Female,[50-60),?,Elective,Discharged to home,Physician Referral,1,CP,Surgery-Neuro,...,No,No,No,No,No,No,,Spinal stenosis in cervical region,Spinal stenosis in cervical region,"Effusion of joint, site unspecified"
1,False,Female,[20-30),[50-75),Urgent,Discharged to home,Physician Referral,2,UN,?,...,No,No,No,No,No,No,,"First-degree perineal laceration, unspecified ...","Diabetes mellitus of mother, complicating preg...",Sideroblastic anemia
2,True,Male,[80-90),?,Not Available,Discharged/transferred to home with home healt...,,7,MC,Family/GeneralPractice,...,No,No,No,No,No,Yes,,Pneumococcal pneumonia [Streptococcus pneumoni...,"Congestive heart failure, unspecified",Hyperosmolality and/or hypernatremia
3,False,Female,[50-60),?,Emergency,Discharged to home,Transfer from another health care facility,4,UN,?,...,No,No,No,No,No,Yes,,Cellulitis and abscess of face,Streptococcus infection in conditions classifi...,Diabetes mellitus without mention of complicat...
4,False,Female,[50-60),?,Emergency,Discharged to home,Emergency Room,5,?,Psychiatry,...,No,No,No,No,Ch,Yes,,"Bipolar I disorder, single manic episode, unsp...",Diabetes mellitus without mention of complicat...,Depressive type psychosis


In [4]:
df['glyburide.metformin'].value_counts()

No        9944
Steady      53
Up           2
Down         1
Name: glyburide.metformin, dtype: int64

## Guessing Data Types
The datatype guesses are provided as a dictionary

In [5]:
guess_dtypes(df)

{'readmitted': 'Boolean',
 'gender': 'Category',
 'age': 'Category',
 'weight': 'Category',
 'admission_type_id': 'String',
 'discharge_disposition_id': 'String',
 'admission_source_id': 'String',
 'time_in_hospital': 'Integer',
 'payer_code': 'Category',
 'medical_specialty': 'Category',
 'num_lab_procedures': 'Integer',
 'num_procedures': 'Integer',
 'num_medications': 'Integer',
 'number_outpatient': 'Integer',
 'number_emergency': 'Boolean',
 'number_inpatient': 'Integer',
 'diag_1': 'Category',
 'diag_2': 'Category',
 'diag_3': 'Category',
 'number_diagnoses': 'Integer',
 'max_glu_serum': 'Category',
 'A1Cresult': 'Category',
 'metformin': 'Category',
 'repaglinide': 'Boolean',
 'nateglinide': 'Category',
 'chlorpropamide': 'Boolean',
 'glimepiride': 'Category',
 'acetohexamide': 'Boolean',
 'glipizide': 'Category',
 'glyburide': 'Category',
 'tolbutamide': 'Boolean',
 'pioglitazone': 'Category',
 'rosiglitazone': 'Category',
 'acarbose': 'Category',
 'miglitol': 'Boolean',
 'trog

### Customizing the heuristics
The function can be customized to sample more data. Note that more features are changed to `Category` type instead of `Boolean` due to increased sampling.

In [6]:
guess_dtypes(df, 
             strict=True, # If False, violiation of type validation rules is allowed. This might be useful when some data errors are expected.
             sample_size=0.8 # A higher sample size may result in more accurate guesses, but comes at a tradeoff of processing time
             )

{'readmitted': 'Boolean',
 'gender': 'Category',
 'age': 'Category',
 'weight': 'Category',
 'admission_type_id': 'String',
 'discharge_disposition_id': 'String',
 'admission_source_id': 'String',
 'time_in_hospital': 'Integer',
 'payer_code': 'Category',
 'medical_specialty': 'Category',
 'num_lab_procedures': 'Integer',
 'num_procedures': 'Integer',
 'num_medications': 'Integer',
 'number_outpatient': 'Integer',
 'number_emergency': 'Integer',
 'number_inpatient': 'Integer',
 'diag_1': 'Category',
 'diag_2': 'Category',
 'diag_3': 'Category',
 'number_diagnoses': 'Integer',
 'max_glu_serum': 'Category',
 'A1Cresult': 'Category',
 'metformin': 'Category',
 'repaglinide': 'Category',
 'nateglinide': 'Category',
 'chlorpropamide': 'Category',
 'glimepiride': 'Category',
 'acetohexamide': 'Boolean',
 'glipizide': 'Category',
 'glyburide': 'Category',
 'tolbutamide': 'Category',
 'pioglitazone': 'Category',
 'rosiglitazone': 'Category',
 'acarbose': 'Category',
 'miglitol': 'Category',
 '

## Selecting by data type
The `select_dtypes` allows for selecting columns by datatype

In [7]:
select_dtypes(df, ['String']).head()


A column with None-type is present in the data.



Unnamed: 0,admission_type_id,discharge_disposition_id,admission_source_id,diag_1_desc,diag_2_desc,diag_3_desc
0,Elective,Discharged to home,Physician Referral,Spinal stenosis in cervical region,Spinal stenosis in cervical region,"Effusion of joint, site unspecified"
1,Urgent,Discharged to home,Physician Referral,"First-degree perineal laceration, unspecified ...","Diabetes mellitus of mother, complicating preg...",Sideroblastic anemia
2,Not Available,Discharged/transferred to home with home healt...,,Pneumococcal pneumonia [Streptococcus pneumoni...,"Congestive heart failure, unspecified",Hyperosmolality and/or hypernatremia
3,Emergency,Discharged to home,Transfer from another health care facility,Cellulitis and abscess of face,Streptococcus infection in conditions classifi...,Diabetes mellitus without mention of complicat...
4,Emergency,Discharged to home,Emergency Room,"Bipolar I disorder, single manic episode, unsp...",Diabetes mellitus without mention of complicat...,Depressive type psychosis


## Correcting data types
Corrections can also be made to the `select_dtypes` function by passing a dictionary. Omitted columns will continue to use the datatype identified by the default option.

In [8]:
select_dtypes(df, ['Category'], dtypes={'admission_type_id': 'Category'}).head()


A column with None-type is present in the data.



Unnamed: 0,gender,age,weight,admission_type_id,payer_code,medical_specialty,diag_1,diag_2,diag_3,max_glu_serum,...,nateglinide,glimepiride,glipizide,glyburide,pioglitazone,rosiglitazone,acarbose,insulin,glyburide.metformin,change
0,Female,[50-60),?,Elective,CP,Surgery-Neuro,723,723.0,719,,...,No,No,No,No,No,No,No,No,No,No
1,Female,[20-30),[50-75),Urgent,UN,?,664,648.0,285,,...,No,No,No,No,No,No,No,No,No,No
2,Male,[80-90),?,Not Available,MC,Family/GeneralPractice,481,428.0,276,>200,...,No,No,No,No,No,No,No,Steady,No,No
3,Female,[50-60),?,Emergency,UN,?,682,41.0,250,,...,No,No,No,No,No,No,No,Steady,No,No
4,Female,[50-60),?,Emergency,?,Psychiatry,296,250.01,298,,...,No,No,Steady,No,No,No,No,Steady,No,Ch
