# Pandas

## What is Pandas?

Pandas is a python library used for the manipulation and analysis of data. If you have ever done anything with tabular data in python, chances are that you have encountered this package. In this module, we will highlight important Pandas features  as well as manipulate and analyze the MIMIC-III dataset, which we will be using in this course. Some of these features include reading and writing data, slicing and subsetting, merging and joining, and filling in missing data.

The official documentation for pandas can be found [here](http://pandas.pydata.org/pandas-docs/stable/) and is a great resource if you encounter something that you do not understand or would like to reference. 


In [2]:
import pandas as pd
import numpy as np

## Pandas Series and DataFrame

Pandas has two main data structures -- the Pandas Series and the Pandas DataFrame. In essence, Pandas Series are 1 dimensional while DataFrames are 2 dimensional. Let's take a look at some examples:

In [14]:
series = pd.Series(data = range(10, 20))
series

0    10
1    11
2    12
3    13
4    14
5    15
6    16
7    17
8    18
9    19
dtype: int64

In [15]:
series.index

RangeIndex(start=0, stop=10, step=1)

In [18]:
series.index = range(30, 40)
series

30    10
31    11
32    12
33    13
34    14
35    15
36    16
37    17
38    18
39    19
dtype: int64

In [20]:
series.index = ['a', 'b', 'c', 'd', 'e']*2
series

a    10
b    11
c    12
d    13
e    14
a    15
b    16
c    17
d    18
e    19
dtype: int64

In [21]:
series.is_unique

True

In [22]:
series.is_monotonic

True

In [26]:
daterange_index = pd.date_range('01/01/2016', periods = 10, frequency = "D")
daterange_index

DatetimeIndex(['2016-01-01', '2016-01-02', '2016-01-03', '2016-01-04',
               '2016-01-05', '2016-01-06', '2016-01-07', '2016-01-08',
               '2016-01-09', '2016-01-10'],
              dtype='datetime64[ns]', freq='D')

In [28]:
series.index = daterange_index
series

2016-01-01    10
2016-01-02    11
2016-01-03    12
2016-01-04    13
2016-01-05    14
2016-01-06    15
2016-01-07    16
2016-01-08    17
2016-01-09    18
2016-01-10    19
Freq: D, dtype: int64

For a full list of pandas.Series methods and attributes, see the documentation here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html

In [None]:
df = pd.DataFrame()

In [3]:
diabetes_df = pd.read_csv('./data/dataset_diabetes/diabetic_data.csv')
ids_mapping = pd.read_csv('./data/dataset_diabetes/IDs_mapping.csv')

## Exploring the data

In [4]:
diabetes_df.head()

Unnamed: 0,encounter_id,patient_nbr,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,...,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
0,2278392,8222157,Caucasian,Female,[0-10),?,6,25,1,1,...,No,No,No,No,No,No,No,No,No,NO
1,149190,55629189,Caucasian,Female,[10-20),?,1,1,7,3,...,No,Up,No,No,No,No,No,Ch,Yes,>30
2,64410,86047875,AfricanAmerican,Female,[20-30),?,1,1,7,2,...,No,No,No,No,No,No,No,No,Yes,NO
3,500364,82442376,Caucasian,Male,[30-40),?,1,1,7,2,...,No,Up,No,No,No,No,No,Ch,Yes,NO
4,16680,42519267,Caucasian,Male,[40-50),?,1,1,7,1,...,No,Steady,No,No,No,No,No,Ch,Yes,NO


In [5]:
ids_mapping.head()

Unnamed: 0,admission_type_id,description
0,1,Emergency
1,2,Urgent
2,3,Elective
3,4,Newborn
4,5,Not Available


In [6]:
diabetes_df.describe()

Unnamed: 0,encounter_id,patient_nbr,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,num_lab_procedures,num_procedures,num_medications,number_outpatient,number_emergency,number_inpatient,number_diagnoses
count,101766.0,101766.0,101766.0,101766.0,101766.0,101766.0,101766.0,101766.0,101766.0,101766.0,101766.0,101766.0,101766.0
mean,165201600.0,54330400.0,2.024006,3.715642,5.754437,4.395987,43.095641,1.33973,16.021844,0.369357,0.197836,0.635566,7.422607
std,102640300.0,38696360.0,1.445403,5.280166,4.064081,2.985108,19.674362,1.705807,8.127566,1.267265,0.930472,1.262863,1.9336
min,12522.0,135.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
25%,84961190.0,23413220.0,1.0,1.0,1.0,2.0,31.0,0.0,10.0,0.0,0.0,0.0,6.0
50%,152389000.0,45505140.0,1.0,1.0,7.0,4.0,44.0,1.0,15.0,0.0,0.0,0.0,8.0
75%,230270900.0,87545950.0,3.0,4.0,7.0,6.0,57.0,2.0,20.0,0.0,0.0,1.0,9.0
max,443867200.0,189502600.0,8.0,28.0,25.0,14.0,132.0,6.0,81.0,42.0,76.0,21.0,16.0


In [7]:
diabetes_df.dtypes

encounter_id                 int64
patient_nbr                  int64
race                        object
gender                      object
age                         object
weight                      object
admission_type_id            int64
discharge_disposition_id     int64
admission_source_id          int64
time_in_hospital             int64
payer_code                  object
medical_specialty           object
num_lab_procedures           int64
num_procedures               int64
num_medications              int64
number_outpatient            int64
number_emergency             int64
number_inpatient             int64
diag_1                      object
diag_2                      object
diag_3                      object
number_diagnoses             int64
max_glu_serum               object
A1Cresult                   object
metformin                   object
repaglinide                 object
nateglinide                 object
chlorpropamide              object
glimepiride         

In [22]:
categorical_vars = [x for x in diabetes_df.columns if diabetes_df[x].dtype == np.dtype('O')]
categorical_vars

['race',
 'gender',
 'age',
 'weight',
 'payer_code',
 'medical_specialty',
 'diag_1',
 'diag_2',
 'diag_3',
 'max_glu_serum',
 'A1Cresult',
 'metformin',
 'repaglinide',
 'nateglinide',
 'chlorpropamide',
 'glimepiride',
 'acetohexamide',
 'glipizide',
 'glyburide',
 'tolbutamide',
 'pioglitazone',
 'rosiglitazone',
 'acarbose',
 'miglitol',
 'troglitazone',
 'tolazamide',
 'examide',
 'citoglipton',
 'insulin',
 'glyburide-metformin',
 'glipizide-metformin',
 'glimepiride-pioglitazone',
 'metformin-rosiglitazone',
 'metformin-pioglitazone',
 'change',
 'diabetesMed',
 'readmitted']

In [27]:
value_count_dict = {x:diabetes_df[x].value_counts() for x in categorical_vars}

In [31]:
[print(x) for x in value_count_dict.values()]

Caucasian          76099
AfricanAmerican    19210
?                   2273
Hispanic            2037
Other               1506
Asian                641
Name: race, dtype: int64
Female             54708
Male               47055
Unknown/Invalid        3
Name: gender, dtype: int64
[70-80)     26068
[60-70)     22483
[50-60)     17256
[80-90)     17197
[40-50)      9685
[30-40)      3775
[90-100)     2793
[20-30)      1657
[10-20)       691
[0-10)        161
Name: age, dtype: int64
?            98569
[75-100)      1336
[50-75)        897
[100-125)      625
[125-150)      145
[25-50)         97
[0-25)          48
[150-175)       35
[175-200)       11
>200             3
Name: weight, dtype: int64
?     40256
MC    32439
HM     6274
SP     5007
BC     4655
MD     3532
CP     2533
UN     2448
CM     1937
OG     1033
PO      592
DM      549
CH      146
WC      135
OT       95
MP       79
SI       55
FR        1
Name: payer_code, dtype: int64
?                                    49949
InternalMedi

[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None]

### Replace all '?' values with None

In [35]:
diabetes_df.replace(to_replace='?', value = np.nan, inplace=True)

In [36]:
value_count_dict = {x:diabetes_df[x].value_counts() for x in categorical_vars}

In [37]:
value_count_dict

{'race': Caucasian          76099
 AfricanAmerican    19210
 Hispanic            2037
 Other               1506
 Asian                641
 Name: race, dtype: int64, 'gender': Female             54708
 Male               47055
 Unknown/Invalid        3
 Name: gender, dtype: int64, 'age': [70-80)     26068
 [60-70)     22483
 [50-60)     17256
 [80-90)     17197
 [40-50)      9685
 [30-40)      3775
 [90-100)     2793
 [20-30)      1657
 [10-20)       691
 [0-10)        161
 Name: age, dtype: int64, 'weight': [75-100)     1336
 [50-75)       897
 [100-125)     625
 [125-150)     145
 [25-50)        97
 [0-25)         48
 [150-175)      35
 [175-200)      11
 >200            3
 Name: weight, dtype: int64, 'payer_code': MC    32439
 HM     6274
 SP     5007
 BC     4655
 MD     3532
 CP     2533
 UN     2448
 CM     1937
 OG     1033
 PO      592
 DM      549
 CH      146
 WC      135
 OT       95
 MP       79
 SI       55
 FR        1
 Name: payer_code, dtype: int64, 'medical_specialty': 