In [1]:
import numpy as np
import pandas as pd

# Data Inspection

In [2]:
medical_raw = pd.read_csv('medical_raw_data.csv')
medical_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 53 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Unnamed: 0          10000 non-null  int64  
 1   CaseOrder           10000 non-null  int64  
 2   Customer_id         10000 non-null  object 
 3   Interaction         10000 non-null  object 
 4   UID                 10000 non-null  object 
 5   City                10000 non-null  object 
 6   State               10000 non-null  object 
 7   County              10000 non-null  object 
 8   Zip                 10000 non-null  int64  
 9   Lat                 10000 non-null  float64
 10  Lng                 10000 non-null  float64
 11  Population          10000 non-null  int64  
 12  Area                10000 non-null  object 
 13  Timezone            10000 non-null  object 
 14  Job                 10000 non-null  object 
 15  Children            7412 non-null   float64
 16  Age  

## Null Values

Let's check which columns contain null values:

In [3]:
medical_raw.isnull().any()

Unnamed: 0            False
CaseOrder             False
Customer_id           False
Interaction           False
UID                   False
City                  False
State                 False
County                False
Zip                   False
Lat                   False
Lng                   False
Population            False
Area                  False
Timezone              False
Job                   False
Children               True
Age                    True
Education             False
Employment            False
Income                 True
Marital               False
Gender                False
ReAdmis               False
VitD_levels           False
Doc_visits            False
Full_meals_eaten      False
VitD_supp             False
Soft_drink             True
Initial_admin         False
HighBlood             False
Stroke                False
Complication_risk     False
Overweight             True
Arthritis             False
Diabetes              False
Hyperlipidemia      

Upon inspection we find that the following columns contain null values:
    
- Children
- Age
- Income
- soft_drink
- Overweight
- Anxiety
- Initial_days

### Children
For the Children column, we will use ``fillna()`` from pandas specifying ``method='bfill'`` to fill in the missing data based on the data that is not null. 

In [4]:
medical_raw['Children'].fillna(method='bfill', inplace=True)

Check if there are any null values remaining:

In [5]:
medical_raw['Children'].isna().sum()

0

### Age

In [25]:
medical_raw['Age'].isna().sum()


53

We'll use the same methodology for Age that we used for Children previously.

In [7]:
medical_raw['Age'].fillna(round(medical_raw['Age'].mean()), inplace=True)

Check if there are any null values remaining:

In [8]:
medical_raw['Age'].isna().sum()

1

### Income

In [9]:
medical_raw['Income'].isna().sum()

2464

For Income we will replace the null values with the mean of the given values.

In [10]:
income_mean = medical_raw['Income'].mean()
income_mean

40484.43826831216

In [11]:
medical_raw['Income'].fillna(medical_raw['Income'].mean(), inplace=True)

In [12]:
medical_raw['Income'].isna().sum()

0

### Soft_drink

In [13]:
medical_raw['Soft_drink'].isna().sum()

2467

Since soft_drink is categorical, either a "Yes" or "No", we'll replace the nulls with the mode.

In [14]:
medical_raw['Soft_drink'].fillna(medical_raw['Soft_drink'].mode()[0], inplace=True)
medical_raw['Soft_drink'].isnull().sum()

0

### Overweight

In [15]:
medical_raw['Overweight'].isna().sum()

982

Overweight is also categorical, though it is "0" and "1" instead of "yes and "no" (we'll fix these inconsitencies in the categorical values later). We'll use the same method we used for soft_drink.

In [16]:
medical_raw['Overweight'].fillna(medical_raw['Overweight'].mode()[0], inplace=True)
medical_raw['Overweight'].isna().sum()

0

### Anxiety
Another categorical! we use mode again

In [17]:
medical_raw['Anxiety'].isna().sum()

984

In [18]:
medical_raw['Anxiety'].fillna(medical_raw['Anxiety'].mode()[0], inplace=True)
medical_raw['Anxiety'].isna().sum()

0

### Initial_days
Numerical

In [19]:
medical_raw['Initial_days'].isna().sum()

1056

In [20]:
medical_raw['Initial_days'].fillna(medical_raw['Initial_days'].mean(), inplace=True)
medical_raw['Initial_days'].isna().sum()

0

In [21]:
medical_raw.isnull().any()

Unnamed: 0            False
CaseOrder             False
Customer_id           False
Interaction           False
UID                   False
City                  False
State                 False
County                False
Zip                   False
Lat                   False
Lng                   False
Population            False
Area                  False
Timezone              False
Job                   False
Children              False
Age                    True
Education             False
Employment            False
Income                False
Marital               False
Gender                False
ReAdmis               False
VitD_levels           False
Doc_visits            False
Full_meals_eaten      False
VitD_supp             False
Soft_drink            False
Initial_admin         False
HighBlood             False
Stroke                False
Complication_risk     False
Overweight            False
Arthritis             False
Diabetes              False
Hyperlipidemia      