# Description

We have all been in situation where we go to a doctor in emergency and find that the consultation fees are too high. As a data scientist we all should do better. What if you have data that records important details about a doctor and you get to build a model to predict the doctor’s consulting fee.? This is the use case that let's you do that. 

Size of training set: 5961 records

Size of test set: 1987 records


FEATURES:

Qualification: Qualification and degrees held by the doctor

Experience: Experience of the doctor in number of years

Rating: Rating given by patients

Profile: Type of the doctor

Miscellaneous_Info: Extra information about the doctor

Fees: Fees charged by the doctor (Target Variable)

Place: Area and the city where the doctor is located.


Using the given features we have to build a model which can predict Doctor's Consultation Fees.

As the Fees(target  variable) is continuous is nature we will follow Regression approach for this.

Lets Begin....!!!!


In [None]:
# Importing Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Importing training and test dataset

In [None]:
# Importing Data Set into Dataframe
doctor_train=pd.read_excel('Doctor_Train.xlsx')
doctor_test=pd.read_excel('Doctor_Test.xlsx')

In [None]:
pd.set_option('display.max_rows',None)
pd.set_option('display.max_columns',None)

# Overview

In [None]:
# Checking Training Dataset

doctor_train.head()

In [None]:
doctor_train.tail()

In [None]:
# Checking Dataset Info

doctor_train.info()

In [None]:
# Checking Shape

doctor_train.shape

We have 5961 rows and 7 columns

In [None]:
# Checking Datatypes

doctor_train.dtypes

Apart from fees all are object type.

In [None]:
# Checking Test Dataset

doctor_test.head()

In [None]:
doctor_test.tail()

In [None]:
doctor_test.info()

In [None]:
doctor_test.shape

In [None]:
doctor_test.dtypes

In [None]:
# Checking Null Values in Training Data

doctor_train.isnull().sum()

In [None]:
# Checking Null Values in Test Data
doctor_test.isnull().sum()

We have Null values in Rating, Place and Miscelleneous Column in both dataset. We will fill these null values along with treating the columns.

# Data Exploration and Cleaning

As we see there are columns like Qualification and Place which contains multiple information. Also Experience and Rating columns has string values because of which they are Object type. we will extract these values from these columns one by one.

In [None]:
# Working on Training dataset

In [None]:
# Starting with Experience Column

In [None]:
# Checking unique values and there counts

doctor_train['Experience'].value_counts()

In [None]:
doctor_train['Experience'].nunique()

We will remove string from the values and keep integer values only.Also we will convert datatype to int32

In [None]:
# Splitting string and assigning 0th index value back to column and converting to int32

string1=doctor_train['Experience'].str.split()
doctor_train['Experience']=string1.str[0]
doctor_train['Experience']=doctor_train['Experience'].astype('int32')

In [None]:
# Next we correct rating column

In [None]:
# Splitting string and assigning 0th index value back to column.

string3=doctor_train['Rating'].str.split('%')
doctor_train['Rating']=string3.str[0]

In [None]:
# Filling NaN values with 0

doctor_train['Rating'].fillna(0,inplace=True)

In [None]:
# Converting to int32

doctor_train['Rating']=doctor_train['Rating'].astype('int32')

In [None]:
# Place Column

In [None]:
# We split the string and store value for area and city in different columns

string4=doctor_train['Place'].str.split(',')
doctor_train['Area']=string4.str[0].replace(' ','')
doctor_train['City']=string4.str[1].replace(' ','')

In [None]:
# Qualification Column

In [None]:
# We follow the same procedure of splitting and assigning vales to seperate columns.

# We store the method of splitting in a variable

train_split = doctor_train.Qualification.apply(lambda x: len(x.split(',')))

In [None]:
# We define a function to create seperate columns according to the length given by train_split variable

def qualif_column(data, col, col_num):
    return data[col].str.split(',').str[col_num]

In [None]:
# Now we pass Qualification column in the function 

# for training set
for i in range(0,train_split.max()):
    qualif = "Qualification_"+ str(i+1)
    doctor_train[qualif] = qualif_column(doctor_train,'Qualification', i)

In [None]:
# Lets check our training dataset

doctor_train

Dataset looks good. Max qualification for a doctor is 10.

We perform same tasks for Test Dataset.

Working on Test Dataset

In [None]:
# Starting with Expereince column

In [None]:
doctor_test['Experience'].value_counts()

In [None]:
# Splitting string and assigning 0th index value back to column and converting to int32

string11=doctor_test['Experience'].str.split()
doctor_test['Experience']=string11.str[0]
doctor_test['Experience']=doctor_test['Experience'].astype('int32')

In [None]:
# For Rating column

In [None]:
# Splitting string and assigning 0th index value back to column.

string33=doctor_test['Rating'].str.split('%')
doctor_test['Rating']=string33.str[0]

# Filling NaN values with 0

doctor_test['Rating'].fillna(0,inplace=True)


# Converting to int32

doctor_test['Rating']=doctor_test['Rating'].astype('int32')

In [None]:
# For Place column

In [None]:
# We split the string and store value for area and city in different columns

string44=doctor_test['Place'].str.split(',')
doctor_test['Area']=string4.str[0].replace(' ','')
doctor_test['City']=string4.str[1].replace(' ','')

In [None]:
# For Qualification

In [None]:
# We follow the same procedure of splitting and assigning vales to seperate columns.

# We store the method of splitting in a variable

test_split = doctor_test.Qualification.apply(lambda x: len(x.split(',')))

In [None]:
# We define a function to create seperate columns according to the length given by train_split variable

def qualif_column(data, col, col_num):
    return data[col].str.split(',').str[col_num]

In [None]:
# Now we pass Qualification column in the function 

# for test set
for i in range(0,test_split.max()):
    qualif = "Qualification_"+ str(i+1)
    doctor_test[qualif] = qualif_column(doctor_test,'Qualification', i)

In [None]:
# Checking dataset

doctor_test

In test dataset we have maximum 17 qualifications for a doctor.

# Data Analysis

Lets check our features and there impact on Fees

In [None]:
fig,ax=plt.subplots(figsize=(8,6))
sns.countplot(x='Profile',data=doctor_train,ax=ax)
plt.title('Profile Count')
plt.xticks(rotation=35)
plt.show()

Record count for Dentist profile is highest. This could possible mean that people visit a doctor for dental issues more frequently than other problems.

In [None]:
fig,ax=plt.subplots(figsize=(8,6))
sns.countplot(x='City',data=doctor_train,ax=ax)
plt.title('Cities')
plt.xticks(rotation=35)
plt.show()

We see more number of records from Bangalore, Mumbai and Delhi. We also observe an unusual entry named Sector 5.

Lets check the relatioship with Target Variable

In [None]:
sns.catplot(x='Profile',y='Fees',kind='bar',data=doctor_train,ax=ax)
plt.title('Impact of Profile on Fees')
plt.xticks(rotation=90)
plt.show()

ENT Specialist and Dermatologist charge the highest fees followed by General and Homeopathy doctors. Ayurveda doctors charge the least. 

This could possibly mean that there isn't enough favouritism for Ayurveda as compare to Allopathy

In [None]:
sns.catplot(x='City',y='Fees',kind='bar',data=doctor_train,ax=ax)
plt.title('City wise tally of Fees')
plt.xticks(rotation=90)
plt.show()

In [None]:
We observe high Fee trend in Tier-I cities with Delhi at top.

# Observing relation of numerical columns with Fees

In [None]:
#Experience vs Fees
plt.title('Experience vs Fees')
plt.xlabel('Experience')
plt.ylabel('Fees')
sns.scatterplot(doctor_train['Experience'],doctor_train['Fees'])
plt.show()

In [None]:
plt.figure(figsize=(15,12))
sns.barplot('Experience','Fees',data=doctor_train)
plt.xticks(rotation=90)
plt.show()

There isn't a positive relation of Experience with Fees. We cannot say more or less experience have effect on Fee value.

In [None]:
# Checking Rating

plt.figure(figsize=(15,12))
sns.barplot('Rating','Fees',data=doctor_train)
plt.xticks(rotation=90)
plt.show()

In [None]:
#Rating vs Fees
plt.title('Rating vs Fees')
plt.xlabel('Rating')
plt.ylabel('Fees')
sns.scatterplot(doctor_train['Rating'],doctor_train['Fees'])
plt.show()

There is a bit of negative correlation with Fees but we do observe that higher Fee value has low ratings. This could be interpreted as the doctors who charge less gets higher ratings.

In [None]:
# We check  for null values again and also for any unusual entry in test data

In [None]:
doctor_train.isnull().sum()

In [None]:
doctor_test.isnull().sum()

In [None]:
# We will replace null values for Area and city with 'Unknown' and replace unusual entry in Training dataset.

doctor_train['Area']=doctor_train['Area'].fillna('Unknown')
doctor_train['City']=doctor_train['City'].fillna('Unknown')

In [None]:
doctor_train['Area']=doctor_train['Area'].replace('e','Unknown')
doctor_train['City']=doctor_train['City'].replace(' Sector 5',' Delhi')

In [None]:
# For Test Dataset

In [None]:
doctor_test['Area']=doctor_test['Area'].fillna('Unknown')
doctor_test['City']=doctor_test['City'].fillna('Unknown')

In [None]:
#Checking Unique values in both dataset

doctor_train['Area'].value_counts()

In [None]:
doctor_train['City'].value_counts()

In [None]:
#Test Data

In [None]:
doctor_test['City'].value_counts()

In [None]:
# All entries looks good lets check null values again

In [None]:
doctor_train.isnull().sum()

In [None]:
doctor_test.isnull().sum()

# Encoding

We will encode qualification columns with category codes.

As test data has more values for qualification we will create a dictionary of unique values with codes and replace values in datasets.

In [None]:
# Creating a copy of Test Dataset

data=doctor_test.copy()
data1=doctor_train.copy()

In [None]:
# Splitting strings in Qualification column

data['Qualification']=data['Qualification'].str.split(',')
data1['Qualification']=data1['Qualification'].str.split(',')

In [None]:
# Fetching values in a seperate list

data_train=[]
for i in range(data.shape[0]):
    for j in data['Qualification'][i]:
        data_train.append(j)

In [None]:
data_test=[]
for i in range(data1.shape[0]):
    for j in data1['Qualification'][i]:
        data_test.append(j)

In [None]:
data_final=data_train+data_test

In [None]:
# sorting unique values only

data1_unique=list(set(data_final))

In [None]:
data1_unique

In [None]:
# Creating a dataframe with unique values

df=pd.DataFrame(data1_unique,columns=['Qualification'])

In [None]:
# Creating a column with category codes

df['Code']=df.Qualification.astype('category').cat.codes

In [None]:
# Checking Dataframe

df

In [None]:
# Lets create a dictionory with qualification and codes

qualif_dict=dict(zip(df.Qualification,df.Code))

In [None]:
qualif_dict

In [None]:
# Replacing values of qualifications with codes in Training and test dataset

# For Training

for k in doctor_train.iloc[:,9:]:
    doctor_train.replace({k:qualif_dict},inplace=True)

In [None]:
# For Test Dataset

for z in doctor_test.iloc[:,8:]:
    doctor_test.replace({z:qualif_dict},inplace=True)

In [None]:
doctor_train

In [None]:
doctor_test

In [None]:
# We will encode city and profile using One Hot Encoder in both Train and Test Set

In [None]:
# For Training Data

doctor_train=pd.get_dummies(doctor_train,columns=['Profile','City'],prefix=['Profile','City'])

In [None]:
# For Test data
doctor_test=pd.get_dummies(doctor_test,columns=['Profile','City'],prefix=['Profile','City'])

In [None]:
doctor_train.head()

In [None]:
doctor_test.head()

In [None]:
# Encoding Area column with category codes

In [None]:
df2=doctor_train['Area'].unique()
df2=pd.DataFrame(df2,columns=['Area'])
df2['Area_Code']=df2['Area'].astype('category').cat.codes

In [None]:
# Creating Area and Area Code dictionory

area_dict=dict(zip(df2.Area,df2.Area_Code))

In [None]:
# Replacing Area values with Codes in Train Dataset

doctor_train.replace({'Area':area_dict},inplace=True)

In [None]:
doctor_train

In [None]:
# Replacing Area values with Codes in Test Dataset

doctor_test.replace({'Area':area_dict},inplace=True)

In [None]:
doctor_test

Replacing NaN with -1 for difference

In [None]:
# For Training

for x in doctor_train.iloc[:,7:17]:
    doctor_train[x]=doctor_train[x].fillna(-1)

In [None]:
# For Test

for x in doctor_test.iloc[:,6:23]:
    doctor_test[x]=doctor_test[x].fillna(-1)

# Creating Input and Target Variables from Training Dataset

In [None]:
x=doctor_train.drop(['Qualification','Place','Miscellaneous_Info','Fees'],axis=1)

In [None]:
y=doctor_train['Fees']

In [None]:
# Dropping columns from Test Datset as well
# As we see there are few rows which has qualifications extended till 17th column. Thus we will drop these columns.
doctor_test.drop(['Qualification','Place','Miscellaneous_Info','Qualification_11','Qualification_12','Qualification_13','Qualification_14','Qualification_15','Qualification_16','Qualification_17'],axis=1,inplace=True)

In [None]:
doctor_test.head()

In [None]:
doctor_test.shape

In [None]:
x.head()

In [None]:
y.head()

In [None]:
x.shape

In [None]:
y.shape

# Checking Skewness

In [None]:
plt.figure(figsize=(15,12))
sns.distplot(x['Experience'])
plt.xticks(rotation=90)
plt.show()

In [None]:
x.skew()

We will only treat Experience column for skewness

In [None]:
from sklearn.preprocessing import power_transform
x['Experience']=power_transform(x['Experience'].values.reshape(-1,1))

In [None]:
#For Test dataset

doctor_test.skew()

In [None]:
from sklearn.preprocessing import power_transform
doctor_test['Experience']=power_transform(doctor_test['Experience'].values.reshape(-1,1))

In [None]:
doctor_test

In [None]:
As we see there are few rows which has qualifications extended till 17th column. Thus we will drop 

# Creating Train and Test Split

In [None]:
# Importing Regression Algorithms & Metrics
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor,AdaBoostRegressor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso,Ridge,ElasticNet
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error,mean_absolute_error,r2_score

In [None]:
maxAccu=0
maxRs=0
for i in range(1,500):
    x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.33,random_state=i)
    lr=LinearRegression()
    lr.fit(x_train,y_train)
    lr_pred=lr.predict(x_test)
    acc=r2_score(y_test,lr_pred)
    if acc>maxAccu:
        maxAccu=acc
        maxRs=i
print('Best R2 Score is : ', maxAccu, ' when Random state is : ',maxRs)

In [None]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.33,random_state=267)

In [None]:
# Defining Model List
model_list=[LinearRegression(),SVR(),DecisionTreeRegressor(),KNeighborsRegressor(),RandomForestRegressor(),GradientBoostingRegressor(),AdaBoostRegressor(),Lasso(),Ridge(),ElasticNet(),XGBRegressor()]

In [None]:
# Creating For loop to print Training and Test accuracy score
for m in model_list:
    model=m
    model.fit(x_train,y_train)
    model_pred_train=model.predict(x_train)
    model_pred=model.predict(x_test)
    print('Training Accuracy for the model ',m,' is: ',r2_score(y_train,model_pred_train)*100)
    print('Testing Accuracy for the model ',m,' is: ',r2_score(y_test,model_pred)*100)
    print('\n')

# Cross Validation

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
# Linear Regression
lr=LinearRegression()
lr.fit(x_train,y_train)
lr_pred=lr.predict(x_test)
testing_accu=r2_score(y_test,lr_pred)*100
for k in range(2,10):
    cv_score=cross_val_score(lr,x,y,cv=k)
    cv_mean=cv_score.mean()*100
    print(f'At crossfold {k} the CV score of is {cv_mean} and the accuracy for testing is {testing_accu} ')
    print('\n')

In [None]:
# Cross validating RandomForestRegressor
rfr=RandomForestRegressor()
rfr.fit(x_train,y_train)
rfr_pred=rfr.predict(x_test)
testing_accu=r2_score(y_test,rfr_pred)*100
for k in range(2,10):
    cv_score=cross_val_score(rfr,x,y,cv=k)
    cv_mean=cv_score.mean()*100
    print(f'At crossfold {k} the CV score of is {cv_mean} and the accuracy for testing is {testing_accu} ')
    print('\n')

In [None]:
# Cross validating GradientBoostingRegressor

gbr = GradientBoostingRegressor()
gbr.fit(x_train, y_train)
gbr_pred=gbr.predict(x_test)
testing_accu=r2_score(y_test,gbr_pred)*100
for k in range(2,10):
    cv_score=cross_val_score(gbr,x,y,cv=k)
    cv_mean=cv_score.mean()*100
    print(f'At crossfold {k} the CV score of is {cv_mean} and the accuracy for testing is {testing_accu} ')
    print('\n')

In [None]:
# Cross validating Lasso
ls=Lasso()
ls.fit(x_train,y_train)
ls_pred=ls.predict(x_test)
testing_accu=r2_score(y_test,ls_pred)*100
for k in range(2,10):
    cv_score=cross_val_score(ls,x,y,cv=k)
    cv_mean=cv_score.mean()*100
    print(f'At crossfold {k} the CV score of is {cv_mean} and the accuracy for testing is {testing_accu} ')
    print('\n')

In [None]:
# Cross validating Ridge
rd=Ridge()
rd.fit(x_train,y_train)
rd_pred=rd.predict(x_test)
testing_accu=r2_score(y_test,rd_pred)*100
for k in range(2,10):
    cv_score=cross_val_score(rd,x,y,cv=k)
    cv_mean=cv_score.mean()*100
    print(f'At crossfold {k} the CV score of is {cv_mean} and the accuracy for testing is {testing_accu} ')
    print('\n')

# Hyper Tuning

In [None]:
# Importing Gridsearch CV

from sklearn.model_selection import GridSearchCV

In [None]:
# defining parameters


gr_param={'n_estimators':list(range(50,400,50)),'max_depth' : np.arange(2,8),'criterion':['mse','mae'],'max_features':['auto', 'sqrt', 'log2']}
gcv= GridSearchCV(estimator=rfr,param_grid=gr_param,cv=5)

In [None]:
# Getting Best Parameters

gcv.fit(x_train,y_train)
gcv.best_params_

In [None]:
rfr1=RandomForestRegressor(criterion='mse',max_depth=7,max_features='sqrt',n_estimators=300)
rfr1.fit(x_train,y_train)
rfr1_pred=rfr1.predict(x_test)
testing_accu=r2_score(y_test,rfr1_pred)*100
for k in range(2,10):
    cv_score=cross_val_score(rfr1,x,y,cv=k)
    cv_mean=cv_score.mean()*100
    print(f'At crossfold {k} the CV score of is {cv_mean} and the accuracy for testing is {testing_accu} ')
    print('\n')

In [None]:
# defining parameters


gbr_param={'n_estimators':list(range(50,400,50)),'max_depth' : np.arange(2,10),'learning_rate' :[0.01,0.1,0.2,0.3],'criterion':['friedman_mse', 'mse', 'mae'], 'max_features':['auto', 'sqrt', 'log2']}
gcv_gbr= GridSearchCV(estimator=gbr,param_grid=gbr_param,scoring='r2',cv=5)

In [None]:
# Training final model with Gradient Boosting Regressor

rfr1=RandomForestRegressor(criterion='mse',max_depth=7,max_features='sqrt',n_estimators=300)
rfr1.fit(x_train,y_train)
rfr1_pred=rfr1.predict(x_test)
testing_accu=r2_score(y_test,rfr1_pred)*100

cv_score=cross_val_score(rfr1,x,y,cv=5)
cv_mean=cv_score.mean()*100
print(f'The CV score of the model is {cv_mean} and the R2 Score for testing is {testing_accu} ')
print('\n')

print('Mean Squared Error of the model is : ',mean_squared_error(y_test,rfr1_pred))
print('Mean Absolute Error of the model is : ',mean_absolute_error(y_test,rfr1_pred))
print('Root Mean Squared Error of the model is : ',np.sqrt(mean_squared_error(y_test,rfr1_pred)),'\n')

In [None]:
# Training final model with Gradient Boosting Regressor

gbr1=GradientBoostingRegressor()
gbr1.fit(x_train,y_train)
gbr1_pred=gbr1.predict(x_test)
testing_accu=r2_score(y_test,gbr1_pred)*100

cv_score=cross_val_score(gbr1,x,y,cv=5)
cv_mean=cv_score.mean()*100
print(f'The CV score of the model is {cv_mean} and the R2 Score for testing is {testing_accu} ')
print('\n')

print('Mean Squared Error of the model is : ',mean_squared_error(y_test,gbr1_pred))
print('Mean Absolute Error of the model is : ',mean_absolute_error(y_test,gbr1_pred))
print('Root Mean Squared Error of the model is : ',np.sqrt(mean_squared_error(y_test,gbr1_pred)),'\n')

We will select Gradient Boosting Regressor as our final model as we get better results

# Saving the Model

In [None]:
import pickle
filename='doctor_con.pkl'
pickle.dump(gbr1,open(filename,'wb'))

# Loading Model

In [None]:
load_model=pickle.load(open('doctor_con.pkl','rb'))

In [None]:
load_model

In [None]:
prediction=load_model.predict(doctor_test)

In [None]:
prediction