# Mini project: Medical Appointment No Shows

Skills tested:
- Using Pandas to access and explore the dataset.
- Using Pandas to cleanse columns to choose features
- Using Scikit-Learn to preprocess the data before training.
- Using the decision tree classifier and random forest classifier in classifying and testing the data.

Description

Medical appointments are time commitments doctors make with their patients. However, some people do not show up (for different reasons), which causes lost time and money for the doctor. It is time for you to build models that predict whether the next appointment is a show or no show!
License: the dataset is CC4.0: BY-NC-SA and it is publicly available online.
Submission of your project on GitHub is optional. If you choose to manage your project using GitHub, find guidelines for using GitHub here. Ensure you are coding using your Jupyter Notebook – it will be uploaded to GitHub when you perform a Git push operation.

Expected output

By the end of this mini project, you will need to deliver within your code:
- Multiple accuracy measures resembling different criteria used for training your decision tree classifiers.
- Multiple accuracy measures resembling different number of estimators used for your random forest classifiers.
- One printed confusion matrix for the best model.

You are expected to write around 35 lines of code to complete this project




## Download the dataset

## Read the dataset

In [96]:
os.chdir('c:\\Users\\cbeer\\Desktop\\data-science-learning\\python-for-machine-learning')



In [97]:
import pandas as pd

dat = pd.read_csv("dat/no-shows.csv")

dat # 110527 rows × 14 columns


Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
0,2.987250e+13,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0,1,0,0,0,0,No
1,5.589978e+14,5642503,M,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,0,0,0,0,0,No
2,4.262962e+12,5642549,F,2016-04-29T16:19:04Z,2016-04-29T00:00:00Z,62,MATA DA PRAIA,0,0,0,0,0,0,No
3,8.679512e+11,5642828,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No
4,8.841186e+12,5642494,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,1,1,0,0,0,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
110522,2.572134e+12,5651768,F,2016-05-03T09:15:35Z,2016-06-07T00:00:00Z,56,MARIA ORTIZ,0,0,0,0,0,1,No
110523,3.596266e+12,5650093,F,2016-05-03T07:27:33Z,2016-06-07T00:00:00Z,51,MARIA ORTIZ,0,0,0,0,0,1,No
110524,1.557663e+13,5630692,F,2016-04-27T16:03:52Z,2016-06-07T00:00:00Z,21,MARIA ORTIZ,0,0,0,0,0,1,No
110525,9.213493e+13,5630323,F,2016-04-27T15:09:23Z,2016-06-07T00:00:00Z,38,MARIA ORTIZ,0,0,0,0,0,1,No


In [98]:
len(dat) == len(dat.dropna()) # There are no missing values

True

## Feature extraction

In [99]:
## Preprocessing

features = ["Gender", "Age", "Scholarship", "Hipertension", "Diabetes", "Alcoholism",
            "Handcap", "SMS_received"]

x = dat[features]


## Preprocessing

In [100]:
# we will need to scale X and encode gender and our target 

from sklearn.preprocessing import LabelEncoder

## Encode target

y = dat['No-show'].values

y = LabelEncoder().fit_transform(y)

## encode gender

x.loc[:, 'Gender'] = LabelEncoder().fit_transform(x.loc[:, 'Gender'])

## Scale x

from sklearn.preprocessing import StandardScaler

scaled_x = StandardScaler().fit_transform(x)

scaled_x


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(ilocs[0], value, pi)


array([[-0.73383659,  1.07793239, -0.33011206, ..., -0.1770676 ,
        -0.13772244, -0.68761155],
       [ 1.36270119,  0.81830565, -0.33011206, ..., -0.1770676 ,
        -0.13772244, -0.68761155],
       [-0.73383659,  1.07793239, -0.33011206, ..., -0.1770676 ,
        -0.13772244, -0.68761155],
       ...,
       [-0.73383659, -0.69618366, -0.33011206, ..., -0.1770676 ,
        -0.13772244,  1.45430948],
       [-0.73383659,  0.03942544, -0.33011206, ..., -0.1770676 ,
        -0.13772244,  1.45430948],
       [-0.73383659,  0.73176341, -0.33011206, ..., -0.1770676 ,
        -0.13772244,  1.45430948]])

## Splitting the data

In [101]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(scaled_x, y, test_size=0.2, random_state=0)

x_test, x_val, y_test, y_val = train_test_split(x_test, y_test, test_size=0.5, random_state=0)

## Training tree-based classifiers


In [104]:
from sklearn.tree import DecisionTreeClassifier

classifier_1 = DecisionTreeClassifier(criterion='entropy', random_state=0)

classifier_2 = DecisionTreeClassifier(criterion='gini', random_state=0)

classifier_1.fit(x_train, y_train)

y_pred_1 = classifier_1.predict(x_test)

classifier_2.fit(x_train, y_train)

y_pred_2 = classifier_2.predict(x_test)

from sklearn.metrics import accuracy_score

print("entropy score is: ", str(accuracy_score(y_test, y_pred_1)), "\n", "gini score is ",
str(accuracy_score(y_test, y_pred_2))) 

from sklearn.metrics import confusion_matrix

matrix_1 = confusion_matrix(y_test, y_pred_1)

matrix_2 = confusion_matrix(y_test, y_pred_2)

print("\n")

print("the confusion matrix for entropy is:", "\n", matrix_1)

print("\n")

print("the confusion matrix for gini is:", "\n", matrix_2)


# entropy is slightly better, very close though

entropy score is:  0.7957115715190446 
 gini score is  0.7952592056455261


the confusion matrix for entropy is: [[8760   73]
 [2185   35]]


the confusion matrix for gini is: [[8756   77]
 [2186   34]]


## Random forest



In [None]:
from sklearn.ensemble import RandomForestClassifier

score = ()

n = ()

for i in 1:1000:

    random_forest = RandomForestClassifier(n_estimators = i)

    random_forest.fit(x_train, y_train)

    y_pred = random_forest

    

    n.append(i)



