# D210 - Reporting and Representation

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.core.interactiveshell import InteractiveShell 

pd.set_option('display.max_columns', None)
InteractiveShell.ast_node_interactivity = "all"
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

In [None]:
df = pd.read_csv('medical_clean.csv', index_col='Customer_id')

In [None]:
df.shape

In [None]:
df.head()

In [None]:
survey_df = df[['Item1', 'Item2', 'Item3', 'Item4', 'Item5', 'Item6', 'Item7', 'Item8']]

In [None]:
survey_df.rename(columns={'Item1':'Timely_Admission', 'Item2':'Timely_Treatment', 'Item3':'Timely_Visits', 'Item4':'Reliability',
                          'Item5':'Options', 'Item6':'HoursofTreatment', 'Item7':'Courteous_Staff', 'Item8':'Active_Listening_Doctor'}, inplace=True)

In [None]:
df.drop(['Item1','Item2','Item3','Item4','Item5','Item6','Item7','Item8'], axis=1, inplace=True)

In [None]:
survey_df.head()

# § Data Cleaning  

### ▶ Detection and Treatment of Null Values

In [None]:
df.isnull().sum()

### ▶ Detection and Treatment of Duplicated Values

In [None]:
df.duplicated().value_counts()

# § Readmission Prediction using Random Forest

In [None]:
# Creating X and y data
X = df[['Area','Income','Marital','Gender','VitD_levels','Doc_visits',
       'Full_meals_eaten','vitD_supp','Soft_drink','Initial_admin',
       'HighBlood','Stroke','Complication_risk','Overweight','Arthritis',
       'Diabetes','Hyperlipidemia','BackPain','Anxiety','Allergic_rhinitis',
       'Reflux_esophagitis','Asthma','Services','Initial_days','TotalCharge',
        'Additional_charges']]
y = df['ReAdmis'].values.reshape(-1,1)

In [None]:
X.head()

In [None]:
X = pd.get_dummies(data=X, columns=['Area','Marital','Gender','Soft_drink',
                                    'Initial_admin','HighBlood','Stroke','Overweight','Arthritis',
                                    'Diabetes','Hyperlipidemia','BackPain','Anxiety','Allergic_rhinitis',
                                    'Reflux_esophagitis','Asthma','Services'], drop_first=True)


In [None]:
X.head()

In [None]:
# Encoding colum "Complication Risk"
from sklearn.preprocessing import OrdinalEncoder
enc = OrdinalEncoder()
X['Complication_risk'] = enc.fit_transform(X[['Complication_risk']])

In [None]:
X.head()

In [None]:
from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()
X = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)

In [None]:
X.head()

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=53)

pipe = Pipeline([('model', RandomForestClassifier(random_state=53))])

pipe.get_params()


In [None]:
mod = GridSearchCV(estimator=pipe,
                   param_grid={'model__max_depth': [1,2,3,4,5,6,7,8,9,10]},
                   cv=5,
                   n_jobs=-1)

In [None]:
mod.fit(X_train, y_train)

In [None]:
print(f'The best depth for the Random Forest is: {mod.best_params_}')
print(f'The best score was : {mod.best_score_}')


In [None]:
y_pred = mod.predict(X_test)

In [None]:
# Create confusion matrix
from sklearn.metrics import confusion_matrix, classification_report

cm = confusion_matrix(y_test, y_pred)

# Visualize the confusion matrix using a heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['No', 'Yes'], yticklabels=['No', 'Yes'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show();

Here we can see the confusion matrix for the model. It shows:  
1) **1898** - These are the cases where the model correctly predicted the positive class (e.g., a patient is readmitted to the hospital), and the actual outcome was also **positive**.
2) **1042** - These are the cases where the model correctly predicted the negative class (e.g., a patient is not readmitted to the hospital), and the actual outcome was also **negative**.
3) **28** - These are the cases where the model incorrectly predicted the positive class when it should have been negative. In other words, the model gave a positive prediction, but the actual outcome was negative.
4) **32** - These are the cases where the model incorrectly predicted the negative class when it should have been positive. The model gave a negative prediction, but the actual outcome was positive

In [None]:
# Print classification report for additional performance metrics

print(classification_report(y_test, y_pred))

The classification reports breaks down the model's performence further. Some key takeaways are:

1. **Precision**:
   - For the "No" class: Precision is 0.99, which means that when the model predicts "No" (negative class), it is correct 99% of the time.
   - For the "Yes" class: Precision is 0.97, indicating that when the model predicts "Yes" (positive class), it is correct 97% of the time.  
2. **Recall (Sensitivity)**:
   - For the "No" class: Recall is 0.98, meaning that the model correctly identifies 98% of the actual "No" cases.
   - For the "Yes" class: Recall is 0.97, indicating that the model captures 97% of the actual "Yes" cases.
3. **F1-Score**:
   - For the "No" class, the F1-score is 0.98, which is a harmonic mean of precision and recall. It provides a balanced measure of accuracy.
   - For the "Yes" class, the F1-score is 0.97, reflecting the balance between precision and recall for the "Yes" class.
4. **Support**:
   - The "support" column shows the number of instances in each class in the test dataset.
     - For the "No" class, there are 1,930 instances.
     - For the "Yes" class, there are 1,070 instances.
5. **Accuracy**:
   - The overall accuracy of the model is 0.98, or 98%. This indicates that 98% of the predictions (both "Yes" and "No" combined) are correct.
6. **Macro Avg**:
   - The "macro avg" row shows the average of precision, recall, and F1-score for both classes. In this case, the average is 0.98.
7. **Weighted Avg**:
   - The "weighted avg" row provides a weighted average of precision, recall, and F1-score. It takes into account the class imbalances, giving more weight to the class with more samples. In this case, the weighted average is 0.98.

In summary, the model appears to perform very well, with high precision, recall, and F1-scores for both the "Yes" and "No" classes. The high accuracy of 98% suggests that the model is effective at correctly classifying instances.