# CASE STUDY : LOAN PREDICTION ANALYTICS

# Dataset Information

   
Variable | Description
----------|--------------
Loan_ID | Unique Loan ID
Gender | Male/ Female
Married | Applicant married (Y/N)
Dependents | Number of dependents
Education | Applicant Education (Graduate/ Under Graduate)
Self_Employed | Self employed (Y/N)
ApplicantIncome | Applicant income
CoapplicantIncome | Coapplicant income
LoanAmount | Loan amount in thousands
Loan_Amount_Term | Term of loan in months
Credit_History | credit history meets guidelines
Property_Area | Urban/ Semi Urban/ Rural
Loan_Status | Loan approved (Y/N)

# Import modules Load Essential Python Libraries

# Data Engineering Part 

Important Libaries for Data Engineering Tasks

1- Numpy ( Numerical Python ) 2- Pandas 3- Matplotlib 4- Seaborn

# What is NUMPY:
NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays https://numpy.org/

# What is Pandas:
pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series.

https://pandas.pydata.org/docs/index.html

https://www.w3schools.com/python/pandas/pandas_dataframes.asp

https://www.tutorialspoint.com/python_pandas/python_pandas_dataframe.htm

# What is Matplotlib:

Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy. It provides an object-oriented API for embedding plots into applications using general-purpose GUI toolkits like Tkinter, wxPython, Qt, or GTK. https://matplotlib.org/

# What is Seaborn:
Seaborn is a library for making statistical graphics in Python. It builds on top of matplotlib and integrates closely with pandas data structures. Seaborn helps you explore and understand your data. https://seaborn.pydata.org/

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
import matplotlib
%matplotlib inline
from IPython.display import display, HTML
import warnings
warnings.filterwarnings('ignore')

In [2]:
#https://research.google.com/colaboratory/
#from google.colab import drive  
#drive.mount('/content/drive')

## Loading the dataset

# How to Create Dataframe through Pandas
DataFrame. DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object.

https://www.w3schools.com/python/pandas/pandas_dataframes.asp

load comma separeted variable (CSV ) data with pd.read_csv method

https://sites.google.com/a/umt.edu.pk/datascience/data-engineering-pundas

In above link you will find this code file regarding various examples to crearte data frame ...> downlaod this file ( 02- PANDAS DATAFRAMES.IPYNB )

# Youtube Videos 
Tutorial 5- Pandas, Data Frame and Data Series Part-1

https://www.youtube.com/watch?v=QUClKFFn1Vk&list=PLZoTAELRMXVPBTrWtJkn3wWQxZkmTXGwe&index=9

Tutorial 6- Pandas,Reading CSV files With Various Parameters-

https://www.youtube.com/watch?v=tW1BWtQRZ2M&list=PLZoTAELRMXVPBTrWtJkn3wWQxZkmTXGwe&index=9

In [4]:
df = pd.read_csv("Loan Prediction Dataset.csv")
df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


# Exploratory Data Analysis (EDA)
- Need to get a better understanding of the given data.
- Questions like:
- How much data do we have?
- Are there are any missing values?
- What is the data type of each column?
- What is the distribution of data in each column?
- Do we see any outliers?

In [None]:
#summary 
df.describe().T

In [None]:
print("Number of rows: ", df.shape[0])
counts = df.describe().iloc[0]
display(pd.DataFrame(counts.tolist(), columns=["Count of values"], index=counts.index.values).transpose())

In [None]:
df.dtypes

In [None]:
df.info()

# Data Wrangling 

# Handling Missing Values & Outlier Handling

In order to check missing values in Pandas DataFrame, we use a function isnull() and notnull() . Both function help in checking whether a value is NaN or not. These function can also be used in Pandas Series in order to find null values in a series


# Useful blogs 

https://towardsdatascience.com/8-methods-for-handling-missing-values-with-python-pandas-842544cdf891

https://www.geeksforgeeks.org/working-with-missing-data-in-pandas/

https://www.tutorialspoint.com/python_pandas/python_pandas_missing_data.htm

https://sites.google.com/a/umt.edu.pk/datascience/data-engineering-pundas 

Download this file for practice 

# Youtube Video 

https://www.youtube.com/watch?v=uDr67HBIPz8


https://www.youtube.com/watch?v=fCMrO_VzeL8

https://www.youtube.com/watch?v=EaGbS7eWSs0

In [None]:
df.isnull().any()

In [None]:
# Heatmap to check missing values 

In [None]:
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')

In [None]:
# Number of Missing Values

In [None]:
# find the null values
df.isnull().sum()

In [None]:
# fill the missing values for numerical terms - mean
df['LoanAmount'] = df['LoanAmount'].fillna(df['LoanAmount'].mean())
df['Loan_Amount_Term'] = df['Loan_Amount_Term'].fillna(df['Loan_Amount_Term'].mean())
df['Credit_History'] = df['Credit_History'].fillna(df['Credit_History'].mean())

In [None]:
# fill the missing values for categorical terms - mode
df['Gender'] = df["Gender"].fillna(df['Gender'].mode()[0])
df['Married'] = df["Married"].fillna(df['Married'].mode()[0])
df['Dependents'] = df["Dependents"].fillna(df['Dependents'].mode()[0])
df['Self_Employed'] = df["Self_Employed"].fillna(df['Self_Employed'].mode()[0])

In [None]:
df.isnull().sum()

In [None]:
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')

# What is Pandas Groupby

Pandas groupby is used for grouping the data according to the categories and apply a function to the categories.

It also helps to aggregate data efficiently.

Pandas dataframe. groupby() function is used to split the data into groups based on some criteria.

pandas objects can be split on any of their axes.

What does a groupby do?
Group by is one of the most frequently used SQL clauses. It allows you to collapse a field into its distinct values.

# Useful links

https://www.tutorialspoint.com/python_pandas/python_pandas_groupby.htm

https://sites.google.com/a/umt.edu.pk/datascience/data-engineering-pundas --- >

download file 04- PANDAS GROUPBY.IPYNB file

PANDASDS.IPYNB file download with Data.rar folder as well

# Youtube Video

Python Pandas Tutorial 7. Group By (Split Apply Combine)

https://www.youtube.com/watch?v=Wb2Tp35dZ-I

In [None]:
# group by command to find mean of multiple colums 
df_Loan_Loan_Amount_Term =df.groupby(["LoanAmount","ApplicantIncome"],as_index=False).Loan_Amount_Term.mean()

print(df_Loan_Loan_Amount_Term)


# Data visualization 
Data visualization is defined as a graphical representation that contains the information and the data. By using visual elements like charts, graphs, and maps, data visualization techniques provide an accessible way to see and understand trends, outliers, and patterns in data

# UseFull links of Matplotlib to plot through rcParams
https://matplotlib.org/3.5.0/tutorials/introductory/customizing.html

https://www.programcreek.com/python/example/102312/matplotlib.pyplot.rcParams

https://www.tutorialexample.com/understand-matplotlib-rcparams-a-beginner-guide-matplotlib-tutorial/

https://sites.google.com/a/umt.edu.pk/datascience/4-data-visualization-matplotlib-and-seaboan

# Youtube Video

Tutorial 8- Matplotlib (Simple Visualization Library)

https://www.youtube.com/watch?v=czQO1_GEEos&list=PLZoTAELRMXVPBTrWtJkn3wWQxZkmTXGwe&index=11

Tutorial 9- Seaborn Tutorial- Distplot, Joinplot, Pairplot Part 1

https://www.youtube.com/watch?v=UsglokDLa2o&list=PLZoTAELRMXVPBTrWtJkn3wWQxZkmTXGwe&index=12


In [None]:
# Univariate Analysis:
# categorical attributes visualization
fig,ax = plt.subplots(2,4,figsize=(16,10))
sns.countplot(x='Loan_Status', data = df, ax=ax[0][0])
sns.countplot(x='Gender', data = df, ax=ax[0][1])
sns.countplot(x='Married', data = df, ax=ax[0][2])
sns.countplot(x='Education', data = df, ax=ax[0][3])
sns.countplot(x='Self_Employed', data = df, ax=ax[1][0])
sns.countplot(x='Property_Area', data = df, ax=ax[1][1])
sns.countplot(x='Credit_History', data = df, ax=ax[1][2])
sns.countplot(x='Dependents', data = df, ax=ax[1][3])

# Checking for data imbalance

The class imbalance problem typically occurs when there are many more instances of some classes than others. In such cases, standard classifiers tend to be overwhelmed by the large classes and ignore the small ones.

# Useful Blobs 
https://www.analyticsvidhya.com/blog/2020/07/10-techniques-to-deal-with-class-imbalance-in-machine-learning/

https://machinelearningmastery.com/what-is-imbalanced-classification/

In [None]:
sns.countplot(df['Loan_Status'])
print('The percentage of Y class : %.2f' % (df['Loan_Status'].value_counts()[0] / len(df)))
print('The percentage of N class : %.2f' % (df['Loan_Status'].value_counts()[1] / len(df)))


In [None]:
# total income
df['Total_Income'] = df['ApplicantIncome'] + df['CoapplicantIncome']
df.head()

In [None]:
# Bivariate Analysis
sns.boxplot(x='Loan_Status', y='Total_Income', data=df)

In [None]:
sns.boxplot(x='Total_Income', y='Gender', data=df)

In [None]:
#The mean value of Loan Amount applied by males (0) is slightly higher than Females(1).

In [None]:
Credit_History = pd.crosstab(df['Credit_History'], df['Loan_Status'])
Credit_History.div(Credit_History.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True, figsize=(6,4))
plt.legend(loc = 'best')

In [None]:
Property_Area = pd.crosstab(df['Property_Area'], df['Loan_Status'])
Property_Area.div(Property_Area.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True, figsize=(6,4))
plt.legend(bbox_to_anchor=(1.05,1.0),loc='best')

In [None]:
Property_Area = pd.crosstab(df['Property_Area'], df['Loan_Status'])
Property_Area.div(Property_Area.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True, figsize=(6,4))
plt.legend(bbox_to_anchor=(1.05,1.0),loc='best')

# Probability and Statistics:
The goal of these questions is to test your ability to answer probability and stat questions with code.
Use whatever libraries you are comfortable with.
Code clarity and cleanliness are also highly valuable.

https://sites.google.com/a/umt.edu.pk/datascience/5--statistics-and-probability

https://www.kaggle.com/code/carlolepelaars/statistics-tutorial

# Youtube playlist

https://www.youtube.com/watch?v=dmHcFQQPGEE&list=PLVgEzPHodXi1wT9OK8B_W6Hs8Xc-gaG6N&index=1

# understanding of plots Distributions plots learning of matplotlib and seaborn



# Probabilty distributions from youtube

Python for Data Analysis: Probability Distributions

https://www.youtube.com/watch?v=uial-2girHQ

Probability - Simulation to See Probability in Python
https://www.youtube.com/watch?v=4YPt3BHuJEE

3- Statistics Using Python

https://www.youtube.com/watch?v=mQ-3KwrBIN0

4- Tutorial 25- Probability Density function and CDF- EDA-Data Science

https://www.youtube.com/watch?v=PYIjkw0HN1Q&t=2s

In [None]:
# numerical attributes visualization
sns.distplot(df["ApplicantIncome"])

In [None]:
sns.distplot(df["CoapplicantIncome"])

In [None]:
sns.distplot(df["LoanAmount"])

In [None]:
sns.distplot(df['Loan_Amount_Term'])

In [None]:
sns.distplot(df['Credit_History'])

## Log Transformation

In [None]:
# apply log transformation to the attribute
df['ApplicantIncomeLog'] = np.log(df['ApplicantIncome']+1)
sns.distplot(df["ApplicantIncomeLog"])

In [None]:
df['CoapplicantIncomeLog'] = np.log(df['CoapplicantIncome']+1)
sns.distplot(df["CoapplicantIncomeLog"])

In [None]:
df['LoanAmountLog'] = np.log(df['LoanAmount']+1)
sns.distplot(df["LoanAmountLog"])

In [None]:
df['Loan_Amount_Term_Log'] = np.log(df['Loan_Amount_Term']+1)
sns.distplot(df["Loan_Amount_Term_Log"])

In [None]:
df.head()

In [None]:
df['Total_Income_Log'] = np.log(df['Total_Income']+1)
sns.distplot(df["Total_Income_Log"])

In [None]:
# What's the probability of a customer having 1 credit History 

In [None]:
# Here we get the count of rows having n_kids =2, df_person_data.n_kids == 2 is the condition to filter rows
# shape[0] return count of rows.
a = df[df.Credit_History == 1].shape[0]
total =df.shape[0] # getting total rows
p_a = a/total # probabilty formula --> p(a) = event/total
print(f"Probability of customer having 1 Credit_History is {round(p_a,4)}")

## Coorelation Matrix

In [None]:
corr = df.corr()
plt.figure(figsize=(15,10))
sns.heatmap(corr, annot = True, cmap="BuPu")

In [None]:
df.corr()

In [None]:
df.head()

# Data Preprocessing & Feature Engineering
- Feature space is almost unchanged¶
- Remove irrelevant columns
- Convert strings to boolean values

In [None]:
# drop unnecessary columns
cols = ['ApplicantIncome', 'CoapplicantIncome', "LoanAmount", "Loan_Amount_Term", "Total_Income", 'Loan_ID', 'CoapplicantIncomeLog']
df = df.drop(columns=cols, axis=1)
df.head()

# Feature Engineering 

# Data Preprocessing with Label Encoder
What is a label encoder? Image result for label encoder Label Encoding is a popular encoding technique for handling categorical variables. In this technique, each label is assigned a unique integer based on alphabetical ordering. Let's see how to implement label encoding in Python using the scikit-learn library and also understand the challenges with label encoding.

# Useful links
https://towardsdatascience.com/categorical-encoding-using-label-encoding-and-one-hot-encoder-911ef77fb5bd

https://www.geeksforgeeks.org/ml-label-encoding-of-datasets-in-python/

#sklearn.preprocessing.LabelEncoder

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html?highlight=labelencoder


# Youtube video
Label encoder example in python code

https://www.youtube.com/watch?v=UtgrhBr3kTw

https://www.youtube.com/watch?v=OTPz5plKb40&t=281s


# Pre processing with Standard Scaler
In Machine Learning, StandardScaler is used to resize the distribution of values ​​so that the mean of the observed values ​​is 0 and the standard deviation is 1.

# Useful links
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html?highlight=standard%20scaler#sklearn.preprocessing.StandardScaler

https://www.geeksforgeeks.org/standardscaler-minmaxscaler-and-robustscaler-techniques-ml/

https://machinelearningmastery.com/standardscaler-and-minmaxscaler-transforms-in-python/

# YOU TUBE Video
1- Standardization Vs Normalization- Feature Scaling

https://www.youtube.com/watch?v=mnKm3YP56PY&t=52s

2- Using Standard Scaler to scale features

https://www.youtube.com/watch?v=CYd_u_4P_lQ

3- Data Preprocessing 01: StandardScaler Machine Learning

https://www.youtube.com/watch?v=ZddUwo4R5ug


In [None]:
from sklearn.preprocessing import LabelEncoder
cols = ['Gender',"Married","Education",'Self_Employed',"Property_Area","Loan_Status","Dependents"]
le = LabelEncoder()
for col in cols:
    df[col] = le.fit_transform(df[col])

In [None]:
df.head()

# Model Modeling 

## Train-Test Split

In [None]:
# specify input and output attributes
X = df.drop(columns=['Loan_Status'], axis=1)
y = df['Loan_Status']

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Model Training

# Cross Validation
Cross-Validation is a statistical method of evaluating and comparing learning algorithms by dividing data into two segments: one used to learn or train a model and the other used to validate the model.

# Useful Link
https://www.analyticsvidhya.com/blog/2021/05/importance-of-cross-validation-are-evaluation-metrics-enough/

https://towardsdatascience.com/cross-validation-explained-evaluating-estimator-performance-e51e5430ff85

#sklearn.model_selection.cross_val_score

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html?highlight=cross_val_score

# You tube video
1- Machine Learning Tutorial Python 12 - K Fold Cross Validation

https://www.youtube.com/watch?v=gJo0uNL-5Qw

2- What is Cross Validation and its types

https://www.youtube.com/watch?v=7062skdX05Y&t=759s

3- Machine Learning Fundamentals: Cross Validation

https://www.youtube.com/watch?v=fSytzGwwBVw

4- All Type Of Cross Validation With Python All In 1 Video

https://www.youtube.com/watch?v=3fzYdnuvEfk

5- Cross Validation in Scikit Learn

https://www.youtube.com/watch?v=L_dQrZZjGDg

In [None]:
from sklearn.model_selection import GridSearchCV, cross_val_score, train_test_split, RandomizedSearchCV
from sklearn.metrics import precision_score, recall_score, confusion_matrix,  roc_curve, precision_recall_curve, accuracy_score, roc_auc_score

from sklearn.metrics import roc_curve,auc
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_predict

In [None]:
# classify function
from sklearn.model_selection import cross_val_score
def classify(model, x, y):
    x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
    model.fit(x_train, y_train)
    print("Accuracy is", model.score(x_test, y_test)*100)
    # cross validation - it is used for better validation of model
    # eg: cv-5, train-4, test-1
    score = cross_val_score(model, x, y, cv=5)
    print("Cross validation is",np.mean(score)*100)

In [None]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
classify(model, X, y)

In [None]:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
classify(model, X, y)

In [None]:
from sklearn.ensemble import RandomForestClassifier,ExtraTreesClassifier
model = RandomForestClassifier()
classify(model, X, y)

In [None]:
model = ExtraTreesClassifier()
classify(model, X, y)

In [None]:
# classify function
from sklearn.model_selection import cross_val_score
def classify(modelhy, x, y):
    x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
    modelhy.fit(x_train, y_train)
    print("Accuracy is", modelhy.score(x_test, y_test)*100)
    # cross validation - it is used for better validation of model
    # eg: cv-5, train-4, test-1
    score = cross_val_score(modelhy, x, y, cv=5)
    print("Cross validation is",np.mean(score)*100)

## Hyperparameter tuning

# Tuning Random Forest
n_estimators: The number of trees in the forest.

max_depth: The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples

min_samples_leaf: The minimum number of samples required to be at a leaf node.

max_features: The number of features to consider when looking for the best split If int, then consider max_features features at each split. If float, then max_features is a fraction and round(max_features * n_features) features are considered at each split. If “auto”, then max_features=sqrt(n_features). If “sqrt”, then max_features=sqrt(n_features) (same as “auto”). If “log2”, then max_features=log2(n_features). If None, then max_features=n_features.

max_samples: If bootstrap is True, the number of samples to draw from X to train each base estimator.

# Choose the type of classifier. 

In [None]:
modelhy = RandomForestClassifier(n_estimators=110, min_samples_split=15, max_depth=8, max_features=1)
classify(modelhy, X, y)

## Confusion Matrix

A confusion matrix is a summary of prediction results on a classification problem. The number of correct and incorrect predictions are summarized with count values and broken down by each class. It gives us insight not only into the errors being made by a classifier but more importantly the types of errors that are being made.

# How do you evaluate your model?

### - Different situations or business problems call for different metrics

### - The most basic summary is given by the Confusion Matrix
![Figure 1-1](cm.png "Figure 1-1")

### - Type I & II Errors (False Positives / False Negatives)

### - Then you can consider:
>### - Accuracy
>### - Precision
>### - Recall
>### - Lift
>### - Support
>### - Confidence etc.
![Figure 1-2](metrics.png "Figure 1-2")

 ### - Many, many more metrics have been proposed:
![Figure 1-3](many.png "Figure 1-3")

In [None]:
modelrm = RandomForestClassifier()
modelrm.fit(x_train, y_train)

In [None]:
y_pred=modelrm.predict(x_test)
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
print("Classification Report is:\n",classification_report(y_test,y_pred))
print("Confusion Matrix:\n",confusion_matrix(y_test,y_pred))
print("Training Score:\n",modelrm.score(x_train,y_train)*100)
print("Mean Squared Error:\n",mean_squared_error(y_test,y_pred))
print("R2 score is:\n",r2_score(y_test,y_pred))

In [None]:
ypred = modelrm.predict(x_test)
from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
p = sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

In [None]:
# Test on other models ??

# ROC CURVE 
ROC curves are frequently used to show in a graphical way the connection/trade-off between clinical sensitivity and specificity for every possible cut-off for a test or a combination of tests. In addition the area under the ROC curve gives an idea about the benefit of using the test(s) in question.
AUC - ROC curve is a performance measurement for the classification problems at various threshold settings. ROC is a probability curve and AUC represents the degree or measure of separability. It tells how much the model is capable of distinguishing between classes.


# How to plot ROC curve in single 


https://www.statology.org/plot-multiple-roc-curves-python/

https://www.imranabdullah.com/2019-06-01/Drawing-multiple-ROC-Curves-in-a-single-plot

# Youtube Video Links 


Performance Metrics(ROC,AUC Curve) For Classification Problem In Machine Learning 
https://www.youtube.com/watch?v=A_ZKMsZ3f3o

How to Plot an ROC Curve in Python

https://www.youtube.com/watch?v=uVJXPPrWRJ0


https://www.youtube.com/watch?v=TEkvKx2tQHU


In [None]:
from sklearn.metrics import roc_curve

In [None]:
#Get predictions of Random Forest and Logistic Regression models in the form of probability values
y_lg_prob = model.predict_proba(x_test)[:,1]
y_rfc_prob =  modelrm.predict_proba(x_test)[:,1]

In [None]:
#For Logistic Regression
fpr, tpr, _ = metrics.roc_curve(y_test,y_lg_prob)
auc = metrics.roc_auc_score(y_test, y_lg_prob)

#For Random Forest
fpr1, tpr1, _1 = metrics.roc_curve(y_test,y_rfc_prob)
auc1 = metrics.roc_auc_score(y_test, y_rfc_prob)

#create ROC curve
plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='r',label='Random', alpha=.8)
plt.plot(fpr,tpr,label="LR AUC = "+str(round(auc,3)))
plt.plot(fpr1,tpr1,label="RFC AUC = "+str(round(auc1,3)))
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.legend(loc=4)
plt.show()

In [None]:
# Getting feature importances for Random Forest model
(pd.Series(modelrm.feature_importances_, index=x_train.columns).plot(kind='barh'))

In [None]:
#understanding which variable(s) have the largest impact on the outcome.
featimp = pd.Series(modelrm.feature_importances_, index=x_train.columns).sort_values(ascending=False) 
print(featimp)

# Thanks 
# Contact 

https://www.linkedin.com/in/mazhar-javed-42587046/

mazharjaved2001@yahoo.com



# WhatsAPP +923334461420