<a href="https://colab.research.google.com/github/anithathavamani/anitha-/blob/main/Logistic_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

We will use Logistic Regression to model the "Disered placement" data set. This model will predict which people are likely to develop diabetes.




## Import Libraries

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt       # matplotlib.pyplot plots data
%matplotlib inline
import seaborn as sns

## Load and review data

In [None]:
pdata = pd.read_csv("/content/archive (3).zip")


In [None]:
pdata.shape # Check number of columns and rows in data frame

In [None]:
pdata.head() # To check first 5 rows of data set

In [None]:
pdata.isnull().values.any() # If there are any null values in data set

In [None]:
columns = list(pdata)[0:-1] # Excluding Outcome column which has only
pdata[columns].hist(stacked=False, bins=100, figsize=(12,30), layout=(14,2));
# Histogram of first 8 columns

## Identify Correlation in data

In [None]:
pdata.corr() # It will show correlation matrix

In [None]:
# However we want to see correlation in graphical representation so below is function for that
def plot_corr(df, size=5):
    corr = df.corr()
    fig, ax = plt.subplots(figsize=(5,5))
    ax.matshow(corr)
    plt.xticks(range(len(corr.columns)), corr.columns)
    plt.yticks(range(len(corr.columns)), corr.columns)

In [None]:
plot_corr(pdata)

In above plot yellow colour represents maximum correlation and blue colour represents minimum correlation.
We can see none of variable have correlation with any other variables.

In [None]:
sns.pairplot(pdata,diag_kind='kde')

## Calculate placement ratio of True/False from outcome variable

In [None]:
n_true = len(pdata.loc[pdata['placement'] == True])
n_false = len(pdata.loc[pdata['placement'] == False])
print("Number of true cases: {0} ({1:2.2f}%)".format(n_true, (n_true / (n_true + n_false)) * 100 ))
print("Number of false cases: {0} ({1:2.2f}%)".format(n_false, (n_false / (n_true + n_false)) * 100))

So we have 34.90% people in current data set who have placement and rest of 65.10% doesn't have placement.

Its a good distribution True/False cases of placement in data.

## Spliting the data
We will use 70% of data for training and 30% for testing.

In [None]:
from sklearn.model_selection import train_test_split

X = pdata.drop('placement',axis=1)     # Predictor feature columns (8 X m)
Y = pdata['placement']   # Predicted class (1=True, 0=False) (1 X m)

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=1)
# 1 is just any random seed number

x_train.head()

Lets check split of data

In [None]:
print("{0:0.5f}% data is in training set".format((len(x_train)/len(pdata.index)) * 100))
print("{0:0.8f}% data is in test set".format((len(x_test)/len(pdata.index)) * 100))

# Data Preparation

### Check hidden missing values

As we checked missing values earlier but haven't got any. But there can be lots of entries with 0 values. We must need to take care of those as well.

In [None]:
x_train.head()

In [None]:
x_train.info()

We can see lots of 0 entries above.

### Replace 0s with serial mean

In [None]:
#from sklearn.preprocessing import Imputer
#my_imputer = Imputer()
#data_with_imputed_values = my_imputer.fit_transform(original_data)

from sklearn.impute import SimpleImputer
rep_0 = SimpleImputer(missing_values=0, strategy="mean")
cols=x_train.columns
x_train = pd.DataFrame(rep_0.fit_transform(x_train))
x_test = pd.DataFrame(rep_0.fit_transform(x_test))

x_train.columns = cols
x_test.columns = cols

x_train.head()

# Logistic Regression

In [None]:
from sklearn import metrics

from sklearn.linear_model import LogisticRegression

# Fit the model on train
model = LogisticRegression()
model.fit(x_train, y_train)
#predict on test
y_predict = model.predict(x_test)


coef_df = pd.DataFrame(model.coef_)
coef_df['intercept'] = model.intercept_
print(coef_df)

In [None]:
cm=metrics.confusion_matrix(y_test, y_predict, labels=[1, 0])

df_cm = pd.DataFrame(cm, index = [i for i in ["1","0"]],
                  columns = [i for i in ["Predict 1","Predict 0"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True)

The confusion matrix

True Positives (TP): we correctly predicted that they do have placement 48

True Negatives (TN): we correctly predicted that they don't have placement 132

False Positives (FP): we incorrectly predicted that they do have placement (a "Type I error") 14 Falsely predict positive Type I error

False Negatives (FN): we incorrectly predicted that they don't have placement (a "Type II error") 37 Falsely predict negative Type II error