# 1. Importing Libraries
First, the required libraries are imported such as numpy, pandas, seaborn, matplotlib, and sklearn. These libraries provide essential functions and tools for data manipulation, visualization, and machine learning tasks.

Then, a dataset is imported using the pd.read_csv function from pandas library. The dataset is stored in the data variable.

Next, a new column new_As is added to the dataset using loc function from pandas. Based on the values of the As column, the new_As column is assigned either Unsafe or Safe. The pd.get_dummies function is then used to convert the new_As column into binary values i.e. 0 or 1.

Finally, the X and y variables are created to split the dataset into dependent and independent variables. The X variable contains all the columns except for new_As, As, and Type_of_well. The y variable contains only the new_As column.

The head() function is then used to display the first five rows of the dataset.

Overall, the code performs data preprocessing tasks such as creating new columns, dropping irrelevant columns, and splitting the dataset into dependent and independent variables.

In [240]:
#importing library file
import numpy as np
import pandas as pd 
from pandas import Series,DataFrame
import seaborn as sns 
import matplotlib.pyplot as plt
%matplotlib inline 
from sklearn import preprocessing 
#getting the data set
data= pd.read_csv('the.csv')
# Showing the first five rows of train data set 

data.loc[data['As'] <= 10, 'new_As'] = 'Unsafe' 
data.loc[data['As'] > 10, 'new_As'] = 'Safe' 
data['new_As']=pd.get_dummies(data.new_As, drop_first=True)
# Linear Model for the  train and test
y = data['new_As']
X = data.drop(['new_As', 'As','Type_of_well'], axis=1)
data.head()


Unnamed: 0,Type_of_well,Well_Depth,Lattitude,Longitude,pH,Eh,Cond,Temp,DO,DOC,...,Sr,Zn,Cd,Cr,Cu,Mo,Ni,Pb,As,new_As
0,Shallow,24.2,23.51744,90.86417,6.72,-96.0,1416,29.6,0.3,14.03,...,0.25,49.18,0.0,13.6364,3.727,0.0,3.4724,0.0,499.3,0
1,Deep,181.8,23.54386,90.83903,6.74,-33.0,465,31.4,1.2,6.51,...,0.16,79.11,0.0,0.0,4.2555,0.0,2.7421,0.0,8.9,1
2,Deep,242.4,23.52825,90.82956,6.73,-19.0,496,31.1,1.5,6.23,...,0.22,51.76,0.0,0.0,3.3989,0.0,0.0,0.0,6.3,1
3,Deep,303.0,23.52561,90.84586,6.72,-25.0,761,32.1,5.3,5.85,...,0.38,43.75,0.0,1.2759,154.386,0.0,0.0,4.3982,16.1,0
4,Shallow,84.8,23.54328,90.78681,6.72,17.0,1261,31.0,2.4,16.93,...,0.68,69.22,0.0,1.7589,3.1335,0.0,6.01,0.0,5.8,1


# 2. Train-Test-Split 
To split the data into training and testing sets, we are using the train_test_split function from the sklearn.model_selection module.

The function takes several parameters:

- X: the feature matrix
- y: the target variable
- random_state: to ensure reproducibility of the results
- test_size: the proportion of the data to be used for testing. In this case, we have chosen to use 23% of the data for testing and the remaining 77% for training.


The function returns four sets of data: X_train, X_test, y_train, and y_test. X_train and y_train are used to train the model, while X_test and y_test are used to evaluate its performance.

In [241]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
                                    X, y, random_state=42,test_size=.23)

# 3. Logistic Regression Model

The __init__ function initializes the class with some hyperparameters like learning rate lr and number of iterations n_iters. The weights and bias for the logistic regression model are initialized to None at this point.

The sigmoid function computes the sigmoid activation function.

<img src="https://dzone.com/storage/temp/7151312-screen-shot-2017-11-07-at-30741-pm.png" />

The fit function takes in the training data X and their corresponding labels y, initializes the weights and bias to zeros, and iteratively updates them using gradient descent to minimize the cost function. The cost function used here is binary cross-entropy loss.

The predict function takes in a set of data X, predicts their corresponding labels using the updated weights and bias and returns the predicted labels. Here, linear_pred is the linear prediction of the model, y_pred is the predicted probability of the target class and class_pred is the predicted class based on the threshold of 0.5.

In [242]:
class LogisticRegression():
    def __init__(self, lr = 0.001, n_iters = 1000):
        self.lr = lr
        self.n_iters = n_iters
        self.weights = None
        self.bias = None
        
    def sigmoid(self,z):
        return 1 / (1 + np.exp(-z))
    
    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0
        
        for _ in range(self.n_iters):
            linear_pred = np.dot(X, self.weights) + self.bias
            predictions = self.sigmoid(linear_pred)
            
            dw = (1/n_samples) * np.dot(X.T, (predictions - y))
            db = (1/n_samples) * np.sum(predictions-y)
            
            self.weights = self.weights - self.lr*dw
            self.bias = self.bias - self.lr*db
    
    def predict(self, X):
        linear_pred = np.dot(X, self.weights) + self.bias
        y_pred = self.sigmoid(linear_pred)
        class_pred = [0 if y<0.5 else 1 for y in y_pred]
        return class_pred
    
    def get_coeff(self):
        return self.weights, self.bias
    
    def predict_with_confidence(self, X):
        linear_pred = np.dot(X, self.weights) + self.bias
        y_pred = self.sigmoid(linear_pred)
        class_pred = [0 if y<0.5 else 1 for y in y_pred]
        confidence = np.abs(y_pred - 0.5) * 2
        return class_pred, confidence
    
    

In [243]:
logisticRegression = LogisticRegression()

In [244]:
logisticRegression.fit(X_train, y_train)

y_pred = logisticRegression.predict(X_test)



  return 1 / (1 + np.exp(-z))


# 4. Finding Accuracy:- 

The above code defines a Python function named "accuracy" that takes two arguments: "y_pred" and "y_test". It calculates the accuracy of the predicted output values "y_pred" compared to the actual output values "y_test".

The mathematical formula for calculating accuracy is:


<img src="https://miro.medium.com/max/2306/1*TJYxisH4zWla4oXYJxVJ-g.png" />


the function calculates the accuracy by dividing the sum of the number of correct predictions (y_pred==y_test) by the total number of predictions (len(y_test)) and returns the result. The output is a floating-point value between 0 and 1, where 1 indicates perfect accuracy (all predictions are correct) and 0 indicates no accuracy (all predictions are incorrect).

In [245]:
def accuracy(y_pred, y_test):
    return np.sum(y_pred==y_test)/len(y_test)

In [246]:
print(accuracy(y_pred, y_test)*100)

97.2972972972973


In [247]:
logisticRegression.get_coeff()[0]

array([ 6.53228265e-02,  1.74462432e-01,  6.87455353e-01,  4.94566806e-02,
        9.37765862e-01, -4.54095153e-02,  1.70387496e-01, -7.32131485e-02,
       -9.76416216e-02, -1.75381964e-02,  2.29523350e-01, -3.21477061e-02,
        2.89973593e-01, -2.63548628e-02, -1.22976083e-01, -2.64547989e-01,
       -2.62763916e-03, -2.60255805e-03,  3.41056034e-02, -2.35593913e-02,
       -8.37536020e-02,  3.73322133e-01,  3.55908418e-03,  7.19982140e-02,
       -8.47419125e-02,  9.40018528e-02,  2.35791885e-01,  5.04080885e-01,
        1.29998340e-03,  3.84141764e-02, -6.10596264e-05, -1.40039906e-01,
       -7.16446045e-01, -1.68751897e-02,  5.17531345e-03, -8.58399266e-02])

In [248]:
def get_feature_importance(clf, feature_names):
    feature_importance = (
        pd.DataFrame(
            {
                'variable': feature_names,
                'coefficient': clf.get_coeff()[0]
            }
        )
    ).round(decimals=2)
    feature_importance.sort_values('coefficient', ascending=False)
    feature_importance.style.bar(color=["red", "green"], align='zero')
    return feature_importance
        
    

In [250]:
get_feature_importance(logisticRegression, X_test.columns)

Unnamed: 0,variable,coefficient
0,Well_Depth,0.07
1,Lattitude,0.17
2,Longitude,0.69
3,pH,0.05
4,Eh,0.94
5,Cond,-0.05
6,Temp,0.17
7,DO,-0.07
8,DOC,-0.1
9,HCO3,-0.02


In [251]:
X = data['Eh'].values.reshape(-1,1)

# Split data into training and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=42,test_size=.23)


In [252]:
logisticRegression.fit(X_train, y_train)

In [253]:
y_pred = logisticRegression.predict(X_test)
print(accuracy(y_pred, y_test)*100)



83.78378378378379


In [254]:
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(y_test,y_pred)
print(confusion_matrix)

[[18  2]
 [ 4 13]]


In [255]:
import joblib

In [256]:
joblib.dump(logisticRegression, "LogisticRegression.joblib")

['LogisticRegression.joblib']

In [257]:
lrModel = joblib.load("LogisticRegression.joblib")

In [264]:
input_data = np.array([-96.0])

In [265]:
input_data = input_data.reshape(1, -1)

In [266]:
lrModel.predict_with_confidence(input_data)

([0], array([0.9709866]))