<b>2) Build a two-class logistic regression model from scratch. You will need to work on the following: <br>
	a. Implement the sigmoid function from scratch and call it sigmoid_f <br>
	b. Implement the hypothesis function from scratch and call it classifier_f<br> 
	c. Implement the entropy function as your cost function and call it binary_loss_f<br> 
	d. Implement gradient descent for logistic regression and call it gradient_f<br> 
	e. Combining the functionalities of what you have coded above, create an optimizer function and call it optimizer_f. <br>
    <br>
	Note: You should find out the input and output to the functions above by reviewing the class notes and the textbook; in other words, this will be part of the challenge! If needed, use 265 as your random seed. </b>


	
<b>Let’s test your code on a dataset. Load the Breast Cancer Wisconsin Dataset provided by sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html#sklearn.datasets.load_breast_cancer 
Now, do the following: <br>
	a. Set the target column as your Y variable. <br>
	b. Set all other numeric variables (excluding index) as your X matrix. <br>
	c. Apply 0-1 normalization on both the X matrix and Y vector. <br>
	d. Run logistic regression by using the code you have written (no need to do train/test split). Set the maximum number of iterations to 10,000. <br>
	e. Report the final equation you have obtained for logistic regression. <br>
	f. Also indicate which coefficients are positively associated and which coefficients are negatively associated with the target variable. Rank them from positive to negative. Interpret the results </b>



In [2]:
import numpy as np
import pandas as pd
import sklearn
from sklearn.datasets import load_breast_cancer

### Sigmoid function

In [3]:
def sigmoid_f(z):
    return 1/(1+np.exp(-(z)))

### Hypothesis Function

In [4]:
def hypothesis_f(X, weights):
    z = np.dot(X, weights)
    return sigmoid_f(z)

### Entropy/Loss Function

In [5]:
def binary_loss_f(X, y, weights):
    y1 = hypothesis_f(X, weights)
    return -(1/len(X)) * np.sum(y*np.log(y1) + (1-y)*np.log(1-y1))

### Gradient Descent

In [10]:
def gradient_f(X, y, weights, alpha, epochs):
    m =len(X)
    cost_list = [binary_loss_f(X, y, weights)] 
    for i in range(0, epochs):
        h = hypothesis_f(X, weights)
        for i in range(0, len(X.columns)):
            weights[i] -= (alpha/m) * np.sum((h-y)*X.iloc[:, i])
        cost = binary_loss_f(X, y, weights)
        if(i%100==0):
            print("Epochs:",i,"   Cost:",cost )
        cost_list.append(cost)
    return cost_list, weights    

### Optimizer

In [22]:
def optimizer_f(X,y):
    epochs = 10000
    alpha=0.1
    weights = [0.5]*len(X.columns)
    bias = 0
    cost_list,weights = gradient_f(X,y,weights,alpha,epochs)
    z = np.dot(X, weights)
    #Binary result
    preds = [1 if i > 0.5 else 0 for i in sigmoid_f(z)]
    y=list(y)
    acc = np.sum([y[i] == preds[i] for i in range(len(y))])/len(y)
    #print("Accuracy : ", acc)
    return weights,acc,cost_list

In [23]:
from sklearn.datasets import load_breast_cancer
from sklearn import preprocessing
data = load_breast_cancer()
# Read the DataFrame, first using the feature data
X= pd.DataFrame(data.data, columns=data.feature_names)
# Add a target column, and fill it with the target data
y= data.target
# Show the first five rows
X.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [24]:
def NormalizeData(data):
    return (data - np.min(data)) / (np.max(data) - np.min(data))

In [25]:
X_normalised = NormalizeData(X)
weights, acc,cost_list= optimizer_f(X_normalised,y)
for i in range(10000):
    if(i % 100==0):
        print("epochs : ",i,"   Cost: ",cost_list[i])

epochs :  0    Cost:  1.8954066800788267
epochs :  100    Cost:  0.5921714811816127
epochs :  200    Cost:  0.5244633852568734
epochs :  300    Cost:  0.480087653865511
epochs :  400    Cost:  0.448537786496904
epochs :  500    Cost:  0.42473868909696855
epochs :  600    Cost:  0.4059827458774401
epochs :  700    Cost:  0.39070209142995926
epochs :  800    Cost:  0.37792638237947246
epochs :  900    Cost:  0.3670220551735117
epochs :  1000    Cost:  0.35755705436935076
epochs :  1100    Cost:  0.34922600315452357
epochs :  1200    Cost:  0.3418065166965753
epochs :  1300    Cost:  0.33513250870376815
epochs :  1400    Cost:  0.3290772354206438
epochs :  1500    Cost:  0.3235421590108058
epochs :  1600    Cost:  0.31844941784444886
epochs :  1700    Cost:  0.3137366046746127
epochs :  1800    Cost:  0.3093530634701741
epochs :  1900    Cost:  0.3052572106777179
epochs :  2000    Cost:  0.30141456297959435
epochs :  2100    Cost:  0.2977962620391774
epochs :  2200    Cost:  0.29437795514

In [40]:
answer = pd.DataFrame({'Variables':X.columns,'Weights':weights})

### Positively corelated coefficients:

In [41]:
answer[answer['Weights']>0]

Unnamed: 0,Variables,Weights
0,mean radius,3.308499
1,mean texture,0.246262
2,mean perimeter,2.634417
4,mean smoothness,4.034146
8,mean symmetry,2.999738
9,mean fractal dimension,5.00352
11,texture error,1.417101
14,smoothness error,2.052791
15,compactness error,1.734586
16,concavity error,1.707584


### Negatively corelated coefficients:


In [42]:
answer[answer['Weights']<0]

Unnamed: 0,Variables,Weights
3,mean area,-0.169952
5,mean compactness,-1.717932
6,mean concavity,-5.849737
7,mean concave points,-7.214918
10,radius error,-3.364835
12,perimeter error,-2.585674
13,area error,-2.383512
20,worst radius,-0.733358
21,worst texture,-1.284407
22,worst perimeter,-0.823451


### Coefficients from positive to negative in order 

In [43]:
answer.sort_values('Weights',ascending=False)

Unnamed: 0,Variables,Weights
9,mean fractal dimension,5.00352
4,mean smoothness,4.034146
0,mean radius,3.308499
8,mean symmetry,2.999738
2,mean perimeter,2.634417
17,concave points error,2.104155
14,smoothness error,2.052791
15,compactness error,1.734586
16,concavity error,1.707584
19,fractal dimension error,1.583333


In [44]:
print("Accuaracy:", acc)

Accuaracy: 0.929701230228471


### Logistic Equation:

In [45]:
# print log reg equation
eq = ""
for i, col in enumerate(data.feature_names):
    if i == 0:
        eq += f"{weights[i]}*{col}"
        eq += " "
    else:
        eq += f"+({weights[i]})*{col}"
        eq += " "
print(eq)

3.308499067392783*mean radius +(0.24626218721647752)*mean texture +(2.63441663023807)*mean perimeter +(-0.16995245337805048)*mean area +(4.0341464362786095)*mean smoothness +(-1.7179319342783268)*mean compactness +(-5.8497369102157855)*mean concavity +(-7.214917877156743)*mean concave points +(2.99973799851675)*mean symmetry +(5.0035203986430465)*mean fractal dimension +(-3.364834954785204)*radius error +(1.4171013966296315)*texture error +(-2.5856743685245025)*perimeter error +(-2.3835116144425386)*area error +(2.052791022834254)*smoothness error +(1.734586320251773)*compactness error +(1.7075835046783163)*concavity error +(2.1041552870735463)*concave points error +(1.43633023521943)*symmetry error +(1.5833325020484903)*fractal dimension error +(-0.7333583326516334)*worst radius +(-1.2844073712191497)*worst texture +(-0.8234506644551598)*worst perimeter +(-2.5171945931199056)*worst area +(0.7161901793823963)*worst smoothness +(-1.4451161957666687)*worst compactness +(-2.11708792269881

### Interpretation: 
<b>mean fractal dimension, mean smoothness, mean radius, mean symmetry, mean perimeter are the most positively correlated features to the target variable.</b><br>
<b>perimeter error, radius error, worst concave points, mean concavity, mean concave points are the most negatively correlated features to the target variable.</b><br>
