# Machine Learning Basics
In this module, you'll be implementing a simple Linear Regressor and Logistic Regressor. You will be using the Salary Data for the tasks in this module. <br> <br>
**Pipeline:**
* Acquiring the data - done
* Handling files and formats - done
* Data Analysis - done
* Prediction
* Analysing results

## Imports
You may require NumPy, pandas, matplotlib and scikit-learn for this module. Do not, however, use the inbuilt Linear and Logistic Regressors from scikit-learn.

In [111]:
import sklearn as skl
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn import linear_model


## Dataset
You can load the dataset and perform any dataset related operations here. Split the data into training and testing sets. Do this separately for the regression and classification problems.

In [112]:
df=pd.read_csv("C:\\Users\\Abi\\Desktop\\MLBasics\\Data\\SalaryData.csv")
sal_arr=df[['YearsExperience','Salary']].to_numpy()
size=sal_arr.shape[0]
set_list=np.split(sal_arr,[int(size*0.8),size])
training_set=set_list[0]
testing_set=set_list[1]



## Task 1a - Linear Regressor
Code your own Linear Regressor here, and fit it to your training data. You will be predicting salary based on years of experience.

In [113]:
x_y=np.hsplit(training_set,2)
x=x_y[0]
y=x_y[1]
n=len(x)
X_test=np.hsplit(testing_set,2)
X_exp=X_test[0]
Y_sal=X_test[1]
LR=0.01
b0,b1=0.0,1.0
for i in range(2000):
    y_pred=b0+b1*x
    part_wrt_b0=-2*np.sum((y-y_pred))/n
    part_wrt_b1=-2*np.sum((y-y_pred)*x)/n
    #cost_b0=part_wrt_b0.sum()
    #cost_b1=part_wrt_b1.sum()
    
    b0=b0-LR*part_wrt_b0
    b1=b1-LR*part_wrt_b1
    
train_len=len(X_exp)
Y=b0+b1*X_exp
df_final_linear=(df[['YearsExperience','Salary']]).tail(train_len)
df_final_linear['Salary(predicted)']=Y
df_final_linear





Unnamed: 0,YearsExperience,Salary,Salary(predicted)
24,8.7,109431,111134.48028
25,9.0,105582,114151.605027
26,9.5,116969,119180.146274
27,9.6,112635,120185.854523
28,10.3,122391,127225.812268
29,10.5,121872,129237.228766


## Task 1b - Logistic Regression
Code your own Logistic Regressor here, and fit it to your training data. You will first have to create a column, 'Salary<60000', which contains '1' if salary is less than 60000 and '0' otherwise. This is your target variable, which you will aim to predict based on years of experience.

In [136]:
def sigmoid(x):
    return(1/(1+np.exp(-x)))

df_logistic=df
df_logistic['Salary<60000']=df_logistic['Salary'].apply(lambda x:int(x<60000))
df_logistic=df_logistic.sample(frac=1)

dataset_len=df_logistic.shape[0]
trainset_len=int(0.8*dataset_len)
testset_len=int(0.2*dataset_len)


x_train=np.array(df_logistic['YearsExperience'].head(trainset_len))
y_train=np.array(df_logistic['Salary<60000'].head(trainset_len))
x_test=np.array(df_logistic['YearsExperience'].tail(testset_len))
y_test=np.array(df_logistic['Salary<60000'].tail(testset_len))
w,b=1.0,0.0
LR=0.1
for i in range(1800):
    y_pred=sigmoid(b+w*x_train)
    derivative_w=(2/trainset_len)*np.sum((y_pred-y_train)*x_train)
    derivative_b=(2/trainset_len)*np.sum((y_pred-y_train))
    w=w-LR*derivative_w
    b=b-LR*derivative_b

y_pred=sigmoid(b+w*x_test)

y_final=np.where(y_pred>0.5,1,0)

df_final=df_logistic.tail(testset_len)
df_final['Salary<60000(predicted)']=y_final
df_final







Unnamed: 0.1,Unnamed: 0,YearsExperience,Salary,Salary<60000,Salary<60000(predicted)
23,23,8.2,113812,0,0
7,7,3.2,54445,1,1
29,29,10.5,121872,0,0
21,21,7.1,98273,0,0
8,8,3.2,64445,0,1
3,3,2.0,43525,1,1


## Task 2 - Results
Analyse the quality of the ML models you built using metrics such as R2, MAE and RMSE for the Linear Regressor, and Accuracy for the Logistic Regressor. Evaluate their performance on the testing set.

In [143]:

#R2
y_mean=df_final_linear['Salary'].mean()
print("R2:       ",((df_final_linear['Salary']-y_mean)**2).sum()/((df_final_linear['Salary(predicted)']-y_mean)**2).sum())

#MAE
print("MAE:      ",(abs(df_final_linear['Salary']-df_final_linear['Salary(predicted)'])).sum()/train_len)

#RMSE
print("RMSE:     ",(((df_final_linear['Salary']-df_final_linear['Salary(predicted)'])**2).sum()/train_len)**0.5)

#accuracy for logistic regression
correct_pred=0
temp1=np.array(df_final['Salary<60000'])
temp2=np.array(df_final['Salary<60000(predicted)'])
for i in range(testset_len):
    if(temp1[i]==temp2[i]):
        correct_pred+=1
print("accuracy: ",(correct_pred/testset_len)*100)


R2:        0.5443683928520892
MAE:       5372.52118952739
RMSE:      5998.146264992441
accuracy:  83.33333333333334
