## Implementing Gradient Descent
#### for Logistic Regression

In [222]:
import pandas as pd
import numpy as np

We are going to make a Class object. We've been working with certain classes. Going to come to the foreground now.

An **instance** of a Class is made by a function (with same name as the Class) that initializes it. 

A Class can have:
* attributes - variables which get assigned a value for each instance of the Class; 
* methods - functions that can "see" and use attributes and other methods of the Class instance.

In [209]:
# Numpy array is a Class. 
my_array = np.array([[1],[2],[312]]) # this initializes an array instance
my_array.shape # shape is an attribute

(3, 1)

In [210]:
# printing output of 3 methods for the Class instance
print( my_array.astype('float') )  
print( my_array.min(axis=0) )
print( my_array.flatten() )

[[  1.]
 [  2.]
 [312.]]
[1]
[  1   2 312]


In [223]:
auto = pd.read_csv('../../DataSets/Auto.csv')
auto.head() # A method called on auto (instance of pandas.DataFrame Class)

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino


In [224]:
auto.dtypes

mpg             float64
cylinders         int64
displacement    float64
horsepower        int64
weight            int64
acceleration    float64
year              int64
origin            int64
name             object
dtype: object

We're going to use data from other columns (not `mpg`), to predict whether a car has "high" mpg or "low" mpg.

In [226]:
auto['mpg01'] = np.array(auto['mpg'] > auto['mpg'].median()).astype('int') # make column with 0 or 1; =1 if mpg is larger than median mpg
auto.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name,mpg01
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu,0
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320,0
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite,0
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst,0
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino,0


In [227]:
data_init = auto[['cylinders','displacement','horsepower','weight','acceleration','year']]
labels = auto['mpg01']

In [228]:
data = pd.DataFrame({}) # empty dataframe

In [229]:
# Standardize scale of columns (so that max minus min in each column is same)
for c in data_init.columns:
  data[c] = (data_init[c] - data_init[c].min())/ (data_init[c].max() - data_init[c].min())

In [230]:
from sklearn.linear_model import LogisticRegression

In [231]:
clf=LogisticRegression(tol=0.001)
clf.fit(data, labels)

In [232]:
clf.coef_

array([[-2.67428777, -2.62306548, -1.8171092 , -3.48172319,  0.12956741,
         2.44902549]])

In [233]:
clf.intercept_

array([2.26612189])

In [234]:
clf.score(data, labels)

0.9209183673469388

In [None]:
np.sum(clf.predict(data) == labels)/len(labels)

Logistic Regression class does not use gradient descent. We will make a Class that does.

In [None]:
class LogisticModel:
  """
  represent a Logistic Regression model, that fits data using batch gradient descent on the log-loss function
  attributes:
    - 
    - 
  methods: 
    -
    -
  """
  def __init__(self, tolerance=0.001, iteration_cap=1e4):
    self.coefs = None
    ## also put in attributes for the threshhold and max number of iterations: call them 'tol' and 'max_iters'
    self.tol = tolerance
    self.max_iters = iteration_cap
  def _sigma(self, z):
    return 1/(1+np.exp(-z))
  def grad(self, x,y):
    n = x.shape[0]
    X = np.hstack((x,np.ones(n).reshape(-1,1)))
    # below computes the per-example loss
    per_exm = ( -y*(X.T)*(1 - self.predict(X[:,:-1])) + (1-y)*(X.T)*self.predict(X[:,:-1]) ).T
    return np.sum(per_exm, axis=0)/n
  def predict(self, x):
    return self._sigma(x@self.coefs[:-1] + self.coefs[-1])
  def fit(self, x, y, lr=0.1, return_iter=True):
    n, d = x.shape
    t = 0
    self.coefs = np.zeros(d+1, dtype='float')
    # put code here to iteratively update self.coefs with gradient descent steps, 
    # stopping after change in parameters falls below threshhold
    if return_iter:
      print(f'Last iteration: {t}.')
    return None  # technically don't need this line

The cells below should run after you have filled in the code above.

In [None]:
my_model = LogisticModel()
my_model.fit(data.to_numpy(), labels.to_numpy(), lr=1.1)

In [None]:
my_model.coefs

In [None]:
y_prob = my_model.predict(data.to_numpy())
y_pred = (y_prob > 0.5).astype('int')

In [None]:
np.sum(y_pred == labels)/len(labels)