## 1. Logistic Regression with L1 regularization from scratch

### Instructions

  1. Read in the train_data.
  2. Vectorize train_data and test_data using sklearns built in tfidf vectorizer.
  3. Ignore unigrams and make use of both **bigrams & trigrams** and also limit the **max features** to **2000** and **minimum document frequency** to **10**.
  4. After the tfidf vectors are generated as mentioned above, next task is to column standardize your data.
  5. We want you to write in comments in your code, the reason you think for standardizing the data in the above step.
  6. You can use sklearn StandardScaler to column standardize your data.
  7. Write a function to initialise your weights & bias. And then run its corresponding grader function.
  8. Write a custom function to calculate sigmoid of a value. And then run its corresponding grader function to cross check your implementation of sigmoid function.
  9. Write a custom function to compute the total loss as the sum of log loss and l1 regularization loss based on true labels and predicted labels and weights. And you can crosscheck your implementation with its corresponding grader.
  10. Write a function to compute gradients for your weights and bias terms, which you have to make use of in updating your weights and bias while training your model.
  11. Implement a custom train function of logistic regression, wherein you take in the following inputs:
        * **X_train** which will be your vectorized text data
        * **y_train** which are the labels for your train data
        * **alpha** = 0.0001 which is the regularization factor (λ)
        * **eta0** = 0.0001 which will be the learning rate   
        * **tolerance** = 0.001
        
  12. In the custom train function you should make use of a custom SGD function to update the weights and bias terms for **each** of your inputs.
  13. The custom SGD implemented in the above train function for updating the weights and bias terms should run for many epochs until the difference in loss between two consecutive epochs is less than tolerance.

  14. Here one epoch means a complete iteration of your entire train data.
  15. Your train function should return the follwing:
        * the number of epochs it took to complete the training
        * train loss for all epochs
        * the values for final weights and bias terms.
        
  16. Now run the grader function to check whether the weights and bias obtained from your custom implementation are close enough to that of sklearns implementation.
  17. Next write a custom predict function which takes in as input the weights and bias values that you computed in your train function, and also takes in the test standardized data as input to predict its labels.
  18. Now run the grader function to check the accuracy of your predictions.





### Data wrangling and preprocessing

In [None]:
# Loading necessary libraries
import numpy as np
import pandas as pd
import operator
from sklearn import linear_model
import matplotlib.pyplot as plt
import altair as alt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

In [None]:
# read in the dataset
comments_raw_df = pd.read_csv('https://raw.githubusercontent.com/myamullaciencia/open_into_datos/master/logistic_regression_assignment_data.csv')

In [None]:
# glance at comments dataframe
comments_raw_df.head()

Unnamed: 0,category,text
0,0,worldcom boss left books alone former worldc...
1,1,tigers wary of farrell gamble leicester say ...
2,1,yeading face newcastle in fa cup premiership s...
3,1,henman hopes ended in dubai third seed tim hen...
4,1,wilkinson fit to face edinburgh england captai...


In [None]:
# levels of category field
comments_raw_df.value_counts('category')

category
1    509
0    508
dtype: int64

In [None]:
# creating pandas sereis with text and category fields
comments_df_text = comments_raw_df['text']
comments_df_category = comments_raw_df['category']

In [None]:
# Creating training and testing datasets
train_comments_text,test_comments_text,train_comments_cat,test_comments_cat = train_test_split(comments_df_text,comments_df_category,random_state=100, stratify=comments_df_category, test_size=0.01)

In [None]:
print(f'1.X_Training comments set shape is {train_comments_text.shape}\n2.X_Testing comments set shape is {test_comments_text.shape}\n3.Y_Training category set shape {train_comments_cat.shape}\n4.Y_Testing category set shape {test_comments_cat.shape}')


1.X_Training comments set shape is (1006,)
2.X_Testing comments set shape is (11,)
3.Y_Training category set shape (1006,)
4.Y_Testing category set shape (11,)


In [None]:
# Define text tfidf vectorizer
LR_text_vectorizer = TfidfVectorizer(ngram_range=(2,3),max_features=2000,min_df=10)

In [None]:
# fitting the text vectorizer on training set
train_comments_vec_fit = LR_text_vectorizer.fit(train_comments_text)

In [None]:
# creating training and testing vectors
train_comments_vectors = train_comments_vec_fit.transform(train_comments_text)
test_comments_vectors = train_comments_vec_fit.transform(test_comments_text)

In [None]:
print(f'The training and testing vectors shape as: {train_comments_vectors.shape , test_comments_vectors.shape}')

The training and testing vectors shape as: ((1006, 2000), (11, 2000))


In [None]:
# Standardization initializer
txt_scaler = StandardScaler()

**Column standardization:**

The reason for standardizing the predictor variables as -

1. To bring all the predictor varibles which have different scales on to the same scales.

2. Used to improve the numerical stability of model calculations.

3. Regularizations deals with a penalty on the magnitude of the coefficients caluclated on each predictor, here the scale of predictor will affect how much penalty is applied on their coefficients and the predictors with large variances will have small coefficients, as a result of it they will be less penalized.

In [None]:
# train and test standardized vectors
train_comments_std_vec = txt_scaler.fit_transform(train_comments_vectors.toarray())
test_comments_std_vec = txt_scaler.fit_transform(test_comments_vectors.toarray())

### Custom function implementations

In [None]:
def initialize_weights_bias(dim):
    ''' In this function, we will initialize our weights and bias terms'''

    # Initialize the weights to zeros array of (dim) dimensions. Here dim will be the number of features of your tfidf vectorizer output.
    # You can initialize the weight terms with zeros.
    w=np.full((dim),0)
    # Initialize bias term to zero
    b=0.0
    return w,b

In [None]:
def custom_sigmoid(z):
    ''' In this function, we will return sigmoid of z'''
    # Compute sigmoid(z) and return its value.
    # Write your code below.
    sigmoid= 1.0/(1 + np.exp(-z))
    return sigmoid

In [None]:
def custom_loss(y_true, y_pred, alpha, w):
    '''In this function, we will compute total loss which is [(logloss) + (alpha * L1regularization loss)] '''
    log_loss = -1 * np.mean(y_true*(np.log10(y_pred)) + (1-y_true)*np.log10(1-y_pred))
    l1_loss = sum(abs(w))
    total_loss = log_loss + alpha*l1_loss
    return total_loss

In [None]:
def gradient_dw(x, y, w, b, alpha, N):
    '''In this function, we will compute the gardient w.r.t. w '''

    # Write your code below.+ (alpha*(w+(1e-5))/abs(w+(1e-5)))/N
    part_1 = np.dot(custom_sigmoid(np.dot(x,w.T) + b)-y,x)
    part_2 = np.dot((alpha/N),(w+1e-5/abs(w+1e-5)))
    dw = part_1+part_2

    return dw

In [None]:
def gradient_db(x, y, w, b):
    '''In this function, we will compute the gardient w.r.t. b '''
    sig = custom_sigmoid(np.dot(x,w.T)+b)
    db = sig - y
    return db

In [None]:
def custom_train(X_train, y_train,alpha, eta0,tolerance):
  """
  In this function we will compute optimal values for weights and bias terms on
  the train data.

  Here eta0 is the learning rate and alpha is the regularization term.
  """
  train_loss=[]

  # 1. Initalize the weights (call the initialize_weights(X_train[0]) function)
  w,b = initialize_weights_bias(X_train.shape[1])
  print(f'first:{b}')
  # 2. Repeat For many epochs until condition "e"  fails
          # a) for every data point(X_train,y_train)
                # compute gradient w.r.to w (call the gradient_dw() function)
                # compute gradient w.r.to b (call the gradient_db() function)
                # update w, b using the above eqns
          # b) predict the output of x_train[for all data points in X_train] using w,b
          # c) compute the loss between predicted and actual values (call the loss function)
          # d) store all the train loss values in a list
          # e) Compare previous loss and current loss, if the difference between loss is not more than or equal to the tolerance, stop the process and return w,b

  # 3. Return the values of weights, bias, train_loss and num_epochs
  num_epochs=0
  while True:
      num_epochs+=1
      dw = gradient_dw(X_train,y_train,w,b,alpha,X_train.shape[0])
      db = gradient_db(X_train,y_train,w,b)
      y_pred = custom_sigmoid(np.dot(w,X_train.T)+b)
      loss = custom_loss(y_train, y_pred, alpha, w)
      train_loss.append(loss)
      w=w-(np.dot(eta0,dw))
      b=b-(np.dot(eta0,db))
      print(b)
      if len(train_loss)>=2:
          for _ in range(len(train_loss)):
              idx=len(train_loss)
              nxt=idx-1
              prev = nxt-1
              diff = train_loss[prev]-train_loss[nxt]
          if diff < tolerance:
              break
  return w,b,train_loss,num_epochs

In [None]:
w,b,train_loss,epochs = c
alt.Chart(pd.DataFrame({'epochs':range(epochs),'train_loss':train_loss})).mark_line().encode(
    x='epochs',
    y='train_loss'
).properties(
    title='epoch vs loss'
)

In [None]:
def predict(w,b, X):
    '''function to predict label given weights, bias and standardized data'''
    predictions = custom_sigmoid(np.dot(w,X.T)+np.mean(b))
    pred_z = np.array(predictions)
    y_pred_prob=[]
    for z in pred_z:
        if z>=0.5:
            y_pred_prob.append(1)
        else:
            y_pred_prob.append(0)
    return np.array(y_pred_prob)

### Grader functions

In [None]:
def grader_sigmoid(z):
  val = custom_sigmoid(z)
  assert(val==0.8807970779778823)
  return True

grader_2 = grader_sigmoid(2)
print("Grader_2 Status : ", grader_2)

Grader_2 Status :  True


In [None]:
# Grader function to check the initialization of your weights and bias terms.

def grader_weights_bias(w,b):
  assert((len(w)==2000) and b==0)
  return True

dim = 2000
w,b = initialize_weights_bias(dim)
grader_1 = grader_weights_bias(w,b)
print("Grader_1 Status : ", grader_1)

Grader_1 Status :  True


In [None]:
# Grader function to check the implementaiton of logloss
def grader_loss():
  true_values = np.array([1,1,0,1,0])
  pred_values = np.array([0.9,0.8,0.1,0.8,0.2])
  w= np.array([0.1]*10)
  alpha= 0.0001
  loss = custom_loss(true_values, pred_values,alpha,w)
  assert(loss==0.07644900402910389+0.0001*10*0.1)
  return True


grader_3 = grader_loss()
print("Grader_3 Status : ", grader_3)

Grader_3 Status :  True


In [None]:
def grader_weights_bias():
  # fitting sklearn SGD classifier
  clf = linear_model.SGDClassifier(eta0=0.0001, alpha=0.0001, loss='log', random_state=15, penalty='l1', tol=1e-3, learning_rate='constant')
  clf.fit(train_comments_std_vec,train_comments_cat.values)
  model_coef= clf.coef_[0]

  # fitting custom train with same learning rate, regularization and tolerance as of sklearn
  w,b,_,epoch = custom_train(train_comments_std_vec,train_comments_cat.values, 0.0001,0.0001,0.001)
  #w,b,_,epoch = custom_train(train_vectors_stand, train_category.values, 0.0001,0.0001,0.001)

  # checking whether the weights and bias returned by both the implementations are closer
  assert((not (w-model_coef>0.02).any())==True)
  assert(not (np.mean(b)-clf.intercept_>0.02)==True)

  return True

grader_4 = grader_weights_bias()
print("Grader_4 Status : ", grader_4)

Grader_4 Status :  True


In [None]:
def grader_predict():
  ''' grader to check the test accuracy'''
  w,b,_,_ = custom_train(train_comments_std_vec,train_comments_cat.values, 0.0001,0.0001,0.001)
  test_preds= predict(w,b,test_comments_std_vec)
  test_accuracy= (np.sum(test_comments_cat==test_preds)/len(test_preds))*100
  if(test_accuracy>=90):
    print("Success!!!",test_accuracy)
  else:
    print("Failed! \n Test accuracy = ", test_accuracy)
  return

grader_predict()

Success!!! 100.0
