# Logistic Regression Evaluative

- In this evaluative, you will be implementing a logistic regression model from scratch.
- You will be using the `train.csv` as your training dataset to train your model and evaluate it's performance on the `test.csv`
- Use **ONLY** Logistic Regression model
- You will be evaluated on the basis of the accuracy score on the test dataset.
- **DO NOT** change the notebook name. The notebook name should be `eval.ipynb`.

Guidelines to be followed:
- You can refer to last labs as you wish. Using pre-implemented setups is not allowed and will be given a 0.
- You are to submit your results in the format shown in the sample submissions.csv file.

In [1]:
#These are the only imports allowed.
import os
import numpy as np
from matplotlib import pyplot
import pandas as pd
%matplotlib inline

## Data Preprocessing

In [2]:
df = pd.read_csv('train.csv')
df.head()

Unnamed: 0,Type,Age,Breed1,Gender,Color1,Color2,MaturitySize,FurLength,Vaccinated,Sterilized,Health,Fee,PhotoAmt,target
0,Dog,2,Mixed Breed,Male,Black,Brown,Medium,Medium,Yes,No,Healthy,0,3,1
1,Dog,3,Jack Russell Terrier,Female,Brown,White,Medium,Short,Yes,No,Healthy,500,1,1
2,Cat,3,Domestic Short Hair,Female,Gray,White,Small,Medium,No,No,Healthy,0,1,1
3,Dog,2,Mixed Breed,Female,Black,Brown,Medium,Medium,Yes,No,Healthy,0,7,1
4,Dog,12,Poodle,Male,Brown,Cream,Medium,Medium,Yes,Yes,Healthy,0,8,1


In [3]:
df.describe()

Unnamed: 0,Age,Fee,PhotoAmt,target
count,10383.0,10383.0,10383.0,10383.0
mean,11.770105,23.912646,3.612058,0.734277
std,19.487016,80.72063,3.175399,0.441739
min,0.0,0.0,0.0,0.0
25%,2.0,0.0,2.0,0.0
50%,4.0,0.0,3.0,1.0
75%,12.0,0.0,5.0,1.0
max,255.0,2000.0,30.0,1.0


In [4]:
all_features = list(df.columns)
ordinal_features = ['Age', 'Fee', 'PhotoAmt', 'target'] # Scritly speaking, target is not an ordinal feature, but for implementation purposes, we include it here
categorical_features = [feature for feature in all_features if feature not in ordinal_features]
print('Categorical Features:', categorical_features)

Categorical Features: ['Type', 'Breed1', 'Gender', 'Color1', 'Color2', 'MaturitySize', 'FurLength', 'Vaccinated', 'Sterilized', 'Health']


In [5]:
df = pd.get_dummies(df, columns=categorical_features).astype(float)
df.head()

Unnamed: 0,Age,Fee,PhotoAmt,target,Type_Cat,Type_Dog,Breed1_0,Breed1_Abyssinian,Breed1_Akita,Breed1_American Bulldog,...,FurLength_Short,Vaccinated_No,Vaccinated_Not Sure,Vaccinated_Yes,Sterilized_No,Sterilized_Not Sure,Sterilized_Yes,Health_Healthy,Health_Minor Injury,Health_Serious Injury
0,2.0,0.0,3.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0
1,3.0,500.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0
2,3.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
3,2.0,0.0,7.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0
4,12.0,0.0,8.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0


In [6]:
df_x = df.drop('target', axis=1)
df_y = df['target']

df_min = df_x.min()
df_max = df_x.max()

df_x = (df_x - df_min) / (df_max - df_min)
df_x.describe()

Unnamed: 0,Age,Fee,PhotoAmt,Type_Cat,Type_Dog,Breed1_0,Breed1_Abyssinian,Breed1_Akita,Breed1_American Bulldog,Breed1_American Curl,...,FurLength_Short,Vaccinated_No,Vaccinated_Not Sure,Vaccinated_Yes,Sterilized_No,Sterilized_Not Sure,Sterilized_Yes,Health_Healthy,Health_Minor Injury,Health_Serious Injury
count,10383.0,10383.0,10383.0,10383.0,10383.0,10383.0,10383.0,10383.0,10383.0,10383.0,...,10383.0,10383.0,10383.0,10383.0,10383.0,10383.0,10383.0,10383.0,10383.0,10383.0
mean,0.046157,0.011956,0.120402,0.425407,0.574593,0.000385,0.001734,0.000193,9.6e-05,0.001252,...,0.576423,0.430126,0.132139,0.437735,0.650775,0.115959,0.233266,0.961861,0.035443,0.002697
std,0.07642,0.04036,0.105847,0.494428,0.494428,0.019625,0.041602,0.013878,0.009814,0.035364,...,0.494149,0.495117,0.338658,0.496132,0.476748,0.320191,0.42293,0.191542,0.184904,0.051862
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.007843,0.0,0.066667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
50%,0.015686,0.0,0.1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
75%,0.047059,0.0,0.166667,1.0,1.0,0.0,0.0,0.0,0.0,0.0,...,1.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


Conversion of data to numpy arrays

In [7]:
x = df_x.values
y = df_y.values

print('x shape:', x.shape)
print('y shape:', y.shape)

x shape: (10383, 199)
y shape: (10383,)


## Learning Weights

Randomly initialize the weights using samples from a standard normal distribution

In [8]:
w = np.random.normal(0, 1, x.shape[-1])
b = np.random.normal()

Make a prediction using the random weights

In [9]:
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

In [10]:
y_pred = sigmoid(x @ w + b) > 0.5
y_pred = y_pred.astype(int)
y_pred

array([0, 0, 0, ..., 0, 0, 0])

Define the Loss Function

In [11]:
def binary_cross_entropy(y_pred, y_true):
    return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

Define the gradient of the loss function with respect to the weights

In [12]:
def grad(y_pred, y_true, x):
    return np.mean((y_pred - y_true) * x.T, axis=1), np.mean(y_pred - y_true)

Write the Gradient Descent Algorithm and train the model

In [13]:
NUM_EPOCHS = 1_000
LEARNING_RATE = 0.1

for epoch in range(NUM_EPOCHS):
    y_pred = sigmoid(x @ w + b)
    loss = binary_cross_entropy(y_pred, y)
    dw, db = grad(y_pred, y, x)
    w -= LEARNING_RATE * dw
    b -= LEARNING_RATE * db
    if epoch % 100 == 0:
        print(f'Epoch {epoch}: {loss}')

Epoch 0: 3.52989346624372
Epoch 100: 0.669642485209944
Epoch 200: 0.6311847932776785
Epoch 300: 0.6105644277766847
Epoch 400: 0.5975023469849018
Epoch 500: 0.588370382653848
Epoch 600: 0.5816099384058923
Epoch 700: 0.5764026650915777
Epoch 800: 0.5722616440587686
Epoch 900: 0.5688775442908208


Print the Training Accuracy

In [14]:
y_pred_final = sigmoid(x @ w + b) > 0.5
y_pred_final = y_pred_final.astype(int)

accuracy = np.mean(y_pred_final == y)
print(f'Training Accuracy: {accuracy}')

Training Accuracy: 0.7277280169507849


## Predictions on Test Set

Preprocess the test set in the same way as the training set.

**NOTE:** The number of features in the test set may be different from the number of features in the training set. This is because some categorical features may have different number of categories in the test set and the training set. This is a common problem in data preprocessing. You will have to find a way around it. (If did the previous labs, you might have encountered the same problem)

**NOTE:** Depending on your method of preprocessing, you might not encounter this error

In [15]:
df_test = pd.read_csv('test.csv')
id_columns = df_test['ID']
df_test = df_test.drop('ID', axis=1)
df_test.head()

Unnamed: 0,Type,Age,Breed1,Gender,Color1,Color2,MaturitySize,FurLength,Vaccinated,Sterilized,Health,Fee,PhotoAmt
0,Dog,60,Mixed Breed,Female,Black,Brown,Large,Short,Yes,Yes,Healthy,0,5
1,Cat,2,Domestic Short Hair,Female,Black,Gray,Small,Short,No,No,Healthy,0,3
2,Dog,1,Mixed Breed,Female,Black,Brown,Medium,Short,No,No,Healthy,0,5
3,Cat,2,Domestic Medium Hair,Female,Black,White,Small,Medium,No,No,Healthy,0,3
4,Cat,12,Domestic Medium Hair,Female,Black,White,Medium,Medium,Yes,Yes,Healthy,150,3


In [16]:
df_test = pd.get_dummies(df_test, columns=categorical_features).astype(float)
df_test.head()

Unnamed: 0,Age,Fee,PhotoAmt,Type_Cat,Type_Dog,Breed1_Abyssinian,Breed1_American Shorthair,Breed1_Australian Kelpie,Breed1_Australian Terrier,Breed1_Beagle,...,FurLength_Short,Vaccinated_No,Vaccinated_Not Sure,Vaccinated_Yes,Sterilized_No,Sterilized_Not Sure,Sterilized_Yes,Health_Healthy,Health_Minor Injury,Health_Serious Injury
0,60.0,0.0,5.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0
1,2.0,0.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
2,1.0,0.0,5.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
3,2.0,0.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
4,12.0,150.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0


In [17]:
# So we have 110 features while the training data had 199 features.
# This is because some of the categorical values in the training data did not appear in the test data.
# So when we one-hot encoded the test data, we ended up with fewer features.
# We need to fix this by adding the missing features to the test data.

missing_features = set(df_x.columns) - set(df_test.columns)
for feature in missing_features:
    df_test[feature] = 0
df_test = df_test[df_x.columns] # Reorder columns to match training data

# This can throw a warning, but for our purposes, it's fine. You are encouraged to look up the warning and understand why it's happening. (And how to fix it.)
df_test.head()

  df_test[feature] = 0
  df_test[feature] = 0
  df_test[feature] = 0


Unnamed: 0,Age,Fee,PhotoAmt,Type_Cat,Type_Dog,Breed1_0,Breed1_Abyssinian,Breed1_Akita,Breed1_American Bulldog,Breed1_American Curl,...,FurLength_Short,Vaccinated_No,Vaccinated_Not Sure,Vaccinated_Yes,Sterilized_No,Sterilized_Not Sure,Sterilized_Yes,Health_Healthy,Health_Minor Injury,Health_Serious Injury
0,60.0,0.0,5.0,0.0,1.0,0,0.0,0,0,0,...,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0
1,2.0,0.0,3.0,1.0,0.0,0,0.0,0,0,0,...,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
2,1.0,0.0,5.0,0.0,1.0,0,0.0,0,0,0,...,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
3,2.0,0.0,3.0,1.0,0.0,0,0.0,0,0,0,...,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
4,12.0,150.0,3.0,1.0,0.0,0,0.0,0,0,0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0


Convert the test set to numpy array(s)

In [18]:
x_test = df_test.values
print('x_test shape:', x_test.shape)

x_test shape: (1154, 199)


Use the Learnt weights to make predictions on the test set

In [19]:
y_pred = sigmoid(x_test @ w + b) > 0.5
y_pred = y_pred.astype(int)
y_pred

array([1, 1, 1, ..., 1, 1, 1])

Save these predictions as a csv file called `submission.csv` in the format given in `sample_submission.csv`

In [20]:
df = pd.DataFrame({'ID': id_columns, 'target': y_pred})
df.head()

Unnamed: 0,ID,target
0,1,1
1,2,1
2,3,1
3,4,1
4,5,1


In [21]:
df.to_csv('submission.csv', index=False)

## Submission Cells

We will now zip and prepare the notebook and csv for submission.

Preliminary checks to ensure `submission.csv` is in the correct format.

In [22]:
df_temp = pd.read_csv('submission.csv')
test_temp = pd.read_csv('test.csv')
assert len(df_temp.columns) == 2, "Number of columns in the submission file is not correct, check the submission format"
assert list(df_temp.columns) == ['ID', 'target'] , "Column names are not correct, check the submission format"
assert df_temp['target'].nunique() == 1 or df_temp['target'].nunique() == 2, "The prediction should be 0 or 1 only"
assert len(df_temp) == len(test_temp), "Number of rows in the submission file is not correct"

Making the submission zip ready<br>
Note: Ensure that your notebook has been saved uptil now with the name eval.ipynb

In [23]:
import shutil
import os

if not os.path.exists('temp'):
    os.makedirs('temp')

if os.path.exists('submission.csv'):
    shutil.copy('submission.csv','temp/submission.csv')

if os.path.exists('eval.ipynb'):
    shutil.copy('eval.ipynb',os.path.join('temp','eval.ipynb'))

shutil.make_archive('submission', 'zip', 'temp')
shutil.rmtree('temp')

Submit the `submission.zip` file to kaggle