<a href="https://colab.research.google.com/github/geekdyout/yadu_gdsc_aiml_submission/blob/main/Yadu_Krishnan_LevelOne_Submission.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GDSC - Artificial Intelligence & Machine Learning


---

### **I. Introduction to AI / ML**
Machine learning aims to teach a machine how to perform a specific task and provide accurate results by identifying patterns. Curious how ML does this? There's a lot of math and logic that goes behind what's happening in a Machine Learning model.

### **II. Supervised & Unsupervised Machine Learning**
Supervised learning is the types of machine learning in which machines are trained using well "labelled" training data, and on basis of that data, machines predict the output. The labelled data means some input data is already tagged with the correct output.

Supervised learning is a process of providing input data as well as correct output data to the machine learning model. The aim of a supervised learning algorithm is to find a mapping function to map the input variable(x) with the output variable(y).

Unsupervised learning, also known as unsupervised machine learning, uses machine learning algorithms to analyze and cluster unlabeled datasets. These algorithms discover hidden patterns or data groupings without the need for human intervention.

### **III. Level One**
For level one, we would be expecting you to research on your own and complete this assignment. We will not be directly explaining the concepts and you would be expected to try to learn it on your own, however, if you do have any doubts regarding this, you can contact Dhruv Shah or Advik Raj Basani. We would be dealing with ONLY Supervised Machine Learning this assignment, and the math involved in this has already been covered either in first year / 12th grade. Math concepts we would expect you to know before you go into this assignment is:

*   Multiplication of Matrices
*   Dot Product & Cross Product

We would also need to know the very basics of Python. This is not a very difficult task and you can familiarise yourself with how Python works with this link: https://www.w3schools.com/python/

### **IV. How to Start**
First off, go to File -> Save a copy on Drive, and share the new copied file on YOUR drive (with Editing permissions) and fill out this form with the copied share link.

https://forms.gle/D6qigSNLEaPf3kTB6

And... you're done! You can now get started and learn how ML works.



#Question 1: Linear Regression using Gradient Descent

Hey! Welcome to your first assignment question. Let's introduce you to the question.


---

You are given a dataset about House Prices in an area. In this dataset, you will have multiple features about the house. You are expected to create a Linear Regression ML model (**ONLY USING NUMPY** - a python library) to predict prices. Below you will find two links - one containing the training dataset and the other containing the test dataset. Your assignment is to fill in the blanks of code and maximize accuracy / minimize loss by using as many features as you can use.

Training_Data, Testing_Data & Information_about_Features can be found in this drive link: https://drive.google.com/drive/folders/1jTnYiFaUn0czGEmOS637SWgwaNdSDp07?usp=sharing

Resources to study:
* https://www.geeksforgeeks.org/ml-linear-regression/
* https://www.javatpoint.com/cost-function-in-machine-learning
* https://www.javatpoint.com/gradient-descent-in-machine-learning
* https://www.scaler.com/topics/np-vectorize/
* https://www.simplilearn.com/what-is-multiple-linear-regression-in-machine-learning-article
* https://www.javatpoint.com/feature-engineering-for-machine-learning


In [1]:
import numpy as np
import pandas as pd


In [2]:

def ConvertToInputOutput(dataframe):
  Y = dataframe[['SalePrice']]
  chosen_feats = ['OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', 'FullBath', 'YearBuilt']
  X = dataframe[chosen_feats]
  return X,Y


# Normalization Function
def Normalize(X):
  return (X - X.mean())/(X.std()+0.001)

# Randomize a dataset
def randomize_dataset(dataset):
    dataset = dataset.sample(frac=1)
    return dataset

# Split Training and Test Data
def obtain_training_test_data(X, Y, n):
    num = round(X.shape[0] * (1 - n))

    X_train = X.iloc[0:num, :]
    X_test = X.iloc[(num):, :]

    Y_train = Y.iloc[0:num]
    Y_test = Y.iloc[(num):]

    return X_train, X_test, Y_train, Y_test

In [5]:
#load the dataset
train = pd.read_csv("/train.csv")
print(train.isnull().sum()[train.isnull().sum() > 0])

LotFrontage      259
Alley           1369
MasVnrType         8
MasVnrArea         8
BsmtQual          37
BsmtCond          37
BsmtExposure      38
BsmtFinType1      37
BsmtFinType2      38
Electrical         1
FireplaceQu      690
GarageType        81
GarageYrBlt       81
GarageFinish      81
GarageQual        81
GarageCond        81
PoolQC          1453
Fence           1179
MiscFeature     1406
dtype: int64


In [6]:
train = randomize_dataset(train)
X1,Y1 = ConvertToInputOutput(train)

X_train, X_test, Y_train, Y_test = obtain_training_test_data(X1, Y1, 0.3)

#Normalisation
X_train = Normalize(X_train)
X_test = Normalize(X_test)
Y_train = Normalize(Y_train)
Y_test = Normalize(Y_test)
X_train = X_train.to_numpy()
Y_train = Y_train.to_numpy()
X_test = X_test.to_numpy()
Y_test = Y_test.to_numpy()
print("[PREPROCESSING] Completed")

[PREPROCESSING] Completed


In [7]:
class LinearRegression():
  def __init__(self):
    self.loss = []

  def fit(self,X,Y,learningRate = 0.1,numIterations=10):
    m,n = X.shape
    self.outputstd = Y.std()
    self.outputmean = Y.mean()
    self.weights = np.ones((n, 1))
    self.bias = 0

    for i  in range(numIterations):
      predY = np.dot(X, self.weights)+self.bias
      error = Y -predY
      mse = np.square(error).mean()

      self.loss.append(mse)

      dw = -(1/m)*np.dot(X.T, error)
      db = -(1/m)*np.nansum(error,keepdims = True)
      self.weights = self.weights - learningRate*dw
      self.bias = self.bias - learningRate*db
      if(i%10==0):
        print("Iteration ", i + 1 ,"/",numIterations, "MSE: ", mse)

    print(self.loss)

  def predict(self, X):
    return (np.dot(X, self.weights) + self.bias) * (self.outputstd + 0.001) + self.outputmean

  def valid(self,X,Y):
    predY = np.dot(X, self.weights) + self.bias
    error = Y - predY
    mse = np.square(error).mean()
    print(mse)

In [8]:
linear = LinearRegression()
linear.fit(X_train, Y_train, learningRate = 0.01, numIterations = 1000)

Iteration  1 / 1000 MSE:  13.786066745768053
Iteration  11 / 1000 MSE:  6.9710984078374505
Iteration  21 / 1000 MSE:  3.5877659942949105
Iteration  31 / 1000 MSE:  1.9069202567561536
Iteration  41 / 1000 MSE:  1.07082487466749
Iteration  51 / 1000 MSE:  0.6539898907025571
Iteration  61 / 1000 MSE:  0.44533396511442125
Iteration  71 / 1000 MSE:  0.3401310987929162
Iteration  81 / 1000 MSE:  0.2864131516842576
Iteration  91 / 1000 MSE:  0.2583838026208579
Iteration  101 / 1000 MSE:  0.243230886223653
Iteration  111 / 1000 MSE:  0.23458469374103674
Iteration  121 / 1000 MSE:  0.2292728726583296
Iteration  131 / 1000 MSE:  0.2257105539732573
Iteration  141 / 1000 MSE:  0.2231016222661854
Iteration  151 / 1000 MSE:  0.2210423584069431
Iteration  161 / 1000 MSE:  0.21932469139438973
Iteration  171 / 1000 MSE:  0.21783848059006308
Iteration  181 / 1000 MSE:  0.2165229361556625
Iteration  191 / 1000 MSE:  0.2153424276384688
Iteration  201 / 1000 MSE:  0.21427440208518306
Iteration  211 / 1000 

In [9]:
linear.valid(X_test, Y_test)

0.31096374791510545


#Question 2: Logistic Regression
Welcome to your second question! Let us introduce you to the question.


---
You are provided a dataset which is used to predict if a patient has Breast Cancer or not. Your mission, should you choose to accept it, is to use the dataset to create a Logistic Regression model to predict whether a patient is benign / malignant. Try to obtain the highest accuracy.

Dataset: https://drive.google.com/drive/folders/1jTnYiFaUn0czGEmOS637SWgwaNdSDp07?usp=sharing

Resources to study:
* https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6
* https://www.scaler.com/topics/matplotlib/matplotlib-heatmap/
* https://www.geeksforgeeks.org/understanding-logistic-regression/
* https://www.analyticsvidhya.com/blog/2021/10/building-an-end-to-end-logistic-regression-model/




In [None]:
# Define the sigmoid function
def sigmoid(z):
  return _______________________

# Defining X and Y
def GetInputOutputLR(dataset):
  X = dataset.iloc[:, 0:-1]
  Y = dataset.iloc[:, -1:]

  return X, Y

def obtain_training_test_data(X, Y, n):
  # Your function! Learn to use pandas to split training and test data!

In [None]:
# Fill in the blanks! You should be comfortable with numpy & pandas now.
dataset = pd.read_csv(_________________________)
dataset.info()

In [None]:
X, Y = GetInputOutputLR(dataset)

# This dataset has 2, 4 as Benign & Malignant, so we convert it to 0 and 1.
Y = Y.replace({_____________________})
X_train, X_test, Y_train, Y_test = obtain_training_test_data(X, Y, 0.3);

X_train.shape, Y_train.shape

In [None]:
# Heatmaps - What library uses heatmaps in Python?
import _____________ as sns
heatmap = sns.heatmap(dataset.corr(), vmin=-1, vmax=1, annot=True)
heatmap.set_title('Correlation Heatmap');

# Learn what a correlation heatmap is, we might ask you about this later on.

In [None]:
#Normalisation - from Question 1, this cell is entirely yours. Add as many features / normalize and preprocess them to try to obtain maximum accuracy.
#Pre-processing

In [None]:
# Refer to Andrew NG's lectures of Logistic Regression to understand what's actually happening in terms of matrixs and weights. You can include bias here if you'd like, but we haven't included it
# in this case. More points if you figure out how a bias' dimensions would work and actually implement it here.
class LogisticRegression():
  def __init__(self, learning_rate = 0.1, max_iterations=100):
    self.learning_rate = learning_rate
    self.max_iterations = max_iterations
    self.loss = []
    self.w = []

  def fit(self, X, Y):
    self.w = np.zeros((_____________________), dtype=np.float32)

    for iteration in range(_______________________):
      dw, cost = self.gradient_cost_eval(self.w, X, Y)
      self.w -= (self.learning_rate * dw)
      self.loss.append(cost)

      if iteration % 10 == 0:
        print("Iteration ", iteration, "/", self.max_iterations, ", Loss: ", cost)

# Do learn what (z), H(z) and such generic terms stand for, they are often used in the ML community.
  def predict(self, X):
    w = self.w
    H = _________________

# What threshold value should you have to say that the prediction is 1?
    Y_pred = np.zeros(______________________)
    for i in range(H.shape[1]):
      if H[0, i] >= _________:
        Y_pred[0,i] = 1
      else:
        Y_pred[0,i] = 0

    return Y_pred

  def test(self, X, Y):
    Y_pred = self.predict(X)
    print("Accuracy: ", ____________________________)

  def hypo(self, w, X):
    return sigmoid(__________________)

# Use the binary crossentropy loss function here.
  def cost(self, H, Y, num_samples):
    return _______________________________________

  def gradient_cost_eval(self, w, X, Y):
    H = self.hypo(w, X)
    cost = self.cost(H, Y, len(Y))

    temp = (H - Y)
    dw = np.dot(___________) / ______________

    return dw, cost


In [None]:
# How do you define the Logistic Regression model now?
LR = _________________________________
LR.fit(________________)

In [None]:
LR.test(______________)

# Submissions

---

You can submit your final solutions using this link:
https://forms.gle/42UcG7dFttEStHyY8

Thank you!