<a href="https://colab.research.google.com/github/Uzmamushtaque/CSCI-4967-Projects-in-ML-AI/blob/main/Lecture_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lecture 2

## Today's Lecture

1. Data manipulation and Pre-processing (Tensorflow,PyTorch, Numpy)
2. Broadcasting
3. Python numpy and pandas
4. Logistic Regression using vectorization
5. Datasets
6. Projects in ML/AI
7. Gradient Descent (Optimization Algorithms)
8. About Homework 1
9. Key terms in ML Projects

# Data manipulation

Generally, there are two important things we need to do with data:
(i) acquire them; and (ii) process them once they are inside the computer.

Once you acquire data, there are several data pre-processing, data visualization and feature engineering steps that need to be performed in order to get the data in the right format. Some of the most popular libraries for data manipulation are pandas and numpy.

Here is an example notebook with basic pre-processing and manipulation steps: [Link](https://github.com/Uzmamushtaque/Projects-in-Machine-Learning-and-AI/blob/main/TitanicExample.ipynb)

[TensorFlow](https://www.tensorflow.org/) is an open-source end-to-end machine learning library for preprocessing data, modelling data and serving models (getting them into the hands of others).

[PyTorch](https://pytorch.org/)

## Introduction to Tensors

If you've ever used [NumPy](https://numpy.org/), tensors are kind of like NumPy arrays.

You can consider of a tensor as a multi-dimensional numerical representation (also referred to as n-dimensional, where n can be any number) of something. Where something can be almost anything you can imagine:

1. It could be numbers themselves (using tensors to represent the price of houses).
2. It could be an image (using tensors to represent the pixels of an image).
3. It could be text (using tensors to represent words).

Or it could be some other form of information (or data) you want to represent with numbers.

The main difference between tensors and NumPy arrays (also an n-dimensional array of numbers) is that tensors can be used on GPUs (graphical processing units) and TPUs (tensor processing units).

The benefit of being able to run on GPUs and TPUs is faster computation, this means, if we wanted to find patterns in the numerical representations of our data, we can generally find them faster using GPUs and TPUs.

Let us get started with Tensors.
The first thing we'll do is import TensorFlow under the common alias tf.

In [2]:
# Import TensorFlow
import tensorflow as tf

print(tf.__version__) # find the version number (should be 2.x+)

2.15.0


In [2]:
# simple pytorch tensor
import torch
x = torch.tensor(3.5)
print(x)

tensor(3.5000)


In [3]:
# simple pytorch tensor
x = torch.tensor(3.5)
print("x:", x)

# simple arithmetic with tensors
y = x + 3
print("y = x+3:", y)

x: tensor(3.5000)
y = x+3: tensor(6.5000)


A tensor represents a (possibly multi-dimensional) array of numerical values. With one axis, a tensor corresponds (in math) to a vector. With two axes, a tensor corresponds to a matrix. Just as vectors generalize scalars, and matrices generalize vectors, we can build data structures with even more axes. Tensors give us a generic way of describing  n -dimensional arrays with an arbitrary number of axes. Vectors, for example, are first-order tensors, and matrices are second-order tensors. Let us create one tensor and then update its shape:

In [4]:
x = tf.range(12)
x

<tf.Tensor: shape=(12,), dtype=int32, numpy=array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11], dtype=int32)>

We can access a tensor’s shape (the length along each axis) by inspecting its shape property.

In [None]:
x.shape

TensorShape([12])

If we just want to know the total number of elements in a tensor, i.e., the product of all of the shape elements, we can inspect its size.

In [5]:
tf.size(x)

<tf.Tensor: shape=(), dtype=int32, numpy=12>

To change the shape of a tensor without altering either the number of elements or their values, we can invoke the reshape function.

In [9]:
X = tf.reshape(x, (3, 4))
X

<tf.Tensor: shape=(3, 4), dtype=int32, numpy=
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]], dtype=int32)>

Reshaping by manually specifying every dimension is unnecessary. If our target shape is a matrix with shape (height, width), then after we know the width, the height is given implicitly. Try calling x.reshape(-1, 4) or x.reshape(3, -1) for x above. Why do you think you get the result you are getting?

Typically, we will want our matrices initialized either with zeros, ones, some other constants, or numbers randomly sampled from a specific distribution. We can create a tensor representing a tensor with all elements set to 0 and a shape of (2, 3, 4) as follows:

In [10]:
tf.zeros((2, 3, 4))
tf.ones((3,3,4))

<tf.Tensor: shape=(3, 3, 4), dtype=float32, numpy=
array([[[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]],

       [[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]],

       [[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]]], dtype=float32)>

The following snippet creates a tensor with shape (3, 4). Each of its elements is randomly sampled from a standard Gaussian (normal) distribution with a mean of 0 and a standard deviation of 1.

In [11]:
tf.random.normal(shape=[3, 4])

<tf.Tensor: shape=(3, 4), dtype=float32, numpy=
array([[ 5.9187937e-01, -8.8718188e-01, -3.9479160e-04,  1.2093159e+00],
       [ 5.9993702e-01,  2.8964812e-01,  8.4894878e-01,  1.8067051e+00],
       [-7.7732205e-01,  2.9494369e-01, -7.7629071e-01, -7.0611900e-01]],
      dtype=float32)>

In [12]:
# An exact input for a tensor- Python List
tf.constant([[2, 1, 4, 3], [1, 2, 3, 4], [4, 3, 2, 1]])

<tf.Tensor: shape=(3, 4), dtype=int32, numpy=
array([[2, 1, 4, 3],
       [1, 2, 3, 4],
       [4, 3, 2, 1]], dtype=int32)>

# Operations

All binary scalar operators perform operations elementwise between arrays/matrices.

In [None]:
x = tf.constant([1.0, 2, 4, 8])
y = tf.constant([2.0, 2, 2,2])
x + y, x - y, x * y, x / y, x**y  # The ** operator is exponentiation


(<tf.Tensor: shape=(4,), dtype=float32, numpy=array([ 3.,  4.,  6., 10.], dtype=float32)>,
 <tf.Tensor: shape=(4,), dtype=float32, numpy=array([-1.,  0.,  2.,  6.], dtype=float32)>,
 <tf.Tensor: shape=(4,), dtype=float32, numpy=array([ 2.,  4.,  8., 16.], dtype=float32)>,
 <tf.Tensor: shape=(4,), dtype=float32, numpy=array([0.5, 1. , 2. , 4. ], dtype=float32)>,
 <tf.Tensor: shape=(4,), dtype=float32, numpy=array([ 1.,  4., 16., 64.], dtype=float32)>)

In [None]:
tf.exp(x)

<tf.Tensor: shape=(4,), dtype=float32, numpy=
array([2.7182817e+00, 7.3890562e+00, 5.4598148e+01, 2.9809580e+03],
      dtype=float32)>

We can also concatenate multiple tensors together, stacking them end-to-end to form a larger tensor.

In [15]:
X = tf.reshape(tf.range(12, dtype=tf.float32), (3, 4))
Y = tf.constant([[2.0, 1, 4, 3], [1, 2, 3, 4], [4, 3, 2, 1]])
tf.concat([X, Y], axis=0), tf.concat([X, Y], axis=1)

(<tf.Tensor: shape=(6, 4), dtype=float32, numpy=
 array([[ 0.,  1.,  2.,  3.],
        [ 4.,  5.,  6.,  7.],
        [ 8.,  9., 10., 11.],
        [ 2.,  1.,  4.,  3.],
        [ 1.,  2.,  3.,  4.],
        [ 4.,  3.,  2.,  1.]], dtype=float32)>,
 <tf.Tensor: shape=(3, 8), dtype=float32, numpy=
 array([[ 0.,  1.,  2.,  3.,  2.,  1.,  4.,  3.],
        [ 4.,  5.,  6.,  7.,  1.,  2.,  3.,  4.],
        [ 8.,  9., 10., 11.,  4.,  3.,  2.,  1.]], dtype=float32)>)

Sometimes, we want to construct a binary tensor via logical statements. Take X == Y as an example. For each position, if X and Y are equal at that position, the corresponding entry in the new tensor takes a value of 1, meaning that the logical statement X == Y is true at that position; otherwise that position takes 0.

In [16]:
X == Y

<tf.Tensor: shape=(3, 4), dtype=bool, numpy=
array([[False,  True, False,  True],
       [False, False, False, False],
       [False, False, False, False]])>

# Broadcasting Mechanism

Under certain conditions, when shapes differ, we can still perform elementwise operations by invoking the broadcasting mechanism. This mechanism works in the following way: First, expand one or both arrays by copying elements appropriately so that after this transformation, the two tensors have the same shape. Second, carry out the elementwise operations on the resulting arrays.

In [17]:
a = tf.reshape(tf.range(3), (3, 1))
b = tf.reshape(tf.range(2), (1, 2))
a, b

(<tf.Tensor: shape=(3, 1), dtype=int32, numpy=
 array([[0],
        [1],
        [2]], dtype=int32)>,
 <tf.Tensor: shape=(1, 2), dtype=int32, numpy=array([[0, 1]], dtype=int32)>)

Since a and b are  3×1  and  1×2  matrices respectively, their shapes do not match up if we want to add them. We broadcast the entries of both matrices into a larger  3×2  matrix as follows:

In [None]:
a + b

<tf.Tensor: shape=(3, 2), dtype=int32, numpy=
array([[0, 1],
       [1, 2],
       [2, 3]], dtype=int32)>

[Source for this excerpt](https://numpy.org/doc/stable/user/basics.broadcasting.html)


When operating on two arrays, NumPy compares their shapes element-wise. Two dimensions are compatible when

they are equal, or

one of them is 1

If these conditions are not met, a ValueError: operands could not be broadcast together exception is thrown, indicating that the arrays have incompatible shapes. The size of the resulting array is the size that is not 1 along each axis of the inputs.

# Data Pre-processing

To apply deep learning to solving real-world problems, we often begin with preprocessing raw data, rather than those nicely prepared data in the tensor format. Among popular data analytic tools in Python, the pandas package is commonly used.

[Pandas documentation](https://pandas.pydata.org/)

In [23]:
## Reading the dataset

# If pandas is not installed, just uncomment the following line:
# !pip install pandas
import pandas as pd

data = pd.read_csv('/content/drive/MyDrive/train.csv')
data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [19]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Handling Missing Values: NaN values are unknown or missing values.To handle missing data, typical methods include imputation and deletion, where imputation replaces missing values with substituted ones, while deletion ignores missing values.

In [24]:
data.isna()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,False,False,False,False,False,False,True,False,False,False,...,False,True,True,True,False,False,False,False,False,False
1,False,False,False,False,False,False,True,False,False,False,...,False,True,True,True,False,False,False,False,False,False
2,False,False,False,False,False,False,True,False,False,False,...,False,True,True,True,False,False,False,False,False,False
3,False,False,False,False,False,False,True,False,False,False,...,False,True,True,True,False,False,False,False,False,False
4,False,False,False,False,False,False,True,False,False,False,...,False,True,True,True,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,False,False,False,False,False,False,True,False,False,False,...,False,True,True,True,False,False,False,False,False,False
1456,False,False,False,False,False,False,True,False,False,False,...,False,True,False,True,False,False,False,False,False,False
1457,False,False,False,False,False,False,True,False,False,False,...,False,True,False,False,False,False,False,False,False,False
1458,False,False,False,False,False,False,True,False,False,False,...,False,True,True,True,False,False,False,False,False,False


Different datatypes require different ways of dealing with missing values.

[Handling Missing Values in pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html)

# Vectors

A scalar is represented by a tensor with just one element. Vectors in ML problems represent examples from the dataset. In math notation, we will usually denote vectors as bold-faced, lower-cased letters (e.g., $\textbf{x}$ ,  $\textbf{y}$ , and  $\textbf{z}$) .

We can refer to any element of a vector by using a subscript. For example, we can refer to the  $i$th  element of  $x$  by  $x_i$ . Note that the element  $x_i$  is a scalar, so we do not bold-face the font when referring to it.

In [3]:
x = tf.range(4)
x
print(x[3])

tf.Tensor(3, shape=(), dtype=int32)


With the advent of deep learning, we usually work with extremely large datasets. Therefore, its important we write efficient code. One such technique is vectorization.

In [None]:
import numpy as np
a=np.array([1,2,3,4,5])
a

array([1, 2, 3, 4, 5])

In [None]:
import time
a=np.random.rand(1000000)
b=np.random.rand(1000000)

#Vectorized version
start=time.time()
c=np.dot(a,b)
end=time.time()
print('Vectorized version '+str(end-start)+' ms')
print(c)

Vectorized version 0.0050313472747802734 ms
249953.43446398724


In [None]:
#Non vectorized version
c=0
start=time.time()
for i in range(1000000):
  c+=a[i] * b[i]
end=time.time()
print('Non vectorized version '+str(end-start)+' ms')
print(c)

Non vectorized version 0.6930620670318604 ms
249953.43446398407


How many times longer does the non-vectorized version takes? Try checking the time for a nested loop.

You must have heard about GPUs(Graphics Processing Unit) and CPUs(Central Processing Unit (CPU)).Both GPU and CPU have parallelization instructions. They're sometimes called SIMD instructions. This stands for a single instruction multiple data. But what this basically means is that, if you use built-in functions such as this np.function or other tf.functions that don't require you explicitly implementing a for loop. It enables Python to take much better advantage of parallelism to do your computations much faster. And this is true both computations on CPUs and computations on GPUs.

More information on this:
[Difference between GPU and CPU](https://blogs.nvidia.com/blog/2009/12/16/whats-the-difference-between-a-cpu-and-a-gpu/)

# Vectorizing Logistic Regression

We know that in logistic regression we are claculating the predicted value for each example using the following function:

$\hat{y}$= $\sigma(\textbf{w}^Tx + b)$

where $\sigma(a) = \frac{1}{(1+e^{-a})}$

For a given example $i$, the loss function for a single instance is given by:

$l^{i}(y^{(i)},\hat{y}^{(i)})$ = $-(y^{(i)}\space log\hat{y}^{(i)} + (1-y^{i}) log(1-\hat{y}^{(i)}))$

Cost function for the entire data:

$L(y,\hat{y}) = \frac{1}{n} \sum_{i=1}^{n} l^{i}(y^{(i)},\hat{y}^{(i)})$

The computation here requires the calculation of the $\hat{y}$. Lets consider $a=(\textbf{w}^Tx + b)$. This $a$ needs to be computed for every instance. Instead of using an explicit for loop, we can find the dot product of the feature vector and the transpose of the weight vector. The bias term (if it exists) can be added to each individual calculation via broadcasting. Resulting $A$ vector will be:

$A=[a^{(1)},a^{(2)}...a^{(n)}]$

This step can be completed in 1 line of code.

**Steps in implementing gradient descent**

You get input X
- You compute $A = \sigma(w^T X + b) = (a^{(1)}, a^{(2)}, ..., a^{(n-1)}, a^{(n)})$
- You calculate the cost function: $L = -\frac{1}{n}\sum_{i=1}^{m}(y^{(i)}\log(a^{(i)})+(1-y^{(i)})\log(1-a^{(i)}))$


Here are the two formulas you will be using (Try finding the derivative of the cost function with respect to the parameters):


$$ \frac{\partial L}{\partial w} = \frac{1}{n}X(A-Y)^T$$
$$ \frac{\partial L}{\partial b} = \frac{1}{n} \sum_{i=1}^n (a^{(i)}-y^{(i)})$$




In [None]:
#Assuming you have a custom sigmoid function
#A = sigmoid(np.dot(w.T,X) + b)
#cost=-1/m * np.sum(Y * np.log(A) + (1-Y) * (np.log(1-A)))
#dw = np.dot(X, (A-Y).T)/n
#db= np.sum(A-Y)/n

**The Update Step**

Once you have initialized your parameters and you have computed a cost function and its gradient.Next, you want to update the parameters using gradient descent.

Write down the optimization function. The goal is to learn  w  and  b  by minimizing the cost function  L . For a parameter  w , the update rule is  w=w−$\eta$ dw , where  $\eta$  is the learning rate.

In [None]:
#You basically need to write down two steps and iterate through them for the entire dataset:
# 1) Calculate the cost and the gradient for the current parameters.
# 2) Update the parameters using gradient descent rule for w and b.
#w = w - learning_rate*dw
#b= b - learning_rate*db

**Predict**
The previous function/code will output the learned w and b. We are able to use w and b to predict the labels for a dataset X.Next step is prediction. There are two steps to computing predictions:

1. Calculate $\hat{Y} = A = \sigma(w^T X + b)$

2. Convert the entries of a into 0 (if activation <= 0.5) or 1 (if activation > 0.5) and store the predictions in a vector.

# Activation Functions

The choice of activation functions is critical in a Neural Network(NN) Design. The logistic regression model we saw above is very similar to the perceptron which is a basic building block of any NN model.

In problems where a binary class label needs to be predicted, usually sign function is can be a choice. For problems where the target variable to be predicted is real, it makes sense to use the identity activation function. When predicting probabilities of a binary class it makes sense to use the sigmoid function as it restricts the outcome between a 0-1 value.

The importance of non-linear activation functions will become clear when we move to multi-layered architecture.

[More about activation functions](https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6)

# Readings for Today

[Paper 1](https://arxiv.org/pdf/1609.04747.pdf)

### Summary
The paper covers traditional gradient descent and explores more advanced algorithms designed to improve convergence speed and performance. Some key points include:

Gradient Descent Basics: The paper starts by explaining the fundamental concept of gradient descent, where the goal is to minimize a cost function by iteratively adjusting model parameters in the direction of steepest descent.

Variants of Gradient Descent:

Batch Gradient Descent: Computes the gradient of the entire dataset.
Stochastic Gradient Descent (SGD): Updates parameters for each training example, introducing randomness.
Mini-batch Gradient Descent: A compromise between batch and stochastic, where updates are made using a small random subset of the data.
Challenges with Vanilla Gradient Descent:

Learning Rate: The choice of the learning rate can significantly impact the convergence of the algorithm.
Saddle Points: Vanilla gradient descent may struggle with convergence in the presence of saddle points.
Advanced Optimization Algorithms:

Momentum: Introduces a moving average of past gradients to accelerate convergence.
Adagrad, Adadelta, RMSprop: Adaptive learning rate methods that adjust the learning rates for each parameter individually.
Adam: A popular optimization algorithm that combines ideas from momentum and adaptive learning rates.
Issues and Considerations:

The paper discusses potential issues with optimization algorithms, such as choosing hyperparameters and dealing with non-convex optimization challenges.

Practical Recommendations: Provides practical advice on selecting optimization algorithms based on the characteristics of the optimization problem.

The paper serves as a valuable resource for understanding the landscape of optimization algorithms, their strengths, and considerations for practical implementation in training machine learning models.

[Paper 2](https://proceedings.neurips.cc/paper/2020/file/d3f5d4de09ea19461dab00590df91e4f-Paper.pdf)



# Extra Resources

[Article](https://iamtrask.github.io/2015/07/27/python-network-part2/)