Let's start with the Linear Regression Cost Function: $J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (\hat{y}^{i} - y^{i})^{2}$. 

$\frac{\partial}{\partial \theta_{j}} J(\theta) = \frac{1}{m} \sum_{i=1}^{m} (\hat{y}^{i} - y^{i})\cdot x_{j}^{i}$

For Vanilla (Batch) Gradient Descent:
    $\theta_{j} := \theta_{j} - \alpha \cdot \frac{\partial}{\partial \theta_{j}} J(\theta) = \theta_{j} - \alpha \cdot \frac{1}{m} \sum_{i=1}^{m} (\hat{y}^{i} - y^{i})\cdot x_{j}^{i}$

In [3]:
import numpy as np

In [2]:
def gradientDescent(X,y,theta,alpha,num_iters):
    """ Performs gradient descent to learn theta"""
    
    # number of training examples,
    # neat trick: y.size = np.prod(y.shape)
    m = y.size 
    
    for i in range(num_iters):
        y_hat = np.dot(X, theta)
        theta = theta - alpha*(1.0/m) * np.dot(X.T, y_hat-y)
    
    return theta

Notice the $\textbf{np.dot(X.T, y_hat-y)}$ above? That's basically a vectorized version of "looping through (summing) the number of training samples". YES, it takes a lot of time.