# Section 2: Programming Questions (Total 80 marks)
- This section contains 3 questions.
- You are provided with Python 3.9 Standard Library and NumPy Library.
- Python 3.9 Documentation: https://docs.python.org/3.9/
- NumPy Documentation: https://numpy.org/

In [None]:
import numpy as np

***
## Question 1: Linear Regression Using Gradient Descent (15 marks)
***
Linear Regression can also be performed using a technique called gradient descent, where the coefficients (or weights) of the model are iteratively adjusted to minimise a cost function (usually mean squared error). This method is particularly useful when the numbers of features is too large for analytical solutions like the normal equation or when the feature matrix is not invertible. <br><br>

The gradient descent algorithm updates the weights by moving in the direction of the negative gradient of the cost function with respect to the weights. The updates occur iteratively until the algorithm converges to a minimum of the cost function. <br><br>

The update rule for each weight is given by: <br>

$$θ_j := θ_j - \alpha\frac{1}{m}\sum_{i=1}^{m}(h_θ(x^i)-y^i)x_j^i$$

**Explanation of Terms**
1. $α$ is the learning rate.
2. $m$ is the number of training examples.
3. $h_θ(x^i)$ is the hypothesis function at iteration $i$.
4. $x^i$ is the feature vector of $i_{th}$ training example.
5. $y^i$ is the actual target value for the $i_{th}$ training example.
6. $x_j^i$ is the value of feature $j$ for the $i_{th}$ training example.<br><br>

**Key Points**
- **Learning Rate**: The choice of learning rate is crucial for the convergence and performance of gradient descent.
  - A small learning rate may lead to slow convergence.
  - A large learning rate may cause overshooting and divergence.
- **Number of Iterations**: The number of iterations determines how long the algorithm runs before it converges or stops.<br><br>

**Practical Implementation**<br>
Implementing gradient descent involves initialising the weights, computing the gradient of the cost function, and iteratively updating the weights according to the update rule.

## Task #1
- The function should take
  - NumPy arrays X (features with a column of values for the intercept);
  - y (target) as input;
  - learning rate alpha;
  - the number of iterations;
  - return the coefficients of the linear regression model as a NumPy array.
- Round your answer to four decimal places. -0.0 is a valid result for rounding a very small number.

In [None]:
# DO NOT MAKE CHANGES TO THIS CELL

X = np.array([[2, 2], [2, 4], [2, 6]])
y = np.array([2, 4, 6])
alpha = 0.01
iterations = 1000

In [None]:
def linear_regression_gradient_descent(X: np.ndarray, y: np.ndarray, alpha: float, iterations: int) -> np.ndarray:
    # your code here
    return

In [None]:
# DO NOT MAKE CHANGES TO THIS CELL

linear_regression_gradient_descent(X=X, y=y, alpha=alpha, iterations=iterations)

***
## Question 2: K-Means Clustering (25 marks)
***
1. **Initialisation**<br>
  Use the provided `initial_centroids` as your starting point.<br>

2. **Assignment Step**<br>
  For each point in your dataset:
    - Calculate its distance to each centroid.
    - Assign the point to the cluster of the nearest centroid.<br>
      *Hint: Consider creating a helper function to calculate the Euclidean distance between two points.*

3. **Update Step**<br>
  For each cluster:
    - Calculate the mean of all points assigned to the cluster.
    - Update the centroid to this new mean position.<br>
      *Hint: Be careful with potential empty clusters. Decide how you will handle them (eg. keep the previous centroid).*

4. **Iteration**<br>
  Repeat Step 2 and 3 until either:
    - The centroids no longer change significantly, OR
    - `max_iterations` reached.<br>
      *Hint: You might want to keep track of the previous centroids to check for significant changes.*

5. **Result**<br>
  Return the list of final centroids, ensuring each coordinate is rounded to the nearest fourth decimal.

## Task #2
- Write a Python function that implements the K-Means clustering algorithm.
  - This function should take specific inputs and produce a list of final centroids.
  - K-Means clustering is a method used to partition *n* points into *k* clusters.
  - The goal is to group similar points together and represent each group by its centroid.

- Function Inputs:
  - points: list of points, where each point is a tuple of coordinates (e.g., (x, y) for 2D points)
  - *k*: an integer representing the number of clusters to form
  - `initial_centroids`: list of initial centroid points, each a tuple of coordinates
  - `max_iterations`: an integer representing the maximum number of iterations to perform

- Function Output:
  - A list of the final centroids of the clusters, where each centroid is rounded to the nearest fourth decimal.

In [None]:
# DO NOT MAKE CHANGES TO THIS CELL

points = [(2, 4), (2, 8), (2, 10), (10, 2), (10, 4), (10, 0)]
k = 2
initial_centroids = [(2, 2), (10, 1)]
max_iterations = 10

In [None]:
def euclidean_distance(a, b):
    # your code here
    return

def k_means_clustering(points, k, initial_centroids, max_iterations):
    # your code here
    return

In [None]:
# DO NOT MAKE CHANGES TO THIS CELL

k_means_clustering(points=points, k=k, initial_centroids=initial_centroids, max_iterations=max_iterations)

***
## Question 3: Retrieval Augmented Generation (40 marks)
***
### Background

In a classroom filled with curious students, Ms. Sally stood at the front with a confident smile.

She had promised her class an engaging lesson on a cutting-edge technology: Artificial Intelligence.

"Alright, class," she began, holding a few printed notes in her hand. "Let's build a smart assistant that can answer your questions using only a set of documents. 

Let’s say the question is: ‘How does machine learning work?’

The smart assistant is to extract the content from the given set of documents and answer the question.

This is done with a 2-step process: "Retrieve and Generate"

The set of documents can be found below:

1. Artificial intelligence is transforming industries.
2. Machine learning is a subset of artificial intelligence.
3. Deep learning allows machines to solve complex problems.
4. Data science involves statistics and machine learning.
5. Neural networks are inspired by the human brain.

"Which one of these do you think would help answer the question?" she asked. 

A few hands shot up and answered it.

"That’s right! The second and fourth ones mentioned machine learning directly," she confirmed.

"Now the smart assistant has to take what it found and turn it into a formulated response: ‘Machine learning is a subset of artificial intelligence that enables systems to learn from data. It involves techniques like statistics and algorithms often used in data science.’ 

See how it combines ideas from the documents to explain the concept?"

## Task #3

The goal is to retrieve the most relevant information from these documents and generate an accurate, concise, and meaningful response to the query.<br>

The solution will be a simple Retrieval Augumented Generation solution that includes but not limited to the following:<br>
i. Feature Extraction such as vectorisation technique to represent the documents as numerical vectors that capture their semantic meaning and relevance to the question.<br>
  - A common approach is **TF-IDF** (Term Frequency-Inverse Document Frequency):<br>
    - **Term Frequency (TF)**: How often a word appears in a document
    - **Inverse Document Frequency (IDF)**: How rare or common a word is across all documents

ii. Once you have numerical vectors for the documents and query, you can measure how similar they are using similarity scoring such as cosine similarity.<br>
  - Cosine similarity measures the angle between two vectors
  - Values range from 0 (completely different) to 1 (exactly the same)
  - The higher the score, the more relevant a document is to the query

### Expected output

Query: How does machine learning work?

Generated Response:
Machine learning is a subset of artificial intelligence that enables systems to learn from data. It involves techniques like statistics and algorithms, often used in data science.
  


In [None]:
# DO NOT MAKE CHANGES TO THIS CELL

# Set of documents
documents = [
    "Artificial intelligence is transforming industries.",
    "Machine learning is a subset of artificial intelligence.",
    "Deep learning allows machines to solve complex problems.",
    "Data science involves statistics and machine learning.",
    "Neural networks are inspired by the human brain."
]

# Query
query = "How does machine learning work?"

In [None]:
# your code here

In [None]:
# DO NOT MAKE CHANGES TO THIS CELL

print("Query:", query)
print("\nRetrieved Documents:")
for doc in top_documents:
    print("-", doc)

print("\nGenerated Response:")
print(response)

In [None]:
# Share your thought process on how you derive at your solution