**<p style='text-align: right;'>Ver. 2.5.1</p>**

# Introductory Applied Machine Learning (IAML) Coursework - Semester 2, 2023-24

### Author: Fengxiang He, Waylon Li, Hiroshi Shimodaira and Rohan Gorantla

## Important Instructions

#### It is important that you follow the instructions below carefully for things to work properly.

You need to set up and activate your environment as you would do for your labs, see Learn section on Labs.  **You will need to use Noteable to create the files you will submit (the Jupyter (IPynthon) Notebook and the PDF)**.  Do **NOT** create the PDF in some other way, we will not be able to mark it.  If you want to develop your answers in your own environment, you should make sure you are using the same packages we are using, by running the cell which does imports below.

Read the instructions in this notebook carefully, especially where asked to name variables with a specific name. Wherever you are required to produce code you should use code cells, otherwise you should use markdown cells to report results and explain answers. In most cases we indicate the nature of answer we are expecting (code/text), and also provide the required code/markdown cell. **If you are not familiar with markdown, here's a tutorial: [click here](https://www.markdowntutorial.com/).**

- We will use the IAML Learn page for any announcements, updates, and FAQs on this assignment. Please ***visit the page frequently*** to find the latest information/changes.
- Data files that you will be using are included in the coursework zip file that you have downloaded from the Learn assignment page for this coursework.
- There is a helper file 'iaml24cw_helpers.py' in the zip file, which you should upload to your environment.
- Some of the topics in this coursework are covered in week 7 and 8 of the course. Focus first on questions or topics that you have covered already, and come back to the other questions as the lectures progress.
- Keep your answers brief and concise.
- Make sure to show all your code/working.
- All the figures you present should have axis labels, titles, and grid lines unless specified explicitly. If you think grid lines spoiling readability, you can adjust the line width and/or line style. Figures should not be too small to read.
- Write readable code. While we do not expect you to follow PEP8 to the letter, the code should be adequately understandable, with plots/visualisations correctly labelled. Do use inline comments when doing something non-standard.
- When asked to present numerical values, make sure to represent real numbers in the appropriate precision corresponding to your answer.
- When you use libraries specified in this coursework, you should use the default parameters unless specified explicitly.
- The criteria on which you will be judged include the quality of the textual answers and/or any plots asked for. For higher marks, when asked you need to give good and concise discussions based on experiments and theories using your own words.

- You will see <html>\\pagebreak</html> at the start of each subquestion.  ***Do not remove these, if you do we will not be able to mark your coursework.***

#### Good Scholarly Practice
Please remember the University requirement regarding all assessed work for credit. Details about this can be found at:
http://web.inf.ed.ac.uk/infweb/admin/policies/academic-misconduct

Specifically, this assignment should be your own individual work. We will employ tools for detecting misconduct.

Moreover, please note that Piazza is NOT a forum for discussing the solutions of the assignment. You may ask private questions. You can use the office hours to ask questions.

### SUBMISSION Mechanics
This assignment will account for 30% of your final mark. We ask you to submit answers to all questions.

You will submit (1) a PDF of your Notebook and (2) the Notebook itself via Gradescope.  Your grade will be based on the PDF, we will only use the Notebook if we need to see details.  **You must use the following procedure to create the materials to submit**.

1. Make sure your Notebook, the helper file, and the datasets are in Noteable and will run.  If you developed your answers in Noteable, this is already done.

2. Select **Kernel->Restart & Run All** to create a clean copy of your submission, this will run the cells in order from top to bottom.  This may take a while (a few hours) to complete, ensure that all the output and plots have complete before you proceed.

3. Select **File->Download as->PDF via LaTeX (.pdf)** and wait for the PDF to be created and downloaded.

4. Select **File->Download as->Notebook (.ipynb)**

5. You now should have in your download folder the pdf and the notebook.  Rename them sNNNNNNN.pdf and sNNNNNNN.ipynb, where sNNNNNNN is your matriculation number (student number).

**Details on submission instructions will be announced and documented on Learn before the deadline**.

The submission deadline for this assignment is **Monday 1st April 2024 at 4pm UK time (UTC)**.  Don't leave it to the last minute!


#### IMPORTS
Execute the cell below to import all packages you will be using for this assignment.  If you are not using Noteable, make sure the python and package version numbers reported match the python and package numbers, which can be checked by running the following cell. The Python version does not need to be the same, but it should be $3.9.p$, where $p \ge 12$.

In [1]:
import pandas as pd
import numpy as np
import scipy
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import seaborn as sns
import matplotlib.patches as mpatches
import copy

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.svm import SVC
from sklearn.cluster import KMeans
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_curve
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay
from pandas.plotting import parallel_coordinates


#from iaml23cw_helpers import *
#print_versions();

# You may add other libraries here or in your other cells as needed.



\pagebreak

# Question 1: Experiments with a stock price  data set

#### 65 marks out of 110 for this coursework

The stock price data set we use in this coursework is a stock price of a company for the period between 2000 and 2024, consisting of four historical prices ('Open', 'High', 'Low', 'Close', which denote the opening, highest, lowest, and closing prices on the trading day, respectively) and trading volume. For the convenience of the coursework, we have added some features to the data set. They are four [technical indicators](https://python.stockindicators.dev/indicators/) (RSI, SMA, MACD, ADX), 'Tomorrow', and 'Target'. 'Tomorrow' holds the closing price of next trading day, which we will use for price prediction, and 'Target' is a binary indicator (label), which takes 1 if 'Tomorrow' is higher than 'Close', 0 otherwise, which we will use for the prediction of movement direction.

***Loading data***
Make sure that you have the data set files "dset_q1a.csv", "dset_q1a_extend", and "dset_q1b.csv" in your environment. We will use the first file in the following sub questions except the last two subquestions 1.7 and 1.8. Run the following cell to load the first file.

In [2]:
# Load the data set "dset_q1a.csv"
df = pd.read_csv("dset_q1a.csv", index_col='Date', parse_dates=True)
df.index = pd.to_datetime(df.index, format='%d/%m/%Y')
df.head()

Unnamed: 0_level_0,Open,High,Low,Close,Volume,SMA,RSI,MACD,ADX,Tomorrow,Target
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2000-01-03,58.6875,59.3125,56.0,58.28125,53228400,54.215625,71.898153,3.292083,38.756214,56.3125,0
2000-01-04,56.78125,58.5625,56.125,56.3125,54119000,54.645313,60.689975,2.986482,37.855621,56.90625,1
2000-01-05,55.5625,58.1875,54.6875,56.90625,64059600,55.165625,62.58436,2.760381,35.986452,55.0,0
2000-01-06,56.09375,56.9375,54.1875,55.0,54976600,55.621875,53.645906,2.399714,33.922343,55.71875,1
2000-01-07,54.3125,56.125,53.65625,55.71875,62013600,56.089062,56.186787,2.147129,31.661532,56.125,1


# ========== Question 1.1 --- [5 marks] ==========
###  Describe the main properties of the data:
1. [Code] Display the shape of the data
2. [Code] Display the range of the dataframe index
3. [Code] What data are presented and what types of data are they? Display the information using **pandas.DataFrame.info**.
4. [Code] Display the highest price, the lowest price, and the mean of the closing price ('Close') for each year in the data. (Hint: the highest price for each year is obtained from the price 'High'.)

\pagebreak
## Your answers for Question 1.1

In [3]:
#(1) Your code goes here

In [4]:
#(2) Your code goes here

In [5]:
#(3) Your code goes here

In [6]:
#(4) Your code goes here

\pagebreak

# ========== Question 1.2 --- [8 marks] ==========
Perform an *exploratory data analysis* on the dataset by studying the following:
1. [Code and text] Plot the stock market closing price ('Close') and analyse the plot by identifying key trends and volatility patterns
2. [Code] For the period from the beginning of year 2021 until the end of 2022, plot the volumes ('Volume'), where you show months on the x-axis and indicate the positions of the highest and lowest values for the period.
3. [Code and text] Plot a pairplot for the dataset features using the seaborn **pairplot**. Examine and describe specific correlations and distributions among the features. Highlight any strong or weak correlations observed between pairs of features and discuss potential implications for further analysis.
4. [Code] Plot the correlation matrix for the dataset features.
5. [Text] Based on the results you obtained in 3 and 4 above, comment on the relationships among the features present in the dataset.

\pagebreak
## Your answers for Question 1.2

In [7]:
#(1) Your code and text goes here

In [8]:
#(2) Your code goes here

In [9]:
#(3) Your code and text goes here

In [10]:
#(4) Your code goes here

#(5) Your text goes here

\pagebreak

# ========== Question 1.3 --- [9 marks] ==========

We here apply linear regression to predict 'Tomorrow' from 'MACD'.
For this question, you should use the sklearn implementation of Linear Regression. Use the first 80% of the data for training and the remaining 20% for testing ***without shuffling***.
1. [Code] Fit a linear regression model to the training data so that we can predict 'Tomorrow' from 'MACD'. Report the estimated model parameters $w$ and the coefficient of determination $R^2$.
2. [Text] Interpret the coefficient ($w$) for 'MACD' in the linear regression model. Discuss how changes in 'MACD' are expected to influence the prediction of 'Tomorrow'.
3. [Code] Report the root mean-square error (RMSE) for the training set and test set, respectively.
4. [Code] Plot predicted values versus actual values for the test set, where the x-axis corresponds to actual values and the y-axis to predicted values. Draw a line of $y=x$ on the plot.
5. [Code] Plot 'Tomorrow' versus 'MACD' for the training set and display the regression line on the same graph. The x-axis corresponds to 'MACD' and the y-axis to 'Tomorrow'.
6. [Text] Examining the results (e.g. $R^2$ and RMSE), discuss the model's reliability for financial forecasting.

\pagebreak
## Your answers for Question 1.3

In [11]:
#(1) Your code goes here

#(2) Your text goes here

In [12]:
#(3) Your code goes here

In [13]:
#(4) Your code goes here

In [14]:
#(5) Your code goes here

#(6) Your text goes here

\pagebreak

# ========== Question 1.4 --- [5 marks] ==========

1. [Code] Instead of using `sklearn` for linear regression, implement an **analytical solution** for linear regression to predict the 'Tomorrow' variable using the 'MACD' feature. Explicitly calculate the regression coefficients without relying on external optimization libraries. Run your code and show the coefficients, using the same training data as Question 1.3.
2. [Text] One of the common metric used for evaluating the performance of regression models is Mean Squared Error (MSE). Write out the expression for MSE and list one of its limitations and how it can be addressed with alternative metrics.

\pagebreak
## Your answers for Question 1.4

In [15]:
#(1) Your code goes here

#(2) Your text goes here

\pagebreak

# ========== Question 1.5 --- [6 marks] ==========
#### Multiple linear regression and polynomial regression

We here consider multiple linear regression that employs four variables (RSI, SMA, MACD, ADX) to predict 'Tomorrow'. We use the same training data and test data as Question 1.3.
1. [Code] Train the multiple linear regression model on the training set and show the model parameters and the coefficient of determination $R^2$. You also show the RMSE for the training set and test set respectively.
2. [Code] We now extend the model to the polynomial regression model, in which we use all polynomial combinations of the variables up to the specified degree $p$. Using $p=2$, run an experiment in the same manner as 1 above and report the model parameters and $R^2$. You also report the RMSE for the training and test sets respectively. You should use the sklearn implementation of Linear Regression and Polynomial Features.
3. [Text] Analyse and compare the performance of the multiple linear regression model and the polynomial regression model (with $p=2$) against the results from Question 1.3. Focus your discussion on the differences in $R^2$ and RMSE values across the models, and what these differences indicate about the models' ability to predict 'Tomorrow' from the given variables. Consider discussing model complexity, overfitting, and predictive accuracy.

\pagebreak
## Your answers for Question 1.5

In [16]:
#(1) Your code goes here


In [17]:
#(2) Your code goes here

#(3) Your text goes here

\pagebreak

# ========== Question 1.6 --- [12 marks] ==========
#### Classification

We now consider the prediction of stock price movement as a binary classification problem - class 1 for upward movement and class 0 otherwise. We use the four technical Indicators, 'RSI', 'SMA', 'MACD', 'ADX', as input features to a classifier to predict 'Target'.

1. [Code] Using 15-fold cross validation with ***no shuffling*** on ***the whole data***, train four classifiers, Logistic Regression, SVM, Decision Trees, and Random Forests. Display, in a single graph, the validation accuracy with boxplot for each model. For each model, you also report the mean accuracy and mean F-score for the training set and validation set, respectively.
(NB: You should obtain the accuracy and F-score for each trial of k-fold cross validation, which will be used for plotting a boxplot. A mean value/score denotes the average value over the $k$ trials, where $k=15$).
<br> ***Note***: you should use sklearn's KFold, SVC, DecisionTreeClassifier, RandomForestClassifier, and LogisticRegression. Use `random_state=0` for all models.
2. [Code] Further to the above, for each model, display the confusion matrix for the validation sets, where rows correspond to true class labels and columns to predicted ones, and each element of the matrix shows the number of corresponding instances.
3. [Text] Comment on which model is best with respect to false positives and false negatives.


\pagebreak
## Your answers for Question 1.6

In [18]:
#(1) Your code goes here

In [19]:
#(2) Your code goes here

#(3) Your text goes here

\pagebreak

# ========== Question 1.7 --- [5 marks] ==========


We considered only four technical features so far to find that movement classification with the four classifiers is challenging.
This time we use another data set file ("dset_q1a_extend.csv"), which is an extended version of the original one and contains 16 technical indicators. Load the dataset in the following manner
>   df1b = pd.read_csv("dset_q1a_extend.csv", index_col="Date", parse_dates=True)

However, we only use **5% of data for training** and the remaining 95% for testing.

1. [Code] Split the data into two subsets with `random_state=0` with **no shuffling** - the first 5% of data should be used for training, and the remaining 95% for testing. Standardize the features using `StandardScaler` and train three logistic regression models with the following settings:
    - Without regularization
    - With L1 regularization and `liblinear` solver
    - With L2 regularization and `liblinear` solver 

   Report the accuracy and F1 score for these three models as well as the weights of the three models.
2. [Text] Discuss the implications of using L1 regularization versus L2 regularization in logistic regression models. Consider scenarios where one might be preferred over the other, and how the choice of regularization parameter ($\lambda$ or `C` in scikit-learn) affects model complexity and feature selection.


\pagebreak
## Your answers for Question 1.7

In [20]:
#(1) Your code goes here
    

#(2) Your text goes here

\pagebreak

# ========== Question 1.8 --- [15 marks] ==========

This is a mini-project where the goal is to predict fraudulent transactions. You will focus on understanding and selecting appropriate evaluation metrics for imbalanced classification tasks. 
- You will be working on "dset_q1b.csv" with transaction features and a binary target variable `is_fraud` indicating fraud. All the features have been preprocessed to numerical values. Other necessary preprocessing steps might be needed such as feature scaling.
- Import the packages you need and load the data set in the following manner: `df1b = pd.read_csv("dset_q1b.csv")`
- All the `random states` of `SMOTE`, `train_test_split`, `DecisionTreeClassifier`, etc. should be set to 0.


1. [Code and Text] Shuffle and Split the data into two subsets with `random_state=0`. 80% of data should be used for training and validation, and the remaining 20% for testing. **Please also make sure the proportion of the positive class in the training and test set is the same as the original dataset.** Plot two pie charts to visualize the class distribution of the target variable for training and test set with labels and percentages in 3 decimal places.

2. [Text] Before implementing the models, discuss the challenges of evaluating classifiers on imbalanced datasets. Specifically, address why traditional metrics such as accuracy may not be appropriate in the context of fraud detection. Propose alternative metrics that could provide a more meaningful assessment of classifier performance in detecting fraudulent transactions.

3. [Code and Text] Apply feature scaling to the data. Subsequently, train three logistic regression models with `random_state=0`:
    - Logistic regression without class weighting (`logit_none`)
    - Logistic regression with a class weight of 1:7 (majority class : minority class) (`logit_custome`)
    - Logistic regression with `class_weight` parameter set to `balanced` (`logit_balance`)
    - Gaussian Naive Bayes
    - KNN with `n_neighbors=5`, `n_jobs=-1` 

    After model training, plot the ROC curves in one graph for all three models. By only looking at the ROC curves, can you tell which model is the best? Why or why not?

4. [Code and Text] In addition to the ROC curves, calculate the selected metrics from question 2 for the models in question 3 on the test set. The results should be shown in a numerical table and necessary figures. With the results obtained, give a quantitative comparison of the performance for the three logistic models. Discuss the strengths and weaknesses of each classifier in the context of fraud detection, and justify which classifier you would choose for the task. There is no standard answer to this question, but you should provide a clear and well-justified argument.

5. [Code] Adjusting the threshold is a common technique to improve the performance of classifiers on imbalanced datasets. For the `logit_none` setting, identify a key metric that you would like to improve and find the optimal threshold that maximizes it before testing the model on the test set. Report the found threshold and the complete set of metrics from Question 2 on the test set with the new threshold.

\pagebreak
## Your answers for Question 1.8

In [21]:
# (1) Your code and text goes here

#(2) Your text goes here

In [22]:
#(3) Your code and text goes here

In [23]:
#(4) Your code and text goes here

In [24]:
#(5) Your code and text goes here

\pagebreak

# Question 2: Experiments with a census income dataset

#### 45 marks out of 110 for this coursework

The "Adult" dataset is a widely-used public dataset extracted by Barry Becker from the 1994 Census database.  A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0)).

It is usually used for research in calssification task - determining whether a person makes over $50K p.a. from demographic information, including age, work class, education, gender, race, capital gain and loss, and hours worked per week. Also because of the nature of demographic information, this dataset is also popular in research of AI's algorithmic fairness.

We have done a preparatory data cleansing, which removes all features involving missing values and other 11 features remain.

**Link:** https://archive.ics.uci.edu/dataset/2/adult

***Loading data:***
Make sure that you have the data set files "adult.data" and "adult.test" in your environment and run the following cell to load the data set.

In [25]:
# Load the data set and apply some preprocessing

columns = ['age', 'workclass', 'fnlwgt', 'education', 'education-num',
           'marital-status', 'occupation', 'relationship', 'race',
           'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income']
df_train = pd.read_csv("adult.data", names=columns)
df_train = df_train[['age', 'fnlwgt', 'education', 'education-num',
                'marital-status', 'relationship', 'race', 'sex',
                'capital-gain', 'capital-loss', 'income']]
df_test = pd.read_csv("adult.test", names=columns).iloc[1:]
df_test = df_test[['age', 'fnlwgt', 'education', 'education-num',
                'marital-status', 'relationship', 'race', 'sex',
                'capital-gain', 'capital-loss', 'income']]

# ========== Question 2.1 --- [5 marks] ==========

Visualise the data:

1. [Code] There are some features are not in the form of integer; use sklearn's **LableEncoder** to transform them into the form of integer.
2. [Code] Use pandas's **parallel_coordinates** to visualise the features **'age'**, **'education-num'**, and **'race'** in the traning set **df_train**; the data points in different classes should be coloured differently.

\pagebreak
## Your answers for Question 2.1

In [26]:
#(1) Your code goes here

In [27]:
#(2) Your code goes here

\pagebreak

# ========== Question 2.2 --- [10 marks] ==========

Apply K-means (with $k = 3$) to perform clustering on the Adult dataset.

1. [Code] Apply sklearn's **KMeans** specifying **n_clusters=3** and **random_state=0** to the dataset, while all other parameters are set as default. Note that the two parameters should be set explicitly.
2. [Code] Use matplotlib's **pyplot** to plot the cluster centres' **('age', 'fnlwgt'** features on a plane; data points from different classes should be in different colours.

\pagebreak
## Your answers for Question 2.2

In [28]:
#(1) Your code goes here

In [29]:
#(2) Your code goes here

\pagebreak

# ========== Question 2.3 --- [5 marks] ==========

Use Principal Component Analysis (PCA) to perform dimension reduction on the dataset.

1. [Code] Use sklearn's **PCA** to perform PCA to the data **df**. Calcualte and show the **variances** of all the ten **principal components**.
2. [Code] Plot the **cumulative explained variance ratio** $r_i$ as a function of the number of principal components, $i$ ($ 1 \le i ≤ D$, $D$ is the data dimension). $r_i$ is defined as follows,
> $$ r_i = \frac{\sum_{j=1}^i \lambda_j}{\sum_{j=1}^D \lambda_j}.$$


\pagebreak
## Your answers for Question 2.3

In [30]:
#(1) Your code goes here

In [31]:
#(2) Your code goes here

\pagebreak

# ========== Question 2.4 --- [10 marks] ==========

We now would like to know how the training data **df_train** distribute in a vector space. To visualise distributions, we reduce the dimensionality of the data to **2** using PCA, and then plot the dimensionality-reduced data on a two-dimensional "plane" spanned by the first two principal components. Note that each instance in the dataset is now displayed as a single point on the plane.

1. [Code] Plot the training data **df_train** on the two-dimensional PCA plane, where the data points from different classes are coloured differently. Adjust the marker size to reduce the overlapping area.
2. [Text] Discuss the separation of the classes. Explain your findings.


\pagebreak
## Your answers for Question 2.4

In [32]:
#(1) Your code goes here

#(2) ***Your text goes here***

\pagebreak

# ========== Question 2.5 --- [15 marks] ==========

We now apply classification to the dataset. Make sure that you use **df_train** for training and **df_test** for testing.
1. [Code] Use sklearn's **LogisticRegression** (with **random_state=0**) to perform classification on the dataset. Calculate and report the classification accuracy and confusion matrix on both training set and test set. Use sklearn's **ConfusionMatrixDisplay** to display the confusion matrix. Note that you may ignore a warning message in the training.
2. [Code] Use sklearn's **SVC** (with **random_state=0**) to perform SVM on the dataset. Calculate and report the classification accuracy and confusion matrix on both training set and test set.
3. [Text] Based on the results obtained in 1 and 2, discuss your findings.

\pagebreak
## Your answers for Question 2.5

In [33]:
#(1) Your code goes here

In [34]:
#(2) Your code goes here

#(3) Your text goes here