# **HW1: Regression**
In *assignment 1*, you need to finish:

1.  Basic Part: Implement two regression models to predict the Systolic blood pressure (SBP) of a patient. You will need to implement **both Matrix Inversion and Gradient Descent**.


> *   Step 1: Split Data
> *   Step 2: Preprocess Data
> *   Step 3: Implement Regression
> *   Step 4: Make Prediction
> *   Step 5: Train Model and Generate Result

2.  Advanced Part: Implement one regression model to predict the SBP of multiple patients in a different way than the basic part. You can choose **either** of the two methods for this part.

# **1. Basic Part (55%)**
In the first part, you need to implement the regression to predict SBP from the given DBP


## 1.1 Matrix Inversion Method (25%)


*   Save the prediction result in a csv file **hw1_basic_mi.csv**
*   Print your coefficient


### *Import Packages*

> Note: You **cannot** import any other package

In [323]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import csv
import math
import random

### *Global attributes*
Define the global attributes

In [324]:
training_dataroot = 'hw1_basic_training.csv' # Training data file file named as 'hw1_basic_training.csv'
testing_dataroot = 'hw1_basic_testing.csv'   # Testing data file named as 'hw1_basic_training.csv'
output_dataroot = 'hw1_basic_mi.csv' # Output file will be named as 'hw1_basic.csv'

training_datalist =  [] # Training datalist, saved as numpy array
testing_datalist =  [] # Testing datalist, saved as numpy array

output_datalist =  [] # Your prediction, should be 20 * 3 matrix and saved as numpy array
                      # The format of each row should be ['subject_id', 'charttime', 'sbp']

You can add your own global attributes here


In [325]:
training_dataset_1 = []     # The first 80% of training_datalist
validation_dataset_1 = []   # The last 20% of training_datalist

training_dataset_2 = []     # Randomly selected 80% of training_datalist
validation_dataset_2 = []   # The rest of training_datalist

training_dataset_3 = []     # Randomly selected 70% of training_datalist
validation_dataset_3 = []   # The rest of training_datalist

training_dataset_4 = []     # Randomly selected 60% of training_datalist
validation_dataset_4 = []   # The rest of training_datalist

training_data_num = 0

### *Load the Input File*
First, load the basic input file **hw1_basic_training.csv** and **hw1_basic_testing.csv**

Input data would be stored in *training_datalist* and *testing_datalist*

In [326]:
# Read input csv to datalist
with open(training_dataroot, newline='') as csvfile:
	global training_datalist, training_data_num
	
	training_datalist = np.array(list(csv.reader(csvfile)))
	training_datalist = np.delete(training_datalist, 0, 0)
	training_data_num = training_datalist.shape[0]

with open(testing_dataroot, newline='') as csvfile:
	global testing_datalist

	testing_datalist = np.array(list(csv.reader(csvfile)))
	testing_datalist = np.delete(testing_datalist, 0, 0)
	  
# print(training_datalist.shape[0], training_data_num)
# print(testing_datalist)

### *Implement the Regression Model*

> Note: It is recommended to use the functions we defined, you can also define your own functions


#### Step 1: Split Data
Split data in *training_datalist* into training dataset and validation dataset
* Validation dataset is used to validate your own model without the testing data



In [327]:
def SplitData():
    def RandomSplitData(propotion):
        # We have to deepcopy the training_datalist since we don't want to change the original one
        training_datalist_copy = training_datalist.copy()
        
        random_times = random.randint(1, 10)
        for i in range(random_times):
            # Since the copy is a numpy array, we have to use np.random.shuffle instead of random.shuffle
            # Or it will have some repeated elements after shuffling
            np.random.shuffle(training_datalist_copy)

        return training_datalist_copy[ : propotion], training_datalist_copy[propotion : ]

    # We have to minus 1 since the first row is the name of each column
    global training_data_num
    propotion_80 = int(training_data_num * 0.8 + 0.5)   # 0.5 is for rounding
    propotion_70 = int(training_data_num * 0.7 + 0.5)
    propotion_60 = int(training_data_num * 0.6 + 0.5)
    
    global training_datalist
    global training_dataset_1, validation_dataset_1
    global training_dataset_2, validation_dataset_2
    global training_dataset_3, validation_dataset_3 
    global training_dataset_4, validation_dataset_4

    training_dataset_1 = training_datalist[0 : propotion_80 ]
    validation_dataset_1 = training_datalist[propotion_80 : ]
    training_dataset_2, validation_dataset_2 = RandomSplitData(propotion_80)
    training_dataset_3, validation_dataset_3 = RandomSplitData(propotion_70)
    training_dataset_4, validation_dataset_4 = RandomSplitData(propotion_60)

# SplitData()
# print(len(training_dataset_1), len(validation_dataset_1))
# print(len(training_dataset_2), len(validation_dataset_2))
# print(len(training_dataset_3), len(validation_dataset_3))
# print(len(training_dataset_4), len(validation_dataset_4))
#print(validation_dataset_2)

#### Step 2: Preprocess Data
Handle the unreasonable data
> Hint: Outlier and missing data can be handled by removing the data or adding the values with the help of statistics  

In [328]:
def PreprocessData():
    '''
    # ref
    1. https://andy6804tw.github.io/2021/04/02/python-outliers-clean/ \n
    2. https://chat.openai.com/share/76e7e9a6-7e42-4cdc-a2bb-72af60082814

    By searching some technique, I decided to remove the outliers by using IQR
    Since I roughly check the training data and found that the outliers are not too much
    If just one of the number in the row is outlier, I will modify its value based on another number in the same row
    '''
    
    global training_datalist, training_data_num
    dbp_list = np.array([float(training_datalist[i][0]) for i in range(training_data_num)])
    sbp_list = np.array([float(training_datalist[i][1]) for i in range(training_data_num)])

    dbp_IQR = np.percentile(dbp_list, 75) - np.percentile(dbp_list, 25)
    sbp_IQR = np.percentile(sbp_list, 75) - np.percentile(sbp_list, 25)

    dbp_upper = np.percentile(dbp_list, 75) + 1.5 * dbp_IQR
    dbp_lower = np.percentile(dbp_list, 25) - 1.5 * dbp_IQR
    sbp_upper = np.percentile(sbp_list, 75) + 1.5 * sbp_IQR
    sbp_lower = np.percentile(sbp_list, 25) - 1.5 * sbp_IQR
    # print(dbp_upper, dbp_lower)
    # print(sbp_upper, sbp_lower)

    tem_list = list()
    for i in range(training_data_num):
        diff_with_mean_dbp = (dbp_list[i] - np.mean(dbp_list)) / dbp_IQR
        diff_with_mean_sbp = (sbp_list[i] - np.mean(sbp_list)) / sbp_IQR
        is_outlier_dbp = bool(dbp_list[i] > dbp_upper or dbp_list[i] < dbp_lower)
        is_outlier_sbp = bool(sbp_list[i] > sbp_upper or sbp_list[i] < sbp_lower)

        if is_outlier_dbp and is_outlier_sbp:
            continue
        elif is_outlier_dbp:
            dbp_list[i] = int(np.mean(dbp_list) + diff_with_mean_sbp * dbp_IQR + 0.5)
        elif is_outlier_sbp:
            sbp_list[i] = int(np.mean(sbp_list) + diff_with_mean_dbp * sbp_IQR + 0.5)

        tem_list.append([sbp_list[i], dbp_list[i]])

    training_datalist = np.array(tem_list)
    training_data_num = training_datalist.shape[0]

# PreprocessData()
# print(training_datalist.shape[0], training_data_num, end = '\n\n')
# print(training_datalist)

#### Step 3: Implement Regression
> use Matrix Inversion to finish this part




In [329]:
def MatrixInversion():
    global training_dataset_1
    global training_dataset_2
    global training_dataset_3
    global training_dataset_4

    phi_1 = np.array([[1, float(training_dataset_1[i][0])] for i in range(len(training_dataset_1))]); phi_1_T = phi_1.T
    phi_2 = np.array([[1, float(training_dataset_2[i][0])] for i in range(len(training_dataset_2))]); phi_2_T = phi_2.T
    phi_3 = np.array([[1, float(training_dataset_3[i][0])] for i in range(len(training_dataset_3))]); phi_3_T = phi_3.T
    phi_4 = np.array([[1, float(training_dataset_4[i][0])] for i in range(len(training_dataset_4))]); phi_4_T = phi_4.T
    y1 = np.array([float(training_dataset_1[i][1]) for i in range(len(training_dataset_1))])
    y2 = np.array([float(training_dataset_2[i][1]) for i in range(len(training_dataset_2))])
    y3 = np.array([float(training_dataset_3[i][1]) for i in range(len(training_dataset_3))])
    y4 = np.array([float(training_dataset_4[i][1]) for i in range(len(training_dataset_4))])

    w1 = np.dot( np.dot( np.linalg.inv(np.dot(phi_1_T, phi_1)), phi_1_T), y1)
    w2 = np.dot( np.dot( np.linalg.inv(np.dot(phi_2_T, phi_2)), phi_2_T), y2)
    w3 = np.dot( np.dot( np.linalg.inv(np.dot(phi_3_T, phi_3)), phi_3_T), y3)
    w4 = np.dot( np.dot( np.linalg.inv(np.dot(phi_4_T, phi_4)), phi_4_T), y4)

    return [(w1[0] + w1[0] + w2[0] + w3[0]) / 4, (w1[1] + w1[1] + w2[1] + w3[1]) / 4]

#### Step 4: Make Prediction
Make prediction of testing dataset and store the value in *output_datalist*
The final *output_datalist* should look something like this 
> [ [100], [80], ... , [90] ] where each row contains the predicted SBP

In [330]:
def MakePrediction(coefficients):
    global testing_datalist, output_datalist

    testing_dbp = np.array([ [1, int(testing_datalist[i][0])] for i in range(len(testing_datalist)) ])
    coe = np.array([[coefficients[0]], [coefficients[1]]])
    result = np.dot(testing_dbp, coe)
    print(result)


#### Step 5: Train Model and Generate Result

> Notice: **Remember to output the coefficients of the model here**, otherwise 5 points would be deducted
* If your regression model is *3x^2 + 2x^1 + 1*, your output would be:
```
3 2 1
```





In [331]:
PreprocessData()
SplitData()

coefficients = MatrixInversion()
print(coefficients)

MakePrediction(coefficients)
print(output_datalist)

[-7.029492346312112, 0.6917490234284271]


ValueError: data type must provide an itemsize

### *Write the Output File*
Write the prediction to output csv
> Format: 'sbp'




In [None]:
with open(output_dataroot, 'w', newline='', encoding="utf-8") as csvfile:
	writer = csv.writer(csvfile)
	for row in output_datalist:
		writer.writerow(row)

## 1.2 Gradient Descent Method (30%)


*   Save the prediction result in a csv file **hw1_basic_gd.csv**
*   Output your coefficient update in a csv file **hw1_basic_coefficient.csv**
*   Print your coefficient





### *Global attributes*

In [None]:
output_dataroot = 'hw1_basic_gd.csv' # Output file will be named as 'hw1_basic.csv'
coefficient_output_dataroot = 'hw1_basic_coefficient.csv'

training_datalist =  [] # Training datalist, saved as numpy array
testing_datalist =  [] # Testing datalist, saved as numpy array

output_datalist =  [] # Your prediction, should be 20 * 3 matrix and saved as numpy array
                      # The format of each row should be ['subject_id', 'charttime', 'sbp']

coefficient_output = [] # Your coefficient update during gradient descent
                   # Should be a (number of iterations * number_of coefficient) matrix
                   # The format of each row should be ['w0', 'w1', ...., 'wn']

Your own global attributes

### *Implement the Regression Model*


#### Step 1: Split Data

In [None]:
def SplitData():
    pass

#### Step 2: Preprocess Data

In [None]:
def PreprocessData():
    pass

#### Step 3: Implement Regression
> use Gradient Descent to finish this part

In [None]:
def GradientDescent():
    pass

#### Step 4: Make Prediction

Make prediction of testing dataset and store the values in *output_datalist*
The final *output_datalist* should look something like this 
> [ [100], [80], ... , [90] ] where each row contains the predicted SBP

Remember to also store your coefficient update in *coefficient_output*
The final *coefficient_output* should look something like this
> [ [1, 0, 3, 5], ... , [0.1, 0.3, 0.2, 0.5] ] where each row contains the [w0, w1, ..., wn] of your coefficient





In [None]:
def MakePrediction():
    pass

#### Step 5: Train Model and Generate Result

> Notice: **Remember to output the coefficients of the model here**, otherwise 5 points would be deducted
* If your regression model is *3x^2 + 2x^1 + 1*, your output would be:
```
3 2 1
```



### *Write the Output File*

Write the prediction to output csv
> Format: 'sbp'

**Write the coefficient update to csv**
> Format: 'w0', 'w1', ..., 'wn'
>*   The number of columns is based on your number of coefficient
>*   The number of row is based on your number of iterations

In [None]:
with open(output_dataroot, 'w', newline='', encoding="utf-8") as csvfile:
	writer = csv.writer(csvfile)
	for row in output_datalist:
		writer.writerow(row)

with open(coefficient_output_dataroot, 'w', newline='', encoding="utf-8") as csvfile:
	writer = csv.writer(csvfile)
	for row in coefficient_output:
		writer.writerow(row)

# **2. Advanced Part (40%)**
In the second part, you need to implement the regression in a different way than the basic part to help your predictions of multiple patients SBP.

You can choose **either** Matrix Inversion or Gradient Descent method.

The training data will be in **hw1_advanced_training.csv** and the testing data will be in **hw1_advanced_testing.csv**.

Output your prediction in **hw1_advanced.csv**

Notice:
> You cannot import any other package other than those given



### Input the training and testing dataset

In [None]:
training_dataroot = 'hw1_advanced_training.csv' # Training data file file named as 'hw1_basic_training.csv'
testing_dataroot = 'hw1_advanced_testing.csv'   # Testing data file named as 'hw1_basic_training.csv'
output_dataroot = 'hw1_advanced.csv' # Output file will be named as 'hw1_basic.csv'

training_datalist =  [] # Training datalist, saved as numpy array
testing_datalist =  [] # Testing datalist, saved as numpy array

output_datalist =  [] # Your prediction, should be 220 * 1 matrix and saved as numpy array
                      # The format of each row should be ['sbp']

### Your Implementation

### Output your Prediction

> your filename should be **hw1_advanced.csv**

In [None]:
with open(output_dataroot, 'w', newline='', encoding="utf-8") as csvfile:
	writer = csv.writer(csvfile)
	for row in output_datalist:
		writer.writerow(row)

# Report *(5%)*

Report should be submitted as a pdf file **hw1_report.pdf**

*   Briefly describe the difficulty you encountered
*   Summarize your work and your reflections
*   No more than one page






# Save the Code File
Please save your code and submit it as an ipynb file! (**hw1.ipynb**)