# **HW1: Regression** 
In *assignment 1*, you need to finish:

1.  Basic Part: Implement the regression model to predict the number of dengue cases


> *   Step 1: Split Data
> *   Step 2: Preprocess Data
> *   Step 3: Implement Regression
> *   Step 4: Make Prediction
> *   Step 5: Train Model and Generate Result

2.  Advanced Part: Implementing a regression model to predict the number of dengue cases in a different way than the basic part

# 1. Basic Part (60%)
In the first part, you need to implement the regression to predict the number of dengue cases

Please save the prediction result in a csv file **hw1_basic.csv**


## Import Packages

> Note: You **cannot** import any other package in the basic part

In [15]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import csv
import math
import random

## Global attributes
Define the global attributes

In [16]:
input_dataroot = 'hw1_basic_input.csv' # Input file named as 'hw1_basic_input.csv'
output_dataroot = 'hw1_basic.csv' # Output file will be named as 'hw1_basic.csv'

input_datalist =  [] # Initial datalist, saved as numpy array
output_datalist =  [] # Your prediction, should be 10 * 4 matrix and saved as numpy array
             # The format of each row should be ['epiweek', 'CityA', 'CityB', 'CityC']

You can add your own global attributes here


In [17]:
train = []
validation = []
test = []
weights = []

## Load the Input File
First, load the basic input file **hw1_basic_input.csv**

Input data would be stored in *input_datalist*

In [18]:
# Read input csv to datalist
input_datalist = pd.read_csv(input_dataroot)

## Implement the Regression Model

> Note: It is recommended to use the functions we defined, you can also define your own functions


### Step 1: Split Data
Split data in *input_datalist* into training dataset and validation dataset 



In [19]:
def SplitData():
  global train, validation, test
  train = input_datalist[50:int(94*0.9)]
  validation = input_datalist[int(94*0.9):94]
  test = input_datalist[94:]

### Step 2: Preprocess Data
Handle the unreasonable data
> Hint: Outlier and missing data can be handled by removing the data or adding the values with the help of statistics  

In [20]:
def HandleOutliers(dataset, kind):
  # IQR method
  iqr = dataset[kind].quantile(0.75) - dataset[kind].quantile(0.25)
  lb = dataset[kind].quantile(0.25) - iqr*1.5
  ub = dataset[kind].quantile(0.75) + iqr*1.5
  dataset.loc[dataset[kind] >= ub, kind] = ub
  dataset.loc[dataset[kind] <= lb, kind] = lb
  return dataset

def PreprocessData(dataset):
  # Removing the missing data
  dataset = dataset.dropna(how="any")
  # Handling the outliers
  for i in dataset: 
    if i != 'epiweek':
      dataset = HandleOutliers(dataset, i)
  return dataset

### Step 3: Implement Regression
> Hint: You can use Matrix Inversion, or Gradient Descent to finish this part




In [21]:
def Regression(xdata, ydata, degree):
  # Matrix inversion
  weights = []
  A = []
  B = []
  for i in range(degree+1):
    row = []
    for j in range(degree+1):
      row.append(np.sum([x**(i+j) for x in xdata]))
    A.append(row)
    B.append(np.sum(np.multiply([x**i for x in xdata], ydata)))
  weights = np.matmul(np.linalg.inv(A), np.transpose(B))
  return weights

### Step 4: Make Prediction
Make prediction of testing dataset and store the value in *output_datalist*

In [22]:
def MAPE_calc(ydata, ypred):
  s = 0
  for i in range(len(ydata)):
    s += abs(ydata[i]-ypred[i])/ydata[i]
  return s*100/len(ydata)

def MakePrediction():
  global weights, train, validation, test
  prediction = []
  for i in range(len(weights)):
    temp = []
    for x in test.iloc[:,i+1:i+2].values:
      temp.append(int(np.sum([w*x**j for j, w in enumerate(weights[i])])))
    prediction.append(temp)
    #print(MAPE_calc(validation.iloc[:,i+4:i+5].values, temp))
    #print(temp)
  return prediction

### Step 5: Train Model and Generate Result

> Notice: **Remember to output the coefficients of the model here**, otherwise 5 points would be deducted
* If your regression model is *3x^2 + 2x^1 + 1*, your output would be: 
```
3 2 1
```





In [23]:
SplitData()
train = PreprocessData(train)
validation = PreprocessData(validation)

CityA_train_xdata = train.iloc[:,1:2].values
CityA_train_ydata = train.iloc[:,4:5].values
CityB_train_xdata = train.iloc[:,2:3].values
CityB_train_ydata = train.iloc[:,5:6].values
CityC_train_xdata = train.iloc[:,3:4].values
CityC_train_ydata = train.iloc[:,6:7].values

weights.append(Regression(CityA_train_xdata, CityA_train_ydata, 4)) # CityA
weights.append(Regression(CityB_train_xdata, CityB_train_ydata, 3)) # CityB
weights.append(Regression(CityC_train_xdata, CityC_train_ydata, 2)) # CityC

output_datalist.append(test['epiweek'].values)
pred = MakePrediction()
for i in range(len(pred)):
  output_datalist.append(pred[i])
output_datalist = np.transpose(output_datalist)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value, pi)


In [24]:
# Output the coefficients of the models
for i in range(len(weights)):
  print("Coefficients of Regression Model for City", chr(i+65), ": ")
  for j in range(len(weights[i])-1, -1, -1):
    if j == 0:
      print(weights[i][j])
    else:
      print(weights[i][j], end=', ')
  print()

Coefficients of Regression Model for City A : 
0.007982119326698012, -0.7125281058251858, 22.512685418128967, -289.6496422290802, 1229.2750129699707

Coefficients of Regression Model for City B : 
-0.11077442114037694, 7.458440331276506, -166.8636347129941, 1262.573723912239

Coefficients of Regression Model for City C : 
0.22522304691960926, -14.877287703982574, 274.34064501895045



## Write the Output File
Write the prediction to output csv
> Format: 'epiweek', 'CityA', 'CityB', 'CityC'

In [25]:
with open(output_dataroot, 'w', newline='', encoding="utf-8") as csvfile:
  writer = csv.writer(csvfile)
  for row in output_datalist:
    writer.writerow(row)

# 2. Advanced Part (35%)
In the second part, you need to implement the regression in a different way than the basic part to help your predictions for the number of dengue cases

We provide you with two files **hw1_advanced_input1.csv** and **hw1_advanced_input2.csv** that can help you in this part

Please save the prediction result in a csv file **hw1_advanced.csv** 


In [26]:
from sklearn.linear_model import LinearRegression

# Use temperature and precipitation
basic_input_datalist = pd.read_csv(input_dataroot)
advanced_input_datalist = pd.read_csv("hw1_advanced_input1.csv")
advanced_input_datalist = advanced_input_datalist.drop("epiweek", axis=1)
advanced_output_datalist = []

# Combine two dataframes
datalist = pd.concat([basic_input_datalist, advanced_input_datalist], axis=1)

# Split and preprocess the data
advanced_train = datalist[50:int(94*0.9)]
advanced_validation = datalist[int(94*0.9):94]
advanced_test = datalist[94:]

advanced_train = PreprocessData(advanced_train)
advanced_validation = PreprocessData(advanced_validation)

# Data for CityA
CityA_train_xdata = advanced_train[['TemperatureA', 'PrecipitationA']].values
CityA_train_ydata = advanced_train['CityA'].values
CityA_validation_xdata = advanced_validation[['TemperatureA', 'PrecipitationA']].values
CityA_validation_ydata = advanced_validation['CityA'].values
CityA_test_xdata = advanced_test[['TemperatureA', 'PrecipitationA']].values

# Data for CityB
CityB_train_xdata = advanced_train[['TemperatureB', 'PrecipitationB']].values
CityB_train_ydata = advanced_train['CityB'].values
CityB_validation_xdata = advanced_validation[['TemperatureB', 'PrecipitationB']].values
CityB_validation_ydata = advanced_validation['CityB'].values
CityB_test_xdata = advanced_test[['TemperatureB', 'PrecipitationB']].values

# Data for CityC
CityC_train_xdata = advanced_train[['TemperatureC', 'PrecipitationC']].values
CityC_train_ydata = advanced_train['CityC'].values
CityC_validation_xdata = advanced_validation[['TemperatureC', 'PrecipitationC']].values
CityC_validation_ydata = advanced_validation['CityC'].values
CityC_test_xdata = advanced_test[['TemperatureC', 'PrecipitationC']].values

# Train the regression model and make prediction
advanced_prediction = []
advanced_prediction.append(LinearRegression().fit(CityA_train_xdata, CityA_train_ydata).predict(CityA_test_xdata))
advanced_prediction.append(LinearRegression().fit(CityB_train_xdata, CityB_train_ydata).predict(CityB_test_xdata))
advanced_prediction.append(LinearRegression().fit(CityC_train_xdata, CityC_train_ydata).predict(CityC_test_xdata))

# Load data into output_datalist
advanced_output_datalist.append(advanced_test['epiweek'].values)
for i in range(len(advanced_prediction)):
  advanced_output_datalist.append(advanced_prediction[i])
advanced_output_datalist = np.transpose(advanced_output_datalist)

# Load data to csv file
with open("hw1_advanced.csv", 'w', newline='', encoding="utf-8") as csvfile:
  writer = csv.writer(csvfile)
  for row in advanced_output_datalist:
    writer.writerow(row)

# Report *(5%)*

Report should be submitted as a pdf file **hw1_report.pdf**

*   Briefly describe the difficulty you encountered 
*   Summarize your work and your reflections 
*   No more than one page






# Save the Code File
Please save your code and submit it as an ipynb file! (**hw1.ipynb**)