# CS 584 :: Data Mining :: George Mason University :: Spring 2024


# Homework 2: Linear Regression&Neural Networks

- **100 points [8% of your final grade]**
- **Due Sunday, March 10 by 11:59pm**

- *Goals of this homework:* (1) implement the linear regression model; (2) implement the multi-layer perceptron neural network; (3) tune the hyperparameters of MLP model to produce classification result as good as possible.

- *Submission instructions:* for this homework, you need to submit to two different platforms. First, you should submit your notebook file to Blackboard (look for the homework 2 assignment there). Please name your submission **FirstName_Lastname_hw2.ipynb**, so for example, my submission would be something like **Ziwei_Zhu_hw2.ipynb**. Your notebook should be **fully executed** so that we can see all outputs. Then, you need to submit a output file from this notebook (you will see later in this notebook) to the HW2 page in the http://miner2.vsnet.gmu.edu website.

## Part 1: Linear Regression (40 points)

Recent studies have found that novel mobile games can lead to increased physical activity. A notable example is Pokemon Go, a mobile game combining the Pokemon world through augmented reality with the real world requiring players to physically move around. Specifically, in the following study, researchers have found that Pokemon Go leads to increased levels of physical activity for the most engaged players! https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5174727/
![image.png](attachment:image.png)



In this part, our goal is to predict the combat point of each pokemon in the 2017 Pokemon Go mobile game. Each pokemon has its own unique attributes that can help predicting its combat points. These include:

- Stamina
- Attack value
- Defense value
- Capture rate
- Flee rate
- Spawn chance
- Primary strength

The file pokemon_data.csv contains data of 146 pokemons to be used in this homework. The rows of these files refer to the data samples (i.e., pokemon samples), while the columns denote the name of the pokemon (column 1), its attributes (columns 2-8), and the combat point outcome (column 9). You can ignore column 1 for the rest of this problem.

First, let's load the data by excuting the following code.

**Note: you need to install the pandas library beforehand**

In [71]:
import numpy as np
import pandas as pd

data_frame = pd.read_csv('pokemon_data.csv')
data_frame.head()

Unnamed: 0,name,stamina,attack_value,defense_value,capture_rate,flee_rate,spawn_chance,primary_strength,combat_point
0,Bulbasaur,90,126,126,0.16,0.1,69.0,Grass,1079
1,Ivysaur,120,156,158,0.08,0.07,4.2,Grass,1643
2,Venusaur,160,198,200,0.04,0.05,1.7,Grass,2598
3,Charmander,78,128,108,0.16,0.1,25.3,Fire,962
4,Charmeleon,116,160,140,0.08,0.07,1.2,Fire,1568


Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of Python. By excuting the following code, let's create one Numpy array to contain the feature data without the name column and one array to contain the combat point ground truth.

In [72]:
features = data_frame.values[:, 1:-1]
labels = data_frame.values[:, -1]
print('array of labels: shape ' + str(np.shape(labels)))
print('array of feature matrix: shape ' + str(np.shape(features)))

array of labels: shape (146,)
array of feature matrix: shape (146, 7)


Now, you may find out that we have a categorical feature 'primary_strength' in our data. Categorical features require special attention because usually they cannot be the input of regression models as they are. A potential way to treat categorical features is to simply convert each value of the feature to a separate number. However, this might impute non-existent relative associations between the features, which might not always be representative of the data (e.g., if we assign “1” to the value “green” and “2” to the value “red”, the regression algorithm will assume that “red” is greater than “green,” which is not necessarily the case). For this reason, we can use a “one hot encoding” to represent categorical features. According to this, we will create a binary column for each category of the categorical feature, which will take a value of 1 if the sample belongs to that category, and 0 otherwise. For each categorical feature of the problem, count the number of different values and implement the one hot encoding. For the remaining of the problem, you will be working with the one hot encoding of the categorical features.


In the next cell, write your code to replace the categorical feature 'primary_strength' with **one-hot encoding** and generate the new version of the Numpy array 'features'.

**Hint: if you don't remember one hot encoding, review the slides of our first-week lecture.**

**Note: do not use sklearn to automatically generate one hot encoding.**

In [73]:
# Write your code here
data_frame['primary_strength'].describe()

count       146
unique       15
top       Water
freq         28
Name: primary_strength, dtype: object

In [74]:
encoder_ps = pd.get_dummies(data_frame['primary_strength'])
data_frame = data_frame.drop('primary_strength',axis= 1)
com_po_data = data_frame['combat_point']
data_frame = data_frame.drop('combat_point',axis= 1)
data_frame = data_frame.join(encoder_ps)
data_frame = data_frame.join(com_po_data)
data_frame.insert(1,'bias_term', '0.001')

feature = data_frame.values[:, 1:-1]
labels = data_frame.values[:,-1]

data_frame


Unnamed: 0,name,bias_term,stamina,attack_value,defense_value,capture_rate,flee_rate,spawn_chance,Bug,Dragon,...,Ghost,Grass,Ground,Ice,Normal,Poison,Psychic,Rock,Water,combat_point
0,Bulbasaur,0.001,90,126,126,0.16,0.10,69.00,0,0,...,0,1,0,0,0,0,0,0,0,1079
1,Ivysaur,0.001,120,156,158,0.08,0.07,4.20,0,0,...,0,1,0,0,0,0,0,0,0,1643
2,Venusaur,0.001,160,198,200,0.04,0.05,1.70,0,0,...,0,1,0,0,0,0,0,0,0,2598
3,Charmander,0.001,78,128,108,0.16,0.10,25.30,0,0,...,0,0,0,0,0,0,0,0,0,962
4,Charmeleon,0.001,116,160,140,0.08,0.07,1.20,0,0,...,0,0,0,0,0,0,0,0,0,1568
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
141,Aerodactyl,0.001,160,182,162,0.16,0.09,1.80,0,0,...,0,0,0,0,0,0,0,1,0,2180
142,Snorlax,0.001,320,180,180,0.16,0.09,1.60,0,0,...,0,0,0,0,1,0,0,0,0,3135
143,Dratini,0.001,82,128,110,0.32,0.09,30.00,0,1,...,0,0,0,0,0,0,0,0,0,990
144,Dragonair,0.001,122,170,152,0.08,0.06,2.00,0,1,...,0,0,0,0,0,0,0,0,0,1760


Besides, you may also notice that other features have different scales. So, you need to standardize them: $({x-\mu})/{\sigma}$, where $\mu$ is the mean and $\sigma$ is the standard deviation. Write your code below.

**Hint: details about feature standardization is also in slides of our first-week lecture.**

**Note: You do not need to do standardize for one-hot encodings.**

In [75]:
# Write your code here
data_frame_val = data_frame[['stamina','attack_value','defense_value','capture_rate','flee_rate','spawn_chance']]
data_frame[['stamina','attack_value','defense_value','capture_rate','flee_rate','spawn_chance']]  = (data_frame_val-data_frame_val.mean())/data_frame_val.std()

feature1 = data_frame.values[:, 1:-1]
labels1 = data_frame.values[:,-1]

Now, in the next cell, you need to implement your own **linear regression model** using the **Ordinary Least Square (OLS)** solution **without regularization**. And here, you should adopt the **5-fold cross-validation** method. For each fold compute and print out the **square root** of the residual sum of squares error (RSS) between the actual and predicted outcome variable. Also compute and print out the **average** square root of the RSS over all folds.

**Note: You should implement the algorithm by yourself. You are NOT allowed to use Machine Learning libraries like Sklearn.**

**Hint: Use numpy.linalg.pinv() for calculating the inverse of a matrix.**

**Hint: details about cross-validation is on page 40-42 in slides of KNN lecture.**


In [76]:
# Write your code here
import math
from numpy import linalg as lag

def rss_function(x_train,y_train,x_test,y_test):
    invers = lag.pinv( np.dot(x_train.T,x_train))
    invers_x = np.dot(invers,x_train.T)

    w_matrix = np.dot(invers_x, y_train)

    y_pred = np.dot(w_matrix,x_test.T)

    y_diff  = y_pred - y_test

    rss_value  = np.dot(y_diff.T,y_diff)

    print(math.sqrt(rss_value))


# fold 1
x_train = feature1[29:].astype(np.float64)
y_train = labels1[29:].astype(np.float64)

x_test = feature1[:29].astype(np.float64)
y_test = labels1[:29].astype(np.float64)
print("fold 1: ")
rss_function(x_train,y_train,x_test,y_test)

# fold 2
x_train = np.concatenate((feature1[:29].astype(np.float64),feature1[59:].astype(np.float64)),axis=0)
y_train = np.concatenate((labels1[:29].astype(np.float64),labels1[59:].astype(np.float64)),axis=0)

x_test = feature1[30:59].astype(np.float64)
y_test = labels1[30:59].astype(np.float64)
print("fold 2: ")
rss_function(x_train,y_train,x_test,y_test)

# fold 3
x_train = np.concatenate((feature1[:59].astype(np.float64),feature1[89:].astype(np.float64)),axis=0)
y_train = np.concatenate((labels1[:59].astype(np.float64),labels1[89:].astype(np.float64)),axis=0)

x_test = feature1[60:88].astype(np.float64)
y_test = labels1[60:88].astype(np.float64)
print("fold 3: ")
rss_function(x_train,y_train,x_test,y_test)

# fold 4
x_train = np.concatenate((feature1[:88].astype(np.float64),feature1[116:].astype(np.float64)),axis=0)
y_train = np.concatenate((labels1[:88].astype(np.float64),labels1[116:].astype(np.float64)),axis=0)

x_test = feature1[89:116].astype(np.float64)
y_test = labels1[89:116].astype(np.float64)
print("fold 4: ")
rss_function(x_train,y_train,x_test,y_test)

# fold 5
x_train = feature1[:116].astype(np.float64)
y_train = labels1[:116].astype(np.float64)

x_test = feature1[115:].astype(np.float64)
y_test = labels1[115:].astype(np.float64)
print("fold 5: ")
rss_function(x_train,y_train,x_test,y_test)



fold 1: 
1286.303954284591
fold 2: 
2399.4271712163472
fold 3: 
844.7269999965312
fold 4: 
3107.3545763482703
fold 5: 
3358.661350289822


At the end in this part, please repeat the same experiment as in the previous step, but instead of linear regression, implement linear regression **with L2-norm regularization**. Experiment and report your results (average square root of RSS over 5-fold cross-validation) with different values of the regularization term $\lambda=\{1, 0.1, 0.01, 0.001, 0.0001\}$.

**Hint: details about the closed-form solution with regularization is on page 76 in slides of our linear regression lecture.**

In [77]:
# Write your code here
# implemented above

## Part 2: Neural Networks (40 points)

In this part, you are going to implement your multi-layer perceptron model by the Pytorch library. You will still use the same handwritten digit image dataset from HW1. So, in the next few cells, please run the provided code to load and process the data, and creat dataset objects for further use by Pytorch.

**Note: you need to install Pytorch beforehand. Or, you can use Google Colab for this homework, which is recommended.**

In [78]:
# load data from file and split into training and validation sets
import numpy as np
data = np.loadtxt("train.txt", delimiter=',')
perm_idx = np.random.permutation(data.shape[0])
vali_num = int(data.shape[0] * 0.2)
vali_idx = perm_idx[:vali_num]
train_idx = perm_idx[vali_num:]
train_data = data[train_idx]
vali_data = data[vali_idx]
train_features = train_data[:, 1:].astype(np.float32)
train_labels = train_data[:, 0].astype(int)
vali_features = vali_data[:, 1:].astype(np.float32)
vali_labels = vali_data[:, 0].astype(int)

In [79]:
# define a Dataset class
import torch
from torch import nn
from torchvision import datasets, transforms
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
class MNISTDataset(Dataset):
    def __init__(self, features, labels):
        self.features = features
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        return self.features[idx, :], self.labels[idx]

In [80]:
training_data = MNISTDataset(train_features, train_labels)
vali_data = MNISTDataset(vali_features, vali_labels)
batch_size = 64

# Create data loaders.
train_dataloader = DataLoader(training_data, batch_size=batch_size)
vali_dataloader = DataLoader(vali_data, batch_size=batch_size)

for X, y in train_dataloader:
    print(f"Shape of X [N, F]: {X.shape} {X.dtype}")
    print(f"Shape of y: {y.shape} {y.dtype}")
    break

Shape of X [N, F]: torch.Size([64, 784]) torch.float32
Shape of y: torch.Size([64]) torch.int64


Now, you should have the train_dataloader and vali_dataloader. Then, you need to build and train your multi-layer perceptron model by Pytorch.

https://pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html gives a comprehansive example how to achieve this. Please read this tutorial closely, and implement the model in the next few cells.

https://colab.research.google.com/github/pytorch/tutorials/blob/gh-pages/_downloads/c30c1dcf2bc20119bcda7e734ce0eb42/quickstart_tutorial.ipynb provides the interactive version, which you can run and edit.

**Note: in your implementation:**
- you will only have three layers [784 -> 512 -> 10], you need to remove the [512 -> 512] layer in the tutorial.
- add 'weight_decay=1e-4' in torch.optim.SGD to add L2 regularization.
- train the model for 10 epochs instead of 5 epochs.
- keep all other hyper-parameters the same as used in the tutorial.
- **You are allowed to resue the code in the tutorial for this homework**


**Note: print out the training process and the final accuracy on the validation set.**

**Note: you can use Colab for running the code with GPU for free (open a colab notebook, then Runtime->Change runtime type->Hardware accelerator->GPU)**

In [85]:
# Write your code
# Get cpu or gpu device for training

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using {device} device")

# Define model
class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28, 512),
            nn.ReLU(),
            nn.Linear(512, 10)
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

model = NeuralNetwork().to(device)
print(model)

Using cpu device
NeuralNetwork(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear_relu_stack): Sequential(
    (0): Linear(in_features=784, out_features=512, bias=True)
    (1): ReLU()
    (2): Linear(in_features=512, out_features=10, bias=True)
  )
)


In [86]:
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3, weight_decay=1e-4)

In [87]:

def train(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)
    model.train()
    for batch, (X, y) in enumerate(dataloader):
        X, y = X.to(device), y.to(device)

        # Compute prediction error
        pred = model(X)
        loss = loss_fn(pred, y)

        # Backpropagation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if batch % 100 == 0:
            loss, current = loss.item(), batch * len(X)
            print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")

def test(dataloader, model, loss_fn):
    size = len(dataloader.dataset)
    num_batches = len(dataloader)
    model.eval()
    test_loss, correct = 0, 0
    with torch.no_grad():
        for X, y in dataloader:
            X, y = X.to(device), y.to(device)
            pred = model(X)
            test_loss += loss_fn(pred, y).item()
            correct += (pred.argmax(1) == y).type(torch.float).sum().item()
    test_loss /= num_batches
    correct /= size
    print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")

In [88]:
epochs = 10
for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    train(train_dataloader, model, loss_fn, optimizer)
    test(vali_dataloader, model, loss_fn)
print("Done!")

Epoch 1
-------------------------------
loss: 30.519804  [    0/48000]
loss: 0.820468  [ 6400/48000]
loss: 0.233496  [12800/48000]
loss: 0.460172  [19200/48000]
loss: 0.274668  [25600/48000]
loss: 0.115464  [32000/48000]
loss: 0.370955  [38400/48000]
loss: 0.053592  [44800/48000]
Test Error: 
 Accuracy: 94.2%, Avg loss: 0.230328 

Epoch 2
-------------------------------
loss: 0.134763  [    0/48000]
loss: 0.160597  [ 6400/48000]
loss: 0.100323  [12800/48000]
loss: 0.221694  [19200/48000]
loss: 0.027613  [25600/48000]
loss: 0.043076  [32000/48000]
loss: 0.176671  [38400/48000]
loss: 0.020381  [44800/48000]
Test Error: 
 Accuracy: 95.1%, Avg loss: 0.185548 

Epoch 3
-------------------------------
loss: 0.044283  [    0/48000]
loss: 0.088944  [ 6400/48000]
loss: 0.056728  [12800/48000]
loss: 0.123044  [19200/48000]
loss: 0.002733  [25600/48000]
loss: 0.028961  [32000/48000]
loss: 0.105163  [38400/48000]
loss: 0.013488  [44800/48000]
Test Error: 
 Accuracy: 95.6%, Avg loss: 0.169530 

Epo

In [89]:
torch.save(model.state_dict(), "model.pth")
print("Saved PyTorch Model State to model.pth")

Saved PyTorch Model State to model.pth


In [90]:
model = NeuralNetwork()
model.load_state_dict(torch.load("model.pth"))

<All keys matched successfully>

## Part 3: Tune Hyperparameter [Need to submit to Miner2] (20 points)

In this part, you need to do your best to tune the hyperparameter in the MLP to build the best model and submit the predictions for the testing data to Miner2 system. First of all, let's load the testing data by excuting the following code.

In [91]:
test_features = np.loadtxt("test.txt", delimiter=',')
print('array of testing feature matrix: shape ' + str(np.shape(test_features)))

array of testing feature matrix: shape (10000, 784)


Now, you should tune four hyperparameters:

- the number of layers and the dimension of each layer (explore as much as you can, but choose reasonable settings considering the computational resource you have)
- the activation function (choose from sigmoid, tanh, relu, leaky_relu)
- weight decay
- number of training epochs

Rules:

- Write your predictions for samples in the testing set into a file, in which each line has one integer indicating the prediction from your best model for the corresponding sample in the test.txt file. Please see the format.txt file in Miner2 as one submission example. Name the submission file hw2_Miner2.txt and submit it to Miner2 HW2 page.
- The public leaderboard shows results for 50% of randomly chosen test instances only. This is a standard practice in data mining challenge to avoid gaming of the system. The private leaderboard will be released after the deadline evaluates all the entries in the test set.
- You are allowed 5 submissions in a 24 hour cycle.
- The final score and ranking will always be based on the last submission.
- Grading will only be based on the model performance (based on Accuracy metric) instead of ranking. You'll get full credit as long as your socre is a reasonable number.


**Hint: You can tune these hyperparameters by one randomly generated validation set, or you can also use the cross-validation method.**

**Note: you can use Colab for running the code with GPU for free**

**Hint: use the following two lines of code to generate the label predictions for test data:**
- raw_pred = model(torch.tensor(test_features).to(device).float())
- pred = np.argmax(raw_pred.to('cpu').detach().numpy(), axis=1)


In [92]:
# Write your code here
class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28,400 ),
            nn.ReLU(),
            nn.Linear(400, 10)
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

model = NeuralNetwork().to(device)
print(model)

loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3, weight_decay=1e-1)


def train(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)
    model.train()
    for batch, (X, y) in enumerate(dataloader):
        X, y = X.to(device), y.to(device)

        # Compute prediction error
        pred = model(X)
        loss = loss_fn(pred, y)

        # Backpropagation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if batch % 100 == 0:
            loss, current = loss.item(), batch * len(X)
            print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")


def test(dataloader, model, loss_fn):
    size = len(dataloader.dataset)
    num_batches = len(dataloader)
    model.eval()
    test_loss, correct = 0, 0
    with torch.no_grad():
        for X, y in dataloader:
            X, y = X.to(device), y.to(device)
            pred = model(X)
            test_loss += loss_fn(pred, y).item()
            correct += (pred.argmax(1) == y).type(torch.float).sum().item()
    test_loss /= num_batches
    correct /= size
    print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")


epochs = 20                                                   #
for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    train(train_dataloader, model, loss_fn, optimizer)
    test(vali_dataloader, model, loss_fn)
print("Done!")

torch.save(model.state_dict(), "model.pth")
print("Saved PyTorch Model State to model.pth")

model = NeuralNetwork()
model.load_state_dict(torch.load("model.pth"))

raw_pred = model(torch.tensor(test_features).to(device).float())
pred = np.argmax(raw_pred.to('cpu').detach().numpy(), axis=1)
print(pred.shape)

NeuralNetwork(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear_relu_stack): Sequential(
    (0): Linear(in_features=784, out_features=400, bias=True)
    (1): ReLU()
    (2): Linear(in_features=400, out_features=10, bias=True)
  )
)
Epoch 1
-------------------------------
loss: 26.416094  [    0/48000]
loss: 0.575880  [ 6400/48000]
loss: 0.426897  [12800/48000]
loss: 0.377060  [19200/48000]
loss: 0.442325  [25600/48000]
loss: 0.487567  [32000/48000]
loss: 0.440072  [38400/48000]
loss: 0.033856  [44800/48000]
Test Error: 
 Accuracy: 94.0%, Avg loss: 0.223578 

Epoch 2
-------------------------------
loss: 0.207247  [    0/48000]
loss: 0.210413  [ 6400/48000]
loss: 0.102590  [12800/48000]
loss: 0.253380  [19200/48000]
loss: 0.249664  [25600/48000]
loss: 0.280120  [32000/48000]
loss: 0.254665  [38400/48000]
loss: 0.021858  [44800/48000]
Test Error: 
 Accuracy: 95.2%, Avg loss: 0.166095 

Epoch 3
-------------------------------
loss: 0.105959  [    0/48000]
loss: 0.126334  [ 6400/4

In [94]:
outFile = open('hw2_Miner2.txt', 'w')
for result in pred:
    outFile.write(str(int(result))+ '\n')
outFile.close()

### Question: What is your final hyperparameter setting? How do you tune them? What choices have you tried?

#### Write your answer here
- the number of layers and the dimension of each layer :

    nn.Linear(28*28,400 ),

    nn.ReLU(),
    
    nn.Linear(400, 10)

- the activation function (relu)

- weight decay: =  weight_decay=1e-1

- number of training epochs : 20

### Question: your username in Miner2 and the score&ranking of your submission in Miner2 (at the time of answering this question)

#### Write your answer here
user : = sudo_08

score = 98%

Rank = 23