<a href="https://colab.research.google.com/github/jeffheaton/app_deep_learning/blob/main/assignments/assignment_yourname_class8.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# T81-558: Applications of Deep Neural Networks
* Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), McKelvey School of Engineering, [Washington University in St. Louis](https://engineering.wustl.edu/academics/programs/index.html)
* For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/).

**Module 8 Assignment: Feature Engineering**

**Student Name: Your Name**

# Assignment Instructions

This assignment is similar to assignment 5, except that you must use feature engineering to solve it.  I provide you with a dataset that contains dimensions and the quality of items of specific shapes.  With the values of 'height', 'width', 'depth'. 'shape', and 'quality' you should try to predict the cost of these items.  You should be able to match very close to solution file, if you feature engineer correctly.  To get full credit your average cost should not be more than 50 off from the solution.  The autocorrector will let you know if you are in this range.

You can find all of the needed CSV files here:

* [Shapes - Training](https://data.heatonresearch.com/data/t81-558/datasets/shapes-train.csv)
* [Shapes - Submit](https://data.heatonresearch.com/data/t81-558/datasets/shapes-test.csv)

Use the training file to train your neural network and submit results for for the data contained in the test/submit file.

In [1]:
try:
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)
    COLAB = True
    print("Note: using Google CoLab")
except:
    print("Note: not using Google CoLab")
    COLAB = False

Mounted at /content/drive
Note: using Google CoLab


# Assignment Submit Function

You will submit the 10 programming assignments electronically.  The following submit function can be used to do this.  My server will perform a basic check of each assignment and let you know if it sees any basic problems. 

**It is unlikely that should need to modify this function.**

In [1]:
import base64
import os
import numpy as np
import pandas as pd
import requests
import PIL
import PIL.Image
import io

# This function submits an assignment.  You can submit an assignment as much as you like, only the final
# submission counts.  The paramaters are as follows:
# data - List of pandas dataframes or images.
# key - Your student key that was emailed to you.
# no - The assignment class number, should be 1 through 1.
# source_file - The full path to your Python or IPYNB file.  This must have "_class1" as part of its name.  
# .             The number must match your assignment number.  For example "_class2" for class assignment #2.
def submit(data,key,no,source_file=None):
    if source_file is None and '__file__' not in globals(): raise Exception('Must specify a filename when a Jupyter notebook.')
    if source_file is None: source_file = __file__
    suffix = '_class{}'.format(no)
    if suffix not in source_file: raise Exception('{} must be part of the filename.'.format(suffix))
    with open(source_file, "rb") as image_file:
        encoded_python = base64.b64encode(image_file.read()).decode('ascii')
    ext = os.path.splitext(source_file)[-1].lower()
    if ext not in ['.ipynb','.py']: raise Exception("Source file is {} must be .py or .ipynb".format(ext))
    payload = []
    for item in data:
        if type(item) is PIL.Image.Image:
            buffered = BytesIO()
            item.save(buffered, format="PNG")
            payload.append({'PNG':base64.b64encode(buffered.getvalue()).decode('ascii')})
        elif type(item) is pd.core.frame.DataFrame:
            payload.append({'CSV':base64.b64encode(item.to_csv(index=False).encode('ascii')).decode("ascii")})
    r= requests.post("https://api.heatonresearch.com/assignment-submit",
        headers={'x-api-key':key}, json={ 'payload': payload,'assignment': no, 'ext':ext, 'py':encoded_python})
    if r.status_code==200:
        print("Success: {}".format(r.text))
    else: print("Failure: {}".format(r.text))

# Assignment #8 MyCode

In [24]:
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import torch
import torch.nn as nn
import torch.optim as optim
key = "QGOMi9jY948rtuqknQ9Wb20gQ7BaRlg369Q6fiSX" 
file='E:\\WUSTL\\2024 SPRING\\INFO.558 Applications of Deep Neural Networks\\jheaton\\projects\\t81_558_deep_learning\\assignments\\assignment_ZihanLuo_class8.ipynb'

In [12]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [28]:
import copy


class EarlyStopping:
    def __init__(self, patience=50, min_delta=0, restore_best_weights=True):
        self.patience = patience
        self.min_delta = min_delta
        self.restore_best_weights = restore_best_weights
        self.best_model = None
        self.best_loss = None
        self.counter = 0
        self.status = ""

    def __call__(self, model, val_loss):
        if self.best_loss is None:
            self.best_loss = val_loss
            self.best_model = copy.deepcopy(model.state_dict())
        elif self.best_loss - val_loss >= self.min_delta:
            self.best_model = copy.deepcopy(model.state_dict())
            self.best_loss = val_loss
            self.counter = 0
            self.status = f"Improvement found, counter reset to {self.counter}"
        else:
            self.counter += 1
            self.status = f"No improvement in the last {self.counter} epochs"
            if self.counter >= self.patience:
                self.status = f"Early stopping triggered after {self.counter} epochs."
                if self.restore_best_weights:
                    model.load_state_dict(self.best_model)
                return True
        return False

In [29]:
import time
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
import tqdm
from sklearn.model_selection import train_test_split
from torch.utils.data import DataLoader, TensorDataset


device = 'cuda' if torch.cuda.is_available() else 'cpu'

df = pd.read_csv("https://data.heatonresearch.com/data/t81-558/datasets/shapes-train.csv")


df_dummies = pd.get_dummies(df['shape']).astype(int)
df = pd.concat([df, df_dummies], axis=1)
result = df['cost']
x_columns = df.columns.drop(['shape', 'cost', 'id'])
x = df[x_columns].values
y = result.values  

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=42)

x_train = torch.tensor(x_train, device=device, dtype=torch.float32)
y_train = torch.tensor(y_train, device=device, dtype=torch.float32)
x_test = torch.tensor(x_test, device=device, dtype=torch.float32)
y_test = torch.tensor(y_test, device=device, dtype=torch.float32)

BATCH_SIZE = 16
dataset_train = TensorDataset(x_train, y_train)
dataloader_train = DataLoader(dataset_train, batch_size=BATCH_SIZE, shuffle=True)
dataset_test = TensorDataset(x_test, y_test)
dataloader_test = DataLoader(dataset_test, batch_size=BATCH_SIZE, shuffle=True)

model = nn.Sequential(
    nn.Linear(x_train.shape[1], 100),
    nn.ReLU(),
    nn.Linear(100, 50),
    nn.ReLU(),
    nn.Linear(50, 25),
    nn.ReLU(),
    nn.Linear(25, 1)
).to(device)

loss_fn = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

es = EarlyStopping()
epoch = 0
done = False
while epoch < 1000 and not done:
    epoch += 1
    steps = list(enumerate(dataloader_train))
    pbar = tqdm.tqdm(steps)
    model.train()
    model = model.to(device)
    for i, (x_batch, y_batch) in pbar:
        x_batch = x_batch.to(device)
        y_batch = y_batch.to(device)
        y_batch_pred = model(x_batch).flatten()
        loss = loss_fn(y_batch_pred, y_batch)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if i == len(steps) - 1:
            model.eval()
            with torch.no_grad():
                pred = model(x_test).flatten()
                vloss = loss_fn(pred, y_test)
            if es(model, vloss):
                done = True
            pbar.set_description(f"Epoch: {epoch}, tloss: {loss.item()}, vloss: {vloss.item():.4f}")

model.eval()
with torch.no_grad():
    pred = model(x_test)
    score = torch.sqrt(torch.nn.functional.mse_loss(pred.flatten(), y_test))
print(f"Final score (RMSE): {score.item()}")

Epoch: 1, tloss: 127443.2109375, vloss: 331873.3750: 100%|██████████| 469/469 [00:00<00:00, 830.25it/s]
Epoch: 2, tloss: 59230.0859375, vloss: 92630.5234: 100%|██████████| 469/469 [00:00<00:00, 878.38it/s]
Epoch: 3, tloss: 69751.96875, vloss: 56818.5195: 100%|██████████| 469/469 [00:00<00:00, 818.47it/s]
Epoch: 4, tloss: 30921.62890625, vloss: 39249.3672: 100%|██████████| 469/469 [00:00<00:00, 843.14it/s]
Epoch: 5, tloss: 9221.54296875, vloss: 47240.8633: 100%|██████████| 469/469 [00:00<00:00, 808.51it/s]
Epoch: 6, tloss: 2397.622314453125, vloss: 29798.8887: 100%|██████████| 469/469 [00:00<00:00, 879.23it/s]
Epoch: 7, tloss: 101267.9921875, vloss: 32273.7695: 100%|██████████| 469/469 [00:00<00:00, 858.70it/s]
Epoch: 8, tloss: 10350.314453125, vloss: 46365.8828: 100%|██████████| 469/469 [00:00<00:00, 922.31it/s]
Epoch: 9, tloss: 18298.7578125, vloss: 34019.9023: 100%|██████████| 469/469 [00:00<00:00, 884.15it/s]
Epoch: 10, tloss: 23067.48046875, vloss: 25502.2871: 100%|██████████| 469/

Epoch: 157, tloss: 2054.4560546875, vloss: 2979.4941: 100%|██████████| 469/469 [00:00<00:00, 881.40it/s]
Epoch: 158, tloss: 17235.041015625, vloss: 4302.1318: 100%|██████████| 469/469 [00:00<00:00, 912.63it/s]
Epoch: 159, tloss: 843.2201538085938, vloss: 1771.9208: 100%|██████████| 469/469 [00:00<00:00, 867.32it/s]
Epoch: 160, tloss: 1258.90380859375, vloss: 3333.9963: 100%|██████████| 469/469 [00:00<00:00, 754.93it/s]
Epoch: 161, tloss: 2825.046875, vloss: 1725.5721: 100%|██████████| 469/469 [00:00<00:00, 725.33it/s]
Epoch: 162, tloss: 7628.6640625, vloss: 8862.7959: 100%|██████████| 469/469 [00:00<00:00, 752.39it/s]
Epoch: 163, tloss: 3066.30078125, vloss: 7582.2622: 100%|██████████| 469/469 [00:00<00:00, 690.59it/s]
Epoch: 164, tloss: 5633.6669921875, vloss: 8890.6211: 100%|██████████| 469/469 [00:00<00:00, 668.91it/s]
Epoch: 165, tloss: 2585.5341796875, vloss: 7284.1846: 100%|██████████| 469/469 [00:00<00:00, 708.74it/s]
Epoch: 166, tloss: 1864.259521484375, vloss: 2375.1606: 100%|

Epoch: 311, tloss: 691.433837890625, vloss: 894.6638: 100%|██████████| 469/469 [00:00<00:00, 702.69it/s]
Epoch: 312, tloss: 703.0885009765625, vloss: 936.3942: 100%|██████████| 469/469 [00:00<00:00, 706.42it/s]
Epoch: 313, tloss: 1618.0970458984375, vloss: 2386.3770: 100%|██████████| 469/469 [00:00<00:00, 861.12it/s]
Epoch: 314, tloss: 3728.2978515625, vloss: 3643.2615: 100%|██████████| 469/469 [00:00<00:00, 958.82it/s]
Epoch: 315, tloss: 1896.397705078125, vloss: 1519.8297: 100%|██████████| 469/469 [00:00<00:00, 764.86it/s]
Epoch: 316, tloss: 1069.211669921875, vloss: 3399.3799: 100%|██████████| 469/469 [00:00<00:00, 746.41it/s]
Epoch: 317, tloss: 15160.18359375, vloss: 1144.9973: 100%|██████████| 469/469 [00:00<00:00, 728.87it/s]
Epoch: 318, tloss: 718.1124267578125, vloss: 1767.7284: 100%|██████████| 469/469 [00:00<00:00, 737.17it/s]
Epoch: 319, tloss: 1910.9775390625, vloss: 1154.1372: 100%|██████████| 469/469 [00:00<00:00, 724.01it/s]
Epoch: 320, tloss: 42396.15625, vloss: 1590.94

Final score (RMSE): 25.59932518005371





In [19]:
from sklearn import metrics
import scipy as sp
import numpy as np
import math
from sklearn import metrics

import torch
import torch.nn.functional as F
import pandas as pd

def perturbation_rank(device, model, x, y, names, regression):
    model.to(device)
    model.eval() # set the model to evaluation mode

    #x = torch.tensor(x).float().to(device)
    #y = torch.tensor(y).float().to(device)
    errors = []
    for i in range(x.shape[1]):
        hold = x[:, i].clone()
        x[:, i] = torch.randperm(x.shape[0]).to(device)  # shuffling
        
        with torch.no_grad():
            pred = model(x)

        if regression:
            loss_fn = torch.nn.MSELoss()
            error = loss_fn(y, pred).item()
        else:
            # pred should be probabilities; apply softmax if not done in model's forward method
            if len(pred.shape) == 2 and pred.shape[1] > 1:
                pred = F.softmax(pred, dim=1)
                loss_fn = torch.nn.CrossEntropyLoss()
                error = loss_fn(pred, y.long()).item()
            else:
                loss_fn = nn.MSELoss()
                error = loss_fn(y, pred).item()
            
            
        errors.append(error)
        x[:, i] = hold
        
    max_error = max(errors)
    importance = [e/max_error for e in errors]

    data = {'name':names, 'error':errors, 'importance':importance}
    result = pd.DataFrame(data, columns=['name', 'error', 'importance'])
    result.sort_values(by=['importance'], ascending=[0], inplace=True)
    result.reset_index(inplace=True, drop=True)
    return result

In [30]:
from IPython.display import display, HTML

names = list(df.columns) 
names.remove('id')
names.remove('shape')
names.remove("cost")  
print(names)

rank = perturbation_rank(device, model, x_test, y_test, names, True)
display(rank)

['height', 'width', 'depth', 'quality', 'box', 'cylinder', 'ellipsoid']


  return F.mse_loss(input, target, reduction=self.reduction)


Unnamed: 0,name,error,importance
0,height,33596270.0,1.0
1,depth,28490310.0,0.84802
2,width,11794540.0,0.351067
3,quality,2735138.0,0.081412
4,ellipsoid,1364313.0,0.040609
5,box,961808.4,0.028628
6,cylinder,960546.7,0.028591


In [31]:
df_submit = pd.read_csv("https://data.heatonresearch.com/data/t81-558/datasets/shapes-test.csv")
id_col = df_submit['id']
df_dummies = pd.get_dummies(df_submit['shape']).astype(int)
df_submit = pd.concat([df_submit, df_dummies], axis=1)
x_columns_submit = df_submit.columns.drop(['shape', 'id'])
x_submit = df_submit[x_columns_submit].values
x_pred_submit = torch.tensor(x_submit, device=device, dtype=torch.float32)

pred_submit = model(x_pred_submit)
df_submit = pd.DataFrame(pred_submit.cpu().detach().numpy(), columns=['cost'])
df_submit['id'] = id_col
print(df_submit)

             cost     id
0       16.262226  10001
1      343.610382  10002
2       16.262226  10003
3     1008.516418  10004
4      284.315887  10005
...           ...    ...
1995   825.185364  11996
1996   291.880096  11997
1997   269.080292  11998
1998    16.262226  11999
1999    16.262226  12000

[2000 rows x 2 columns]


In [32]:
submit(source_file=file,data=[df_submit],key=key,no=8)

Success: Submitted Assignment 8 for luozihan:
You have submitted this assignment 2 times. (this is fine)
Note: The mean difference 2.6676492159999725 for column 'cost' is acceptable and is less than the maximum allowed value of '50.0' for this assignment.
