<a href="https://colab.research.google.com/github/jeffheaton/app_deep_learning/blob/main/assignments/assignment_yourname_class3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# T81-558: Applications of Deep Neural Networks
* Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), McKelvey School of Engineering, [Washington University in St. Louis](https://engineering.wustl.edu/index.html)
* For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/).

**Module 3 Assignment: Simple Classification Neural Network**

**Student Name: Your Name**

# Assignment Instructions

For this assignment, you will use the **crx** dataset. You can find the CSV file on my data site, at this location: [crx](https://data.heatonresearch.com/data/t81-558/crx.csv). Load and summarize the data set.  You will submit this summarized dataset to the **submit** function.  See [Assignment #1](https://github.com/jeffheaton/app_deep_learning/blob/main/assignments/assignment_yourname_class1.ipynb) for details on how to submit an assignment or check that one was submitted.

The RAW datafile looks something like the following:

|a1|a2|s3|a4|a5|a6|a7|a8|a9|a10|a11|a12|a13|a14|a15|a16|
|--|--|--|--|--|--|--|--|--|---|---|---|---|---|---|---|
|b|30.83|0|u|g|w|v|1.25|t|t|1|f|g|202|0|+|
|a|58.67|4.46|u|g|q|h|3.04|t|t|6|f|g|43|560|+|
|...|...|...|...|...|...|...|...|...|...|...|...|...|...|...|...|

For this assignment you must complete the following.

* Write a classification neural network that predicts the probability of either "+" or "-" for the column **a16**.
* Use early stopping to know when to complete your training.
* For all columns that are categorical, you must convert them to dummy variables.
* Some columns have missing values, fill these missing values with the median of that column.
* This is a simple neural network using basic techniques, do not worry too much about overall accuracy.
* Predict/submit for the entire dataset that I gave you, training and validation, you should have the same number of rows as crx.csv (690 data and 1 header row).

Your submit will look something like the following:

|+|-|
|-|-|
|0.09405358|0.90594643|
|0.33253232|0.66746765|
|0.098494485|0.90150553|
|...|...|

Common errors that you may run into include:

* **ValueError: could not convert string to float: ...** - Value errors typically mean you've not converted all of the categoricals to dummy variables.

* **tloss nan:** - Nan's usually mean youve not filled all missing values.



# Google CoLab Instructions

If you are using Google CoLab, it will be necessary to mount your GDrive so that you can send your notebook during the submit process. Running the following code will map your GDrive to ```/content/drive```.

In [None]:
try:
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)
    COLAB = True
    print("Note: using Google CoLab")
except:
    print("Note: not using Google CoLab")
    COLAB = False

# Early stopping (see module 3.4)
import copy

class EarlyStopping:
    def __init__(self, patience=5, min_delta=0, restore_best_weights=True):
        self.patience = patience
        self.min_delta = min_delta
        self.restore_best_weights = restore_best_weights
        self.best_model = None
        self.best_loss = None
        self.counter = 0
        self.status = ""

    def __call__(self, model, val_loss):
        if self.best_loss is None:
            self.best_loss = val_loss
            self.best_model = copy.deepcopy(model.state_dict())
        elif self.best_loss - val_loss >= self.min_delta:
            self.best_model = copy.deepcopy(model.state_dict())
            self.best_loss = val_loss
            self.counter = 0
            self.status = f"Improvement found, counter reset to {self.counter}"
        else:
            self.counter += 1
            self.status = f"No improvement in the last {self.counter} epochs"
            if self.counter >= self.patience:
                self.status = f"Early stopping triggered after {self.counter} epochs."
                if self.restore_best_weights:
                    model.load_state_dict(self.best_model)
                return True
        return False

# Make use of a GPU or MPS (Apple) if one is available.  (see module 3.2)
import torch

device = (
    "mps"
    if getattr(torch, "has_mps", False)
    else "cuda"
    if torch.cuda.is_available()
    else "cpu"
)
print(f"Using device: {device}")

# Assignment Submit Function

You will submit the 10 programming assignments electronically.  The following submit function can be used to do this.  My server will perform a basic check of each assignment and let you know if it sees any basic problems.

**It is unlikely that should need to modify this function.**

In [None]:
import base64
import os
import numpy as np
import pandas as pd
import requests
import PIL
import PIL.Image
import io

# This function submits an assignment.  You can submit an assignment as much as you like, only the final
# submission counts.  The paramaters are as follows:
# data - List of pandas dataframes or images.
# key - Your student key that was emailed to you.
# no - The assignment class number, should be 1 through 1.
# source_file - The full path to your Python or IPYNB file.  This must have "_class1" as part of its name.
# .             The number must match your assignment number.  For example "_class2" for class assignment #2.
def submit(data,key,no,source_file=None):
    if source_file is None and '__file__' not in globals(): raise Exception('Must specify a filename when a Jupyter notebook.')
    if source_file is None: source_file = __file__
    suffix = '_class{}'.format(no)
    if suffix not in source_file: raise Exception('{} must be part of the filename.'.format(suffix))
    with open(source_file, "rb") as image_file:
        encoded_python = base64.b64encode(image_file.read()).decode('ascii')
    ext = os.path.splitext(source_file)[-1].lower()
    if ext not in ['.ipynb','.py']: raise Exception("Source file is {} must be .py or .ipynb".format(ext))
    payload = []
    for item in data:
        if type(item) is PIL.Image.Image:
            buffered = BytesIO()
            item.save(buffered, format="PNG")
            payload.append({'PNG':base64.b64encode(buffered.getvalue()).decode('ascii')})
        elif type(item) is pd.core.frame.DataFrame:
            payload.append({'CSV':base64.b64encode(item.to_csv(index=False).encode('ascii')).decode("ascii")})
    r= requests.post("https://api.heatonresearch.com/assignment-submit",
        headers={'x-api-key':key}, json={ 'payload': payload,'assignment': no, 'ext':ext, 'py':encoded_python})
    if r.status_code==200:
        print("Success: {}".format(r.text))
    else: print("Failure: {}".format(r.text))

# Assignment #3 Sample Code

The following code provides a starting point for this assignment.

In [None]:
import os
import pandas as pd
from scipy.stats import zscore
import numpy as np
import torch
import tqdm
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from torch import nn
from torch.autograd import Variable
from torch.utils.data import DataLoader, TensorDataset


# This is your student key that I emailed to you at the beginnning of the semester.
key = "Gx5en9cEVvaZnjut6vfLm1HG4ZO4PsI32sgldAXj"  # This is an example key and will not work.

# You must also identify your source file.  (modify for your local setup)
file='/content/drive/My Drive/Colab Notebooks/c_assignment_yourname_class3.ipynb'  # Google CoLab
# file='C:\\Users\\jeffh\\projects\\t81_558_deep_learning\\assignments\\assignment_yourname_class3.ipynb'  # Windows
# file='/Users/jheaton/projects/t81_558_deep_learning/assignments/assignment_yourname_class3.ipynb'  # Mac/Linux

# Begin assignment

df = pd.read_csv("https://data.heatonresearch.com/data/t81-558/crx.csv",na_values=['?'])

# Your code goes here.

# Submit it
submit(source_file=file,data=[df_submit],key=key,no=3)

# Assignment #3 MyCode

In [1]:
import os
import pandas as pd
from scipy.stats import zscore
import numpy as np
import torch
import tqdm
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from torch import nn
from torch.autograd import Variable
from torch.utils.data import DataLoader, TensorDataset


In [2]:
key = "QGOMi9jY948rtuqknQ9Wb20gQ7BaRlg369Q6fiSX" 
file='E:\\WUSTL\\2024 SPRING\\INFO.558 Applications of Deep Neural Networks\\jheaton\\projects\\t81_558_deep_learning\\assignments\\assignment_ZihanLuo_class3.ipynb'

df = pd.read_csv("https://data.heatonresearch.com/data/t81-558/crx.csv",na_values=['?'])

In [44]:
import base64
import os
import numpy as np
import pandas as pd
import requests
import PIL
import PIL.Image
import io

# This function submits an assignment.  You can submit an assignment as much as you like, only the final
# submission counts.  The paramaters are as follows:
# data - List of pandas dataframes or images.
# key - Your student key that was emailed to you.
# no - The assignment class number, should be 1 through 1.
# source_file - The full path to your Python or IPYNB file.  This must have "_class1" as part of its name.
# .             The number must match your assignment number.  For example "_class2" for class assignment #2.
def submit(data,key,no,source_file=None):
    if source_file is None and '__file__' not in globals(): raise Exception('Must specify a filename when a Jupyter notebook.')
    if source_file is None: source_file = __file__
    suffix = '_class{}'.format(no)
    if suffix not in source_file: raise Exception('{} must be part of the filename.'.format(suffix))
    with open(source_file, "rb") as image_file:
        encoded_python = base64.b64encode(image_file.read()).decode('ascii')
    ext = os.path.splitext(source_file)[-1].lower()
    if ext not in ['.ipynb','.py']: raise Exception("Source file is {} must be .py or .ipynb".format(ext))
    payload = []
    for item in data:
        if type(item) is PIL.Image.Image:
            buffered = BytesIO()
            item.save(buffered, format="PNG")
            payload.append({'PNG':base64.b64encode(buffered.getvalue()).decode('ascii')})
        elif type(item) is pd.core.frame.DataFrame:
            payload.append({'CSV':base64.b64encode(item.to_csv(index=False).encode('ascii')).decode("ascii")})
    r= requests.post("https://api.heatonresearch.com/assignment-submit",
        headers={'x-api-key':key}, json={ 'payload': payload,'assignment': no, 'ext':ext, 'py':encoded_python})
    if r.status_code==200:
        print("Success: {}".format(r.text))
    else: print("Failure: {}".format(r.text))

In [46]:
submit(source_file=file,data=[df_submit],key=key,no=3)

Success: Submitted Assignment 3 for luozihan:
You have submitted this assignment 2 times. (this is fine)
Note: The mean difference 0.052630252566956526 for column '+' is acceptable and is less than the maximum allowed value of '0.5' for this assignment.
Note: The mean difference 0.05263024155047813 for column '-' is acceptable and is less than the maximum allowed value of '0.5' for this assignment.


In [3]:
df.head(10)

Unnamed: 0,a1,a2,s3,a4,a5,a6,a7,a8,a9,a10,a11,a12,a13,a14,a15,a16
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202.0,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43.0,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280.0,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100.0,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120.0,0,+
5,b,32.08,4.0,u,g,m,v,2.5,t,f,0,t,g,360.0,0,+
6,b,33.17,1.04,u,g,r,h,6.5,t,f,0,t,g,164.0,31285,+
7,a,22.92,11.585,u,g,cc,v,0.04,t,f,0,f,g,80.0,1349,+
8,b,54.42,0.5,y,p,k,h,3.96,t,f,0,f,g,180.0,314,+
9,b,42.5,4.915,y,p,w,v,3.165,t,f,0,t,g,52.0,1442,+


In [4]:
print(len(df))


690


## Instructions

In [None]:
Write a classification neural network that predicts the probability of either "+" or "-" for the column a16.
Use early stopping to know when to complete your training.
For all columns that are categorical, you must convert them to dummy variables.
Some columns have missing values, fill these missing values with the median of that column.
This is a simple neural network using basic techniques, do not worry too much about overall accuracy.
Predict/submit for the entire dataset that I gave you, training and validation, you should have the same number of rows as crx.csv (690 data and 1 header row).

### 1.For all columns that are categorical, you must convert them to dummy variables.

In [5]:
dummies = pd.get_dummies(df["a1"], prefix="a1")
df = pd.concat([df, dummies], axis=1)

In [6]:
df.drop("a1", axis=1, inplace=True)

In [7]:
df.head()

Unnamed: 0,a2,s3,a4,a5,a6,a7,a8,a9,a10,a11,a12,a13,a14,a15,a16,a1_a,a1_b
0,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202.0,0,+,0,1
1,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43.0,560,+,1,0
2,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280.0,824,+,1,0
3,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100.0,3,+,0,1
4,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120.0,0,+,0,1


In [8]:
dummies = pd.get_dummies(df["a4"], prefix="a4")
df = pd.concat([df, dummies], axis=1)
df.drop("a4", axis=1, inplace=True)

In [9]:
dummies = pd.get_dummies(df["a5"], prefix="a5")
df = pd.concat([df, dummies], axis=1)
df.drop("a5", axis=1, inplace=True)

In [10]:
dummies = pd.get_dummies(df["a6"], prefix="a6")
df = pd.concat([df, dummies], axis=1)
df.drop("a6", axis=1, inplace=True)

In [11]:
dummies = pd.get_dummies(df["a7"], prefix="a7")
df = pd.concat([df, dummies], axis=1)
df.drop("a7", axis=1, inplace=True)

In [12]:
dummies = pd.get_dummies(df["a9"], prefix="a9")
df = pd.concat([df, dummies], axis=1)
df.drop("a9", axis=1, inplace=True)

In [13]:
dummies = pd.get_dummies(df["a10"], prefix="a10")
df = pd.concat([df, dummies], axis=1)
df.drop("a10", axis=1, inplace=True)

In [14]:
dummies = pd.get_dummies(df["a12"], prefix="a12")
df = pd.concat([df, dummies], axis=1)
df.drop("a12", axis=1, inplace=True)

In [15]:
dummies = pd.get_dummies(df["a13"], prefix="a13")
df = pd.concat([df, dummies], axis=1)
df.drop("a13", axis=1, inplace=True)

In [16]:
df.head(10)

Unnamed: 0,a2,s3,a8,a11,a14,a15,a16,a1_a,a1_b,a4_l,...,a7_z,a9_f,a9_t,a10_f,a10_t,a12_f,a12_t,a13_g,a13_p,a13_s
0,30.83,0.0,1.25,1,202.0,0,+,0,1,0,...,0,0,1,0,1,1,0,1,0,0
1,58.67,4.46,3.04,6,43.0,560,+,1,0,0,...,0,0,1,0,1,1,0,1,0,0
2,24.5,0.5,1.5,0,280.0,824,+,1,0,0,...,0,0,1,1,0,1,0,1,0,0
3,27.83,1.54,3.75,5,100.0,3,+,0,1,0,...,0,0,1,0,1,0,1,1,0,0
4,20.17,5.625,1.71,0,120.0,0,+,0,1,0,...,0,0,1,1,0,1,0,0,0,1
5,32.08,4.0,2.5,0,360.0,0,+,0,1,0,...,0,0,1,1,0,0,1,1,0,0
6,33.17,1.04,6.5,0,164.0,31285,+,0,1,0,...,0,0,1,1,0,0,1,1,0,0
7,22.92,11.585,0.04,0,80.0,1349,+,1,0,0,...,0,0,1,1,0,1,0,1,0,0
8,54.42,0.5,3.96,0,180.0,314,+,0,1,0,...,0,0,1,1,0,1,0,1,0,0
9,42.5,4.915,3.165,0,52.0,1442,+,0,1,0,...,0,0,1,1,0,0,1,1,0,0


In [17]:
print(len(df))


690


### Some columns have missing values, fill these missing values with the median of that column.

In [18]:
print(f"a2 has na? {pd.isnull(df['a2']).values.any()}")

a2 has na? True


In [19]:
print(f"s3 has na? {pd.isnull(df['s3']).values.any()}")

s3 has na? False


In [20]:
print(f"a8 has na? {pd.isnull(df['a8']).values.any()}")

a8 has na? False


In [21]:
print(f"a11 has na? {pd.isnull(df['a11']).values.any()}")

a11 has na? False


In [22]:
print(f"a14 has na? {pd.isnull(df['a14']).values.any()}")

a14 has na? True


In [23]:
print(f"a15 has na? {pd.isnull(df['a15']).values.any()}")

a15 has na? False


In [24]:
## only the columns of a14 and a2 have missing values.

In [25]:
med = df["a2"].median()
df["a2"] = df["a2"].fillna(med)

In [26]:
print(f"a2 has na? {pd.isnull(df['a2']).values.any()}")

a2 has na? False


In [27]:
med_2= df["a14"].median()
df["a14"] = df["a14"].fillna(med_2)

In [28]:
print(f"a14 has na? {pd.isnull(df['a14']).values.any()}")

a14 has na? False


In [29]:
x_columns = df.columns
print(list(x_columns))

['a2', 's3', 'a8', 'a11', 'a14', 'a15', 'a16', 'a1_a', 'a1_b', 'a4_l', 'a4_u', 'a4_y', 'a5_g', 'a5_gg', 'a5_p', 'a6_aa', 'a6_c', 'a6_cc', 'a6_d', 'a6_e', 'a6_ff', 'a6_i', 'a6_j', 'a6_k', 'a6_m', 'a6_q', 'a6_r', 'a6_w', 'a6_x', 'a7_bb', 'a7_dd', 'a7_ff', 'a7_h', 'a7_j', 'a7_n', 'a7_o', 'a7_v', 'a7_z', 'a9_f', 'a9_t', 'a10_f', 'a10_t', 'a12_f', 'a12_t', 'a13_g', 'a13_p', 'a13_s']


### Generate X and Y for a Classification Neural Network

In [30]:
import copy


class EarlyStopping:
    def __init__(self, patience=5, min_delta=0, restore_best_weights=True):
        self.patience = patience
        self.min_delta = min_delta
        self.restore_best_weights = restore_best_weights
        self.best_model = None
        self.best_loss = None
        self.counter = 0
        self.status = ""

    def __call__(self, model, val_loss):
        if self.best_loss is None:
            self.best_loss = val_loss
            self.best_model = copy.deepcopy(model.state_dict())
        elif self.best_loss - val_loss >= self.min_delta:
            self.best_model = copy.deepcopy(model.state_dict())
            self.best_loss = val_loss
            self.counter = 0
            self.status = f"Improvement found, counter reset to {self.counter}"
        else:
            self.counter += 1
            self.status = f"No improvement in the last {self.counter} epochs"
            if self.counter >= self.patience:
                self.status = f"Early stopping triggered after {self.counter} epochs."
                if self.restore_best_weights:
                    model.load_state_dict(self.best_model)
                return True
        return False

In [31]:
import time

import numpy as np
import pandas as pd
import torch
import tqdm
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from torch import nn
from torch.autograd import Variable
from torch.utils.data import DataLoader, TensorDataset


np.random.seed(42)
torch.manual_seed(42)


from sklearn import preprocessing



def load_data():
    le = LabelEncoder()
    x_columns = df.columns.drop("a16")
    x = df[x_columns].values
    le = preprocessing.LabelEncoder()
    y = le.fit_transform(df["a16"])
    a16 = le.classes_

    # Split into validation and training sets
    x_train, x_test, y_train, y_test = train_test_split(
        x, y, test_size=0.25, random_state=42
    )

    scaler = StandardScaler()
    x_train = scaler.fit_transform(x_train)
    x_test = scaler.transform(x_test)

    # Numpy to Torch Tensor
    x_train = torch.tensor(x_train, device=device, dtype=torch.float32)
    y_train = torch.tensor(y_train, device=device, dtype=torch.long)

    x_test = torch.tensor(x_test, device=device, dtype=torch.float32)
    y_test = torch.tensor(y_test, device=device, dtype=torch.long)

    return x_train, x_test, y_train, y_test, a16


In [32]:
import torch

has_mps = torch.backends.mps.is_built()
device = "mps" if has_mps else "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

Using device: cuda


In [33]:
x_train, x_test, y_train, y_test, a16_1 = load_data()

In [34]:

BATCH_SIZE = 16

dataset_train = TensorDataset(x_train, y_train)
dataloader_train = DataLoader(
    dataset_train, batch_size=BATCH_SIZE, shuffle=True)

dataset_test = TensorDataset(x_test, y_test)
dataloader_test = DataLoader(dataset_test, batch_size=BATCH_SIZE, shuffle=True)

In [35]:

model = nn.Sequential(
    nn.Linear(x_train.shape[1], 50),
    nn.ReLU(),
    nn.Linear(50, 25),
    nn.ReLU(),
    nn.Linear(25, len(a16_1)),
    nn.LogSoftmax(dim=1),
)

model = model.to(device)

loss_fn = nn.CrossEntropyLoss()  # cross entropy loss

optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
es = EarlyStopping()

epoch = 0
done = False
while epoch < 1000 and not done:
    epoch += 1
    steps = list(enumerate(dataloader_train))
    pbar = tqdm.tqdm(steps)
    model.train()
    for i, (x_batch, y_batch) in pbar:
        y_batch_pred = model(x_batch.to(device))
        loss = loss_fn(y_batch_pred, y_batch.to(device))
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        loss, current = loss.item(), (i + 1) * len(x_batch)
        if i == len(steps) - 1:
            model.eval()
            pred = model(x_test)
            vloss = loss_fn(pred, y_test)
            if es(model, vloss):
                done = True
            pbar.set_description(
                f"Epoch: {epoch}, tloss: {loss}, vloss: {vloss:>7f}, {es.status}"
            )
        else:
            pbar.set_description(f"Epoch: {epoch}, tloss {loss:}")

Epoch: 1, tloss: 0.1560591459274292, vloss: 0.397172, : 100%|██████████| 33/33 [00:00<00:00, 227.41it/s]
Epoch: 2, tloss: 0.6952951550483704, vloss: 0.450589, No improvement in the last 1 epochs: 100%|██████████| 33/33 [00:00<00:00, 572.27it/s]
Epoch: 3, tloss: 0.44539839029312134, vloss: 0.586766, No improvement in the last 2 epochs: 100%|██████████| 33/33 [00:00<00:00, 571.40it/s]
Epoch: 4, tloss: 0.22077450156211853, vloss: 0.542776, No improvement in the last 3 epochs: 100%|██████████| 33/33 [00:00<00:00, 523.56it/s]
Epoch: 5, tloss: 0.0043371026404201984, vloss: 0.652776, No improvement in the last 4 epochs: 100%|██████████| 33/33 [00:00<00:00, 526.48it/s]
Epoch: 6, tloss: 0.03307044506072998, vloss: 0.796580, Early stopping triggered after 5 epochs.: 100%|██████████| 33/33 [00:00<00:00, 536.92it/s]


In [36]:
print(len(df))


690


In [41]:
x = df.drop('a16', axis=1).values  
y = df['a16'].values  

scaler = StandardScaler()
x_scaled = scaler.fit_transform(x)
x_tensor = torch.tensor(x_scaled, dtype=torch.float32).to(device)


model.eval()  

with torch.no_grad():  
    log_predictions = model(x_tensor)
    predictions = torch.exp(log_predictions)  

predictions_np = predictions.cpu().numpy()

df_submit = pd.DataFrame(predictions_np, columns=a16_1) 


In [42]:
df_submit.head(2000)

Unnamed: 0,+,-
0,0.797516,0.202484
1,0.929376,0.070624
2,0.753851,0.246150
3,0.826342,0.173658
4,0.576493,0.423507
...,...,...
685,0.204347,0.795653
686,0.042353,0.957647
687,0.015911,0.984089
688,0.013368,0.986632
