# Training and Evaluating a Model with PyTorch, FSSpec, and Remote CSV Data

This notebook demonstrates how to train a simple neural network using PyTorch with data read from remote CSV files over HTTPS using `fsspec`. The example includes data pipelines for both training and test datasets and evaluates the model's accuracy on the test set.

## Install Dependencies

```python
!pip install torch fsspec pandas torchdata

## Import Libraries

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import pandas as pd
import fsspec
from torch.utils.data import Dataset, DataLoader
import numpy as np

## Define the Nueral Network

Define a simple feedforward nueral network for the example

In [14]:
class SimpleNN(nn.Module):
    def __init__(self):
        super(SimpleNN, self).__init__()
        self.fc1 = nn.Linear(784, 50)
        self.fc2 = nn.Linear(50, 600)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        return x


## Define the Custom Dataset

Create a custom dataset for PyTorch

In [15]:
class CSVDataset(Dataset):
    def __init__(self, data):
        self.data = data
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, index):
        return self.data[index]

## Define Functions to Read and Process Remote CSV Data

This will use the fsspec to read and process data.

Note that this notebook isn't using the fsspec handling functions built into torchdata.datapipes because that package is being deprecated

In [16]:
def read_csv_from_url(file_url):
    # Create a filesystem object for HTTPS
    fs = fsspec.filesystem('osdf')
    # Open the remote file
    with fs.open(file_url, 'r') as f:
        # Read the file into a pandas DataFrame
        df = pd.read_csv(f, index_col=False)
    return df

def dataframe_to_dataset(df):
    features = df.iloc[:, :-1].values.astype(np.float32)  # Assuming last column is target
    targets = df.iloc[:, -1].values.astype(np.int64)
    dataset = [(torch.tensor(feature), torch.tensor(target)) for feature, target in zip(features, targets)]
    return dataset



## Prepare the Data

Get the data remotely from Pelican using fsspec with the 'osdf' protocol. (Note that the OSDF protocol is a specific version of PelicanFS with the discoverURL alreayd set)

In [17]:
# Define remote file URLs
train_csv_url = '/chtc/PUBLIC/hzhao292/fashion-mnist_train.csv'
test_csv_url = '/chtc/PUBLIC/hzhao292/fashion-mnist_test.csv'

# Read and convert data
train_df = read_csv_from_url(train_csv_url)
test_df = read_csv_from_url(test_csv_url)
train_data = dataframe_to_dataset(train_df)
test_data = dataframe_to_dataset(test_df)

# Create DataLoaders
train_dataset = CSVDataset(train_data)
test_dataset = CSVDataset(test_data)
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=16, shuffle=False)


## Train the model

Train our example model using the data from Pelican.

In [18]:
# Instantiate model, loss function, and optimizer
model = SimpleNN()
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

# Training loop
epochs = 5
for epoch in range(epochs):
    model.train()
    running_loss = 0.0
    for batch_X, batch_y in train_loader:
        optimizer.zero_grad()
        outputs = model(batch_X)
        loss = criterion(outputs, batch_y)
        loss.backward()
        optimizer.step()
        running_loss += loss.item() * batch_X.size(0)
    
    epoch_loss = running_loss / len(train_loader.dataset)
    print(f'Epoch [{epoch+1}/{epochs}], Loss: {epoch_loss:.4f}')


IndexError: Target 8 is out of bounds.

## Evaluate the Model

Evaluate the accuracy of the model

In [None]:
model.eval()
correct = 0
total = 0
with torch.no_grad():
    for batch_X, batch_y in test_loader:
        outputs = model(batch_X)
        _, predicted = torch.max(outputs, 1)
        total += batch_y.size(0)
        correct += (predicted == batch_y).sum().item()

accuracy = correct / total
print(f'Accuracy on test data: {accuracy:.4f}')