# Demo: PBMC68k dataset

The demo shows the usage of scClassifier2.

The PBMC68k dataset can be downloaded from the 10x Genomics website.

## Train scClassifier2

The PBMC68k dataset is divided into two the train subset and the test subset. Run scClassifier2 on the training data to get the classification model.

The command to run the training process is as follows.

``` shell
python scClassifier2.py --sup-data-file /DATA2/Project/MICE_evaluation/data/PBMC68k/pbmc68k_dataset_roc_train.mtx \
                        --sup-label-file /DATA2/Project/MICE_evaluation/data/PBMC68k/pbmc68k_dataset_label_text_roc_train.txt \
                        -lr 0.0001 \
                        -n 30 \
                        -bs 100 \
                        --aux-loss \
                        --cross-validation-fold 10 \
                        --cuda \
                        --runtime \
                        --jit \
                        --best-valid-model pbmc68k_model.pth                 
```

## Test scClassifier2

In [29]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (20, 10)


import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score, ConfusionMatrixDisplay

from scClassifier2 import scClassifier2
from utils.scdata_cached import SingleCellCached

import torch
from torch.utils.data import DataLoader

## Load model and test datset

In [30]:
ModelPath = 'pbmc68k_model.pth'
DataPath='/DATA2/Project/MICE_evaluation/data/PBMC68k/pbmc68k_dataset_roc_test.mtx'
LabelPath='/DATA2/Project/MICE_evaluation/data/PBMC68k/pbmc68k_dataset_label_text_roc_test.txt'

In [31]:
# load model
model = torch.load(ModelPath)

In [32]:
# load data
batch_size = 100

# Apply log-transformation on data or not.
# If count data is provided, log-transformation
# should be used. 
log_trans = False

# Put data on cuda or not.
use_cuda = True

# Randomly permute data or not.
shuffle = False

label_encoder = model.label2class
data_cached = SingleCellCached(DataPath, LabelPath, label_encoder, 'roc', log_trans = log_trans, use_cuda = use_cuda)
data_loader = DataLoader(data_cached, batch_size = batch_size, shuffle = shuffle)


## Run prediction on the test data

In [33]:
# predict
predictions, scores, actuals = [], [], []

# use the appropriate data loader
for (xs, ys) in data_loader:
    # use classification function to compute all predictions for each batch
    yhats, yscores = model.classifier_with_probability(xs)
    scores.append(yscores)

    _, yhats = torch.topk(yhats, 1)
    yhats = model.convert_to_label(yhats)
    predictions.append(yhats)

    _, ys = torch.topk(ys, 1)
    ys = model.convert_to_label(ys)
    actuals.append(ys)

predictions = np.concatenate(predictions, 0)
scores = torch.cat(scores, dim=0).cpu().detach().numpy()
actuals = np.concatenate(actuals, 0)

Save the classification scores for plotting ROC curves.

In [None]:
scores = pd.DataFrame(scores)
scores.columns = model.convert_to_label(torch.from_numpy(np.arange(model.output_size)))

scores['True'] = actuals
scores['Pred'] = predictions

In [None]:
scores

Save scores for ROC analysis.

```python
scores.to_csv('pbmc68k_scc2_pred_scores.txt', index=False)
```

Performance assessment

In [None]:
accuracy_score(actuals, predictions)

In [None]:
ConfusionMatrixDisplay.from_predictions(actuals, predictions, xticks_rotation='vertical')