# Getting started with astir

## 0. Load necessary libraries

In [13]:
!pip install -e ../../..
from astir.data_readers import from_csv_yaml
import pandas as pd

Obtaining file:///Users/jinelles.h/Documents/Camlab/astir-top-level/astir
Installing collected packages: astir
  Attempting uninstall: astir
    Found existing installation: astir 0.0.1
    Uninstalling astir-0.0.1:
      Successfully uninstalled astir-0.0.1
  Running setup.py develop for astir
Successfully installed astir


## 1. Load data

We start by reading expression data in the form of a csv file and marker gene information in the form of a yaml file:

In [14]:
expression_mat_path = "../../../astir/tests/test-data/sce.csv"
yaml_marker_path = "../../../astir/tests/test-data/jackson-2020-markers.yml"

.. note:: 
    Expression data should already be cleaned and normalized, through e.g. a log transformation and winsorization.

We can view both the expression data and marker data:

In [15]:
!head -n 20 ../../../astir/tests/test-data/jackson-2020-markers.yml


cell_states:
  RTK_signalling:
    - Her2
    - EGFR
  proliferation:
    - Ki-67
    - phospho Histone
  mTOR_signalling:
    - phospho mTOR
    - phospho S6
  apoptosis:
    - cleaved PARP
    - Cleaved Caspase3

cell_types:
  Stromal:
    - Vimentin
    - Fibronectin
  B cells:


In [16]:
pd.read_csv(expression_mat_path, index_col=0)[['EGFR','E-Cadherin', 'CD45', 'Cytokeratin 5']].head()

Unnamed: 0,EGFR,E-Cadherin,CD45,Cytokeratin 5
BaselTMA_SP41_186_X5Y4_3679,0.346787,0.938354,0.22773,0.095283
BaselTMA_SP41_153_X7Y5_246,0.833752,1.364884,0.068526,0.124031
BaselTMA_SP41_20_X12Y5_197,0.110006,0.177361,0.301222,0.05275
BaselTMA_SP41_14_X1Y8_84,0.282666,1.122174,0.606941,0.093352
BaselTMA_SP41_166_X15Y4_266,0.209066,0.402554,0.588273,0.064545


Then we can create an astir object using the `from_csv_yaml` function. For more data loading options, see the data loading tutorial.

In [17]:
ast = from_csv_yaml(expression_mat_path, marker_yaml=yaml_marker_path)
print(ast)

Astir object with 6 cell types, 4 cell states, and 100 cells.


## 2. Fitting cell types

To fit cell types, simply call

In [18]:
ast.fit_type()

training astir: 100%|██████████| 10/10 [00:00<00:00, 52.58epochs/s]
training astir:   0%|          | 0/10 [00:00<?, ?epochs/s]

---------- Astir Training 1/5 ----------
Done!
---------- Astir Training 2/5 ----------


training astir: 100%|██████████| 10/10 [00:00<00:00, 57.28epochs/s]
training astir: 100%|██████████| 10/10 [00:00<00:00, 66.08epochs/s]
training astir:   0%|          | 0/10 [00:00<?, ?epochs/s]

Done!
---------- Astir Training 3/5 ----------
Done!
---------- Astir Training 4/5 ----------


training astir: 100%|██████████| 10/10 [00:00<00:00, 63.83epochs/s]
training astir: 100%|██████████| 10/10 [00:00<00:00, 66.13epochs/s]

Done!
---------- Astir Training 5/5 ----------
Done!





In [24]:
# import sys
# from time import sleep
# from tqdm import tqdm

# for i in tqdm(range(20)):
#     sleep(0.1)

# values = range(20)
# with tqdm(total=len(values), file=sys.stdout) as pbar:
#     for i in values:
#         pbar.set_description('processed: %d' % (1 + i))
#         pbar.update(1)
#         sleep(0.1)

100%|██████████| 20/20 [00:02<00:00,  9.52it/s]


.. note:: 
    **Controlling inference**
    There are many different options for controlling inference in the `fit_type` function, including
    `max_epochs` (maximum number of epochs to train),
    `learning_rate` (ADAM optimizer learning rate),
    `batch_size` (minibatch size),
    `delta_loss` (stops iteration once the change in loss falls below this value),
    `n_inits` (number of restarts using random initializations).
    For full details, see the function documentation.

We can then get cell type assignment probabilities by calling

In [None]:
assignments = ast.get_celltype_probabilities()
assignments

where each row corresponds to a cell, and each column to a cell type, with the entry being the probability of that cell belonging to a particular cell type.

To fetch an array corresponding to the most likely cell type assignments, call

In [None]:
# TODO

## 3. Fitting cell state

Similarly as before, to fit cell state, call

In [25]:
ast.fit_state()

training astir: 100%|██████████| 100/100 [00:00<00:00, 774.10epochs/s]
training astir:   0%|          | 0/100 [00:00<?, ?epochs/s]

---------- Astir Training 1/5 ----------
---------- Astir Training 2/5 ----------


training astir: 100%|██████████| 100/100 [00:00<00:00, 848.74epochs/s]
training astir: 100%|██████████| 100/100 [00:00<00:00, 943.06epochs/s]
training astir:   0%|          | 0/100 [00:00<?, ?epochs/s]

---------- Astir Training 3/5 ----------
---------- Astir Training 4/5 ----------


training astir: 100%|██████████| 100/100 [00:00<00:00, 873.29epochs/s]
training astir: 100%|██████████| 100/100 [00:00<00:00, 936.03epochs/s]
training astir: 0epochs [00:00, ?epochs/s]

---------- Astir Training 5/5 ----------





and cell state assignments can be inferred via

In [None]:
states = ast.get_cellstates()
states

## 4. Saving results

Both cell type and cell state information can easily be saved to disk via

In [None]:
ast.type_to_csv("cell-types.csv")
ast.state_to_csv("cell-states.csv")

In [None]:
!head -n 3 cell-types.csv

In [None]:
!head -n 3 cell-states.csv

where the first (unnamed) column always corresponds to the cell name/ID.

## 5. Accessing internal functions and data

Data stored in `astir` objects is in the form of an `SCDataSet`. These can be retrieved via

In [None]:
celltype_data = ast.get_type_dataset()
celltype_data

and similarly for cell state via `ast.get_state_dataset()`.

These have several helper functions to retrieve relevant information to the dataset:

In [None]:
celltype_data.get_cells()[0:4] # cell names

In [None]:
celltype_data.get_classes() # cell type names

In [None]:
print(celltype_data.get_n_classes()) # number of cell types
print(celltype_data.get_n_features()) # number of features / proteins

In [None]:
celltype_data.get_exprs() # Return a torch tensor corresponding to the expression data used