# Analysis of Sjogrens Syndrome PBMCs

## Data source: GSE157278

### **Single-cell RNA Sequencing reveals the expansion of Cytotoxic CD4+ T lymphocytes and the heterogeneity of pathogenesis in primary Sjögren’s syndrome patients**

#### "By single cell RNA sequencing, our data revealed disease-specific immune cell subsets and provide some potential new targets of pSS, specific expansion of CD4+ CTLs may be involved in the pathogenesis of pSS, which might give a valuable insights for therapeutic interventions of pSS."

#### "We applied single cell RNA sequencing (scRNA-seq) to 57, 288 peripheral blood mononuclear cells (PBMCs) from 5 patients with pSS and 5 healthy controls. The immune cell subsets and susceptibility genes involved in the pathogenesis of pSS were analyzed."

In [None]:
PDIR = '/Users/aumchampaneri/VSCode Projects/complement-receptor-blockade/'

# Convert raw data to AnnData

In [None]:
import os
import scanpy as sc

# Path to the folder containing your raw-data files
data_dir = f"{PDIR}/sjogrens-pbmc/raw-data"
print("Files in directory:", os.listdir(data_dir))

# Read 10x Genomics formatted data
adata = sc.read_10x_mtx(
    data_dir,             # Path to directory with matrix.mtx, barcodes.tsv, features.tsv
    var_names='gene_symbols',  # Use gene symbols for variable names
    cache=True,                 # Cache the result for faster future loading
)

Files in directory: ['cell_batch.tsv.gz', 'features.tsv.gz', 'barcodes.tsv.gz', 'matrix.mtx.gz']


## Data exploration

In [None]:
adata

AnnData object with n_obs × n_vars = 61405 × 33694
    var: 'gene_ids', 'feature_types'

In [None]:
adata.obs

AAACCTGAGACCTAGG-1
AAACCTGAGCCACCTG-1
AAACCTGAGTCATCCA-1
AAACCTGCAGATGAGC-1
AAACCTGCATCCGCGA-1
...
TTTGTCAGTCTAGTGT-10
TTTGTCAGTCTCTCTG-10
TTTGTCAGTGCACGAA-10
TTTGTCAGTGCAGGTA-10
TTTGTCAGTTACGCGC-10


In [None]:
adata.var

Unnamed: 0,gene_ids,feature_types
RP11-34P13.3,ENSG00000243485,Gene Expression
FAM138A,ENSG00000237613,Gene Expression
OR4F5,ENSG00000186092,Gene Expression
RP11-34P13.7,ENSG00000238009,Gene Expression
RP11-34P13.8,ENSG00000239945,Gene Expression
...,...,...
AC233755.2,ENSG00000277856,Gene Expression
AC233755.1,ENSG00000275063,Gene Expression
AC240274.1,ENSG00000271254,Gene Expression
AC213203.1,ENSG00000277475,Gene Expression


## Data modification

1. Merge 'cell_batch.tsv' to .obs layer
2. Extrapolate 'disease' and patient_ID
3. Rearange and rename .var columns

In [None]:
import pandas as pd

# Read the cell_batch file, using the first column (barcodes) as the index
cell_batch = pd.read_csv(f"{PDIR}/sjogrens-pbmc/raw-data/cell_batch.tsv.gz", sep='\t', header=0, index_col=0)

# Align and assign the batch/condition info to AnnData obs
adata.obs['cell_batch'] = adata.obs_names.map(cell_batch.iloc[:, 0])

# Preview the result
print(adata.obs[['cell_batch']].head())

                   cell_batch
AAACCTGAGACCTAGG-1       HC-1
AAACCTGAGCCACCTG-1       HC-1
AAACCTGAGTCATCCA-1       HC-1
AAACCTGCAGATGAGC-1       HC-1
AAACCTGCATCCGCGA-1       HC-1


In [None]:
# Extract disease and patient id from the cell_batch column
disease_labels = adata.obs['cell_batch'].str.extract(r'^(pSS|HC)')[0]
patient_ids = adata.obs['cell_batch'].str.extract(r'-(\d+)$')[0]

# Assign disease column
adata.obs['disease'] = disease_labels

# Create a mapping from (disease, patient_id) to a unique patient number (1-10)
unique_patients = adata.obs[['disease', 'cell_batch']].drop_duplicates()
unique_patients['patient'] = range(1, len(unique_patients) + 1)

# Merge back to obs to assign patient numbers
adata.obs = adata.obs.merge(unique_patients[['cell_batch', 'patient']], left_on='cell_batch', right_on='cell_batch', how='left')

# Preview the new columns
print(adata.obs[['cell_batch', 'disease', 'patient']].head())

  cell_batch disease  patient
0       HC-1      HC        1
1       HC-1      HC        1
2       HC-1      HC        1
3       HC-1      HC        1
4       HC-1      HC        1


In [None]:
# Duplicate the .var index to a new column 'feature_names'
adata.var["feature_names"] = adata.var.index

# Rename gene_ids to ensembl_id
adata.var.rename(columns={"gene_ids": "ensembl_id"}, inplace=True)

# Replace the information in the index column with the ensembl_id values
adata.var.index = adata.var["ensembl_id"]

# Clear the index column name
adata.var.index.name = None

## TODO - Tokenization preparations

## Save processed AnnData object

In [None]:
import os
os.makedirs(f"{PDIR}/sjogrens-pbmc/input-data/", exist_ok=True)

In [None]:
# Save prepared AnnData object for Geneformer tokenization
adata.write_h5ad(f"{PDIR}/sjogrens-pbmc/input-data/sjogrens-pbmc_prepared.h5ad")