# CAR-T Sequence Detection in Single-Cell RNA Sequencing Data

This notebook provides a workflow for analyzing single-cell RNA sequencing (scRNA-seq) data to detect Chimeric Antigen Receptor (CAR) sequence insertions using the 10x Genomics Cell Ranger pipeline.

## Overview

CAR-T cell therapy involves genetically modifying T cells to express chimeric antigen receptors (CARs) that target specific antigens on cancer cells. Detecting these CAR sequences in scRNA-seq data is critical for understanding CAR-T cell behaviors, persistence, and effectiveness.

This workflow includes:
1. Setting up Cell Ranger and dependencies
2. Preparing a custom reference that includes the CAR sequence
3. Running Cell Ranger for alignment and analysis
4. Post-processing to identify and visualize cells containing CAR sequences
5. Advanced analysis of CAR-positive cells


## 1. Environment Setup

First, let's set up the necessary Python packages for our analysis.

In [None]:
# Install required packages
!pip install scanpy pandas numpy matplotlib seaborn anndata scikit-learn scipy

# Import libraries
import os
import subprocess
import pandas as pd
import numpy as np
import scanpy as sc
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Set plotting defaults
sc.settings.verbosity = 3
sc.settings.set_figure_params(dpi=100, facecolor='white')
%matplotlib inline

## 2. Cell Ranger Setup

Cell Ranger is a set of analysis pipelines from 10x Genomics that processes Chromium single-cell RNA-seq output to align reads, generate feature-barcode matrices, and perform secondary analysis.

### 2.1 Download and Install Cell Ranger

You'll need to download Cell Ranger from the 10x Genomics website: https://support.10xgenomics.com/single-cell-gene-expression/software/downloads/latest

After downloading, you can unpack and set up Cell Ranger as follows:

In [None]:
# Define the Cell Ranger path - replace with your actual path
CELLRANGER_PATH = "/path/to/cellranger-7.2.0"

# Check if Cell Ranger is properly installed
!{CELLRANGER_PATH}/cellranger --help

### 2.2 Set up Data Directories

Organize your directories for the analysis:

In [None]:
# Set up directories
BASE_DIR = Path("./car_t_analysis")
FASTQ_DIR = BASE_DIR / "fastq"
REFERENCE_DIR = BASE_DIR / "reference"
OUTPUT_DIR = BASE_DIR / "results"

# Create directories if they don't exist
for dir_path in [BASE_DIR, FASTQ_DIR, REFERENCE_DIR, OUTPUT_DIR]:
    dir_path.mkdir(exist_ok=True, parents=True)

print(f"Base directory: {BASE_DIR}")
print(f"FASTQ directory: {FASTQ_DIR}")
print(f"Reference directory: {REFERENCE_DIR}")
print(f"Output directory: {OUTPUT_DIR}")

## 3. Preparing a Custom Reference with CAR Sequence

To detect CAR sequences, we need to create a custom reference genome that includes the CAR construct sequence. This involves adding the CAR sequence to the reference transcriptome.

In [None]:
# Define the path to the reference genome and GTF file
REF_GENOME = "path/to/reference/genome.fa"  # e.g., GRCh38 for human
REF_GTF = "path/to/reference/genes.gtf"     # Corresponding GTF annotation file

# Define the CAR sequence - replace with your specific CAR sequence
CAR_SEQUENCE = """
>CAR_CD19
ATGGCCTTACCAGTGACCGCCTTGCTCCTGCCGCTGGCCTTGCTGCTCCACGCCGCCAGGCCGGGATCCCAGGTGCAGCTGCAGGAG...
"""

# Create a CAR sequence FASTA file
car_fasta_path = REFERENCE_DIR / "car_sequence.fa"
with open(car_fasta_path, 'w') as f:
    f.write(CAR_SEQUENCE)

# Create a CAR GTF annotation
car_gtf_path = REFERENCE_DIR / "car_genes.gtf"
car_gtf_content = """
CAR_CD19\tunknown\texon\t1\t{0}\t.\t+\t.\tgene_id "CAR_CD19"; transcript_id "CAR_CD19"; gene_name "CAR_CD19";
""".format(len(CAR_SEQUENCE.split('\n')[1]))

with open(car_gtf_path, 'w') as f:
    f.write(car_gtf_content)

print(f"CAR sequence saved to: {car_fasta_path}")
print(f"CAR GTF annotation saved to: {car_gtf_path}")

### 3.1 Merge CAR Sequence with Reference Genome

Next, we'll merge the CAR sequence with the reference genome and GTF files.

In [None]:
# Combine the reference genome with CAR sequence
combined_fasta = REFERENCE_DIR / "combined_reference.fa"
combined_gtf = REFERENCE_DIR / "combined_reference.gtf"

# Cat commands to combine files
!cat {REF_GENOME} {car_fasta_path} > {combined_fasta}
!cat {REF_GTF} {car_gtf_path} > {combined_gtf}

print(f"Combined reference genome saved to: {combined_fasta}")
print(f"Combined GTF annotation saved to: {combined_gtf}")

### 3.2 Build Custom Reference with Cell Ranger

Now we'll use Cell Ranger's `mkref` command to create a custom reference package.

In [None]:
# Build custom reference with Cell Ranger
custom_reference_dir = REFERENCE_DIR / "custom_reference"

!{CELLRANGER_PATH}/cellranger mkref \
    --genome=car_t_reference \
    --fasta={combined_fasta} \
    --genes={combined_gtf} \
    --nthreads=8 \
    --output-dir={custom_reference_dir}

print(f"Custom reference built at: {custom_reference_dir}")

## 4. Running Cell Ranger Count

Now we'll run Cell Ranger's `count` command to process the FASTQ files and align them to our custom reference.

In [None]:
# Define sample name and FASTQ path
SAMPLE_NAME = "car_t_sample"  # Replace with your sample name
FASTQ_PATH = str(FASTQ_DIR)   # Path to directory containing FASTQ files

# Run Cell Ranger count
!{CELLRANGER_PATH}/cellranger count \
    --id={SAMPLE_NAME} \
    --transcriptome={custom_reference_dir} \
    --fastqs={FASTQ_PATH} \
    --sample={SAMPLE_NAME} \
    --expect-cells=5000 \
    --localcores=8 \
    --localmem=64

# Copy output to results directory
!cp -r {SAMPLE_NAME} {OUTPUT_DIR}/

print(f"Cell Ranger count completed for sample: {SAMPLE_NAME}")
print(f"Results saved to: {OUTPUT_DIR}/{SAMPLE_NAME}")

## 5. Analysis of Cell Ranger Results

Now we'll analyze the Cell Ranger results to identify cells with CAR sequences.

In [None]:
# Load the Cell Ranger output
results_dir = OUTPUT_DIR / SAMPLE_NAME / "outs" / "filtered_feature_bc_matrix"
adata = sc.read_10x_mtx(results_dir, var_names='gene_symbols', cache=True)

# Basic QC and filtering
sc.pp.calculate_qc_metrics(adata, inplace=True)
print(f"Loaded data with {adata.shape[0]} cells and {adata.shape[1]} genes")

# View the AnnData object
adata

### 5.1 Identify Cells with CAR Expression

In [None]:
# Check if CAR gene is in the dataset
if 'CAR_CD19' in adata.var_names:
    print("CAR_CD19 found in the dataset")
    
    # Identify cells expressing CAR
    car_counts = adata[:, 'CAR_CD19'].X.toarray().flatten()
    adata.obs['car_counts'] = car_counts
    adata.obs['car_positive'] = adata.obs['car_counts'] > 0
    
    # Count of CAR-positive cells
    car_positive_count = adata.obs['car_positive'].sum()
    car_positive_percent = (car_positive_count / adata.shape[0]) * 100
    
    print(f"CAR-positive cells: {car_positive_count} ({car_positive_percent:.2f}%)")
else:
    print("CAR_CD19 not found in the dataset. Check your reference and sequence names.")

### 5.2 Visualize CAR Expression

In [None]:
# Process the data for visualization
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
sc.pp.pca(adata, n_comps=50, use_highly_variable=True)
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)
sc.tl.umap(adata)
sc.tl.leiden(adata, resolution=0.5)

# Plot UMAP with CAR expression
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
sc.pl.umap(adata, color='car_positive', title='CAR-positive cells', show=False)

plt.subplot(1, 2, 2)
if 'CAR_CD19' in adata.var_names:
    sc.pl.umap(adata, color='CAR_CD19', title='CAR expression level', show=False)
else:
    plt.title('CAR gene not found')
    
plt.tight_layout()
plt.show()

### 5.3 Analyze Expression Patterns in CAR-Positive vs. CAR-Negative Cells

In [None]:
# Perform differential expression analysis between CAR+ and CAR- cells
if 'car_positive' in adata.obs.columns:
    sc.tl.rank_genes_groups(adata, groupby='car_positive', method='wilcoxon')
    sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False, figsize=(12, 5))
    
    # Extract results to a DataFrame
    result = adata.uns['rank_genes_groups']
    groups = result['names'].dtype.names
    
    # Get differential expression results for CAR-positive group
    if 'True' in groups:
        de_genes = pd.DataFrame({
            'names': result['names']['True'],
            'scores': result['scores']['True'],
            'pvals': result['pvals']['True'],
            'pvals_adj': result['pvals_adj']['True'],
            'logfoldchanges': result['logfoldchanges']['True']
        })
        
        print("Top differentially expressed genes in CAR-positive cells:")
        display(de_genes.head(10))
else:
    print("CAR expression data not available.")

### 5.4 Cell Type Annotation and CAR Expression

In [None]:
# Define marker genes for T cell subsets
t_cell_markers = {
    'CD4+ T': ['CD4', 'IL7R'],
    'CD8+ T': ['CD8A', 'CD8B'],
    'Regulatory T': ['FOXP3', 'IL2RA'],
    'Naive T': ['CCR7', 'LEF1', 'TCF7'],
    'Memory T': ['IL7R', 'S100A4'],
    'Effector T': ['GZMA', 'GZMB', 'PRF1'],
    'Exhausted T': ['PDCD1', 'HAVCR2', 'LAG3', 'TIGIT']
}

# Check for marker genes in the dataset
for cell_type, markers in t_cell_markers.items():
    present_markers = [m for m in markers if m in adata.var_names]
    if present_markers:
        print(f"{cell_type}: {len(present_markers)}/{len(markers)} markers found")
    else:
        print(f"{cell_type}: No markers found")

# Score cells for T cell subsets
for cell_type, markers in t_cell_markers.items():
    markers_in_data = [m for m in markers if m in adata.var_names]
    if markers_in_data:
        sc.tl.score_genes(adata, markers_in_data, score_name=f'{cell_type}_score')

# Visualize T cell subset scores and CAR expression
score_cols = [k for k in adata.obs.columns if k.endswith('_score')]
if score_cols:
    sc.pl.umap(adata, color=score_cols, ncols=3, cmap='viridis')
    
    # Compare CAR expression across different T cell subsets
    if 'car_positive' in adata.obs.columns:
        for score in score_cols:
            plt.figure(figsize=(10, 6))
            sns.boxplot(x='car_positive', y=score, data=adata.obs)
            plt.title(f'{score} by CAR expression')
            plt.xlabel('CAR positive')
            plt.ylabel('Score')
            plt.show()

### 5.5 Save Results

In [None]:
# Save the analyzed data
results_file = OUTPUT_DIR / f"{SAMPLE_NAME}_analyzed.h5ad"
adata.write(results_file)

# Export CAR-positive cells to CSV
if 'car_positive' in adata.obs.columns:
    car_positive_cells = adata.obs[adata.obs['car_positive']].copy()
    car_positive_file = OUTPUT_DIR / f"{SAMPLE_NAME}_car_positive_cells.csv"
    car_positive_cells.to_csv(car_positive_file)
    print(f"CAR-positive cell information saved to: {car_positive_file}")

print(f"Analysis complete. Results saved to: {results_file}")

## 6. Summary and Next Steps

In this notebook, we have:
1. Set up a custom reference genome with the CAR sequence
2. Processed single-cell RNA-seq data using Cell Ranger
3. Identified cells expressing the CAR construct
4. Analyzed the transcriptional profiles of CAR-positive vs. CAR-negative cells
5. Examined T cell subset distributions in relation to CAR expression

### Next Steps:

1. **Deeper Phenotyping**: Further analyze the transcriptional profiles of CAR-T cells to understand their functional state (e.g., activation, exhaustion)
2. **Clonotype Analysis**: If TCR sequencing data is available, link CAR expression with specific T cell clonotypes
3. **Trajectory Analysis**: Perform pseudotime analysis to understand the developmental trajectories of CAR-T cells
4. **Integration with Other Samples**: Compare CAR-T transcriptional profiles across different patients or timepoints
5. **Advanced Visualizations**: Create publication-quality figures of CAR-T cell characteristics

### Conclusion

This workflow provides a foundation for detecting and analyzing CAR-expressing cells in single-cell RNA sequencing data. By incorporating the CAR sequence into the reference genome, we can accurately identify cells that have successfully integrated the CAR construct and analyze their transcriptional profiles.