## Processing ATAC-seq Data into Base-Level Data
### Goal  
- The `goal` of this procedure is to transform raw single-nucleus ATAC-seq (snATAC-seq) data from `multiome` libraries into `base-level` chromatin accessibility data for downstream analysis. 
- The purpose is to work with finer resolution data (at the base level), rather than relying on peak calling, which groups accessible chromatin regions. 
- Base-level data offers a more `detailed view` of chromatin accessibility across individual bases, which can be critical for certain models like fine-mapping or multiomic studies, such as fitting the `mfsusie`.

### Data
- The `ROSMAP` single-nucleus ATAC-seq (`snATAC-seq`) is from the `10x Genomics` platform, specifically from the `multiome` libraries. 
- **Samples**: It containes` > 200 samples`.
- **Path**: `/mnt/mfs/ctcn/datasets/rosmap/snmultiome/dlpfc/batch1`    
    - fragment files: `/mnt/mfs/ctcn/datasets/rosmap/snmultiome/dlpfc/batch1/cellRanger/lirary_id/outs/atac_fragments.tsv.gz`   
    - sample data: `/mnt/mfs/ctcn/datasets/rosmap/snmultiome/dlpfc/batch1/demuxlet/scripts/SampleSheet.csv`   
    - library QCed Seurat object: `/mnt/mfs/ctcn/datasets/rosmap/snmultiome/dlpfc/batch1/atac_qc/analysis/step1_chromatin_assay_QC/snMultiome_chromatin_assay_libraryidtoid.rds`    
    - cell type annotated Seurat object: `/mnt/mfs/ctcn/datasets/rosmap/snmultiome/dlpfc/batch1/atac_qc/analysis/step3_postQC/'cell_type'/cell_type.rds`   

### Overview of the Process:
`Procedure`: fragnent QC(library) -> cell type annotation(library) ->  split by cell type and sample(6 cell type *240 samples) ->  Calculate coverage(6*240 bedgraph files) -> Merge all sample bedgraph into one for the sample cell type(6 bedgraph files)  ->  QC and get base by sample count matrix of each cell type 

`Input Data`: The starting data consists of `fragment` files generated from snATAC-seq. Each fragment file contains records of accessible chromatin fragments with the following information:    
- Chromosome
- Start and end positions of the fragment
- Barcode identifying the cell/nucleus
- Read count information

Data Structure:   
```bash
## Fragment file format:
chr start end barcode reads
chr1 10073 10210 AGTAGGATCCCGTTAC-1 1
chr1 10073 10309 TCGCGCACAGCAAATA-1 1
chr1 10079 10261 ACATCAATCCGCCAAA-1 2
chr1 10084 10204 TCATTGTTCCGCCTAT-1 1
chr1 10085 10303 ATTGGCTAGCGCCTAA-1 1
chr1 10085 10340 GAACTTATCCCTGATC-1 1
```
This format shows chromatin fragments for specific cells, identified by barcodes. The coordinates of the start and end positions represent where the chromatin is accessible.

`Purpose of Coverage`:
- Coverage is a measure of how many fragments (or reads) map to a specific base across all cells. It helps identify regions of high or low chromatin accessibility at the single-base resolution.  
- Tools like `bedtools`  are typically used to calculate coverage from fragment files.    
For example, calculating coverage will result in:   

```bash
| Base Position    | Sample1 |
|------------------|---------|
| chr1:100000      | 1       |
| chr1:100001      | 0       |
| chr1:100002      | 3       |
```
This output shows how many fragments overlap each base, giving a base-by-base accessibility profile.

### Procedure for Processing ATAC-seq Data to Base-Level Data

61 libraries(~ 4 samples/library) are processed in this procedure. The process involves several steps to transform raw fragment data into base-level chromatin accessibility data for each cell type. The steps are as follows:

**1) QC Fragment Files**:   

- Start with raw fragment files, which contain chromatin fragments with information such as chromosome positions, start and end points, and associated barcodes.   
- Apply quality control (QC) by matching filtered QCed barcodes (e.g., ncell, nfeature) from Seurat object metadata with the fragments. This ensures only high-quality cells are retained for further analysis.  

**2) Subset by Cell Type**:

- Use an annotation file (another Seurat object) to map barcodes to specific cell types.   
- Subset the fragments per cell type, ensuring each fragment is correctly assigned to its corresponding cell type based on the barcodes.  

**3) Subset by Sample**:   

- Further subset the cell-type-specific fragments by sample ID.   
- Each sample within the cell type is separated out to individual files for more granular analysis. 

**4) Run Coverage on Per-Sample Files**:   

- For each sample-specific fragment file, calculate base-level chromatin accessibility using bedtools coverage.
- `bedtools -bg`: for BedGraph output, which is more memory-efficient (1-2 mins/each).
- This produces per-sample `BedGraph` files, showing the chromatin accessibility at each base position.

**5) Merge BedGraph Files by Cell Type:**

- Merge the BedGraph files from different samples within the same cell type using `bedtools unionbedg`.
- This step combines data from multiple samples of the same cell type into a single base-by-sample matrix, ready for analysis.

**6) QC and get Base-by-Sample Matrix:**

This pseudobulk-level matrix can undergo QC to filter out low-quality samples, ensuring reliable downstream analysis. The final matrix contains base-level chromatin accessibility for each sample within a given cell type.


```bash
## Final Output: Base-by-Sample Matrix

| Base Position    | Sample1 | Sample2 | Sample3 | ... |
|------------------|---------|---------|---------|-----|
| chr1:100000      | 10      | 5       | 8       | ... |
| chr1:100001      | 0       | 1       | 3       | ... |
| chr1:100002      | 12      | 0       | 15      | ... |
| chr1:100003      | 7       | 4       | 10      | ... |
```

### BedGraph Data Format and Explanation
- The `BedGraph` format is more memory-efficient. Unlike `base-level` data, which stores coverage at every base, BedGraph condenses `consecutive` positions with the `same` coverage value into a single `interval`, reducing redundancy. 
- This dramatically reduces memory usage, making it a better format for storing and processing large genomic datasets at scale. This format is particularly helpful during `QC`, and we can convert the interval data into base-level data for final analysis if needed.


#### Bedgraph for one sample of Mic
- tool: `bedtools genomecov -bg`
```bash
head /home/al4225/project/multiome/step5_bedgraph_coverage/Microglia/MAP26637867.bedgraph
chrom   start   end MAP26637867(coverage)
chr1    10156   10157   1
chr1    10157   10169   2
chr1    10169   10186   3
chr1    10186   10209   2
chr1    10209   10210   1
chr1    10267   10279   1
chr1    10309   10436   2
chr1    17339   17494   2
chr1    87248   87394   1
chr1    91121   91362   2
```

#### Bedgraph after merging all samples of Mic, and get PICALM region
- tool: `bedtools unionbedg`  
- After generating the BedGraph files for all samples, I used bedtools unionbedg to merge them. The resulting file contains chromosome, start, end, and the coverage values for multiple samples. Each row represents an interval where the coverage values are the same across samples. Also memory savings, but still maintaining detailed coverage data.

- PICALM Region window:
    - Original PICALM region expanded by 1MB on both sides:
    - Start: chr11 84,668,727 (original: 85,668,727)
    - End: chr11 86,780,924 (original: 85,780,924)
    
```bash
zcat /home/al4225/project/multiome/step5_bedgraph_coverage/PICALM/PICALM_Microglia.bedgraph.gz |
head | cut -f 1-6
chrom   start             end    MAP26637867         MAP50106992      MAP61344957
chr11   84668727        84668728        1               0               1
chr11   84668728        84668729        1               0               1
chr11   84668729        84668735        1               0               1
chr11   84668735        84668736        1               0               1
chr11   84668736        84668741        1               0               1
chr11   84668741        84668747        1               0               1
chr11   84668747        84668750        1               0               1
chr11   84668750        84668751        1               0               1
chr11   84668751        84668752        1               0               1
```

- `chrom`: Chromosome name (e.g., chr11 represents chromosome 11).
- `start`: Start position, 0-based. It indicates the first base position in the interval.
- `end`: End position, not inclusive (half-open interval).
- `Sample columns` (e.g., MAP26637867, MAP50106992, MAP61344957): These columns represent the coverage value for  each specific sample. The numbers indicate the signal strength or the number of fragments covering that region.   

For example, chr11 84668727 84668728 1 0 1 means that, for the region on chromosome 11 from 84668727 to 84668728, the samples MAP26637867 and MAP61344957 have a coverage of 1, while MAP50106992 has a coverage of 0.

### PICALM genotype data
- ROSMAP PLINK Genotype Data:
    - Subsetted data covering 1153 samples (rows) and 74,512 SNPs (columns) in the region.
    - Need to map samples to each phenotype data.
```bash
zcat /home/al4225/project/multiome/step5_bedgraph_coverage/PICALM_genotype/genotype/snuc_pseudo_bulk_mega.ENSG00000073921.PICALM_ROSMAP_NIA_WGS.leftnorm.bcftools_qc.plink_qc.11.tsv.gz | head -n 5| cut -f 1-4
chr11:84668727_C_T      chr11:84668730_T_C      chr11:84668839_C_T      chr11:84668840_G_A
MAP15387421     0       0       0
MAP26637867     0       0       0
MAP29629849     0       0       0
MAP33332646     0       0       0
```