# Spoc data structures
This notebook explains the data structures that are availabel within spoc and shows how they relate to each other. On a high level, spoc provides data structures for all parts of the transformation pipline from raw reads to aggregated pixels. 

Often, these data structures (with the exception of the pixels class) will not be used within every day analysis tasks, but rather within analysis pipelines.

# Data frame schemas
Spoc data structures are wrappers around tabular data containers such as `panda.DataFrame` or `dask.dataframe.DataFrame`. To ensure that the underlying data complies with the format that spoc expects, spoc implements dataframe validation using `pandera`. The underlying schemas reside in the `spoc.dataframe_models` file.

# I/O
Reading and writing of spoc data structures is managed by the `spoc.io` package, specifically by the `FileManager` class. Examples of using the FileManager can be found with the specific data structure.

# Fragments
Fragments encapsulate a data structure that can hold a dynamic number of aligned fragments per sequencing unit. In a Pore-C experiment, a sequencing unit is the sequencing read that holds multiple fragments per read. In theory, this structure can also be used for other experiment types that generate aligned fragments that are grouped together by an id, for exapmle SPRITE

Reading fragments using `FileManager`

In [1]:
from spoc.io import FileManager

In [2]:
fragments = FileManager().load_fragments("../tests/test_files/good_porec.parquet")

Fragments class has data accessor for fragments

In [5]:
fragments.data.head()

Unnamed: 0,chrom,start,end,strand,read_name,read_start,read_end,read_length,mapping_quality,align_score,align_base_qscore,pass_filter
0,chr1,1,4,True,dummy,1,4,1,1,1,1,True
1,chr1,2,5,True,dummy,2,5,1,2,2,2,True
2,chr1,3,6,True,dummy,3,6,1,3,3,3,True


Fragments class also supports reading as dask dataframe

In [6]:
fragments = FileManager(use_dask=True).load_fragments("../tests/test_files/good_porec.parquet")

In [9]:
fragments.data

Unnamed: 0_level_0,chrom,start,end,strand,read_name,read_start,read_end,read_length,mapping_quality,align_score,align_base_qscore,pass_filter
npartitions=1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
,object,int64,int64,bool,object,int64,int64,int64,int64,int64,int64,bool
,...,...,...,...,...,...,...,...,...,...,...,...


## Annotating fragments
Fragments can carry metadata that add additional information, which can be propagated in the analysis pipeline. `FragmentAnnotator` uses a dictionary that contains compound fragment ids and metainformation to annotate fragments. These ids are concatenations of the read_id, chromosome, start and end of the mapping.

In [10]:
fragments = FileManager().load_fragments("../tests/test_files/good_porec.parquet")

In [11]:
label_library = FileManager().load_label_library("../tests/test_files/ll1.pickle")

In [12]:
label_library

{'dummy_chr1_1_4': True, 'dummy_chr1_2_5': False}

In [13]:
from spoc.fragments import FragmentAnnotator

In [15]:
annotated_fragments = FragmentAnnotator(label_library).annotate_fragments(fragments)

In [16]:
annotated_fragments.data.head()

Unnamed: 0,chrom,start,end,strand,read_name,read_start,read_end,read_length,mapping_quality,align_score,align_base_qscore,pass_filter,meta_data
0,chr1,1,4,True,dummy,1,4,1,1,1,1,True,SisterB
1,chr1,2,5,True,dummy,2,5,1,2,2,2,True,SisterA


# Contacts
While the fragment representation retains flexibility, it is often not practical to have contacts of multiple orders and types in different rows of the same file. To this end, we employ the contact representation, where each row contains one contact of a defined order, e.g. a duplet, or a triplet. The `Contact` class is a wrapper around the data structure that holds this representation.
The `Contacts` class is a generic interface that can represent different orders.
The class that creates contacts from fragments is called `FragmentExpander`, which can be used to generate contacts of arbitrary order.

In [19]:
import pandas as pd
from spoc.fragments import FragmentExpander

In [23]:
fragments = FileManager().load_fragments("../tests/test_files/fragments_unlabelled.parquet")

In [25]:
fragments.data.head()

Unnamed: 0,chrom,start,end,strand,read_name,read_start,read_end,read_length,mapping_quality,align_score,align_base_qscore,pass_filter
0,chr1,1,4,True,dummy,1,4,1,1,1,1,True
1,chr1,2,5,True,dummy,2,5,1,2,2,2,True
2,chr1,3,6,True,dummy,3,6,1,3,3,3,True
3,chr1,4,7,True,dummy,4,7,1,4,4,4,True
4,chr1,5,8,True,dummy2,5,8,1,5,5,5,True


In [35]:
contacts = FragmentExpander(number_fragments=2).expand(fragments)

In [36]:
contacts.data.head()

Unnamed: 0,read_name,read_length,chrom_1,start_1,end_1,mapping_quality_1,align_score_1,align_base_qscore_1,chrom_2,start_2,end_2,mapping_quality_2,align_score_2,align_base_qscore_2
0,dummy,1,chr1,1,4,1,1,1,chr1,2,5,2,2,2
1,dummy,1,chr1,1,4,1,1,1,chr1,3,6,3,3,3
2,dummy,1,chr1,1,4,1,1,1,chr1,4,7,4,4,4
3,dummy,1,chr1,2,5,2,2,2,chr1,3,6,3,3,3
4,dummy,1,chr1,2,5,2,2,2,chr1,4,7,4,4,4


Fragment expander also allows us to deal with metadata that is associated with fragments

In [32]:
fragments_labelled = FileManager().load_fragments("../tests/test_files/fragments_labelled.parquet")

In [37]:
contacts_labelled = FragmentExpander(number_fragments=2).expand(fragments_labelled)

In [38]:
contacts_labelled.data.head()

Unnamed: 0,read_name,read_length,chrom_1,start_1,end_1,mapping_quality_1,align_score_1,align_base_qscore_1,meta_data_1,chrom_2,start_2,end_2,mapping_quality_2,align_score_2,align_base_qscore_2,meta_data_2
0,dummy,1,chr1,1,4,1,1,1,SisterA,chr1,2,5,2,2,2,SisterB
1,dummy,1,chr1,1,4,1,1,1,SisterA,chr1,3,6,3,3,3,SisterA
2,dummy,1,chr1,1,4,1,1,1,SisterA,chr1,4,7,4,4,4,SisterB
3,dummy,1,chr1,2,5,2,2,2,SisterB,chr1,3,6,3,3,3,SisterA
4,dummy,1,chr1,2,5,2,2,2,SisterB,chr1,4,7,4,4,4,SisterB


The contact class retains teh information is to whether the expanded contacts contain metadata

In [40]:
contacts_labelled.contains_meta_data

True

## Symmetry
