In [1]:
import os
import sys

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

import random
import pickle

In [2]:
sys.path.append("../src/")
from preprocessing import *

# Data preprocessing example (TGF$\beta$ repA)
Below is an example for preprocessing sequencing data from affinity selection experiments conducted on an Illumina HiSeq/MiSeq machine. This should help when processing new data to train counterselection models for different targets.


Before running this notebook, follow the README in `counterselection/scripts/` to generate a folder like `counterselection/data/processing_example/`. Particularly `counterselection/data/processing_example/aa/` which contains `.fa` files that have enriched reads from each round in CDR peptide sequence format. Below is a table of how the GRI numbers align with the demo presented below:

| GRI     | Name      |
|---------|-----------|
| GRI5684 | Mock_R1_a |
| GRI7049 | TGFb_R2_a |
| GRI7055 | TGFb_R3_a |

## 1. Count reads from `.fa` files using `make_read_txt()`
The following call will generate a `.txt` from `counterselection/processing_example/aa/*_pept.fa` that has each unique read and its counts.

In [3]:
make_read_txt("../processing_example/")

## 2. Create `count_dict` object and pickle the file using `create_count_dict()`
The following call will generate a Python dictionary for each GRI for ease of use in downstream tasks (this step is not explicitly necessary). 

In [4]:
create_count_dict("../processing_example/")

## 3. Make a round enrichment dataframe using `make_enrichment_df()`
This function takes as input the Round 2 (R2) and Round 3 (R3) count_dict paths.

In [5]:
df = make_enrichment_df(
                   r1_count_dict_path='../processing_example/count_dicts/GRI5684_count_dict.pkl',
                   r2_count_dict_path='../processing_example/count_dicts/GRI7049_count_dict.pkl', 
                   r3_count_dict_path='../processing_example/count_dicts/GRI7055_count_dict.pkl'
                    )

## 4. Make a training dataset using `make_class_set()`
The following call will generate a classification training set for use in training models.

In [6]:
tgfb_a = make_class_set(df)