# Uruguay Foreground data
Pablo sent new foreground data to the smarter repository. They come from affymetrix chip, however this file format is totally new, seems to be an affymetrix report. Check data file and try to guess how to import this new data format

In [1]:
import csv
import collections
import pandas as pd

from pathlib import Path
from tqdm.notebook import tqdm

from src.features.smarterdb import global_connection, Dataset, VariantSheep
from src.features.utils import skip_comments, text_or_gzip_open, get_interim_dir
from src.features.affymetrix import read_affymetrixRow

In [2]:
def get_header(report):
        # sample names are sanitized through read_affymetrixRow: so read the
        # first header of the report file to determine the original sample
        # names
        with text_or_gzip_open(report) as handle:
            position, skipped = skip_comments(handle)

            # go back to header section
            handle.seek(position)

            # now read csv file
            reader = csv.reader(handle, delimiter="\t")

            # get header
            return next(reader)

In [3]:
_ = global_connection()

## Placa_Junio_recommended
Let's start from `Placa_Junio_recommended.zip` datafile:

In [4]:
placa_junio = Dataset.objects.get(file="Placa_Junio_recommended.zip")

There's only one file in dataset, the file with the data:

In [5]:
path = placa_junio.working_dir / placa_junio.contents[0]
probeset_ids = [record.probeset_id for record in read_affymetrixRow(path)]

Check if those probeset ids are in database:

In [6]:
missing = 0

for probeset_id in tqdm(probeset_ids, total=len(probeset_ids)):
    query = {
        "probesets__match": {
            'chip_name': placa_junio.chip_name,
            'probeset_id': probeset_id
        }
    }
    
    if not VariantSheep.objects(**query):
        missing += 1
        
print(f"Missing {missing} SNPs of {len(probeset_ids)}")

  0%|          | 0/49636 [00:00<?, ?it/s]

Missing 1 SNPs of 49636


Ok, for what I see all the SNPs except one can be found in database. How about sample names?

In [7]:
record = next(read_affymetrixRow(path))
record.n_samples

96

It seems that I have more samples than I have in metadata. Let's call metadata file and check if I could find those samples:

In [8]:
creole_metadata = Dataset.objects.get(file="20220809_105_Creole_Samples_INIA_Uruguay.zip")
with open(creole_metadata.working_dir / "20220809_105_Creole_Samples_INIA_Uruguay.xlsx", "rb") as handle:
    info = pd.read_excel(handle)
placa_junio_metadata = info[info["File"] == placa_junio.contents[0]]
print(f"Got {placa_junio_metadata.shape[0]} samples")
placa_junio_metadata

Got 11 samples


Unnamed: 0,N,Lab_ID,Breed,Sex,Site,File
94,1,20220323182,Creole,Female,INIA Las Brujas,Placa_Junio_recommended.txt
95,2,20220323183,Creole,Female,INIA Las Brujas,Placa_Junio_recommended.txt
96,3,20220323184,Creole,Male,INIA Las Brujas,Placa_Junio_recommended.txt
97,4,20220323185,Creole,Female,INIA Las Brujas,Placa_Junio_recommended.txt
98,5,20220323186,Creole,Female,INIA Las Brujas,Placa_Junio_recommended.txt
99,6,20220323187,Creole,Female,INIA Las Brujas,Placa_Junio_recommended.txt
100,7,20220323188,Creole,Female,INIA Las Brujas,Placa_Junio_recommended.txt
101,8,20220323189,Creole,Female,INIA Las Brujas,Placa_Junio_recommended.txt
102,9,20220323190,Creole,Male,INIA Las Brujas,Placa_Junio_recommended.txt
103,10,20220323191,Creole,Male,INIA Las Brujas,Placa_Junio_recommended.txt


Well, I have less *ids* than the samples I see in the datafile. Get *lab ids*:

In [9]:
lab_ids = [str(id_) for id_ in placa_junio_metadata["Lab_ID"].tolist()]
lab_ids

['20220323182',
 '20220323183',
 '20220323184',
 '20220323185',
 '20220323186',
 '20220323187',
 '20220323188',
 '20220323189',
 '20220323190',
 '20220323191',
 '20220323192']

And even samples names are different. I need to match the *file id* with the *lab id*:

In [10]:
lab2file_ids = collections.defaultdict(lambda: None)

for lab_id in lab_ids:
    for col in get_header(path):
        if lab_id in col:
            lab2file_ids[lab_id] = col

Next, I need to track this column in my datatable:

In [11]:
info["id_column"] = info["Lab_ID"].apply(lambda lab_id: lab2file_ids[str(lab_id)])

Focus only on *placa junio* data:

In [12]:
placa_junio_metadata = info[info["File"] == placa_junio.contents[0]]
placa_junio_metadata

Unnamed: 0,N,Lab_ID,Breed,Sex,Site,File,id_column
94,1,20220323182,Creole,Female,INIA Las Brujas,Placa_Junio_recommended.txt,OP1010_20220323182.CEL_call_code
95,2,20220323183,Creole,Female,INIA Las Brujas,Placa_Junio_recommended.txt,OP1011_20220323183.CEL_call_code
96,3,20220323184,Creole,Male,INIA Las Brujas,Placa_Junio_recommended.txt,OP1012_20220323184.CEL_call_code
97,4,20220323185,Creole,Female,INIA Las Brujas,Placa_Junio_recommended.txt,OP1013_20220323185.CEL_call_code
98,5,20220323186,Creole,Female,INIA Las Brujas,Placa_Junio_recommended.txt,OP1014_20220323186.CEL_call_code
99,6,20220323187,Creole,Female,INIA Las Brujas,Placa_Junio_recommended.txt,OP1015_20220323187.CEL_call_code
100,7,20220323188,Creole,Female,INIA Las Brujas,Placa_Junio_recommended.txt,OP1016_20220323188.CEL_call_code
101,8,20220323189,Creole,Female,INIA Las Brujas,Placa_Junio_recommended.txt,OP1017_20220323189.CEL_call_code
102,9,20220323190,Creole,Male,INIA Las Brujas,Placa_Junio_recommended.txt,OP1018_20220323190.CEL_call_code
103,10,20220323191,Creole,Male,INIA Las Brujas,Placa_Junio_recommended.txt,OP1019_20220323191.CEL_call_code


Ok, I need to track this new column in metadata file.

In [13]:
outfile = str(get_interim_dir() / "placa_junio_metadata.xlsx")
placa_junio_metadata.to_excel(outfile, index=False)

This fixed metadata file will be placed in the archive. Let's explore the other datasets. I need also to define a new class in `src.features.plinkio` module to deal with this type of data files.

## OP829-924 INIA Abril
Time for `OP829-924_INIA_Abril.zip` file:

In [14]:
inia_abril = Dataset.objects.get(file="OP829-924_INIA_Abril.zip")

There's only one file in dataset, the file with the data:

In [15]:
path = inia_abril.working_dir / inia_abril.contents[0]
probeset_ids = [record.probeset_id for record in read_affymetrixRow(path)]

Check if those probeset ids are in database:

In [16]:
missing = 0

for probeset_id in tqdm(probeset_ids, total=len(probeset_ids)):
    query = {
        "probesets__match": {
            'chip_name': inia_abril.chip_name,
            'probeset_id': probeset_id
        }
    }
    
    if not VariantSheep.objects(**query):
        missing += 1
        
print(f"Missing {missing} SNPs of {len(probeset_ids)}")

  0%|          | 0/34327 [00:00<?, ?it/s]

Missing 0 SNPs of 34327


Ok, for what I see all the SNPs can be found in database. However, I have less SNPs this time than the dataset before. How about sample names?

In [17]:
record = next(read_affymetrixRow(path))
record.n_samples

96

It seems that I have more samples than I have in metadata. Check in metadata info

In [18]:
inia_abril_metadata = info[info["File"] == inia_abril.contents[0]]
print(f"Got {inia_abril_metadata.shape[0]} samples")
inia_abril_metadata.head()

Got 53 samples


Unnamed: 0,N,Lab_ID,Breed,Sex,Site,File,id_column
41,1,20220323001,Creole,Female,INIA Las Brujas,OP829-924 INIA Abril.txt,
42,2,20220323002,Creole,Male,INIA Las Brujas,OP829-924 INIA Abril.txt,
43,3,20220323003,Creole,Female,INIA Las Brujas,OP829-924 INIA Abril.txt,
44,4,20220323004,Creole,Female,INIA Las Brujas,OP829-924 INIA Abril.txt,
45,5,20220323005,Creole,Male,INIA Las Brujas,OP829-924 INIA Abril.txt,


Well, I have less *ids* than the samples I see in the datafile. Get *lab ids*:

In [19]:
lab_ids = [str(id_) for id_ in inia_abril_metadata["Lab_ID"].tolist()]

And even samples names are different. I need to match the *file id* with the *lab id*:

In [20]:
for lab_id in lab_ids:
    for col in get_header(path):
        if lab_id in col:
            lab2file_ids[lab_id] = col

Next, I need to track this column in my datatable:

In [21]:
info["id_column"] = info["Lab_ID"].apply(lambda lab_id: lab2file_ids[str(lab_id)])

Focus only on *inia abril* data:

In [22]:
inia_abril_metadata = info[info["File"] == inia_abril.contents[0]]
inia_abril_metadata.head()

Unnamed: 0,N,Lab_ID,Breed,Sex,Site,File,id_column
41,1,20220323001,Creole,Female,INIA Las Brujas,OP829-924 INIA Abril.txt,OP829_20220323001.CEL_call_code
42,2,20220323002,Creole,Male,INIA Las Brujas,OP829-924 INIA Abril.txt,OP830_20220323002.CEL_call_code
43,3,20220323003,Creole,Female,INIA Las Brujas,OP829-924 INIA Abril.txt,OP831_20220323003.CEL_call_code
44,4,20220323004,Creole,Female,INIA Las Brujas,OP829-924 INIA Abril.txt,OP832_20220323004.CEL_call_code
45,5,20220323005,Creole,Male,INIA Las Brujas,OP829-924 INIA Abril.txt,OP833_20220323005.CEL_call_code


Ok, I need to track this new column in metadata file.

In [23]:
outfile = str(get_interim_dir() / "inia_abril_metadata.xlsx")
inia_abril_metadata.to_excel(outfile, index=False)

## Placas1_4_genotyping
Finally the last dataset:

In [24]:
placas1_4 = Dataset.objects.get(file="Placas1_4_genotyping.zip")

There's only one file in dataset, the file with the data:

In [25]:
path = placas1_4.working_dir / placas1_4.contents[0]
probeset_ids = [record.probeset_id for record in read_affymetrixRow(path)]

Check if those probeset ids are in database:

In [26]:
missing = 0

for probeset_id in tqdm(probeset_ids, total=len(probeset_ids)):
    query = {
        "probesets__match": {
            'chip_name': placas1_4.chip_name,
            'probeset_id': probeset_id
        }
    }
    
    if not VariantSheep.objects(**query):
        missing += 1
        
print(f"Missing {missing} SNPs of {len(probeset_ids)}")

  0%|          | 0/46359 [00:00<?, ?it/s]

Missing 1 SNPs of 46359


Ok, all SNPs except one are in database. How about sample names?

In [27]:
record = next(read_affymetrixRow(path))
record.n_samples

380

It seems that I have more samples than I have in metadata. Check in metadata info

In [28]:
placas1_4_metadata = info[info["File"] == placas1_4.contents[0]]
print(f"Got {placas1_4_metadata.shape[0]} samples")
placas1_4_metadata.head()

Got 41 samples


Unnamed: 0,N,Lab_ID,Breed,Sex,Site,File,id_column
0,1,20210824028,Creole,Female,INIA Las Brujas,Placas1_4_genotyping.txt,
1,2,20210824029,Creole,Female,INIA Las Brujas,Placas1_4_genotyping.txt,
2,3,20210824030,Creole,Female,INIA Las Brujas,Placas1_4_genotyping.txt,
3,4,20210824031,Creole,Female,INIA Las Brujas,Placas1_4_genotyping.txt,
4,5,20210824032,Creole,Female,INIA Las Brujas,Placas1_4_genotyping.txt,


Well, I have less *ids* than the samples I see in the datafile. Get *lab ids*:

In [29]:
lab_ids = [str(id_) for id_ in placas1_4_metadata["Lab_ID"].tolist()]

And even samples names are different. I need to match the *file id* with the *lab id*:

In [30]:
for lab_id in lab_ids:
    for col in get_header(path):
        if lab_id in col:
            lab2file_ids[lab_id] = col

Next, I need to track this column in my datatable:

In [31]:
info["id_column"] = info["Lab_ID"].apply(lambda lab_id: lab2file_ids[str(lab_id)])

Focus only on *placas 1-4* data:

In [32]:
placas1_4_metadata = info[info["File"] == placas1_4.contents[0]]
placas1_4_metadata.head()

Unnamed: 0,N,Lab_ID,Breed,Sex,Site,File,id_column
0,1,20210824028,Creole,Female,INIA Las Brujas,Placas1_4_genotyping.txt,OP258_20210824028.CEL_call_code
1,2,20210824029,Creole,Female,INIA Las Brujas,Placas1_4_genotyping.txt,OP259_20210824029.CEL_call_code
2,3,20210824030,Creole,Female,INIA Las Brujas,Placas1_4_genotyping.txt,OP260_20210824030.CEL_call_code
3,4,20210824031,Creole,Female,INIA Las Brujas,Placas1_4_genotyping.txt,OP261_20210824031.CEL_call_code
4,5,20210824032,Creole,Female,INIA Las Brujas,Placas1_4_genotyping.txt,OP262_20210824032.CEL_call_code


Ok, I need to track this new column in metadata file.

In [33]:
outfile = str(get_interim_dir() / "placas1_4_metadata.xlsx")
placas1_4_metadata.to_excel(outfile, index=False)

This fixed metadata file will be placed in the archive