# Uruguay Second Upload (Foreground)
Pablo set new data belonging created for other WPs

* [Inia_junio_2021_Texel_46_20210409_SMARTER](#20210409)
* [OP635-818 genotyping_soloTexel_20211110_SMARTER](#20211110)
* [Placas1_4_genotyping_Corr_Tex_20210824_SMARTER](#20210824)
* [OP829-924 INIA Abril_20220301_Texel_SMARTER](#20220301)
* [OP925-969 1046-1085 1010-1020_20220323_texel_SMARTER](#20220323)
* [OP1586-1666 OP1087-1106 Placa6 Corr_Tex_Genotyping_20220810_SMARTER](#20220810)

In [1]:
import csv
import collections
import pandas as pd

from pathlib import Path
from tqdm.notebook import tqdm

from src.features.smarterdb import global_connection, Dataset, VariantSheep
from src.features.utils import skip_comments, text_or_gzip_open, get_interim_dir
from src.features.affymetrix import read_affymetrixRow
from src.features.plinkio import AffyReportIO

In [2]:
def get_header(report):
        # sample names are sanitized through read_affymetrixRow: so read the
        # first header of the report file to determine the original sample
        # names
        with text_or_gzip_open(report) as handle:
            position, skipped = skip_comments(handle)

            # go back to header section
            handle.seek(position)

            # now read csv file
            reader = csv.reader(handle, delimiter="\t")

            # get header
            return next(reader)

In [3]:
_ = global_connection()

<a id='20210409'></a>
## Inia_junio_2021_Texel_46_20210409_SMARTER
lets start with `Inia_junio_2021_Texel_46_20210409_SMARTER.zip`

In [4]:
inia_20210409 = Dataset.objects.get(file="Inia_junio_2021_Texel_46_20210409_SMARTER.zip")

There's only one file in dataset, the file with the data:

In [5]:
path = inia_20210409.working_dir / inia_20210409.contents[0]
probeset_ids = [record.probeset_id for record in read_affymetrixRow(path)]

Check if those probeset ids are in database:

In [6]:
missing = 0

for probeset_id in tqdm(probeset_ids, total=len(probeset_ids)):
    query = {
        "probesets__match": {
            'chip_name': inia_20210409.chip_name,
            'probeset_id': probeset_id
        }
    }
    
    if not VariantSheep.objects(**query):
        missing += 1
        
print(f"Missing {missing} SNPs of {len(probeset_ids)}")

  0%|          | 0/40204 [00:00<?, ?it/s]

Missing 0 SNPs of 40204


I have all the SNPs in my database. Check for sample names:

In [7]:
record = next(read_affymetrixRow(path))
print(f"{record.n_samples} reported in file")
print(f"dataset has {inia_20210409.n_of_individuals} samples")
samples = list(filter(lambda name: 'cel_call_code' in name, get_header(path)))
print(f"I could find only {len(samples)} samples in report file")
print(f"Missing {inia_20210409.n_of_individuals - len(samples)} samples in reportfile")

81 reported in file
dataset has 46 samples
I could find only 38 samples in report file
Missing 8 samples in reportfile


Well, even if the report tells that there are `81` samples, dataset has `46` samples and I could find only `38` samples in file. Try to force reading report with custom number of samples:

In [8]:
report = AffyReportIO(report=path)
report.read_reportfile(n_samples=len(samples))

Ok, try to get metadata and understand which samples I miss

In [9]:
metadata_dataset = Dataset.objects.get(file="INIA_other_WPs_metadata.zip")
metadata_dataset.contents

['20210409_Genexa.xlsx',
 '20210824_Genexa.xlsx',
 '20211110_Genexa.xlsx',
 '20220301_Genexa.xlsx',
 '20220323_Genexa.xlsx',
 '20220810_Genexa.xlsx']

In [10]:
metadata_path = metadata_dataset.working_dir / "20210409_Genexa.xlsx"
with open(metadata_path, "rb") as handle:
    inia_20210409_metadata = pd.read_excel(handle)
inia_20210409_metadata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46 entries, 0 to 45
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   N                46 non-null     int64 
 1   ID               46 non-null     int64 
 2   Breed            46 non-null     object
 3   Sex              46 non-null     object
 4   Stall            46 non-null     object
 5   GPS_Coordinates  46 non-null     object
 6   GPS_2            46 non-null     object
dtypes: int64(2), object(5)
memory usage: 2.6+ KB


In [11]:
ids = [str(id_) for id_ in inia_20210409_metadata["ID"].tolist()]

And even samples names are different. I need to match the *file id* with the *lab id*:

In [12]:
name2id = collections.defaultdict(lambda: None)

for id_ in ids:
    for col in samples:
        if id_ in col:
            name2id[id_] = col

Next, I need to track this column in my datatable:

In [13]:
inia_20210409_metadata["alias"] = inia_20210409_metadata["ID"].apply(lambda id_: name2id[str(id_)])

Now try to get rows with missing alias:

In [14]:
inia_20210409_metadata[inia_20210409_metadata["alias"].isnull()].drop(["GPS_2", "GPS_Coordinates"], axis=1)

Unnamed: 0,N,ID,Breed,Sex,Stall,alias
1,9,20210409009,Texel,Male,INIA Las Brujas,
2,10,20210409010,Texel,Male,INIA Las Brujas,
14,65,20210409065,Texel,Male,INIA Las Brujas,
22,73,20210409073,Texel,Male,INIA Las Brujas,
23,74,20210409074,Texel,Male,INIA Las Brujas,
30,81,20210409081,Texel,Female,INIA Las Brujas,
38,89,20210409089,Texel,Male,INIA Las Brujas,
39,90,20210409090,Texel,Female,INIA Las Brujas,


Ok there are some missing samples. Try to split coordinate columns in order to be imported into database:

In [15]:
inia_20210409_metadata["latitude"] = inia_20210409_metadata["GPS_Coordinates"].apply(lambda string: float(string.split(",")[0].strip()))
inia_20210409_metadata["longitude"] = inia_20210409_metadata["GPS_Coordinates"].apply(lambda string: float(string.split(",")[1].strip()))

Ok, write them into a file:

In [16]:
inia_20210409_metadata.to_excel("20210409_Genexa_fix.xlsx", index=False)

<a id='20211110'></a>
## OP635-818 genotyping_soloTexel_20211110_SMARTER
Now it's time to process `OP635-818 genotyping_soloTexel_20211110_SMARTER.zip`

In [17]:
inia_20211110 = Dataset.objects.get(file="OP635-818 genotyping_soloTexel_20211110_SMARTER.zip")

There's only one file in dataset, the file with the data:

In [18]:
path = inia_20211110.working_dir / inia_20211110.contents[0]
probeset_ids = [record.probeset_id for record in read_affymetrixRow(path)]

Check if those probeset ids are in database:

In [19]:
missing = 0

for probeset_id in tqdm(probeset_ids, total=len(probeset_ids)):
    query = {
        "probesets__match": {
            'chip_name': inia_20211110.chip_name,
            'probeset_id': probeset_id
        }
    }
    
    if not VariantSheep.objects(**query):
        missing += 1
        
print(f"Missing {missing} SNPs of {len(probeset_ids)}")

  0%|          | 0/56254 [00:00<?, ?it/s]

Missing 5 SNPs of 56254


Well, I have allmost all SNPs in database. Check for sample names:

In [20]:
record = next(read_affymetrixRow(path))
print(f"{record.n_samples} reported in file")
print(f"dataset has {inia_20211110.n_of_individuals} samples")
samples = list(filter(lambda name: 'cel_call_code' in name.lower(), get_header(path)))
print(f"I could find only {len(samples)} samples in report file")
print(f"Missing {inia_20211110.n_of_individuals - len(samples)} samples in reportfile")

191 reported in file
dataset has 59 samples
I could find only 59 samples in report file
Missing 0 samples in reportfile


This time I have all 59 samples defined in dataset

In [21]:
report = AffyReportIO(report=path)
report.read_reportfile(n_samples=len(samples))

Ok, try to get metadata and understand which samples I miss

In [22]:
metadata_path = metadata_dataset.working_dir / "20211110_Genexa.xlsx"
with open(metadata_path, "rb") as handle:
    inia_20211110_metadata = pd.read_excel(handle)
inia_20211110_metadata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59 entries, 0 to 58
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   N                59 non-null     int64 
 1   ID               59 non-null     int64 
 2   Breed            59 non-null     object
 3   Sex              59 non-null     object
 4   Stall            59 non-null     object
 5   GPS_Coordinates  59 non-null     object
 6   GPS_2            59 non-null     object
dtypes: int64(2), object(5)
memory usage: 3.4+ KB


In [23]:
ids = [str(id_) for id_ in inia_20211110_metadata["ID"].tolist()]

And even samples names are different. I need to match the *file id* with the *lab id*:

In [24]:
name2id = collections.defaultdict(lambda: None)

for id_ in ids:
    for col in samples:
        if id_ in col:
            name2id[id_] = col

Next, I need to track this column in my datatable:

In [25]:
inia_20211110_metadata["alias"] = inia_20211110_metadata["ID"].apply(lambda id_: name2id[str(id_)])

Now try to get rows with missing alias (if any):

In [26]:
inia_20211110_metadata[inia_20211110_metadata["alias"].isnull()].drop(["GPS_2", "GPS_Coordinates"], axis=1)

Unnamed: 0,N,ID,Breed,Sex,Stall,alias


Well, I can find an entry for every sample in reportfile. Try to split coordinate columns in order to be imported into database:

In [27]:
inia_20211110_metadata["latitude"] = inia_20211110_metadata["GPS_Coordinates"].apply(lambda string: float(string.split(",")[0].strip()))
inia_20211110_metadata["longitude"] = inia_20211110_metadata["GPS_Coordinates"].apply(lambda string: float(string.split(",")[1].strip()))

Ok, write them into a file:

In [28]:
inia_20211110_metadata.to_excel("20211110_Genexa_fix.xlsx", index=False)

<a id='20210824'></a>
## Placas1_4_genotyping_Corr_Tex_20210824_SMARTER
Now it's time to process `Placas1_4_genotyping_Corr_Tex_20210824_SMARTER.zip`

In [29]:
inia_20210824 = Dataset.objects.get(file="Placas1_4_genotyping_Corr_Tex_20210824_SMARTER.zip")

There's only one file in dataset, the file with the data:

In [30]:
path = inia_20210824.working_dir / inia_20210824.contents[0]
probeset_ids = [record.probeset_id for record in read_affymetrixRow(path)]

Check if those probeset ids are in database:

In [31]:
missing = 0

for probeset_id in tqdm(probeset_ids, total=len(probeset_ids)):
    query = {
        "probesets__match": {
            'chip_name': inia_20210824.chip_name,
            'probeset_id': probeset_id
        }
    }
    
    if not VariantSheep.objects(**query):
        missing += 1
        
print(f"Missing {missing} SNPs of {len(probeset_ids)}")

  0%|          | 0/46359 [00:00<?, ?it/s]

Missing 1 SNPs of 46359


Well, I have allmost all SNPs in database. Check for sample names:

In [32]:
record = next(read_affymetrixRow(path))
print(f"{record.n_samples} reported in file")
print(f"dataset has {inia_20210824.n_of_individuals} samples")
samples = list(filter(lambda name: 'cel_call_code' in name.lower(), get_header(path)))
print(f"I could find only {len(samples)} samples in report file")
print(f"Missing {inia_20210824.n_of_individuals - len(samples)} samples in reportfile")

380 reported in file
dataset has 336 samples
I could find only 332 samples in report file
Missing 4 samples in reportfile


This time miss 4 samples from dataset

In [33]:
report = AffyReportIO(report=path)
report.read_reportfile(n_samples=len(samples))

Ok, try to get metadata and understand which samples I miss

In [34]:
metadata_path = metadata_dataset.working_dir / "20210824_Genexa.xlsx"
with open(metadata_path, "rb") as handle:
    inia_20210824_metadata = pd.read_excel(handle)
inia_20210824_metadata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 336 entries, 0 to 335
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   N                336 non-null    int64 
 1   ID               336 non-null    int64 
 2   Breed            336 non-null    object
 3   Sex              336 non-null    object
 4   Stall            336 non-null    object
 5   GPS_Coordinates  336 non-null    object
 6   GPS_2            336 non-null    object
dtypes: int64(2), object(5)
memory usage: 18.5+ KB


In [35]:
ids = [str(id_) for id_ in inia_20210824_metadata["ID"].tolist()]

And even samples names are different. I need to match the *file id* with the *lab id*:

In [36]:
name2id = collections.defaultdict(lambda: None)

for id_ in ids:
    for col in samples:
        if id_ in col:
            name2id[id_] = col

Next, I need to track this column in my datatable:

In [37]:
inia_20210824_metadata["alias"] = inia_20210824_metadata["ID"].apply(lambda id_: name2id[str(id_)])

Now try to get rows with missing alias (if any):

In [38]:
inia_20210824_metadata[inia_20210824_metadata["alias"].isnull()].drop(["GPS_2", "GPS_Coordinates"], axis=1)

Unnamed: 0,N,ID,Breed,Sex,Stall,alias
155,204,20210824204,Texel,Female,CCT - Tupambaé,
168,217,20210824217,Texel,Female,CCT - Tupambaé,
220,269,20210824269,Texel,Female,CCT - Tupambaé,
224,273,20210824273,Texel,Male,CCT - Tupambaé,


Here are the sample I miss. Try to split coordinate columns in order to be imported into database:

In [39]:
inia_20210824_metadata["latitude"] = inia_20210824_metadata["GPS_Coordinates"].apply(lambda string: float(string.split(",")[0].strip()))
inia_20210824_metadata["longitude"] = inia_20210824_metadata["GPS_Coordinates"].apply(lambda string: float(string.split(",")[1].strip()))

Ok, write them into a file:

In [40]:
inia_20210824_metadata.to_excel("20210824_Genexa_fix.xlsx", index=False)

<a id='20220301'></a>
## OP829-924 INIA Abril_20220301_Texel_SMARTER
Now it's time to process `OP829-924 INIA Abril_20220301_Texel_SMARTER.zip`

In [41]:
inia_20220301 = Dataset.objects.get(file="OP829-924 INIA Abril_20220301_Texel_SMARTER.zip")

There's only one file in dataset, the file with the data:

In [42]:
path = inia_20220301.working_dir / inia_20220301.contents[0]
probeset_ids = [record.probeset_id for record in read_affymetrixRow(path)]

Check if those probeset ids are in database:

In [43]:
missing = 0

for probeset_id in tqdm(probeset_ids, total=len(probeset_ids)):
    query = {
        "probesets__match": {
            'chip_name': inia_20220301.chip_name,
            'probeset_id': probeset_id
        }
    }
    
    if not VariantSheep.objects(**query):
        missing += 1
        
print(f"Missing {missing} SNPs of {len(probeset_ids)}")

  0%|          | 0/34327 [00:00<?, ?it/s]

Missing 0 SNPs of 34327


Well, I have allmost all SNPs in database. Check for sample names:

In [44]:
record = next(read_affymetrixRow(path))
print(f"{record.n_samples} reported in file")
print(f"dataset has {inia_20220301.n_of_individuals} samples")
samples = list(filter(lambda name: 'cel_call_code' in name.lower(), get_header(path)))
print(f"I could find only {len(samples)} samples in report file")
print(f"Missing {inia_20220301.n_of_individuals - len(samples)} samples in reportfile")

96 reported in file
dataset has 43 samples
I could find only 43 samples in report file
Missing 0 samples in reportfile


This time all samples are in dataset

In [45]:
report = AffyReportIO(report=path)
report.read_reportfile(n_samples=len(samples))

Ok, try to get metadata and understand which samples I miss

In [46]:
metadata_path = metadata_dataset.working_dir / "20220301_Genexa.xlsx"
with open(metadata_path, "rb") as handle:
    inia_20220301_metadata = pd.read_excel(handle)
inia_20220301_metadata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43 entries, 0 to 42
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   N                43 non-null     int64 
 1   ID               43 non-null     int64 
 2   Breed            43 non-null     object
 3   Sex              43 non-null     object
 4   Stall            43 non-null     object
 5   GPS_Coordinates  43 non-null     object
 6   GPS_2            43 non-null     object
dtypes: int64(2), object(5)
memory usage: 2.5+ KB


In [47]:
ids = [str(id_) for id_ in inia_20220301_metadata["ID"].tolist()]

And even samples names are different. I need to match the *file id* with the *lab id*:

In [48]:
name2id = collections.defaultdict(lambda: None)

for id_ in ids:
    for col in samples:
        if id_ in col:
            name2id[id_] = col

Next, I need to track this column in my datatable:

In [49]:
inia_20220301_metadata["alias"] = inia_20220301_metadata["ID"].apply(lambda id_: name2id[str(id_)])

Now try to get rows with missing alias (if any):

In [50]:
inia_20220301_metadata[inia_20220301_metadata["alias"].isnull()].drop(["GPS_2", "GPS_Coordinates"], axis=1)

Unnamed: 0,N,ID,Breed,Sex,Stall,alias


All samples have a match with metadata. Try to split coordinate columns in order to be imported into database:

In [51]:
inia_20220301_metadata["latitude"] = inia_20220301_metadata["GPS_Coordinates"].apply(lambda string: float(string.split(",")[0].strip()))
inia_20220301_metadata["longitude"] = inia_20220301_metadata["GPS_Coordinates"].apply(lambda string: float(string.split(",")[1].strip()))

Ok, write them into a file:

In [52]:
inia_20220301_metadata.to_excel("20220301_Genexa_fix.xlsx", index=False)

<a id='20220323'></a>
## OP925-969 1046-1085 1010-1020_20220323_texel_SMARTER
Now it's time to process `OP925-969 1046-1085 1010-1020_20220323_texel_SMARTER.zip`

In [53]:
inia_20220323 = Dataset.objects.get(file="OP925-969 1046-1085 1010-1020_20220323_texel_SMARTER.zip")

There's only one file in dataset, the file with the data:

In [54]:
path = inia_20220323.working_dir / inia_20220323.contents[0]
probeset_ids = [record.probeset_id for record in read_affymetrixRow(path)]

Check if those probeset ids are in database:

In [55]:
missing = 0

for probeset_id in tqdm(probeset_ids, total=len(probeset_ids)):
    query = {
        "probesets__match": {
            'chip_name': inia_20220323.chip_name,
            'probeset_id': probeset_id
        }
    }
    
    if not VariantSheep.objects(**query):
        missing += 1
        
print(f"Missing {missing} SNPs of {len(probeset_ids)}")

  0%|          | 0/56254 [00:00<?, ?it/s]

Missing 5 SNPs of 56254


Well, I have allmost all SNPs in database. Check for sample names:

In [56]:
record = next(read_affymetrixRow(path))
print(f"{record.n_samples} reported in file")
print(f"dataset has {inia_20220323.n_of_individuals} samples")
samples = list(filter(lambda name: 'cel_call_code' in name.lower(), get_header(path)))
print(f"I could find only {len(samples)} samples in report file")
print(f"Missing {inia_20220323.n_of_individuals - len(samples)} samples in reportfile")

96 reported in file
dataset has 25 samples
I could find only 25 samples in report file
Missing 0 samples in reportfile


This time all samples are in dataset

In [57]:
report = AffyReportIO(report=path)
report.read_reportfile(n_samples=len(samples))

Ok, try to get metadata and understand which samples I miss

In [58]:
metadata_path = metadata_dataset.working_dir / "20220323_Genexa.xlsx"
with open(metadata_path, "rb") as handle:
    inia_20220323_metadata = pd.read_excel(handle)
inia_20220323_metadata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25 entries, 0 to 24
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   N                25 non-null     int64 
 1   ID               25 non-null     int64 
 2   Breed            25 non-null     object
 3   Sex              25 non-null     object
 4   Stall            25 non-null     object
 5   GPS_Coordinates  25 non-null     object
 6   GPS_2            25 non-null     object
dtypes: int64(2), object(5)
memory usage: 1.5+ KB


In [59]:
ids = [str(id_) for id_ in inia_20220323_metadata["ID"].tolist()]

And even samples names are different. I need to match the *file id* with the *lab id*:

In [60]:
name2id = collections.defaultdict(lambda: None)

for id_ in ids:
    for col in samples:
        if id_ in col:
            name2id[id_] = col

Next, I need to track this column in my datatable:

In [61]:
inia_20220323_metadata["alias"] = inia_20220323_metadata["ID"].apply(lambda id_: name2id[str(id_)])

Now try to get rows with missing alias (if any):

In [62]:
inia_20220323_metadata[inia_20220323_metadata["alias"].isnull()].drop(["GPS_2", "GPS_Coordinates"], axis=1)

Unnamed: 0,N,ID,Breed,Sex,Stall,alias


All samples have a match with metadata. Try to split coordinate columns in order to be imported into database:

In [63]:
inia_20220323_metadata["latitude"] = inia_20220323_metadata["GPS_Coordinates"].apply(lambda string: float(string.split(",")[0].strip()))
inia_20220323_metadata["longitude"] = inia_20220323_metadata["GPS_Coordinates"].apply(lambda string: float(string.split(",")[1].strip()))

Ok, write them into a file:

In [64]:
inia_20220323_metadata.to_excel("20220323_Genexa_fix.xlsx", index=False)

<a id='20220810'></a>
## OP1586-1666 OP1087-1106 Placa6 Corr_Tex_Genotyping_20220810_SMARTER
Now it's time to process `OP1586-1666 OP1087-1106 Placa6 Corr_Tex_Genotyping_20220810_SMARTER.zip`

In [65]:
inia_20220810 = Dataset.objects.get(file="OP1586-1666 OP1087-1106 Placa6 Corr_Tex_Genotyping_20220810_SMARTER.zip")

There's only one file in dataset, the file with the data:

In [66]:
path = inia_20220810.working_dir / inia_20220810.contents[0]
probeset_ids = [record.probeset_id for record in read_affymetrixRow(path)]

Check if those probeset ids are in database:

In [67]:
missing = 0

for probeset_id in tqdm(probeset_ids, total=len(probeset_ids)):
    query = {
        "probesets__match": {
            'chip_name': inia_20220810.chip_name,
            'probeset_id': probeset_id
        }
    }
    
    if not VariantSheep.objects(**query):
        missing += 1
        
print(f"Missing {missing} SNPs of {len(probeset_ids)}")

  0%|          | 0/56941 [00:00<?, ?it/s]

Missing 148 SNPs of 56941


Well, I have allmost all SNPs in database. Check for sample names:

In [68]:
record = next(read_affymetrixRow(path))
print(f"{record.n_samples} reported in file")
print(f"dataset has {inia_20220810.n_of_individuals} samples")
samples = list(filter(lambda name: 'cel_call_code' in name.lower(), get_header(path)))
print(f"I could find only {len(samples)} samples in report file")
print(f"Missing {inia_20220810.n_of_individuals - len(samples)} samples in reportfile")

91 reported in file
dataset has 35 samples
I could find only 35 samples in report file
Missing 0 samples in reportfile


This time all samples are in dataset

In [69]:
report = AffyReportIO(report=path)
report.read_reportfile(n_samples=len(samples))

Missing 'chr_id' column in reportfile!
Missing 'start' column in reportfile!
Missing 'allele_a' column in reportfile!
Missing 'allele_b' column in reportfile!


Ok, try to get metadata and understand which samples I miss

In [70]:
metadata_path = metadata_dataset.working_dir / "20220810_Genexa.xlsx"
with open(metadata_path, "rb") as handle:
    inia_20220810_metadata = pd.read_excel(handle)
inia_20220810_metadata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35 entries, 0 to 34
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   N                35 non-null     int64 
 1   ID               35 non-null     int64 
 2   Breed            35 non-null     object
 3   Sex              35 non-null     object
 4   Stall            35 non-null     object
 5   GPS_Coordinates  35 non-null     object
 6   GPS_2            35 non-null     object
dtypes: int64(2), object(5)
memory usage: 2.0+ KB


In [71]:
ids = [str(id_) for id_ in inia_20220810_metadata["ID"].tolist()]

And even samples names are different. I need to match the *file id* with the *lab id*:

In [72]:
name2id = collections.defaultdict(lambda: None)

for id_ in ids:
    for col in samples:
        if id_ in col:
            name2id[id_] = col

Next, I need to track this column in my datatable:

In [73]:
inia_20220810_metadata["alias"] = inia_20220810_metadata["ID"].apply(lambda id_: name2id[str(id_)])

Now try to get rows with missing alias (if any):

In [74]:
inia_20220810_metadata[inia_20220810_metadata["alias"].isnull()].drop(["GPS_2", "GPS_Coordinates"], axis=1)

Unnamed: 0,N,ID,Breed,Sex,Stall,alias


All samples have a match with metadata. Try to split coordinate columns in order to be imported into database:

In [75]:
inia_20220810_metadata["latitude"] = inia_20220810_metadata["GPS_Coordinates"].apply(lambda string: float(string.split(",")[0].strip()))
inia_20220810_metadata["longitude"] = inia_20220810_metadata["GPS_Coordinates"].apply(lambda string: float(string.split(",")[1].strip()))

Ok, write them into a file:

In [76]:
inia_20220810_metadata.to_excel("20220810_Genexa_fix.xlsx", index=False)