# Introduction
To do downstream analysis later in the project we need:
1. Geolocation of where the sample was taken
2. Source that the sample was isolated from
3. Date on which the sample was collected
4. Filter out lab strains

None of these things have a column in any of the tables from the NCBI database. See this [paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6380228/) that describes how metadata in NCBI sucks.
 
However, there is a column called 'sample_attribute' in the SRA and Sample table where a submitter can add additional information about a sample. As 'sample_attribute' does not require a specific format or specific information. The information found there varies greatly between samples. Some organizations (rivm) that submit data to the NCBI have a consisted format for this column which then also varies per organization, others do not. This makes it very challenging to extrapolate the information mentioned above for all samples. In this notebook we attempt to extract this information.

In [None]:
import pandas as pd
import re
import pyarrow.feather as feather
from collections import defaultdict

Functions written for this notebook are stored in wrangling_funcs.py. Please look there for documentation and tests.

In [None]:
import wrangling_funcs

# Reading in the data
---

R has a nice package called SRAdb that you can use to query the NCBI database. However, I prefer working in Python. So we are querying the data in R using SRAdb and then exporting it in feather format for use here. There might be a way to directly get a dump of the SRA database and query it without using SRAdb. I will look into this.

The default index of a dataframe is not useful to us. Instead, we use the run_accession, these should be unique. This way we can keep track when we split the metadata into a separate dataframe.

In [None]:
# file_path = '../../results/SRA.feather'
# data = feather.read_feather(file_path)
data = feather.read_feather(snakemake.input[0])

metadata_df = pd.DataFrame(data)
metadata_df = metadata_df.convert_dtypes()
metadata_df.set_index('run_accession', inplace=True)

print(f'---Number of rows: {metadata_df.shape[0]}, Number of columns: {metadata_df.shape[1]}---')
metadata_df.head()

In [None]:
na_synonyms = {r'^\*$', r'^-$', r'^\.$', r'^[Nn]one$', r'^[Nn]an$', r'^[Uu]nknown$', r'(?i)^not[ _-]collected$', r'(?i)^not[ _-]provided', r'^\?$', r'^ $', r'(?i)^not[ _-]applicable$', r'^[Nn]a$', r'^[Nn]o$', r'^[Oo]ther$', r'^[Mm]is{1,3}ing$', r'^[Uu]nspecified$', r'^[Nn]ot[ ]available$', r'^[Nn]ot[ :]available[:] not collected$', '^[Nn]ot[ :]available[:] to be reported later$'}

metadata_df = metadata_df.replace(to_replace=na_synonyms, value=pd.NA, regex=True)

print(f'---Number of rows: {metadata_df.shape[0]}, Number of columns: {metadata_df.shape[1]}---')
metadata_df.head()

# Finding metadata in the sample_attribute
---

All the metadata we are interested in is contained in the 'sample_attribute' column. From what we could see most of the information in this column is split by '||' characters. The information between these characters is then often split using ':'. We will use this to make key value pairs which we will then turn into a dataframe.

In [None]:
sample_attribute = metadata_df['sample_attribute']
faulty_lines = []
correct_lines = []

pattern = re.compile(r"^[mM]is{1,3}ing$|^[nN]ot.*|^[oO]ther$|^[uU]nspecified$|^\.$|^\*$|\?|^[Nn]a[nN]$|^[Nn]a$|^ $|^[Uu]nknown$|^[Nn]o$")

for line, identity in zip(sample_attribute, sample_attribute.index):
    line = line.split("||")
    line_items = defaultdict(list)
    for subitem in line:
        try:
            key, value = subitem.split(': ', 1)
            strip_key = wrangling_funcs.clean_string(key)
            strip_value = value.strip()
            if pattern.match(strip_value):
                strip_value = pd.NA
            line_items[strip_key] = strip_value
            line_items['run_accession'] = identity
        except ValueError:
            faulty_lines.append((identity, line))
    correct_lines.append(line_items)

smpl_att_df = pd.DataFrame(correct_lines)
smpl_att_df = smpl_att_df.convert_dtypes()
smpl_att_df.set_index('run_accession', inplace=True)

print(f'---Number of rows: {smpl_att_df.shape[0]}, Number of columns: {smpl_att_df.shape[1]}---')
smpl_att_df

In [None]:
with open("results/removed_samples.txt", "w") as file:
    for line in faulty_lines:
        file.write(f"{faulty_lines}\n")

### Searching for geographic data

There is no consistent column that contains the geolocation. To (hopefully) obtain the geolocation we use regex to find keywords in the column names of the dataframe. The matched columns are then combined in a single column while handling NaN values.i

![NCBI geo location description](images/geo_location.png)

In [None]:
geo_col_matches = wrangling_funcs.find_columns(['geo', 'geographic', 'country', 'continent'], smpl_att_df, ['longitude', 'latitude', 'depth'])
print(f'The following columns matched the keywords: {geo_col_matches}')

smpl_att_df = wrangling_funcs.combine_columns(smpl_att_df, list(geo_col_matches), 'inferred_location')
smpl_att_df.drop(geo_col_matches, inplace=True, axis=1)

smpl_att_df['inferred_continent'], smpl_att_df['inferred_country'], smpl_att_df['inferred_city'] = zip(*smpl_att_df['inferred_location'].map(wrangling_funcs.clean_geo))
smpl_att_df.drop('inferred_location', axis=1, inplace=True)

smpl_att_df = smpl_att_df.convert_dtypes()

print(f'---Number of rows: {smpl_att_df.shape[0]}, Number of columns: {smpl_att_df.shape[1]}---')
smpl_att_df.head()

### Searching for the sample collection data

There is no consistent column that contains the date. To (hopefully) obtain the date we use regex to find keywords in the column names of the dataframe. The matched columns are then combined in a single column while handling NaN values.

![ncbi collection date description](images/collection_date.png)

In [None]:
date_col_matches = wrangling_funcs.find_columns(['date', 'year', 'time'], smpl_att_df, ['update'])
print(f'The following columns matched the keywords: {date_col_matches}')
smpl_att_df = wrangling_funcs.combine_columns(smpl_att_df, list(date_col_matches), 'inferred_collection_year')
smpl_att_df.drop(date_col_matches, inplace=True, axis=1)


date = smpl_att_df['inferred_collection_year'].str.extract(r'^(\d{4})', expand=False) # Extract the year
smpl_att_df['inferred_collection_year'] = pd.to_numeric(date) # cast year to int

smpl_att_df = smpl_att_df.convert_dtypes()
print(f'---Number of rows: {smpl_att_df.shape[0]}, Number of columns: {smpl_att_df.shape[1]}---')
smpl_att_df.head()

### Searching for sample isolation source
There is no consistent column that contains the isolation source. To (hopefully) obtain the isolation source we use regex to find keywords in the column names of the dataframe. The matched columns are then combined in a single column while handling NaN values.

![NCBI isolation source description](images/isolation_source.png)
![NCBI env package description](images/env_package.png)
![NCBI isolation name description](images/isolate_name.png)
![NCBI relative location description](images/rel_location.png)



In [None]:
isolate_matches = wrangling_funcs.find_columns(['sample', 'source', 'environment', 'env', 'site'], smpl_att_df, ['name', 'provider', 'comment'])
# isolate_matches = wrangling_funcs.find_columns(['source', 'isolate'], smpl_att_df, ['name', 'provider', 'comment', 'time', 'date', 'collected'])
print(isolate_matches)

smpl_att_df = wrangling_funcs.combine_columns(smpl_att_df, list(isolate_matches), "inferred_source")
smpl_att_df.drop(isolate_matches, inplace=True, axis=1)

smpl_att_df['inferred_source'] = smpl_att_df['inferred_source'].apply(wrangling_funcs.clean_source)

smpl_att_df = smpl_att_df.convert_dtypes()
print(f'---Number of rows: {smpl_att_df.shape[0]}, Number of columns: {smpl_att_df.shape[1]}---')
smpl_att_df.head()

### Remove non relevant columns
We have a ton of columns and very few of them are actually usefull to us. Let's remove all not relevant columns

In [None]:
# latitude and longitude are optional
if  'geographic_location_latitude' in smpl_att_df.columns and 'geographic_location_longitude' in smpl_att_df.columns:
    smpl_att_df = smpl_att_df[['strain', 'inferred_collection_year', 'inferred_source', 'inferred_continent', 'inferred_country', 'inferred_city', 'geographic_location_latitude', 'geographic_location_longitude']]
elif 'strain' in smpl_att_df.columns:
    # Not guaranteed to be here
    smpl_att_df = smpl_att_df[['strain', 'inferred_collection_year', 'inferred_source', 'inferred_continent', 'inferred_country', 'inferred_city']]
else:
    smpl_att_df = smpl_att_df[['inferred_collection_year', 'inferred_source', 'inferred_continent', 'inferred_country', 'inferred_city']]


print(f'---Number of rows: {smpl_att_df.shape[0]}, Number of columns: {smpl_att_df.shape[1]}---')
smpl_att_df.head()

# Combine sample_attribute metadata with rest of the data
---
Now that we have extracted the metadata that we wanted we can combine it back to the original dataframe. We only want to keep rows that have values for the collection_year/source/country because we require this downstream.

In [None]:
combined_df = metadata_df.join(smpl_att_df)
cols = ['inferred_collection_year', 'inferred_source', 'inferred_country']
combined_df = combined_df.dropna(subset=cols)

combined_df = combined_df.convert_dtypes()
print(f'---Number of rows: {combined_df.shape[0]}, Number of columns: {combined_df.shape[1]}---')
combined_df.head()

### Throw away empty columns
We want to filter out columns that only have NaN values so there is less cluter

In [None]:
combined_df = combined_df.dropna(axis=1, how='all')

print(f'---Number of rows: {combined_df.shape[0]}, Number of columns: {combined_df.shape[1]}---')
combined_df.head()

## Make a clean TSV using a selection of columns for parsing with bash when downloading through SRAtools

In [None]:
clean_tsv = combined_df[['platform', 'scientific_name']].copy()
clean_tsv['scientific_name'] = clean_tsv['scientific_name'].str.replace(' ', '_')
clean_tsv.head()

## Write out files

In [None]:
clean_tsv.to_csv(snakemake.output[1], sep='\t')
combined_df.to_csv(snakemake.output[0], na_rep='NaN')