## Quality Assessment

Using only the recent nextclade references for Yamagata and Victoria, older strains only slightly favor their lineage over the other lineage: 
```
162433	CY018765.1 Influenza B virus (B/Yamagata/16/1988) segment 4, complete sequence	seg4_yam	0.5842822215369322	502
162433	CY018765.1 Influenza B virus (B/Yamagata/16/1988) segment 4, complete sequence	seg4_vic	0.41565415358681823	370
493	X13553.1 Influenza B virus HA gene for haemagglutinin precursor, strain B/B/Victoria/3/85	seg4_vic	0.41837826600792244	370
493	X13553.1 Influenza B virus HA gene for haemagglutinin precursor, strain B/B/Victoria/3/85	seg4_yam	0.4041802058080461	345
```
Additionally, the split between yam and vic in segment 4 (HA) is not very clean as can be seen from the auspice config - leading me to believe lineage assignment using nextclade sort is probably not accurate enough to rely on.

However, it appears that segment identification does work as intended, only segments 1 and 2 are sometimes switched in the data - but of the few instances where this happens, e.g. LC720259.1 - we can see this was a submission error. It is also interesting that in NCBI terminology segment1 in influenza B is pb1, but segment1 in influenza A is pb2. I keep this assignment when creating my datasets.

In [6]:
import pandas as pd

# Load the TSV file
file_path = 'sort_results.tsv'
df = pd.read_csv(file_path, sep='\t')
df = df.astype({'index':'int'})

df = df.dropna(subset=['score'])

sorted_fasta_ids = set([l.split(' ')[0] for l in df['seqName'].tolist()])

In [15]:
from Bio import SeqIO

# Extract IDs from input FASTA
missing_fasta_ids = []
number_under_22 = 0
total_missing_over = 0

for record in SeqIO.parse("ncbi_dataset/data/genomic.fna", "fasta"):
    if record.id not in sorted_fasta_ids:
        if len(record.seq) < 22:
            number_under_22 += 1
            continue
        missing_fasta_ids.append({"id":record.id, "description": record.description, "length":len(record.seq)})
        total_missing_over += 1

missing_fasta_ids = pd.DataFrame(missing_fasta_ids)
missing_fasta_ids.to_csv("missing_fasta_ids.tsv", sep='\t', index=False)
print(f"Number of sequences <22bp: {number_under_22}")
print(f"Total number of missing sequences >=21bp: {total_missing_over}")

Number of sequences <22bp: 1223
Total number of missing sequences >=21bp: 408


of these the majority are under 60bp, leaving only the following sequences that might be of interest: 

- OY757903.1	OY757903.1 Influenza B virus isolate DW-RAT-183 genome assembly, segment: 5	2247
- OZ035285.1	OZ035285.1 Influenza B virus isolate DW-RAT-352 genome assembly, chromosome: 1	2309
- OZ035383.1	OZ035383.1 Influenza B virus isolate DW-RAT-349 genome assembly, chromosome: 1	2309
- OZ035475.1	OZ035475.1 Influenza B virus isolate DW-RAT-356 genome assembly, segment: 1	2313
- OZ035476.1	OZ035476.1 Influenza B virus isolate DW-RAT-356 genome assembly, segment: 2	2001
- OZ035477.1	OZ035477.1 Influenza B virus isolate DW-RAT-356 genome assembly, segment: 3	1589
- OZ035478.1	OZ035478.1 Influenza B virus isolate DW-RAT-356 genome assembly, segment: 4	1625
- OZ035479.1	OZ035479.1 Influenza B virus isolate DW-RAT-356 genome assembly, segment: 5	1487
- OZ035480.1	OZ035480.1 Influenza B virus isolate DW-RAT-356 genome assembly, segment: 6	2337
- OZ211326.1	OZ211326.1 Influenza B virus (B/Brisbane/60/2008) isolate Respiratory swabs genome assembly, segment: KC866605.1	1844
- OZ211328.1	OZ211328.1 Influenza B virus (B/Brisbane/60/2008) isolate Respiratory swabs genome assembly, segment: KC866607.1	1189
- OZ211393.1	OZ211393.1 Influenza B virus (B/Brisbane/60/2008) isolate Respiratory swabs genome assembly, segment: KC866605.1	1844
- OZ211395.1	OZ211395.1 Influenza B virus (B/Brisbane/60/2008) isolate Respiratory swabs genome assembly, segment: KC866607.1	1189
- OY757748.1	OY757748.1 Influenza B virus isolate DW-RAT-127 genome assembly, segment: 5	1500
- OY757749.1	OY757749.1 Influenza B virus isolate DW-RAT-127 genome assembly, segment: 6	1761
- ON845594.1	ON845594.1 Influenza B virus (B/Greece/def2/2022) segment 4 hemagglutinin (HA) gene, complete cds	1701
- ON845595.1	ON845595.1 Influenza B virus (B/Argentina/def2/2022) segment 7 matrix protein 2 (M2) and matrix protein 1 (M1) genes, complete cds	982
- ON845596.1	ON845596.1 Influenza B virus (B/swine/Bangladesh/57BC/2022) segment 6 neuraminidase (NA) gene, complete cds	1410
- GU064893.1	GU064893.1 Influenza B virus (B/Shanghai/N37/2008) segment 6 neuraminidase (NA) gene, partial cds	251