# Convert multiple loci alignments into one single hdf5 file

This script can convert multiple fasta files in a single folder into a hdf5 file useful for some downstream analyses in [ipyrad](https://github.com/dereneaton/ipyrad) and other software like [superBPP](https://github.com/eaton-lab/superbpp).

Genes or loci must be saved in individual fasta files. They could not include all samples. For example: 

`gene1.fna`
```text
>sample_1
ACGGCAC
>sample_2
ACTGCAC
>sample_3
ACTGCAA
``` 
`gene2.fna`
```text
>sample_1
GTAAAGTA
>sample_2
GTAGGGTA
```

In [4]:
import alignment2hdf5

In [5]:
alignment2hdf5.multiple_fastas_to_hdf5("./test/genes/*.FNA", output="./test/alignment.hdf5")

# Convert a single fasta file into a single hdf5 file

This script can split a fasta file into multiple loci having the same length and convert it into a hdf5 file useful for some downstream analyses in [ipyrad](https://github.com/dereneaton/ipyrad) and other software like [superBPP](https://github.com/eaton-lab/superbpp).

Fasta files can be single-lined or multi-lined (interleaved), for example:

`simple.fa` 
```text
>sample_1 single line
ACGGCACGTAAAGTA
>sample_2 multiline
ACTGCACGTAG
GGTA
```

In [16]:
import alignment2hdf5

In [17]:
# Convert fasta and split each sequence in 7 loci with similar sizes
alignment2hdf5.split_fasta_to_hdf5("./test/fasta.fa", number_loci=3, output="./test/fasta.hdf5")

# Convert a single nexus file into a single hdf5 file

This script can split a nexus file into multiple loci using the information in the `charpartition` block, and convert it into a hdf5 file useful for some downstream analyses in [ipyrad](https://github.com/dereneaton/ipyrad) and other software like [superBPP](https://github.com/eaton-lab/superbpp).

Nexus files can be sequential or interleaved, for example:

`simple.nex`
```text
 #NEXUS
[This is an example of nexus file]

Begin data;
    Dimensions ntax=6 nchar=48;
    Format datatype=nucleotide gap=- missing=?;
    Matrix
a1    CTGATTTACATGTCAGATGTTTTTACTAGTTCCCAACAGTTTCTCATG
a2    CTGATTTACATGTCAGATGTTTTTACTAGTTCCCAACAGTTTCTCATG
b1    CTGATTTACATGTCAGATGTTTTTACTAGTTCCCAACAGTTTCTCATG
b2    CTGATTTACATGTCAGATGTTTTTACTAGTTCCCAACAGTTTCTCATG
c1    CTGATTTACATGTCAGATGTTTTTACTAGTTCCCAACAGTTTCTCATG
c2    CTGATTTACATGTCAGATGTTTTTACTAGTTCCCAACAGTTTCTCATG
    ;
End;

[charpartition block is requiered]
charpartition lociset =
1: 1-10,
2: 11-20,
3: 21-30,
4: 31-40,
5: 41-48;
end;
```

In [3]:
import alignment2hdf5

In [4]:
# Convert nexus file
alignment2hdf5.split_fasta_to_hdf5("./test/nexus.nex", number_loci=3, output="./test/nexus.hdf5")

# Verifying hdf5 structure

In [26]:
import h5py
w = h5py.File("./test/alignment.hdf5",'r')

In [27]:
w.keys()

<KeysViewHDF5 ['phy', 'phymap', 'scaffold_lengths', 'scaffold_names']>

In [28]:
w["phy"]

<HDF5 dataset "phy": shape (5, 15), type "|u1">

In [29]:
w["phy"][0]

array([65, 67, 71, 71, 67, 65, 67, 71, 84, 65, 65, 65, 71, 84, 65],
      dtype=uint8)

In [30]:
w["phymap"][:]

array([[ 1,  0,  7,  0,  7],
       [ 2,  7, 15,  0, 15]])

In [31]:
w["phymap"].attrs.keys()

<KeysViewHDF5 ['columns', 'phynames', 'reference']>

In [32]:
w["phymap"].attrs["reference"]

'converted-with-alignment2hdf5'

In [33]:
w.close()