# Illustrating Fasta2Structure use anywhere WITH NO INSTALLATIONS

Fasta2Structure prepares data in the specific format required by the population gentics software research program STRUCTURE that was developed by [Pritchard et al. 2000](https://pubmed.ncbi.nlm.nih.gov/10835412/). See more about that software [here](https://web.stanford.edu/group/pritchardlab/structure.html) with [the current version of the STRUCTURE software (v. 2.3.4) and documentation available here](https://web.stanford.edu/group/pritchardlab/structure_software/release_versions/v2.3.4/html/structure.html).  
Fasta2Structure is described in [Bessa-Silva 2024 'Fasta2Structure: a user-friendly tool for converting multiple aligned FASTA files to STRUCTURE format'](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05697-7).

This notebook will demonstrate using a modified script Fasta2Structure in Jupyter and on the command line. The latter means it will be useful almost anywhere such as on remote machines or computer clusters. There is also the original Tkinter-based Python script by Adam Bessa you can run in your desktop, if you prefer. To try that in a virtual desktop without needing to install anything you own system [here](https://gist.github.com/fomightez/e65761a066f56cbbc4c9b5b882c87380) and find a step-by-step. It will not be as easy to use as the examples that I walk through in this Jupyter notebook file.

Importantly, what is here will demonstrate using Fasta2Structure, right in your favorite web browser, **without the need to install pr do anything on your own system**.

-----

##### Absolutely need to try out the version currently in AdamBessa's Fasta2Structure repo & yet rather not touch your system?

You can still try the original Tkinter-based software presently available at https://github.com/AdamBessa/Fasta2Structure without installing anything on your computer. You can go [here](https://gist.github.com/fomightez/e65761a066f56cbbc4c9b5b882c87380) and find a step-by-step to use a remote virtual desktop to test the Fasta2Structure script. Only you'll find it isn't as convenient as what is provided here.
    
----
    
----    


#### Scenario: Use individual filepaths/filenames

This example will be more detailed than the following scenarios, however, you can use these same approaches to explore the results of the additional scenarios.

In [1]:
%run improved_Fasta2Structure.py Example_data/Datasets/ITS.fas Example_data/Datasets/trnD-trnT.fas Example_data/Datasets/trnH-trnK.fas 

Converted files saved as: Structure.str


It saves it with a generic file name, `Structure.str`. Rename it to be clear before running the script again because it may well save different data with the same name. (Do the same with the `log.log`.)

In [2]:
!mv Structure.str all_three_examples_called_individually_Structure.str
!mv log.log all_three_examples_called_individually_log.log

In [3]:
ls 

all_three_examples_called_individually_log.log
all_three_examples_called_individually_Structure.str
[0m[01;34mbinder[0m/
[01;34mExample_data[0m/
Fasta2Structure.py
[01;34mFasta2Structure_Windows[0m/
improved_Fasta2Structure.py
index.ipynb
README.md
[01;34mtests[0m/


Let's examine the top lines of main result:

In [4]:
!head all_three_examples_called_individually_Structure.str

AgMRJ10_1 3  1  3  0  0  2  2  1  3  2  3  3  3  2  2  3  0  1  2  2  0  2  3  2  1  2  0  2  1  1  1  3  0  1  0  2  0  1  1  2  0  3  2  2  2  0  2  2  0  3  -9  2  3  0  1  3  1  1  2  1  1  1  0  2  0  1  2  2  3  2  2  2  1  0  2  3  2  1  0  0  3  1  3  3  -9  1  0  1  1 
AgMRJ10_2 3  1  3  0  0  2  2  1  3  2  3  3  3  2  2  3  0  1  2  2  0  2  3  2  1  2  0  2  1  1  1  3  3  1  0  2  0  1  1  2  0  3  2  2  2  0  2  2  0  3  -9  2  3  0  1  3  1  1  2  1  1  1  0  2  0  1  2  2  3  2  2  2  1  0  2  3  2  1  0  0  3  1  3  3  -9  1  0  1  1 
AgMRJ12_1 3  1  3  0  0  2  2  1  3  2  3  3  3  2  1  3  0  1  2  2  0  2  3  2  1  2  0  2  1  1  1  3  3  1  0  2  0  1  1  2  3  3  2  2  2  0  2  2  0  3  3  2  3  2  1  3  1  1  2  1  1  1  0  2  0  1  2  2  3  2  2  2  1  0  2  3  2  1  0  0  3  1  3  3  -9  1  0  1  1 
AgMRJ12_2 3  1  3  0  0  2  2  1  3  2  3  3  3  2  1  3  0  1  2  2  0  2  3  2  1  2  0  2  1  1  1  3  3  1  0  2  0  1  1  2  3  3  2  2  2  0  2  2  0  3  3  2

Oddly, that looks similar to that the original script author provides as an example result in [`Example_data/Results/Structure.str`](https://github.com/AdamBessa/Fasta2Structure/blob/b88da439a0dca47ef94ac1f1cd8ffe0f30703ce0/Example_data/Results/Structure.str) , but there are two issues I describe in an issue posted here](https://github.com/AdamBessa/Fasta2Structure/issues/4). I built in tests to make sure the `improved_Fasta2Structure.py` script gives the same result as when I run the GUI with a few input examples, including what results [from that situation when the GUI bersion is used](https://github.com/AdamBessa/Fasta2Structure/assets/4700990/4ea587b2-fdba-4755-9b68-309db41488c1) by me at this time.  So for now, if the top of yours lookss like the following, then it should be correct as far as I can vouch for at this times.

```text
AgMRJ10_1 3  1  3  0  0  2  2  1  3  2  3  3  3  2  2  3  0  1  2  2  0  2  3  2  1  2  0  2  1  1  1  3  0  1  0  2  0  1  1  2  0  3  2  2  2  0  2  2  0  3  -9  2  3  0  1  3  1  1  2  1  1  1  0  2  0  1  2  2  3  2  2  2  1  0  2  3  2  1  0  0  3  1  3  3  -9  1  0  1  1 
AgMRJ10_2 3  1  3  0  0  2  2  1  3  2  3  3  3  2  2  3  0  1  2  2  0  2  3  2  1  2  0  2  1  1  1  3  3  1  0  2  0  1  1  2  0  3  2  2  2  0  2  2  0  3  -9  2  3  0  1  3  1  1  2  1  1  1  0  2  0  1  2  2  3  2  2  2  1  0  2  3  2  1  0  0  3  1  3  3  -9  1  0  1  1 
AgMRJ12_1 3  1  3  0  0  2  2  1  3  2  3  3  3  2  1  3  0  1  2  2  0  2  3  2  1  2  0  2  1  1  1  3  3  1  0  2  0  1  1  2  3  3  2  2  2  0  2  2  0  3  3  2  3  2  1  3  1  1  2  1  1  1  0  2  0  1  2  2  3  2  2  2  1  0  2  3  2  1  0  0  3  1  3  3  -9  1  0  1  1 
AgMRJ12_2 3  1  3  0  0  2  2  1  3  2  3  3  3  2  1  3  0  1  2  2  0  2  3  2  1  2  0  2  1  1  1  3  3  1  0  2  0  1  1  2  3  3  2  2  2  0  2  2  0  3  3  2  3  2  1  3  1  1  2  1  1  1  0  2  0  1  2  2  3  2  2  2  1  0  2  3  2  1  0  0  3  1  3  3  -9  1  0  1  1 
AgMRJ14_1 3  1  3  0  0  2  2  1  3  2  3  3  3  2  2  3  0  1  2  2  0  2  3  2  1  2  0  2  1  1  1  3  0  1  0  2  0  1  1  2  0  3  2  2  2  0  2  2  0  3  3  2  3  2  1  3  1  1  2  1  1  1  0  2  0  1  2  2  3  2  2  2  1  0  2  3  2  1  0  0  3  1  3  3  -9  1  0  1  1 
AgMRJ14_2 3  1  3  0  0  2  2  1  3  2  3  3  3  2  2  3  0  1  2  2  0  2  3  2  1  2  0  2  1  1  1  3  0  1  0  2  0  1  1  2  0  3  2  2  2  0  2  2  0  3  3  2  3  2  1  3  1  1  2  1  1  1  0  2  0  1  2  2  3  2  2  2  1  0  2  3  2  1  0  0  3  1  3  3  -9  1  0  1  1 
AgMRJ17_1 3  1  3  0  0  2  2  1  3  2  3  3  3  2  1  3  0  1  2  2  0  2  3  2  1  2  0  2  1  1  1  3  3  1  0  2  0  1  1  2  3  3  2  2  2  0  2  2  0  3  3  2  3  2  1  3  1  1  2  1  1  1  0  2  0  1  2  2  3  2  2  2  1  0  2  3  2  1  0  0  3  1  3  3  -9  1  0  1  1 
AgMRJ17_2 3  1  3  0  0  2  2  1  3  2  3  3  3  2  1  3  0  1  2  2  0  2  3  2  1  2  0  2  1  1  1  3  3  1  0  2  0  1  1  2  3  3  2  2  2  0  2  2  0  3  3  2  3  2  1  3  1  1  2  1  1  1  0  2  0  1  2  2  3  2  2  2  1  0  2  3  2  1  0  0  3  1  3  3  -9  1  0  1  1 
AgMRJ19_1 3  1  3  0  0  2  2  1  3  2  3  3  3  2  2  3  0  1  2  2  0  1  3  2  1  2  0  2  1  1  1  3  3  1  0  2  0  1  1  2  0  3  2  2  2  0  2  2  0  3  3  2  3  2  1  3  1  1  2  1  1  1  0  2  0  1  2  2  3  2  2  2  1  0  2  3  2  1  0  0  3  1  3  3  -9  1  0  1  1 
AgMRJ19_2 3  1  3  0  0  2  2  1  3  2  3  3  3  2  2  3  0  1  2  2  0  1  3  2  1  2  0  2  1  1  1  3  3  1  0  2  0  1  1  2  0  1  2  2  2  0  2  2  0  3  3  2  3  2  1  3  1  1  2  1  1  1  0  2  0  1  2  2  3  2  2  2  1  0  2  3  2  1  0  0  3  1  3  3  -9  1  0  1  1 
```

And the log for that:

In [5]:
!cat all_three_examples_called_individually_log.log

root - INFO - 3 FASTA files selected.
root - INFO - Variable sites for Example_data/Datasets/ITS.fas: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 73, 78, 93, 96, 102, 106, 110, 122, 126, 131, 132, 141, 159, 178, 179, 180, 184, 200, 297, 391, 424, 447, 478, 492, 495, 496, 500, 501, 530, 574, 579, 612, 630, 633, 640, 647, 648, 649, 650, 651]
root - INFO - Variable sites for Example_data/Datasets/trnD-trnT.fas: [107, 292, 297, 366, 477, 592, 610, 627, 645, 721]
root - INFO - Variable sites for Example_data/Datasets/trnH-trnK.fas: [0, 1, 2, 3, 4, 5, 6, 7, 9, 10, 11, 14, 15, 16, 18, 23, 28, 54, 61, 91, 122, 162, 268, 324, 507, 701, 777, 786]


The developer of the original script `Fasta2Structure.py` that uses a GUI to select the file provided results that would be obtained. Let's compare that to your results.

In [6]:
!cat Example_data/Results/log.log

root - INFO - 3 FASTA files selected.
root - INFO - Variable sites for C:/Users/adam-/OneDrive/�rea de Trabalho/Artigo_BMC/Exemple-Data/Avicennia-ITS_Phase.fas: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 73, 78, 93, 96, 102, 106, 110, 122, 126, 131, 132, 141, 159, 178, 179, 180, 184, 200, 297, 391, 424, 447, 478, 492, 495, 496, 500, 501, 530, 574, 579, 612, 630, 633, 640, 647, 648, 649, 650, 651]
root - INFO - Variable sites for C:/Users/adam-/OneDrive/�rea de Trabalho/Artigo_BMC/Exemple-Data/Avicennia-trnD-trnT_ediphase.fas: [107, 292, 297, 366, 477, 592, 610, 627, 645, 721]
root - INFO - Variable sites for C:/Users/adam-/OneDrive/�rea de Trabalho/Artigo_BMC/Exemple-Data/Avicennia-trnH-trnK_editphase.fas: [0, 1, 2, 3, 4, 5, 6, 7, 9, 10, 11, 14, 15, 16, 18, 23, 28, 54, 61, 91, 122, 162, 268, 324, 507, 701, 777, 786]


That looks sort of like what we got for `all_three_examples_called_individually_log.log`; however, it would be best to know if that is the same.

Running the next cell will compare the elements in each of the lists of 'Variable sites' to see if match.

In [7]:
#compare numbers in the Variable sites list for provided `Example_data/Results/log.log` vs. all_three_examples_called_individually_log.log
# File paths
original_file = "Example_data/Results/log.log"
new_file = "all_three_examples_called_individually_log.log"
import re
def extract_sites(text):
    pattern = r'Variable sites for .*?: \[(.*?)\]'
    return [set(map(int, match.group(1).split(', '))) for match in re.finditer(pattern, text, re.DOTALL)]
def compare_sites(original, new):
    for i, (orig_set, new_set) in enumerate(zip(original, new)):
        print(f"Comparison for file {i+1}:")
        if orig_set == new_set:
            print("  No deviations found.")
        else:
            diff_orig = orig_set - new_set
            diff_new = new_set - orig_set
            if diff_orig:
                print(f"  Sites in original but not in new result: {sorted(diff_orig)}")
            if diff_new:
                print(f"  Sites in new result but not in original: {sorted(diff_new)}")
        print()
def read_file(filepath):
    encodings = ['utf-8', 'latin-1']
    for encoding in encodings:
        try:
            with open(filepath, 'r', encoding=encoding) as file:
                return file.read()
        except UnicodeDecodeError:
            continue
        except FileNotFoundError:
            print(f"Error: File not found - {filepath}")
            return ""
        except IOError:
            print(f"Error: Unable to read file - {filepath}")
            return ""
    print(f"Error: Unable to decode file with attempted encodings - {filepath}")
    return ""
# Read files
original_results = read_file(original_file)
new_results = read_file(new_file)
# Extract sites
original_sites = extract_sites(original_results)
new_sites = extract_sites(new_results)
# Compare sites
compare_sites(original_sites, new_sites)

Comparison for file 1:
  No deviations found.

Comparison for file 2:
  No deviations found.

Comparison for file 3:
  No deviations found.



For each it should say, `No deviations found.`

#### Scenario: Use an ipywidget's based GUI to make submitting the command more convenient

The original software `Fasta2Structure.py` runs on your desktop and makes a simple GUI that lets you select files to feed the script and then does the conversion once you selected those files.  
Needing the GUI to run on your desktop can be limiting yet makes it user-friendly because users don't have to write out each file. Using ipywidgets that convenience can be added on top of `improved_Fasta2Structure.py` if users prefer.

*Coming Soon example of using ipywidgets to select same three files as above and then use those in the command*  
like `%run improved_Fasta2Structure.py {selected_files}`

In [9]:
%run improved_Fasta2Structure.py {selected_files}

Converted file saved as: {selected_files}_Structure.str


Traceback (most recent call last):
  File "/home/jovyan/improved_Fasta2Structure.py", line 45, in process_fasta_file
    alignment = AlignIO.read(filepath, "fasta")
  File "/srv/conda/envs/notebook/lib/python3.10/site-packages/Bio/AlignIO/__init__.py", line 384, in read
    alignment = next(iterator)
  File "/srv/conda/envs/notebook/lib/python3.10/site-packages/Bio/AlignIO/__init__.py", line 323, in parse
    with as_handle(handle) as fp:
  File "/srv/conda/envs/notebook/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/srv/conda/envs/notebook/lib/python3.10/site-packages/Bio/File.py", line 72, in as_handle
    with open(handleish, mode, **kwargs) as fp:
FileNotFoundError: [Errno 2] No such file or directory: '{selected_files}'


In [10]:
!mv Structure.str all_three_examples_selected_by_file_selector_Structure.str
!mv log.log all_three_examples_selected_by_file_selector_log.log

mv: cannot stat 'Structure.str': No such file or directory
mv: cannot stat 'log.log': No such file or directory


I would argue that this section provides a more user-friendly version of the `Fasta2Structure.py` because it allows a GUI to help choose the files but can be run pretty much anywhere.

#### Scenario: Use a directory

Point the `improved_Fasta2Structure.py` script at a direcotry and it will process the FASTA files it recognizes in the directory. (Note they extensions must match the expected ones.)

In [11]:
%run improved_Fasta2Structure.py Example_data/Datasets/

Converted files saved as: Structure.str


Note that produces the same as `%run improved_Fasta2Structure.py Example_data/Datasets/ITS.fas Example_data/Datasets/trnD-trnT.fas Example_data/Datasets/trnH-trnK.fas` beause those are the files in that directory and the order they get pased based on sorting. (If this wasn't the case for what you want, youd have to specify the filepasths as arguments with the order you need.)

That situation also saves it with a generic file name, `Structure.str`. again, rename it & the log file to distinguish them before running the script a next time.

In [12]:
!mv Structure.str directory_utilized_Structure.str
!mv log.log directory_utilized_log.log

mv: cannot stat 'log.log': No such file or directory


In [13]:
ls

all_three_examples_called_individually_log.log
all_three_examples_called_individually_Structure.str
[0m[01;34mbinder[0m/
directory_utilized_Structure.str
[01;34mExample_data[0m/
Fasta2Structure.py
[01;34mFasta2Structure_Windows[0m/
improved_Fasta2Structure.py
index.ipynb
README.md
{selected_files}_Structure.str
[01;34mtests[0m/


#### Troubleshooting

Note if you start to not see the log file, `log.log`, get generated that should be produced when using improved_Fasta2Structure.py, you should first try restarting the kernel and sticking with `%run`, see [here](https://stackoverflow.com/a/48005958/8508004) for more about that. If that fails, you can try to change the `%run` to `!python`.  