You can read FASTA files using the ``read_fasta()`` function:

In [1]:
import aaanalysis as aa
file_path = "data/example_FASTA.fasta"
df_seq = aa.read_fasta(file_path)
aa.display_df(df_seq)

Unnamed: 0,entry,sequence
1,"SEMA4A,38.4",LAAQQSYWPHFVTVT...IILVASPLRALRARG
2,"SEMA4B,47.0",WGADRSYWKEFLVMC...LFLLYRHRNSMKVFL
3,"SEMA4C,86.6",EARAPLENLGLVWLA...LLLVLSLRRRLREEL
4,"SEMA4D,19.1",TMYLKSSDNRLLMSL...FFYNCYKGYLPRQCL
5,"SEMA4F,88.5",RDAPSRAHTVGAGLA...TLILIGRRQQRRRQR
6,"SEMA4G,49.9",GAQLAPDVRLLYVLA...ASSLLYVACLREGRR
7,"SEMA5A,28.1",EEKRCGEFNMFHMIA...LTLLVYTYCQRYQQQ
8,"SEMA5B,30.0",TDCAGFNLIHLVATG...LAVYLSCQHCQRQSQ
9,"SEMA6A,21.1",KGHDQLVPVTLLAIA...SGITVYCVCDHRRKD
10,"SEMA6B,80.0",VSVNLLVTSSVAAFV...WFVGLRERRELARRK


To adjust the names of the columns for the primary FASTA file information, use the ``col_id`` and ``col_seq`` parameters:

In [2]:
df_seq = aa.read_fasta(file_path, col_id="ENTRY", col_seq="SEQUENCE")
aa.display_df(df_seq)

Unnamed: 0,ENTRY,SEQUENCE
1,"SEMA4A,38.4",LAAQQSYWPHFVTVT...IILVASPLRALRARG
2,"SEMA4B,47.0",WGADRSYWKEFLVMC...LFLLYRHRNSMKVFL
3,"SEMA4C,86.6",EARAPLENLGLVWLA...LLLVLSLRRRLREEL
4,"SEMA4D,19.1",TMYLKSSDNRLLMSL...FFYNCYKGYLPRQCL
5,"SEMA4F,88.5",RDAPSRAHTVGAGLA...TLILIGRRQQRRRQR
6,"SEMA4G,49.9",GAQLAPDVRLLYVLA...ASSLLYVACLREGRR
7,"SEMA5A,28.1",EEKRCGEFNMFHMIA...LTLLVYTYCQRYQQQ
8,"SEMA5B,30.0",TDCAGFNLIHLVATG...LAVYLSCQHCQRQSQ
9,"SEMA6A,21.1",KGHDQLVPVTLLAIA...SGITVYCVCDHRRKD
10,"SEMA6B,80.0",VSVNLLVTSSVAAFV...WFVGLRERRELARRK


The ``col_id`` column should only contain the unique identifier. If the FASTA file comprises additional information, use the ``sep`` (default='|') argument to save them in additional columns, named ``info1`` to ``info(n)``:

In [3]:
df_seq = aa.read_fasta(file_path, sep=",")
aa.display_df(df_seq)

Unnamed: 0,entry,sequence,info1
1,SEMA4A,LAAQQSYWPHFVTVT...IILVASPLRALRARG,38.4
2,SEMA4B,WGADRSYWKEFLVMC...LFLLYRHRNSMKVFL,47.0
3,SEMA4C,EARAPLENLGLVWLA...LLLVLSLRRRLREEL,86.6
4,SEMA4D,TMYLKSSDNRLLMSL...FFYNCYKGYLPRQCL,19.1
5,SEMA4F,RDAPSRAHTVGAGLA...TLILIGRRQQRRRQR,88.5
6,SEMA4G,GAQLAPDVRLLYVLA...ASSLLYVACLREGRR,49.9
7,SEMA5A,EEKRCGEFNMFHMIA...LTLLVYTYCQRYQQQ,28.1
8,SEMA5B,TDCAGFNLIHLVATG...LAVYLSCQHCQRQSQ,30.0
9,SEMA6A,KGHDQLVPVTLLAIA...SGITVYCVCDHRRKD,21.1
10,SEMA6B,VSVNLLVTSSVAAFV...WFVGLRERRELARRK,80.0


To adjust the name of the additional columns, provide a list of column names by ``cols_info``:

In [4]:
df_seq = aa.read_fasta(file_path, sep=",", cols_info=["prediction"])
aa.display_df(df_seq)

Unnamed: 0,entry,sequence,prediction
1,SEMA4A,LAAQQSYWPHFVTVT...IILVASPLRALRARG,38.4
2,SEMA4B,WGADRSYWKEFLVMC...LFLLYRHRNSMKVFL,47.0
3,SEMA4C,EARAPLENLGLVWLA...LLLVLSLRRRLREEL,86.6
4,SEMA4D,TMYLKSSDNRLLMSL...FFYNCYKGYLPRQCL,19.1
5,SEMA4F,RDAPSRAHTVGAGLA...TLILIGRRQQRRRQR,88.5
6,SEMA4G,GAQLAPDVRLLYVLA...ASSLLYVACLREGRR,49.9
7,SEMA5A,EEKRCGEFNMFHMIA...LTLLVYTYCQRYQQQ,28.1
8,SEMA5B,TDCAGFNLIHLVATG...LAVYLSCQHCQRQSQ,30.0
9,SEMA6A,KGHDQLVPVTLLAIA...SGITVYCVCDHRRKD,21.1
10,SEMA6B,VSVNLLVTSSVAAFV...WFVGLRERRELARRK,80.0


The headers of FASTA files can start with a database abbreviation (e.g., 'sp' for Swiss-Prot). To properly convert these into a database column, provide a name to the ``col_db`` parameter:

In [5]:
file_path = "data/example_FASTA_db.fasta"
df_seq = aa.read_fasta(file_path, col_db="database", sep=",")
aa.display_df(df_seq)

Unnamed: 0,entry,sequence,database,info1
1,SEMA4A,LAAQQSYWPHFVTVT...IILVASPLRALRARG,sp,38.4
2,SEMA4B,WGADRSYWKEFLVMC...LFLLYRHRNSMKVFL,sp,47.0
3,SEMA4C,EARAPLENLGLVWLA...LLLVLSLRRRLREEL,sp,86.6
4,SEMA4D,TMYLKSSDNRLLMSL...FFYNCYKGYLPRQCL,sp,19.1
5,SEMA4F,RDAPSRAHTVGAGLA...TLILIGRRQQRRRQR,sp,88.5
6,SEMA4G,GAQLAPDVRLLYVLA...ASSLLYVACLREGRR,sp,49.9
7,SEMA5A,EEKRCGEFNMFHMIA...LTLLVYTYCQRYQQQ,sp,28.1
8,SEMA5B,TDCAGFNLIHLVATG...LAVYLSCQHCQRQSQ,sp,30.0
9,SEMA6A,KGHDQLVPVTLLAIA...SGITVYCVCDHRRKD,sp,21.1
10,SEMA6B,VSVNLLVTSSVAAFV...WFVGLRERRELARRK,sp,80.0
