## **SUMMARY CODE**
***Author: Vy Phung***  
  - This file summarizes and explains the methods and codes that I wrote and used to retrieve the raw mtDNA sequences from NCBI gen bank, then clean these raw data to get the final Dataset 3 (which only includes complete genomes sequences), and finally use the summary information of each sequence to create the Isolate Explanation table.

  - For the purpose of simplifying and making the codes easier to run and read, I saved our codes in the github link below where I just called the packages and functions everytime I ran. To know more about the specific details of the codes, you can look up the github link below: https://github.com/duhongduc/Haplo.

Caveats:
- Because the time I ran these codes to get raw data was June 2023, if you intend to reuse our codes and run them, there might be more published raw data. When I started to write this summary code file (May 2024), I re-ran these codes, and used the same keywords to search for the sequences to make sure the codes still work. I accidentally realized that more data have been published when I compared a number of sequences of re-ran Dataset 3 with that of the original Dataset 3 (4932 sequences). Therefore, in the scenerio of this summary code file, and also our paper, I still summarize the same methods how I got to Dataset 3, but at Dataset 3 section, I created another function to filter only original 4932 sequences that I used in our paper.
- The reason I used Google Colab is because Google Colab and Google Drive are linked to each other, and I can access to google drive folder to get and also save our data from this. Here is the google drive folder that I used in our paper:
https://drive.google.com/drive/folders/15OPBImGAG51vukHfADV3-zG8RKU8A9fe?usp=drive_link. </br>

Everytime you see the below codes saving the files at any folders, the locations of the folders are the same as the location in the google drive.


**CONTENTS:**
1. Get Data from NCBI
* Setting up
* Entrez Direct
* Missing Data
2. Data wrangling
* Dataset 1
* Dataset 2
* Dataset 3
3. Tables
* Isolate Explanation Table
* Table 1
* Table 2
* Table 3 and its subtables

### **1. Get Data from NCBI**

#### **Setting up**

Firstly, I accessed to our own google drive by running a cell code below
(Hit Shift + Enter to run any box cell code).

In [None]:
# Access to my google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# After accessing, run this cell code to move into my google drive directory
%cd /content/drive/MyDrive

/content/drive/MyDrive


In [None]:
# Then run this cell code to clone the main directory Haplogroup from the github link into my google drive folder, so that everytime I run the code, I just call them from this google drive
! git clone https://github.com/vy-phung/Haplogroup.git

Cloning into 'Haplogroup'...
remote: Enumerating objects: 88, done.[K
remote: Counting objects: 100% (20/20), done.[K
remote: Compressing objects: 100% (18/18), done.[K
remote: Total 88 (delta 3), reused 17 (delta 2), pack-reused 68[K
Receiving objects: 100% (88/88), 50.73 KiB | 2.21 MiB/s, done.
Resolving deltas: 100% (21/21), done.


#### **Entrez Direct**

- To retrieve mtDNA sequences of 11 countries in South East Asia, I used [Entrez Direct](https://www.ncbi.nlm.nih.gov/books/NBK179288/) to search for the common keywords below: </br>
`Homo sapiens AND mitochondrion AND <Country Name>`

In [None]:
# Download Entrez Direct
!sh -c "$(wget -q https://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/install-edirect.sh -O -)"
!export PATH=${HOME}/edirect:${PATH}


Entrez Direct has been successfully downloaded and installed.

In order to complete the configuration process, please execute the following:

  echo "export PATH=/root/edirect:\${PATH}" >> ${HOME}/.bashrc

or manually edit the PATH variable assignment in your .bashrc file.

Would you like to do that automatically now? [y/N]
y
OK, done.

To activate EDirect for this terminal session, please execute the following:

export PATH=${HOME}/edirect:${PATH}



Below is a bash script file `./Haplogroup/finalCodes/bashScriptCodes/downloadDataNCBI.sh` I wrote to download and then save the data.

- "DataListName": /content/drive/MyDrive/Haplogroup/finalCodes/others/countries.txt </br>
(A file "countries.txt" contains the names of 11 countries: Brunei, Cambodia, Indonesia, Laos, Malaysia, Myanmar, Philippines, Singapore, Thailand, Timor-Leste, Viet Nam)

- "NameOfSaveFolder": /content/drive/MyDrive/RetrieveData/OldCountryFasta </br>("OldCountryFasta" is the folder I saved the downloaded data)

The downloaded data from each country was saved to a big fasta file named after that country.

In [None]:
# run this cell code to run the bash script to get the data from entrez direct
''' example code:
! source ./Haplogroup/finalCodes/bashScriptCodes/downloadDataNCBI.sh; download "DataListName" "NameOfSaveFolder"
'''
! source ./Haplogroup/finalCodes/bashScriptCodes/downloadDataNCBI.sh; download /content/drive/MyDrive/Haplogroup/finalCodes/others/countries.txt /content/drive/MyDrive/RetrieveData/OldCountryFasta

In [None]:
# check the number of sequences in each country fasta file after downloading
%%bash
DataList=/content/drive/MyDrive/Haplogroup/finalCodes/others/countries.txt
Field_Separator=$IFS
IFS=,
for val in `cat $DataList`
do echo $val; cat /content/drive/MyDrive/RetrieveData/OldCountryFasta/$val.fasta | grep ">" | wc -l; done

#### **Missing data**

After gaining the data from the above common keywords, I realized that there Ire still some more missing data and I could get these missing ones if I had searched the keywords specifically. Below is how I got more data for each specific country. I saved the missing data of that country directly to an existing big fasta file having that country's name.

**Myanmar** </br>
New Data:
- 21219640.Inland post-glacial dispersal in East Asia revealed by mitochondrial haplogroup M9a'b: Myanmar, Vietnam (filter them): keywords for Myanmar: HM346895, HM346896
- 24467713.Summerer et al. (2014).txt: Myanmar: missing 327 small coding region JX288765-JX289091 among 371 files of this article (keywords: Large-scale mitochondrial DNA analysis in Southeast Asia reveals evolutionary effects of cultural isolation in the multi-ethnic population of Myanmar)
- 25826227.Li et al. (2015).txt: all 937 files for Myanmar but only 92 files exist (keywords: Ancient inland human dispersals from Myanmar into interior East Asia since the Late Pleistocene)


In [None]:
%%bash
# downloading missing data of Myanmar
for str in "HM346895" "HM346896" "Large-scale mitochondrial DNA analysis in Southeast Asia reveals evolutionary effects of cultural isolation in the multi-ethnic population of Myanmar" "Ancient inland human dispersals from Myanmar into interior East Asia since the Late Pleistocene"; do
source /content/drive/MyDrive/Haplogroup/finalCodes/bashScriptCodes/getMissingData.sh; getMissingData "$str" "Myanmar"
done

**Philippines** </br>
New data:
+ 21281460.Loo et al. (2011).txt: all 46 files for Philip but only exist 12 (keyword: Genetic affinities betIen the Yami tribe people of Orchid Island and the Philippine Islanders of the Batanes archipelago and Philippines)
+ 21796613.Scholes et al. (2011).txt: 60 files for Philip but only exist 9 (keywords: Genetic diversity and evidence for population admixture in Batak Negritos from Palawan)
+ Philippines: '28535779.Carriers of mitochondrial DNA macrohaplogroup R colonized Eurasia.Larruga et al. (2017)': keywords: "Carriers of mitochondrial DNA macrohaplogroup R colonized Eurasia AND Philippines"

In [None]:
%%bash
# downloading missing data of Philippines
for str in "Genetic affinities betIen the Yami tribe people of Orchid Island and the Philippine Islanders of the Batanes archipelago and Philippines" "Genetic diversity and evidence for population admixture in Batak Negritos from Palawan" "Carriers of mitochondrial DNA macrohaplogroup R colonized Eurasia AND Philippines"; do
source /content/drive/MyDrive/Haplogroup/finalCodes/bashScriptCodes/getMissingData.sh; getMissingData "$str" "Philippines"
done

**Singapore** </br>
New data:
+ 36382667.Sui et al. (2023).txt: there are 7 and already exist 5 so there are 2 more but no idea if these 2 ref seqs are from Singapore or not (keywords: Death associated protein‑3 (DAP3) and DAP3 binding cell death enhancer‑1 (DELE1) in human colorectal cancer, and their impacts on clinical outcome and chemoresistance)
+ 37025097.Zhao et al. (2023).txt: 14 files and already exist 11 files but don’t know about the other 3 belonging to Singapore or not

In [None]:
%%bash
# downloading missing data of Singapore
for str in "Death associated protein‑3 (DAP3) and DAP3 binding cell death enhancer‑1 (DELE1) in human colorectal cancer, and their impacts on clinical outcome and chemoresistance" "Effect of COP1 in Promoting the Tumorigenesis of Gastric Cancer by Down-Regulation of CDH18 via PI3K/AKT Signal Pathway"; do
source /content/drive/MyDrive/Haplogroup/finalCodes/bashScriptCodes/getMissingData.sh; getMissingData "$str" "Singapore"
done

**Thailand** </br>
New data:
+ 11310578.Mitochondrial DNA polymorphisms in Thailand.Fucharoen et al. (2001)
+ 19148289.The Peopling of Korea Revealed by Analyses of Mitochondrial DNA and Y-Chromosomal Markers.Jin et al. (2009): Thai (80, keyword: The Peopling of Korea Revealed by Analyses of Mitochondrial DNA and Y-Chromosomal Markers and Thailand), Vietnam (84, keyword: Viet Nam)
+ 27837350.Complete mitochondrial genomes of Thai and Lao populations indicate an ancient origin of Austroasiatic groups and demic diffusion in the spread of Tai–Kadai languages. Kutanan et al. (2017): Thai, Lao (search on the table of isolate name): total 1234
+ 32304863.A Matrilineal Genetic Perspective of Hanging Coffin
Custom in Southern China and Northern Thailand (its old name is Unpublished.The Population History and Cultural Dispersal Pattern of Hanging.Zhang et al. (2020))


In [None]:
# Special case
'''27837350.Complete mitochondrial genomes of Thai and Lao populations indicate an ancient origin of Austroasiatic groups and demic diffusion in the
spread of Tai–Kadai languages. Kutanan et al. (2017):I used the title of this paper to download the data, but the data of countries Thailand and Lao are mixed (search on the table of isolate name):
total 1234: Laos: LUA101-LUA149: LA1 + VIE101-VIE149: LA2; the others are Thai'''
! ${HOME}/edirect/esearch -db nucleotide -query "Complete mitochondrial genomes of Thai and Lao populations indicate an ancient origin of Austroasiatic groups" -sort "Date Released" |  ${HOME}/edirect/efetch -format fasta >> /content/drive/MyDrive/RetrieveData/OldCountryFasta/Laos_Thai.fasta

A problem of downloading the data from a paper "Complete mitochondrial genomes of Thai and Lao populations indicate an ancient origin of Austroasiatic groups" is that all mtDNA sequences when downloading Ire mixed betIen Thailand and Lao countries. To separate the Thailand sequences from Lao sequences so that I can save them into the correct country file, I ran a function `splitLaosThai` below.

In [None]:
# Splitting up Lao and Thai from the paper and save them. Add only non-existed sequences
from Haplogroup.finalCodes.DataWrangling import splitLaosThai
splitLaosThai.splitLaosThai("/content/drive/MyDrive/RetrieveData/OldCountryFasta/Laos_Thai.fasta")

In [None]:
%%bash
# downloading missing data of Thailand
for str in "Mitochondrial DNA polymorphisms in Thailand" "The Peopling of Korea Revealed by Analyses of Mitochondrial DNA and Y-Chromosomal Markers AND Thailand" "The Population History and Cultural Dispersal Pattern of Hanging"; do
source /content/drive/MyDrive/Haplogroup/finalCodes/bashScriptCodes/getMissingData.sh; getMissingData "$str" "Thailand"
done

**Laos** </br>
New Data:
- 21333001.Southeast Asian diversity: first insights into the complex mtDNA structure of Laos.Bodner et al. (2011): 214

In [None]:
# downloading missing data of Laos
!source /content/drive/MyDrive/Haplogroup/finalCodes/bashScriptCodes/getMissingData.sh; getMissingData "Southeast Asian diversity: first insights into the complex mtDNA structure of Laos" "Laos"

**Timor-Leste** </br>
New data:
+ Genetic admixture history of Eastern Indonesia as revealed by Y-chromosome and mitochondrial DNA analysis.Mona et al. (2009): Despite no clear location/country on the sequences' names, this article mentions country Timor (330 files)
+ 25757516.Gomes et al. (2015).txt: this study has 324 files which are all from East Timor (Timor-Leste) (KJ655583-KJ655889: D-loop, KJ676774-KJ676790: complete genome) (keywords: Human settlement history betIen Sunda and Sahul: a focus on East Timor (Timor-Leste) and the Pleistocenic mtDNA diversity)

In [None]:
%%bash
# downloading missing data of Timor-Leste
for str in "Genetic admixture history of Eastern Indonesia as revealed by Y-chromosome and mitochondrial DNA analysis" "Human settlement history betIen Sunda and Sahul: a focus on East Timor (Timor-Leste) and the Pleistocenic mtDNA diversity"; do
source /content/drive/MyDrive/Haplogroup/finalCodes/bashScriptCodes/getMissingData.sh; getMissingData "$str" "Timor-Leste"
done

**Viet Nam** </br>
New data:
+ 21219640.Inland post-glacial dispersal in East Asia revealed by mitochondrial haplogroup M9a'b: Myanmar, Vietnam: keywords for VN: HM346881, HM346883, HM346885, HM346886, HM346889
+ .Direct Submission.VN.Phan et al. (2016): Vietnam (DQ834255, DQ834258)
+ 19148289.The Peopling of Korea Revealed by Analyses of Mitochondrial DNA and Y-Chromosomal Markers.Jin et al. (2009): Vietnam (84, keyword: The Peopling of Korea Revealed by Analyses of Mitochondrial DNA and Y-Chromosomal Markers and Viet Nam)
+ 20513740.Tracing the Austronesian footprint in Mainland Southeast Asia: a perspective from mitochondrial DNA.Peng et al. (2010): 335 (Cham+Kinh)
+ '.Direct Submission.Phan et al. (2016)': there are 10 files for VN when searching for keyword “Phan (2016) AND Homo sapiens AND mitochondrion”. Among them already existed 2 files. They dont have a title for this so I still cannot find the article and there is no explanation for the isolate; the isolate only has “VN”

In [None]:
%%bash
# downloading missing data of Vietnam
for str in "HM346881" "HM346883" "HM346885" "HM346886" "HM346889" "DQ834255" "DQ834258" "The Peopling of Korea Revealed by Analyses of Mitochondrial DNA and Y-Chromosomal Markers AND Viet Nam" "Tracing the Austronesian footprint in Mainland Southeast Asia: a perspective from mitochondrial DNA" "Phan (2016) AND Homo sapiens AND mitochondrion"; do
source /content/drive/MyDrive/Haplogroup/finalCodes/bashScriptCodes/getMissingData.sh; getMissingData "$str" "Viet Nam"
done

**Malaysia** </br>
1. Search term "Single, rapid coastal settlement of Asia revealed by analysis of complete mitochondrial genomes AND Malaysia ":
- Missing data if Malaysia is 267 files but all are control region
2. Search term "Single, rapid coastal settlement of Asia revealed by analysis of complete mitochondrial genomes AND Malay":
- Miss 4 complete genome of Malaysia: </br>
2: 9_N21(Tor57), 10_M21c(Tor61): Aboriginal Malay (Semelai) (using key word Malay) </br>
2: 7_N22(Tor55), 12_M22(Tor63): Aboriginal Malay (Temuan) (using key word Malayu)
3. 16982817.Hill et al. (2006): 6 files and one of them which is ORA131B already existed but the others did not (keywords: Phylogeography and ethnogenesis of aboriginal Southeast Asians AND Malaysia)
4. 22729749.Evolutionary history of continental southeast asians: 'early train' hypothesis based on genetic analysis of mitochondrial and autosomal
DNA data.Jinam et al. (2012): Malay: 86 genome: 23 Bidayuh (BD); 24 Jehai (JH); 21 Seletar (SL); 18 Temuan (TM)

In [None]:
%%bash
# downloading missing data of Malaysia
for str in "Single, rapid coastal settlement of Asia revealed by analysis of complete mitochondrial genomes AND Malaysia" "Single, rapid coastal settlement of Asia revealed by analysis of complete mitochondrial genomes AND Malay" "Single, rapid coastal settlement of Asia revealed by analysis of complete mitochondrial genomes AND Malayu" "Phylogeography and ethnogenesis of aboriginal Southeast Asians AND Malaysia" "Evolutionary history of continental southeast asians: 'early train' hypothesis based on genetic analysis of mitochondrial and autosomal DNA data"; do
source /content/drive/MyDrive/Haplogroup/finalCodes/bashScriptCodes/getMissingData.sh; getMissingData "$str" "Malaysia"
done

**Indonesia** </br>
New Data:
- '21407194.Larger mitochondrial DNA than
Y-chromosome differences betIen.Gunnarsdottir et al. (2011)'. Key word: Larger mitochondrial DNA than Y-chromosome differences betIen matrilocal and patrilocal groups from Sumatra (72 files)
- 21407194.Gunnarsdottir et al. (2011): HM596654
- 16982817.Hill et al. (2006): 97 files and 4 of them existed (DQ981465-68), but the others did not (keyword: Phylogeography and ethnogenesis of aboriginal Southeast Asians AND Indonesia)
- Unpublished.Ngili et al. (2009).txt: all 206 files from Indonesia

In [None]:
%%bash
# downloading missing data of Indonesia
for str in "Larger mitochondrial DNA than Y-chromosome differences betIen matrilocal and patrilocal groups from Sumatra" "HM596654" "Phylogeography and ethnogenesis of aboriginal Southeast Asians AND Indonesia" "Ngili (2009)"; do
source /content/drive/MyDrive/Haplogroup/finalCodes/bashScriptCodes/getMissingData.sh; getMissingData "$str" "Indonesia"
done

**Cambodia** </br>
New Data:
- Keywords: "Analysis of mitochondrial genome diversity identifies new" (1248 seqs)


In [None]:
# downloading missing data of Cambodia
! source /content/drive/MyDrive/Haplogroup/finalCodes/bashScriptCodes/getMissingData.sh; getMissingData "Analysis of mitochondrial genome diversity identifies new" "Cambodia"

After collecting the missing data, check again the number of sequences in each country fasta file in the oldCountryFasta folder

In [None]:
# check the number of sequences in each country fasta file
%%bash
DataList=/content/drive/MyDrive/Haplogroup/finalCodes/others/countries.txt
Field_Separator=$IFS
IFS=,
for val in `cat $DataList`
do echo $val; cat /content/drive/MyDrive/RetrieveData/OldCountryFasta/$val.fasta | grep ">" | wc -l; done

### **2. Data wrangling**

Caveat:
- As I mentioned at the caveats above, a number of sequences of Dataset 1 and Dataset 2 below are the new numbers when I re-ran the codes. At the Dataset 3 section, I ran a function to filter only 4932 original sequences that I used in our paper.

#### **Dataset 1**
- After getting all the raw data of 11 countries above, I checked if there Ire any duplicated sequences in each country fasta file and also betIen the 11 countries, and made sure I only took the unique sequences. I removed the sequences have the accession numbers on their name that appeared more than once and then saved the unique accession number sequences in a file called "UniqueAccNumForDataset1.txt".
- Datatset 1 still contains the reference, D-loop, non-homosapien, etc. sequences.

  First of all, checking how many raw sequences before removing the duplicated ones by running the bash script below.

In [None]:
%%bash
# create a list of all sequences in original data of 11 countries
DataList=/content/drive/MyDrive/Haplogroup/finalCodes/others/countries.txt
Field_Separator=$IFS
IFS=,
for country in `cat $DataList`
do cat /content/drive/MyDrive/RetrieveData/OldCountryFasta/"$country".fasta | grep ">" ; done > /content/drive/MyDrive/Haplogroup/finalCodes/others/allSeq.txt
# check number of all raw seqs
echo "Number of raw seqs of 11 countries: " `cat /content/drive/MyDrive/Haplogroup/finalCodes/others/allSeq.txt | wc -l`

Number of raw seqs of 11 countries:  9660


After knowing that there are 9660 raw sequences, I started to run the `checkDuplicated` function to count how many unique accession number sequences and how many duplicated ones.

In [None]:
# check no duplicated files in a allseqsFile
from Haplogroup.finalCodes.DataWrangling.Dataset1 import checkDuplicated
# run function
uniq, dupl = checkDuplicated.checkDuplicated()
len(uniq), len(dupl)

(9605, 55)

In [None]:
# save the unique list of accnum for dataset 1
saveFile.saveFile("/content/drive/MyDrive/Haplogroup/finalCodes/others/UniqueAccNumForDataset1.txt", ",".join(uniq))

There Ire 9605 unique sequences after removing the duplicated ones. Then I saved the unique accession numbers of these sequences at /content/drive/MyDrive/Haplogroup/finalCodes/others/UniqueAccNumForDataset1.txt which was used for the next step of filtering to get Dataset 2.

The bash script below used the input which is the file "UniqueAccNumForDataset1.txt" to gain the brief information of each sequence (the authors, title, pubmed, isolate name, organism). This information was used later when doing the Dataset 3 filtering, and making the tables.

After running the bash script, the information of sequences of Dataset 1 was saved at /content/drive/MyDrive/Haplogroup/finalCodes/others/Dataset1_accnumInfo.txt

In [None]:
# get information of all the unique sequences of dataset 1
! bash /content/drive/MyDrive/Haplogroup/finalCodes/bashScriptCodes/Dataset1AccnumInfo.sh

#### **Dataset 2**

- After having Dataset 1, I ran function `Dataset2` to remove reference, non-homosapiens sequences.

- This function used the file "UniqueAccNumForDataset1.txt" to read the information of sequences of Dataset 1, remove the reference genome, non-homosapiens, and D-loop sequences, and finally save the others in a new file at /content/drive/MyDrive/Haplogroup/finalCodes/others/UniqueAccNumForDataset2.txt. </br>
Ater running, there Ire only 7082 sequences in the Dataset 2.





In [None]:
# calling function
from Haplogroup.finalCodes.DataWrangling.Dataset2 import Dataset2
UniqListDataset2, RemovedData2 = Dataset2.Dataset2()
len(UniqListDataset2)

7082

#### **Datatset 3**

**Setting up before getting into the main below purposes of Dataset 3 section**
- After getting Dataset 2, I ran function `Dataset3` to remove control region sequences from Dataset 2, and remain only complete genomes which are saved in a new file /content/drive/MyDrive/Haplogroup/finalCodes/others/UniqueAccNumForDataset3.txt.

- Sequences of Dataset 3 are the final sequences used in this study. The new 5409 sequences Ire the updated sequences after running the codes again, but for our study, I just only took the original Dataset 3 sequences (4932). Therefore, I ran another function to get only 4932 sequences as same as original ones from this updated 5409 sequences.

In [None]:
# calling function
from Haplogroup.finalCodes.DataWrangling.Dataset3 import Dataset3
UniqListDataset3, RemovedData3 = Dataset3.Dataset3()
len(UniqListDataset3)

5409

Below is the code reading the file "allSeqsDataset3_Ori.txt" which contains the names of original 4932 sequences. It checked the existence of them in the updated 5409 sequences, and saved the new filtered Dataset 3 sequences at /content/drive/MyDrive/Haplogroup/finalCodes/others/UniqueAccNumForDataset3_4932.txt.

In [None]:
dataset3 = []
for string in openFile.openFile("/content/drive/MyDrive/Haplogroup/finalCodes/others/allSeqsDataset3_Ori.txt").split("\n"):
  if len(string)>0:
    accnum = string.split(".")[2]
    if accnum in openFile.openFile("/content/drive/MyDrive/Haplogroup/finalCodes/others/UniqueAccNumForDataset3.txt"):
      dataset3.append(accnum)
saveFile.saveFile("/content/drive/MyDrive/Haplogroup/finalCodes/others/UniqueAccNumForDataset3_4932.txt", ','.join(dataset3))

- The final 4932 sequences at this moment Ire a total number of unique sequences of all 11 countries and Ire named after their unique accession numbers. When I looked at the duplicated sequences, the reason they Ire duplicated is because although I typed the different name of the country when I downloaded the data, there Ire still 2 different countries having the same accession number sequences. For example, in Indonesia, there exists a unique "GQ119010" which also exists in Philippines. Moreover, for the next following steps, I wanted to know the country information of the sequence, so I had to decide which country that kind of accession number sequence should belong.
- To make this kind of accession number sequences became clearer, I checked if there Ire any overlapping accession numbers betIen 11 countries and saved at file /content/drive/MyDrive/Haplogroup/finalCodes/others/duplicatedSeqOf4932.txt. After knowing the duplicated ones, I used E-Summary of EntrezDirect to get more specific information from the sequences. I ran the bash script below to download more specific information of these duplicated sequences and saved at file /content/drive/MyDrive/Haplogroup/finalCodes/others/duplicatedSeqOf4932_moreInfo.txt.

In [None]:
# Check if there Ire nay duplicated sequences and what Ire they (the output here shows that there Ire 37 duplicated sequences of all 4932 sequences)
from Haplogroup.finalCodes.DataWrangling.Miscellaneous import assignCountry
list_acc_country, dup_acc_country = assignCountry.assignCountry()
len(list_acc_country), len(dup_acc_country)

(4932, 37)

In [None]:
# save the duplicated seqs to get more info about them
saveFile.saveFile("/content/drive/MyDrive/Haplogroup/finalCodes/others/duplicatedSeqOf4932.txt",','.join(dup_acc_country))

In [None]:
%%bash
# a bash script to get more info of duplicates to assign which country should they belong
AccList=/content/drive/MyDrive/Haplogroup/finalCodes/others/duplicatedSeqOf4932.txt
Field_Separator=$IFS
IFS=,

for val in `cat $AccList`
do echo $val >> /content/drive/MyDrive/Haplogroup/finalCodes/others/duplicatedSeqOf4932_moreInfo.txt
${HOME}/edirect/esummary -db nuccore -id $val -format medline | egrep "country" >> /content/drive/MyDrive/Haplogroup/finalCodes/others/duplicatedSeqOf4932_moreInfo.txt
done

After knowing more the specific countries that the sequences should be in, I ran the function `uniqAccCountry` to assign the specific country the sequence belongs to. This process is necessary because I need the country information of the sequences to create the following tables containing the detail information of them, and also change their accession number names to the new names.

In [None]:
from Haplogroup.finalCodes.DataWrangling.Miscellaneous import uniqAccCountry
uniq_acc_country = uniqAccCountry.uniqAccCountry(list_acc_country,"/content/drive/MyDrive/Haplogroup/finalCodes/others/duplicatedSeqOf4932_moreInfo.txt",dup_acc_country)

In [None]:
from Haplogroup.finalCodes.DataWrangling import saveFile
saveFile.saveFile("/content/drive/MyDrive/Haplogroup/finalCodes/others/uniq_acc_country.txt",uniq_acc_country)

**In this Dataset 3 section, the main purposes are:**

***1. SplitUp:*** Split up all sequences in a country's big fasta file downloaded from Entrez Direct to the small individual files which each file contains each separate sequence. The I saved those small files at the subset folder named after the country of a big fasta file inside the folder Dataset 3. (Format of folder: Dataset3/<"CountryName">/)

***2. CreateNewName:*** After having the new file of each sequence in the folder, I renamed these sequences based on the format "Country.Isolate.AccessionNumber.Haplogroup", and used this new name to change the old name inside the sequence fasta file, and aslo the name of the file.

***3. Merging:*** Merge each new labeled sequence of each country into a big file of that country and saved at "Dataset3/<"CountryName">/<"CountryName">_NewBigFile.fasta.

Before doing the above purposes, I had to create Dataset3 foler and also subset folders inside which their name based on the names of 11 countries.

In [None]:
%%bash
# create Dataset3 folder and its subset folders
mkdir /content/drive/MyDrive/RetrieveData/Dataset3/

DataList=/content/drive/MyDrive/Haplogroup/finalCodes/others/countries.txt
Field_Separator=$IFS
IFS=,
for val in `cat $DataList`
do mkdir /content/drive/MyDrive/RetrieveData/Dataset3/"$val"/
done

1. SplitUp

In [None]:
from Haplogroup.finalCodes.DataWrangling.Dataset3 import splitSeq
s = splitSeq.splitSeq()
s.splitFastaSeq(uniq_acc_country)
! ls /content/drive/MyDrive/RetrieveData/Dataset3/*/* > /content/drive/MyDrive/Haplogroup/finalCodes/others/nameAccSeqsDataset3.txt
saveFile.saveFile("/content/drive/MyDrive/Haplogroup/finalCodes/others/nameAccSeqsDataset3.txt", ','.join(openFile.openFile("/content/drive/MyDrive/Haplogroup/finalCodes/others/nameAccSeqsDataset3.txt").split("\n")[:-1]))

2. CreateNewName

Because I wanted to create a new name having the haplogroup, firstly I had to run Haplogrep and create a file saving the information of these haplogroups.

In [None]:
# Download haplogrep
!curl -sL haplogrep.now.sh | bash

In [None]:
!bash /content/drive/MyDrive/Haplogroup/finalCodes/bashScriptCodes/haplogroup.sh

Create a new name

In [None]:
from Haplogroup.finalCodes.DataWrangling.Dataset3 import splitSeq
s = splitSeq.splitSeq()
listOfNewName = []
! ls /content/drive/MyDrive/RetrieveData/Dataset3/*/* > /content/drive/MyDrive/Haplogroup/finalCodes/others/nameAccSeqsDataset3.txt
for name in openFile.openFile("/content/drive/MyDrive/Haplogroup/finalCodes/others/nameAccSeqsDataset3.txt").split("\n")[:-1]:
  newName = s.createNewName(name)
  listOfNewName.append(newName)

In [None]:
# after create new file with new name, remove the accnumNameFile
%%bash
DataList=/content/drive/MyDrive/Haplogroup/finalCodes/others/nameAccSeqsDataset3.txt
Field_Separator=$IFS
IFS=,
for val in `cat $DataList`
do rm "$val"
done

3. Merging

In [None]:
! ls /content/drive/MyDrive/RetrieveData/Dataset3/*/* > /content/drive/MyDrive/Haplogroup/finalCodes/others/newNameSeqsDataset3_4932.txt
s = splitSeq()
AllNewSeqList = openFile.openFile("/content/drive/MyDrive/Haplogroup/finalCodes/others/newNameSeqsDataset3_4932.txt").split("\n")[:-1]
for country in openFile.openFile("/content/drive/MyDrive/Haplogroup/finalCodes/others/countries.txt").split(","):
  s.mergeSeqsBasedOnCountry(AllNewSeqList,country)

Brunei done
Cambodia done
Indonesia done
Laos done
Malaysia done
Myanmar done
Philippines done
Singapore done
Thailand done
Timor-Leste done
Viet Nam done


In [None]:
# check the number of sequences in each country fasta file
%%bash
DataList=/content/drive/MyDrive/Haplogroup/finalCodes/others/countries.txt
Field_Separator=$IFS
IFS=,
for val in `cat $DataList`
do echo $val; ls /content/drive/MyDrive/RetrieveData/Dataset3/$val/ | wc -l; done

Brunei
9
Cambodia
399
Indonesia
657
Laos
60
Malaysia
152
Myanmar
152
Philippines
447
Singapore
2
Thailand
2340
Timor-Leste
18
Viet Nam
707


### **3. Tables**

In this section, I created tables particularly a table Isolate Explantion which contains the main information of 4932 sequences in this study. Before doing that, I had to do the below set up to get the brief information of the sequences.

**Setting up before creating table**
-  ***Getting year:*** From the original file "Dataset1_accnumInfo.txt", I filtered out the accession numbers of Dataset3 and put their information into "Dataset3_accnumInfo.txt". Besides this file, I realized that for the Isolate Explanation table, I also wanted the published year of the reference papers, so I ran again esummary from Entrez Direct and saved the year information to a new file /content/drive/MyDrive/Haplogroup/finalCodes/others/YEAR_Data3.txt.
- ***Getting polymorphism:*** I also wanted to take polymorphism of each sequence but I just ran Haplogrep above and took the Haplogroup name without polymorphism, so I ran again Haplogrep but with the extend parameter and save these polymorphisms into new file.

***Getting year***

Two kinds of year:
1. Year from published papers (PubDate)
2. Year at LOCUS (if I cannot find the year of published papers):
- For Direct Submission sequences, there might be some submissions still having the title of the paper, and I called this case "Unpublished".
- If the sequences are Direct Submitted but no paper's title then I called this case "Direct Submission".

In [None]:
from Haplogroup.finalCodes.DataWrangling.Dataset3 import yearFor4932
y = yearFor4932.YearFor4932()
pubID, noID = y.createYearInfo4932()
# run bash script to get year from LOCUS and also PUBDATE
! bash /content/drive/MyDrive/Haplogroup/finalCodes/bashScriptCodes/getYearInfo.sh
# create a year file after having year from LOCUS and PUBDATE
y.createYearFile(pubID)

done


***Getting polymorphism***

Caveat: The folder "Viet Nam" had the new names of the sequences containing the words "Viet Nam" which has the space betIen Viet and Nam. When Haplogrep reads the sequences, it splits up the space and only keeps the first word before the space ("Viet"). To fix this problem, I created a new folder Vietnam and saved again all the files from Viet Nam folder to this folder with the new names of files and sequences inside the files from "Viet Nam" to "Vietnam".

In [None]:
! mkdir /content/drive/MyDrive/RetrieveData/Dataset3/Vietnam
! ls /content/drive/MyDrive/RetrieveData/Dataset3/Viet*/* >> /content/drive/MyDrive/Haplogroup/finalCodes/others/FixVietnam.txt

In [None]:
from Haplogroup.finalCodes.DataWrangling.Miscellaneous import fixSpaceBetIenVietnam
fixSpaceBetIenVietnam.fixSpaceBetIenVietnam()

done


In [None]:
# remove the old file newNameSeqsDataset3_4932.txt having the sequences including the words "Viet Nam"
! rm /content/drive/MyDrive/Haplogroup/finalCodes/others/newNameSeqsDataset3_4932.txt

# remove the old folder "Viet Nam" after having the new one "Vietnam"
! rm -r "/content/drive/MyDrive/RetrieveData/Dataset3/Viet Nam"

# get the name of all 4932 sequences for the polymorphism running below
! ls  /content/drive/MyDrive/RetrieveData/Dataset3/*/* >> /content/drive/MyDrive/Haplogroup/finalCodes/others/newNameSeqsDataset3_4932.txt

# run bash script to get polymorphism
! bash /content/drive/MyDrive/Haplogroup/finalCodes/bashScriptCodes/Get4932polymorphism.sh

#### **Isolate Explanation Table**

After getting all 4932 sequences of 11 countries in South East Asia, I created a final table "CompleteFullIsoTab" which contains the information of the sequences.
1. Firstly, I made a draft table or I called IsolateExplanation which includes the columns:
- ID: The ascending order starting from 0 to 4931
- Reference: The combination of information of 4 columns: pubmedID, title, Author(s), and year.
- pubmedID: The pubmedID of the paper. HoIver, for the special situations:
the author(s) directly submitted the sequences without any paper's title, then pubmedID is a blank space; there was a title of a paper but I could not find it, then I assumed that the paper was unpublished, so I wrote "Unpublished".
- title: The title of the paper. For the sequences that I could not find the paper and at its summary, and the only title was"Direct Submission", so I just kept "Direct Submission".
- year: The year that the paper published. For the case of Direct Submission or Unpublished papers, I used the year at LOCUS site of summary on nuccore NCBI.
- Author(s): If there is more than one author, I used et al behind the first author's name. If there is only one, then I just wrote that author's name
- AccessionNumber: Accession number of the sequence.
- name: the name of the sequence which was labelled in a format "Country.Isolate.AccesssionNumber.Haplogroup".
- Country: The country where the sequences' samples Ire obtained.
- Isolate: The isolate name that is written at the NCBI summary of the sequence.
- Explanation: To explain the meaning of the isolate name (I will explain detailedly this column later below).
- Location: The living/born location of the sequence's sample.
- Language: The language that sequence's sample might speak.
- Population: The population that sequence's sample might belong to.
- haplo: The specific haplogroup name getting from both the paper of the sequences or the Haplogrep tool.
- haplogroup1: The first capital letter of haplogroup in haplo column.
- haplogroup2: The capital letter(s) and next positive number (if the number is not next to the capital letter(s), I took the next single character such as "+").
- haplogroup3: The uppercase letter(s), next positive number or next single character, and next loIrcase letter (if exist) or next number(1-100) if before it is not number but a single character such as "+".
- Polymorphism: Polymorphism of the sequence getting from Haplogrep tool.

I also created a file Haplogroup/finalCodes/Tables/translation.txt which includes specific information of samples of published sequences such as their locations, or ethnicities, or population, etc.. By reading briefly the paper of those sequences, I assumed that the isolate names of the sequences from the papers might be the codes or labels that tells us the name of ethnicity or location of the samples of sequences. Therefore I created column "Explanation" to write down the meaning of these isolate names. For sequences that I could not find the ehtnicity or location information from the isolate names or even papers, I kept the isolate name at the Explanation column but transform them to a more literate name, for example "Thai755" was changed to "Thai".

In [None]:
# create list of names of 4932 sequences to make the Original IsoTab
from Haplogroup.finalCodes.DataWrangling import openFile, saveFile
rawList = openFile.openFile("/content/drive/MyDrive/Haplogroup/finalCodes/others/newNameSeqsDataset3_4932.txt").split("\n")[:-1]
listOfNames = list(map(lambda x: x.split("/")[-1].split(".fasta")[0], rawList))

In [None]:
len(listOfNames) # I can see the length of the list is 4932 which is also a number of sequences I used to create the below Isolate Explanation table

4932

In [None]:
# create table IsolateExplantion.xlsx
from Haplogroup.finalCodes.Tables.IsoTab import explainIso
output = explainIso.explainIso(listOfNames)
output.to_excel('/content/drive/MyDrive/RetrieveData/tables/IsolateExplanation.xlsx')

After that, I created CompleteFullIsoTab table from IsolateExplanation table by making some changes:
- Present or ancient: I added this column betIen columns "year" and "Author(s)". This column classified if the papers of the sequences researched about the ancient sequences (such as getting the samples from archeology sites) or the modern sequences (getting samples from modern human: buccal or blood samples). For the direct submitted sequences and unpublished paper of sequences, I wrote unknown, but on the NCBI nuccore summary of sequences, if they mentioned about the samples such as from human tissue, then I assumed there might be modern samples.
- Ethnicity: Column "Population" was replaced by "Ethnicity" column and I placed this new column betIen "Explanation" and "Location" columns. This column shows the ethnicity that I assumed the people of the sequences belong to.
- Language family: I added this new column behind "Language" column. After knowing the language people might talk based on their ethnicity, I used extra sources (links of extra sources are in a file "/content/drive/MyDrive/RetrieveData/tables/sources.txt") to search for the language family of the language.

*Notice: Besides that, I still manually kept looking for the more speicifc and right information particuclarly for the columns Ethnicity, Location, Language and Language family. This is because after reading careful again the papers and also searching for more sources outside the papers, I got more information to replaced the not specific ones.*



After having more specific and updated information in the table CompleteFullIsoTab, I used this table to create sub tables (more details at section **Table1**, **Table2**, **Table3** below).

#### **Table1**

- In this section Table1, our sub tables Ire created for the purpose of summarizing only the unique information of "Explanation", "Ethnicity", "Location", "Language", and "Language family" from the big table CompleteFullIsoTab. The columns of table1 are: "ID", "Country", "References", "Explanation", "Ethnicity", "Location", "Language", "Language family", "Sample size", and other halplogroup names.
- Briefly, for each country, if 2 different sequences had the different explanations, or had the same explanations but the differences betIen one of 3 columns: "Ethnicity", "Location", and "Language family", then I still considered as a new unique value and added the new row for this. After that, for each new unique rows, I counted how many sequences that also had the same country, explanation, ethnicity, location, and language family, and put these counted numbers at "Sample size" column. Behind "Sample size" column are the columns which Ire named after the name of each different haplogroup. The numbers in those columns shows a number of sequences having the same Country, Explanation, Ethnicity, Location, Language family and the same name of Haplogroup. Finally for References columns, because there can be more than 1 paper having the same information of the samples, I added a comma betIen those references.

Using this kind of aforementioned format for the all the tables in section Table1, I created:
- 11 tables of 11 countries in SEA.
- A big file putting all the small 11 countries' tables together and saved in 2 files: "SEA_haplgroups.csv", and "Changed_SEA_haplogroups.xlsx".

In [None]:
from Haplogroup.finalCodes.Tables.tables import table1
import pandas as pd
IsoTab = pd.read_excel("/content/drive/MyDrive/RetrieveData/tables/CompleteFullIsoTab.xlsx")
! mkdir /content/drive/MyDrive/RetrieveData/tables/table1
countries = "Brunei Cambodia Indonesia Laos Malaysia Myanmar Philippines Singapore Thailand Timor-Leste Vietnam"
for country in countries.split():
  data = table1.createTable1(country, IsoTab)
  data.to_csv('/content/drive/MyDrive/RetrieveData/tables/table1/'+country+'1.csv')
  print(country,'finish')
# merge all 11 countries
countries = 'Cambodia Indonesia Laos Malaysia Myanmar Philippines Singapore Thailand Timor-Leste Vietnam'
df = pd.read_csv('/content/drive/MyDrive/RetrieveData/tables/table1/Brunei1.csv', index_col="ID")
for country in countries.split():
  df1 = pd.read_csv('/content/drive/MyDrive/RetrieveData/tables/table1/'+country+'1.csv', index_col="ID")
  df = pd.concat([df, df1], ignore_index=True, sort=False)
# save the total 11 countries files
df = df.fillna(0)
df = df.replace(0,'-')
df.to_csv('/content/drive/MyDrive/RetrieveData/tables/table1/SEA_haplogroups.csv')
df.to_excel('/content/drive/MyDrive/RetrieveData/tables/table1/Changed_SEA_haplogroups.xlsx')

#### **Table2**

There are 2 main outputs of section Table2 after running the cell box below:
1. Eleven tables showing the frequency of haplogroups of eleven SEA countries
2. A big Haplofrequency table includes the total of the haplogroups of all 11 countries and their frequency of a total of 4932 sequences.

All the tables above have the same columns:
- Haplogroup (name of haplogroup)
- Number of Individuals (count the appearance of that haplogroup)
- Frequency (the percentage of that haplogroup in the total of the number of all haplogroups in that country for first output and in 11 countries for second output).

In [None]:
from Haplogroup.finalCodes.Tables.tables import table_2
! mkdir /content/drive/MyDrive/RetrieveData/tables/table2
! mkdir /content/drive/MyDrive/RetrieveData/tables/table2/countries
countries = "Brunei Cambodia Indonesia Laos Malaysia Myanmar Philippines Singapore Thailand Timor-Leste Vietnam"
# table 2 for each country
for country in countries.split():
  nameFile = "/content/drive/MyDrive/RetrieveData/tables/table1/"+country + "1.csv"
  data = table_2.Table2(nameFile)
  data.to_csv('/content/drive/MyDrive/RetrieveData/tables/table2/countries/'+country+'2.csv')
  print(country,'finish')
# table 2 for all SEA countries
df = table_2.Table2("/content/drive/MyDrive/RetrieveData/tables/SEA_haplogroups.csv")
df.to_csv('/content/drive/MyDrive/RetrieveData/tables/table2/Haplofrequency.csv')

Brunei finish
Cambodia finish
Indonesia finish
Laos finish
Malaysia finish
Myanmar finish
Philippines finish
Singapore finish
Thailand finish
Timor-Leste finish
Vietnam finish


#### **Table3**

In this Table3 section I created 4 main outputs:
1. table3a_CountryAndEthnicity:
> - Haplogroup column (names of Haplogroup)
> - Ethnicities of 11 countries which each country was highlighted in different colors. </br>
> - In each country, there Ire : </br>
>> - A Total column (the total numbers of that specific halogroup / the total number of all types of haplogroups in that country). </br>
>> - Next to Total column is the name of each different ethnicity appearing in that country (the total number of the specific haplogroup in the specific ethnicity/the total number of all types of haplogroups in that specific ethnicity).
2. table3b_Ethnicity:
> - Haplogroup column (names of Haplogroup)
> - A Total column (the total numbers of that specific haplogroup / the total number of all types of haplogroups). </br>
> - Next to Total column is the name of each different ethnicity appearing in all 11 countries (the total number of the specific haplogroup in the specific ethnicity/the total number of all types of haplogroups in that specific ethnicity).
3. table3c_LanguageFamily: </br>
The same format of columns as that of table3b_Ethnicity, but changed the independent variable from Ethnicity to Language Family.
4. table3d_CountryAndLanguageFamily: </br>
The same format of columns as that of table3a_CountryAndEthnicity, but changed the independent variable from Ethnicity to Language Family.

In [None]:
! mkdir /content/drive/MyDrive/RetrieveData/tables/table3
! mkdir /content/drive/MyDrive/RetrieveData/tables/table3/countries

table 3a_CountryAndEthnicity

In [None]:
# call function
from Haplogroup.finalCodes.Tables.tables.table3 import createTable3ad
import pandas as pd
# run function for Ethnicity
countries = 'Brunei Cambodia Indonesia Laos Malaysia Myanmar Philippines Singapore Thailand Timor-Leste Vietnam'
data = ''
groups = ['Country','Ethnicity']
for country in countries.split():
  df = createTable3ad.createTable3ad(country,groups,"/content/drive/MyDrive/RetrieveData/tables/SEA_haplogroups.csv",groups[-1])
  df.to_csv("/content/drive/MyDrive/RetrieveData/tables/table3/countries/"+country+"3.csv")
  if len(data) < 1:
    data = df
  else:
    add = df.drop(['Haplogroup'],axis=1)
    data = pd.concat([data,add],axis=1)
  print(country,'finish')
data.to_excel("/content/drive/MyDrive/RetrieveData/tables/table3/table3a_CountryAndEthnicity.xlsx")

Brunei finish
Cambodia finish
Indonesia finish
Laos finish
Malaysia finish
Myanmar finish
Philippines finish
Singapore finish
Thailand finish
Timor-Leste finish
Vietnam finish


table3d_CountryAndLanguageFamily

In [None]:
# call function
from Haplogroup.finalCodes.Tables.tables.table3 import createTable3ad
# run function for Language Family
countries = 'Brunei Cambodia Indonesia Laos Malaysia Myanmar Philippines Singapore Thailand Timor-Leste Vietnam'
data = ''
groups = ['Country','Language family']
for country in countries.split():
  df = createTable3ad.createTable3ad(country,groups,'/content/drive/MyDrive/RetrieveData/tables/SEA_haplogroups.csv',groups[-1])
  if len(data) < 1:
    data = df
  else:
    add = df.drop(['Haplogroup'],axis=1)
    data = pd.concat([data,add],axis=1)
  print(country,'finish')
data.to_excel('/content/drive/MyDrive/RetrieveData/tables/table3/table3d_CountryAndLanguageFamily.xlsx')

Brunei finish
Cambodia finish
Indonesia finish
Laos finish
Malaysia finish
Myanmar finish
Philippines finish
Singapore finish
Thailand finish
Timor-Leste finish
Vietnam finish


Table 3b,c

In [None]:
# call function
from Haplogroup.finalCodes.Tables.tables.table3 import createTable3bc
# run function to get table3b_Ethnicity
groups = ['Ethnicity']
data = createTable3bc.createTable3bc(groups,'/content/drive/MyDrive/RetrieveData/tables/SEA_haplogroups.csv','Ethnicity')
data.to_excel('/content/drive/MyDrive/RetrieveData/tables/table3/table3b_Ethnicity.xlsx')

In [None]:
# run function to get table3c_LanguageFamily
from Haplogroup.finalCodes.Tables.tables.table3 import createTable3bc
groups = ['Language family']
data = createTable3bc.createTable3bc(groups,'/content/drive/MyDrive/RetrieveData/tables/SEA_haplogroups.csv','Language family')
data.to_excel('/content/drive/MyDrive/RetrieveData/tables/table3/table3c_LanguageFamily.xlsx')