After filtering the dataset with vcftools, we need (to pre-filter data in order to have a clean vcf or) to convert the .vcf file to .hdf5. 
note: Check that the ipyrad environment that you are using has a version of 
python between 3.5 to 3.9 (the version 3.10 doesn't work well in the moment)

#conda create -n ipyrad_env python=3.9
#conda activate ipyrad_env
#conda install ipyrad -c conda-forge -c bioconda


In [6]:
#the analysis tools are a subpackage of ipyrad
#import ipyrad as ip
#import ipyrad.analysis as ipa
#import pandas as pd

In [1]:
#Importing modules
import ipyrad
import ipyrad.analysis as ipa      ## ipyrad analysis toolkit


In [2]:
## Converting .vcf file to .hdf5 file
vcffile = "/dss/dsshome1/0A/ra93rey/final_datasets/Alpsmac_linked_95th.recode.vcf"
pca = ipa.pca(vcffile)


Converting vcf to HDF5 using default ld_block_size: 20000
Typical RADSeq data generated by ipyrad/stacks will ignore this value.
You can use the ld_block_size parameter of the PCA() constructor to change
this value.

Indexing VCF to HDF5 database file
VCF: 2189 SNPs; 181 scaffolds
[                    ]   0% 0:00:00 | converting VCF to HDF5 
This appears to be a denovo assembly, ld_block_size arg is being ignored.
[####################] 100% 0:00:06 | converting VCF to HDF5 
HDF5: 2189 SNPs; 181 linkage group
SNP database written to ./analysis-vcf2hdf5/Alpsmac_linked_95th.recode.snps.hdf5
Samples: 190
Sites before filtering: 2189
Filtered (indels): 0
Filtered (bi-allel): 52
Filtered (mincov): 0
Filtered (minmap): 0
Filtered (subsample invariant): 0
Filtered (minor allele frequency): 0
Filtered (combined): 52
Sites after filtering: 2137
Sites containing missing values: 2137 (100.00%)
Missing values in SNP matrix: 75881 (18.69%)
SNPs (total): 2137
SNPs (unlinked): 181
Imputation (null; 

In [None]:

##Converting clean vcf to hdf5

# init a conversion tool. ld_block_size: size of the linkage blocks in base pairs
converter = ipa.vcf_to_hdf5(
 name="Alps_wocgn_7add",
 data="/dss/dsshome1/0A/ra93rey/ipyrad/cloud_PCA/alps_wocgn_7added.cleaned.vcf.gz",
 ld_block_size=20000,
)
# run the converter
converter.run()

Indexing VCF to HDF5 database file
VCF: 6094152 SNPs; 1 scaffolds
[####################] 100% 0:02:22 | converting VCF to HDF5
HDF5: 6094152 SNPs; 10845 linkage group
SNP database written to ./analysis-vcf2hdf5/Macaque_LD20K.snps.hdf5


In [8]:
#installation of the required software
# conda install ipyrad -c bioconda
# conda install scikit-learn -c bioconda
# conda install toyplot -c eaton-lab

SyntaxError: invalid syntax (937971825.py, line 2)

In [1]:
#the analysis tools are a subpackage of ipyrad
import ipyrad.analysis as ipa
import pandas as pd
import toyplot

In [2]:

## making 2 dictionaries (species_dict and loc_dict) to use in PCA 
with open("/dss/dsshome1/0A/ra93rey/ipyrad/cloud_PCA/porous-alp-popdist-scheme-184indv.txt") as file:
    file.readline() #reads and discards the 1st line of the file (headers)
    content = file.read() #reads the rest of the file's content and stores it in the variable content

    species_dict = {} #empty dictionary named species_dict
    loc_dict = {} #empty dictionary named loc_dict

## 0 corresponds to id ,1 to locations, 7 to species (number of the columns)
#The code is splitting the data into lines and then further splitting each line by tab ("\t") to extract information. 
for line in content.splitlines():
    line_list = line.split("\t")
    if len(line_list) == 0 or len(line_list[0])  == 0 : #check if the line is empty
        continue
    else:
        if line_list[7].strip() not in species_dict:
            species_dict[line_list[7].strip()] = [line_list[0]]
        else:
            species_dict[line_list[7].strip()].append(line_list[0])
        if line_list[1].strip() not in loc_dict:
            loc_dict[line_list[1].strip()] = [line_list[0]]
        else:
            loc_dict[line_list[1].strip()].append(line_list[0])

## now we have the dictionaries!

In [3]:
#import input data file .snps.hdf5 via file path
data = "/dss/dsshome1/0A/ra93rey/ipyrad/cloud_PCA/alps_wocgn_7added.snps.hdf5"

In [4]:
#create an imap file (like a dictionary)
#group individuals into populations
#species_dict
imap = species_dict

In [5]:
#create a minmap dictionary that require that 20% of samples have data in each group (20% of coverage in each pop). This filters SNPs to only include those that have data for at least some proportion of samples in every group.
minmap = {i: 0.2 for i in imap}

In [6]:
#Enter data file and parameters for pca
# init (load data and apply filtering) pca object with input data and (optional) parameter options
pca = ipa.pca(
    data=data,
    imap=imap,
    minmap=minmap,
    mincov=0.2, #filters SNPs that are shared across less than some proportion of all samples 
    impute_method=3, #kmeans imputation method with 3 clusters
)

Kmeans clustering: iter=0, K=3, mincov=0.9, minmap={'global': 0.2}
Samples: 184
Sites before filtering: 7238146
Filtered (indels): 143258
Filtered (bi-allel): 222874
Filtered (mincov): 7236776
Filtered (minmap): 6981275
Filtered (subsample invariant): 375268
Filtered (minor allele frequency): 0
Filtered (combined): 7238121
Sites after filtering: 1180
Sites containing missing values: 1180 (100.00%)
Missing values in SNP matrix: 13714 (6.32%)
SNPs (total): 1180
SNPs (unlinked): 39
Imputation: 'sampled'; (0, 1, 2) = 95.1%, 4.2%, 0.7%




{0: ['RP1230', 'RP1231', 'RP1232', 'RP1233', 'RP1234', 'RP1238', 'RP1240', 'RP1253', 'RP1254', 'RP1256', 'RP1257', 'RP1258', 'RP1259', 'RP1260', 'RP1261', 'RP1263', 'RP1264', 'RP1265', 'RP1266', 'RP1267', 'RP1268', 'RP1269', 'RP1270', 'RP1271', 'RP1272', 'RP1274', 'RP1275', 'RP1276', 'RP1277', 'RP1278', 'RP1279', 'RP1280', 'RP1281', 'RP1283', 'RP1284', 'RP1285', 'RP1287', 'RP1290', 'RP1291', 'RP1293', 'RP1294', 'RP1295', 'RP1297', 'RP1301', 'RP1302', 'RP1312', 'RP1379', 'RP1403', 'RP1404', 'RP1405', 'RP1406', 'RP1407', 'RP1417', 'RP1418', 'RP1419', 'RP1420', 'RP1421', 'RP1422', 'RP1423', 'RP1426', 'RP1427', 'RP1456', 'RP1458', 'RP1459', 'RP1461', 'RP1462', 'RP1479', 'RP1480', 'RP1481', 'RP1482', 'RP1484', 'RP1485', 'RP1487', 'RP1488', 'RP1489', 'RP1490', 'RP1491', 'RP1494', 'RP1495', 'RP1496', 'RP1497', 'RP1499', 'RP1508', 'RP1510', 'RP1511', 'RP1512', 'RP1530', 'RP1550', 'RP1551', 'RP1561', 'RP1562', 'RP1563', 'RP1564', 'RP1581', 'RP1582', 'RP1583', 'RP1584'], 1: ['RP1210', 'RP1211', 



Samples: 184
Sites before filtering: 7238146
Filtered (indels): 143258
Filtered (bi-allel): 222874
Filtered (mincov): 7217752
Filtered (minmap): 7133175
Filtered (subsample invariant): 375268
Filtered (minor allele frequency): 0
Filtered (combined): 7222276
Sites after filtering: 17025
Sites containing missing values: 17025 (100.00%)
Missing values in SNP matrix: 949313 (30.30%)
SNPs (total): 17025
SNPs (unlinked): 534
Imputation: 'sampled'; (0, 1, 2) = 94.4%, 4.7%, 0.9%




{0: ['RP1210', 'RP1211', 'RP1212', 'RP1213', 'RP1221', 'RP1239', 'RP1304', 'RP1305', 'RP1312', 'RP1313', 'RP1314', 'RP1315', 'RP1316', 'RP1317', 'RP1338', 'RP1339', 'RP1340', 'RP1341', 'RP1342', 'RP1343', 'RP1379', 'RP1380', 'RP1381', 'RP1382', 'RP1383', 'RP1384', 'RP1385', 'RP1386', 'RP1388', 'RP1434', 'RP1436', 'RP1437', 'RP1438', 'RP1439', 'RP1441', 'RP1442', 'RP1460', 'RP1463', 'RP1464', 'RP1465', 'RP1466', 'RP1467', 'RP1468', 'RP1470', 'RP1491', 'RP1517', 'RP1520', 'RP1521', 'RP1531', 'RP1532', 'RP1533', 'RP1534', 'RP1549', 'RP1575', 'RP1576', 'RP1577', 'RP1578', 'RP1579'], 1: ['RP1230', 'RP1231', 'RP1232', 'RP1233', 'RP1234', 'RP1238', 'RP1240', 'RP1253', 'RP1254', 'RP1255', 'RP1256', 'RP1257', 'RP1258', 'RP1259', 'RP1260', 'RP1261', 'RP1263', 'RP1264', 'RP1265', 'RP1266', 'RP1267', 'RP1268', 'RP1269', 'RP1270', 'RP1271', 'RP1272', 'RP1274', 'RP1275', 'RP1276', 'RP1277', 'RP1278', 'RP1279', 'RP1280', 'RP1281', 'RP1282', 'RP1283', 'RP1284', 'RP1285', 'RP1286', 'RP1287', 'RP1288', 



Samples: 184
Sites before filtering: 7238146
Filtered (indels): 143258
Filtered (bi-allel): 222874
Filtered (mincov): 6981275
Filtered (minmap): 7137761
Filtered (subsample invariant): 375268
Filtered (minor allele frequency): 0
Filtered (combined): 7150186
Sites after filtering: 89115
Sites containing missing values: 89115 (100.00%)
Missing values in SNP matrix: 9489079 (57.87%)
SNPs (total): 89115
SNPs (unlinked): 2829
Imputation: 'sampled'; (0, 1, 2) = 91.5%, 7.2%, 1.3%


In [7]:
#run PCA analysis. Each replicate subsample a different random set of unlinked SNPs each time.
pca.run(nreplicates=55,seed=12345)

Subsampling SNPs: 2829/89115


In [8]:
# store the PC axes as a dataframe
df = pd.DataFrame(pca.pcaxes[0], index=pca.names)

In [9]:
# write the PC axes to a CSV file
df.to_csv("pca_184indv.csv")