# PCAs per region
How genetic information can provide the same information than geography, phylogeny and network.

Create PCAs organized in three or four plots that represent:
1. All species in all regions
2. Only species in Mexican regions (and centralamerica)
3. Only species in north south america
4. Maybe species in central america alone with jamaican species


### This was re-run on May 10, 2021 to be sure that we included 11 regions instead of 10

In [2]:
import ipyrad.analysis as ipa
import pandas as pd
import toyplot
import dbgdrive 

In [3]:
#load snp hdf5 file
SNPS = "/home/deren/Documents/Viburnum-Oreinotinus/assembly_hic_feb/full_dataset_outfiles/full_dataset.snps.hdf5"

In [4]:
#Create big imap with all species and all samples per species
##import and load database
import dbgdrive
fulldata = dbgdrive.get_database(sheet_name='sample-data', id_spreadsheet='1mnbZVtnI4AQDseWaawV2au4bNyFD_B9M0z8REMXbOcs', api_key='AIzaSyCbfzhhYZg6f3nDSFF9hbHQOMWx2td611o')

#those regions correspond to the Early March 2021 version, not with late march or april
regions = dbgdrive.get_database(sheet_name='regions', id_spreadsheet='1mnbZVtnI4AQDseWaawV2au4bNyFD_B9M0z8REMXbOcs', api_key='AIzaSyCbfzhhYZg6f3nDSFF9hbHQOMWx2td611o')

In [5]:
##import collections to create an empty dict
from collections import defaultdict


#create empty dict
imap = defaultdict(list)

#go row by row
for index, row in fulldata.iterrows():
    #only check fulldataset specimens
    if row["full_dataset_withAyava"]:
        if row["Last_valid_name"] != "dentatum": 
            sp = row["Last_valid_name"]
            region = regions[regions["Species"] == sp]["Region"].values[0]
            imap[region].append(row["NameInAssembly"])

In [9]:
len(imap)

11

In [8]:
count = 0
for i in imap:
    count += len(imap[i])
print (count)

178


In [11]:
# require that 50% of samples have data in each group
minmap = {i: 0.5 for i in imap}

## Define color dicts for coloring pca

In [18]:
#get colors from database
region_colors = {}
for row, region in enumerate(regions["RegionUnique"]):
    if region != None: 
        region_colors[region] = regions["ColorRegionUnique"][row]

## Test with sample imputation

In [12]:
# testing with sample without imputation (base 100% on imap)
pcaSample = ipa.pca(
    data=SNPS,
    imap=imap,
    minmap=minmap,
    mincov=0.75,
    impute_method="sample",
)

Samples: 178
Sites before filtering: 3419673
Filtered (indels): 0
Filtered (bi-allel): 193270
Filtered (mincov): 3357698
Filtered (minmap): 3225437
Filtered (subsample invariant): 957960
Filtered (minor allele frequency): 0
Filtered (combined): 3375216
Sites after filtering: 32182
Sites containing missing values: 32172 (99.97%)
Missing values in SNP matrix: 1209181 (21.11%)
SNPs (total): 32182
SNPs (unlinked): 2381
Imputation: 'sampled'; (0, 1, 2) = 94.0%, 4.0%, 2.0%


In [29]:
pcaSample.run()

Subsampling SNPs: 2381/32182


In [30]:
# store the PC axes as a dataframe
df = pd.DataFrame(pcaSample.pcaxes[0], index=pcaSample.names)

# write the PC axes to a CSV file
df.to_csv("analysis_pca/pcaSample_analysis.csv")

In [31]:
#create a list of colors following imap order
colors = [region_colors[k] for k in imap]

pcaSample.draw(colors=colors, cycle=10,
#               outfile="pcaSampleMay2021.svg"
              )

(<toyplot.canvas.Canvas at 0x7f34969bf050>,
 <toyplot.coordinates.Cartesian at 0x7f34969bf590>)

Sample imputation looks fine and retains geographical structure pretty accurate.

### Mexico and central america

In [46]:
imapCentralAmerica = {k: imap[k] for k in imap.keys() & {"central", "mex_w", "mex_e", "11th_area", "jamaica", "costa"}}
minmapCentralAmerica = {i: 0.5 for i in imapCentralAmerica}

In [47]:
pcaSampleCentral = ipa.pca(
    data=SNPS,
    imap=imapCentralAmerica,
    minmap=minmapCentralAmerica,
    mincov=0.75,
    impute_method="sample",
)

Samples: 116
Sites before filtering: 3419673
Filtered (indels): 0
Filtered (bi-allel): 129070
Filtered (mincov): 3392855
Filtered (minmap): 3153237
Filtered (subsample invariant): 1686874
Filtered (minor allele frequency): 0
Filtered (combined): 3314330
Sites after filtering: 12726
Sites containing missing values: 12677 (99.61%)
Missing values in SNP matrix: 302640 (20.50%)
SNPs (total): 12726
SNPs (unlinked): 2716
Imputation: 'sampled'; (0, 1, 2) = 92.3%, 5.1%, 2.6%


In [79]:
pcaSampleCentral.run()

Subsampling SNPs: 2716/12726


In [80]:
# store the PC axes as a dataframe
df = pd.DataFrame(pcaSampleCentral.pcaxes[0], index=pcaSampleCentral.names)

# write the PC axes to a CSV file
df.to_csv("analysis_pca/pcaSampleCentral_analysis.csv")

In [83]:
#create a list of colors following imap order
colors = [region_colors[k] for k in imapCentralAmerica]

pcaSampleCentral.draw(colors=colors,
#                       outfile="pcaSampleCentralMay2021.svg"
                     )

(<toyplot.canvas.Canvas at 0x7f4cf2ad90d0>,
 <toyplot.coordinates.Cartesian at 0x7f4cf2ad9150>)

### South america

In [51]:
imapSouthAmerica = {k: imap[k] for k in imap.keys() & {"col_w", "bol", "peru", "col_e", "ecu_n"}}
minmapSouthAmerica = {i: 0.5 for i in imapSouthAmerica}

In [52]:
pcaSampleSouth = ipa.pca(
    data=SNPS,
    imap=imapSouthAmerica,
    minmap=minmapSouthAmerica,
    mincov=0.75,
    impute_method="sample",
)

Samples: 62
Sites before filtering: 3419673
Filtered (indels): 0
Filtered (bi-allel): 72203
Filtered (mincov): 3083036
Filtered (minmap): 2670397
Filtered (subsample invariant): 2525925
Filtered (minor allele frequency): 0
Filtered (combined): 3013568
Sites after filtering: 85395
Sites containing missing values: 85347 (99.94%)
Missing values in SNP matrix: 941968 (17.79%)
SNPs (total): 85395
SNPs (unlinked): 12298
Imputation: 'sampled'; (0, 1, 2) = 80.8%, 12.7%, 6.5%


In [53]:
pcaSampleSouth.run()

Subsampling SNPs: 12298/85395


In [54]:
# store the PC axes as a dataframe
df = pd.DataFrame(pcaSampleSouth.pcaxes[0], index=pcaSampleSouth.names)

# write the PC axes to a CSV file
df.to_csv("analysis_pca/pcaSampleSouth_analysis.csv")

In [111]:
#create a list of colors following imap order
colors = [region_colors[k] for k in imapSouthAmerica]

pcaSampleSouth.draw(colors=colors, 
#                     outfile="pcaSampleSouthMay2021.svg"
                   )

(<toyplot.canvas.Canvas at 0x7f4cf28a5950>,
 <toyplot.coordinates.Cartesian at 0x7f4cf28a5910>)

# Experimental area

## Test with kmeans imputation

In [38]:
# testing kmeans imputation
# kmeans imputation
pcaK = ipa.pca(
    data=SNPS,
    imap=imap,
    minmap=minmap,
    mincov=0.99,
    impute_method=11, #I am puting here the number of endemism areas
)

Kmeans clustering: iter=0, K=11, mincov=0.9, minmap={'global': 0.99}
Samples: 178
Sites before filtering: 3419673
Filtered (indels): 0
Filtered (bi-allel): 193270
Filtered (mincov): 3417640
Filtered (minmap): 3419591
Filtered (subsample invariant): 957960
Filtered (minor allele frequency): 0
Filtered (combined): 3407348
Sites after filtering: 50
Sites containing missing values: 40 (80.00%)
Missing values in SNP matrix: 40 (0.45%)
SNPs (total): 50
SNPs (unlinked): 21
Imputation: 'sampled'; (0, 1, 2) = 90.0%, 2.5%, 7.5%
{0: ['__EJE_617', '__EJE_629', '__EJE_630', 'acutifolium_DRY3_MEX_006', 'acutifolium_MEX_005', 'acutifolium_MJD_011_tuton_111816', 'acutifolium_MJD_012_tuton_111816', 'acutifolium_MJD_60', 'acutifolium_PWS_3050', 'acutifolium_PWS_3059', 'alpinum_PWS_3924', 'anabaptista_PWS_2156', 'anabaptista_PWS_2160', 'anabaptista_PWS_2162', 'anabaptista_PWS_2164', 'anabaptista_PWS_2165', 'anabaptista_PWS_2173', 'ayavacense_PWS_3884', 'ayavacense_PWS_3889', 'ayavacense_PWS_3902', 'ayava

  100 * np.sum(imputed == 0) / imputed.size,
  100 * np.sum(imputed == 1) / imputed.size,
  100 * np.sum(imputed == 2) / imputed.size,
  kmeans_model.fit(pcadata)


Samples: 178
Sites before filtering: 3419673
Filtered (indels): 0
Filtered (bi-allel): 193270
Filtered (mincov): 3419591
Filtered (minmap): 3419591
Filtered (subsample invariant): 957960
Filtered (minor allele frequency): 0
Filtered (combined): 3407348
Sites after filtering: 50
Sites containing missing values: 40 (80.00%)
Missing values in SNP matrix: 40 (0.45%)
SNPs (total): 50
SNPs (unlinked): 21
Imputation: 'sampled'; (0, 1, 2) = 90.0%, 2.5%, 7.5%


In [39]:
# run and draw results for kmeans clustering
pcaK.run()

Subsampling SNPs: 21/50


In [None]:
# store the PC axes as a dataframe
df = pd.DataFrame(pcaK.pcaxes[0], index=pcaK.names)

# write the PC axes to a CSV file
df.to_csv("analysis_pca/pcaK_analysis.csv")

In [41]:
pcaK.draw(0, 2);

kimputation is kinda messy. with mincov 0.75

## Imputation without interference of imap

In this test the imputation is base on the entire matrix and not in the subset in the imap.

If this pca is similar to those with sample imputation,  we can be sure that those results are not an artifact of our imputation method

In [45]:
# testi g no imputation
pcaNone = ipa.pca(
    data=SNPS,
    imap=None,
    minmap=None,
    mincov=0.5,
    impute_method="sample",
)

Samples: 274
Sites before filtering: 3419673
Filtered (indels): 0
Filtered (bi-allel): 278880
Filtered (mincov): 2529537
Filtered (minmap): 0
Filtered (subsample invariant): 6
Filtered (minor allele frequency): 0
Filtered (combined): 2628555
Sites after filtering: 796202
Sites containing missing values: 796193 (100.00%)
Missing values in SNP matrix: 88550963 (40.59%)
SNPs (total): 796202
SNPs (unlinked): 45280
Imputation: 'sampled'; (0, 1, 2) = 93.3%, 4.8%, 1.8%


In [46]:
pcaNone.run()

Subsampling SNPs: 45280/796202


In [47]:
##import collections to create an empty dict
from collections import defaultdict


#create empty dict
imap_test = defaultdict(list)

imap_test["excluded"] = []


#go row by row
for index, row in fulldata.iterrows():
    #only check fulldataset specimens
    if row["full_dataset_withAyava"]:
        if row["Last_valid_name"] != "dentatum": 
            sp = row["Last_valid_name"]
            region = regions[regions["Species"] == sp]["Region"].values[0]
            imap_test[region].append(row["NameInAssembly"])

In [49]:
import h5py
w = h5py.File(SNPS,'r')

In [50]:
names_in_h5 = w["snps"].attrs["names"]
w.close()

In [51]:
t = []
for i in imap_test.values():
    t.extend(i)

In [52]:
imap_test["central"].append("reference")


for i in names_in_h5:
    if i.decode("utf-8") not in t:
        imap_test["excluded"].append(i.decode("utf-8"))

In [55]:
pcaNone.draw(imap=imap_test, cycle=10)

(<toyplot.canvas.Canvas at 0x7f6da8e7f610>,
 <toyplot.coordinates.Cartesian at 0x7f6da8e7fd50>)

This result shows that the organization is pretty accurate to the general pca done with sample imputation. Because our hdf5 file contain some samples that were not included in this analysis, some excluded specimens are showed as purple boxes. Howerver groups like mex_w and mex_e are separated and a big pack of south america still is separated from the latter.

I am pretty sure that imputation is not the cause of the pattern for that the sample imputation pca is very accurate depite of the missing data imputed (17-21%).