#Building Linkage Maps with LepMap

Before I begin with LepMap, I have to identify the haploid data that I want to use for LepMap. More specifically, which runs of MSTMap I want to use to identify and compare linkage groups between families and rename the loci. Looking over the results from the various runs of MSTMap I did and paying attention to the size and number of the linkage groups as well as the number of markers and bins that were used, I chose the following runs for each of the four families. These runs of the familes were not overmerged, had seemingly uninflated linkage group size, robust numbers of markers, a resonable number of bins per linkage group and a fair number of samples. 

In [None]:
PBIRD13X108H_80GR 
PBIRD13X110H_65GR
PHOOD11x01H_4
PHOOD11x05H_5

###Identifying duplicated loci

I used the ouput from Ryans pipeline for each of the families to identify loci that are duplicates. 
Garrett wrote a script called parseStats.pl that takes the .stats output from Ryans pipeline and for each of the loci tells you the most likely model, then gives a distinction for whether that model is a duplicated one or not (or whether we would be able to tell if it were duplicated). The singles are given the classification of 1 and the duplicates are given a 2. 

In [None]:
#!/usr/bin/perl -w
use strict;

#Hash of single and duplicate marker models, key is model, value is 1 for single locus or 2 for duplicate.
my%dupModels=(	"AA_xx" => 1,
				"AB" => 1,
				"AA_AB" => 2,
				"AA_BB" => 2,
				"AB_AB" => 2,
				"AA_AB_AB" => 2,
				"AA_BB_AB" => 2,
				"AB_AB_AB" => 2,
				"AA_BC" => 2,
				"AB_AC" => 2,
				"AB_CD" => 2,
				"AA_BB_AC" => 2,
				"AA_AB_AC" => 2,
				"AA_BB_CC" => 2);


while(my$line=<>){
	chomp$line;
	my($tag,$mappable_loci,$epsilon,$bestModel,$bestLikelihood,$secondModel,$secondLikelihood,$segTest_x1,$segTest_x2)=split "\t", $line, 9;
	if($mappable_loci>=1){
		print "$tag\t$bestModel\t$dupModels{$bestModel}\n";
	}
}

perl parseStats.pl PBIRD13X108H_filtered.stats >PBIRD13X108H_parsed.txt
perl parseStats.pl PBIRD13X110H_filtered.stats >PBIRD13X110H_parsed.txt
perl parseStats.pl PHOOD11x01H_filtered.stats >PHOOD11x01H_parsed.txt
perl parseStats.pl PHOOD11x05H_filtered.stats >PHOOD11x05H_parsed.txt

##Comparing duplicated loci across families

The next thing we need is a file that has all the loci that are genotyped in any family, and the distinction of whether they are duplicated or not. If they are duplicated in any family (even if they are not duplicated in other families)they are considered duplicated. 

To build this file we need the output from MSTMap that has been split into one set of linkage groups (_split) and run through the MST_format.pl script(_format) and the output from the runs of the parseStats.pl on the four families. The parseStats results I put in excel, one tab for each family. 

I took the list of loci in the file that has been split and formatted and put all the loci from the families in column all together. I had already removed the R from the loci at an earlier stage. I 'removed duplicates' in excel so that each loci is in the list only once, but all the loci that are present in any family is included in the list. Then I copied that column to another column and did 'text to column', splitting the names on the _ to get the beginning of the loci name alone. I did a vlookup for each family with that loci prefix to the output of the parseStats to find whether in that family the loci was a duplicate or not. 

Finally, i made a single file that had a list of the loci as one column and the distinction of whether it was a duplicated loci or not, using the 1 or 2. If the loci had a distinction of 2 in any of the families, I gave it 2 (duplicated) in the final list. Here is an example: 


In [None]:
22869_x1	2
12750_x1	2
14409_x1	2
18990_x1	2
20758_x1	2
30259_x1	2
28317_x1	2
32569_x1	2
45786_x1	2
45786_x2	2
87954_x1	2
11564_x1	2
15649_x1	2
564_x1	2

##Compare the MSTMap results between families to find matching linkage groups

Garrett wrote a script that takes two MSTMap output files and the list of all the loci (with their distinction of duplicated or not) and returns the linkage groups that match up to one another between families. The script also tells you how many loci the linkage groups had in common, so what the match was based on. 

In [None]:
#MST_MST_compare_all_dupFilter2.pl
#!/usr/bin/perl -w
use strict;

my$MSTfile1=$ARGV[0];
my$MSTfile2=$ARGV[1];
my$DuplicateFile=$ARGV[2];

my%LociDuplication;
open(DUPLICATION, "<$DuplicateFile")||die "cannot open $DuplicateFile:$!";
my$DupHeader=<DUPLICATION>;
chomp $DupHeader;
while(my$line=<DUPLICATION>){
	chomp $line;
	my($loci,$duplication)=split "\t", $line, 2;
	$LociDuplication{$loci}=$duplication;
}

my%MST1loci;
my$MST1currentLG;
my$MST1store_loci=0;
my%MST1LGlociNum;
my%MST1LGlength;
my$MST1lociNum=0;
my$MST1currentPosition;
open(MST1, "<$MSTfile1")||die "cannot open $MSTfile1:$!";
while(my $line=<MST1>){
	chomp $line;
	if($line=~/^group/){
		my($discard, $LG)=split " ", $line, 2;
		$MST1currentLG=$LG;
	}
	if($line=~/\;BEGINOFGROUP/){
		$MST1store_loci=1;
		next;
	}
	if($line=~/\;ENDOFGROUP/){
		$MST1LGlociNum{$MST1currentLG}=$MST1lociNum;
		$MST1LGlength{$MST1currentLG}=$MST1currentPosition;
		$MST1store_loci=0;
		$MST1currentLG="";
		$MST1currentPosition="";
		$MST1lociNum=0;
	}
	if($MST1store_loci==1){
		my($locus, $position)=split "\t", $line, 2;
		#$locus=~s/R//;
		if($LociDuplication{$locus}==1){
			$MST1loci{$locus}=$MST1currentLG;
			$MST1currentPosition=$position;
			$MST1lociNum++;
		}
	}
}
close MST1;

my%MST2loci;
my$MST2currentLG;
my$MST2store_loci=0;
my%MST2LGlociNum;
my%MST2LGlength;
my$MST2lociNum=0;
my$MST2currentPosition;
open(MST2, "<$MSTfile2")||die "cannot open $MSTfile2:$!";
while(my $line=<MST2>){
	chomp $line;
	if($line=~/^group/){
		my($discard, $LG)=split " ", $line, 2;
		$MST2currentLG=$LG;
	}
	if($line=~/\;BEGINOFGROUP/){
		$MST2store_loci=1;
		next;
	}
	if($line=~/\;ENDOFGROUP/){
		$MST2LGlociNum{$MST2currentLG}=$MST2lociNum;
		$MST2LGlength{$MST2currentLG}=$MST2currentPosition;
		$MST2store_loci=0;
		$MST2currentLG="";
		$MST2currentPosition="";
		$MST2lociNum=0;
	}
	if($MST2store_loci==1){
		my($locus, $position)=split "\t", $line, 2;
		#$locus=~s/R//;
		if($LociDuplication{$locus}==1){
			$MST2loci{$locus}=$MST2currentLG;
			$MST2currentPosition=$position;
			$MST2lociNum++;
			#print "$locus\t$MST2currentLG\n";
		}
	}
}
close MST2;

my%LGpairs;
foreach my$key (keys%MST1loci){
	my$Rkey=$key."R";
	if(((exists $MST1loci{$key})&&(exists $MST2loci{$key}))||((exists $MST1loci{$Rkey})&&(exists $MST2loci{$key}))){
		#print "$key\n";
		#print "$MST1loci{$key}\t$MST2loci{$key}\n";
		my$pair="$MST1loci{$key} $MST1LGlociNum{$MST1loci{$key}} $MST1LGlength{$MST1loci{$key}}\t$MST2loci{$key} $MST2LGlociNum{$MST2loci{$key}} $MST2LGlength{$MST2loci{$key}}";
		$LGpairs{$pair}++;
	}
}

foreach my$key (sort keys %LGpairs){
	print "$key\t$LGpairs{$key}\n";
}


I have four families and did three comparisons. 
 
- PBIRD13X110H_65GR to PBIRD13X108H_80GR
- PBIRD13X108H_80GR to PHOOD11x05H_5
- PHOOD11x05H_5 to PHOOD11x01H_4

In [None]:
perl MST_MST_compare_all_dupFilter2.pl PBIRD13X110H_65GR_splitNEW_out.txt PBIRD13X108H_80GR_splitNEW_out.txt Loci.txt >MST_compare_110H_108H.txt
perl MST_MST_compare_all_dupFilter2.pl PBIRD13X108H_80GR_splitNEW_out.txt PHOOD11X05H_5_splitNEW_out.txt Loci.txt >MST_compare_108H_05H.txt
perl MST_MST_compare_all_dupFilter2.pl PHOOD11X05H_5_splitNEW_out.txt PHOOD11X01H_4_splitNEW_out.txt Loci.txt >MST_compare_05H_01H.txt

#Making the input files for LepMap

All of the following code was written by Ryan Waples, taken and adapted from his ipython notebook http://nbviewer.ipython.org/github/rwaples/chum_populations/blob/master/multi-family%20mapping.ipynb

In [3]:
cd G:\  

G:\


In [4]:
cd Analysis\Mapping\AllHaps\LepMap

G:\Analysis\Mapping\AllHaps\LepMap


In [12]:
#rwaples/chum_populations/linkage_map/multi_family_mapping_functions.py
#@rwaples rwaples on Mar 6 consensus mapping with paralogs

from datetime import datetime
import os
import collections
import numpy as np
import itertools
import scipy.spatial.distance
import pandas as pd
import sys
import operator

#Functions
   
def get_ml_R_frac(R, NR):
    #"""Returns the maximum likelihood recombination fraction, as given as: R / (R + NR)"""
    if R.__class__ == numpy.ndarray:
        R = R.astype(numpy.float)
    return(R / (R + NR))
    
def get_LOD(R, NR, R_frac):
    #"""Returns the LOD score for a set of the given values of [R, NR, R_frac].
    #LOD scores calculated against the likelihood c =.5"""
    Z = numpy.log10(
                (numpy.power((1-R_frac), NR) * numpy.power(R_frac, R)) / numpy.power(.5, (R + NR))
            )
    return(Z)
    
def getR(pairs):
    #"""generator, yields the number of recombinant individuals for a given pair of loci"""
    for x, y in pairs:
        mult = x * y
        yield numpy.sum(mult == 2)
         
def getNR(pairs):
    #"""generator, yields the number of NON-recombinant individuals for a given pair of loci"""
    for x, y in pairs:
        mult = x * y
        yield (numpy.sum(mult == 1) + numpy.sum(mult == 4))
    
def import_MSTmap(filename):
    with open(filename) as INFILE:
        genotypes_at_locus = dict()
        for line in INFILE:
            if line.startswith('locus_name'): # start parsing
                individuals = line.strip().split()[1:]
                #print(individuals)
                for line in INFILE:
                    line_split = line.strip().split()
                    locus = line_split[0]
                    #print(locus)
                    genotypes = [1 if xx == 'a' else 2 if xx == 'b' else 0 for xx in line_split[1:]]
                    genotypes_at_locus[locus] = genotypes
                    #print(genotypes)
            else:
                pass
    return(individuals, genotypes_at_locus)
    
def remove_by_blacklist(blacklist, genotypes_at_locus):
    print("Starting length of genotypes: {}".format(len(genotypes_at_locus)))
    for locus in blacklist:        
        if locus in genotypes_at_locus.keys():
            del genotypes_at_locus[locus]
        elif locus + "_x1" in genotypes_at_locus.keys():
            del genotypes_at_locus[locus + "_x1"]
        elif locus + "_x2" in genotypes_at_locus.keys():
            del genotypes_at_locus[locus + "_x2"]
    print("Final length of genotypes: {}".format(len(genotypes_at_locus)))
    return(True)
    
def prep_data_pandas(individuals, genotypes_at_locus):
    # prepare pandas data.frame
    # columns are loci, rows are individuals
    my_pd_genos = pd.DataFrame.from_dict(genotypes_at_locus)
    # add an index
    my_pd_genos.index = individuals
    return(my_pd_genos)

def prepare_matrix(*args):
    """
    """
    for arg in args:
        if not isinstance(arg, pd.DataFrame):
            raise ValueError("need a pandas DataFrame")
        else:
            pass
    # combine all the data sets into one
    all_genos = pd.concat(objs = args, join = 'outer')
    # convert Nan to 0
    all_genos.fillna(value = 0, inplace = True)
    #list of loci
    loci = [str(xx) for xx in all_genos.transpose().index]
    # conver to integer numpy array
    all_genos_np = np.array(all_genos).astype(int)
    #all_genos_np = all_genos_np.astype(int)
    # transpose
    all_genos_np = np.transpose(all_genos_np)
    return(all_genos_np, loci)
    

# returns a redundant square matix
def get_matrix(data_in_array):
    data_in_matrix = scipy.spatial.distance.squareform(data_in_array)
    np.fill_diagonal(data_in_matrix, np.nan)
    return(data_in_matrix)

def get_recombination_stats(geno_array):
    int_arr = geno_array
    num_loci = int_arr.shape[0]
    num_pairs =  int((num_loci * (num_loci-1))/2)
    
    print('Starting, num_pairs = {}'.format(num_pairs))
    print(str(datetime.now()))
    time_start = datetime.now()
    sys.stdout.flush()
    
    pairs = itertools.combinations(int_arr, 2)
    R = np.fromiter(getR(pairs), dtype = np.int, count = num_pairs)
    time_R = datetime.now()
    print('Finished R')
    print(str(datetime.now()))
    sys.stdout.flush()
    
    pairs = itertools.combinations(int_arr, 2)
    NR = np.fromiter(getNR(pairs), dtype = np.int, count = num_pairs)
    time_NR = datetime.now()
    print('Finished NR')
    print(str(datetime.now()))
    sys.stdout.flush()
    
    ml_R_frac = get_ml_R_frac(R = R, NR = NR)
    time_RF = datetime.now()
    print('Finished RF')
    print(str(datetime.now()))
    
    sys.stdout.flush()
    Z = get_LOD(R = R, NR = NR, R_frac = ml_R_frac)
    time_Z = datetime.now()
    print('Finished Z')
    print(str(datetime.now()))
    sys.stdout.flush()
    
    N = R + NR
    MST = np.e**-(2*(N/2. - R)**2/N)
    print('Finished MST')
    time_MST = datetime.now()
    print(str(datetime.now()))
    sys.stdout.flush()
    
    print("R took: {}".format(str(time_R - time_start)))
    print("NR took: {}".format(str(time_NR - time_R)))
    print("RF took: {}".format(str(time_RF - time_NR)))
    print("Z took: {}".format(str(time_Z - time_RF)))
    print("MST took: {}".format(str(time_MST - time_Z)))   
    
    Z_mat = get_matrix(Z)
    RF_mat = get_matrix(ml_R_frac)
    R_mat = get_matrix(R)
    NR_mat = get_matrix(NR)
    MST_mat = get_matrix(MST)
    
    Recombination_stats = collections.namedtuple('Recombination_stats', "R NR RF Z MST" )
    my_stats = Recombination_stats(R_mat, NR_mat, RF_mat, Z_mat, MST_mat)
    return(my_stats)    

def write_loci(loci, path):
    with open(os.path.join(path, 'loci.txt'), 'w') as OUTFILE:
        OUTFILE.write("\n".join(loci))
        
def write_rec_stats(stats, path):
    for stat in stats._fields:
        np.savetxt(X = getattr(stats, stat), fname = os.path.join(path, stat + ".tsv"), delimiter = "\t", fmt = '%1.4g')

In [15]:
stats_file_1 = ('PBIRD13X110H_filtered.stats')
stats_file_2 = ('PBIRD13X108H_filtered.stats')
stats_file_3 = ('PHOOD11X01H_filtered.stats')
stats_file_4 = ('PHOOD11X05H_filtered.stats')

paralogs_file = ('pink_paralogs4.txt')

In [None]:
paralogs = set()

for stats_file in [stats_file_1, stats_file_2, stats_file_3, stats_file_4]:
    with open(stats_file) as INFILE:
        for line in INFILE:
            if line.strip().split()[3] not in ["AA_xx", "AB"]:
                paralogs.add(line.strip().split()[0])
                
paralogs = sorted([int(xx) for xx in paralogs])

with open(paralogs_file, 'w') as OUTFILE:
    for xx in sorted(list(paralogs)):
        OUTFILE.write('{}\n'.format(xx))

In [23]:
linkage_map_file_1 = ('PBIRD13X110H_65GR_splitNEW.txt')
linkage_map_file_2 = ('PBIRD13X108H_80GR_splitNEW.txt')
linkage_map_file_3 = ('PHOOD11X01H_4_splitNEW.txt')
linkage_map_file_4 = ('PHOOD11X05H_5_splitNEW.txt')

In [24]:
individuals_110, genotypes_at_locus_110 = import_MSTmap(linkage_map_file_1)
individuals_108, genotypes_at_locus_108 = import_MSTmap(linkage_map_file_2)
individuals_01, genotypes_at_locus_01 = import_MSTmap(linkage_map_file_3)
individuals_05, genotypes_at_locus_05 = import_MSTmap(linkage_map_file_4)

my_pd_genos_110 = prep_data_pandas(individuals_110, genotypes_at_locus_110)
my_pd_genos_108 = prep_data_pandas(individuals_108, genotypes_at_locus_108)
my_pd_genos_01  = prep_data_pandas(individuals_01,  genotypes_at_locus_01)
my_pd_genos_05  = prep_data_pandas(individuals_05,  genotypes_at_locus_05)

fam_110 = prepare_matrix(my_pd_genos_110)
fam_108 = prepare_matrix(my_pd_genos_108)
fam_01 = prepare_matrix(my_pd_genos_01)
fam_05 = prepare_matrix(my_pd_genos_05)

In [25]:
def rename_loci_by_family(paralogs_file, fam_names, families):
    # check if each family listed in families is formatted as if returned from prepare matrix
    for afam in families:
        if not isinstance(afam, pd.core.frame.DataFrame):
            raise ValueError("families should be a pandas DataFrame")
        else:
            pass
    if len(fam_names) != len(families) :
        raise ValueError("names and families should have the same length")
    if not isinstance(fam_names, list ):
        raise ValueError("names should be a list")
    
    with open(paralogs_file) as INFILE: 
        paralogs = [yy.strip() for yy in INFILE.readlines()]
    # for each family, for each locus, if the locus is a paralog, append family-specific text to locus name
    # genotypes are unchanged
    #new_familes = list()
    for idx, afam in enumerate(families):
        old_locus_names = afam.columns.values.tolist()
        new_locus_names = []
        for xx in old_locus_names:
            base_name = xx[:-3]
            if base_name in paralogs:
                #print("{} is a paralog".format(base_name))
                new_name = "{}_{}_{}".format(base_name, fam_names[idx], xx[-2:])
            else: 
                new_name = base_name
            new_locus_names.append(new_name)
        afam.columns = new_locus_names
    return(families)

In [26]:
renamed_110, renamed_108, renamed_01, renamed_05  = rename_loci_by_family(paralogs_file = paralogs_file, fam_names = ['pink_110','pink_108', 'pink_01', 'pink_05'], families = [my_pd_genos_110, my_pd_genos_108, my_pd_genos_01, my_pd_genos_05])

In [27]:
renamed_01.head


<bound method DataFrame.head of                   10003_pink_01_x1  100077  10007  100176  10017  \
PHOOD11X01H_0001                 1       2      1       2      1   
PHOOD11X01H_0002                 1       0      2       1      1   
PHOOD11X01H_0003                 2       2      2       1      1   
PHOOD11X01H_0004                 1       2      2       2      1   
PHOOD11X01H_0005                 1       1      2       2      1   
PHOOD11X01H_0006                 1       1      1       1      1   
PHOOD11X01H_0007                 2       1      1       1      1   
PHOOD11X01H_0008                 1       2      1       2      1   
PHOOD11X01H_0009                 2       2      2       2      2   
PHOOD11X01H_0010                 1       2      1       2      1   
PHOOD11X01H_0011                 1       2      2       1      1   
PHOOD11X01H_0012                 1       0      1       2      1   
PHOOD11X01H_0013                 1       1      2       1      1   
PHOOD11X01H_0014

In [None]:
##### this should be moved inside function
renamed_110t = renamed_110.transpose()
renamed_108t = renamed_108.transpose()
renamed_01t = renamed_01.transpose()
renamed_05t = renamed_05.transpose()

In [29]:
aa = pd.merge(left = renamed_110t, right = renamed_108t , how = 'outer', left_index =True, right_index =True)
bb = pd.merge(left = aa, right = renamed_01t, how = 'outer', left_index =True, right_index =True) 
cc = pd.merge(left = bb, right = renamed_05t, how = 'outer', left_index =True, right_index =True)

In [30]:
all_my_data, loci_all = prepare_matrix(cc.transpose())

In [31]:
def write_LEPmap(families, family_names, loci, genotypes, output_filename):
    with open(output_filename, 'w') as OUTFILE:
        header = "\t".join(["#family", 'name', 'sire', 'dam', 'sex', 'blank'] + loci) + "\n"
        OUTFILE.write(header)
        for fam_idx, fam in enumerate(families):
            fam_name = family_names[fam_idx]
            DAM_line = "\t".join([fam_name, fam_name + "_Dam", '0', '0', '2', '0'] + ['1 1' for xx in loci]) + "\n"
            SIRE_line = "\t".join([fam_name, fam_name + "_Sire", '0', '0', '1', '0'] + ['1 2' for xx in loci]) + "\n"
            OUTFILE.write(DAM_line)
            OUTFILE.write(SIRE_line)
            for ind in fam:
                ind_info = "\t".join([fam_name, ind, fam_name + "_Sire", fam_name + "_Dam", '0', '0'])
                ind_genotypes = genotypes.loc[ind]
                OUTFILE.write(ind_info + "\t" + "\t".join([str(xx) for xx in ind_genotypes]) + "\n")

In [32]:
fams = [individuals_110, individuals_108, individuals_01, individuals_05]
LEPmap_filename = ('all_loci3.lepmap')

In [33]:
my_genotypes = cc.transpose()
my_genotypes = my_genotypes.replace(to_replace = [np.NaN, 0, 1, 2 ], value = ['0 0', '0 0', '1 1', '1 2'])

In [34]:
my_genotypes.shape

(334, 13790)

In [35]:
write_LEPmap(families = fams, family_names = ['pink_110','pink_108','pink_01', 'pink_05'], 
             loci = loci_all, genotypes = my_genotypes, output_filename = LEPmap_filename)

In [36]:
pd.read_csv('all_loci3.lepmap', sep = '\t').shape

(342, 13796)

##Run LepMap once to split chromosomes and to join singles. 

In [81]:
print "java -cp bin/ SeparateChromosomes data = all_loci2.lepmap lodLimit = 10 sizeLimit = 3 > all_loci2.inital_chromosomes 2> all_loci2.inital_chromosomes.log"
print "java -cp bin/ JoinSingles all_loci2.inital_chromosomes lodLimit = 8 data=all_loci2.lepmap > all_loci2.lod8_singles.chromosomes 2> all_loci2.lod8_singles.chromosomes.log"
print "java -cp bin/ JoinSingles all_loci2.lod8_singles.chromosomes lodLimit = 7 data= all_loci2.lepmap > all_loci2.lod7_singles.chromosomes 2> all_loci2.lod7_singles.chromosomes.log"
print "java -cp bin/ JoinSingles all_loci2.lod7_singles.chromosomes lodLimit = 6 data= all_loci2.lepmap > all_loci2.lod6_singles.chromosomes 2> all_loci2.lod6_singles.chromosomes.log"
print "java -cp bin/ JoinSingles all_loci2.lod6_singles.chromosomes lodLimit = 5 data= all_loci2.lepmap > all_loci2.lod5_singles.chromosomes 2> all_loci2.lod5_singles.chromosomes.log"
print "java -cp bin/ JoinSingles all_loci2.lod5_singles.chromosomes lodLimit = 4 data= all_loci2.lepmap > all_loci2.lod4_singles.chromosomes 2> all_loci2.lod4_singles.chromosomes.log"
print "java -cp bin/ JoinSingles all_loci2.lod4_singles.chromosomes lodLimit = 3.5 data= all_loci2.lepmap > all_loci2.lod3-5_singles.chromosomes 2> all_loci2.lod3-5_singles.chromosomes.log"

java -cp bin/ SeparateChromosomes data = all_loci2.lepmap lodLimit = 10 sizeLimit = 3 > all_loci2.inital_chromosomes 2> all_loci2.inital_chromosomes.log
java -cp bin/ JoinSingles all_loci2.inital_chromosomes lodLimit = 8 data=all_loci2.lepmap > all_loci2.lod8_singles.chromosomes 2> all_loci2.lod8_singles.chromosomes.log
java -cp bin/ JoinSingles all_loci2.lod8_singles.chromosomes lodLimit = 7 data= all_loci2.lepmap > all_loci2.lod7_singles.chromosomes 2> all_loci2.lod7_singles.chromosomes.log
java -cp bin/ JoinSingles all_loci2.lod7_singles.chromosomes lodLimit = 6 data= all_loci2.lepmap > all_loci2.lod6_singles.chromosomes 2> all_loci2.lod6_singles.chromosomes.log
java -cp bin/ JoinSingles all_loci2.lod6_singles.chromosomes lodLimit = 5 data= all_loci2.lepmap > all_loci2.lod5_singles.chromosomes 2> all_loci2.lod5_singles.chromosomes.log
java -cp bin/ JoinSingles all_loci2.lod5_singles.chromosomes lodLimit = 4 data= all_loci2.lepmap > all_loci2.lod4_singles.chromosomes 2> all_loci2.lod

##Use the output of LepMap to figure out which of the duplicated loci that have the same name in more than one family are on the same Linkage groups. Those will be collapsed down to have one name. 

In [37]:
def find_duplicate_names(paralogs_file, loci, LG_file, output_file):
    with open(paralogs_file) as INFILE: 
        paralogs = [yy.strip() for yy in INFILE.readlines()]
    #get list of the locus names
        #  given by loci
    #get list of LG assignments
    with open(LG_file) as INFILE:
        #skip first line
        next(INFILE)
        LG_assignments = [int(line.strip()) for line in INFILE]
    LG_of_locus = dict(zip(loci, LG_assignments))
    #print(paralogs)
    with open(output_file, 'w') as OUTFILE:
        OUTFILE.write("{}\t{}\t{}\t{}\n".format('catalog_name', 'family', 'copy', 'LG'))
        for locus in loci:
            catalog_name = locus.split("_")[0]
            if catalog_name in paralogs:
                family = locus.split("_")[2]
                copy = locus[-2:]
                OUTFILE.write("{}\t{}\t{}\t{}\n".format(catalog_name, family, copy, LG_of_locus[locus]))
            #list of all loci sharing base name
                sharing = [loc for loc in loci if locus.split("_")[0] == loc.split("_")[0]]
                sharing.remove(locus)
                agree_on_LG = sum([1 for loc in sharing if LG_of_locus[loc] == LG_of_locus[locus] ])
                disagree_on_LG = sum([1 for loc in sharing if LG_of_locus[loc] != LG_of_locus[locus] ])
                #print(locus, agree_on_LG, disagree_on_LG)

In [38]:
find_duplicate_names(paralogs_file = paralogs_file , loci = loci_all, LG_file = 'all_loci2.lod5_singles.chromosomes', output_file = 'LG_congruence3.tsv')

In [39]:
def collapse_names(LG_110_x1, LG_110_x2, LG_108_x1, LG_108_x2, LG_01_x1, LG_01_x2, LG_05_x1, LG_05_x2):
    # take them in order given here
    possible_results = ['A','B','C','D','E','F','G','H']
    #possible_results.reverse()
    result = []
    
    # check for segmental
    # if segmental do not try to resolve
    for xx in ((LG_110_x1, LG_110_x2), (LG_108_x1, LG_108_x2), (LG_01_x1, LG_01_x2), (LG_05_x1, LG_05_x2)):
        x1, x2 = xx
        if x1 == x2 and x1 != 0: # segmental
            result = possible_results

    else:
        mapping = dict()
        for assign in (LG_110_x1, LG_110_x2, LG_108_x1, LG_108_x2, LG_01_x1, LG_01_x2, LG_05_x1, LG_05_x2):
            if assign in mapping:
                result.append(mapping[assign])
            else:
                if assign == 0:
                     result.append(possible_results.pop(0))
                else:
                    mapping[assign] = possible_results.pop(0)
                    result.append(mapping[assign])
    return(result)  

In [46]:
def parse_LG_congruence_line(line):
    catalog_name, family, copy, LG = line.strip().split("\t")
    return(catalog_name, family, copy, LG)

In [40]:
#examples
print collapse_names(1,2,2,1,2,1,2,1)
print collapse_names(0,1,2,0,2,0,2,1)
print collapse_names(1,0,2,2,1,0,2,0)
print collapse_names(0,0,1,0,2,0,0,0)
print collapse_names(0,1,2,0,0,2,2,2)

['A', 'B', 'B', 'A', 'B', 'A', 'B', 'A']
['A', 'B', 'C', 'D', 'C', 'E', 'C', 'B']
['F', 'G', 'H', 'A', 'B', 'C', 'C', 'A', 'D', 'C', 'E']
['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H']
['F', 'G', 'H', 'A', 'B', 'C', 'D', 'E', 'C', 'C', 'C']


In [47]:
def write_rename_table(LG_congruence_file, out_file):
    famLG_of_locus = collections.defaultdict(dict)
    with open(out_file, 'w') as OUTFILE:
        OUTFILE.write("{}\t{}\t{}\n".format("old_name", "new_name", "LG"))
        with open(LG_congruence_file) as PARALOG_LGS:
            #skip first line
            next(PARALOG_LGS)
            for line in PARALOG_LGS:
                catalog_name, family, copy, LG = parse_LG_congruence_line(line)
                famLG_of_locus[catalog_name][family, copy] = int(LG)
            for cn, famLG in famLG_of_locus.items():
                LG_110_x1 = famLG.get(('110', 'x1'), 0)
                LG_110_x2 = famLG.get(('110', 'x2'), 0)
                LG_108_x1 = famLG.get(('108', 'x1'), 0)
                LG_108_x2 = famLG.get(('108', 'x2'), 0)
                LG_01_x1 = famLG.get(('01', 'x1'), 0)
                LG_01_x2 = famLG.get(('01', 'x2'), 0)
                LG_05_x1 = famLG.get(('05', 'x1'), 0)
                LG_05_x2 = famLG.get(('05', 'x2'), 0)
                #print cn, [LG_110_x1, LG_110_x2, LG_108_x1, LG_108_x2, LG_01_x1, LG_01_x2, LG_05_x1, LG_05_x2], collapse_names(LG_110_x1, LG_110_x2, LG_108_x1, LG_108_x2, LG_01_x1, LG_01_x2, LG_05_x1, LG_05_x2)
                fam_names = ["{}_pink_110_x1", "{}_pink_110_x2", "{}_pink_108_x1", "{}_pink_108_x2", "{}_pink_01_x1", "{}_pink_01_x2", "{}_pink_05_x1", "{}_pink_05_x2"]
                collaped_names = collapse_names(LG_110_x1, LG_110_x2, LG_108_x1, LG_108_x2, LG_01_x1, LG_01_x2, LG_05_x1, LG_05_x2)
                LGs = [LG_110_x1, LG_110_x2, LG_108_x1, LG_108_x2, LG_01_x1, LG_01_x2, LG_05_x1, LG_05_x2]
                family_specific_names = [xx.format(cn) for xx in fam_names]
                for cnt, fsn in enumerate(family_specific_names):
                    OUTFILE.write("{}\t{}\t{}\n".format(fsn, cn+"_{}".format(collaped_names[cnt]), LGs[cnt]))

In [48]:
write_rename_table(LG_congruence_file =  "LG_congruence3.tsv", out_file = "rename_table3.tsv")   

In [50]:
#get rename dict
rename_table = pd.read_table("rename_table3.tsv")
rename_dict = dict(zip(rename_table.old_name,rename_table.new_name))

In [51]:
renamed_110t = renamed_110.rename(columns=rename_dict).transpose()
renamed_108t = renamed_108.rename(columns=rename_dict).transpose()
renamed_01t = renamed_01.rename(columns=rename_dict).transpose()
renamed_05t = renamed_05.rename(columns=rename_dict).transpose()
aa = pd.merge(left = renamed_110t, right = renamed_108t, how = 'outer', left_index =True, right_index =True)
bb = pd.merge(left = aa, right = renamed_01t, how = 'outer', left_index =True, right_index =True)
cc = pd.merge(left = bb, right = renamed_05t, how = 'outer', left_index =True, right_index =True)

In [52]:
fams = [individuals_110, individuals_108, individuals_01, individuals_05]
LEPmap_filename = "collapsed_loci2.lepmap"
my_genotypes = cc.transpose()
my_genotypes = my_genotypes.replace(to_replace = [np.NaN, 0, 1, 2 ], value = ['0 0', '0 0', '1 1', '1 2'])
write_LEPmap(families = fams, family_names = ["pink_110", "pink_108", "pink_01", "pink_05"], loci = my_genotypes.columns.values.tolist(),
    genotypes = my_genotypes, output_filename = LEPmap_filename)

##Run LepMap again with the newly renamed loci, the loci with the same base name in multiple families that mapped to the same LG have been collapsed down to one 

In [108]:
# form linkage groups
print "java -cp bin/ SeparateChromosomes data=collapsed_loci.lepmap lodLimit = 10 sizeLimit = 20 > collapsed_loci.inital.chromosomes 2> collapsed_loci.initial_chromosomes.log"
print "java -cp bin/ JoinSingles collapsed_loci.initial_chromosomes lodLimit = 8 lodDifference=3 data=collapsed_loci.lepmap > collapsed_loci.lod8_singles.chromosomes 2> collapsed_loci.lod8_singles.chromosomes.log"
print "java -cp bin/ JoinSingles collapsed_loci.lod8_singles.chromosomes lodLimit = 7 lodDifference=3 data=collapsed_loci.lepmap > collapsed_loci.lod7_singles.chromosomes 2> collapsed_loci.lod7_singles.chromosomes.log"
print "java -cp bin/ JoinSingles collapsed_loci.lod7_singles.chromosomes lodLimit = 6 lodDifference=3 data=collapsed_loci.lepmap > collapsed_loci.lod6_singles.chromosomes 2> collapsed_loci.lod6_singles.chromosomes.log"
print "java -cp bin/ JoinSingles collapsed_loci.lod6_singles.chromosomes lodLimit = 5 lodDifference=3 data=collapsed_loci.lepmap > collapsed_loci.lod5_singles.chromosomes 2> collapsed_loci.lod5_singles.chromosomes.log"

java -cp bin/ SeparateChromosomes data=collapsed_loci.lepmap lodLimit = 10 sizeLimit = 20 > collapsed_loci.inital.chromosomes 2> collapsed_loci.initial_chromosomes.log
java -cp bin/ JoinSingles collapsed_loci.initial_chromosomes lodLimit = 8 lodDifference=3 data=collapsed_loci.lepmap > collapsed_loci.lod8_singles.chromosomes 2> collapsed_loci.lod8_singles.chromosomes.log
java -cp bin/ JoinSingles collapsed_loci.lod8_singles.chromosomes lodLimit = 7 lodDifference=3 data=collapsed_loci.lepmap > collapsed_loci.lod7_singles.chromosomes 2> collapsed_loci.lod7_singles.chromosomes.log
java -cp bin/ JoinSingles collapsed_loci.lod7_singles.chromosomes lodLimit = 6 lodDifference=3 data=collapsed_loci.lepmap > collapsed_loci.lod6_singles.chromosomes 2> collapsed_loci.lod6_singles.chromosomes.log
java -cp bin/ JoinSingles collapsed_loci.lod6_singles.chromosomes lodLimit = 5 lodDifference=3 data=collapsed_loci.lepmap > collapsed_loci.lod5_singles.chromosomes 2> collapsed_loci.lod5_singles.chromosom

###There are 25 linkage groups at the end of this step, which means that there is likely some overmerging. We are expecting there to be 26 linkage groups. (R B Phillips and A R Kapuscinski, 1988) 52 chromosomes in a normal diploid pink salmon, occasionally 53 chromosomes. 

In [2]:
# form linkage groups

print "java -cp bin/ SeparateChromosomes data=collapsed_loci.lepmap lodLimit = 11 sizeLimit = 20 > collapsed_loci_11.initial.chromosomes 2> collapsed_loci_11.initial_chromosomes.log"
print "java -cp bin/ JoinSingles collapsed_loci_11.initial.chromosomes lodLimit = 8 lodDifference=3 data=collapsed_loci.lepmap > collapsed_loci_11.lod8_singles.chromosomes 2> collapsed_loci_11.lod8_singles.chromosomes.log"
print "java -cp bin/ JoinSingles collapsed_loci_11.lod8_singles.chromosomes lodLimit = 7 lodDifference=3 data=collapsed_loci.lepmap > collapsed_loci_11.lod7_singles.chromosomes 2> collapsed_loci_11.lod7_singles.chromosomes.log"
print "java -cp bin/ JoinSingles collapsed_loci_11.lod7_singles.chromosomes lodLimit = 6 lodDifference=3 data=collapsed_loci.lepmap > collapsed_loci_11.lod6_singles.chromosomes 2> collapsed_loci_11.lod6_singles.chromosomes.log"
print "java -cp bin/ JoinSingles collapsed_loci_11.lod6_singles.chromosomes lodLimit = 5 lodDifference=3 data=collapsed_loci.lepmap > collapsed_loci_11.lod5_singles.chromosomes 2> collapsed_loci_11.lod5_singles.chromosomes.log"

java -cp bin/ SeparateChromosomes data=collapsed_loci.lepmap lodLimit = 11 sizeLimit = 20 > collapsed_loci_11.initial.chromosomes 2> collapsed_loci_11.initial_chromosomes.log
java -cp bin/ JoinSingles collapsed_loci_11.initial.chromosomes lodLimit = 8 lodDifference=3 data=collapsed_loci.lepmap > collapsed_loci_11.lod8_singles.chromosomes 2> collapsed_loci_11.lod8_singles.chromosomes.log
java -cp bin/ JoinSingles collapsed_loci_11.lod8_singles.chromosomes lodLimit = 7 lodDifference=3 data=collapsed_loci.lepmap > collapsed_loci_11.lod7_singles.chromosomes 2> collapsed_loci_11.lod7_singles.chromosomes.log
java -cp bin/ JoinSingles collapsed_loci_11.lod7_singles.chromosomes lodLimit = 6 lodDifference=3 data=collapsed_loci.lepmap > collapsed_loci_11.lod6_singles.chromosomes 2> collapsed_loci_11.lod6_singles.chromosomes.log
java -cp bin/ JoinSingles collapsed_loci_11.lod6_singles.chromosomes lodLimit = 5 lodDifference=3 data=collapsed_loci.lepmap > collapsed_loci_11.lod5_singles.chromosomes 

##Testing the LOD scores to find 26 Linkage Groups

In [None]:
print "java -cp bin/ SeparateChromosomes data=collapsed_loci.lepmap lodLimit = 12 sizeLimit = 20 > collapsed_loci_12.initial.chromosomes 2> collapsed_loci_12.initial_chromosomes.log"
print "java -cp bin/ SeparateChromosomes data=collapsed_loci.lepmap lodLimit = 13 sizeLimit = 20 > collapsed_loci_13.initial.chromosomes 2> collapsed_loci_13.initial_chromosomes.log"
print "java -cp bin/ SeparateChromosomes data=collapsed_loci.lepmap lodLimit = 14 sizeLimit = 20 > collapsed_loci_14.initial.chromosomes 2> collapsed_loci_14.initial_chromosomes.log"
print "java -cp bin/ SeparateChromosomes data=collapsed_loci.lepmap lodLimit = 15 sizeLimit = 20 > collapsed_loci_15.initial.chromosomes 2> collapsed_loci_15.initial_chromosomes.log"
print "java -cp bin/ SeparateChromosomes data=collapsed_loci.lepmap lodLimit = 16 sizeLimit = 20 > collapsed_loci_16.initial.chromosomes 2> collapsed_loci_16.initial_chromosomes.log"
print "java -cp bin/ SeparateChromosomes data=collapsed_loci.lepmap lodLimit = 17 sizeLimit = 20 > collapsed_loci_17.initial.chromosomes 2> collapsed_loci_17.initial_chromosomes.log"
print "java -cp bin/ SeparateChromosomes data=collapsed_loci.lepmap lodLimit = 18 sizeLimit = 20 > collapsed_loci_18.initial.chromosomes 2> collapsed_loci_18.initial_chromosomes.log"
print "java -cp bin/ SeparateChromosomes data=collapsed_loci.lepmap lodLimit = 19 sizeLimit = 20 > collapsed_loci_19.initial.chromosomes 2> collapsed_loci_19.initial_chromosomes.log"
print "java -cp bin/ SeparateChromosomes data=collapsed_loci.lepmap lodLimit = 20 sizeLimit = 20 > collapsed_loci_20.initial.chromosomes 2> collapsed_loci_20.initial_chromosomes.log"

In [None]:
LOD 10: 25 lgs 13027 markers in lgs 763 singles
LOD 11: 25 lgs 12328 markers in lgs 7769 singles
LOD 12: 25 lgs 12318 markers in lgs 779 singles
LOD 13: 25 lgs 12294 markers in lgs 803 singles
LOD 14: 25 lgs 12273 markers in lgs 824 singles
LOD 15: 25 lgs 12186 markers in lgs 911 singles
LOD 16: 26 lgs 12076 markers in lgs 1021 singles
LOD 17: 27 lgs 11833 markers in lgs 1264 singles
LOD 18: 27 lgs 11483 markers in lgs 1614 singles
LOD 19: 30 lgs 10871 markers in lgs 2226 singles
LOD 20: 33 lgs 9357 markers in lgs 3740 singles

In [9]:
print "java -cp bin/ SeparateChromosomes data=collapsed_loci.lepmap lodLimit = 16 sizeLimit = 20 > collapsed_loci_16.initial.chromosomes 2> collapsed_loci_16.initial_chromosomes.log"

print "java -cp bin/ JoinSingles collapsed_loci_16.initial.chromosomes lodLimit = 8 lodDifference=3 data=collapsed_loci.lepmap > collapsed_loci_16.lod8_singles.chromosomes 2> collapsed_loci_16.lod8_singles.chromosomes.log"

print "java -cp bin/ JoinSingles collapsed_loci_16.lod8_singles.chromosomes lodLimit = 7 lodDifference=3 data=collapsed_loci.lepmap > collapsed_loci_16.lod7_singles.chromosomes 2> collapsed_loci_16.lod7_singles.chromosomes.log"

print "java -cp bin/ JoinSingles collapsed_loci_16.lod7_singles.chromosomes lodLimit = 6 lodDifference=3 data=collapsed_loci.lepmap > collapsed_loci_16.lod6_singles.chromosomes 2> collapsed_loci_16.lod6_singles.chromosomes.log"

print "java -cp bin/ JoinSingles collapsed_loci_16.lod6_singles.chromosomes lodLimit = 5 lodDifference=3 data=collapsed_loci.lepmap > collapsed_loci_16.lod5_singles.chromosomes 2> collapsed_loci_16.lod5_singles.chromosomes.log"

print "java -cp bin/ JoinSingles collapsed_loci_16.lod5_singles.chromosomes lodLimit = 4 lodDifference=3 data=collapsed_loci.lepmap > collapsed_loci_16.lod4_singles.chromosomes 2> collapsed_loci_16.lod4_singles.chromosomes.log"

print "java -cp bin/ JoinSingles collapsed_loci_16.lod4_singles.chromosomes lodLimit = 3.5 lodDifference=3 data=collapsed_loci.lepmap > collapsed_loci_16.lod3-5_singles.chromosomes 2> collapsed_loci_16.lod3-5_singles.chromosomes.log"

java -cp bin/ SeparateChromosomes data=collapsed_loci.lepmap lodLimit = 16 sizeLimit = 20 > collapsed_loci_16.initial.chromosomes 2> collapsed_loci_16.initial_chromosomes.log
java -cp bin/ JoinSingles collapsed_loci_16.initial.chromosomes lodLimit = 8 lodDifference=3 data=collapsed_loci.lepmap > collapsed_loci_16.lod8_singles.chromosomes 2> collapsed_loci_16.lod8_singles.chromosomes.log
java -cp bin/ JoinSingles collapsed_loci_16.lod8_singles.chromosomes lodLimit = 7 lodDifference=3 data=collapsed_loci.lepmap > collapsed_loci_16.lod7_singles.chromosomes 2> collapsed_loci_16.lod7_singles.chromosomes.log
java -cp bin/ JoinSingles collapsed_loci_16.lod7_singles.chromosomes lodLimit = 6 lodDifference=3 data=collapsed_loci.lepmap > collapsed_loci_16.lod6_singles.chromosomes 2> collapsed_loci_16.lod6_singles.chromosomes.log
java -cp bin/ JoinSingles collapsed_loci_16.lod6_singles.chromosomes lodLimit = 5 lodDifference=3 data=collapsed_loci.lepmap > collapsed_loci_16.lod5_singles.chromosomes 

###Using this perl script to figure out which of the linkage groups are the same 

In [None]:
#SepChrom_SepChrom_compare_all.pl

#!/usr/bin/perl -w
use strict;

my$MSTfile1=$ARGV[0];
my$MSTfile2=$ARGV[1];

my%MST1loci;
my$MST1currentLG;
my$MST1store_loci=0;
my%MST1LGlociNum;
#my%MST1LGlength;
my$MST1lociNum=0;
my$MST1currentPosition;
my$MST1locus=0;
open(MST1, "<$MSTfile1")||die "cannot open $MSTfile1:$!";
my$header=<MST1>;
while(my $line=<MST1>){
	chomp $line;
	$MST1locus++;
	$MST1loci{$MST1locus}=$line;
	$MST1LGlociNum{$line}++;
	# if($line=~/^group/){
		# my($discard, $LG)=split " ", $line, 2;
		# $MST1currentLG=$LG;
	# }
	# if($line=~/\;BEGINOFGROUP/){
		# $MST1store_loci=1;
		# next;
	# }
	# if($line=~/\;ENDOFGROUP/){
		# $MST1LGlociNum{$MST1currentLG}=$MST1lociNum;
		# $MST1LGlength{$MST1currentLG}=$MST1currentPosition;
		# $MST1store_loci=0;
		# $MST1currentLG="";
		# $MST1currentPosition="";
		# $MST1lociNum=0;
	# }
	# if($MST1store_loci==1){
		# my($locus, $position)=split "\t", $line, 2;
		# #$locus=~s/R//;
		# $MST1loci{$locus}=$MST1currentLG;
		# $MST1currentPosition=$position;
		# $MST1lociNum++;
	# }
}
close MST1;

my%MST2loci;
my$MST2currentLG;
my$MST2store_loci=0;
my%MST2LGlociNum;
#my%MST2LGlength;
my$MST2lociNum=0;
my$MST2currentPosition;
my$MST2locus=0;

open(MST2, "<$MSTfile2")||die "cannot open $MSTfile2:$!";
$header=<MST2>;
while(my $line=<MST2>){
	chomp $line;
	$MST2locus++;
	$MST2loci{$MST2locus}=$line;
	$MST2LGlociNum{$line}++;
	# if($line=~/^group/){
		# my($discard, $LG)=split " ", $line, 2;
		# $MST2currentLG=$LG;
	# }
	# if($line=~/\;BEGINOFGROUP/){
		# $MST2store_loci=1;
		# next;
	# }
	# if($line=~/\;ENDOFGROUP/){
		# $MST2LGlociNum{$MST2currentLG}=$MST2lociNum;
		# $MST2LGlength{$MST2currentLG}=$MST2currentPosition;
		# $MST2store_loci=0;
		# $MST2currentLG="";
		# $MST2currentPosition="";
		# $MST2lociNum=0;
	# }
	# if($MST2store_loci==1){
		# my($locus, $position)=split "\t", $line, 2;
		# #$locus=~s/R//;
		# $MST2loci{$locus}=$MST2currentLG;
		# $MST2currentPosition=$position;
		# $MST2lociNum++;
		# #print "$locus\t$MST2currentLG\n";
	# }
}
close MST2;

my%LGpairs;
foreach my$key (keys%MST1loci){
	#my$Rkey=$key."R";
	#if(((exists $MST1loci{$key})&&(exists $MST2loci{$key}))||((exists $MST1loci{$Rkey})&&(exists $MST2loci{$key}))){
	if((exists $MST1loci{$key})&&(exists $MST2loci{$key})){
		#print "$key\n";
		#print "$MST1loci{$key}\t$MST2loci{$key}\n";
		my$pair="$MST1loci{$key} $MST1LGlociNum{$MST1loci{$key}}\t$MST2loci{$key} $MST2LGlociNum{$MST2loci{$key}}";
		$LGpairs{$pair}++;
	}
}

foreach my$key (sort keys %LGpairs){
	print "$key\t$LGpairs{$key}\n";
}


##Ordering Markers

In [13]:
for xx in reversed(range(1,29,2)):
    print("java -cp bin OrderMarkers map=collapsed_loci_16.lod8_singles.chromosomes data=collapsed_loci.lepmap alpha=1 maxDistance=30 useKosambi=1 initError=0.001 initRecombination=0.01 0.01 chromosome={} > chr_{}.map 2> chr_{}.map.log\n".format(xx, xx, xx))

java -cp bin OrderMarkers map=collapsed_loci_16.lod8_singles.chromosomes data=collapsed_loci.lepmap alpha=1 maxDistance=30 useKosambi=1 initError=0.001 initRecombination=0.01 0.01 chromosome=27 > chr_27.map 2> chr_27.map.log

java -cp bin OrderMarkers map=collapsed_loci_16.lod8_singles.chromosomes data=collapsed_loci.lepmap alpha=1 maxDistance=30 useKosambi=1 initError=0.001 initRecombination=0.01 0.01 chromosome=25 > chr_25.map 2> chr_25.map.log

java -cp bin OrderMarkers map=collapsed_loci_16.lod8_singles.chromosomes data=collapsed_loci.lepmap alpha=1 maxDistance=30 useKosambi=1 initError=0.001 initRecombination=0.01 0.01 chromosome=23 > chr_23.map 2> chr_23.map.log

java -cp bin OrderMarkers map=collapsed_loci_16.lod8_singles.chromosomes data=collapsed_loci.lepmap alpha=1 maxDistance=30 useKosambi=1 initError=0.001 initRecombination=0.01 0.01 chromosome=21 > chr_21.map 2> chr_21.map.log

java -cp bin OrderMarkers map=collapsed_loci_16.lod8_singles.chromosomes data=collapsed_loci.lep

In [14]:
for xx in reversed(range(2,28,2)):
    print("java -cp bin OrderMarkers map=collapsed_loci_16.lod8_singles.chromosomes data=collapsed_loci.lepmap alpha=1 maxDistance=30 useKosambi=1 initError=0.001 initRecombination=0.01 0.01 chromosome={} > chr_{}.map 2> chr_{}.map.log\n".format(xx, xx, xx))

java -cp bin OrderMarkers map=collapsed_loci_16.lod8_singles.chromosomes data=collapsed_loci.lepmap alpha=1 maxDistance=30 useKosambi=1 initError=0.001 initRecombination=0.01 0.01 chromosome=26 > chr_26.map 2> chr_26.map.log

java -cp bin OrderMarkers map=collapsed_loci_16.lod8_singles.chromosomes data=collapsed_loci.lepmap alpha=1 maxDistance=30 useKosambi=1 initError=0.001 initRecombination=0.01 0.01 chromosome=24 > chr_24.map 2> chr_24.map.log

java -cp bin OrderMarkers map=collapsed_loci_16.lod8_singles.chromosomes data=collapsed_loci.lepmap alpha=1 maxDistance=30 useKosambi=1 initError=0.001 initRecombination=0.01 0.01 chromosome=22 > chr_22.map 2> chr_22.map.log

java -cp bin OrderMarkers map=collapsed_loci_16.lod8_singles.chromosomes data=collapsed_loci.lepmap alpha=1 maxDistance=30 useKosambi=1 initError=0.001 initRecombination=0.01 0.01 chromosome=20 > chr_20.map 2> chr_20.map.log

java -cp bin OrderMarkers map=collapsed_loci_16.lod8_singles.chromosomes data=collapsed_loci.lep

##Plotting the linkage map in R

I took the results of the run and compiled them in a single txt file, with four columns. The first is the linkage group, the second is the locus name, the third is the position and the fourth is whether it is a paralog (listed in the pink_paralogs2.txt file that is output from an earlier step). This file is saved as the MAP.txt, and only the paralogs of this file are saved as the MAPdups.txt

In [None]:
library(ggplot2)
library(stringr)
library(plyr)


map_positions <- read.table("G:\\Analysis\\Mapping\\AllHaps\\LepMap\\MAP.txt", sep = "\t", header = TRUE)
duplicates <- read.table("G:\\Analysis\\Mapping\\AllHaps\\LepMap\\MAPdups.txt", sep = "\t", header = TRUE)

###this one works ok to show duplicates in the data
dup_color <-c(yes = "#000f2d", no = "#ffd9f2")
ggplot(data = map_positions) + geom_point(aes(x = LG, y = position, color = paralog), alpha = .5, size = 3) + scale_colour_manual(values = dup_color)+ theme_classic()


##plotting only the duplicates
ggplot(data = duplicates) + geom_point(aes(x = LG, y = position), alpha = .5, size = 3, color = "blue") 


##this jitters the duplicates so you can see them 
ggplot(data = duplicates, aes(x = LG, y = position)) + geom_point(position = position_jitter(w = 0.3, h = 0.3), alpha = .5, size = 3, color = "firebrick") 


The 13th chromosome looks different than the others, it has a large gap and is less compact. I used Garretts SepChrom_SepChrom_compare_all.pl script from earlier to compare the results of the Separate Chromosome runs from earlier to see if there is an obvious fragmentation or overmerging issue that has to do with differnt LOD scores at that stage. There were no overlaps with the 13th linkage group, they all matched one to one in the different LOD score runs. Garrett recommmends that I re-order the linkage group using the output of a more conservative initial separateChromosome run. I'll rty a LOD score of 17. I have the Separate Chromosomes, but need the results of the Join Singles. 

In [10]:
print "java -cp bin/ JoinSingles collapsed_loci_17.initial.chromosomes lodLimit = 8 lodDifference=3 data=collapsed_loci.lepmap > collapsed_loci_17.lod8_singles.chromosomes 2> collapsed_loci_17.lod8_singles.chromosomes.log"

print "java -cp bin/ JoinSingles collapsed_loci_17.lod8_singles.chromosomes lodLimit = 7 lodDifference=3 data=collapsed_loci.lepmap > collapsed_loci_17.lod7_singles.chromosomes 2> collapsed_loci_17.lod7_singles.chromosomes.log"

print "java -cp bin/ JoinSingles collapsed_loci_17.lod7_singles.chromosomes lodLimit = 6 lodDifference=3 data=collapsed_loci.lepmap > collapsed_loci_17.lod6_singles.chromosomes 2> collapsed_loci_17.lod6_singles.chromosomes.log"

print "java -cp bin/ JoinSingles collapsed_loci_17.lod6_singles.chromosomes lodLimit = 5 lodDifference=3 data=collapsed_loci.lepmap > collapsed_loci_17.lod5_singles.chromosomes 2> collapsed_loci_17.lod5_singles.chromosomes.log"

print "java -cp bin/ JoinSingles collapsed_loci_17.lod5_singles.chromosomes lodLimit = 4 lodDifference=3 data=collapsed_loci.lepmap > collapsed_loci_17.lod4_singles.chromosomes 2> collapsed_loci_17.lod4_singles.chromosomes.log"

print "java -cp bin/ JoinSingles collapsed_loci_17.lod4_singles.chromosomes lodLimit = 3.5 lodDifference=3 data=collapsed_loci.lepmap > collapsed_loci_17.lod3-5_singles.chromosomes 2> collapsed_loci_17.lod3-5_singles.chromosomes.log"

java -cp bin/ JoinSingles collapsed_loci_17.initial.chromosomes lodLimit = 8 lodDifference=3 data=collapsed_loci.lepmap > collapsed_loci_17.lod8_singles.chromosomes 2> collapsed_loci_17.lod8_singles.chromosomes.log
java -cp bin/ JoinSingles collapsed_loci_17.lod8_singles.chromosomes lodLimit = 7 lodDifference=3 data=collapsed_loci.lepmap > collapsed_loci_17.lod7_singles.chromosomes 2> collapsed_loci_17.lod7_singles.chromosomes.log
java -cp bin/ JoinSingles collapsed_loci_17.lod7_singles.chromosomes lodLimit = 6 lodDifference=3 data=collapsed_loci.lepmap > collapsed_loci_17.lod6_singles.chromosomes 2> collapsed_loci_17.lod6_singles.chromosomes.log
java -cp bin/ JoinSingles collapsed_loci_17.lod6_singles.chromosomes lodLimit = 5 lodDifference=3 data=collapsed_loci.lepmap > collapsed_loci_17.lod5_singles.chromosomes 2> collapsed_loci_17.lod5_singles.chromosomes.log
java -cp bin/ JoinSingles collapsed_loci_17.lod5_singles.chromosomes lodLimit = 4 lodDifference=3 data=collapsed_loci.lepmap 

In [None]:
java -cp bin OrderMarkers map=collapsed_loci_17.lod8_singles.chromosomes data=collapsed_loci.lepmap alpha=1 maxDistance=30 useKosambi=1 initError=0.001 initRecombination=0.01 0.01 chromosome=13 > LOD17_chr_13.map 2> LOD17_chr_13.map.log

##Comparing MSTMap and LepMap results

In [None]:
Data: 
    LepMap output from the run with a Separate Chromosome LOD 17 and a Join Singles LOD 8 
    MSTMap output from: 
                    PBIRD13X108H_80GR 
                    PBIRD13X110H_65GR
                    PHOOD11x01H_4
                    PHOOD11x05H_5