Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

coordinates from the events file #22

Open
pjm43 opened this issue Jan 20, 2022 · 4 comments
Open

coordinates from the events file #22

pjm43 opened this issue Jan 20, 2022 · 4 comments

Comments

@pjm43
Copy link

pjm43 commented Jan 20, 2022

Hi Esteban,

I'm interested in pulling the specific chromosomal coordinates identified in the events file after running run_and_plot_chromeister.sh. I believe they are reported globally across the genome (the genome has 21 chromosomes in the fasta) in this file, nor does the file specify the chromosome from which they are derived. What is the best way to covert global coordinate to local chromosomal coordinates for the events identified?

Thanks in advance,

Jeff

@pjm43
Copy link
Author

pjm43 commented Jan 20, 2022

Sorry, I should be more specific. The events.txt file has multiple descriptions of the syntenic blocks, I'm specifically interested in the rearranged block (i.e., inversions, transposition, etc.) and not the conserved blocks.
Thanks,
Jeff

@estebanpw
Copy link
Owner

Hello Jeff!

You are correct in your guess: the coordinates are always global (internally the fasta headers > sequence ... get removed). In case you want these to be local there are two approaches:

  • You could re-run everything with every chromosome in a separate file (this takes a bit more time but it is still fast).
  • You can convert them by using the following script that I just pushed into the repository (remember to update your repository). It is located in the bin folder and named global_to_local.sh. To use it, you have to run bin/global_to_local.sh comparison.csv comparison.events.txt > localEvents.txt. Important: the two provided arguments are the comparison.csv which is the auto-generated csv file (which contains coordinates) and the events file. The new local coordinates per chromosome will be in the localEvents.txt file. Disclaimer: I have not had much time to test it so be careful with it and check that it works (e.g. if you notice that coordinates are too big or negative, then that would be a sign that it does not work). Also, remember that the coordinates in the events file are a coarse-grained approximation!

Note: The script uses a naive approach which makes it slow if you the fasta files contain many sequences or if the events file contains many events. It is naive because it will check every event with all the lengths of the sequences until one that is bigger is found. This could be easily improved with a binary search, but it should work without further problem if you have e.g. 30 by 30 chromosomes.

Btw: I am no longer able to maintain this repository on a daily basis as I recently switched jobs. Still I will try to help when possible!

Hope this helps,
Esteban

@pjm43
Copy link
Author

pjm43 commented Jan 27, 2022

Esteban,

Thanks so much for you help - very much appreciated - I'll give you script a try now!

A couple more quick questions - is there a way to export the figure as a higher resolution file? I'm looking for something that I could import into illustrator to make manuscript quality figures. Lastly, I'm still trying to wrap my head around the level argument. It appears that it should be set to 4, except when using with large plant genome where you indicate that it could be set to 1. What exactly is this argument doing?

I know you've switched jobs and really appreciate you still helping with chromeister.

Jeff

@estebanpw
Copy link
Owner

estebanpw commented Jan 29, 2022

Hey there @pjm43 ,

Thanks so much for you help - very much appreciated

Thanks for using CHROMEISTER! :)

I'll give you script a try now!

Let me know if it works alright!

is there a way to export the figure as a higher resolution file?

I think this should be possible. In essence, what CHROMEISTER plots is basically a matrix of 0's and 1's after it has computed some stuff. This is the matrix called dotplot.mat.raw.txt (although the first two lines include the sequence lengths). I have tried something out that might work out for you. First, remove the two length lines from the file by doing:

tail -n +3 dotplot.mat.raw.txt > dotplot.nolen

Then copy this python code:

import matplotlib.pyplot as plt
import numpy as np
import matplotlib
matplotlib.use('Agg')


fig = plt.figure(figsize=(16, 16))
matrix = np.loadtxt('dotplot.nolen')
plt.imshow(matrix, cmap='Greys', interpolation='nearest')
plt.savefig("dotplot.svg")

Note: Remember to change the dotplot.nolen to the name of your comparison file.
Note 2: You may need to set up a virtual env and install python packages with pip.

Then you can run the above python script and it will save the matrix to an svg file.
If you are happy with the output, you can then add the axes names and titles (check out the matplotlib library for this).

This is the resulting svg for a normal comparison:

dotplot

(Note: you should be able to download the svg by right clicking on it, I am not sure how github will render it).

This is not a very clean solution but it might work for what you need.
IMPORTANT: saving dotplots from CHROMEISTER as vector graphics file was not planned originally because the png file should provide with enough detail - especially considering that CHROMEISTER is very, very heuristic. This means that if you make the dotplot of too much resolution then some boundary pixels might show signal when there is none in reality, i.e. CHROMEISTER will give you a good coarse-grain preview, but going into the fine-grain detail might produce some artifacts (remember that every pixel in the dotplot typically corresponds to around 100,000 base pairs in sequences of length 100 MBp!). Always check that the results are correct and that they make sense!

I'm still trying to wrap my head around the level argument. It appears that it should be set to 4, except when using with large plant genome where you indicate that it could be set to 1. What exactly is this argument doing?

I would not give too much though to this parameter, but in essence it controls how many nucleotides are skipped to calculate the hash function. This means that if you input z=4 (the default) only 1 in every 4 nucleotides will be used to compute the hash of a word. If you use z=1, all of them will be used, thus leading to less collisions. Therefore, a low z value is recommended when comparing very large sequences with many repeats (such as plants) to avoid "too many" collisions. But in my experience this affects the output by very little and only in some comparisons, so I think you are probably safe leaving to 4.

Hope it helps!
Esteban

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants