Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError: can only concatenate str (not "int") to str #42

Open
FatihSarigol opened this issue Apr 14, 2021 · 8 comments
Open

TypeError: can only concatenate str (not "int") to str #42

FatihSarigol opened this issue Apr 14, 2021 · 8 comments

Comments

@FatihSarigol
Copy link

Hello,
My test run (fithic/tests/run_tests-git.sh) finished successfully, but while running it on my files using this command using version 2.0.7:

python3 fithic.py -f fithic.fragmentMappability.gz -i fithic.interactionCounts.gz -o FitHicAmphioxus -t fithic.biases.gz -r 150000

I received this error:

Reading the contact counts file to generate bins...
Interactions file read. Time took 26.23392629623413
Traceback (most recent call last):
File "/home/user/sarigoel/Programs/FITHIC/fithic/fithic/fithic.py", line 1324, in
main()
File "/home/user/sarigoel/Programs/FITHIC/fithic/fithic/fithic.py", line 323, in main
(binStats,noOfFrags, maxPossibleGenomicDist, possibleIntraInRangeCount, possibleInterAllCount, interChrProb, baselineIntraChrProb) = generate_FragPairs(observedInterAllCount, observedInterAllSum, binStats, fragsFile, resolution)
File "/home/user/sarigoel/Programs/FITHIC/fithic/fithic/fithic.py", line 600, in generate_FragPairs
print("ERROR - the chromosome " + ch + " has " + len(allFragsDic[ch]) + " valid fragments/bins and should be removed from the input fragment information !!! ")
TypeError: can only concatenate str (not "int") to str

Here is how my input files look like:

[sarigoel@myotis AMPHIOXUS]$ zcat fithic.biases.gz | head -n2
Sc7u5tJ_517 75000 1.970547623956338
Sc7u5tJ_517 225000 0.40157523166875075
[sarigoel@myotis AMPHIOXUS]$ zcat fithic.fragmentMappability.gz | head -n2
Sc7u5tJ_517 0 75000 17395 1
Sc7u5tJ_517 150000 225000 2437 1
[sarigoel@myotis AMPHIOXUS]$ zcat fithic.interactionCounts.gz | head -n2
Sc7u5tJ_517 75000 Sc7u5tJ_517 75000 1700
Sc7u5tJ_517 75000 Sc7u5tJ_517 225000 5

I used an old HicPro (version 2.10.0) to generate my initial data and used this command/script to convert it:

python3 HiCPro2FitHiC.py -i Sample1_150000.matrix -b Sample1_150000_abs.bed -s Sample1_150000_iced.matrix.biases -o . -r 150000

These files had these lengths:
3776446 Sample1_150000.matrix
3769 Sample1_150000_abs.bed
3769 Sample1_150000_iced.matrix.biases

and first two lines were as below:

**==> Sample1_150000.matrix <==
1 1 1700
1 2 5

==> Sample1_150000_abs.bed <==
Sc7u5tJ_517 0 150000 1
Sc7u5tJ_517 150000 246623 2

==> Sample1_150000_iced.matrix.biases <==
1.917118534333063673e+00
3.906869898508548156e-01**

Sample1_150000_iced.matrix.biases file had also nan values which were I guess converted to -1.

Following the conversion the files kept their original lengths:

[sarigoel@myotis AMPHIOXUS]$ zcat fithic.interactionCounts.gz | wc -l
3776446
[sarigoel@myotis AMPHIOXUS]$ zcat fithic.fragmentMappability.gz | wc -l
3769
[sarigoel@myotis AMPHIOXUS]$ zcat fithic.biases.gz | wc -l
3769

As for the chromosome names, all start with Sc7u5tJ_ and there is no other special character than an underscore, each followed by a scaffold number.

The log file had these lines:

###########
Interactions file read successfully
Observed, Intra-chr in range: pairs= 275495 totalCount= 6213510
Observed, Intra-chr all: pairs= 275495 totalCount= 6213510
Observed, Inter-chr all: pairs= 3500951 totalCount= 7397792
Range of observed genomic distances [0 35250000]

Making equal occupancy bins
Observed intra-chr read counts in range 6213510
Desired number of contacts per bin 62135.1,
Number of bins 100
Equal occupancy bins generated

Looping through all possible fragment pairs in-range_
############

Can you think of a reason that may have caused the error?
Thank you!

@ay-lab
Copy link
Owner

ay-lab commented Apr 14, 2021

I believe the error is coming from "ch" in the line below being an integer for at least one chr. Can you check all your contigs/chrs to make sure none of them are somehow integers (I see you are saying that already but)
print("ERROR - the chromosome " + ch + " has " + len(allFragsDic[ch]) + " valid fragments/bins and should be removed from the input fragment information !!! ")
Other possibility is a python version difference related problem about len(allFragsDic[ch]) being an integer (it should be) and not being able to append to the overall string.
Overall, I believe if you filter out the chrs/contigs with no valid bins (all out of bias value range) then the code should run

@FatihSarigol
Copy link
Author

Thank you for your reply.
I checked again this time using grep by searching for a scaffold that doesn't have Sc7u5tJ_ and found none, so all indeed have this at the beginning, and then a number.
I removed the bins from the bed file that were shorter than my bin size 150000 of bases (is that what you mean by filtering out contigs with no valid bins/out of bias values range?) but then the HiCPro2FitHiC.py gave a key error with that bed file (I suppose I also need to remove them from the biases file? or also all interactions from the matrix as well?)
Or do you mean the ones with nan value in biases file?
Thanks!

@aryakaul
Copy link
Collaborator

Can you confirm you're using Python3? If you are and are still getting this, then I would recommend attempting to remove the Sc7u5tJ_ from the files (sed 's/Sc7u5tJ_//g') and see if that resolves it.

@FatihSarigol
Copy link
Author

Thank you for your reply again,
Yes I am using Python3.8 but it is on an HPCC and I installed the dependency packages myself locally while using the python that is installed on the cluster. Having said that, I just tried the test script again at this instance of connection and it again said All tests completed successfully. Fit-Hi-C is up and running! at the end. I see that the test script calls python via python3 command and that is the same way I am running it too. (python2 also happens to be on my path and python command without specifying version calls that since it is located on some default bin folder so if somewhere inside the code there is a line that calls python by python in a similar way to the test script but this time instead of by python3 and without then it would end up calling python2)
I removed the Sc7u5tJ_ with the code you suggested and conversion went well but FitHiC again gave the same error on the new files where contig names are only numbers.
I tried to add 3 to the environment of fithic.py but that gave another error.
So if you believe even though the test script runs well it may be related to python2 being called somehow, I can try installing it via conda I guess.
Thanks!

@aryakaul
Copy link
Collaborator

after looking at the code, I think the error you're getting is actually a bug in the way we output our error message.

Regardless, this is a check to make sure people don't see the #39 error. I'd go through the scaffolds in your fragments + bias file to make sure you have no scaffold which has no valid fragments. You can also throw a print(ch) right before this line to find out which scaffold is causing the issue.

@FatihSarigol
Copy link
Author

Thank you for your help one more time!

Below I show the bias values of the chr names it printed (and it stopped after the last one) when I ran it by adding print(chr) to line 600:

517 75000 1.970547623956338
517 225000 0.40157523166875075
1522 75000 0.09166041122148283
836 75000 0.06371769236176253
396 75000 0.10903455977486064
462 75000 0.18866132416648182
818 75000 0.5478669940197456
818 225000 0.0831076690152435
429 75000 0.9607972377588191
1131 75000 -1
1239 75000 -1

So is the problem then having -1 as a bias value (those had been converted from nan in the Sample1_150000_iced.matrix.biases file by the HiCPro2FitHiC.py)? But if that were the case I would expect it to have stopped at the previous scaffold which also had -1, so I looked into the fragmentMappability file and saw that the last one has zero as a difference:

517 0 75000 17395 1
517 150000 225000 2437 1
1522 0 75000 503 1
836 0 75000 303 1
396 0 75000 685 1
462 0 75000 1154 1
818 0 75000 3455 1
818 150000 225000 475 1
429 0 75000 6951 1
1131 0 75000 3 1
1239 0 75000 0 0

I checked out the #39 and there it seems like any chromosome below bias values all less than 0.5 I should remove actually? It looks like in my case I have quite a lot of those, probably because I mapped the HiC reads to the whole reference genome and happened to run HicPro also on small scaffolds rather than only on actual chromosomes.

Anyway at least for this specific error I can say that I tried different things and saw that my error finally went away when I removed the chromosomes (in reality short scaffolds) with zero mappability from fragmentMappability file and the corresponding lines from the biases file at the same time, which is not straightforward to do by simple pattern matching since biases file doesn't include info about scaffolds with zero mappability and in my case fithic.fragmentMappability also contains scaffold names to remove on column 4 too, but if anybody runs into same issue I can happily share my solution... Or since FitHic won't run in any similar case apparently, HiCPro2FitHiC.py may include a few lines to remove such cases from the two files directly maybe? One last thing, it also worked when I kept a chromosome with only the last window with zero mappability, so I didn't touch such last window.

Thanks!

@ay-lab
Copy link
Owner

ay-lab commented Apr 20, 2021

"I checked out the #39 and there it seems like any chromosome below bias values all less than 0.5 I should remove actually? "
Yes, and when you are removing them please remove the whole chromosome and all fragments and interactions corresponding to it. Generally you can do this simply by "grep -v" but in your case, now that you converted chr names to numbers, you may want to use awk '$1!=429 || $3!=429' for instance for interactions file if you want to remove chr/scaffold 429 and all entries related to it, similar thing you can do for fragments file.
I agree that this could be done during conversion by HiCPro2FitHiC.py. However, one may want to use a different bias value threshold range as opposed to 0.5 to 2.0 which is the default. In that case the filtering may remove chrs you still want or not remove the ones you don't want with new threshold etc.
Thanks

@FatihSarigol
Copy link
Author

FatihSarigol commented Apr 20, 2021

Thank you for your suggestions one more time!
I honestly forgot that I had removed the scaffold prefixes, so took the hard way, but still useful to keep I guess for species with chromosome names as only numbers. I deleted them after identifying the line numbers and using awk 'NR!~/^(11|429|557|667|889|1033|1455|2222|2245|3122|3762|)$/' in a single command to remove them from both biases and mappability files, and yes I removed all occurences for a scaffold when I removed one, as they all were just a single window scaffold actually.

As far as I understood from my attempts, it was not the bias threshold, but the mapability value of a zero for the all occurences of a scaffold that led to this error, because the scaffolds with a bias of -1 for example which had some mappability value did go through as I kept them and ran the program successfully with them in there.

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants