Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BedTool object to pandas dataframe #111

Closed
radaniba opened this issue Aug 26, 2014 · 6 comments
Closed

BedTool object to pandas dataframe #111

radaniba opened this issue Aug 26, 2014 · 6 comments

Comments

@radaniba
Copy link

Hello

I am using pybedtool quite a lot and each time I find myself writing a small function to convert a BedTool object (let's say result of a coverage) to a pandas dataframe in order to use it for other purposes or to inject it into other functions

It would be great to have this as a prebuilt utility , I guess a lot of people would like to have it by default

Cheers

Rad

@daler
Copy link
Owner

daler commented Aug 26, 2014

Sure -- what does your current function look like?

@radaniba
Copy link
Author

Well it depends on the result actually, let's say we have a result from genom_coverage per base :

so it is gonna be something like

def coverage_to_df(bam_file):
    coverage_result = bam_file.genome_coverage(d=True)
    #we initialize the dataframe that will contain the coverage result
    # This will be easier to use pd df for plotting and subsets selections
    df = pd.DataFrame(columns=['chrom', 'position', 'coverage'])
    row_id = 0
    for pos_cov in coverage_result:
        chrom, position, coverage = pos_cov.split('\t')
        df.loc[row_id] = [chrom, position, coverage]
        row_id = row_id + 1

But I imagine it is depending on the context and on the BedTool object that varies in term of number of columns

@radaniba
Copy link
Author

I was thinking about something like

coverage_result = bam_file.genome_coverage(d=True).to_data_frame()

and the user is free to update the columns of his dataframe with the labels he wants or

coverage_result = bam_file.genome_coverage(d=True).to_data_frame(['chrom','position','coverage'])

with columns names as arguments, that way it will be practical to do the whole thing in one line and that will fit all BedTool objects

@daler
Copy link
Owner

daler commented Aug 26, 2014

For anything but a BAM file, you could just call pandas.read_table on the underlying filename (fn attribute):

import pybedtools
import pandas
x = pybedtools.example_bedtool('a.bed')
df = pandas.read_table(x.fn, names=['chrom', 'start', 'stop', 'name', 'score', 'strand'])

What's the speed like on incrementally building a dataframe from BAM coverage like in your example? I suspect it would be faster to just read the file in -- pandas' parsers are pretty fast.

Making this built-in would be trivial:

def to_dataframe(self, *args, **kwargs):
    """
    create a pandas.DataFrame, passing args and kwargs to pandas.read_table
    """
    # Complain if BAM or if not a file
    if self._isbam:
        raise ValueError("BAM not supported for converting to DataFrame")
    if not isinstance(self.fn, basestring):
        raise ValueError("use .saveas() to make sure self.fn is a file")

    # Otherwise we're good:
    return pandas.read_table(self.fn, *args, **kwargs)

Would this work for you? I suppose a lookup table mapping filetype/field count to default names values would help cut down on typing.

@radaniba
Copy link
Author

Thanks @daler, yes that function is enough and it does what it is needed to do.

@daler daler closed this as completed in f8770e1 Aug 27, 2014
@radaniba
Copy link
Author

awesome ! Thanks @daler

daler added a commit that referenced this issue Feb 6, 2015
Conflicts:
	pybedtools/settings.py
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants