# Matching Base Pairs from a list of research papers to their respective bin locations and filtering for the smallest p-value

Two datasets are provided containing the relevant data. The first, contained within the file *papers_bin_pvalues.csv*, is the a list of papers with chromosomes and base pairs investigated and a corresponding p-value for the hypothesis tested. The second, contained in the file *chromosome_bins.csv* is a list of bins with their corresponding chromosome, start base pair and end base pair.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

### Import the data containing the combinations of chromosomes and their bin starting and ending base pairs. 

The datatypes are defined so that Panda's autodetect does not identify the `bin` values as float.

In [2]:
bins = pd.read_csv(
    'chromosome_bins.csv',
    dtype={
        "chromosome": object,
        "start": int,
        "end": int,
        "bin": object
    }
)
bins.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 240 entries, 0 to 239
Data columns (total 4 columns):
chromosome    240 non-null object
start         240 non-null int64
end           240 non-null int64
bin           240 non-null object
dtypes: int64(2), object(2)
memory usage: 7.6+ KB


In [3]:
bins.head()

Unnamed: 0,chromosome,start,end,bin
0,chr1,1,6902277,1.1
1,chr1,6902278,11404933,1.2
2,chr1,11404934,18107097,1.3
3,chr1,18107098,30481074,1.4
4,chr1,30481075,40606936,1.5


In [4]:
bins.isnull().sum()

chromosome    0
start         0
end           0
bin           0
dtype: int64

There are no `null` values. All rows contain data.

### Import the list of papers and their respective information.

In [5]:
papers = pd.read_csv(
    'paper_bin_pavalue.csv', 
    dtype={
        "author": object,
        "author_id": int,
        "chr_id": object,
        "bp": int,
        "snp": object,
        "pvalue": float,
        "bin": object
    }
)
papers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 191375 entries, 0 to 191374
Data columns (total 7 columns):
author       191375 non-null object
author_id    191375 non-null int64
chr_id       191375 non-null object
bp           191375 non-null int64
snp          186639 non-null object
pvalue       190473 non-null float64
bin          0 non-null object
dtypes: float64(1), int64(2), object(4)
memory usage: 10.2+ MB


In [6]:
papers.head()

Unnamed: 0,author,author_id,chr_id,bp,snp,pvalue,bin
0,Bandres-Ciga,1,chr2,135539967,rs6430538,0.02195,
1,Bandres-Ciga,1,chr17,43994648,rs17649553,0.02547,
2,Bandres-Ciga,1,chr2,169129145,rs 1955337,0.03442,
3,Bandres-Ciga,1,chr11,133765367,rs329648,0.03458,
4,Bandres-Ciga,1,chr17,17715101,rs11868035,0.0457,


In [7]:
papers.isnull().sum()

author            0
author_id         0
chr_id            0
bp                0
snp            4736
pvalue          902
bin          191375
dtype: int64

There are some null values for `snp` and `pvalue`. The `bin` column is intentionaly empty.
The dataframe will copied and the empty entries will droped.

In [8]:
original_papers = papers.copy()

In [9]:
papers.dropna(subset=['snp', 'pvalue'], inplace=True)

In [10]:
papers.isnull().sum()

author            0
author_id         0
chr_id            0
bp                0
snp               0
pvalue            0
bin          185832
dtype: int64

There are now 185832 entries down from 191375 for a reduction of 3% (5543) entries.

## Matching the two datasets

The bin will be calculated from the `bins` dataframe comparing the start and end base pairs with the base pair of the each paper. The matching is done when the chromosome entries are equal **and** when the bin's *start base pair* is __less than or equal__ to the base pair of the paper **and** when the bin's *end base pair* is __greater than or equal__ to the base pair of the paper.

In [11]:
%%timeit -r 1
for idx, row in papers.iterrows():
    bin_value = bins[
        bins.chromosome.eq(row.chr_id) & 
        bins.start.le(row.bp) & 
        bins.end.ge(row.bp)
    ].bin.values
    papers.at[idx, 'bin'] = bin_value[0] if bin_value.size > 0 else None

4min 34s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


Using the now populated `bin` column of the `papers` dataframe it is ease to retrieve only the min values for `pvalue` for each **bin** of each **author**. The resulting dataframe is stored in a new dataframe named `combinations`.

In [12]:
combinations = papers.loc[papers.groupby(['author_id', 'bin']).pvalue.idxmin()]

In [13]:
combinations.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 703 entries, 8 to 114735
Data columns (total 7 columns):
author       703 non-null object
author_id    703 non-null int64
chr_id       703 non-null object
bp           703 non-null int64
snp          703 non-null object
pvalue       703 non-null float64
bin          703 non-null object
dtypes: float64(1), int64(2), object(4)
memory usage: 43.9+ KB


In [14]:
combinations.sample(5)

Unnamed: 0,author,author_id,chr_id,bp,snp,pvalue,bin
41201,Maraganore DM,19,chr3,110453390,2699976,0.002348,3.9
12,Bandres-Ciga,1,chr7,23293746,rs199347,0.3881,7.3
1267,Spencer CC,17,chr5,31270987,rs4457092,4.05e-06,5.3
798,Pickrell JK,12,chr7,23084258,rs10256359,2e-12,7.3
556,Foo JN,6,chr21,19566451,rs2824703,8.61e-05,21.1


The dataframe is not re-index so that a reference to the initial dataframe is kept. It will make lookups easier.

## Storing the results

The resulting filtered by minimum p-value matches are stored in a new file name *matches.csv* for further processing.

In [15]:
combinations.to_csv('matches.csv')