# Converting neuroblastoma to pandas

In this post I will convert the annotated copy number profiles stored in [neuroblastoma](https://cran.r-project.org/web/packages/neuroblastoma/index.html) **R** package. The data contains *partially* annotated profiles.

My main work will be to load and understand the data, get a feeling of all the entries and finally create a custom dataset to use in my package `pydps`.

The content of the notebook is organized as follow:
1. <a href="loading"> Loading the data</a>

<a id = "loading"> </a>
## Loading the data

First step in this process is to load the saved dataset from `R`. The **neuroblastoma** package contains two datasets

1. **profiles** : saved in `neuroblastoma_profiles.csv`
2. **annotations** : saved in `neuroblastoma_annoations.csv`


In [2]:
import pandas as pd

profiles = pd.read_csv('./neuroblastoma_profiles.csv',index_col=0)
profiles.info()
profiles.head(8)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4616846 entries, 1 to 4782328
Data columns (total 4 columns):
profile.id    int64
chromosome    object
position      int64
logratio      float64
dtypes: float64(1), int64(2), object(1)
memory usage: 176.1+ MB


Unnamed: 0,profile.id,chromosome,position,logratio
1,8,1,809681,-0.017417
2,8,1,928433,0.066261
3,8,1,987423,0.067639
4,8,1,1083595,0.042644
5,8,1,1125548,0.005759
6,8,1,1199359,0.0
7,8,1,1392490,0.069015
8,8,1,1672814,0.204141


The dataset contains **four** informations 

* `profile.id`  : id of patient
* `chromosome`  : chromosome number there are 10 they can be inestigated by unique
* `position`    : position of the measure in the profile (x coordinate of the plot)
* `logratio`    : the log ratio measure where we could detect the jumps positions

In [4]:
annotations = pd.read_csv("./neuroblastoma_annotations.csv", index_col=0)
annotations.info()
annotations.head(6)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3418 entries, 1 to 3446
Data columns (total 5 columns):
profile.id    3418 non-null int64
chromosome    3418 non-null int64
min           3418 non-null int64
max           3418 non-null int64
annotation    3418 non-null object
dtypes: int64(4), object(1)
memory usage: 160.2+ KB


Unnamed: 0,profile.id,chromosome,min,max,annotation
1,213,11,53700000,135006516,normal
2,333,11,53700000,135006516,normal
3,393,11,53700000,135006516,normal
4,148,11,53700000,135006516,normal
5,402,11,53700000,135006516,normal
6,521,11,53700000,135006516,normal


The annotation dataset is a little different. I contains **5** columns.

* `profile.id` : Same as the profiles dataset charaterized the patient id
* `chromosoe`  : which chromoose is annotated
* `min` and `max` : didn't get the exact definition of those 
* `annotation`  : annotation for the chromosome *normal* if doese contains any jumps or breakpoints

## Merging the data
<a id="merging" >


First we will **drop** the min and max as they are irrelevant for our task. Also, we will encode the `annotation` column with **normal = 0** and **breakpoint = 1**

In [5]:
#dropping the min and max 
annotations.drop(['min', 'max'],inplace=True, axis=1)

#replacing the annotation
annotations.annotation = annotations.annotation.apply(lambda x: 0 if x=='normal' else 1)
#showing the data
annotations.head()

Unnamed: 0,profile.id,chromosome,annotation
1,213,11,0
2,333,11,0
3,393,11,0
4,148,11,0
5,402,11,0


Now we will merge the two datasets `annotations` and `profiles` with  defining columns `[profile.id, chromosome]`. A problem with the annotations is that chromosome with  the letter `X` are `Y` are never annotated. So will convert the **annotation** column in `annotations` to *Objects* to merge the two datasets

In [18]:
annotations.chromosome = annotations.chromosome.astype("object")

In [20]:
merged = pd.merge(profiles, annotations, on = ['profile.id', 'chromosome'], how = 'left')
merged.head()

Unnamed: 0,profile.id,chromosome,position,logratio,annotation
0,8,1,809681,-0.017417,
1,8,1,928433,0.066261,
2,8,1,987423,0.067639,
3,8,1,1083595,0.042644,
4,8,1,1125548,0.005759,


## Saving the dataset with pickle

In [23]:
merged.to_pickle("neuroblastoma.bz2")

checking the size of the dataset

In [24]:
!ls -alh neuroblastoma.bz2

-rw-r--r-- 1 anass anass 20M Nov  7 16:43 neuroblastoma.bz2
