# Sample Data: Alzheimer's disease 

**Downloaded datasets from https://www.embopress.org/doi/full/10.15252/msb.20199356.**

### Citation:

```Bader, J., Geyer, P., Müller, J., Strauss, M., Koch, M., & Leypoldt, F. et al. (2020). Proteome profiling in cerebrospinal fluid reveals novel biomarkers of Alzheimer's disease. Molecular Systems Biology, 16(6). doi: 10.15252/msb.20199356```


In [8]:
import pandas as pd
import numpy as np
ev1_raw_df = pd.read_excel("Dataset_EV1.xlsx", skiprows=1)
ev2_raw_df = pd.read_excel("Dataset_EV2.xlsx")

In [9]:
ev1_raw_df.rename(columns = {'Unnamed: 0': 'Genes', 'Unnamed: 1': 'Proteins'}, inplace=True)
ev1_raw_df.rename(columns = {k:k.split("]")[1].strip() for k in ev1_raw_df.columns[2:]}, inplace=True)
ev1_raw_df.drop("Genes", axis=1, inplace=True)
ev1_raw_df.set_index("Proteins", inplace=True)
ev1_df = ev1_raw_df.T.reset_index()
ev1_df.rename(columns = {'index': 'Samples'}, inplace=True)
ev2_raw_df.columns = ['_'+_ for _ in ev2_raw_df.columns]
ev2_raw_df.rename(columns={'_sample name': 'Samples'}, inplace=True)

In [10]:
df = pd.merge(ev1_df, ev2_raw_df, on="Samples", how='left')
print(df.columns)
print(df.shape)
df.describe()

Index([&#39;Samples&#39;, &#39;A0A024QZX5;A0A087X1N8;P35237&#39;, &#39;A0A024R0T9;K7ER74;P02655&#39;,
       &#39;A0A024R3B9;E9PJL7;E9PNH7;E9PR44;E9PRA8;P02511&#39;,
       &#39;A0A024R3W6;A0A024R412;O60462;O60462-2;O60462-3;O60462-4;O60462-5;Q7LBX6;X5D2Q8&#39;,
       &#39;A0A024R644;A0A0A0MRU5;A0A1B0GWI2;O75503&#39;, &#39;A0A075B6H7&#39;, &#39;A0A075B6H9&#39;,
       &#39;A0A075B6I0&#39;, &#39;A0A075B6I1&#39;,
       ...
       &#39;_age at CSF collection&#39;, &#39;_gender&#39;, &#39;_t-tau [ng/L]&#39;, &#39;_p-tau [ng/L]&#39;,
       &#39;_Abeta-42 [ng/L]&#39;, &#39;_Abeta-40 [ng/L]&#39;, &#39;_Abeta-42/Abeta-40 ratio&#39;,
       &#39;_primary biochemical AD classification&#39;, &#39;_clinical AD diagnosis&#39;,
       &#39;_MMSE score&#39;],
      dtype=&#39;object&#39;, length=1554)
(210, 1554)


Unnamed: 0,_age at CSF collection,_t-tau [ng/L],_p-tau [ng/L],_Abeta-42 [ng/L],_Abeta-40 [ng/L],_Abeta-42/Abeta-40 ratio,_MMSE score
count,197.0,181.0,98.0,181.0,121.0,121.0,83.0
mean,67.725888,553.624309,72.44898,687.104972,10505.842975,0.078753,25.722892
std,12.122924,372.272096,40.868692,381.119236,5192.846673,0.046603,4.028294
min,20.0,78.0,16.0,154.0,2450.0,0.01591,12.0
25%,63.0,275.0,36.75,417.0,6608.0,0.044879,23.5
50%,70.0,441.0,73.5,593.0,9515.0,0.066624,27.0
75%,74.0,802.0,93.75,892.0,12967.0,0.104904,29.0
max,88.0,2390.0,233.0,2206.0,26080.0,0.369508,30.0


In [11]:
df.iloc[0,:][ev2_raw_df.columns]

Samples                                   20180618_QX0_JaBa_SA_LC12_5_CSF1_1_8-1xD1xS1fM...
_collection site                                                                     Sweden
_age at CSF collection                                                                   71
_gender                                                                                   f
_t-tau [ng/L]                                                                           703
_p-tau [ng/L]                                                                            85
_Abeta-42 [ng/L]                                                                        562
_Abeta-40 [ng/L]                                                                        NaN
_Abeta-42/Abeta-40 ratio                                                                NaN
_primary biochemical AD classification                                  biochemical control
_clinical AD diagnosis                                                          

In [14]:
# Prepare for the exporting the file
df.set_index("Samples", inplace=True)
df.replace('Filtered', np.NaN, inplace=True)

# Export
# df.to_csv("Alzheimer_data.csv", sep=";", index=False)
writer = pd.ExcelWriter('Alzheimer.xlsx', engine='xlsxwriter')
df.to_excel(writer, sheet_name='Data', index=False)
writer.save()