## LongevityMap genes splitting

Source of original data: [https://genomics.senescence.info](https://genomics.senescence.info/longevity/)<br />
*(LongevityMap build 3, release date: 2017 June 24, number of genes: 884)*

In [1]:
import pandas as pd
import re
from pathlib import Path

In [2]:
debug_local = True#False
local = Path("..").resolve()
data  = local / "data"
input = data / "input"
output = data / "output"

In [3]:
# read original data into dataFrame
inputname=Path(input / "longevity_genes.csv").resolve()

df = pd.read_csv(inputname)

print("Dimension:", df.shape)
df.head()

Dimension: (550, 7)


Unnamed: 0,id,Association,Population,Variant(s),Gene(s),PubMed,Unnamed: 6
0,1,non-significant,Dutch,HLA-B40,HLA-B,1859103,
1,2,non-significant,Dutch,HLA-DRB5,HLA-DRB5,1859103,
2,3,non-significant,Finnish,APOB,APOB,8018664,
3,4,significant,Finnish,APOC3,APOC3,8018664,
4,5,significant,Finnish,"rs7412,rs429358",APOE,8018664,


<br />

### Clean some data

In [4]:
# if string in the column 'Variant(s)' contains comma at the end, delete this last character (ignore records that contains NaN)
print("In the explored column, number of records with comma at the end:",
      len(df[df['Variant(s)'].str.endswith(',', na=False)]), end='')

df['Variant(s)'] = df['Variant(s)'].map(lambda st: st[:-1] if st.endswith(',') else st, na_action='ignore')

# check that we eliminate all cases with comma at the end
assert len(df[df['Variant(s)'].str.endswith(',', na=False)]) == 0, "Something wrong with removing comma at the end of string"
print('  - cleaned')

In the explored column, number of records with comma at the end: 1  - cleaned


In [5]:
# if string in the column 'Variant(s)' contains two commas in a row, replace them by one comma
print("In the explored column, number of records containing two commas in a row:",
      len(df[df['Variant(s)'].str.contains(',,', na=False)]), end='')

df['Variant(s)'] = df['Variant(s)'].str.replace(',,', ',')

# check that we eliminate all cases with two commas in a row
assert len(df[df['Variant(s)'].str.contains(',,', na=False)]) == 0, "Something wrong with removing two commas in a row"
print('  - cleaned')

In the explored column, number of records containing two commas in a row: 0  - cleaned


In [6]:
# Last columns of the dataFrame contains only NaN value.
# Make sure of this and delete this column as unnecessary.
assert df['Unnamed: 6'].isna().all(), "In the column 'Unnamed: 6' there is at least 1 record with value other than Nan"
del df['Unnamed: 6']

print("Dimension:", df.shape)
df.head()

Dimension: (550, 6)


Unnamed: 0,id,Association,Population,Variant(s),Gene(s),PubMed
0,1,non-significant,Dutch,HLA-B40,HLA-B,1859103
1,2,non-significant,Dutch,HLA-DRB5,HLA-DRB5,1859103
2,3,non-significant,Finnish,APOB,APOB,8018664
3,4,significant,Finnish,APOC3,APOC3,8018664
4,5,significant,Finnish,"rs7412,rs429358",APOE,8018664


In [7]:
# read auxilary data into dataFrame
inputname2=Path(output / "longevity_map_descriptions.csv").resolve()
df2 = pd.read_csv(inputname2)
df2.head()

Unnamed: 0,id,Study Design,Conclusions
0,1,964 inhabitants aged 85 years and over and 244...,"Without correcting for multiple testing, HLA-D..."
1,2,The apolipoprotein B Xba I polymorphism was ex...,The frequencies of the Xba I alleles among the...
2,3,964 inhabitants aged 85 years and over and 244...,"Without correcting for multiple testing, HLA-B..."
3,4,The Sst I polymorphism was examined in 179 Fin...,The S2 allele (Sst I restriction site present)...
4,5,The common polymorphism of apolipoprotein E (E...,The frequency of the E2 allele was higher and ...


In [8]:
df=pd.merge(df, df2, on="id", how="left")
df.head()

Unnamed: 0,id,Association,Population,Variant(s),Gene(s),PubMed,Study Design,Conclusions
0,1,non-significant,Dutch,HLA-B40,HLA-B,1859103,964 inhabitants aged 85 years and over and 244...,"Without correcting for multiple testing, HLA-D..."
1,2,non-significant,Dutch,HLA-DRB5,HLA-DRB5,1859103,The apolipoprotein B Xba I polymorphism was ex...,The frequencies of the Xba I alleles among the...
2,3,non-significant,Finnish,APOB,APOB,8018664,964 inhabitants aged 85 years and over and 244...,"Without correcting for multiple testing, HLA-B..."
3,4,significant,Finnish,APOC3,APOC3,8018664,The Sst I polymorphism was examined in 179 Fin...,The S2 allele (Sst I restriction site present)...
4,5,significant,Finnish,"rs7412,rs429358",APOE,8018664,The common polymorphism of apolipoprotein E (E...,The frequency of the E2 allele was higher and ...


<br />

### Preliminary separation of original dataFrame into several dataFrames

In [9]:
# Seperate records that do not have values in the column 'Variant(s)'
df_var_nan = df[(df['Variant(s)'].isna())]

print("Dimension:", df_var_nan.shape)
df_var_nan.head()

Dimension: (16, 8)


Unnamed: 0,id,Association,Population,Variant(s),Gene(s),PubMed,Study Design,Conclusions
255,258,significant,American (Caucasian),,TP53,20824210,Alleles in candidate pathways (GH/IGF1 signali...,"Eleven SNPs (in GSR, KL, GHRHR, INS, GHSR, IGF..."
256,261,non-significant,American (Caucasian),,TP53,20824210,Alleles in candidate pathways (GH/IGF1 signali...,"Eleven SNPs (in GSR, KL, GHRHR, INS, GHSR, IGF..."
275,280,significant,European,,TP53,23286790,Genome-wide linkage scan in 2118 European nona...,Four regions showed linkage to longevity (14q1...
276,281,significant,European,,TP53,23286790,"The -11377C > G, -11391G > A, and -11426A > G ...",A very rare AA genotype of the -11391G > A pol...
277,282,significant,European,,TP53,23286790,Genome-wide association study in 801 centenari...,"rs2075650, in a TOMM40 intron but a proxy of S..."


In [10]:
# rows that have some info in the column 'Variant(s)' - this is intermediate dataFrame for further splitting
df_var_not_nan = df[(df['Variant(s)'].notna())]

print("Dimension:", df_var_not_nan.shape)
df_var_not_nan.head()

Dimension: (534, 8)


Unnamed: 0,id,Association,Population,Variant(s),Gene(s),PubMed,Study Design,Conclusions
0,1,non-significant,Dutch,HLA-B40,HLA-B,1859103,964 inhabitants aged 85 years and over and 244...,"Without correcting for multiple testing, HLA-D..."
1,2,non-significant,Dutch,HLA-DRB5,HLA-DRB5,1859103,The apolipoprotein B Xba I polymorphism was ex...,The frequencies of the Xba I alleles among the...
2,3,non-significant,Finnish,APOB,APOB,8018664,964 inhabitants aged 85 years and over and 244...,"Without correcting for multiple testing, HLA-B..."
3,4,significant,Finnish,APOC3,APOC3,8018664,The Sst I polymorphism was examined in 179 Fin...,The S2 allele (Sst I restriction site present)...
4,5,significant,Finnish,"rs7412,rs429358",APOE,8018664,The common polymorphism of apolipoprotein E (E...,The frequency of the E2 allele was higher and ...


In [11]:
# rows that have sting in the column 'Variant(s)'
# but this string does not contains characters 'rs' with two digits after them
df_var_not_rs = df_var_not_nan[~df_var_not_nan['Variant(s)'].str.contains('.*rs\d{2}.*', flags=re.IGNORECASE)]

print("Dimension:", df_var_not_rs.shape)
df_var_not_rs.head()

Dimension: (229, 8)


Unnamed: 0,id,Association,Population,Variant(s),Gene(s),PubMed,Study Design,Conclusions
0,1,non-significant,Dutch,HLA-B40,HLA-B,1859103,964 inhabitants aged 85 years and over and 244...,"Without correcting for multiple testing, HLA-D..."
1,2,non-significant,Dutch,HLA-DRB5,HLA-DRB5,1859103,The apolipoprotein B Xba I polymorphism was ex...,The frequencies of the Xba I alleles among the...
2,3,non-significant,Finnish,APOB,APOB,8018664,964 inhabitants aged 85 years and over and 244...,"Without correcting for multiple testing, HLA-B..."
3,4,significant,Finnish,APOC3,APOC3,8018664,The Sst I polymorphism was examined in 179 Fin...,The S2 allele (Sst I restriction site present)...
6,8,significant,Finnish,APOB,APOB,8155090,The common polymorphism of apolipoprotein E (E...,The frequency of the E4 allele was lower in th...


In [12]:
# rows that have sting in the column 'Variant(s)' that contains characters 'rs' with two digits after them
df_var_rs = df_var_not_nan[df_var_not_nan['Variant(s)'].str.contains('.*rs\d{2}.*', flags=re.IGNORECASE)]

print("Dimension:", df_var_rs.shape)
df_var_rs.tail()

Dimension: (305, 8)


Unnamed: 0,id,Association,Population,Variant(s),Gene(s),PubMed,Study Design,Conclusions
540,G549,non-significant,Jordanian,"rs2241766,rs266729",ADIPOQ,20201642,"SNPs rs266729 (-11377G/C), rs2241766 (+45T/G),...",No significant differences were detected in th...
541,G550,non-significant,Italian,"rs6457931,rs1321312,rs4331968,rs9470367,rs6920...","PANDAR,CDKN1A,RAB44",20126416,30 SNPs were identified by sequencing the prom...,"Rare alleles of two exon-derived SNPs, rs18012..."
543,G552,non-significant,Danish,"rs2866164,Q95H",MTTP,16015282,1651 participants in the Danish 1905 cohort st...,The risk haplotype had no significant survival...
546,556,significant,American (Caucasian),rs1042714,ADRB2,20399803,Meta-analysis of GWAS in Caucasians from four ...,There were 273 single-nucleotide polymorphism ...
549,559,significant,French,"rs7412,rs429358",APOE,8136829,Meta-analysis of GWAS in Caucasians from four ...,There were 273 single-nucleotide polymorphism ...


In [13]:
# ensure that we did not lost something during splitting of the original dataFrame
assert len(df) == len(df_var_nan) + len(df_var_rs) + len(df_var_not_rs), "Something wrong with splitting"

<br />

### Splitting of records with several 'rs' into several records

In [14]:
# Number of variants in each record in dataFrame with rs-variants.
# It is calculated by the number of commas +1
nmb_repeats = (df_var_rs['Variant(s)'].str.count(',') + 1).tolist()

assert len(df_var_rs) == len(nmb_repeats), "Problem with calculation"
print('Length of list:', len(nmb_repeats))
nmb_repeats[-5:]

Length of list: 305


[2, 31, 2, 1, 2]

In [15]:
## alternative way:
# nmb_repeats_alt = [len(st.split(',')) for st in df_var_rs['Variant(s)']]

## make sure that two ways gives the same result
# assert len(nmb_repeats_alt) == len(nmb_repeats), "Problems with calculation"
# assert nmb_repeats_alt == nmb_repeats, "Problems with calculation (2)"
# nmb_repeats_alt[-5:]

In [16]:
# dublicate records in the dataFrame 'df_var_rs' that have several variants in the column 'Variant(s)'
# according to the number of these variants.
df_var_rs_extended = df_var_rs.loc[df_var_rs.index.repeat(nmb_repeats)]

print("Dimension:", df_var_rs_extended.shape)
df_var_rs_extended.tail()

Dimension: (3137, 8)


Unnamed: 0,id,Association,Population,Variant(s),Gene(s),PubMed,Study Design,Conclusions
543,G552,non-significant,Danish,"rs2866164,Q95H",MTTP,16015282,1651 participants in the Danish 1905 cohort st...,The risk haplotype had no significant survival...
543,G552,non-significant,Danish,"rs2866164,Q95H",MTTP,16015282,1651 participants in the Danish 1905 cohort st...,The risk haplotype had no significant survival...
546,556,significant,American (Caucasian),rs1042714,ADRB2,20399803,Meta-analysis of GWAS in Caucasians from four ...,There were 273 single-nucleotide polymorphism ...
549,559,significant,French,"rs7412,rs429358",APOE,8136829,Meta-analysis of GWAS in Caucasians from four ...,There were 273 single-nucleotide polymorphism ...
549,559,significant,French,"rs7412,rs429358",APOE,8136829,Meta-analysis of GWAS in Caucasians from four ...,There were 273 single-nucleotide polymorphism ...


In [17]:
# Create "list" of all variants in the column 'Variant(s)' in the dataFrame 'df_var_rs'.
# To speed up calculation, use generator.
gen_variants = (el for ls in df_var_rs['Variant(s)'] for el in ls.split(','))
gen_variants

<generator object <genexpr> at 0x7efec6572d60>

In [18]:
# change content of the columns 'Variant(s)' by splitting data
df_var_rs_extended['Variant(s)'] = list(gen_variants)

print("Dimension:", df_var_rs_extended.shape)
df_var_rs_extended.tail()

Dimension: (3137, 8)


Unnamed: 0,id,Association,Population,Variant(s),Gene(s),PubMed,Study Design,Conclusions
543,G552,non-significant,Danish,rs2866164,MTTP,16015282,1651 participants in the Danish 1905 cohort st...,The risk haplotype had no significant survival...
543,G552,non-significant,Danish,Q95H,MTTP,16015282,1651 participants in the Danish 1905 cohort st...,The risk haplotype had no significant survival...
546,556,significant,American (Caucasian),rs1042714,ADRB2,20399803,Meta-analysis of GWAS in Caucasians from four ...,There were 273 single-nucleotide polymorphism ...
549,559,significant,French,rs7412,APOE,8136829,Meta-analysis of GWAS in Caucasians from four ...,There were 273 single-nucleotide polymorphism ...
549,559,significant,French,rs429358,APOE,8136829,Meta-analysis of GWAS in Caucasians from four ...,There were 273 single-nucleotide polymorphism ...


In [19]:
# in the column 'Variant(s)' of dataFrame 'df_var_rs_extended', if the first two characters are 'Rs' than change them to 'rs'
df_var_rs_extended['Variant(s)'] = df_var_rs_extended['Variant(s)'].map(lambda st: 'rs' + st[2:] if st.startswith('Rs') else st)

<br />

### Combining dataFrames and saving result to disk

In [20]:
# combine all dataFrames in one
df_result_all = pd.concat([df_var_rs_extended, df_var_not_rs, df_var_nan])

print("Dimension:", df_result_all.shape)
df_result_all.iloc[len(df_var_rs_extended)-10:len(df_var_rs_extended)+5]   # look at the middle of dataframe, where joint occured

Dimension: (3382, 8)


Unnamed: 0,id,Association,Population,Variant(s),Gene(s),PubMed,Study Design,Conclusions
541,G550,non-significant,Italian,rs3176349,"PANDAR,CDKN1A,RAB44",20126416,30 SNPs were identified by sequencing the prom...,"Rare alleles of two exon-derived SNPs, rs18012..."
541,G550,non-significant,Italian,rs876581,"PANDAR,CDKN1A,RAB44",20126416,30 SNPs were identified by sequencing the prom...,"Rare alleles of two exon-derived SNPs, rs18012..."
541,G550,non-significant,Italian,rs6457938,"PANDAR,CDKN1A,RAB44",20126416,30 SNPs were identified by sequencing the prom...,"Rare alleles of two exon-derived SNPs, rs18012..."
541,G550,non-significant,Italian,rs6457940,"PANDAR,CDKN1A,RAB44",20126416,30 SNPs were identified by sequencing the prom...,"Rare alleles of two exon-derived SNPs, rs18012..."
541,G550,non-significant,Italian,rs2145047,"PANDAR,CDKN1A,RAB44",20126416,30 SNPs were identified by sequencing the prom...,"Rare alleles of two exon-derived SNPs, rs18012..."
543,G552,non-significant,Danish,rs2866164,MTTP,16015282,1651 participants in the Danish 1905 cohort st...,The risk haplotype had no significant survival...
543,G552,non-significant,Danish,Q95H,MTTP,16015282,1651 participants in the Danish 1905 cohort st...,The risk haplotype had no significant survival...
546,556,significant,American (Caucasian),rs1042714,ADRB2,20399803,Meta-analysis of GWAS in Caucasians from four ...,There were 273 single-nucleotide polymorphism ...
549,559,significant,French,rs7412,APOE,8136829,Meta-analysis of GWAS in Caucasians from four ...,There were 273 single-nucleotide polymorphism ...
549,559,significant,French,rs429358,APOE,8136829,Meta-analysis of GWAS in Caucasians from four ...,There were 273 single-nucleotide polymorphism ...


In [21]:
# save all results together
df_result_all.to_csv(output / 'longevity_genes_splitted_all.csv', index=False)

<br />

In [22]:
# save only records that contain 'rs' at the beginning
df_result_rs = df_var_rs_extended[df_var_rs_extended['Variant(s)'].str.startswith('rs')]

print("Dimension:", df_result_rs.shape)
df_result_rs.tail()

Dimension: (3129, 8)


Unnamed: 0,id,Association,Population,Variant(s),Gene(s),PubMed,Study Design,Conclusions
541,G550,non-significant,Italian,rs2145047,"PANDAR,CDKN1A,RAB44",20126416,30 SNPs were identified by sequencing the prom...,"Rare alleles of two exon-derived SNPs, rs18012..."
543,G552,non-significant,Danish,rs2866164,MTTP,16015282,1651 participants in the Danish 1905 cohort st...,The risk haplotype had no significant survival...
546,556,significant,American (Caucasian),rs1042714,ADRB2,20399803,Meta-analysis of GWAS in Caucasians from four ...,There were 273 single-nucleotide polymorphism ...
549,559,significant,French,rs7412,APOE,8136829,Meta-analysis of GWAS in Caucasians from four ...,There were 273 single-nucleotide polymorphism ...
549,559,significant,French,rs429358,APOE,8136829,Meta-analysis of GWAS in Caucasians from four ...,There were 273 single-nucleotide polymorphism ...


In [23]:
df_result_rs.to_csv(output / 'longevity_genes_splitted_rs.csv', index=False)

<br />

In [24]:
# save other records, that do not contain 'rs' at the beginning
df_result_not_rs = pd.concat([df_var_rs_extended[~df_var_rs_extended['Variant(s)'].str.startswith('rs')],
                              df_var_not_rs,
                              df_var_nan
                             ])

print("Dimension:", df_result_not_rs.shape)
df_result_not_rs

Dimension: (253, 8)


Unnamed: 0,id,Association,Population,Variant(s),Gene(s),PubMed,Study Design,Conclusions
403,G410,non-significant,Bulgarian,-819C/T,IL10,15050300,"-1082G/A, -819 C/T and -592 C/A SNPs were exam...","Genotype -1082G/A, -819 C/C, -592 C/C was posi..."
403,G410,non-significant,Bulgarian,-592C/A,IL10,15050300,"-1082G/A, -819 C/T and -592 C/A SNPs were exam...","Genotype -1082G/A, -819 C/C, -592 C/C was posi..."
407,G414,non-significant,German,Q/H 95,MTTP,15911777,"rs1800591, rs2866164, and Q/H 95 SNPs were exa...",No evidence for association was detected betwe...
415,G423,non-significant,Ashkenazi Jewish,APM1+2019,ADIPOQ,18511746,Adiponectin levels and variants in the adipone...,Frequencies were not significantly different b...
415,G423,non-significant,Ashkenazi Jewish,APM+2019,ADIPOQ,18511746,Adiponectin levels and variants in the adipone...,Frequencies were not significantly different b...
...,...,...,...,...,...,...,...,...
369,375,significant,Finnish,,TP53,12483296,Inherited mtDNA markers were analyzed between ...,Male centenarians emerged in northern Italy as...
373,379,significant,Italian,,TP53,10463944,Associations between variation in SNPs in the ...,Carriers of the minor allele of rs12778366 had...
381,387,significant,"European (Danish, Finnish, South Italian and G...",,TP53,24341918,Allele and genotype distributions of rs1333049...,The frequency of the GG genotype in centenaria...
547,557,non-significant,American (Caucasian),,TP53,20824210,After 16 SNPs over LMNA gene were genotyped in...,A meta-analysis combining results from 5 sets ...


In [25]:
df_result_not_rs.to_csv(output /'longevity_genes_splitted_not_rs.csv', index=False)

<br />

In [26]:
# ensure that we did lost something
assert len(df_var_rs_extended) + len(df_var_not_rs) + len(df_var_nan) == len(df_result_all), "Something was lost"
assert len(df_result_all) == len(df_result_rs) + len(df_result_not_rs), "Something was lost (2)"
print('Done')

Done
