This notebook performs imputation of brain CpG levels from Edinburh Brain Bank (EBB) to UKB genotypes.
- update the bgen variant ids with the ebb variant ids
- merge all chromosome genotype data
- make imputated methylation scores
- Date: 03.02.2026


## Setup

In [1]:
import os
import glob
import pandas as pd
from pandas.core.common import flatten
import re
import numpy as np
import seaborn as sns

In [35]:
%%bash
rm plink*
rm toy*
rm prettify
rm LICENSE
wget https://s3.amazonaws.com/plink1-assets/plink_linux_x86_64_20250819.zip 2>/dev/null
unzip plink_linux_x86_64_20250819.zip 2>/dev/null
./plink --version

Archive:  plink_linux_x86_64_20250819.zip
  inflating: plink                   
  inflating: LICENSE                 
  inflating: toy.ped                 
  inflating: toy.map                 
  inflating: prettify                
PLINK v1.9.0-b.7.11 64-bit (19 Aug 2025)


In [37]:
%%bash
rm intel-simplified-software-license.txt
rm plink2*
rm vcf_subset*
wget https://s3.amazonaws.com/plink2-assets/plink2_linux_x86_64_20260110.zip 2>/dev/null
unzip plink2_linux_x86_64_20260110.zip 2>/dev/null
./plink2 --version

Archive:  plink2_linux_x86_64_20260110.zip
  inflating: plink2                  
  inflating: vcf_subset              
  inflating: intel-simplified-software-license.txt  
PLINK v2.0.0-a.7LM 64-bit Intel (10 Jan 2026)


### load EBB weights 

In [30]:
%%bash
dx download -f vasilis/data/ebb/weights/EBB.BRAIN.METHYL.HERIT.tar.bz2 

In [None]:
%%bash
tar -xjf EBB.BRAIN.METHYL.HERIT.tar.bz2 

### load QC'ed UKB imputation data

In [3]:
%%bash
mkdir -p qc_imp/
dx download -f -o qc_imp/ vasilis/data/ebb/imp_bed/*

### load variant ids EBB <-> UKB tables

In [4]:
%%bash
mkdir -p extract/
dx download -f -o extract/ vasilis/data/ebb/extract/* 

In [7]:
%%bash
nqc1=$(cat extract/imp_c*.bothIDs.extract | wc -l)
nqc2=$(cat qc_imp/imp_wb_qc_c*.bim | wc -l)
nebb=$(awk '{print $1}' EBB.BRAIN.METHYL.HERIT/EBB.BRAIN.METHYL.HERIT.variants | sort -u | wc -l)

echo "$nqc1 out of $nebb ($((100*nqc1/nebb)) %) weights' variants in UKB imputed data (INFO > 0.8 & MAF > 0.01)"
echo "$nqc2 out of $nebb ($((100*nqc2/nebb)) %) weights' variants in QC'ed UKB imputed data (INFO > 0.8 & MAF > 0.01 & plink QC)"

469881 out of 473982 (99 %) weights' variants in UKB imputed data (INFO > 0.8 & MAF > 0.01)
445614 out of 473982 (94 %) weights' variants in QC'ed UKB imputed data (INFO > 0.8 & MAF > 0.01 & plink QC)


### Update variant ids 

In [25]:
!head extract/imp_c22.bothIDs.extract 

22:17054103:G:A rs4008588
22:17054720:T:C rs9605903
22:17060409:TTTTG:T 22:17060409_TTTTG_T
22:17090624:T:C rs9605978
22:17112342:C:T rs9604967
22:17118461:C:T rs141204045
22:17141339:C:T rs2381086
22:17154984:A:G rs9605028
22:17203103:A:G rs2845380
22:17214252:C:T rs2845346


In [26]:
%%bash
mkdir -p qc_imp/new_id/logs

for chr in {1..22}
do
    # remove duplicates
    awk '!seen[$2]++' extract/imp_c${chr}.bothIDs.extract > extract/imp_c${chr}.bothIDs.extract.nodup
    ./plink2 --bfile qc_imp/imp_wb_qc_c${chr} --rm-dup force-first --make-bed --out qc_imp/new_id/temp.rmdup.c${chr} >/dev/null
    # update .bim file
    ./plink2 --bfile qc_imp/new_id/temp.rmdup.c${chr} --update-name extract/imp_c${chr}.bothIDs.extract.nodup 1 2 --make-bed --out qc_imp/new_id/imp_wb_qc_newid_c${chr} >/dev/null
    # clean
    mv qc_imp/new_id/*log qc_imp/new_id/logs
    rm  qc_imp/new_id/temp.rmdup.c*
    
    echo "finished chr ${chr}"
done 

finished chr 1
finished chr 2
finished chr 3
finished chr 4
finished chr 5
finished chr 6
finished chr 7
finished chr 8
finished chr 9
finished chr 10
finished chr 11
finished chr 12
finished chr 13
finished chr 14
finished chr 15
finished chr 16
finished chr 17
finished chr 18
finished chr 19
finished chr 20
finished chr 21
finished chr 22


### Merge genotype files

In [38]:
%%bash
rm -f list_beds.txt
for chr in {1..22}; do echo "qc_imp/new_id/imp_wb_qc_newid_c${chr}" >> list_beds.txt; done

./plink \
  --merge-list list_beds.txt \
  --make-bed --out qc_imp/new_id/imp_wb_qc_newid_all


PLINK v1.9.0-b.7.11 64-bit (19 Aug 2025)           cog-genomics.org/plink/1.9/
(C) 2005-2025 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to qc_imp/new_id/imp_wb_qc_newid_all.log.
Options in effect:
  --make-bed
  --merge-list list_beds.txt
  --out qc_imp/new_id/imp_wb_qc_newid_all

31380 MB RAM detected; reserving 15690 MB for main workspace.


to length-80+ variant IDs; consider using a different naming scheme for long
indels and the like.


Performing 5-pass merge (407606 people, 107232/445496 variants per pass).
Pass 1 complete.                              
Pass 2 complete.                              
Pass 3 complete.                              
Pass 4 complete.                              
Merged fileset written to qc_imp/new_id/imp_wb_qc_newid_all-merge.bed +
qc_imp/new_id/imp_wb_qc_newid_all-merge.bim +
qc_imp/new_id/imp_wb_qc_newid_all-merge.fam .
445496 variants loaded from .bim file.
407606 people (187273 males, 220333 females) loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 407606 founders and 0 nonfounders present.
Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.
Total genotyping rate is 0.992736.
445496 variants and 407606 people pass filters and QC.
Note: No phenotypes present.
--mak

In [40]:
# upload merged data
!dx upload qc_imp/new_id/imp_wb_qc_newid_all* --dest vasilis/data/ebb/qc_imp/new_id/

ID                                file-J610Y28JZB71g7JkJy5qb9JQ
Class                             file
Project                           project-GfvP6PQJZB72v2Vk348Bb2yg
Folder                            /vasilis/data/ebb/qc_imp/new_id
Name                              imp_wb_qc_newid_all.bed
State                             [33mclosing[0m
Visibility                        visible
Types                             -
Properties                        -
Tags                              -
Outgoing links                    -
Created                           Tue Feb  3 14:43:21 2026
Created by                        vasilisraptis
 via the job                      job-J60vB90JZB73Q4PY9bzGg4Vj
Last modified                     Tue Feb  3 14:46:04 2026
Media type                        
archivalState                     "live"
cloudAccount                      "cloudaccount-dnanexus"
ID                                file-J610ZF0JZB7JJj5v17pK918f
Class                             file
Pro

### Make imputed methylation scores
- use the EBB.BRAIN.METHYL.HERIT/scores/input/cp*.score.in files
- use the merged plink files
- use the EBB.BRAIN.METHYL.HERIT.list : list of weights paths
- make a samples x scores matrix for all CpGs, to use with PrediXcan


#### with SAK

In [11]:
%%bash
dx upload mwas_02_impute.sh --dest vasilis/SAK_scripts/
dx upload mwas_02.1_pred_mat.py --dest vasilis/SAK_scripts/

ID                                file-J61qgFjJZB7BQy0vjx64bFvg
Class                             file
Project                           project-GfvP6PQJZB72v2Vk348Bb2yg
Folder                            /vasilis/SAK_scripts
Name                              mwas_02_impute.sh
State                             closing
Visibility                        visible
Types                             -
Properties                        -
Tags                              -
Outgoing links                    -
Created                           Wed Feb  4 21:35:43 2026
Created by                        vasilisraptis
 via the job                      job-J61k1j8JZB7GYx5p6JKVyfK3
Last modified                     Wed Feb  4 21:35:44 2026
Media type                        
archivalState                     "live"
cloudAccount                      "cloudaccount-dnanexus"
ID                                file-J61qgG0JZB7FbpX304KPy6Bq
Class                             file
Project                      

In [None]:
%%bash

UKB_bfile='vasilis/data/ebb/qc_imp/new_id/imp_wb_qc_newid_all'
WGT="vasilis/data/ebb/weights/EBB.BRAIN.METHYL.HERIT.tar.bz2"
dest="vasilis/data/ebb/scores/"

for batch in {2..50}; do
    dx run swiss-army-knife \
        -iin="vasilis/SAK_scripts/mwas_02_impute.sh" \
        -iin="vasilis/SAK_scripts/mwas_02.1_pred_mat.py" \
        -iin="${UKB_bfile}.bed" \
        -iin="${UKB_bfile}.bim" \
        -iin="${UKB_bfile}.fam" \
        -iin="${WGT}" \
        -icmd="sh mwas_02_impute.sh ${batch}" \
        --tag="imp_${batch}" \
        --instance-type "mem1_ssd1_v2_x36" \
        --destination="${dest}" \
        --brief --yes --priority high
done


job-J61v4X8JZB719Fgyyz21kK5J
job-J61v4XQJZB7FF1b5pBY76z5g
job-J61v4XjJZB7PFz746g85JbQK
job-J61v4Y0JZB7FF1b5pBY76z5z
job-J61v4Y8JZB72gFg9P2P9J2Zq
job-J61v4YQJZB72gFg9P2P9J2Zz
job-J61v4Z0JZB72gFg9P2P9J2b3
job-J61v4Z8JZB776744gkvKJxQ9
job-J61v4ZQJZB776744gkvKJxQJ
job-J61v4ZjJZB7PFz746g85JqGY
job-J61v4b0JZB719Fgyyz21kK77
job-J61v4b8JZB776744gkvKJxQf
job-J61v4bQJZB72gFg9P2P9J2j4
job-J61v4f0JZB7PFz746g85JxFJ
job-J61v4f8JZB719Fgyyz21kK7G


#### without using SAK -NOT USED - ~6h 

In [2]:
### IF START FROM HERE:
# !dx download -f vasilis/data/ebb/weights/EBB.BRAIN.METHYL.HERIT.tar.bz2 
# !tar -xjf EBB.BRAIN.METHYL.HERIT.tar.bz2 >/dev/null
!dx download -f vasilis/data/ebb/qc_imp/new_id/imp_wb_qc_newid_all*



In [47]:
%%bash
#rm scores/scores/*
#rm scores/logs/*

In [None]:
%%bash

# extract weights
# tar -xjf EBB.BRAIN.METHYL.HERIT.tar.bz2 

# inputs
UKB_bfile=qc_imp/new_id/imp_wb_qc_newid_all
WGTLIST_FULL=EBB.BRAIN.METHYL.HERIT/EBB.BRAIN.METHYL.HERIT.list # list of paths to weights (hsq < 0.05)
WGTLIST=EBB.BRAIN.METHYL.HERIT/EBB.BRAIN.METHYL.HERIT.list
scoresdir=EBB.BRAIN.METHYL.HERIT/scores/input


# output
mkdir -p scores/scores
mkdir -p scores/logs
>errors.log # capture errors 
>times.log # print times

### BATCH SETUP STARTS HERE

# TMP=TMP
# mkdir -p $TMP
# split -n l/50 --numeric-suffixes=1 $WGTLIST_FULL $TMP/temp_WGTLIST_ # split weights list into chunks - RUN ONCE
# WGTLIST=$TMP/temp_WGTLIST_$chunk # to be passed by SAK

### BATCH SETUP ENDS HERE


### make imputed methyltation scores for each CpG 
### use the *score.in files to impute methylation in UKB 
### see: https://github.com/gusevlab/fusion_twas/tree/master/utils

a=($(wc -l $WGTLIST))
n_cpgs=${a[0]} # total no. of cpgs 

echo "Analysis started at: $(date)"

# set timer
total_time=0
n_loops=0

for i in $(seq 1 "$n_cpgs"); do
  start=$(date +%s.%N)
  
  # pull path to weights
  wgtfile_i=$(awk -v i=$i 'NR==i' $WGTLIST)
  # pull cpg name
  cpg_name_i=$(basename $wgtfile_i .wgt.RDat)
  # make per-cpg score in UKB; note: use --extract otherwise allele freq calcuation takes for ever
 ./plink2 \
    --bfile $UKB_bfile \
    --extract $scoresdir/$cpg_name_i.score.in \
    --score $scoresdir/$cpg_name_i.score.in 1 2 4 header-read \
    --out scores/scores/$cpg_name_i \
    --threads $(nproc)  >/dev/null 2>> errors.log
  # cleanup
  mv scores/scores/$cpg_name_i.log scores/logs
  
  # print progress
  pct=$(awk "BEGIN { printf \"%.2f\", 100 * $i / $n_cpgs }")

  end=$(date +%s.%N)
  time_i=$(awk "BEGIN { printf \"%.2f\", $end - $start }")

  n_loops=$(( n_loops + 1 ))
  total_time=$(awk "BEGIN { print $total_time + $time_i }")
  avg_time=$(awk "BEGIN { printf \"%.2f\", $total_time / $n_loops }")

  printf "finished CpG %s, %d out of %d (%s%%), in %.2f s | avg time per CpG: %.2f s\n" \
         "$cpg_name_i" "$i" "$n_cpgs" "$pct" "$time_i" "$avg_time"  >> times.log
done

echo "Analysis finished at: $(date)"


Analysis started at: Tue Feb  3 15:50:38 UTC 2026


In [None]:
# make .tar.bz2 file 
!tar -cvjSf scores.tar.bz2 scores

In [54]:
# upload scores
!dx upload scores.tar.bz2 --dest vasilis/data/ebb/scores/
!dx upload errors.log  --dest vasilis/data/ebb/scores/
!dx upload times.log  --dest vasilis/data/ebb/scores/


ID                                file-J6174QQJZB7446zY4bBfXVQX
Class                             file
Project                           project-GfvP6PQJZB72v2Vk348Bb2yg
Folder                            /vasilis/data/ebb/scores
Name                              scores.tar.bz2
State                             [33mclosing[0m
Visibility                        visible
Types                             -
Properties                        -
Tags                              -
Outgoing links                    -
Created                           Tue Feb  3 22:10:10 2026
Created by                        vasilisraptis
 via the job                      job-J60vB90JZB73Q4PY9bzGg4Vj
Last modified                     Tue Feb  3 22:10:16 2026
Media type                        
archivalState                     "live"
cloudAccount                      "cloudaccount-dnanexus"
ID                                file-J6174X8JZB75XPf2BFF29B7b
Class                             file
Project            