
<div id="top"></div>

# Subtyping COVID-19 Therapeutic Research Findings


#### Yuanfang Guan

#### Michigan Medicine, University of Michigan, Ann Arbor


## Summary

The goal of this exercise is to study this literature provided by the Kaggle COVID-19 challenge organizing team, and to subtype the COVID-19 therapeutic research findings. Specifically, we carried out the following four parts of work:

**[Part A. Drugs that have been used in clinical trials for COVID-19](#PartA).** We identified and characterized the [drugs on clinical trials](#PartAdrug) by integrating the FDA drug database and PubChem repository. We hand-curated and summarized the reported effectiveness for each drug. We presented the mutual similarity of [chemical structures](#PartAchem) across the drugs used in clinical trials. 

We [categorized](#PartAcat) the drugs based on their molecular mechanisms, which may facilitate the discovery of related drugs of similar mechanisms and the creation of effective cocktail treatment: 

 **[Category 1](#PartAcat1).** RNA mutagens 
 
 **[Category 2](#PartAcat2).** Protease inhibitors  
 
 **[Category 3](#PartAcat3).** Virus-entry blockers 

 **[Category 4](#PartAcat4).** Virus-release blockers
 
 **[Category 5](#PartAcat5).** Monoclonal antibodies
 

**[Part B. Drugs that have been proposed by computational works](#PartB).** We identified the computational publications, categorized their approaches into the following categories and discussed their performance and applications in other disease domains, and potential limitations. 
 
**[Category 1](#PartBcat1).** Gene-gene network-based algorithms. 

**[Category 2](#PartBcat2).** Expression-based algorithms
 
**[Category 3](#PartBcat3).** Docking simulation or protein structure-based for 
                     
   [Category 3.a](#PartBcat3.a). Small molecules 
                    
   [Category 3.b](#PartBcat3.b). Monoclonal antibodies
 

**[Part C. Drugs that have been proposed by in vitro experiments of COVID-19 invading human cells](#PartC).** We characterized the chemical structures and analyzed the chemical similarity for this group.

For this list, other than literature mining, we carried out a [machine learning experiment](#PartCexp) to prioritize previously unexplored FDA-approved drugs for repurposing without ADMET evaluation. After hand-removing the contaminations, we identified the following top candidates for repurposing: OLUMIANT(Baricitinib) treating rheumatoid arthritis, BRIMONIDINE, treating glaucoma, EDURANT(rilpivirine) treating Human Immunodeficiency Virus-1 (HIV-1), MARPLAN Treating depression, Corlanor (ivabradine) reduces the spontaneous pacemaker activity of the cardiac sinus node. We listed the potential [contaminations/biases](#PartClim) in the this and relevant approaches.

**[Part D. Epitope study for vaccines](#PartD)** We categorized vaccine studies by their approaches and discussed the background and limitations:

**[Approach 1](#PartDcat1).** Homology-based with SARS-COV (2003 version of SARS), other coronavirus or Ebola.

**[Approach 2](#PartDcat2).** Immunoinformatics including docking/molecular dynamics/protein structures/antigenticity.

We hand-curated a list of [147 epitopes](#PartDepi) from the publications, grouped them by source virus protein and T-cell/B-cell targets. We consolidated all published epitopes by sequence overlap into [124 unique groups](#PartDuniq).

**Summary points and future recommended research topics for Phase 2. **
    
 **Conclusion 1.** There is not a single drug for which consistent positive response has been reported. 
 
 **Conclusion 2.** There are overlaps between the drugs in clinical trials, proposed by computational analysis and proposed by in vitro experiments. However, some of the overlaps, especially those with computational analysis may come from a circularity in the methods.
 
 **Conclusion 3.** Drug candidates proposed by computation and in vitro screening could be biased towards cancer-related targeted therapy and substantially contaminated by existing literature or sometimes anecdote. This bias/contamination may affect a significant number of computation-based drug-repurposing studies including our own work, and certainly not limited to COVID-19.
 
 **Future direction 1.** Disagreement in the reported drug response can root from differences in dosage, baseline biometrics and population groups. With more trial results coming out, the next step is to carry out meta-analysis to stratify out these variables.
 
 **Future direction 2.** Analyzing vaccine findings at this stage is premature and there is no clinical effectiveness study yet. It will be meaningful to make genome variation and vaccines (or maybe antibodies as well) into the same topic, therefore allowing connecting the genome variations to what fraction of the virus strains that a vaccine could cover. 
 
 **Future direction 3.** We suggest a topic on news (e.g., google news) retrieval for therapeutic development, as many (if not most) treatment response may not first appear in manuscripts. 

Finally, we would like to take this opportunity to make one comment: Literature tends be biased towards reporting positive results，known biology (e.g., cancer and immuno- drugs), and anecdotes, and we should take the results of this exercise and other documents critically. 
<div id="PartA"></div>




[Go Top](#top)



## Part A Subtyping drugs currently in clinical trial
##### A.1 Methods: 
We first counted how many times each FDA drug occured in the documents provided by Kaggle:

In [None]:
import gensim
import copy
import os
import sys
import glob

root_path = './COVID19/CORD19/'

FILE=open('./COVID19/EOBZIP 03_23_2020/fda.csv','r')

exist={}
for line in FILE:
    line=line.strip()
    line=line.lower()
    exist[line]=1
FILE.close()

all_drugs=exist.keys()

all_files=glob.glob(f'{root_path}/comm_use_subset/**/*.json', recursive=True)
print(len(all_files))

NEW=open(('result.comm'),'w')
file_i=1
for the_file in all_files:
    FILE=open(the_file,'r')
    for line in FILE:
        if ('text' in line):
            line_ori=copy.copy(line).strip()
            line=(line.strip()).lower()

            wordlist = gensim.utils.simple_preprocess(line)
#            print(wordlist)
            for the_drug in all_drugs:
                if the_drug in wordlist:
                    NEW.write(the_drug)
                    NEW.write('\t')
                    NEW.write(line_ori)
                    NEW.write('\n')
    file_i=file_i+1

all_files=glob.glob(f'{root_path}/noncomm_use_subset/**/*.json', recursive=True)
print(len(all_files))
NEW=open(('result.noncomm'),'w')
file_i=1
for the_file in all_files:
 
        FILE=open(the_file,'r')
        for line in FILE:
            if ('text' in line):
                line_ori=copy.copy(line).strip()
                line=(line.strip()).lower()
                wordlist = gensim.utils.simple_preprocess(line)
#            print(wordlist)
                for the_drug in all_drugs:
                    if the_drug in wordlist:
                        NEW.write(the_drug)
                        NEW.write('\t')
                        NEW.write(line_ori)
                        NEW.write('\n')

        file_i=file_i+1


all_files=glob.glob(f'{root_path}/costom_license/**/*.json', recursive=True)
print(len(all_files))
NEW=open(('result.pmc'),'w')
file_i=1
for the_file in all_files:
        FILE=open(the_file,'r')
        for line in FILE:
            if ('text' in line):
                line_ori=copy.copy(line).strip()
                line=(line.strip()).lower()

                wordlist = gensim.utils.simple_preprocess(line)
#            print(wordlist)
                for the_drug in all_drugs:
                    if the_drug in wordlist:
                        NEW.write(the_drug)
                        NEW.write('\t')
                        NEW.write(line_ori)
                        NEW.write('\n')
  
        file_i=file_i+1


all_files=glob.glob(f'{root_path}/biorxiv_medrxiv/**/*.json', recursive=True)
print(len(all_files))
NEW=open(('result'),'w')
file_i=1
for the_file in all_files:
        FILE=open(the_file,'r')
        for line in FILE:
            if ('text' in line):
                line_ori=copy.copy(line).strip()
                line=(line.strip()).lower()

                wordlist = gensim.utils.simple_preprocess(line)
#            print(wordlist)
                for the_drug in all_drugs:
                    if the_drug in wordlist:
                        NEW.write(the_drug)
                        NEW.write('\t')
                        NEW.write(line_ori)
                        NEW.write('\n')
        file_i=file_i+1



18672


Next, we counted the number of publications each drug appeared.

In [None]:
!cat ../COVID19/EOBZIP 03_23_2020/result.* |cut -f 1-2|sort|uniq|cut -f 1|sort|uniq -c|sort -g >sorted_alresult
!cat ../COVID19/EOBZIP 03_23_2020/result.* |grep -i -E -- 'COVID|coronavirus|SARS'|cut -f 1-2|sort|uniq|cut -f 1|sort|uniq -c|sort -g >sorted_alresult.coronavirus
!cat ../COVID19/EOBZIP 03_23_2020/result.* |grep -i -E -- 'COVID-19|COVID19|sars-cov-2'|cut -f 1-2|sort|uniq|cut -f 1|sort|uniq -c|sort -g >sorted_alresult.covid19
!tail -30 sorted_alresult
!tail -30 sorted_alresult.coronavirus
!tail -30 sorted_alresult.covid19
!!find ../COVID19/CORD19/*  -type f | xargs grep -i antibody |grep -E -- 'COVID-19|COVID19|sars-cov-2' >antibody_paragraphs.txt

[Go Top](#top)


<div id="PartAdrug"></div>

##### A.2 Results

###### A.2.1 The number of publications each drug appeared, top ones, >=100 times, are (full list in sorted_alresult):
*     103 hydrocortisone
*     106 ritonavir
*     111 prednisolone
*     113 dv
*     118 ciprofloxacin
*     119 cyclosporine
*     127 acyclovir
*     134 azithromycin
*     141 amoxicillin
*     155 doxycycline
*     159 dexamethasone
*     166 triad
*     177 chloramphenicol
*     177 kanamycin
*     238 isoflurane
*     248 gentamicin
*     370 bal
*     383 adenosine
*     436 insulin
*     480 ribavirin
*    1767 penicillin

[Go Top](#top)


###### A.2.2 the drugs that have been related to coronavirus in literature, and the top ones, >10 times, are (full list in sorted_alresult.coronavirus):
*      10 times: amoxicillin
*      10 times: fluorouracil
*      10 times: kanamycin
*      12 times: azithromycin
*      12 times: hydrocortisone
*      13 times: doxycycline
*      13 times: levofloxacin
*      14 dexamethasone
*      14 isoflurane
*      15 dv
*      15 kaletra
*      15 prednisolone
*      15 tamiflu
*      16 cyclosporine
*      16 gentamicin
*      18 tao
*      19 acyclovir
*      24 triad
*      25 insulin
*      35 remdesivir
*      41 adenosine
*      60 ritonavir
*      66 bal
*      86 penicillin
*     150 ribavirin
    
 [Go Top](#top)
 
###### A.2.3 The drugs specifically related to COVID-19 in literature (sorted_alresult.covid19)
*       1 acetaminophen
*       1 acyclovir
*       1 amoxicillin
*       1 antitussive
*       1 azithromycin
*       1 bal
*       1 ceftriaxone
*       1 chloramphenicol
*       1 digoxin
*       1 doxycycline
*       1 fluorouracil
*       1 ganciclovir
*       1 ibuprofen
*       1 iclusig
*       1 insulin
*       1 levofloxacin
*       1 penicillin
*       1 sulfasalazine
*       1 tigecycline
*       2 adenosine
*       2 triad
*       3 darunavir
*       4 tao
*       7 kaletra
*      12 ribavirin
*      17 remdesivir
*      22 ritonavir

[Go Top](#top)


<div id='PartAchem'></div>

 Now we analyze the chemical similarities of these drugs.




In [None]:

import numpy as np

exist={}
LIST=open('../COVID19/EOBZIP 03_23_2020/sorted_alresult.coronavirus','r')
for line in LIST:
    line=line.replace('\s\s+','\t')
    line=line.strip()
    table=line.split(' ')
    
    if (float(table[0])>4):
        exist[table[1]]=0
    
    
REF=open('../COVID19/EOBZIP 03_23_2020/combine_drug_name_id.csv','r')
DATA=open('../COVID19/EOBZIP 03_23_2020/combined_fp2_data.csv','r')

drug=[]
all_drug={}
for ref in REF:
    #if ('pos' in ref):
        ref=ref.strip()
        rrr=ref.split(',')
        
        if (rrr[1].lower() in exist):
            drug.append(rrr[1])
    
            data=DATA.readline()
            data=data.strip()
            data=data.split(',')
            kkk=0
            for i in data:
                data[kkk]=float(i)
                kkk+1
            all_drug[rrr[1]]=np.asarray(data).astype(np.float)
    
REF.close()
DATA.close()

connections1=[]
connections2=[]
for drug1 in drug:
    for drug2 in drug:
        if (drug1<drug2):
            cor=np.corrcoef(all_drug[drug1],all_drug[drug2])
            if (cor[0,1]>0.35):
                connections1.append(drug1)
                connections2.append(drug2)
                
import sys
import plotly.graph_objects as go
import networkx as nx

node_list=list(all_drug.keys())
G = nx.Graph()
for i in node_list:
    G.add_node(i)

i=0
for drug1 in connections1:
    drug2=connections2[i]
    G.add_edges_from([(drug1,drug2)])
    i=i+1



pos = nx.spring_layout(G, k=0.5, iterations=50)
for n, p in pos.items():
    G.nodes[n]['pos'] = p
    
edge_trace = go.Scatter(
    x=[],
    y=[],
    line=dict(width=1,color='#888'),
    hoverinfo='none',
    mode='lines')


for edge in G.edges():
    x0, y0 = G.nodes[edge[0]]['pos']
    x1, y1 = G.nodes[edge[1]]['pos']
    edge_trace['x'] += tuple([x0, x1, None])
    edge_trace['y'] += tuple([y0, y1, None])
    
node_trace = go.Scatter(
    x=[],
    y=[],
    text=[],
    mode='markers',
    hoverinfo='text',
    marker=dict(
        showscale=True,
        colorscale='RdBu',
        reversescale=True,
        color=[],
        size=15,
        colorbar=dict(
            thickness=5,
       #     title='Node Connections',
            xanchor='left',
            titleside='right'
        ),
        line=dict(width=0)))

for node in G.nodes():
    x, y = G.nodes[node]['pos']
    node_trace['x'] += tuple([x])
    node_trace['y'] += tuple([y])

for node, adjacencies in enumerate(G.adjacency()):
    node_trace['marker']['color']+=tuple([len(adjacencies[1])])
   # node_info = adjacencies[0] +' # of connections: '+str(len(adjacencies[1]))
    node_info = adjacencies[0]
    node_trace['text']+=tuple([node_info])

fig = go.Figure(data=[edge_trace, node_trace],
             layout=go.Layout(
                title='Similarity of chemical structures among the drugs that are related to coronavirus in literature',
                titlefont=dict(size=12),
                showlegend=False,
                hovermode='closest',
                margin=dict(b=50,l=100,r=100,t=50),
                annotations=[ dict(
                   # text="No. of connections",
                    text="",
                    showarrow=False,
                    xref="paper", yref="paper") ],
                xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
                yaxis=dict(showgrid=False, zeroline=False, showticklabels=False)))
fig.show()


In [3]:

import numpy as np

exist={}
LIST=open('./COVID19/EOBZIP 03_23_2020/sorted_alresult.covid19','r')
for line in LIST:
    line=line.replace('\s\s+','\t')
    line=line.strip()
    table=line.split(' ')
    if (float(table[0])>0):
        exist[table[1]]=0
    
    
REF=open('../input/drugdata/combine_drug_name_id.csv','r')
DATA=open('../input/drugdata/combined_fp2_data.csv','r')

drug=[]
all_drug={}
for ref in REF:
    #if ('pos' in ref):
        ref=ref.strip()
        rrr=ref.split(',')
        
        if (rrr[1].lower() in exist):
            drug.append(rrr[1])
    
            data=DATA.readline()
            data=data.strip()
            data=data.split(',')
            kkk=0
            for i in data:
                data[kkk]=float(i)
                kkk+1
            all_drug[rrr[1]]=np.asarray(data).astype(np.float)
    
REF.close()
DATA.close()

connections1=[]
connections2=[]
for drug1 in drug:
    for drug2 in drug:
        if (drug1<drug2):
            cor=np.corrcoef(all_drug[drug1],all_drug[drug2])
            if (cor[0,1]>0.35):
                connections1.append(drug1)
                connections2.append(drug2)
                
import sys
import plotly.graph_objects as go
import networkx as nx

node_list=list(all_drug.keys())
G = nx.Graph()
for i in node_list:
    G.add_node(i)

i=0
for drug1 in connections1:
    drug2=connections2[i]
    G.add_edges_from([(drug1,drug2)])
    i=i+1



pos = nx.spring_layout(G, k=0.5, iterations=50)
for n, p in pos.items():
    G.nodes[n]['pos'] = p
    
edge_trace = go.Scatter(
    x=[],
    y=[],
    line=dict(width=1,color='#888'),
    hoverinfo='none',
    mode='lines')


for edge in G.edges():
    x0, y0 = G.nodes[edge[0]]['pos']
    x1, y1 = G.nodes[edge[1]]['pos']
    edge_trace['x'] += tuple([x0, x1, None])
    edge_trace['y'] += tuple([y0, y1, None])
    
node_trace = go.Scatter(
    x=[],
    y=[],
    text=[],
    mode='markers',
    hoverinfo='text',
    marker=dict(
        showscale=True,
        colorscale='RdBu',
        reversescale=True,
        color=[],
        size=15,
        colorbar=dict(
            thickness=5,
       #     title='Node Connections',
            xanchor='left',
            titleside='right'
        ),
        line=dict(width=0)))

for node in G.nodes():
    x, y = G.nodes[node]['pos']
    node_trace['x'] += tuple([x])
    node_trace['y'] += tuple([y])

for node, adjacencies in enumerate(G.adjacency()):
    node_trace['marker']['color']+=tuple([len(adjacencies[1])])
   # node_info = adjacencies[0] +' # of connections: '+str(len(adjacencies[1]))
    node_info = adjacencies[0]
    node_trace['text']+=tuple([node_info])

fig = go.Figure(data=[edge_trace, node_trace],
             layout=go.Layout(
                title='Similarity of chemical structures among the drugs that are related to COVID-19 in literature',
                titlefont=dict(size=12),
                showlegend=False,
                hovermode='closest',
                margin=dict(b=50,l=100,r=100,t=50),
                annotations=[ dict(
                   # text="No. of connections",
                    text="",
                    showarrow=False,
                    xref="paper", yref="paper") ],
                xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
                yaxis=dict(showgrid=False, zeroline=False, showticklabels=False)))
fig.show()


[Go Top](#top)

<div id="PartAcat">
</div>


###### A.2.4 Literature summary

After hand-removing the irrelevant ones, the drugs can be roughly categorized by their effective mechanisms into:

| Group and Mechanism | Popular Drugs in Trials |
| --- | --- |
| RNA mutagens that stop the copying of the virus | Remdesivir, Favipiravir, Fluorouracil, Ribavirin, Acyclovir  |
| Protease inhibitors that block the multiplication of the virus | Ritonavir, Lopinavir, Kaletra, Darunavir |
| Stopping the entry of the virus into the host cell | Arbidol, Hydroxychloroquine, Chloroquine phosphate |
| Stopping the release of the virus from the host cell | Oseltamivir |
| Monoclonal antibodies targeting a virus protein/epitope | IL-6 monoclonal antibody, Spike (S) protein antibody |


<div id="PartAcat1"></div>
**A.2.4.1 RNA mutagens**

Viruses need to copy themselves in order to invade the host and transmit (like cancer cells), thus it makes sense that mutagens that block the copying can be used as drugs. 

   **Remdesivir**: It was studied in many publications related to coronavirus. It was suggested to be highly effective in the control of 2019-nCoV infection in vitro, while their cytotoxicity remains in control (0562f70516579d557cd1486000bb7aac5ccec2a1.json, 95cc4248c19a3cc9a54ebcfa09fc7c80518dac5d.json). It was also reported to significantly reduce lung viral load in mice and with successful clinical cases (0562f70516579d557cd1486000bb7aac5ccec2a1.json, 49ac69f362c27acbc6de0c5cbb640267e7a1e797.json). In clinical settings, it has been used as compassionate treatment. Other papers, e.g.,  3e9ae5329eecab16d7c39f1f6dc778cf4a53ee0d.json, suggest the effect is still to be verified.

   **Favipiravir**: It was suggested to be a good candidate (58be092086c74c58e9067121a6ba4836468e7ec3.json). It has been used in trials to treat SARS-CoV-2 infections, while the scores of favipiravir docking with the targets in some virtual screenings are relatively low (based on a computation study 95cc4248c19a3cc9a54ebcfa09fc7c80518dac5d.json)

   **Fluorouracil**: The RNA mutagen 5-fluorouracil (5-FU) treatment will also increase the U:C and A:G transitions. 

   **Ribavirin**: It was suggested to be useful for MERS (e5f19b6daf956e815c779228cc0cad1293d65bbb.json). It has been reported to reduce death rate in COVID-19 patients: f294f0df7468a8ac9e27776cc15fa20297a9f040.json. 

   **Acyclovir**: No statistical difference in treatment effect (baabfb35a321ea12028160e0d2c1552a2fda2dd5.json)

[Go Top](#top)

<div id="PartAcat2"></div>
**A.2.4.2 Protease inhibitors **

   **Ritonavir**: It was suggested to inhibit proteases and thus block multiplication of the virus. It was reported to deliver a substantial clinical benefit for COVID-19 patients (0562f70516579d557cd1486000bb7aac5ccec2a1.json, and its effectiveness is suggested by *computational* docking studies (9e94f9379fd74fcacc4f3a57e03cbe9035efee8e.json), while others clinical studies showed no effect at all or 'failed' treatment (24e17488d399c436305c819953beae2961214771.json, 8349823092836fe397a59e38615d1491423dbe70.json,8349823092836fe397a59e38615d1491423dbe70.json, ). Previously, it was shown to be beneficial for treating SARS and MERS (3afd5fba7dc182ddfa769c0d766134b525581005.json ).

   **Lopinavir**: Lopinavir is a protease inhibitor. It was reported with substantial benefit for treating COVID-10 patients (0562f70516579d557cd1486000bb7aac5ccec2a1.json). Most studies consider Lopinavir as a potential candidate. 

   **Kaletra**: It is the combination of Ritonavir and Lopinavar.

   **Darunavir**: The drug was suggested to be potentially beneficial by *computational* docking experiments (9e94f9379fd74fcacc4f3a57e03cbe9035efee8e.json), and in vivo studies (95cc4248c19a3cc9a54ebcfa09fc7c80518dac5d.json). 

[Go Top](#top)

<div id="PartAcat3"></div>
**A.2.4.3 By stopping the entry of the virus into the host cell**

   **Arbidol**: It inhibits membrane fusion between virus particles and plasma membranes, but it shows no statistical difference in treating COVID-19 patients (baabfb35a321ea12028160e0d2c1552a2fda2dd5.json) 
   
   **Hydroxychloroquine, Chloroquine phosphate**: Some studies also suggest that hydroxycholoroquine is working by blocking the entry of the virus, though the exact mechanism is unknown (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7102587/). Chloroquine effectively inhibited SARS-CoV-2 in vitro (58be092086c74c58e9067121a6ba4836468e7ec3.json). Chloroquine phosphate was reported to have apparent efficacy and acceptable safety against COVID-19 in a multicenter clinical trials (462cbb326ccd8587cae7a3538c8c6712d9013698.json, b70d27459fd8143edf76721da40cdbca399c9fb1.json).Chloroquine has been recently written into official recommendation for empirical therapy of COVID-19 for its adequate safety data in human (0562f70516579d557cd1486000bb7aac5ccec2a1.json)

<div id="PartAcat4"></div>
**A.2.4.4 By stopping the release of the virus from the host cell**

   **Oseltamivir**: Tamiflu, inhibitors of the neuraminidase enzyme, no statistical difference in treating COVID-19 (baabfb35a321ea12028160e0d2c1552a2fda2dd5.json) 

The other drugs in the list are irrelevant in this context of effectiveness. Some are related to test of toxicity

<div id="PartAcat5"></div>
**A.2.4.5 By generating monoclonal antibodies targeting certain proteins of the virus**

   **IL-6 monoclonal antibody**: the IL-6 monoclonal antibody-directed COVID-19 therapy has been used in clinical trial in China (No.ChiCTR2000029765) (7852aafdfb9e59e6af78a47af796325434f8922a.json, c8d206a4f9af0709b6e9ee90c4d854d482cb0784.json), and IL-6 level was suggested to serve as an indicator of poor prognosis, and was suggest to be used for these patients (c8437a45bfb84fb206fe03fd18d28858bae32651.json).
  
   **Spike (S) protein antibody**: It was suggested that monoclonal antibody against the S protein may 231 efficiently block the virus from entering the host (c8437a45bfb84fb206fe03fd18d28858bae32651.json). 
   
Note: some other drugs, though used to treat COVID-19, are not relevant to the discussion. For example, broad-spectrum antibiotics or fever reducers are often used in control arm. 

[Go Top](#top)

##### A.3. Limitations

The above analysis has the following limitations: 

  1. We used a rather earlier version of the literature set (because the searching step took quite a long time), and some popular drugs, e.g. hydroxychloroquine are only discussed but without clear clinical conclusion yet. 

  2. Literature could be substantially biased towards positive results and by computational methods (discussed below).

Groups of drugs in clinical trials by working mechanisms

In [4]:
import sys
import plotly.graph_objects as go
import networkx as nx

node_list=list(['Chloroquine phosphate','Spike (S) antibody','IL-6 antibody','Remdesivir','Favipiravir','Fluorouracil','Ribavirin','Acyclovir','Ritonavir','Lopinavir','Kaletra','Darunavir','Arbidol','Hydroxychloroquine','Oseltamivir'])
G = nx.Graph()
for i in node_list:
    G.add_node(i)

G.add_edges_from([('Spike (S) antibody','IL-6 antibody')])
G.add_edges_from([('Remdesivir','Favipiravir')])
G.add_edges_from([('Remdesivir','Fluorouracil')])
G.add_edges_from([('Remdesivir','Ribavirin')])
G.add_edges_from([('Remdesivir','Acyclovir')])
G.add_edges_from([('Fluorouracil','Favipiravir')])
G.add_edges_from([('Ribavirin','Favipiravir')])
G.add_edges_from([('Acyclovir','Favipiravir')])
G.add_edges_from([('Fluorouracil','Ribavirin')])
G.add_edges_from([('Fluorouracil','Acyclovir')])
G.add_edges_from([('Ribavirin','Acyclovir')])

G.add_edges_from([('Ritonavir','Lopinavir')])
G.add_edges_from([('Ritonavir','Kaletra')])
G.add_edges_from([('Ritonavir','Darunavir')])
G.add_edges_from([('Lopinavir','Kaletra')])
G.add_edges_from([('Lopinavir','Darunavir')])
G.add_edges_from([('Kaletra','Darunavir')])

G.add_edges_from([('Arbidol','Hydroxychloroquine')])
G.add_edges_from([('Chloroquine phosphate','Hydroxychloroquine')])
G.add_edges_from([('Chloroquine phosphate','Arbidol')])

pos = nx.spring_layout(G, k=0.5, iterations=50)
for n, p in pos.items():
    G.nodes[n]['pos'] = p
    
edge_trace = go.Scatter(
    x=[],
    y=[],
    line=dict(width=1,color='#888'),
    hoverinfo='none',
    mode='lines')


for edge in G.edges():
    x0, y0 = G.nodes[edge[0]]['pos']
    x1, y1 = G.nodes[edge[1]]['pos']
    edge_trace['x'] += tuple([x0, x1, None])
    edge_trace['y'] += tuple([y0, y1, None])
    
node_trace = go.Scatter(
    x=[],
    y=[],
    text=[],
    mode='markers',
    hoverinfo='text',
    marker=dict(
        showscale=True,
        colorscale='RdBu',
        reversescale=True,
        color=[],
        size=15,
        colorbar=dict(
            thickness=5,
       #     title='Node Connections',
            xanchor='left',
            titleside='right'
        ),
        line=dict(width=0)))

for node in G.nodes():
    x, y = G.nodes[node]['pos']
    node_trace['x'] += tuple([x])
    node_trace['y'] += tuple([y])

for node, adjacencies in enumerate(G.adjacency()):
    node_trace['marker']['color']+=tuple([len(adjacencies[1])])
   # node_info = adjacencies[0] +' # of connections: '+str(len(adjacencies[1]))
    node_info = adjacencies[0]
    node_trace['text']+=tuple([node_info])

fig = go.Figure(data=[edge_trace, node_trace],
             layout=go.Layout(
                title='Groups of drugs in clinical trials by working mechanisms',
                titlefont=dict(size=12),
                showlegend=False,
                hovermode='closest',
                margin=dict(b=100,l=100,r=100,t=100),
                annotations=[ dict(
                   # text="No. of connections",
                    text="",
                    showarrow=False,
                    xref="paper", yref="paper") ],
                xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
                yaxis=dict(showgrid=False, zeroline=False, showticklabels=False)))
fig.show()

[Go Top](#top)


<div id="PartB"></div>

## Part B Subtyping computational approaches that are used to propose drug candidates

We then subtyped computational methods developed to repurposing drugs for COVID-19.

##### B.1 Methods

During reading the literature curated in Part A, we came across computational studies that focus on predicting drugs suitable for repurposing for COVID-19. These works tend to propose many drugs. 


##### B.2 Results

<div id='PartBcat1'></div>

##### B.2.1 Gene-gene network-based approaches

**Example**: https://www.nature.com/articles/s41421-020-0153-3 repurposed drugs by network approaches based on homology analysis to other viruses. The authors proposed 16 potential drugs: Irbesartan, Torernifene, Camphor, Equilin, Mesalazine, Mercaptopurine, Paroxetine, Sirolimus, Carvedilol, Colchicine, Dactinomycin, Melatonin, Quinacrine, Eplerenone, Emodin, Oxymetholone. 

**Background**: Network-based drug response has been intensively used in the cancer area and was shown to excel in several benchmarks. 

<div id='PartBcat2'></div>

##### B.2.2 Expression-based approaches

**Example**: https://arxiv.org/abs/2003.14333 repurposed drugs for treating lung injury in COVID-19 by 'could best reverse abnormal gene expression caused by (SARS)-CoV-2-induced inhibition of ACE2 in lung cells,' an effective drug treatment is one that reverts the aberrant gene expression back
to the normal levels'. The authors proposed the following drugs': geldanamycin, panobinostat, trichostatin A, narciclasine, COL-3 and CGP-60474.


<div id='PartBcat3'></div>

##### B.2.3 Docking or structural-based approaches

<div id='PartBcat3.a'></div>

**B.2.3.1 Small molecule prediction**

**Example 1**: https://www.biorxiv.org/content/10.1101/2020.03.03.972133v1.full 'a novel advanced deep Q-learning network with the fragment-based drug design (ADQN-FBDD) for generating potential lead compounds targeting SARS-CoV-2 3CLpro' Prioritized 48 candidates by docking (supplement Table S1).

**Example 2**: https://www.sciencedirect.com/science/article/pii/S2211383520302999  studied the proteins encoded by SARS-CoV-2 genes, compared them with proteins from other coronaviruses, predicted their structures, and built 19 structures that could be done by homology modeling, Library of ZINC drug database, natural products, 78 anti-viral drugs were screened against these targets plus human ACE2. Prioritized the hundreds of drugs, ranked by docking scores: e.g., Ribavirin, alganciclovir, β-Thymidine, Platycodin D, Chrysin,Neohesperidin, Lymecycline, Chlorhexidine, Alfuzosin, Betulonal, Valganciclovir, Chlorhexidine, Betulonal, Gnidicin.



<div id='PartBcat3.b'></div>

**B.2.3.2 Monoclonal antibody prediction**

**Example 1: docking-based proposal of antibodies** https://www.biorxiv.org/content/10.1101/2020.02.22.951178v1.full.pdf The neutralizing antibodies are proposed by computationally docking to the S protein of COVID-19 by docking simulation.

**Example 2: ACE2 pathway-based proposal of antibodies** https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7079879/ Potential therapeutic approaches include a SARS-CoV-2 spike protein-based vaccine; a transmembrane protease serine 2 (TMPRSS2) inhibitor to block the priming of the spike protein; blocking the surface ACE2 receptor by using anti-ACE2 antibody or peptides; and a soluble form of ACE2 which should slow viral entry into cells through competitively binding with SARS-CoV-2 and hence decrease viral spread as well as protecting the lung from injury through its unique enzymatic function. MasR-mitochondrial assembly receptor, AT1R-Ang II type 1 receptor.

**Background**: Docking has been used intensively in drug discovery in areas such as cancers. 

##### B.3 Limitations

* Computationally proposed drugs tend to be a lot in a single piece of article, sometimes, hundreds of drugs in a single study.
* Most of the works adopted methods from other pharmacogenomics field that were previously developed for cancers. 
* We are not aware these approaches have generated hypotheses that are used in real-world clinical trials even in popular fields, e.g. cancer, Alzheimer's. Thus, use them with cautions.



<div id="PartC"></div>

[Go Top](#top)


## Part C. Drugs proposed by in vitro experiments

#### C.1 Methods

###### C.1.1 Data curation

Other than the drugs used in clinical trials and computational methods, we found an interesting study that carried out genome-wide in vivo binding screening of the virus proteins and human proteins, and proposed 37 drugs that directly target these proteins in the supplementary table 6 of Gordon et al (https://www.biorxiv.org/content/10.1101/2020.03.22.002386v1.supplementary-material?versioned=true). These drugs are currently being screened by the authors: Loratadine, Daunorubicin, Midostaurin, Ponatinib, Silmitasertib, Valproic Acid, Haloperidol, Metformin, Migalastat, S-verapamil, Indomethacin, Ruxolitinib, Mycophenolic acid, Entacapone, Ribavirin, E-52862, Merimepodib, RVX-208, XL413, AC-55541, Apicidin, AZ3451, AZ8838, Bafilomycin A1, CCT365623, GB110, H-89, JQ1, PB28, PD-144418, RS-PPCC, TMCB, UCPH-101, ZINC1775962367, ZINC4326719, ZINC4511851, ZINC95559591.

<div id='PartCexp'></div>
###### C.1.2 Construction of training set

We carried out a machine learning exercise, with the hypothesis that the drugs that will be potentially effective should overlap globally in function of these drug targets. We could extract the chemical structure of 34 of the 37 drugs proposed by the authors, which are used as positive examples. The second positive set is the combination of the first positive set and four other drugs that are currently under clinical trial and whose chemical structure can be extracted: remdesivir, hydroxychloroquine, favipiravir and Vitamin C, and thus 38 in total.

The negative training set, which is also the candidate set, is constructed using the FDA approved list, which was downloaded in Oct 2019 from https://www.accessdata.fda.gov/scripts/cder/daf/index.cfm. This list has a total of 7305 drugs, 5596 of which we could obtain the fingerprinting structure.

In [None]:
#!/usr/bin/env python
# coding: utf-8
import sys
!conda install --yes --prefix {sys.prefix} -c rdkit rdkit
!pip install pubchempy
!conda install --yes --prefix {sys.prefix} -c openbabel openbabel
import numpy as np
from rdkit import DataStructs
from rdkit.Chem.Fingerprints import FingerprintMols
from rdkit import Chem
from pubchempy import *
from rdkit.Chem import MACCSkeys, AllChem
import csv
import openbabel
import pybel
import random
import time

### map fingerprinting to features; the appended label is later removed in the training code

def fp_to_feature_v1(fp, data_set, label):
    tmp = fp.ToBitString()
    the_feature = list(map(int, tmp))
    if label == 1:
        the_feature.append(1)
    else:
        the_feature.append(0)
    data_set.append(the_feature)
    return data_set

def fp_to_feature_v2(fp, data_set, label):
    the_feature = [0 for i in range(1024)]
    for n in fp.bits:
        the_feature[n] = 1
    if label == 1:
        the_feature.append(1)
    else:
        the_feature.append(0)
    data_set.append(the_feature)
    return data_set

def fp_prepare(fp, max_bit, fps):
    if len(fp.bits) > 0 and max(fp.bits) > max_bit:
        max_bit = max(fp.bits)
    fps.append(fp.bits)
    return max_bit, fps

def fp_to_feature_v3(max_bit, fps, num_valid_pos, data_set):
    m = 1
    for bits in fps:
        the_feature = [0 for i in range(max_bit + 1)]
        for n in bits:
            the_feature[n] = 1
        if m <= num_valid_pos:
            the_feature.append(1)
        else:
            the_feature.append(0)
        data_set.append(the_feature)
        m += 1
    return data_set

def get_all_data(smile, maccs_data, morgan_data, fp2_data, fp3_max_bit, fp3s, fp4_max_bit, fp4s, label, num):
    ms = Chem.MolFromSmiles(smile)
    mol = pybel.readstring("smi", smile)
    if ms and mol:
        fp = MACCSkeys.GenMACCSKeys(ms)
        maccs_data = fp_to_feature_v1(fp, maccs_data, label)
        fp = AllChem.GetMorganFingerprintAsBitVect(ms,2,nBits=1024)
        morgan_data = fp_to_feature_v1(fp, morgan_data, label)
        fp2 = mol.calcfp('FP2')
        fp2_data = fp_to_feature_v2(fp2, fp2_data, label)
        fp3 = mol.calcfp('FP3')
        fp3_max_bit, fp3s = fp_prepare(fp3, fp3_max_bit, fp3s)
        fp4 = mol.calcfp('FP4')
        fp4_max_bit, fp4s = fp_prepare(fp4, fp4_max_bit, fp4s)
        num += 1
    return maccs_data, morgan_data, fp2_data, fp3_max_bit, fp3s, fp4_max_bit, fp4s, num


maccs_data = []
morgan_data = []
fp2_data = []
fp3_data = []
fp3_max_bit = 0
fp3s = []
fp4_data = []
fp4_max_bit = 0
fp4s = []

## the random time sleep is inserted because continuous search from pubchem will result in blocking of our IP 
## I was told by pubchem that there are better methods to download, but we never had a chance to explore
num_pos = 0
num_valid_pos = 0
with open("../input/drugdata/bioarchive.list.csv") as f:
    a = 0
    for line in f:
        line=line.strip()
        if a % 100 == 0:
            print(a // 100)
        time.sleep(random.randint(0, 5))
        m = get_compounds(line, 'name')
        if len(m) > 0:
            smile = m[0].isomeric_smiles
            maccs_data, morgan_data, fp2_data, fp3_max_bit, fp3s, fp4_max_bit, fp4s, num_valid_pos = get_all_data(smile, maccs_data, morgan_data, fp2_data, fp3_max_bit, fp3s, fp4_max_bit, fp4s, 1, num_valid_pos)
            #num_valid_pos += 1
        num_pos += 1
        a += 1

csvfile = open("../input/drugdata/fda.csv", encoding='utf-8')
reader = csv.reader(csvfile)
num_neg = 0
num_valid_neg = 0
#a = 0
for item in reader:
    time.sleep(random.randint(0, 5))
    if item[0]:
        time.sleep(random.randint(0, 5))
        m = get_compounds(item[0], 'name')
        if len(m) > 0:
            smile = m[0].isomeric_smiles
            maccs_data, morgan_data, fp2_data, fp3_max_bit, fp3s, fp4_max_bit, fp4s, num_valid_neg = get_all_data(smile, maccs_data, morgan_data, fp2_data, fp3_max_bit, fp3s, fp4_max_bit, fp4s, 0, num_valid_neg)
            #num_valid_neg += 1
        num_neg += 1
    #a += 1

csvfile.close()

fp3_data = fp_to_feature_v3(fp3_max_bit, fp3s, num_valid_pos, fp3_data)
fp4_data = fp_to_feature_v3(fp4_max_bit, fp4s, num_valid_pos, fp4_data)

### generate individual feature files in the format of features, followed by the last element to be labels
with open("combined_maccs_data.csv","w") as csvfile:
    writer = csv.writer(csvfile)
    writer.writerows(maccs_data)

with open("combined_morgan_data.csv","w") as csvfile:
    writer = csv.writer(csvfile)
    writer.writerows(morgan_data)

with open("combined_fp2_data.csv","w") as csvfile:
    writer = csv.writer(csvfile)
    writer.writerows(fp2_data)

with open("combined_fp3_data.csv","w") as csvfile:
    writer = csv.writer(csvfile)
    writer.writerows(fp3_data)

with open("combined_fp4_data.csv","w") as csvfile:
    writer = csv.writer(csvfile)
    writer.writerows(fp4_data)

[Go Top](#top)


###### Code for extracting chemical features 2

In [None]:
#!/usr/bin/env python
# coding: utf-8
import numpy as np
from rdkit import DataStructs
from rdkit.Chem.Fingerprints import FingerprintMols
from rdkit import Chem
from pubchempy import *
from rdkit.Chem import rdmolops
import pybel
import csv
import time
import random

data = []
drug_name_id = []

num_pos = 0
num_valid_pos = 0
with open("../input/drugdata/bioarchive.list.csv") as f:
    a = 0
    for line in f:
        line=line.strip()
        if a % 100 == 0:
            print(a // 100)
        m = get_compounds(line, 'name')
        if len(m) > 0:
            smile = m[0].isomeric_smiles
            #labels.append(1)
            ms = Chem.MolFromSmiles(smile)
            mol = pybel.readstring("smi", smile)
            if ms and mol:
                fp = rdmolops.RDKFingerprint(ms)
                tmp = fp.ToBitString()
                the_feature = list(map(int, tmp))
                the_feature.append(1)
                data.append(the_feature)
                drug_name_id.append(['pos_name', line[30:-1]])
                num_valid_pos += 1
        num_pos += 1
        a += 1

csvfile = open("../input/drugdata/fda.csv", encoding='utf-8')
reader = csv.reader(csvfile)
num_neg = 0
num_valid_neg = 0
#a = 0
for item in reader:
    if reader.line_num % 10 == 0:
        time.sleep(random.randint(0, 5))
    if reader.line_num % 100 == 0:
        print(reader.line_num // 100)
    if item[0]:
        m = get_compounds(item[0], 'name')
        if len(m) > 0:
            smile = m[0].isomeric_smiles
            ms = Chem.MolFromSmiles(smile)
            mol = pybel.readstring("smi", smile)
            if ms and mol:
                fp = rdmolops.RDKFingerprint(ms)
                tmp = fp.ToBitString()
                the_feature = list(map(int, tmp))
                the_feature.append(0)
                data.append(the_feature)
                drug_name_id.append(['neg_name', item[0]])
                num_valid_neg += 1
        num_neg += 1
    #a += 1

csvfile.close()
with open("combined_top_data.csv","w") as csvfile:
    writer = csv.writer(csvfile)
    writer.writerows(data)

with open("combine_drug_name_id.csv","w") as csvfile:
    writer = csv.writer(csvfile)
    writer.writerows(drug_name_id)

Visualize the proposed drug similarities:

In [5]:

import numpy as np

REF=open('../input/drugdata/combine_drug_name_id.csv','r')
DATA=open('../input/drugdata/combined_fp2_data.csv','r')

drug=[]
all_drug={}
for ref in REF:
    if ('pos' in ref):
        ref=ref.strip()
        rrr=ref.split(',')
        drug.append(rrr[1])

        data=DATA.readline()
        data=data.strip()
        data=data.split(',')
        kkk=0
        for i in data:
            data[kkk]=float(i)
            kkk+1
        all_drug[rrr[1]]=np.asarray(data).astype(np.float)

REF.close()
DATA.close()

connections1=[]
connections2=[]
for drug1 in drug:
    for drug2 in drug:
        if (drug1<drug2):
            cor=np.corrcoef(all_drug[drug1],all_drug[drug2])
            if (cor[0,1]>0.35):
                connections1.append(drug1)
                connections2.append(drug2)
                
import sys
import plotly.graph_objects as go
import networkx as nx

node_list=list(all_drug.keys())
G = nx.Graph()
for i in node_list:
    G.add_node(i)

i=0
for drug1 in connections1:
    drug2=connections2[i]
    G.add_edges_from([(drug1,drug2)])
    i=i+1



pos = nx.spring_layout(G, k=0.5, iterations=50)
for n, p in pos.items():
    G.nodes[n]['pos'] = p
    
edge_trace = go.Scatter(
    x=[],
    y=[],
    line=dict(width=1,color='#888'),
    hoverinfo='none',
    mode='lines')


for edge in G.edges():
    x0, y0 = G.nodes[edge[0]]['pos']
    x1, y1 = G.nodes[edge[1]]['pos']
    edge_trace['x'] += tuple([x0, x1, None])
    edge_trace['y'] += tuple([y0, y1, None])
    
node_trace = go.Scatter(
    x=[],
    y=[],
    text=[],
    mode='markers',
    hoverinfo='text',
    marker=dict(
        showscale=True,
        colorscale='RdBu',
        reversescale=True,
        color=[],
        size=15,
        colorbar=dict(
            thickness=5,
       #     title='Node Connections',
            xanchor='left',
            titleside='right'
        ),
        line=dict(width=0)))

for node in G.nodes():
    x, y = G.nodes[node]['pos']
    node_trace['x'] += tuple([x])
    node_trace['y'] += tuple([y])

for node, adjacencies in enumerate(G.adjacency()):
    node_trace['marker']['color']+=tuple([len(adjacencies[1])])
   # node_info = adjacencies[0] +' # of connections: '+str(len(adjacencies[1]))
    node_info = adjacencies[0]
    node_trace['text']+=tuple([node_info])

fig = go.Figure(data=[edge_trace, node_trace],
             layout=go.Layout(
                title='Similarity of chemical structures among the drugs that were proposed by in vivo virus protein binding',
                titlefont=dict(size=12),
                showlegend=False,
                hovermode='closest',
                margin=dict(b=50,l=100,r=100,t=50),
                annotations=[ dict(
                   # text="No. of connections",
                    text="",
                    showarrow=False,
                    xref="paper", yref="paper") ],
                xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
                yaxis=dict(showgrid=False, zeroline=False, showticklabels=False)))
fig.show()


[Go Top](#top)


###### C.1.3 Nested CV to prioritize drug candidates 
For each round, we randomly selected 80% of the example as training, and 20% as testing. The prediction scores for the test set are recorded in each round. We repeated this process for 20 times ensuring all examples occurred in the test set (100 experiments in total). Then the average of each example was taken as the final prediction score.

In [None]:
#!/usr/bin/env python
# coding: utf-8


import numpy as np
import csv
from sklearn.metrics import roc_curve, auc, precision_recall_curve
from sklearn.model_selection import KFold
from scipy import interp
import matplotlib.pyplot as plt
import lightgbm as lgb
import pickle



def readcsv(filename):
    csvfile = open(filename, encoding='utf-8')
    reader = csv.reader(csvfile)
    X = []
    Y = []
    num_all = 0
    num_positive = 0
    for item in reader:
        X.append(list(map(int, item[:-1])))
        Y.append(int(item[-1]))
        if Y[-1] == 1:
            num_positive += 1
        num_all += 1
    X = np.array(X)
    Y = np.array(Y)
    return X, Y, num_all, num_positive

def train_cv(train, test, X, Y):
    X_train, X_test = X[train], X[test]
    Y_train, Y_test = Y[train], Y[test]
    train_data = lgb.Dataset(X_train, label=Y_train)
    bst = lgb.train(param, train_data, num_round)
    prediction = bst.predict(X_test)
    return prediction, Y_test

X1, Y1, num_all, num_positive = readcsv("../input/drugdata/combined_top_data.csv")
X2, Y2, num_all, num_positive = readcsv("../input/drugdata/combined_maccs_data.csv")
X3, Y3, num_all, num_positive = readcsv("../input/drugdata/combined_morgan_data.csv")
X4, Y4, num_all, num_positive = readcsv("../input/drugdata/combined_fp2_data.csv")
X5, Y5, num_all, num_positive = readcsv("../input/drugdata/combined_fp3_data.csv")
X6, Y6, num_all, num_positive = readcsv("../input/drugdata/combined_fp4_data.csv")
X = np.hstack((X1, X2, X3, X4, X5, X6))
Y = Y1

print(len(X))
print(len(X[0]))

the_round=0
while (the_round<20):
    num_fold = 5
    kf = KFold(n_splits=num_fold, random_state=None, shuffle=True)
    param = {'num_leaves': 31, 'objective': 'binary'}
    param['metric'] = 'auc'
    num_round = 10

    AUC = []
    AUPRC = []
    AUC_score = []
    predictions = []
    test_sets = []
    num = 0
    for train, test in kf.split(X):
        #prediction1, Y_test1 = train_cv(train, test, X1, Y1)
        #prediction2, Y_test2 = train_cv(train, test, X2, Y2)
        X_train, X_test = X[train], X[test]
        Y_train, Y_test = Y[train], Y[test]
        train_data = lgb.Dataset(X_train, label=Y_train)
        bst = lgb.train(param, train_data, num_round)
        num += 1
        pickle.dump(bst, open('all_lightgbm_model-'+str(num)+'.'+str(the_round)+'.sav', 'wb'))
        with open("all_lightgbm_Xtest-"+str(num)+'.'+str(the_round)+".csv","w") as csvfile:
            writer = csv.writer(csvfile)
            writer.writerows(X_test)
        with open("all_lightgbm_testid-"+str(num)+'.'+str(the_round)+".csv","w") as csvfile:
            writer = csv.writer(csvfile)
            writer.writerows([test])
        prediction = bst.predict(X_test)
        np.savetxt(('predictions.txt.'+str(num)+'.'+str(the_round)),prediction)
        predictions.append(prediction)
        test_sets.append(Y_test)
    the_round=the_round+1

[Go Top](#top)

Next we pulled all rounds of predictions together. 

In [None]:
import glob

all_dir=glob.glob("../input/drugdata/all_lightgbm_testid-*")

for the_dir in all_dir:
    FILE=open(the_dir,'r')
    line=FILE.readline()
    line=line.strip()
    sss=line.split(',')
    FILE.close()
    pred=the_dir
    pred=pred.replace('all_lightgbm_testid-','predictions.txt.')
    PRED=open(pred,'r')
    i=0
    ref={}
    count={}
    for line in PRED:
        line=line.strip()
        if lll[i] in ref:
            ref[lll[i]]=ref[lll[i]]+float(line)
            count[lll[i]]=count[lll[i]]+1
        else:
            ref[lll[i]]=float(line)
            count[lll[i]]=1


REF=open("../input/drugdata/combine_drug_name_id.csv",'r')
NEW=open('assembled_prediction.dat','w')
i=0;
for line in REF:
    line=line.strip()
    t=line.split(',')
    t[1]=t[1].replace(' ','_')
    NEW.write(t[0])
    NEW.write('\t')
    NEW.write(t[1])
    val=ref[i]/count[i];
    NEW.write('\t')
    NEW.write(str(val))
    NEW.write('\t')
    NEW.write(count[i])
    NEW.write('\n')
REF.close()
NEW.close()



[Go Top](#top)


#### C.2 Results

###### C.2.1 Top candidates in FDA approved drugs

Among the FDA approved drugs, we identified some top candidates that do not exist in the training gold standard. We hand-searched in literature for each of the top candidates with a probability >0.05 (55 in total). Most of them come from contaminations, i.e., overlapping with an example in the training set even though the drug appears with a different name. 

Cleaned-up list:

| Drug name | Original usage | Potential issues in the candidate|
| --- | --- | --- |
| OLUMIANT(Baricitinib) | Janus kinase (JAK) inhibitor ||
| MEKTOVI | Targeted therapy to treat BRAF V600E or V600K cancers | May come from bias in cancer targeted therapy/screening|
| BRIMONIDINE | Treating glaucoma ||
| CAPRELSA | kinase inhibitor, medullary thyroid cancer (MTC) | May come from bias in cancer targeted therapy/screening|
| EDURANT(rilpivirine)| Treating Human Immunodeficiency Virus-1 (HIV-1) ||
| MARPLAN | Treating depression | Some schizophrenia drugs are used in the protein interaction training set, and might result in an implicit contamination here|
| Corlanor (ivabradine) | reduces the spontaneous pacemaker activity of the cardiac sinus node||
| LORBRENA | kinase inhibitor, ALK mutant cancer| May come from bias in cancer targeted therapy/screening|
| BRAFTOVI | kinase inhibitor, Metastatic Melanoma| May come from bias in cancer targeted therapy/screening|
| TAVALISSE | kinase inhibitor indicated for the treatment of thrombocytopenia| May come from bias in cancer targeted therapy/screening|

<div id='PartClim'></div>

#### C.3 Limitations and biases in the finding

Drugs proposed by in vitro or computational protein targets/gene-gene network approaches are definitely biased towards targeted therapies in cancers, because these drugs were intensively screened in cell line experiments. This is true for both the above list and probably the original list proposed through the binding experiments, and certainly other studies.

Second, low scores only mean the drugs are not similar to others that are being investigated in the study, rather than they are not useful. Remdesivir had a high score of 0.09 (we are not sure if this is an implicit contamination from the training set), the others had low scores, including Vitamin C, hydroxychloroquine and favipiravir.


[Go Top](#top)


<div id='PartD'></div>
## Part D. Epitope study for vaccines

#### D.1 Methods

We identied all paragraphs that contain the word vaccine and COVID-19/SARS-COV-2. Then, we looked through each of the abstract. If deemed relevant, we go to the original paper and record down their methods and proposed epitopes




In [None]:
!ls /kaggle/input/CORD-19-research-challenge/
root_path = '/kaggle/input/CORD-19-research-challenge/'

!find ../input/CORD-19-research-challenge/*  -type f | xargs grep -i vaccine |grep -E -- 'COVID-19|COVID19|sars-cov-2' >vaccine_paragraphs


[Go Top](#top)


<div id='PartDcat1'></div>
#### D.2 Results

##### D.2.1 Subtyping major approaches in vaccine research

###### D.2.1.1 Homology-based approach

**Example 1.** 181b7b57851e6f58a601b68e613d10c10616f774.json. used conserved sequence with SARS-COV (2003 version of SARS), which already have experimentally validated antigenic sequences.

**Example 2.** a2a6e262098539eb875a26800d9f6d3d0d5d1875.json. tested an epitope of Ebola in mouse, and suggested that this epitope is conserved in COVID-19.

**Example 3** 74b00f19c3af87d1081644f02490ba250f57b7ca.json used conserved sequences between COVID-19 and human Coronavirus (HCov-HKU1) to identify epitope.

<div id='PartDcat2'></div>
###### D.2.1.1 Immunoinformatics

Docking/molecular dynamics/protein structures and immunoinformatics such as antigenticity

**Example 1** 73c8af41cfdbf52c0dfba37727e3b94cb56b495e.json used antigenicity Prediction, Docking simulation structural prediction

**Example 2** b38ed62b303eaa444d188deb2ab0b23bbdb79211.json used structure prediction.

Background: Many studies use a combination of the above approaches. There are numerous existing software for epitope prediction, including whether the epitope is on the surface, their docking score and MHC classes. However, sequence similarity between COVID-19 and SARS-COV (the 2003 version ~82), Ebola (~40%) and Human Coronavirus (65-70% depending on the exact strain) is very limited. It should be taken critically that an exactly conserved epitope (~10 amino acids) indicates same effectiveness across these species.

<div id='PartDepi'></div>
##### D.2.2 Compiled list of epitopes across the above publications
|Epitope|Protein|T/B cell|MHC class|
| --- | --- | --- | --- |
|ILLNKHID|N|T cell|I|
|AFFGMSRIGMEVTPSGTW|N|T cell|NA|
|MEVTPSGTWL|N|T cell|I|
|GMSRIGMEV|N|T cell|I|
|ILLNKHIDA|N|T cell|I|
|ALNTPKDHI|N|T cell|I|
|IRQGTDYKHWPQIAQFA|N|T cell|NA|
|KHWPQIAQFAPSASAFF|N|T cell|NA|
|LALLLLDRL|N|T cell|I|
|LLLDRLNQL|N|T cell|I|
|LLNKHIDAYKTFPPTEPK|N|T cell|NA|
|LQLPQGTTL|N|T cell|I|
|AQFAPSASAFFGMSR|N|T cell|II|
|AQFAPSASAFFGMSRIGM|N|T cell|NA|
|RRPQGLPNNTASWFT|N|T cell|I|
|YKTFPPTEPKKDKKKK|N|T cell|NA|
|GAALQIPFAMQMAYRF|S|T cell|II|
|MAYRFNGIGVTQNVLY|S|T cell|II|
|QLIRAAEIRASANLAATK|S|T cell|II|
|FIAGLIAIV|S|T cell|I|
|ALNTLVKQL|S|T cell|I|
|LITGRLQSL|S|T cell|I|
|NLNESLIDL|S|T cell|I|
|QALNTLVKQLSSNFGAI|S|T cell|II|
|RLNEVAKNL|S|T cell|I|
|VLNDILSRL|S|T cell|I|
|VVFLHVTYV|S|T cell|I|
|DVVNQNAQALNTLVKQL|S|B cell||
|EAEVQIDRLITGRLQSL|S|B cell|
|EIDRLNEVAKNLNESLIDLQELGKYEQY|S|B cell|
|EVAKNLNESLIDLQELG|S|B cell|
|GAALQIPFAMQMAYRFN|S|B cell|
|GAGICASY|S|B cell|
|AISSVLNDILSRLDKVE|S|B cell|
|GSFCTQLN|S|B cell|
|ILSRLDKVEAEVQIDRL|S|B cell|
|KGIYQTSN|S|B cell|
|AMQMAYRF|S|B cell|
|KNHTSPDVDLGDISGIN|S|B cell|
|MAYRFNGIGVTQNVLYE|S|B cell|
|AATKMSECVLGQSKRVD|S|B cell|
|PFAMQMAYRFNGIGVTQ|S|B cell|
|QALNTLVKQLSSNFGAI|S|B cell|
|QLIRAAEIRASANLAAT|S|B cell|
|QQFGRD|S|B cell|
|RASANLAATKMSECVLG|S|B cell|
|RLITGRLQSLQTYVTQQ|S|B cell|
|EIDRLNEVAKNLNESLIDLQELGKYEQY|S|B cell|
|SLQTYVTQQLIRAAEIR|S|B cell|
|DLGDISGINASVVNIQK|S|B cell|
|FFGMSRIGMEVTPSGTW|N|B cell|
|GLPNNTASWFTALTQHGK|N|B cell|
|GTTLPK|N|B cell|
|IRQGTDYKHWPQIAQFA|N|B cell|
|KHIDAYKTFPPTEPKKDKKK|N|B cell|
|KHWPQIAQFAPSASAFF|N|B cell|
|YNVTQAFGRRGPEQTQGNF|N|B cell|
|KTFPPTEPKKDKKKK|N|B cell|
|LLPAAD|N|B cell|
|LNKHIDAYKTFPPTEPK|N|B cell|
|LPQGTTLPKG|N|B cell|
|LPQRQKKQ|N|B cell|
|PKGFYAEGSRGGSQASSR|N|B cell|
|QFAPSASAFFGMSRIGM|N|B cell|
|QGTDYKHW|N|B cell|
|QLPQGTTLPKGFYAE|N|B cell|
|QLPQGTTLPKGFYAEGSR|N|B cell|
|QLPQGTTLPKGFYAEGSRGGSQ|N|B cell|
|TFPPTEPK|N|B cell|
|RRPQGLPNNTASWFT|N|B cell|
|SQASSRSS|N|B cell|
|SRGGSQASSRSSSRSR|N|B cell|
|AGLPYGANK|N|T cell|
|AADLDDFSK|N|T cell|
|QLESKMSGK|N|T cell|
|QELIRQGTDYKH|N|T cell|
|LIRQGTDYKHWP|N|T cell|
|RLNQLESKMSGK|N|T cell|
|LNQLESKMSGKG|N|T cell|
|LDRLNQLESKMS|N|T cell|
|SVLNDILSR|S|T cell|
|GVLTESNKK|S|T cell|
|RLFRKSNLK|S|T cell|
|QIAPGQTGK|S|T cell|
|TSNFRVQPTESI|S|T cell|
|SNFRVQPTESIV|S|T cell|
|LLIVNNATNVVI|S|T cell|
|MSDNGPQNQRNAPRITFGGPSDSTGSNQNGERSGARSKQRRPQGLPNNTAS|N|B cell|
|RIRGGDGKMKDL|N|B cell|
|TGPEAGLPYGANK|N|B cell|
|GTTLPKGFYAEGSRGGSQASSRSSSRSRNSSRNSTPGSSRGTSPARMAGNGGD|N|B cell|
|SKMSGKGQQQQGQTVTKKSAAEASKKPRQKRTATKAYN|N|B cell|
|KTFPPTEPKKDKKKKADETQALPQRQKKQQ|N|B cell|
|LTPGDSSSGWTAG|S|B cell|
|VRQIAPGQTGKIAD|S|B cell|
|YQAGSTPCNGV|S|B cell|
|QTQTNSPRRARSV|S|B cell|
|VYQVNNLEEIC|
|SMATYYLFDESGEFK|orf1ab|
|MATYYLFDESGEFKL|orf1ab|
|ATYYLFDESGEFKLA|orf1ab|
|DSATLVSDIDITFLK|orf1ab|
|SNPTTFHLDGEVITF|orf1ab|
|NPTTFHLDGEVITFD|orf1ab|
|PTTFHLDGEVITFDN|orf1ab|
|DGEVITFDNLKTLLS|orf1ab|
|EVRTIKVFTTVDNIN|orf1ab|
|VRTIKVFTTVDNINL|orf1ab|
|RTIKVFTTVDNINLH|orf1ab|
|HEGKTFYVLPNDDTL|orf1ab|
|EGKTFYVLPNDDTLR|orf1ab|
|GKTFYVLPNDDTLRV|orf1ab|
|KTFYVLPNDDTLRVE|orf1ab|
|DLMAAYVDNSSLTIK|orf1ab|
|LMAAYVDNSSLTIKK|orf1ab|
|MAAYVDNSSLTIKKP|orf1ab|
|AAYVDNSSLTIKKPN|orf1ab|
|YREGYLNSTNVTIAT|orf1ab|
|REGYLNSTNVTIATY|orf1ab|
|IINLVQMAPISAMVR|orf1ab|
|VAAIFYLITPVHVMS|orf1ab|
|AAIFYLITPVHVMSK|orf1ab|
|PDTRYVLMDGSIIQF|orf1ab|
|DTRYVLMDGSIIQFP|orf1ab|
|TRYVLMDGSIIQFPN|orf1ab|
|RLTKYTMADLVYALR|orf1ab|
|TMADLVYALRHFDEG|orf1ab|
|TKRNVIPTITQMNLK|orf1ab|
|YEAMYTPHTVLQAVG|orf1ab|
|YDHVISTSHKLVLSV|orf1ab|
|SQSIIAYTMSLGAEN|S|
|SNNSIAIPTNFTISV|S|
|AIPTNFTISVTTEIL|S|
|IPTNFTISVTTEILP|S|
|PTNFTISVTTEILPV|S|
|TNFTISVTTEILPVS|S|
|VKPSFYVYSRVKNLN|E|
|KPSFYVYSRVKNLNS|E|
|PSFYVYSRVKNLNSS|E|
|ATKAYNVTQAFGRRG|N|
|KAYNVTQAFGRRGPE|N|
|YTGAIKLDDKDPNFK|N|

##### D.2.3 Compiling list of unique epitopes by sequence overlapping
Now we consolidate the epitopes from various publications by sequence overlap. If one epitope is a subsequence of another, or if one overlaps to the other by more than 5 consequtive amino acides, and only the outside flanking ragion is not overlapping, the two epitodes would be considered in the same group.


[Go Top](#top)

In [None]:
import numpy as np


all_epi={}
FILE=open('../input/drugdata/all_epitope','r')
for line in FILE:
    line=line.strip()
    all_epi[line]=1

the_map={}
for epi1 in all_epi.keys():
    for epi2 in all_epi.keys():
## now we calculate the maximal similarity score:
        if (len(epi1)<len(epi2)):
            if (epi1 in epi2):
                print (epi1,' is a subsequence' ,epi2)
                if (epi1 in the_map):
                    the_map[epi1]=the_map[epi1]+'\t'+epi2
                else:
                    the_map[epi1]=epi2
                if (epi2 in the_map):
                    the_map[epi2]=the_map[epi2]+'\t'+epi1
                else:
                    the_map[epi2]=epi1
            cut=1
            while (cut<5):
                epi1_cut=epi1[cut:]
                if (len(epi1_cut)>5):
                    if (epi1_cut==epi2[0:len(epi1_cut)]):
                        print (epi1,' is almost subsequence by ' ,cut,' to ',epi2)
                        if (epi1 in the_map):
                            the_map[epi1]=the_map[epi1]+'\t'+epi2
                        else:
                            the_map[epi1]=epi2
                        if (epi2 in the_map):
                            the_map[epi2]=the_map[epi2]+'\t'+epi1
                        else:
                            the_map[epi2]=epi1

                epi1_cut=epi1[:-cut]
                if (len(epi1_cut)>5):
                    if (epi1_cut == epi2[(len(epi2)-len(epi1_cut)):]):
                        print (epi1,' is almost subsequence by ' ,cut,' to ',epi2)
                        if (epi1 in the_map):
                            the_map[epi1]=the_map[epi1]+'\t'+epi2
                        else:
                            the_map[epi1]=epi2
                        if (epi2 in the_map):
                            the_map[epi2]=the_map[epi2]+'\t'+epi1
                        else:
                            the_map[epi2]=epi1

                cut=cut+1

group={}
for epi in all_epi.keys():
    if (epi in the_map):
        the_group=the_map[epi].split('\t')
        uniq_group={}
        for ggg in the_group:
            uniq_group[ggg]=1
        uniq_group[epi]=1
        the_member=[]
        for k in sorted(uniq_group, key=len, reverse=False):
            the_member.append(k)
        string=the_member.pop(0)
        k=1
        for mmm in the_member:
            string=string+'|'
            string=string+mmm
            k=k+1
        while(k<9):
            string=string+'|'
            k=k+1

        group[string]=1
    else:

        string=epi
        k=1
        while(k<9):
            string=string+'|'
            k=k+1
        group[string]=1

g_i=1
for uniq_group in sorted(group.keys()):
    string='|'+str(g_i)+'|'+uniq_group+'|'
    print(string)
    g_i=g_i+1


This is the unique epitope groups that have so far been published:

<div id='PartDuniq'></div>

|Group|Epitope 1|Epitope 2|Epitope 3|Epitope 4|Epitope 5|Epitope 6|Epitope 7|Epitope 8|Epitope 9|
|---|---|---|---|---|---|---|---|---|---|
|1|AADLDDFSK|||||||||
|2|AAIFYLITPVHVMSK|||||||||
|3|AATKMSECVLGQSKRVD|||||||||
|4|AAYVDNSSLTIKKPN|||||||||
|5|AGLPYGANK|TGPEAGLPYGANK||||||||
|6|AIPTNFTISVTTEIL|||||||||
|7|ALNTLVKQL|DVVNQNAQALNTLVKQL||||||||
|8|ALNTLVKQL|QALNTLVKQLSSNFGAI|DVVNQNAQALNTLVKQL|||||||
|9|ALNTLVKQL|QALNTLVKQLSSNFGAI||||||||
|10|ALNTPKDHI|||||||||
|11|AMQMAYRF|GAALQIPFAMQMAYRF|GAALQIPFAMQMAYRFN|PFAMQMAYRFNGIGVTQ||||||
|12|AMQMAYRF|GAALQIPFAMQMAYRF|GAALQIPFAMQMAYRFN|||||||
|13|AMQMAYRF|MAYRFNGIGVTQNVLY|PFAMQMAYRFNGIGVTQ|||||||
|14|AQFAPSASAFFGMSR|KHWPQIAQFAPSASAFF|QFAPSASAFFGMSRIGM|AQFAPSASAFFGMSRIGM||||||
|15|AQFAPSASAFFGMSR|KHWPQIAQFAPSASAFF||||||||
|16|ATKAYNVTQAFGRRG|KAYNVTQAFGRRGPE|YNVTQAFGRRGPEQTQGNF|||||||
|17|ATKAYNVTQAFGRRG|YNVTQAFGRRGPEQTQGNF||||||||
|18|ATYYLFDESGEFKLA|||||||||
|19|DGEVITFDNLKTLLS|||||||||
|20|DLGDISGINASVVNIQK|||||||||
|21|DLMAAYVDNSSLTIK|||||||||
|22|DSATLVSDIDITFLK|||||||||
|23|DTRYVLMDGSIIQFP|||||||||
|24|EGKTFYVLPNDDTLR|||||||||
|25|EVRTIKVFTTVDNIN|||||||||
|26|FIAGLIAIV|||||||||
|27|GAGICASY|||||||||
|28|GKTFYVLPNDDTLRV|||||||||
|29|GMSRIGMEV|AQFAPSASAFFGMSR|QFAPSASAFFGMSRIGM|AQFAPSASAFFGMSRIGM||||||
|30|GMSRIGMEV|FFGMSRIGMEVTPSGTW|QFAPSASAFFGMSRIGM|AFFGMSRIGMEVTPSGTW|AQFAPSASAFFGMSRIGM|||||
|31|GMSRIGMEV|MEVTPSGTWL|FFGMSRIGMEVTPSGTW|AFFGMSRIGMEVTPSGTW||||||
|32|GSFCTQLN|||||||||
|33|GTTLPK|LPQGTTLPKG|QLPQGTTLPKGFYAE|QLPQGTTLPKGFYAEGSR|QLPQGTTLPKGFYAEGSRGGSQ|GTTLPKGFYAEGSRGGSQASSRSSSRSRNSSRNSTPGSSRGTSPARMAGNGGD||||
|34|GTTLPK|LQLPQGTTL|LPQGTTLPKG|QLPQGTTLPKGFYAE|PKGFYAEGSRGGSQASSR|QLPQGTTLPKGFYAEGSR|QLPQGTTLPKGFYAEGSRGGSQ|GTTLPKGFYAEGSRGGSQASSRSSSRSRNSSRNSTPGSSRGTSPARMAGNGGD||
|35|GTTLPK|LQLPQGTTL|LPQGTTLPKG|QLPQGTTLPKGFYAE|QLPQGTTLPKGFYAEGSR|QLPQGTTLPKGFYAEGSRGGSQ|GTTLPKGFYAEGSRGGSQASSRSSSRSRNSSRNSTPGSSRGTSPARMAGNGGD|||
|36|GTTLPK|SQASSRSS|LPQGTTLPKG|QLPQGTTLPKGFYAE|SRGGSQASSRSSSRSR|PKGFYAEGSRGGSQASSR|QLPQGTTLPKGFYAEGSR|QLPQGTTLPKGFYAEGSRGGSQ|GTTLPKGFYAEGSRGGSQASSRSSSRSRNSSRNSTPGSSRGTSPARMAGNGGD|
|37|GVLTESNKK|||||||||
|38|HEGKTFYVLPNDDTL|||||||||
|39|IINLVQMAPISAMVR|||||||||
|40|ILLNKHID|ILLNKHIDA|LNKHIDAYKTFPPTEPK|LLNKHIDAYKTFPPTEPK||||||
|41|ILLNKHID|TFPPTEPK|ILLNKHIDA|LNKHIDAYKTFPPTEPK|LLNKHIDAYKTFPPTEPK|KHIDAYKTFPPTEPKKDKKK||||
|42|ILSRLDKVEAEVQIDRL|||||||||
|43|IPTNFTISVTTEILP|||||||||
|44|KAYNVTQAFGRRGPE|YNVTQAFGRRGPEQTQGNF||||||||
|45|KGIYQTSN|||||||||
|46|KNHTSPDVDLGDISGIN|||||||||
|47|KPSFYVYSRVKNLNS|||||||||
|48|KTFYVLPNDDTLRVE|||||||||
|49|LALLLLDRL|||||||||
|50|LITGRLQSL|EAEVQIDRLITGRLQSL|RLITGRLQSLQTYVTQQ|||||||
|51|LITGRLQSL|EAEVQIDRLITGRLQSL||||||||
|52|LITGRLQSL|RLITGRLQSLQTYVTQQ||||||||
|53|LLIVNNATNVVI|||||||||
|54|LLLDRLNQL|LDRLNQLESKMS||||||||
|55|LLLDRLNQL|QLESKMSGK|LDRLNQLESKMS|||||||
|56|LLPAAD|||||||||
|57|LMAAYVDNSSLTIKK|||||||||
|58|LPQRQKKQ|KTFPPTEPKKDKKKKADETQALPQRQKKQQ||||||||
|59|LPQRQKKQ|TFPPTEPK|KTFPPTEPKKDKKKK|YKTFPPTEPKKDKKKK|KTFPPTEPKKDKKKKADETQALPQRQKKQQ|||||
|60|LQLPQGTTL|LPQGTTLPKG|QLPQGTTLPKGFYAE|QLPQGTTLPKGFYAEGSR|QLPQGTTLPKGFYAEGSRGGSQ|||||
|61|LTPGDSSSGWTAG|||||||||
|62|MAAYVDNSSLTIKKP|||||||||
|63|MATYYLFDESGEFKL|||||||||
|64|MAYRFNGIGVTQNVLY|MAYRFNGIGVTQNVLYE|PFAMQMAYRFNGIGVTQ|||||||
|65|MAYRFNGIGVTQNVLY|MAYRFNGIGVTQNVLYE||||||||
|66|MEVTPSGTWL|FFGMSRIGMEVTPSGTW|AFFGMSRIGMEVTPSGTW|||||||
|67|NLNESLIDL|EVAKNLNESLIDLQELG|EIDRLNEVAKNLNESLIDLQELGKYEQY|||||||
|68|NLNESLIDL|RLNEVAKNL|EVAKNLNESLIDLQELG|EIDRLNEVAKNLNESLIDLQELGKYEQY||||||
|69|NPTTFHLDGEVITFD|||||||||
|70|PDTRYVLMDGSIIQF|||||||||
|71|PSFYVYSRVKNLNSS|||||||||
|72|PTNFTISVTTEILPV|||||||||
|73|PTTFHLDGEVITFDN|||||||||
|74|QGTDYKHW|LIRQGTDYKHWP|IRQGTDYKHWPQIAQFA|||||||
|75|QGTDYKHW|QELIRQGTDYKH|IRQGTDYKHWPQIAQFA|||||||
|76|QGTDYKHW|QELIRQGTDYKH|LIRQGTDYKHWP|IRQGTDYKHWPQIAQFA||||||
|77|QIAPGQTGK|VRQIAPGQTGKIAD||||||||
|78|QLESKMSGK|LNQLESKMSGKG||||||||
|79|QLESKMSGK|RLNQLESKMSGK|LNQLESKMSGKG|LDRLNQLESKMS|SKMSGKGQQQQGQTVTKKSAAEASKKPRQKRTATKAYN|||||
|80|QLESKMSGK|RLNQLESKMSGK||||||||
|81|QLESKMSGK|SKMSGKGQQQQGQTVTKKSAAEASKKPRQKRTATKAYN||||||||
|82|QLIRAAEIRASANLAAT|QLIRAAEIRASANLAATK||||||||
|83|QQFGRD|||||||||
|84|QTQTNSPRRARSV|||||||||
|85|RASANLAATKMSECVLG|||||||||
|86|REGYLNSTNVTIATY|||||||||
|87|RIRGGDGKMKDL|||||||||
|88|RLFRKSNLK|||||||||
|89|RLNEVAKNL|EVAKNLNESLIDLQELG|EIDRLNEVAKNLNESLIDLQELGKYEQY|||||||
|90|RLTKYTMADLVYALR|||||||||
|91|RRPQGLPNNTASWFT|GLPNNTASWFTALTQHGK|MSDNGPQNQRNAPRITFGGPSDSTGSNQNGERSGARSKQRRPQGLPNNTAS|||||||
|92|RRPQGLPNNTASWFT|GLPNNTASWFTALTQHGK||||||||
|93|RRPQGLPNNTASWFT|MSDNGPQNQRNAPRITFGGPSDSTGSNQNGERSGARSKQRRPQGLPNNTAS||||||||
|94|RTIKVFTTVDNINLH|||||||||
|95|SLQTYVTQQLIRAAEIR|||||||||
|96|SMATYYLFDESGEFK|||||||||
|97|SNFRVQPTESIV|||||||||
|98|SNNSIAIPTNFTISV|||||||||
|99|SNPTTFHLDGEVITF|||||||||
|100|SQASSRSS|PKGFYAEGSRGGSQASSR|QLPQGTTLPKGFYAEGSRGGSQ|GTTLPKGFYAEGSRGGSQASSRSSSRSRNSSRNSTPGSSRGTSPARMAGNGGD||||||
|101|SQASSRSS|SRGGSQASSRSSSRSR|GTTLPKGFYAEGSRGGSQASSRSSSRSRNSSRNSTPGSSRGTSPARMAGNGGD|||||||
|102|SQASSRSS|SRGGSQASSRSSSRSR|PKGFYAEGSRGGSQASSR|GTTLPKGFYAEGSRGGSQASSRSSSRSRNSSRNSTPGSSRGTSPARMAGNGGD||||||
|103|SQSIIAYTMSLGAEN|||||||||
|104|SVLNDILSR|AISSVLNDILSRLDKVE||||||||
|105|TFPPTEPK|KTFPPTEPKKDKKKK|YKTFPPTEPKKDKKKK|KHIDAYKTFPPTEPKKDKKK|KTFPPTEPKKDKKKKADETQALPQRQKKQQ|||||
|106|TFPPTEPK|KTFPPTEPKKDKKKK|YKTFPPTEPKKDKKKK|LNKHIDAYKTFPPTEPK|LLNKHIDAYKTFPPTEPK|KHIDAYKTFPPTEPKKDKKK|KTFPPTEPKKDKKKKADETQALPQRQKKQQ|||
|107|TFPPTEPK|KTFPPTEPKKDKKKK|YKTFPPTEPKKDKKKK|LNKHIDAYKTFPPTEPK|LLNKHIDAYKTFPPTEPK|KHIDAYKTFPPTEPKKDKKK||||
|108|TKRNVIPTITQMNLK|||||||||
|109|TMADLVYALRHFDEG|||||||||
|110|TNFTISVTTEILPVS|||||||||
|111|TRYVLMDGSIIQFPN|||||||||
|112|TSNFRVQPTESI|||||||||
|113|VAAIFYLITPVHVMS|||||||||
|114|VKPSFYVYSRVKNLN|||||||||
|115|VLNDILSRL|AISSVLNDILSRLDKVE||||||||
|116|VLNDILSRL|SVLNDILSR|AISSVLNDILSRLDKVE|||||||
|117|VRTIKVFTTVDNINL|||||||||
|118|VVFLHVTYV|||||||||
|119|VYQVNNLEEIC|||||||||
|120|YDHVISTSHKLVLSV|||||||||
|121|YEAMYTPHTVLQAVG|||||||||
|122|YQAGSTPCNGV|||||||||
|123|YREGYLNSTNVTIAT|||||||||
|124|YTGAIKLDDKDPNFK|||||||||


[Go Top](#top)

## Acknowledgement

I thank Kaggle organizing team for trouble shooting, my research assistant Jiantao Guo for sharing the feature extraction code and my husband Wei Dong for useful comments. 

This work is supported by NIGMS R35-GM133346 'Machine Learning for Drug Response Prediction' 
