# Aim
Oran : It occurred to me that I don't want this web scraping business to appear too mysterious. So I will outline the steps with lots of comments below. As ever, I will do this with a view towards expanding the example. The basic steps I want to achieve are.

1. Access the html page containing the full lisenced drug list on the [British National Formulary Site][bnf]
2. Make a list of all the drugs, and the url of the page they link to, leaving all other information for now.
3. Generate a dataframe which reformats the drug name, holds the url, and finds the 3-7 suffixes in the drug name as suggested by Alfredo.

[bnf]: https://bnf.nice.org.uk/drug/

In [1]:
#Code in this cell was written by Oran and webscrapes the BNF site
# Imports and preliminaries:

from uonsol.gng import * # This is how you get the libraries in your local copy of the repo!
import urllib # get raw html as data. 
import re # python's regular expression implementation
import pandas as pd # dataframes and series

class AppURLopener(urllib.request.FancyURLopener): # Make a variant of the urllib.request.FancyURLopener class
    """Handles URL requests. Currently a deprecated method, replace functionality in the future."""
    version = "Mozilla/5.0" # Change the version of browser declared to evade detection. 

def get_html(url):
    """str: url -> str, None
    Queries url. If query successful, returns html as string object. Otherwise returns None"""
    opener = AppURLopener() # makes an instance of our FancyURLOpener object
    response = opener.open(url) # now open the url
    if response.code == 200: # if the request was granted
        html = response.file.read().decode('utf-8') # take the response's html file, read and decod the bytes, 
        return html
    else:
        print(f'Connection error: code:{response.code}')
        return None
    
url = 'https://bnf.nice.org.uk/drug/'
html = get_html(url) # See the html
html

  from ipykernel import kernelapp as app


'<!DOCTYPE html>\n<html lang="en-gb">\n<head>\n  \n  <meta charset="utf-8" />\n  <meta content="IE=edge,chrome=1" name="X-UA-Compatible" />\n  <meta content="width=device-width, initial-scale=1, user-scalable=yes" name="viewport" />\n  \n  <!-- meta tags -->\n  <meta content="NICE" name="DC.Publisher" />\n  <meta content="All content on this site is NICE copyright unless otherwise stated. You can download material for private research, study or in-house use only. Do not distribute or publish any material from this site without first obtaining NICE\'s permission. Where Crown copyright applies, see the Office of Public Sector Information (formerly HMSO) website for information." name="DC.Rights.Copyright" />\n  <meta content="eng" name="DC.Language" scheme="DCTERMS.ISO639-2T" />\n  <meta content="Health, well-being and care" name="DC.Subject" scheme="eGMS.IPSV" />\n  <meta content="Double-A" name="eGMS.accessibility" scheme="eGMS.WCAG10" />\n  <meta content="2014-08-05" name="DC.Issued" 

Oran: 
In the html above, you can see that we are looking for entries like this:

- `<li><a href="abacavir.html"><span>ABACAVIR</span></a></li>`
- `<li><a href="abacavir-with-dolutegravir-and-lamivudine.html"><span>ABACAVIR WITH DOLUTEGRAVIR AND LAMIVUDINE</span></a></li>`
- and turn them to a dictionary in this format:
- `{'name': ['ABACAVIR', 'ABACAVIR WITH DOLUTEGRAVIR AND LAMIVUDINE'], 'url': [f'{page_url}/abacavir.html', f'{page_url}/abacavir-with-dolutegravir-and-lamivudine.html']`

Luckily, the indentation of each drug name is identical in the html, which makes the regex job quite easy. We need to find all instances of `\n            <li><a href="drug.html"><span>DRUG</span>` where exacly 12 spaces follow the `\n`. When we find a match, we need to tell python to keep looking at a point in the html after the match until there are none left. This can be achieved using a while loop, which only runs when the condition is true.
I should also add that python and regex escape characters, especially `\` overlap, making some basic commands a little difficult to get right. On the other hand, python's regex lets you be very explicit about splitting what you find into capturing groups. Code explained below. 



In [2]:
# Code written by Oran

chars_in_html = len(html) # get number of chars in html


re_prog = re.compile(r"""                     # specify a raw string to allow comments in the regex
            (?P<pre>\n[ ]{8,20}<li><a\shref=") # match newline + 12 spaces + html tags and group as <pre>
            (?P<page>[\w-]+?\.html)           # match only alphanumerics, _ or - until .html: group as <page>
            (?P<btwn>"><span>)                # match the quotes, closing tag, then open <span> tag.
            (?P<name>[\w ]+?)                 # match arbitrary alphanumeric with spaces
            (?P<end></span>)                  # match the close of the span group.
            """, 
            re.X) # enables me to comment in a multiline regex

# more on python regex: https://docs.python.org/3/library/re.html

# set up loop
searching = True
urls = [] # generate lists of strings to slot into the dictionary. Lists will preserve order.
names = []
start_position = 0 # start trying to match from the beginning of the html
while searching: # searching is true until the search yields nothing. this will break the loop.
    match = re_prog.search(html, start_position) # search html for the compiled regex from start position
    if match == None:
        searching = False # stop searching
    else: # if there is a match, a match object is generated.
        names.append(match.group('name')) # Append the all-caps name from the html to a list each time.
        full_url = lambda relative_url: f'{url}{relative_url}' # (to relative_url prepend the original url)
        urls.append(full_url(match.group('page'))) # Append the associated url, 
        start_position = match.span()[1] # match.span()[1] is the end of the pattern. Search from there next time.
        
names_and_urls = {'drug_name':names, 'drug_url':urls} # put the two in an equivalently indexed dictionary.
# df = pd.DataFrame(names_and_urls) # generate a dataframe from the data and the dict keys a columns

prettier_names = [name.title() for name in names] # list comprehension reformatting all caps to 
drug_dict = dict(zip(prettier_names, urls))

In [3]:
# code written by Alfredo to find 'families' of miscellaneous medicines that do not fall into the 'small drug-like' category
EXCLUSION_WORDS = ['isophane', 'with', 'insulin', 'alfa', 'alpha', 'trophin', 'tropin', 'reotide', 'vaccine', 'vaccines', 'oil', 'venom', 'cellulose', 'immunoglobulin', 'hormone', 'antitoxin', 'vitamins', 'extract', 'mab']
def exclude_non_drugs_dict(drugs_dict):
    """
    returns new dict with drugs not in EXCULSION_WORDS
    """
    culled_drug_dict = {}
    for drug, url in drugs_dict.items():
        ok = True
        for word in EXCLUSION_WORDS:
            if word in drug.lower():
                ok = False
                break
        if ok:
            culled_drug_dict[drug]=url
    return culled_drug_dict         

drug_dict = dict(zip(prettier_names, urls))
culled_drug_dict = exclude_non_drugs_dict(drug_dict)

df = pd.DataFrame(culled_drug_dict.items())
df.columns =['first_word', 'drug_url']
df['drug_name_as_title'] = df['first_word'].values


Unnamed: 0,first_word,drug_url,drug_name_as_title
0,Abacavir,https://bnf.nice.org.uk/drug/abacavir.html,Abacavir
1,Abatacept,https://bnf.nice.org.uk/drug/abatacept.html,Abatacept
2,Abemaciclib,https://bnf.nice.org.uk/drug/abemaciclib.html,Abemaciclib
3,Abiraterone Acetate,https://bnf.nice.org.uk/drug/abiraterone-aceta...,Abiraterone Acetate
4,Acamprosate Calcium,https://bnf.nice.org.uk/drug/acamprosate-calci...,Acamprosate Calcium
...,...,...,...
1147,Zonisamide,https://bnf.nice.org.uk/drug/zonisamide.html,Zonisamide
1148,Zopiclone,https://bnf.nice.org.uk/drug/zopiclone.html,Zopiclone
1149,Zuclopenthixol,https://bnf.nice.org.uk/drug/zuclopenthixol.html,Zuclopenthixol
1150,Zuclopenthixol Acetate,https://bnf.nice.org.uk/drug/zuclopenthixol-ac...,Zuclopenthixol Acetate


Now we want to add columns to the dataframe which will let us carry out a few of alfredo's suggestions.

1. Firstly, a set of colunms which capture suffixes of arbitrary length
2. Capture and isolate special cases (eg salts?)
3. Group dataframe by these

Looking at the set above, we want to count the members of each group in an interactive way as below. we might also want to hold out some first name examples. A comprehensive list of metals might be an obvious example. 

In [4]:
# Special cases that need removing before creating the df. 
# These special cases are more specific than what the function defined in the cell before has removed (Less likely to be 'families' of anomalies)
# Code in this cell was written by Oran and special cases were added by Alfredo after manually going through the JSON file to find non small drug like medicines that were not captured by the fn
special_cases = {
    'metals': ['Lithium', 'Beryllium', 'Sodium', 'Magnesium', 'Aluminium', 'Potassium', 'Calcium', 'Scandium', 'Titanium', 'Vanadium', 'Chromium', 'Manganese', 'Iron', 'Cobalt', 'Nickel', 'Copper', 'Zinc', 'Gallium', 'Rubidium', 'Strontium', 'Yttrium', 'Zirconium', 'Niobium', 'Molybdenum', 'Technetium', 'Ruthenium', 'Rhodium', 'Palladium', 'Silver', 'Cadmium', 'Indium', 'Tin', 'Cesium', 'Barium', 'Lanthanum', 'Cerium', 'Praseodymium', 'Neodymium', 'Promethium', 'Samarium', 'Europium', 'Gadolinium', 'Terbium', 'Dysprosium', 'Holmium', 'Erbium', 'Thulium', 'Ytterbium', 'Strontium', 'Aluminium', 'Lutetium', 'Hafnium', 'Tantalum', 'Tungsten', 'Rhenium', 'Osmium', 'Iridium', 'Platinum', 'Gold', 'Mercury', 'Thallium', 'Lead', 'Bismuth', 'Polonium', 'Francium', 'Radium', 'Actinium', 'Thorium', 'Protactinium', 'Uranium', 'Neptunium', 'Plutonium', 'Americium', 'Curium', 'Berkelium', 'Californium', 'Einsteinium', 'Fermium', 'Mendelevium', 'Nobelium', 'Lawrencium', 'Rutherfordium', 'Dubnium', 'Selenium', 'Seaborgium', 'Bohrium', 'Hassium', 'Meitnerium', 'Darmstadtium', 'Roentgenium', 'Copernicium', 'Nihonium', 'Flerovium', 'Moscovium', 'Livermorium', 'Colecalciferol'], 
    'misc': ['Dairy', 'Abatacept', 'Calamine', 'Cranberry', 'Isophane', 'Dexrazoxane', 'Menotrophin', 'Pancreatin', 'Darbepoetin Alfa', 'Insulin', 'Kaolin With Morphine', 'Albumin Solution', 'Somatropin', 'Lutropin Alfa', 'Urofollitropin', 'Follitropin', 'otropin', 'Fluorescein', 'Melatonin', 'Bivalirudin', 'Cenegermin', 'Glycerol', 'Nonoxinol', 'Ichthammol', 'Alcohol', 'Levomenthol', 'Tuberculin Purified Protein Derivative', 'Macrogol', 'Macrogol 3350', 'Iodide', 'Tetracosactide', 'Cetrimide', 'Lixisenatide', 'Pasireotide', 'Lanreotide', 'Octreotide', 'Tryptophan', 'Human Papillomavirus Vaccines', 'Soybean Oil', 'European Viper Snake Venom Antiserum', 'Patisiran', 'Biphasic', 'Biphasic Insulin', 'Japanese Encephalitis Vaccine', 'Asfotase Alfa', 'Crisantaspase', 'Pegaspargase', 'Rasburicase', 'Asfotase', 'Agalsidase', 'Hyaluronidase', 'Laronidase', 'Urokinase', 'Streptokinase', 'Galsulfase', 'Idursulfase', 'Elosulfase Alfa', 'Tenecteplase', 'Alteplase', 'Imiglucerase', 'Velaglucerase', 'Hydroxyethylcellulose', 'Methylcellulose', 'Botulinum', 'Phosphate', 'Haemophilus Influenzae Type B With Meningococcal Group C Vaccine', 'Tetanus Immunoglobulin', 'Eucalyptus With Menthol ', 'Rotavirus Vaccine', 'Cytomegalovirus Immunoglobulin', 'Normal Immunoglobulin', 'Coal Tar', 'Coal', 'Meningococcal', 'Parenteral', 'Enteral', 'Nusinersen', 'Inotersen', 'Hydroxypropyl', 'Polyvinyl Alcohol', 'Interferon Beta', 'Peginterferon Alfa', 'Parathyroid Hormone', 'Liquid Paraffin', 'Liquid Paraffin With White Soft Paraffin And Wool Alcohols ', 'Conestat Alfa', 'Hepatitis A And B Vaccine', 'Hepatitis A Vaccine', 'Hepatitis A With Typhoid Vaccine', 'Hepatitis B Immunoglobulin', 'Hepatitis B Vaccine', 'Sevelamer', 'Filgrastim', 'Lipegfilgrastim', 'Pegfilgrastim', 'Dried Prothrombin Complex', 'Sterculia', 'Diphtheria Antitoxin', 'Diphtheria With Tetanus And Poliomyelitis Vaccine', 'Abatacept', 'Belatacept', 'Aflibercept', 'Etanercept', 'Vitamins A And D', 'Vitamins With Minerals And Trace Elements', 'Degarelix', 'Ganirelix', 'Cetrorelix', 'Andexanet Alfa', 'Anthrax Vaccine', 'Rabies Immunoglobulin', 'Rabies Vaccine', 'Tree Pollen Extract', 'Bee Venom Extract', 'Cholera Vaccine', 'Anakinra'],
    'monoclonal antibodies': ['Blinatumomab', 'Guselkumab', 'Ramucirumab','Denosumab', 'Bezlotoxumab', 'Romosozumab', 'Ranibizumab','Palivizumab', 'Ocrelizumab', 'Tocilizumab', 'Reslizumab', 'Atezolizumab', 'Certolizumab', 'Mepolizumab', 'Pembrolizumab', 'Vedolizumab', 'Atezolizumab', 'Benralizumab', 'Natalizumab', 'Omalizumab', 'Benralizumab', 'Eculizumab', 'Mogamulizumab', 'Ravulizumab', 'Eculizumab', 'Bevacizumab', 'Caplacizumab', 'Brolucizumab', 'Idarucizumab', 'Tildrakizumab', 'Ixekizumab', 'Risankizumab', 'Polatuzumab', 'Pertuzumab', 'Obinutuzumab', 'Alemtuzumab', 'Gemtuzumab', 'Alemtuzumab', 'Elotuzumab', 'Inotuzumab', 'Trastuzumab', 'Fremanezumab', 'Galcanezumab', 'Nivolumab', 'Brodalumab', 'Durvalumab', 'Lanadelumab', 'Avelumab', 'Dupilumab', 'Sarilumab', 'Adalimumab', 'Belimumab', 'Ipilimumab', 'Golimumab', 'Necitumumab','Panitumumab', 'Canakinumab', 'Ustekinumab', 'Secukinumab', 'Evolocumab', 'Alirocumab', 'Cetuximab', 'Rituximab', 'Siltuximab', 'Brentuximab', 'Dinutuximab', 'Infliximab', 'Basiliximab'], 
    'vitamins': ['Thiamine', 'Pyridoxine Hydrochloride', 'Phytomenadione', 'Ascorbic', 'Ascorbic Acid', 'Vitamin', 'Cyanocobalamin', 'Hydroxocobalamin'], 'small_drug_molecules_in_wrong_suffix': ['Pentamidine', 'Cefradine', 'Levamisole', 'Clofazimine', 'Carbimazole', 'Alprostadil', 'Tranylcypromine', 'Vinblastine', 'Vincristine', 'Atomoxetine', 'Dapoxetine', 'Trientine', 'Nicotine', 'Terbutaline', 'Apomorphine','Domperidone', 'Diazoxide', 'Meptazinol', 'Metaraminol', 'Salbutamol', 'Cefiderocol', ]}

outline = {} # empty dict in which to add categorised entries.
match_first_word = re.compile('\w+') # Find the first word when called
suffix_lengths = range(3,9) # Change this range to change the range of suffix lengths

new_columns = [f'suffix_{i}' for i in suffix_lengths] # set appropriate column names
for col in new_columns:
    df[col] = None # assign new columns to df filled with None (=Null)
    print(df)

                    first_word  \
0                     Abacavir   
1                    Abatacept   
2                  Abemaciclib   
3          Abiraterone Acetate   
4          Acamprosate Calcium   
...                        ...   
1147                Zonisamide   
1148                 Zopiclone   
1149            Zuclopenthixol   
1150    Zuclopenthixol Acetate   
1151  Zuclopenthixol Decanoate   

                                               drug_url  \
0            https://bnf.nice.org.uk/drug/abacavir.html   
1           https://bnf.nice.org.uk/drug/abatacept.html   
2         https://bnf.nice.org.uk/drug/abemaciclib.html   
3     https://bnf.nice.org.uk/drug/abiraterone-aceta...   
4     https://bnf.nice.org.uk/drug/acamprosate-calci...   
...                                                 ...   
1147       https://bnf.nice.org.uk/drug/zonisamide.html   
1148        https://bnf.nice.org.uk/drug/zopiclone.html   
1149   https://bnf.nice.org.uk/drug/zuclopenthixol.html   
1

In [5]:
#code in this cell was written by Oran and makes a dataframe that organises drugs by suffix
match_first_word = re.compile('\w+') # regex compiled to match a word
for index, drug in df['drug_name_as_title'].items(): # iterate through the series of drug names, noting index position
    match = match_first_word.match(drug) # match the first word of the 'drug name'
    first_word = match.group(0) # extract the string corresponding to the match
    
    for k, v in special_cases.items():
        df[f'sc_{k}'] = df['first_word'].isin(v) # consults whether each element of the series is in the list.
        
    
    for i in suffix_lengths:    # iterate through the suffix lengths
        if i > len(first_word)-1: 
            suffix = None # a suffix greater than the length of the word is meaningless: will raise an error
        else:
            suffix = first_word[-i:] # slice out the last i letters from the drug name
        df.loc[index,f'suffix_{i}'] = suffix # set the answer in location

Unnamed: 0,first_word,drug_url,drug_name_as_title,suffix_3,suffix_4,suffix_5,suffix_6,suffix_7,suffix_8,sc_metals,sc_misc,sc_monoclonal antibodies,sc_vitamins,sc_small_drug_molecules_in_wrong_suffix
0,Abacavir,https://bnf.nice.org.uk/drug/abacavir.html,Abacavir,vir,avir,cavir,acavir,bacavir,,False,False,False,False,False
1,Abatacept,https://bnf.nice.org.uk/drug/abatacept.html,Abatacept,ept,cept,acept,tacept,atacept,batacept,False,True,False,False,False
2,Abemaciclib,https://bnf.nice.org.uk/drug/abemaciclib.html,Abemaciclib,lib,clib,iclib,ciclib,aciclib,maciclib,False,False,False,False,False
3,Abiraterone Acetate,https://bnf.nice.org.uk/drug/abiraterone-aceta...,Abiraterone Acetate,one,rone,erone,terone,aterone,raterone,False,False,False,False,False
4,Acamprosate Calcium,https://bnf.nice.org.uk/drug/acamprosate-calci...,Acamprosate Calcium,ate,sate,osate,rosate,prosate,mprosate,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1147,Zonisamide,https://bnf.nice.org.uk/drug/zonisamide.html,Zonisamide,ide,mide,amide,samide,isamide,nisamide,False,False,False,False,False
1148,Zopiclone,https://bnf.nice.org.uk/drug/zopiclone.html,Zopiclone,one,lone,clone,iclone,piclone,opiclone,False,False,False,False,False
1149,Zuclopenthixol,https://bnf.nice.org.uk/drug/zuclopenthixol.html,Zuclopenthixol,xol,ixol,hixol,thixol,nthixol,enthixol,False,False,False,False,False
1150,Zuclopenthixol Acetate,https://bnf.nice.org.uk/drug/zuclopenthixol-ac...,Zuclopenthixol Acetate,xol,ixol,hixol,thixol,nthixol,enthixol,False,False,False,False,False


In [6]:
# Code in this cell was written by Alfredo
#this cell looks up pubchem data and retrieves SMILES
import pubchempy as pcp
for name in df['drug_name_as_title'].iloc[:1151]:
    compounds = pcp.get_compounds(name, 'name')
    if len(compounds) > 0:
        compound = compounds[0]
        mw = compound.molecular_weight 
        mf = compound.molecular_formula 
        smiles = compound.canonical_smiles
        if 50 < mw < 1500:
            print(compound, name, mw, mf, smiles)
        else:
            print(name, f'molecular weight {mw} outside of range 50 < mw < 1500')
    else:
        print(name, 'not found in pubchem database')

Compound(441300) Abacavir 286.33 C14H18N6O C1CC1NC2=C3C(=NC(=N2)N)N(C=N3)C4CC(C=C4)CO
Abatacept not found in pubchem database
Compound(46220502) Abemaciclib 506.6 C27H32F2N8 CCN1CCN(CC1)CC2=CN=C(C=C2)NC3=NC=C(C(=N3)C4=CC5=C(C(=C4)F)N=C(N5C(C)C)C)F


In [7]:
#Code in this cell was written by Alfredo
# test to successfully export this data
pubchem_data = []            
for name in df['drug_name_as_title'].iloc[:1151]:
    compounds = pcp.get_compounds(name, 'name')
    if len(compounds) > 0:
        compound = compounds[0]
        mw = compound.molecular_weight 
        mf = compound.molecular_formula 
        smiles = compound.canonical_smiles       
        pubchem_data.append([name, compound, mw, mf, smiles])
    else:
        pubchem_data.append([name, 'not in pubchem', None, None, None])
from pprint import pprint
pprint(pubchem_data)

import csv
with open("smilesofbnf.csv","w") as f:
    wr = csv.writer(f)
    for row in pubchem_data:
        wr.writerow(row)

In [9]:
#code in this cell was written by Oran
import json

specials = [df[c] for c in df.columns if c[0:3] == 'sc_']

outline = {}

for series in specials:
    selected = series == True
    outline[series.name] = df[selected].set_index('first_word')['drug_url'].T.to_dict()
    outline[series.name]['count'] = df[selected].shape[0]

special_cols = [series.name for series in specials]
non_specials = df[special_cols].any(axis=1) == False
dfn = df[non_specials]

suffixes = [c for c in df.columns if c[0:3] == 'suf']
suffixes.append('first_word')

# def count_clusters(df, outline, col, branch=''):
#     counts = df.apply(pd.value_counts)
#     counts['sum'] = counts.sum(axis=1)
#     counts = counts.sort_values(col, axis=0, ascending=False)
#     is_summed = counts[col] > 0
#     counts = counts[is_summed]
#     for i in counts.index:
#     return counts

# for suffix in suffixes:
#     print(count_clusters(dfn, outline, suffix).index)

def recursive_sort(df,suffixes,dictionary,level=0, branch='', prior_count=0):
    col = suffixes[level]
    counts = df[suffixes].apply(pd.value_counts).sort_values(col, axis=0, ascending=False)
    singles = counts[counts[col] == 1][col]
    count = df.shape[0]

    print(col, f'level={level}', branch, f'iteration:{prior_count}')
    dictionary['count'] = count
    dictionary['singles'] = {}
    dictionary['singles']['count'] = singles.shape[0]
    singlet_data = ['drug_name_as_title', 'drug_url', 'first_word']
    
    for idx in singles.index:
        single_drug = df[df[col]==idx][singlet_data].set_index('drug_name_as_title')
        single_drug_dict = single_drug.to_dict(orient='index')
        drug_name = single_drug.index[0]
        dictionary['singles'][drug_name] = single_drug_dict[drug_name]
    
    groups = counts[counts[col] > 1][suffixes][col]
    constrained_suffixes = suffixes[suffixes.index(col):]
    #print(groups)
    if len(groups.index) == 0:
        drug_first_word = df.iloc[0,3]
        #print('JUMP!',drug_first_word)
        dictionary[drug_first_word] = {}
        same_first = df[singlet_data][df['first_word']==drug_first_word].set_index('drug_name_as_title')
        count_sames = same_first.shape[0]
        dictionary[drug_first_word] = same_first.to_dict(orient='index')
    else:
        for idx in groups.index:

            dfn = df[df[col]==idx]

            if level < len(suffixes)-1:
                if len(groups.index) == 1:
                    recursive_sort(dfn, suffixes, dictionary, level+1, idx, prior_count+1)
                else:
                    dictionary[idx] = {}
                    recursive_sort(dfn,suffixes,dictionary[idx],level+1, idx, prior_count+1)
            else:
                dictionary[idx] = {}
                same_first = df[singlet_data][df['first_word']==idx].set_index('drug_name_as_title')
                count_sames = same_first.shape[0]
                #dictionary[idx]['count'] = count_sames
                dictionary[idx] = same_first.to_dict(orient='index')



In [2]:
#Code in this cell was written by Alfredo
#This cell aims to clean up the CSV upon reflection of a discussion with Isobel

# get mw and molecular formula out of the smilesofbnf CSV file and into a dataframe so that they can be added to the original df 
df_w_smiles = pd.read_csv('smilesofbnf.csv', names=['Name of compound', 'Compound ID', 'molecular_weight', 'molecular_formula', 'SMILES'])
# remove all molecules that do not have a SMILES string
new_df = df_w_smiles.dropna()
# remove all molecules that have a MW over 1000
df_newer = new_df[new_df['molecular_weight'] < 1000]
# Save to CSV to export
# df_newer.to_csv('BNF_new_smiles.csv')

NameError: name 'SMILES' is not defined

In [None]:
# Code written by Oran
recursive_sort(dfn,suffixes,outline)

with open('drug_tree_culled.json', 'w') as fp:
    json.dump(outline, fp, indent=4)

with open('drug_tree_culled.json', 'r') as fp:
    json_raw = fp.read()

result = re.subn(r'{\n\s+"count"', r'{ "count"', json_raw)
print(f'{result[1]} nodes summed.')
with open('drug_tree._culled.json', 'w') as fp:
    fp.write(result[0])

In [None]:
# Code written by Oran
with open('drug_tree.json', 'r') as fp:
    json_raw = fp.read()

result = re.subn(r'{\n\s+"count"', r'{ "count"', json_raw)
print(f'{result[1]} nodes summed.')
with open('drug_tree.json', 'w') as fp:
    fp.write(result[0])

In [None]:
# Alfredo's notes for future editing and general cleaning of dataset to make the suffix approach work more effectively: 
# deflazacort is not a special - prednisolone derivative - need to remove it from special? maybe
# family drug classes that do not make sense:
#     'ine'
#     'idine'
#     'radine'
#     'mine' MAOI antidep with anti-leprosy drug
#     'amine'
#     'tamine' ACEi inhibitor with eye drops to remove crystal deposits

# family drug classes that need editing:
# 'udine' HIV drug should include lamivudine and zidovudine    

    





    
