# Introduction

This exercise makes use of the database you created in `Exercise02` and the BEL statement parsers you write with regular expressions in `Reading_searching_sending.ipynb`.

In [1]:
import pandas as pd
import os, json, re, time
time.asctime()

'Thu Oct  6 09:56:34 2016'

In [2]:
base = os.path.join(os.environ['BUG_FREE_EUREKA_BASE'])
base

'C:\\Users\\Dell\\Documents\\GitHub\\bug-free-eureka'

# Task 1

This exercise is about loading the HGNC data to create a dictionary from HGNC symbols to set of enzyme ID's.

## 1.1 Load Data

Load json data from `/data/exercise02/hgnc_complete_set.json`.

In [3]:
dataPath= os.path.join(base,'data','exercise02','hgnc_complete_set.json')
with open(dataPath) as f:
    hgnc_json = json.load(f)
#hgnc_json is a python object 

## 1.2 Reorganize Data into `pd.DataFrame`

Identify the relevant subdictionaries in your `dictionary -> response -> docs`. Load them to a data frame, 
then create a new data frame with just the HGNC symbol and Enzyme ID

In [4]:
docs = hgnc_json['response']['docs'] # how does adding [0] affect our result??? explore
# this way we see the structure of data. Next step would be to load this data into dataframe
df_hgnc =  pd.DataFrame(docs) # put docs in dataframe using pandas

list(df_hgnc.columns)

['_version_',
 'alias_name',
 'alias_symbol',
 'bioparadigms_slc',
 'ccds_id',
 'cd',
 'cosmic',
 'date_approved_reserved',
 'date_modified',
 'date_name_changed',
 'date_symbol_changed',
 'ena',
 'ensembl_gene_id',
 'entrez_id',
 'enzyme_id',
 'gene_family',
 'gene_family_id',
 'hgnc_id',
 'homeodb',
 'horde_id',
 'imgt',
 'intermediate_filament_db',
 'iuphar',
 'kznf_gene_catalog',
 'lncrnadb',
 'location',
 'location_sortable',
 'locus_group',
 'locus_type',
 'lsdb',
 'mamit-trnadb',
 'merops',
 'mgd_id',
 'mirbase',
 'name',
 'omim_id',
 'orphanet',
 'prev_name',
 'prev_symbol',
 'pseudogene.org',
 'pubmed_id',
 'refseq_accession',
 'rgd_id',
 'snornabase',
 'status',
 'symbol',
 'ucsc_id',
 'uniprot_ids',
 'uuid',
 'vega_id']

In [5]:
pd.DataFrame(hgnc_json).head()
#hgnc_json['response']['docs'][0].keys()

Unnamed: 0,response,responseHeader
QTime,,16.0
docs,"[{'status': 'Approved', 'pubmed_id': [2591067]...",
numFound,41049,
start,0,
status,,0.0


## 1.3 Build dictionary for lookup

Iterate over this dataframe to build a dictionary that is `{hgnc symbol: set of enzyme id's}`. Call this dictionary `symbol2ec`

In [6]:
#symbol2ec = {}
df_hgnc[['symbol','enzyme_id']].head(6)

Unnamed: 0,symbol,enzyme_id
0,A1BG,
1,A1BG-AS1,
2,A1CF,
3,A2M,
4,A2M-AS1,
5,A2ML1,


In [7]:
# dict connecting symbol n enzyme_id
symbol2ec = {}
df_hgnc_sliced = df_hgnc[['symbol','enzyme_id']]
df_hgnc_sliced.head(5)



Unnamed: 0,symbol,enzyme_id
0,A1BG,
1,A1BG-AS1,
2,A1CF,
3,A2M,
4,A2M-AS1,


In [8]:
#iterate over dataframe
for idx, symbol , enzyme_ids in df_hgnc_sliced.itertuples():
    if isinstance(enzyme_ids, list):
        symbol2ec[symbol] = enzyme_ids
        #symbol2ec[symbol] = set(enzyme_ids) # put enzyme_id into a dict. Use {} to avoid duplicates
    else:
        symbol2ec[symbol] = None 
            
    

In [9]:
 'AKT1' in symbol2ec # looks for this key in dict

True

# Task 2

This subexercise is about validating protein and kinase activity statements in BEL. Refer to last Thursday's work in `Reading_searching_sending.ipynb`.

## 2.1 Valid HGNC

Write a function, `valid_hgnc(hgnc_symbol, symbol2ec_instance)` that takes a name and the dictionary from Task 1.3 and returns whether this is a valid name

In [10]:
def valid_hgnc(hgnc_symbol, symbol2ec_instance):
    if hgnc_symbol in symbol2ec_instance:
        return True 
        #return symbol2ec_instance[hgnc_symbol]
    else:
        return False

print(valid_hgnc('APP', symbol2ec))
#assert valid_hgnc('AKT1', symbol2ec) # will throw in FALSE only if condition isnt satisfied 
#if e in symbol2ec_instance for 
#assert  valid_hgnc('AKT1', symbol2ec)
#assert not valid_hgnc('AKTT1', symbol2ec)
#assert not valid_hgnc("boogeyman", symbol2ec)

True


## 2.2 Valid Kinase Activity

Write a function, `valid_kinase(hgnc_symbol, symbol2ec_instance)` that takes a name and the dictionary from Task 1.3 and returns whether this protein has kinase activity. Hint: an enzyme code reference can be found [here](http://brenda-enzymes.org/ecexplorer.php?browser=1&f[nodes]=132&f[action]=open&f[change]=153)

In [11]:
symbol2ec['AKT1'],symbol2ec['PIK3CA'],symbol2ec['AKT2']


(['2.7.11.1'], ['2.7.1.153'], ['2.7.11.1'])

In [12]:
def valid_kinase(hgnc_symbol, symbol2ec_instance):
    if not valid_hgnc(hgnc_symbol, symbol2ec_instance): # FALSE : we have empty list or set 
        return False 
    for e in  symbol2ec_instance[hgnc_symbol]:
        if e.startswith('2.7.'):
            return True
        
                                 
assert valid_kinase('AKT1', symbol2ec)
assert not valid_kinase('AKT23', symbol2ec)

In [13]:
#match_protein = re.compile("^([a-z])\(([A-Z]+):([a-zA-Z0-9]+)\)\s(->|-\|)\s([a-z])\(([A-Z]+):([a-zA-Z0-9]+)\)$")

In [20]:
match_protein = re.compile('p\(HGNC:(?P<name>\w+)\)')
#Compile a regular expression pattern into a regular expression object, 
#which can be used for matching using its match() and search() methods, 
match_protein.match('p(HGNC:ABC)').groupdict()
#groupdict:Return a dictionary containing all the named subgroups of the match, keyed by the subgroup name. 
#The default argument is used for groups that did not participate in the match; it defaults to None
match_kin = re.compile('kin\(p\(HGNC:(?P<name>\w+)\)\)')
match_kin.match('kin(p(HGNC:APP))').groupdict()


{'name': 'APP'}

In [15]:
match_protein = re.compile('p\(HGNC:(?P<name>\w+)\)')
match_protein.match('kin(p(HGNC:ADC))').groupdict()

AttributeError: 'NoneType' object has no attribute 'groupdict'

## 2.3 Putting it all together

Write a function, `validate_bel_term(term, symbol2ec_instance)` that parses a BEL term about either a protein, or the kinase activity of a protein and validates it.

```python
def validate_bel_term(term, symbol2ec_instance):
    pass
```

### Examples

```python
>>> # check that the proteins have valid HGNC codes
>>> validate_bel_term('p(HGNC:APP)', symbol2ec)
True
>>> validate_bel_term('p(HGNC:ABCDEF)', symbol2ec)
False
>>> # check that kinase activity annotations are only on proteins that are
>>> # actually protein kinases (hint: check EC annotation)
>>> validate_bel_term('kin(p(HGNC:APP))', symbol2ec)
False
>>> validate_bel_term('kin(p(HGNC:AKT1))', symbol2ec)
True
```

In [23]:
def validate_bel_term(term,symbol2ec_instance): 
    match_protein = re.compile('p\(HGNC:(?P<name>\w+)\)')  
    match_kin = re.compile('kin\(p\(HGNC:(?P<name>\w+)\)\)')
    try:
        if bool(match_protein.match(term))== True or bool(match_kin.match(term)) == True:
            if bool(match_protein.match(term))== True:
                mm = match_protein.match(term).groupdict()
                e = mm['name']
                if (valid_hgnc(e, symbol2ec)) == True:
                    print("true code")
                    return True
                else:
                    return False
                
                #value_when_true if condition else value_when_false
            else:
                mk = match_kin.match(term).groupdict()
                ek = mk['name']
                if (valid_kinase(ek,symbol2ec)) == True and (valid_hgnc(ek,symbol2ec)) == True:
                    print("kinase2",ek)
                    return True
                else:
                    return False
                
        else:
            print("oh no")
                    
            
                
    except:
        print("here")
                   

                
#validate_bel_term('kin(p(HGNC:AKT1))', symbol2ec)
validate_bel_term('kin(p(HGNC:APP))', symbol2ec) # why is it failing for this test?









here


In [None]:
>>> # check that the proteins have valid HGNC codes
>>> validate_bel_term('p(HGNC:APP)', symbol2ec)

>>> validate_bel_term('p(HGNC:ABCDEF)', symbol2ec)

>>> # check that kinase activity annotations are only on proteins that are
>>> # actually protein kinases (hint: check EC annotation)
>>> validate_bel_term('kin(p(HGNC:APP))', symbol2ec)

>>> validate_bel_term('kin(p(HGNC:AKT1))', symbol2ec)



# Task 3

This task is about manual curation of text. You will be guided through translating the following text into BEL statements as strings within a python list.

## Document Definitions

Recall citations are written with source, title, then identifier as follows:

```
SET Citation = {"PubMed", "Nat Cell Biol 2007 Mar 9(3) 316-23", "17277771"}
```

Use these annotations and these namespaces:

```
DEFINE NAMESPACE HGNC AS URL "http://resource.belframework.org/belframework/20131211/namespace/hgnc-human-genes.belns"

DEFINE ANNOTATION CellLocation as LIST {"cell nucleus", "cytoplasm", "endoplasmic reticulum"}
```


## Source Text

> The following statements are from the document "BEL Exercise" in edition 00001 of the PyBEL Journal.
> The kinase activity of PI3K causes the increased abundance of AKT serine/threonine kinase 1 and AKT serine/threonine kinase 2 in the cytoplasm, 
> but only the increased expression of AKT serine/threonine kinase 1 in the endoplasmic reticulum. 
> Additionally, the abundance of AKT serine/threonine kinase 1 and AKT serine/threonine kinase 2 were found to be postively correlated in the cell nuclei.
> AKT serine/threonine kinase 2 increases GSK3 Beta in all of the nuclei, cyoplasm, and ER.

In [None]:
def get_symbol(name_in):
    return list(df_hgnc[df_hgnc.name == name_in]['symbol'])[0]

In [None]:
get_symbol('AKT serine/threonine kinase 2')

In [None]:
get_symbol('AKT serine/threonine kinase 1')

In [None]:
definition_statements = [
    'SET DOCUMENT name = "BEL Exercise"'
    'DEFINE NAMESPACE HGNC AS URL "http://resource.belframework.org/belframework/20131211/namespace/hgnc-human-genes.belns"',
    'DEFINE ANNOTATION CellLocation AS LIST {"cell nucleus", "cytoplasm", "endoplasmic reticulum"}',
]

In [None]:
# hint: there should be 11 statements from this text
your_statements = [
    'SET  citation  = {"PubMed","BEL Exercise","00001}',
    'SET Evidence = "The kinase activity of PI3K causes the increased abundance of AKT serine/threonine kinase 1 and AKT serine/threonine kinase 2 in the cytoplasm, but only the increased expression of AKT serine/threonine kinase 1 in the endoplasmic reticulum. Additionally, the abundance of AKT serine/threonine kinase 1 and AKT serine/threonine kinase 2 were found to be postively correlated in the cell nuclei. AKT serine/threonine kinase 2 increases GSK3 Beta in all of the nuclei, cyoplasm, and ER. ',
    'SET CellLocation  = "cytoplasm "',
    'kin(p(HGNC:PIK3CA) increases p(HGNC:AKT1) )'
    'kin(p(HGNC:PIK3CA) increases p(HGNC:AKT2) )'
    'SET CellLocation = "endoplasmic reticulum"',
    'kin(p(HGNC:PIK3CA) increases p(HGNC:AKT1) )',
    'SET CellLocation = "cell nucleus" ',
    'p(HGNC:AKT1) positiveCorrelation p(HGNC:AKT2)',
    'SET CellLocation = {"cell nucleus","cytoplasm","endoplasmic reticulum"}',
    ' p(HGNC:AKT2) increases p(:HGNC:GSK3B)',
    '',
    ''
]

In [None]:
statements = definition_statements + your_statements

# Task 4

This task is again about regular expressions. Return to `Reading_searching_sending.ipynb` and find your regular expressions that parse the subject, predicate, and object from a statement like `p(HGNC:AKT1) pos p(HGNC:AKT2)`

## 4.1 Validating Statements

Write a function `validate_bel_statement(statement, symbol2ec)` that takes a subject, predicate, object BEL statement as a string and determines if it its subject and objects are valid.

In [None]:
def validate_bel_statement(statement, symbol2ec):
    pass

## 4.2 Validating Your Statements

Run this cell to validate the BEL statements you've written.

In [None]:
for statement in your_statements:
    valid = validate_bel_statement(statement, symbol2ec)
    print('{} is {}valid'.format(statement, '' if valid else 'in'))

## 4.3 Visualization

Use `pybel` to visualize the network.

In [None]:
try:
    import pybel
    import networkx
    
    g = pybel.from_bel(statements)
    nx.draw_spring(g, with_labels=True)
except:
    print('PyBEL not installed')