# Build in functions and python modules

## Build-in functions
Python 2.7.12: https://docs.python.org/2/library/functions.html

Python 3.6: https://docs.python.org/3.6/library/functions.html

There are many build-in functions in python. You can use them to identify the type of a variable (type()), you can use them to check if an object is an instance of some specific class (isinstance()) and many other ways.

Here we will use the build-in function open() to work with a file on your machine. The special thing about the build-in functions is: You don't need an import, you can use them in your code straight forword.

In [7]:
print('Type of "1":',type(1))
print('Type of "String":',type("String"))

print('Is "1" a string?:',isinstance(1, str))

Type of "1": <class 'int'>
Type of "String": <class 'str'>
Is "1" a string?: False


Now we need a file that is filled with informations that are interesting. Let's use a simplified file written in Biological Expression Language (BEL). The definitions for biological interactions look like this:

p(HGNC:BCL2) -| p(HGNC:CYCS)

This translates to:

The protein BCL2 (defined in namespace HGNC) inhibits the protein CYCS (defined in HGNC).

Let's open a file then and print it out line by line! To do so we will need to store it in some variable that will represent our file as a so called 'file descriptor' (fd). 

In [12]:
import os
fd = open(os.sep.join(('..','data','reading_searching_sending','interactions.txt')))

The open() function will take the path to the file. Weather it is allowed to read ('r'), write ('w') and so on. Here is a full table of the possible rights:

'r' 	open for reading (default)

'w' 	open for writing, truncating the file first

'x' 	open for exclusive creation, failing if the file already exists

'a' 	open for writing, appending to the end of the file if it exists

'b' 	binary mode

't' 	text mode (default)

'+' 	open a disk file for updating (reading and writing)

'U' 	universal newlines mode (deprecated)


In [13]:
type(fd)

_io.TextIOWrapper

In [14]:
lines = fd.readlines()
print("Type of lines:", type(lines))
print(lines)
fd.readlines()

Type of lines: <class 'list'>
['p(HGNC:CYCS) -> act(p(HGNC:CASP9))\n', 'p(HGNC:BCL2) -| p(HGNC:CYCS)']


[]

## re - Regular expression operations
Python 2.7.12: https://docs.python.org/2/library/re.html

Python 3.6: https://docs.python.org/3.6/library/re.html

Regular expressions can be used to analyse given strings and filter for specific information. It can be used to read
large text files and automatically extract the interessting data out of it.
The re module enables these functionalities on python level.

In [15]:
import re

The module requires a valid regular expression to be used. So the correct syntax is needed.

Let's use the regular expression module to analyse a text and extact the interesting information.

We need to setup a text that we want to analyse. Our goal is: Extract the information on which proteins are interacting and how they are interacting. First of all we have to write a pattern (regular expression) that represents our question.

In [16]:
text = "Protein A interacts directly with protein B." # the text we want to analyse.

Basic syntax for regular expressions:

'^' - Begining of expression.

'(' - Begining of a group.

')' - End of a group.

'*' - 0 to n times.

'+' - 1 to n times.

'\s' - Whitespace.

'[a-z]' - Any lowercase letter.

'[A-Z]' - Any upercase letter.

'?' - Different meanings. Here: Next case reached?

'$' - End of expression.

We will define three groups in our pattern:

1. First protein.
2. The way it interacts.
3. Second protein.

In [17]:
pattern_definition = "^Protein\s([A-Z]+)\sinteracts\s([a-z]+?)\swith\sprotein\s([A-Z]+?)\.$"

Now we use the re.compile() method to create an object that represents our defined pattern.

In [18]:
pattern = re.compile(pattern_definition) # compiling the pattern_definition. Creates an object.
type(pattern)

_sre.SRE_Pattern

Now we can use the pattern object to match it with the given text and extract the groups.

In [19]:
result = pattern.match(text)
result.groups()

('A', 'directly', 'B')

Our patter can now be used to extract Information from texts similar to:

Protein [Protein_name] interacts [interaction_type] with protein [Protein_name].

In [22]:
text_2 = "Protein SOMETHING interacts somehow with protein ANOTHER."
pattern.match(text_2).groups()

('SOMETHING', 'somehow', 'ANOTHER')

Lets make our pattern more generic, so it would find more results!

Firstly: Lets exchange the definitions for letters ([a-z] / [A-Z]) with this: \w. This change will allow us to find any Alphanumeric value in the text. 

Secondly: We have defined that the text will start with an uppercase 'Protein'. Definitions of upper- and lowercase may decrease the number of results. Lets get rid of that by adding an re.IGNORECASE (re.I) flag!

In [23]:
generic_pattern_definition = "^protein\s*(\w+?)\s*interacts\s*(\w+?)\swith\sprotein\s(\w+?)\.$" # changed definition
generic_pattern = re.compile(generic_pattern_definition, re.I) # we added re.I to ignore cases!
print("Text_2 matches:",generic_pattern.match(text_2).groups()) # lets see if the new definition works with text_2.

text_3 = "protein Some1 interacts badly with protein oTHER32." # new text to search in.
print("Text_3 matches:",generic_pattern.match(text_3).groups())

Text_2 matches: ('SOMETHING', 'somehow', 'ANOTHER')
Text_3 matches: ('Some1', 'badly', 'oTHER32')


At the moment we receive a tuple of strings. Without knowing the order of the elements in our pattern, we would not know which element is the subject and which one is the object of the interaction. Lets add something that will help us to identify the elements. A dictionary would be great here!

In [24]:
generic_pattern_definition = "^protein\s(?P<subject>\w+?)\sinteracts\s(?P<interaction_type>\w+?)\swith\sprotein\s(?P<object>\w+?)\.$"
generic_pattern = re.compile(generic_pattern_definition, re.I)
print("Text_3 matches:",generic_pattern.match(text_3).groupdict()) # we want to receive a dictionary!

Text_3 matches: {'subject': 'Some1', 'object': 'oTHER32', 'interaction_type': 'badly'}


Lets make sure that we will identify interactions also if they are not a scentense as a whole!

In [25]:
generic_pattern_definition = ".*?(protein\s(?P<subject>[^\s]\w+?)\sinteracts\s(?P<interaction_type>\w+?)\swith\sprotein\s(?P<object>[^\s]\w+))"
generic_pattern = re.compile(generic_pattern_definition, re.I)

text_4 = "Our research resulted in protein Some1 interacts directly with protein SomeElse32!" # new text to match against!
print("Text_4 matches:",generic_pattern.match(text_4).groupdict())

Text_4 matches: {'subject': 'Some1', 'object': 'SomeElse32', 'interaction_type': 'directly'}


## json - JSON encoder and decoder
Python 2.7.12: https://docs.python.org/2/library/json.html

Python 3.6: https://docs.python.org/3.6/library/json.html

JSON (JavaScript Object Notation) is a datatype similiar to the python dictionary. It is wiledly used to exchange data in a normalized way. The json module of python provides an API to interact with JSON objects on python level.

In [1]:
import json

Lets create a simple list of inormations.

In [2]:
data = [1, 'one', {'cat':'Katze','dog':'Hund','nothing':None,'thats_right':True,'thats_wrong':False}]
type(data)

list

The type of our data list is 'list' so it is a basic python datatype. In addition our list contains a python dictionary on position 3. Now we can transform our python list into a json object by using the json.dumps() method.

In [3]:
json_data = json.dumps(data)
type(json_data)

str

Whats that? The datatype of our json_data is a string? Thats right! Python does not know the type 'json' it transforms our data list into a json-like string. And we can also see that pythonic values like 'None', 'True' and 'False' where transformed to their JavaScript equivalents 'null', 'true' and 'false'!

In [4]:
print(json_data)

[1, "one", {"nothing": null, "thats_right": true, "cat": "Katze", "thats_wrong": false, "dog": "Hund"}]


We can exchange this string with other users that use different programming languages with the annotation that the datatype is json. This allows the other users to read it (syntax) and extract the data directly.

Now lets try to use our json data the other way arround and extract the informations from the string and transform it into a python-readable datatype. To fullfill this task we will use the json.loads() method.

In [6]:
python_data = json.loads(json_data)
type(python_data)

python_data[2]['cat']

'Katze'

The json.loads() method transforms 'json strings' into python-readable objects. This allows us to use data from other users that was send to us in the json format directly in python.

# Excercise

Let's combine the three approaches. 

1. First of all we want to load the interactions file into our program.
2. Use the `re` module to filter the lines for the information.
3. Store the information in the JSON format and create a new file that contains this JSON string. Name it decoded_interactions.txt
4. Upload your solution and resulting JSON file to your GitHub.


## steps you should follow
1. open the file
2. read the lines
3. define the regular expression
4. compile the regular expression
5. loop over lines and match regular expression
6. print results

Output:
p(HGNC:CYCS) -> act(p(HGNC:CASP9))
```json 
{'subject':{'namespace':'HGNC','name':'CYCS','biological_type':'p'},
'relation':'->',
'object':{'namespace':'HGNC','name':'CASP9','biological_type':'p'}}
```

In [44]:
import os, re
fd = open(os.sep.join(("..","data","reading_searching_sending","interactions.txt")))

lines = fd.readlines()
                     #   p     ( HGNC   :  BCL2         )     -|        p    (  HGNC  :   CYCS        )
pattern_definition = "^([a-z])\(([A-Z]+):([a-zA-Z0-9]+)\)\s(->|-\|)\s([a-z])\(([A-Z]+):([a-zA-Z0-9]+)\)$"
pattern = re.compile(pattern_definition)
for line in lines:
    match_result = pattern.match(line)
    if match_result:
        result = match_result.groups()
        print(result)
        print(type(result))
 
        sub_obj_keys=('biological_type','namespace','name')
        sub_dict= dict(zip(sub_obj_keys,result[0:2]))
        obj_dict = dict(zip(sub_obj_keys,result[-3:]))
        dictionary = {'subject':sub_dict,
                        'relation':result[3],
                        'object':obj_dict}
    
        print(dictionary)

('p', 'HGNC', 'BCL2', '-|', 'p', 'HGNC', 'CYCS')
<class 'tuple'>
{'object': {'name': 'CYCS', 'biological_type': 'p', 'namespace': 'HGNC'}, 'relation': '-|', 'subject': {'biological_type': 'p', 'namespace': 'HGNC'}}


In [None]:
text="p(HGNC:BCL2) -| p(HGNC:CYCS)"

pattern = "p(HGNC:BCL2) -| p(HGNC:CYCS)"