In [None]:
# from IPython.display import Image
from IPython.display import clear_output
from IPython.display import FileLink, FileLinks

## Introduction to

![title](img/python-logo-master-flat.png)

### for scientific computing

#### - Lecture 10

## Regular Expressions

<br><br><br>
- A smarter way of searching text

- search&replace

- Relatively advanced topic
    - But incredibly useful
- https://xkcd.com/208/

## Regular Expressions

- A formal language for defining search patterns

- Enables to search not only for exact strings but controlled variations of that string.

- Why?

- Examples:
   - Find specific patterns in text
     - example: find email in text: `The results should be sent to user@mail.com automatically`
     - Find all hydrocarbons in a text containing compounds `C2H6, NA2, H2O, CH4`
   - American/British spelling, endings and other variants:
     - salpeter, salpetre, saltpeter, nitre, niter or KNO3
     - hemaglobin, heamoglobin, hemaglobins, heamoglobin's
     - catalyze, catalyse, catalyzed...
   - Find/Replace

## Regular Expressions

- When?


- To find information
    - in your files
    - in your code
    - in a database
    - online
    - in a bunch of articles
    - ...

 - Search/replace
   - becuase &rarr; because
   - color &rarr; colour
   - `\t` (tab) &rarr; `"    "` (four spaces)

- Supported by most programming languages, text editors, search engines...

### Defining a search pattern

<center>
<img src="img/color.png" alt="regex" width="50%"/>
<img src="img/salpeter.png" alt="regex" width="50%"/>

</center>

#### Common operations
Building blocks for creating patterns
- `.` matches any character (once)
- `?` repeat previous pattern 0 or 1 times
- `*` repeat previous pattern 0 or more times
- `+` repeat previous pattern 1 or more times

Pattern for matching the colour family

`colour.*`


`.*` matches everything (including the empty string)!

Pattern for matching the different spellings

`salt?peter`

<center>What about the different endings: er-re?</center>

<center><code>"salt?pet.."</code></center>



<center><font color="green">saltpeter</font></center>


<center><font color="red">"saltpet88"</font></center>
<center><font color="red">"salpetin"</font></center>
<center><font color="red">"saltpet  "</font></center>


#### More common operations - classes of characters

- `\w` matches any letter or number, and the underscore
- `\d` matches any digit
- `\D` matches any non-digit
- `\s` matches any whitespace (spaces, tabs, ...)
- `\S` matches any non-whitespace

#### More common operations - classes of characters

- `\w` matches any letter or number, and the underscore
- `\d` matches any digit
- `\D` matches any non-digit
- `\s` matches any whitespace (spaces, tabs, ...)
- `\S` matches any non-whitespace

`\w+`

![result](img/regex_w.png)

#### More common operations - classes of characters

- `\w` matches any letter or number, and the underscore
- `\d` matches any digit
- `\D` matches any non-digit
- `\s` matches any whitespace (spaces, tabs, ...)
- `\S` matches any non-whitespace

`\d+`

![result](img/regex_d.png)

#### More common operations - classes of characters

- `\w` matches any letter or number, and the underscore
- `\d` matches any digit
- `\D` matches any non-digit
- `\s` matches any whitespace (spaces, tabs, ...)
- `\S` matches any non-whitespace

`\s+`

![result](img/regex_s.png)

#### More common operations - classes of characters

- `\w` matches any letter or number, and the underscore
- `\d` matches any digit
- `\D` matches any non-digit
- `\s` matches any whitespace (spaces, tabs, ...)
- `\S` matches any non-whitespace
- `[abc]` matches a single character defined in this set {a, b, c}
- `[^abc]` matches a single character that is **not** a, b or c

#### `[a-z]` matches all letters between `a` and `z` (the english alphabet).

#### `[a-z]+` matches any (lowercased) english word.

<center><code>salt<font color="red">?</font>pet<font color="red">[</font>er<font color="red">]+</font>
</code></center>

   <font color="green"><center>saltpeter</center>
    <font color="green"><center>salpetre</center>

<center><strike><font color="red">"saltpet88"</font></strike></center>
<center><strike><font color="red">"salpetin"</font></strike></center>
<center><strike><font color="red">"saltpet  "</font></strike></center>

**Example - finding patterns in data about genetic mutations**

<font size="5"><code>
1	920760	rs80259304	T	C	.	PASS	AA=T;AC=18;AN=120;DP=190;GP=1:930897;BN=131	GT:DP:CB	0/1:1:SM 0/0:4/SM...
</code></font>

- Each row contains a number of samples, each sample is defined by 0 or 1 separated by /

`0/0`  `0/1`  `1/1`  ...

`"[01]/[01]"`      (or   `"\d/\d")`

```\s[01]/[01]:```

**Example - finding patterns in vcf**

<font size="5"><code>
1	920760	rs80259304	T	C	.	PASS	AA=T;AC=18;AN=120;DP=190;GP=1:930897;BN=131	GT:DP:CB	0/1:1:SM 0/0:4/SM...
</code></font>

- Find all lines containing more than one homozygous sample.

`... 1/1:...  ... 1/1:...  ...`

```.*1/1.*1/1.*```

```.*\s1/1:.*\s1/1:.*```

## Test your regexes online before writing the code

* https://regex101.com
* https://regexr.com/

### Regular expressions in Python

In [1]:
import re

In [2]:
p = re.compile('ab*')
p

re.compile(r'ab*', re.UNICODE)

### Searching

In [11]:
p = re.compile('ab.')

if p.search('cdefg e90834uq'):
    print("found")
else:
    print("not found")
    
result = p.search("abcd")
result

not found


<re.Match object; span=(0, 3), match='abc'>

In [35]:
print(p.search('cb'))

None


In [12]:
p = re.compile('HELLO')
m = p.search('gsdfgsdfgs  HELLO  __!@£§≈[|ÅÄÖ‚…’ﬁ]')

print(m)

<re.Match object; span=(12, 17), match='HELLO'>


### Case insensitiveness

In [37]:
p = re.compile('[a-z]+')
result = p.search('ATGAAA')
print(result)

None


In [38]:
p = re.compile('[a-z]+', re.IGNORECASE)

result = p.search('ATGAAA')
result

<re.Match object; span=(0, 6), match='ATGAAA'>

### The match object

In [13]:
p = re.compile('[ATCGU]+', re.IGNORECASE)

result = p.search('123 ATGAAA 456')
result

<re.Match object; span=(4, 10), match='ATGAAA'>

`result.group()`: Return the string matched by the expression

`result.start()`: Return the starting position of the match

`result.end()`: Return the ending position of the match

`result.span()`: Return both (start, end)

In [14]:
result.group()

'ATGAAA'

In [43]:
result.start()

4

In [44]:
result.end()

10

In [45]:
result.span()

(4, 10)

### Zero or more...?

In [15]:
p = re.compile('.*HELLO.*')

In [16]:
m = p.search('lots of text  HELLO  more text and characters!!! ^^')

In [17]:
m.group()

'lots of text  HELLO  more text and characters!!! ^^'

The `*` is **greedy**.

### Finding all the matching patterns

In [18]:
p = re.compile('HELLO')
objects = p.finditer('lots of text  HELLO  more text  HELLO ... and characters!!! ^^')
print(objects)

<callable_iterator object at 0x7fbd9c0d4e80>


In [19]:
for m in objects:
    print(f'Found {m.group()} at position {m.start()}')

Found HELLO at position 14
Found HELLO at position 32


In [51]:
objects = p.finditer('lots of text  HELLO  more text  HELLO ... and characters!!! ^^')
for m in objects:
    print('Found {} at position {}'.format(m.group(), m.start()))

Found HELLO at position 14
Found HELLO at position 32


### How to find a full stop?

In [20]:
txt = "The first full stop is here: ."
p = re.compile('.')

m = p.search(txt)
print('"{}" at position {}'.format(m.group(), m.start()))

"T" at position 0


In [21]:
p = re.compile('\.')

m = p.search(txt)
print('"{}" at position {}'.format(m.group(), m.start()))

"." at position 29


### More operations
- `\` escaping a character
- `^` beginning of the string
- `$` end of string
- `|` boolean `or`

`^hello$`

<center>
    <code>salt<font color="red">?</font>pet<font color="red">(</font>er<font color="red">|</font>re<font color="red">)</font> <font color="red">|</font> nit<font color="red">(</font>er<font color="red">|</font>re<font color="red">)</font> <font color="red">|</font> KNO3</code>
</center>

### Substitution

#### Finally, we can fix our spelling mistakes!

In [54]:
txt = "Do it   becuase   I say so,     not becuase you want!"

In [55]:
import re
p = re.compile('becuase')
txt = p.sub('because', txt)
print(txt)

Do it   because   I say so,     not because you want!


In [56]:
p = re.compile('\s+')
p.sub(' ', txt)

'Do it because I say so, not because you want!'

#### Overview

 - Construct regular expressions
 
     ```py
     p = re.compile()
     ```
     
 - Searching
 
     ```py
     p.search(text)
     ```
     
 - Substitution
 
     ```py
     p.sub(replacement, text)
     ```

**Typical code structure:**

```python
p = re.compile( ... )
m = p.search('string goes here')
if m:
    print('Match found: ', m.group())
else:
    print('No match')
```

### Regular expressions


- A powerful tool to search and modify text

- There is much more to read in the [docs](https://docs.python.org/3/library/re.html)

- Note: regex comes in different flavours. If you use it outside Python, there might be small variations in the syntax.

<h3> <center>Sum up!</center></h3>

#### Processing files - looping through the lines

```py
fh = open('myfile.txt')
for line in fh:
    do_stuff(line)
```

#### Store values

```py
iterations = 0
information = []

fh = open('myfile.txt', 'r')
for line in fh:
    iterations += 1
    information += do_stuff(line)
```

#### Values

- Base types:

  ```py
  - str     "hello"
  - int     5
  - float   5.2
  - bool    True
    ```
    
- Collections:

    ```py
  - list  ["a", "b", "c"]
  - dict  {"a": "alligator", "b": "bear", "c": "cat"}
  - tuple ("this", "that")
  - set   {"drama", "sci-fi"}
    ```

**Assign values**
```py
iterations = 0
score = 5.2
```

#### Compare and membership
```py
+, -, *,...  # mathematical
and, or, not  # logical 
==, !=        # comparisons
<, >, <=, >=  # comparisons
in            # membership
```

In [22]:
value = 4
nextvalue = 1
nextvalue += value
print('nextvalue: ', nextvalue, 'value: ', value)

nextvalue:  5 value:  4


In [23]:
x = 5
y = 7
z = 2
x > 6 and y == 7 or z > 1

True

In [25]:
(x > 6 and y == 7) |z > 1

True

#### Strings

- Works like a list of characters
  - ```py
  s += "more words"  # add content
  ```
  - ```py
  s[4]               # get character at index 4
  ```
  - ```py
  'e' in s           # check for membership
  ```
  
  - ```py
  len(s)             # check size
  ```


- But are immutable

  - ```py
  > s[2] = 'i'
  ```
  ---
  ```py
  Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  TypeError: 'str' object does not support item assignment
```

#### Strings

Raw text

- Common manipulations:
 
  -  ```py
  s.strip()  # remove unwanted spacing
  ```
  -  ```py  
  s.split()  # split line into columns
  ```
  -  ```py
  s.upper(), s.lower()  # change the case
  ```



- Regular expressions help you find and replace strings.

  - ```py
  p = re.compile('A.A.A')
  p.search(dnastring)
  ```
  - ```py
  p = re.compile('T')
  p.sub('U', dnastring)
  ```

In [26]:
import re

p = re.compile('p.*\sp')  # the greedy star!

p.search('a python programmer writes python code').group()

'python programmer writes p'

#### Collections

Can contain strings, integer, booleans...
- **Mutable**: you can *add*, *remove*, *change* values

  - Lists:
  ```py
    mylist.append('value')
```

  - Dicts:
```py
    mydict['key'] = 'value'
```

  - Sets:
```py
    myset.add('value')
```

#### Collections

- Test for membership:
```py
value in myobj
```

- Check size:
```py
len(myobj)
```

#### Lists

- Ordered!

```py
todolist = ["work", "sleep", "eat", "work"]

todolist.sort()
todolist.reverse()
todolist[2]
todolist[-1]
todolist[2:6]
```

In [27]:
todolist = ["work", "sleep", "eat", "work"]

In [28]:
todolist.sort()
print(todolist)

['eat', 'sleep', 'work', 'work']


In [29]:
todolist.reverse()
print(todolist)

['work', 'work', 'sleep', 'eat']


In [64]:
todolist[2]

'sleep'

In [65]:
todolist[-1]

'eat'

In [66]:
todolist[:]

['sleep', 'eat']

#### Dictionaries

- Keys have values

```py
mydict = {"a": "alligator", "b": "bear", "c": "cat"}
counter = {"cats": 55, "dogs": 8}

mydict["a"]
mydict.keys()
mydict.values()
```



In [31]:
counter = {'cats': 0, 'others': 0}

for animal in ['zebra', 'cat', 'dog', 'cat']:
    if animal == 'cat':
        counter['cats'] += 1
    else:
        counter['others'] += 1
    if "zebra" in counter:
        counter["zebra"] +=1
    else:
        counter["zebra"] = 1
        
counter

{'cats': 2, 'others': 2, 'zebra': 4}

#### Sets

- Bag of values
 
  - No order
  
  - No duplicates 

  - Fast membership checks
  
  - Logical set operations (union, difference, intersection...)


```py
myset = {"drama", "sci-fi"}

myset.add("comedy")

myset.remove("drama")
```


In [68]:
for m in objects:
    print(f'Found {m.group()} at position {m.start()}')

In [69]:
todolist = ["work", "sleep", "eat", "work"]

todo_items = set(todolist)
todo_items

{'eat', 'sleep', 'work'}

In [71]:
todo_items.add("study")
todo_items

{'eat', 'sleep', 'study', 'work'}

In [72]:
todo_items.add("eat")
todo_items

{'eat', 'sleep', 'study', 'work'}

#### Strings

- Works like a list of characters
  - ```py
  s += "more words"  # add content
  ```
  - ```py
  s[4]               # get character at index 4
  ```
  - ```py
  'e' in s           # check for membership
  ```
  
  - ```py
  len(s)             # check size
  ```


- But are immutable

  - ```py
  > s[2] = 'i'
  ```
  ---
  ```py
  Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  TypeError: 'str' object does not support item assignment
```

#### Tuples

- A group (usually two) of values that belong together
  
  - ```py
tup = (max_length, sequence)
```
  - An ordered sequence (like lists)

  - ```py
length = tup[0]  # get content at index 0
```
  - Immutable

In [32]:
tup = (2, 'xy')
tup[0]

2

In [33]:
tup[0] = 2

TypeError: 'tuple' object does not support item assignment

#### Tuples in functions
```py
def find_longest_seq(file):
    # some code here...
    return length, sequence

```

```py
answer = find_longest_seq(filepath)
print('length', answer[0])
print('sequence', answer[1])
```

```py
answer = find_longest_seq(filepath)
length, sequence = find_longest_seq(filepath)
```

#### Deciding what to do

```py
if count > 10:
   print('big')
elif count > 5:
   print('medium')
else:
   print('small')
```

In [34]:
shopping_list = ['bread', 'egg', ' butter', 'milk']
tired         = True

if len(shopping_list) > 4:
    print('Really need to go shopping!')
elif not tired:
    print('Not tired? Then go shopping!')
else:
    print('Better to stay at home')   

Better to stay at home


#### Deciding what to do - if statement

<img src="img/if_else_statement.png" alt="Drawing" style="width: 600px;"/> 

#### Program flow - for loops

```py
information = []
fh = open('myfile.txt', 'r')

for line in fh:
    if is_comment(line):
       use_comment(line)
    else:
       information = read_data(line)
```

<img src="img/forloop.png" alt="Drawing" style="width: 600px;"/> 

#### Program flow - while loops

```py
keep_going = True
information = []
index = 0

while keep_going:
    current_line = lines[index]
    information += read_line(current_line)
    index += 1
    if check_something(current_line):
        keep_going = False
```

<img src="img/whileloop.png" alt="Drawing" style="width: 600px;"/> 

#### Different types of loops

__`For` loop__

is a control flow statement that performs operations over a known amount of steps.

__`While` loop__

is a control flow statement that allows code to be executed repeatedly based on a given Boolean condition.

<br></br>

__Which one to use?__

`For` loops - standard for iterations over lists and other iterable objects

`While` loops - more flexible and can iterate an unspecified number of times


In [36]:
user_input = "thank god it's friday"
for letter in user_input:
    print(letter.upper())

T
H
A
N
K
 
G
O
D
 
I
T
'
S
 
F
R
I
D
A
Y


In [37]:
i = 0
while i < len(user_input):
    letter = user_input[i]
    print(letter.upper())
    i += 1

T
H
A
N
K
 
G
O
D
 
I
T
'
S
 
F
R
I
D
A
Y


In [None]:
i = 0
go_on = True
while go_on:
    c = user_input[i]
    print(c.upper())
    i += 1
    if c == 'd':
        go_on = False

#### Controlling loops

- `break` - stop the loop
- `continue` - go on to the next iteration

In [39]:
user_input = "thank god it's friday"
for letter in user_input:
    
    if letter == 'd':
        continue
    print(letter.upper())

T
H
A
N
K
 
G
O
 
I
T
'
S
 
F
R
I
A
Y


In [None]:
i = 0
while True:    
    c = user_input[i]
    i += 1
    if c in 'aoueiy':
        continue
    print(c.upper())
    if c == 'd':
        break

**Watch out!**

In [79]:
# DON'T RUN THIS
i = 0
while i > 10:    
    print(user_input[i])


While loops may be infinite!

#### Input/Output

- In:

  - Read files: `fh = open(filename, 'r')`
      - `for line in fh:`
         - `fh.read()`
         - `fh.readlines()`
  - Read information from command line: `sys.argv[1:]`

- Out:

  - Write files: `fh = open(filename, 'w')`
       - `fh.write(text)`
  - Printing: `print('my_information')`
  


#### Input/Output

- Open files should be closed:
    - `fh.close()`


#### Formatting

<pre>
{[field_name] [: <font color="green">format_spec</font>]}
</pre>

*Format_spec:*
<pre>
<font color="green">filling alignment width precision type</font>
   -       &gt;        10    .2       f
</pre>

```py
 print('|{:30}|{:^10}|{:^10.2f}|'.format(movie, votes, total/votes))
```

#### Code structure
- Functions
- Modules

#### Functions

- A named piece of code that performs a certain task.

<img src="img/function_structure_explained.png" alt="Drawing" style="width: 600px;"/>  


- Is given a number of input arguments
  - to be used (are in scope) within the function body
- Returns a result (maybe `None`)

#### Functions - keyword arguments

```py
def prettyprinter(name, value, delim=":", end=None):
    out = "The " + name + " is " + delim + " " + value
    if end:
        out += end
    return out
```


- used to set default values (often `None`)
- can be skipped in function calls
- improve readability

#### Files (modules)

- A (larger) piece of code containing functions, classes...
- Corresponds to a file
- Can be imported

```py
import mymodule
```
- Can be run as a script

```> python3 mymodule.py```

#### Using your code

Any longer pieces of code that have been used and will be re-used should be saved

- Save it as a file `    .py`

- To run it:
`python3 mycode.py`

- Import it:
`import mycode`




#### Documentation and comments

- ```py
""" This is a doc-string explaining what the purpose of this function/module is """
```
- ```py
# This is a comment that helps understanding the code
```

- Comments *will* help you

- Undocumented code rarely gets used

- Try to keep your code readable: use informative variable and function names

<div style="overflow-y:scroll; max-height:900px">
<img src="img/example_code.png" alt="Module"/>
    </div>

#### Why programming?

- Computers are fast
- Computers don't get bored
- Computers don't get sloppy

- Create reproducable results
- Extract large amount of information

#### Final advice

- Stop and think before you start coding
    - use pseudocode
    - use top-down programming (divide and conquer)
    - use paper and pen
    - take breaks

- You know the basics - don't be afraid to try, it's the only way to learn
- You will get faster


#### Final advice (for real)

- Getting help
  - search the web ("pandas filter dataframe multiple columns", "python find all regexes")
  - ask colleauges
  - talk about your problem (get a rubber duck https://en.wikipedia.org/wiki/Rubber_duck_debugging)
  - maybe send me an email

## Final project

* Just a way to show you understand the basics
* Nothing complicated if you have gone through the slides
* Instructions [here](https://github.com/clami66/workshop-python/blob/0422/project/instructions.ipynb) [(download link)](https://github.com/clami66/workshop-python/raw/0422/project/instructions.ipynb)
