# Regular Expressions and `re.split()`

I want to parse a GL String and put all alleles into a list. We will use `re.split()` to do this.

GL String containing all delimiters:

In [1]:
glstring = 'HLA-A*01:01/HLA-A*01:02+HLA-A*24:02|HLA-A*01:03/HLA-A*01:04+HLA-A*24:03^HLA-B*08:01+HLA-B*44:02^HLA-DRB5*01:01~HLA-DRB1*03:01'
print(glstring)

HLA-A*01:01/HLA-A*01:02+HLA-A*24:02|HLA-A*01:03/HLA-A*01:04+HLA-A*24:03^HLA-B*08:01+HLA-B*44:02^HLA-DRB5*01:01~HLA-DRB1*03:01


In [2]:
def gl_to_allele(gl):
    alleles = []
    for locus_block in gl.split('^'):
        for genotype in locus_block.split('|'):
            for locus in genotype.split('+'):
                for allele_list in locus.split('~'):
                    for allele in allele_list.split('/'):
                        alleles.append(allele)
    return alleles

In [3]:
gl_to_allele(glstring)

['HLA-A*01:01',
 'HLA-A*01:02',
 'HLA-A*24:02',
 'HLA-A*01:03',
 'HLA-A*01:04',
 'HLA-A*24:03',
 'HLA-B*08:01',
 'HLA-B*44:02',
 'HLA-DRB5*01:01',
 'HLA-DRB1*03:01']

In [4]:
import re
re.split('[\^+|~/]', glstring)

['HLA-A*01:01',
 'HLA-A*01:02',
 'HLA-A*24:02',
 'HLA-A*01:03',
 'HLA-A*01:04',
 'HLA-A*24:03',
 'HLA-B*08:01',
 'HLA-B*44:02',
 'HLA-DRB5*01:01',
 'HLA-DRB1*03:01']

I learned something new about regular expressions in python. In my example above, I escape the `^` using `\^`. This is because the `^` is a negation character. BUT it only negates if it’s the first character in the set. So if I reorder the set to move the `^` inside the set of chars, then it loses its meaning. For example,  `re.split('[+^|~/]', g)` has the desired behavior, and is much more readable and no escaping is necessary.

Here's what happens if I don't escape the `^`

In [5]:
re.split('[^+|~/]', glstring)

['',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '/',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '+',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '|',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '/',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '+',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '+',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '~',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '']

Here's what happens when I move the `^` inside the set of chars. This is the desired behavior.

In [6]:
re.split('[+^|~/]', glstring)

['HLA-A*01:01',
 'HLA-A*01:02',
 'HLA-A*24:02',
 'HLA-A*01:03',
 'HLA-A*01:04',
 'HLA-A*24:03',
 'HLA-B*08:01',
 'HLA-B*44:02',
 'HLA-DRB5*01:01',
 'HLA-DRB1*03:01']

In [7]:
re.split(r'[+^|~/]', glstring)

['HLA-A*01:01',
 'HLA-A*01:02',
 'HLA-A*24:02',
 'HLA-A*01:03',
 'HLA-A*01:04',
 'HLA-A*24:03',
 'HLA-B*08:01',
 'HLA-B*44:02',
 'HLA-DRB5*01:01',
 'HLA-DRB1*03:01']