# Data in Python

## Stats 141B

## Lecture 7


## Computer architecture

### CPU
- CPU can work with certain number of bits at a time (most are 64-bit now)
- each core can run separate processes in parallel
- CPU has dedicate memory that is very fast to access: register, cache

### Memory hierarchy
- register holds (amoung other things) addresses of variables in main memory
- primary storage (memory), random-access memory, ~10 GB/sec, ~GB in size
- secondary storage, disk and SSD, ~1 GB/sec, ~TB in size
- tertiary storage, ~ .1 GB/sec, ~EB in size (1 Bill GB) 

### GPU
- specialized circuit for rendering graphics
- not general purpose like CPU, but fast at array operations

In [1]:
%load_ext memory_profiler

In [2]:
# The following data is from https://www.pombase.org/downloads/protein-datasets
# It contains the amino acid sequences for proteins in fission yeast

! head data/peptide.fa

>SPAC1002.01:pep
MLPPTIRISGLAKTLHIPSRSPLQALKGSFILLNKRKFHYSPFILQEKVQSSNHTIRSDT
KLWKRLLKITGKQAHQFKDKPFSHIFAFLFLHELSAILPLPIFFFIFHSLDWTPTGLPGE
YLQKGSHVAASIFAKLGYNLPLEKVSKTLLDGAAAYAVVKVSYFVENNMVSSTRPFVSN*
>SPAC1002.02:pep
MASTFSQSVFARSLYEDSAENKVDSSKNTEANFPITLPKVLPTDPKASSLHKPQEQQPNI
IPSKEEDKKPVINSMKLPSIPAPGTDNINESHIPRGYWKHPAVDKIAKRLHDQAPSDRTW
SRMVSNLFAFISIQFLNRYLPNTTAVKVVSWILQALLLFNLLESVWQFVRPQPTFDDLQL
TPLQRKLMGLPEGGSTSGKHLTPPRYRPNFSPSRKAENVKSPVRSTTWA*
>SPAC1002.03c:pep


In [3]:
peppath = "data/peptide.fa"

In [4]:
def pep_search(filename,pep):
    pep_len = len(pep) + 1
    with open(filename,'r') as pepfile:
        for line in pepfile:
            if line[:pep_len] == ">" + pep:
                break
        pepstr = ""
        for line in pepfile:
            if line[0] == ">":
                break
            pepstr += line.strip()
    return pepstr

In [5]:
pepstr = pep_search(peppath,"SPMTR.03")
pepstr

'MSAEDLFTIQILCDQIELKLASIVINSNIKLQLKRKKKTQQL*'

In [6]:
!tail data/peptide.fa

MKRVAVLLKTVMCEFLKCDYNGYDRIISLLRRILTLICTPNLNGLTIKRVIDSMQSLEYI
KQTCNFKLQMCISSMAFKRNNALQNCNHYAWCDDHCSDIGRPMTTVRGQCSKCTKPHLMR
WLLLHYDNPYPSNSEFYDLSAATGLTRTQLRNWFSNRRR*
>SPMTR.03:pep
MSAEDLFTIQILCDQIELKLASIVINSNIKLQLKRKKKTQQL*
>SPMTR.04:pep
MDSHQELSAGSPISYDFLDPDWCFKRYLTKDALHSIETGKGAAYFVPDGFTPILIPNSQS
YLLDGNSAQLPRPQPISFTLDQCKVPGYILKSLRKDTTSTERTPRPPNAFILYRKEKHAT
LLKSNPSINNSQVSKLVGEMWRNESKEVRMRYFKMSEFYKAQHQKMYPGYKYQPRKNKVK
R*


In [7]:
def pep_search_fullread(filename,pep):
    with open(filename,'r') as pepfile:
        pepread = pepfile.read()
    pepiter = iter(pepread.split("\n"))
    for line in pepiter:
        if line[:(len(pep)+1)] == ">" + pep:
            break
    pepstr = ""
    for line in pepiter:
        if line[0] == ">":
            break
        pepstr += line.strip()
    return pepstr

In [8]:
pep_search_fullread(peppath,"SPMTR.03")

'MSAEDLFTIQILCDQIELKLASIVINSNIKLQLKRKKKTQQL*'

## Profiling with IPython

- %time: Time the execution of a single statement
- %timeit: Time repeated execution of a single statement for more accuracy
- %prun: Run code with the profiler
- %memit: Measure the memory use of a single statement, requires %load_ext memory_profiler in IPython

In [9]:
%time %memit pepstr = pep_search(peppath,"SPMTR.03")

peak memory: 57.84 MiB, increment: 0.03 MiB
CPU times: user 57.4 ms, sys: 5.06 ms, total: 62.5 ms
Wall time: 172 ms


In [10]:
%time %memit pepstr = pep_search_fullread(peppath,"SPMTR.03")

peak memory: 66.88 MiB, increment: 9.04 MiB
CPU times: user 46 ms, sys: 19.9 ms, total: 65.9 ms
Wall time: 172 ms


In [13]:
def pep_search(filename,pep):
    with open(filename,'r') as pepfile:
        pep_len = len(pep) + 1
        for line in pepfile:
            if line[:pep_len] == ">" + pep:
                break
        pepstr = ""
        for line in pepfile:
            if line[0] == ">":
                break
            pepstr += line.strip()
    return pepstr

In [15]:
def pep_reader(filename='data/peptide.fa'):
    with open(filename,'r') as pepfile:
        pepname = False # start of file
        for line in pepfile: 
            if line[0] == '>': # check for prot id line
                if pepname:
                    yield (pepname,pepseq) # if not first output protein
                pepname = line.split(':')[0][1:] # get the id
                pepseq = "" # init seq
            else:
                pepseq += line.strip() # append to seq

In [16]:
pep = pep_reader()
pepdict = {k:v for k,v in pep}
[k for i,k in enumerate(pepdict.keys()) if i < 10] # first 10 keys

['SPAC1002.01',
 'SPAC1002.02',
 'SPAC1002.03c',
 'SPAC1002.04c',
 'SPAC1002.05c',
 'SPAC1002.06c',
 'SPAC1002.07c',
 'SPAC1002.08c',
 'SPAC1002.09c',
 'SPAC1002.10c']

## Data structures

- Strategically arrange the data in memory (primary, secondary, or tertiary)
- The data structure can be specific to the data type, for example dictionary keys need to be hashable (str, int)
- Optimized for certain operations, insertion, deletion, indexing (lookup), etc.
- measure complexity of operation by roughly number of basic operations required, $O(n)$ means less than $Cn$ for some constant $C$. 

### Python Lists
- Array data structure: data are indexed by non-negative integers
- Insertion at beginning is $O(n)$, insertion at end is $O(1)$
- To find an index with a certain value (reverse lookup) takes $O(n)$ time

## Dictionaries form hash tables

- a hash function is used to give the keys integer ids (probably unique but maybe not)
- a hash table maps these ids to values

![](https://upload.wikimedia.org/wikipedia/commons/thumb/7/7d/Hash_table_3_1_1_0_1_0_0_SP.svg/315px-Hash_table_3_1_1_0_1_0_0_SP.svg.png)
*image from wikipedia

In [17]:
pepdict['SPAC1002.01'] # select element

'MLPPTIRISGLAKTLHIPSRSPLQALKGSFILLNKRKFHYSPFILQEKVQSSNHTIRSDTKLWKRLLKITGKQAHQFKDKPFSHIFAFLFLHELSAILPLPIFFFIFHSLDWTPTGLPGEYLQKGSHVAASIFAKLGYNLPLEKVSKTLLDGAAAYAVVKVSYFVENNMVSSTRPFVSN*'

In [18]:
hash('SPAC1002.01') # the hash value

-2508259212812006471

In [19]:
pep = pep_reader() # init again
prot_ids, prot_seqs = zip(*pep) # make 2 lists
prot_ids = list(prot_ids)
prot_seqs = list(prot_seqs)

In [20]:
lastid = prot_ids[-1] # select last id

In [21]:
%time prot_seqs[prot_ids.index(lastid)] # time selecting using list.index

CPU times: user 46 µs, sys: 8 µs, total: 54 µs
Wall time: 57 µs


'MSAEDLFTIQILCDQIELKLASIVINSNIKLQLKRKKKTQQL*'

In [22]:
%time pepdict[lastid] # time select using dict

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 3.81 µs


'MSAEDLFTIQILCDQIELKLASIVINSNIKLQLKRKKKTQQL*'

## Regular expressions (regex)

- syntax for representing patterns in text
- descriptive language (like SQL, HTML, markdown)

- metacharacter: matches a character or set of characters
- escape: \ escapes special chars or indicate special forms
- qualifiers: specify the number and place of matches
- concatenation: regex A and B are concatenated by AB

In [23]:
## Using the re package

import re

alpha = re.compile('a')
alpha.match('a'), alpha.match('A'), alpha.match('hello')

(<re.Match object; span=(0, 1), match='a'>, None, None)

- match() : determine if the RE matches at the beginning of the string.
- search() : scan through a string, looking for any location where this RE matches.
- findall() : find all substrings where the RE matches, and returns them as a list.
- finditer() : find all substrings where the RE matches, and returns them as an iterator.

[ ] specifies character classes (sets of characters)

In [24]:
alpha = re.compile('[a-z]')
alpha.match('a'), alpha.match('A'), alpha.match('hello')
# matching is greedy

(<re.Match object; span=(0, 1), match='a'>,
 None,
 <re.Match object; span=(0, 1), match='h'>)

In [25]:
"""
\ escapes characters

\' single quote
\" double quote
\\ backslash
\n new line
\r carriage return
\t tab
"""

'\n\\ escapes characters\n\n\' single quote\n" double quote\n\\ backslash\n\n new line\n\r carriage return\n\t tab\n'

In [26]:
"""
\d any decimal digit; [0-9].
\D any non-digit character; [^0-9].
\s any whitespace character; [ \t\n\r\f\v].
\S any non-whitespace character; [^ \t\n\r\f\v].
\w any alphanumeric character; [a-zA-Z0-9_].
\W any non-alphanumeric character; [^a-zA-Z0-9_].
"""

'\n\\d any decimal digit; [0-9].\n\\D any non-digit character; [^0-9].\n\\s any whitespace character; [ \t\n\r\x0c\x0b].\n\\S any non-whitespace character; [^ \t\n\r\x0c\x0b].\n\\w any alphanumeric character; [a-zA-Z0-9_].\n\\W any non-alphanumeric character; [^a-zA-Z0-9_].\n'

In [27]:
alpha = re.compile('\d\n')
alpha.match('5'), alpha.match('5\n'), alpha.match('five')

(None, <re.Match object; span=(0, 2), match='5\n'>, None)

In [28]:
## Complement set with ^ in []

alpha = re.compile('[^432a-z]')
alpha.match('5'), alpha.match('abs'), alpha.match('1hat')

(<re.Match object; span=(0, 1), match='5'>,
 None,
 <re.Match object; span=(0, 1), match='1'>)

In [29]:
## . is a wildcard for any literal char

alpha = re.compile('.\.com')
alpha.match('a.com'), alpha.match('e.com'), alpha.match('com.org')

(<re.Match object; span=(0, 5), match='a.com'>,
 <re.Match object; span=(0, 5), match='e.com'>,
 None)

In [30]:
## * match 0 or more repetitions of the preceding regex
## () groups the metachars as regex

alpha = re.compile('(an)*')
alpha.match('banana'), alpha.match('an apple'), alpha.match('na no apple')

(<re.Match object; span=(0, 0), match=''>,
 <re.Match object; span=(0, 2), match='an'>,
 <re.Match object; span=(0, 0), match=''>)

In [31]:
## + match 1 or more repetitions of the preceding regex

alpha = re.compile('(an)+')
alpha.match('banana'), alpha.match('an apple'), alpha.match('na no apple')

(None, <re.Match object; span=(0, 2), match='an'>, None)

In [32]:
## ? match 0 or 1 repetitions of the preceding regex

alpha = re.compile('n[ao]?')
alpha.match('banana'), alpha.match('noah\'s apple'), alpha.match('na no apple')

(None,
 <re.Match object; span=(0, 2), match='no'>,
 <re.Match object; span=(0, 2), match='na'>)

In [33]:
# {m} exactly m repetitions
# {m,n} between m and n repetitions
# | or , ^ $ beginning end of string

alpha = re.compile('^[A-Z]+|[0-9]{4}$')
alpha.match('BARN'), alpha.match('2002'), alpha.match('2002 BARNS')

(<re.Match object; span=(0, 4), match='BARN'>,
 <re.Match object; span=(0, 4), match='2002'>,
 None)

In [34]:
email = re.compile("^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]{2,4}$")

In [35]:
email.match('jsharpna@ucdavis.edu')

<re.Match object; span=(0, 20), match='jsharpna@ucdavis.edu'>

In [36]:
email.match('j@j.edu')

<re.Match object; span=(0, 7), match='j@j.edu'>

In [37]:
email.match('jsharpna@ucdavis.elevator')

In [38]:
valid = ["email@example.com",
"firstname.lastname@example.com",
"email@subdomain.example.com",
"firstname+lastname@example.com",
"email@123.123.123.123",
"email@[123.123.123.123]",
"\"email\"@example.com",
"1234567890@example.com",
"email@example-one.com",
"_______@example.com"]

In [39]:
for em in valid:
    if not email.match(em):
        print(em)

email@subdomain.example.com
email@123.123.123.123
email@[123.123.123.123]
"email"@example.com
