## Summing Lists and Counting Characters

### Summing lists by looping through each item
We are going to quickly look at 3 ways to loop through the items in a list and add them up. Then we are going to see a way to do this type of thing more generally that doesn't directly involve looping through each item.

In [1]:
# here's the simple list we will use
numlist = [1, 2, 3.2, 4.7]
print numlist

[1, 2, 3.2, 4.7]


**(1)** <font color='blue'>**while loop**</font> -- this is probably most familiar to those with other programming language experience.

In [2]:
numlist = [1, 2, 3.2, 4.7]

addition = 0; i = 0
while i < len(numlist):
    addition = addition + numlist[i]
    i = i+1
    
print addition

10.9


Here's the same loop using the **+=** operator. I think it reads more cleanly. We'll use this when we can.

In [3]:
addition = 0; i = 0
while i < len(numlist):
    addition += numlist[i]
    i += 1

print addition

10.9


**(2)** <font color='blue'>**for loop**</font> -- this initializes and increments the index with just one line by using the **range()** function, the while loop took 3 statements to handle that.

In [4]:
numlist = [1, 2, 3.2, 4.7]

addition = 0
for i in range( len(numlist) ):
    addition += numlist[i]
    
print addition

10.9


**(3a)** <font color='blue'>**for .. in loop**</font> -- we don't actually need the index. this will return each list item in turn so we can deal with it. cleanest look so far.

In [5]:
numlist = [1, 2, 3.2, 4.7]

addition = 0
for val in numlist:
    addition += val
    
print addition

10.9


**(3b)** <font color='blue'>**for .. in loop**</font> -- to get the index and value of each item in turn, this style loop can do that by passing the list to the **enumerate()** function.

In [6]:
numlist = [1, 2, 3.2, 4.7]

addition = 0
for i, val in enumerate(numlist):
    addition += val
    
print "{} sum with last index of {}".format(addition, i)

10.9 sum with last index of 3


### The sum() built-in function
***Sum*** is a special case in Python. There's a built-in function called **sum()** that sums a list.
Anything else than adding up int or float values and you have to do it some other way. That's what we will look at in the next sections. Let's see sum() in action.

In [7]:
# there's a built-in sum() function for numeric lists: int and float types.
# anything more complicated you'll have to do yourself

print sum([1, 2, 3.2, 4.7])
numlist = [1, 2, 3.2, 4.7]
print sum(numlist), "it still adds up"

10.9
10.9 it still adds up


You might notice above that <font color='green'>**print**</font> is in <font color='green'>**bold green**</font> and <font color='green'>sum</font> is in <font color='green'>regular green</font>. This is Jupyter's way of distiguishing *keywords* from *built-in* functions.

Why do we care? Because built-in names can be reused as variable names. That means you lose them as a function. Yikes!

I've seen more than a few questions on Stack Overflow that were due to someone assiging a value to the variable named **sum** and then trying to use the **sum()** function later in their code. You'll get this green highlighting of built-ins in ***IPython*** but not the ***Python*** command line interpreter. Everybody that has Jupyter installed will have **IPython** installed too. I'd recommend using it over the regular Python interpreter.

A quick example so you won't do this, or if you do, you can recognize what happened. (I'm using an advanced syntax to restore sum(), ignore that for now)

In [8]:
summation = sum([1, 2, 3.2, 4.7])
print summation, "from variable named summation"
print sum([1, 2, 3.2, 4.7]), "directly from sum() function"

# this is what gets us in hot water -- not typically such a direct assignment to sum
sum = sum([1, 2, 3.2, 4.7])

print sum, "from variable named sum -- huh?"
print
print "trying to use sum as a function again"
try:
    print sum([1, 2, 3.2, 4.7]), "directly from sum() function after sum assignment"
finally:
    del sum # this restores the sum() function our variable was hiding

10.9 from variable named summation
10.9 directly from sum() function
10.9 from variable named sum -- huh?

trying to use sum as a function again


TypeError: 'float' object is not callable

### Summing numbers represented as strings as well as int and float types
We'll have more to say about this below, but I want to point out a way to handle lists where numbers are represented as strings, or as int or float values.

You might have read numeric values in from a file and they'll be strings. We'll define a function to handle both cases.

In [9]:
allnumlist = [1, 2, 3.2, 4.7]
strnumlist = ['1', '2', '3.2', '4.7'] # same list but the numbers are strings

def sum_list(numlist):
    addition = 0
    for val in numlist:
        addition += float(val)
    return addition
        
print "{} summed from {}".format( sum_list(allnumlist), allnumlist)
print "{} summed from {}".format( sum_list(strnumlist), strnumlist )

10.9 summed from [1, 2, 3.2, 4.7]
10.9 summed from ['1', '2', '3.2', '4.7']


The reason that the above function **sum_list** works for both types of list is that every value is converted to a float via the **float(val)** call. This is true whether the value is a string, an int, or float.
Only potential downside to this is that a sum of ints gets turned into a float representation.
We'll work on that a little bit below.

### Handling more than addition of lists

There's a more general function named **reduce()** we can use to do things to list items, e.g, multiply the values. Of course, we can do that with the loops we looked at above but this function handles things for you in the general case.

**reduce()** lets us call a function on each list item in turn and return the result from the function
after it is applied to each of these items. There's a 3rd argument that lets us define the starting value. We can leave that 3rd arg out if we have a list of int or float values.

In [10]:
# define a function to multiply 2 values and pass it to reduce

def mult(current_result, next_item_in_list):
    return current_result * next_item_in_list

numlist = [1, 2, 3.2, 4.7]
print reduce(mult, numlist)

30.08


Here's another way we could do mixed type additions. I.e., numbers that are strings, ints, and floats. This one uses **reduce()** as well, so we can concentrate on what we are doing instead of the looping over the elements.

We'll define a function **add()** that adds 2 things and returns the results. It calls a function **to_num()** on the item that turns a string numeric into a float but leaves non-strings as is.

**to_num()** is defined after the **add()** function. It could just as easily have been defined before, as long as it's defined before it is used.


In [13]:
def add(current_result, next_item_in_list):
    return current_result + to_num( next_item_in_list )

def to_num(d):
    return float(d) if isinstance(d, basestring) else d

mixed_types = [1,'25',3, 3.2]
print mixed_types, "notice '25' is a numeric string, not an int or float type"

print "adds to", reduce(add, mixed_types, 0) # pass 0 as 3rd arg so first item can be a string

[1, '25', 3, 3.2] notice '25' is a numeric string, not an int or float type
adds to 32.2


The **to_num()** function, though short, may look very foreign to you. Ignore its implementation complexity if you'd like. (It depends on the fact that in Python everything is an instance of an object. It also uses a form of if..else that returns an expression without having to do an assignment in each clause.)

---
We can do the same thing as above but do the addition inline with what is called a **lambda** function in Python.
These are often called *anonymous functions* in other languages: that just means a function without a name.

Though the syntax is different than **def** for named functions, it is similar. The **lambda** function can only be a single line. We'll show average while we're at it.

In [14]:
# we won't use an add() function here, we'll use its equivalent lambda function in the reduce() call

def to_num(d):
    return float(d) if isinstance(d, basestring) else d

mixed_types = [1,'25',3, 3.2]
mixed_sum = reduce(lambda cursum,nxtitem: cursum+to_num(nxtitem), mixed_types, 0)

print "sum: {} avg: {}".format( mixed_sum, mixed_sum/len(mixed_types) )

sum: 32.2 avg: 8.05


We run into a problem with both of the above when the string is not convertible to an int or a float value, even if it begins with digits. We'll be reusing the **to_num()** function we defined above in the following code block.

In [15]:
bad_mixed_types = mixed_types + ['22 skidoo']
print bad_mixed_types

another_mixed_sum = reduce(lambda a,d: a+to_num(d), bad_mixed_types, 0)
print another_mixed_sum

[1, '25', 3, 3.2, '22 skidoo']


ValueError: invalid literal for float(): 22 skidoo

Let's ease into a solution for this.
I like to define a function named **atoi()** that closely replicates the semantics of this
same named function from the C programming language. Its best feature is that it <font color='blue'>stops at a non-digit</font>; it doesn't <font color='red'>explode</font> aka throw an error.

The **int()** and **float()** functions do throw an error, ie, explode, if there are alphas in the string, ***anywhere***. You saw this above. This makes for very brittle code.

**atoi()** also skips any space at the beginning, as do **int()** and **float()**. In our **atoi()**, if the arg isn't a string we'll just return the arg.


In [16]:
def atoi(int_str=""):
    if not hasattr(int_str,"lstrip"):
        return int_str # could be an int or float passed in (if not, good luck downstream)

    istr = int_str.lstrip()
    rslt = 0; mult = 1; 
    val0 = ord('0'); val9 = ord('9')
    
    if len(istr) and istr[0] == '-': # handle negative ints
        mult = -1; istr = istr[1:]
        
    for ch in istr:
        val = ord(ch)
        if val0 <= val <= val9: # it's a digit: slide everything over by 10s and add the digit
            rslt = rslt*10 + (val-val0)
        else: # not a digit, we'll take whatever we have so far
            break
            
    return rslt * mult

# now let's test it to see what we get
def atoi_test(int_str): print "{:5d} from {}".format(atoi(int_str), repr(int_str))
        
atoi_test("  -22skidoo0123456789")
atoi_test("327")
atoi_test(" 12.34") # note we get ints not floats from atoi()
atoi_test("twelve")
atoi_test(12)


  -22 from '  -22skidoo0123456789'
  327 from '327'
   12 from ' 12.34'
    0 from 'twelve'
   12 from 12


Now we are going to do what we did beforehand that created the error but with the **atoi()** function instead of **to_num()** .

In [17]:
bad_mixed_types = mixed_types + ['22 skidoo']
print bad_mixed_types

another_mixed_sum = reduce(lambda a,d: a+atoi(d), bad_mixed_types, 0)
print another_mixed_sum
print "If the above is 32.2 + 22 it worked"

[1, '25', 3, 3.2, '22 skidoo']
54.2
If the above is 32.2 + 22 it worked


The above worked. However, we have only put **int**s into strings, not a **float** value. Let's see what happens there.

In [18]:
v3_mixed_types = ['fenix', '3.14'] + bad_mixed_types
v4_mixed_types = ['fenix', 3.14]   + bad_mixed_types

print v3_mixed_types
sum3 = reduce(lambda a,d: a+atoi(d), v3_mixed_types, 0)
print sum3
print
print v4_mixed_types
sum4 = reduce(lambda a,d: a+atoi(d), v4_mixed_types, 0)
print sum4
print
print "Difference:", sum4 - sum3

['fenix', '3.14', 1, '25', 3, 3.2, '22 skidoo']
57.2

['fenix', 3.14, 1, '25', 3, 3.2, '22 skidoo']
57.34

Difference: 0.14


The value **'3.14'** in v3_mixed_types was turned into an **int 3** instead of **float 3.14**.
That's why we have a difference of ***.14***

In some cases that might be what we want, but this way is treating a string '3.14'
differently than a float 3.14 and that seems inconsistent.
We could change our to_num function to always convert to int as the **to_num_v3()** example below does.

In [19]:
def to_num_v3(d): return atoi(d) if isinstance(d, basestring) else int(d)
print reduce(lambda a,d: a+to_num_v3(d), v3_mixed_types, 0)

57


The above makes things consistent but I would prefer keeping floats float, wouldn't you.

Let's make a **strtonum()** function that does what atoi() does for an integer string
but also can return a float if there is a fractional part after a period. There's a few ways to go, such as doing everything with regular expressions; however, we'll just see if we can add some smarts to the atoi() algorithm.

In [20]:
def strtonum(num_str=""):
    if not hasattr(num_str,"lstrip"):
        return num_str # could be an int or float passed in (if not, good luck downstream)

    nstr = num_str.lstrip()
    rslt = 0; mult = 1; 
    val0 = ord('0'); val9 = ord('9')
    frac = ""
    
    if len(nstr) and nstr[0] == '-': # handle negative ints
        mult = -1; nstr = nstr[1:] # set multiplier to -1 and delete the '-' char
        
    for ch in nstr:
        val = ord(ch)
        if val0 <= val <= val9: # it's a digit
            if len(frac)==0: # slide everything over by 10s and add the digit
                rslt = rslt*10 + (val-val0)
            else: # gather fractional part in a string
                frac += ch
        elif ch == '.' and len(frac)==0:
            frac = "0."
        else: # not a digit, we'll take whatever we have so far
            break
            
    return (rslt * mult) if len(frac)==0 else (rslt+float(frac)) * mult
    

Make sure that the above has been run so we have defined **strtonum()** and we'll see what it does with ['fenix 5', '3.14', 1, '25', 3, 3.2, '22 skidoo'] 

In [32]:
v3_mixed_types = ['fenix 5', '3.14', 1, '25', 3, 3.2, '22 skidoo']
for val in v3_mixed_types:
    num = strtonum(val)
    print "{:>11s} == {:>4} {}".format(repr(val), num, type(num))

  'fenix 5' ==    0 <type 'int'>
     '3.14' == 3.14 <type 'float'>
          1 ==    1 <type 'int'>
       '25' ==   25 <type 'int'>
          3 ==    3 <type 'int'>
        3.2 ==  3.2 <type 'float'>
'22 skidoo' ==   22 <type 'int'>


Let's rerun the code where we had the difference of .14 substituting **strtonum()** for **atoi()** in the lambda function we use in reduce().

In [34]:
v3_mixed_types = ['fenix', '3.14'] + bad_mixed_types
v4_mixed_types = ['fenix', 3.14]   + bad_mixed_types

print v3_mixed_types
sum3 = reduce(lambda a,d: a+strtonum(d), v3_mixed_types, 0)
print sum3
print
print v4_mixed_types
sum4 = reduce(lambda a,d: a+strtonum(d), v4_mixed_types, 0)
print sum4
print
print "Difference:", sum4 - sum3

['fenix', '3.14', 1, '25', 3, 3.2, '22 skidoo']
57.34

['fenix', 3.14, 1, '25', 3, 3.2, '22 skidoo']
57.34

Difference: 0.0


Things look pretty good with this solution. We can combine int, float, or numeric strings, the way you'd expect. We can also handle strings that would cause an error with the typical Python built-ins int() and float().
And int values stay int and aren't converted to float. Btw, the sum() function maintains this too.

---
Remember though that 'Fahrenheit 451' becomes 0, while '451 Fahrenheit' becomes 451. You can replace the alphas at the beginning of a string with the regular expression module **re** and its **sub()** method. 
We'll see this below but won't work this into our strtonum() strategy. Then we'll move on.

In [23]:
import re # re is python's regular expression module
F451 = 'Fahrenheit 451'

rip_prefix_alphas = re.sub("^[a-zA-Z ]*", "", F451) # we're also getting rid of prefix spaces
print repr(rip_prefix_alphas)

'451'


### Counting Characters
We can count the number of a specific character, or substring, with the string **count()** method.

In [31]:
seq = "ATGTTCTGGGCCGCACGCGTGCTACACTGAGCGGGTCAACGGGTGAGGATGTGCGAGAGCACTTCCCAAT"

print 'G:', seq.count('G')
print 'C:', seq.count('C')
print 'GTG:', seq.count('GTG')

G: 24
C: 18
GTG: 3


Let's use the counts to calculate GC percentage of the same nucleotide string.

In [25]:
seq = "ATGTTCTGGGCCGCACGCGTGCTACACTGAGCGGGTCAACGGGTGAGGATGTGCGAGAGCACTTCCCAAT"
GCnum = seq.count('G') + seq.count('C')

GCfrac = (GCnum+0.0) / len(seq) # we add 0.0 to get float division (not needed in python3)
GCpct = 100 * GCfrac # we print GCfrac using a % format specifier, but this is how you'd get the percentage

print '{0} GC is {0:.1%} of seq length {1}'.format(GCfrac, len(seq))

0.6 GC is 60.0% of seq length 70


Another way to display the counts is to define a function to print all our characters counts at once.

Notice we pass in 'ACGT' as the characters to count in **prt_chrnum(**'ACGT', seq**)**

In [30]:
def prt_chrnum(chrs, st, fmtstr="{} {}", toupper=False): 
    if toupper:
        chrs = chrs.upper(); st = st.upper()
    for ch in chrs:
        print fmtstr.format(ch, st.count(ch))
        
seq = "ATGTTCTGGGCCGCACGCGTGCTACACTGAGCGGGTCAACGGGTGAGGATGTGCGAGAGCACTTCCCAAT"
prt_chrnum('ACGT', seq)

A 14
C 18
G 24
T 14


The above works to print the char and its count, but we usually need to use the char
counts in calculations.

Let's make a Python dictionary with them by **def**ining a function that's similar to **prt_chrnum()** which we'll call **create_chrnum_map()**.

In [27]:
def create_chrnum_map(chrs, st, toupper=False):
    chrnum_map = {} # {} represents an empty dictionary
    if toupper:
        chrs = chrs.upper(); st = st.upper()
    for ch in chrs:
        chrnum_map[ch] = st.count(ch)
    return chrnum_map

seq = "ATGTTCTGGGCCGCACGCGTGCTACACTGAGCGGGTCAACGGGTGAGGATGTGCGAGAGCACTTCCCAAT"
print create_chrnum_map('ACGT',seq)

{'A': 14, 'C': 18, 'T': 14, 'G': 24}


Let's use the dictionary created by **create_chrnum_map()** to calculate GC percentage of a nucleotide string just like before with the counts.

In [28]:
seq = "ATGTTCTGGGCCGCACGCGTGCTACACTGAGCGGGTCAACGGGTGAGGATGTGCGAGAGCACTTCCCAAT"
nucmap = create_chrnum_map('ACGT',seq)
print nucmap
print

# only difference is how we get our G and our C counts. now we ask the map, aka dictionary, for them
GCnum = nucmap['G'] + nucmap['C']

GCfrac = GCnum*1.0 / len(seq) # we multiply by 1.0 to get float division (not needed in python3)
GCpct = 100 * GCfrac # we print GCfrac using a % format specifier, but this is how you'd get the percentage

print '{0} GC is {0:.1%} of seq length {1}'.format(GCfrac,len(seq))

{'A': 14, 'C': 18, 'T': 14, 'G': 24}

0.6 GC is 60.0% of seq length 70


There's a slightly different way to create a dictionary that is more *modern*. It is called dictionary comprehension. And frankly I don't think it's that easy to comprehend (the term comes from set theory).
I use the traditional method.

However, let's take a quick look. You might like it. It is more compact.

In [29]:
seq = "ATGTTCTGGGCCGCACGCGTGCTACACTGAGCGGGTCAACGGGTGAGGATGTGCGAGAGCACTTCCCAAT"

nucmap = {ch: seq.count(ch) for ch in seq}
print nucmap

{'A': 14, 'C': 18, 'T': 14, 'G': 24}


In the above, the variable name before the colon, **ch**, becomes the dictionary element's key and the expression after the colon but before the Python keyword <font color='green'>**for**</font> is that key's value.
That's the **seq.count(ch)** part in the example.
Each key, **ch** in this example, is created in turn by the Python code after the value expression.

This dictionary comprehension was preceded by list comprehension which uses a similar syntax to create lists.