# Problem 1 - Writing Mathematical Functions

## A)
Write a function that accepts an array of floats as inputs. Return an array where every value of the input array has been divided by 1.5.

In [None]:
import numpy as np;

def divide_by_1p5(array):
    return array / 1.5;

# Test it out
test_array = np.array([1, 2, 3, 4, 5]);
out_array = divide_by_1p5(test_array);
out_array

## B)

Use a random function (uniform or normal) to generate an array of floats. Write a function that accepts this array, and returns a list of values that are more than one standard deviation greater or less than the mean of the array.

In [None]:
def getOutliers(array):
    mean=np.mean(array)
    std=np.std(array)
    highCutoff=mean+std
    lowCutoff=mean-std
    greater = array > highCutoff
    less = array < lowCutoff
    greater_or_less = greater | less
    return array[greater_or_less];

# Test it out
a = np.random.normal(50, 10, 100)
print getOutliers(a)

#Sorting makes it easier to see
print np.sort(getOutliers(a))

## B) - Alternate solution

In [None]:
def getOutliers(array):
    mean=np.mean(array)
    std=np.std(array)
    highCutoff=mean+std
    lowCutoff=mean-std
    outliers=[]
    for i in array:
        if i > highCutoff or i < lowCutoff:
            outliers.append(i)
    return np.array(outliers)

# Test it out
a = np.random.normal(50, 10, 100);
print getOutliers(a)

#Sorting makes it easier to see
print np.sort(getOutliers(a))

## C)
Write a function that estimates a p-value from the exponential distribution (another distribution in numpy). The function should take a number as an input (lets call it x), and return an estimate at the probability that a number drawn from the exponential distribution will be equal to or greater than x. 

To do this, generate many samples from the exponential distribution (use the default scale=1.0), count the number of samples greater than x, and divide the result by the number of samples you generated. 

Don't use a loop to count the number of samples greater than x. Instead look at what happens when you use np.sum() on a boolean array, or read about the method np.count_nonzero().

Calling your function should look like this:
```python
out = my_function(3)
print out #prints 0.050316 (or close to this number)
```

In [None]:
def estimate_p(measurement):
    N_SAMPLES = 1000000 
    samples = np.random.exponential(scale = 1.0, size = N_SAMPLES)
    return np.sum(samples > measurement) / float(len(samples))

print "Estimated p-value for 3 is ", estimate_p(3);
print "Estimated p-value for 5 is ", estimate_p(5);

We can also check how close we got by using the 'expon' distribution in scipy.stats

Importing this object gives us access to a variety of functions on the distribution - see the documentation [here](http://docs.scipy.org/doc/scipy-0.16.0/reference/generated/scipy.stats.expon.html)

Using the sf (survival) function computes the integral of the distribution from X to infinity - essentially what we're trying to estimate above by taking many samples and counting the proportion of them greater than X.

In [None]:
from scipy.stats import expon

print "Actual p-value for 3 is ", expon.sf(3);
print "Actual p-value for 5 is ", expon.sf(5);

# 2: Strings to arrays


## A)

So we had this idea that we might be able to find a periodicity in the spacing of pyrimidine residues downstream of the termination site in Rho dependent genes (by and large, we don't). Nevertheless:

Make a function that takes a DNA string as input (Only G, C, A, or T's) and an arbitrary substring (e.g. "CT"). The function should find all locations of the substring in the string and return it as an array. For Example:
```python
a = find_substring("GCACTTGCACGTACGCCGT", "AC") 
#output a contains [2, 8, 12] (or a numpy array with these values)
```

In [None]:
def findsubstring(string,Substring):
    length = len(Substring)
    posList=[]
    for pos,letter in enumerate(string):
        if string[pos:pos+length]==Substring:
            posList.append(pos)
    return np.array(posList)
 
x = findsubstring('ACTAGGGCTAATAGATTACGGACTATG','CT')
print x

## A) - Alternate
If you looked at a list of python string methods, you might notice the "find" method will locate a substring within a string.  However, it only finds the first match after the 'start' position.  So to search for all matches, you need to loop through, finding each match, and then updating the 'start' position so it looks for the next match next time.  Here's what a solution using this method looks like.

In [None]:
def findsubstring(string, Substring):
    posList = [];
    end_of_loop = False
    while(not end_of_loop):
        if(len(posList) == 0):
            start_position = -1;
        else:
            start_position = posList[-1];
        
        next_pos = string.find(Substring, start_position+1);
        if(next_pos > -1):
            posList.append(next_pos);
        else:
            end_of_loop = True;
        
    return np.array(posList);

x = findsubstring('ACTAGGGCTAATAGATTACGGACTATG','CT')
print x

## A) - Alternate #2
Python has a more advanced module for string searching called 're' (stands for Regular Expressions).  Regular Expression syntax is a whole language of its own, but it lets you use wildcards and other customizations to search for particular patterns.  However, we can also use it for our simple example.

In [None]:
import re
def findsubstring(string, Substring):
    return [x.start() for x in re.finditer(Substring, string)];

x = findsubstring('ACTAGGGCTAATAGATTACGGACTATG','CT')
print x

## B:

Using the result of find_substring from (a), find the distance between each pair of adjacent substrings. (i.e. How many basepairs separate each position where we found the subtring.) Check if a numpy method does this.

For Example:

differences = find_differences(a)

In [None]:
difference = np.diff(x) # Just had to find the np.diff method
print difference