# Site Sample

The diceEM framework can be applied to a real bioinformatics problem: 
inferring the sequences recognized by a transcription factor (TF).

Just as in the diceEM, we'll start our work on this problem by first generating
samples of the data we hope to infer parameters from later. This will generate
examples to work with and help you understand how to frame the problem.

Consider a slight variation on the bag-of-dice problem. In this version, the
bag contains two types dice collections. Each collection contains dice whose 
faces represent DNA bases and indicates the order
in which the dice are to be rolled. Each die may have different face
probabilities. After a collection is drawn at random from the bag, the dice
are rolled in the indicated order, the collection is returned to the bag, and
then the next collection is drawn. One type of collection is known as the motif
model and the other as the background model. The motif model represents the types
of sequences that are bound by the transcription factor, while the background
model represents all sequences in a genome. In the background model there is
just one die that is rolled the designated number of times to produce a
sequence of the required length.

We'll be using numbers to represent bases for convenience in this lab, with the
following mapping:

```raw
A <-> 0
C <-> 1
G <-> 2
T <-> 3
```

The expected output of your sampling function will be numbers in the range of 0
to 3.  

For this exercise, You will write a function called site_sample that takes
3 arguments:

- `sequence_model` = This is an object that stores the parameters of
our  binding site probability model.
- `num_draws` = the number of times to draw a sequence from the bag
- `seed` = a random seed that makes the randomness reproducible 

## SequenceModel

The SequenceModel object is similar to the BagOfDice in that it stores the
parameters that describe the probability of observing a given sequence vs
a sequence drawn from the background. The SequenceModel object has the
following attributes:

- `site_prior` = the probability of drawing a sequence from the bag that is bound
  by the transcription factor. This is a number between 0 and 1.

- `background_base_probs` = a list of 4 numbers that sum to 1, representing the
  probability of seeing each base in a sequence drawn from the background model.
  We will assume that non-bound sites are essentially drawn at random
  from a distribution one-by-one, so the chances of seeing an 'A' are the same,
  regardless of the position in the sequence. One possible background_base_probs input
  would look like this:

    ```python
    background_base_probs = [.2, .2, .3, .3]
    ```
  which would indicate that non-bound sequences are roughly 20% A's, 20% C's,
  30% G's, and 30% T's.

- `site_base_probs` = a list of lists, with each sublist containing 4 numbers
  that sum to 1, representing the probability of seeing each base at a given
  position in a sequence drawn from the foreground. This allows you to specify
  position-dependent base frequencies. One  possible site_base_probs would
  look like this:

  ```python
  site_base_probs = [[.1, .4, .4, .1], 
                    [.1, .4, .4, .1], 
                    [.8, .1, .1, 0], 
                    [.8, .1, .1, 0], 
                    [0.2, 0.3, 0.2, 0.3]]
  ```

  This would specify a site that is 5 bases wide, with fair chances of seeing
  either a C or G in the first two positions,  very high chances of seeing an A in
  each of the next two positions, and Cs and Ts favored over As and G's in the
  last position.

Constructing a SequenceModel object would look like this:

```python
>>> site_prior = 0.2
>>> sm = SequenceModel(site_prior,
                       site_base_probs,
                       background_base_probs)
```

<!--
For more details see the
[SequenceModel object documentation](https//cse587a.github.io/cse587Autils/SequenceObjects/API/SequenceModel.html)
and [usage examples](https://cse587a.github.io/cse587Autils/SequenceObjects/Usage/SequenceModel.html)
-->

## site_sample Example 

A call to site_sample would look like this:

```python
>>> print(site_sample(sm, 3))
[[2 1 3 3 3]
 [1 3 1 3 1]
 [1 2 0 0 2]]
```

Each sequence is as long as the motif specified by site_base_probs.
There are 3 sequences, per the `num_draws` argument.

## Provided Files

As with the other assignments, in the subdirectory `assignment` you will find a file called [assignment.py](assignment.py) with the function site_sample() partially implemented. Fill in the missing code (The places where you should be entering code are currently raising `NotImplementedError` errors). You can see the tests you will need to pass in [test_assignment.py](test_assignment.py). Note that if you make any changes to the test_assignment.py file, those changes will not affect the autograder.

# A possible approach

As with `dice_sample`, using numpy's `random.choice()` with weights will work.
This is very similar to dice_sample, so you should use your implementation of it
for reference, keeping in mind that the big difference between the two is that a
site_base_probs specifies different probabilities for each position, while the dice,
like our background_base_probs here, treats the face frequencies as independent of
position in a drawn sequence.