# Assignment 5: Experimentation

The biological and social sciences are largely built upon experimentation. With that in mind, this assignment is focused on using Python applications, deubgging, & writing good, clean and well-documented code. It builds up to the application of experimentation.

This assignment is out of 8 points, worth 8% of your grade.

**PLEASE DO NOT CHANGE THE NAME OF THIS FILE.**

**PLEASE DO NOT COPY & PASTE OR DELETE CELLS INLCUDED IN THE ASSIGNMENT.**

## Overview

Make sure you delete `raise NotImplementedError()` whenever you see it in the assignment, replacing it with your code for the question. To have the best chance at receiving full credit, be sure your completed assignment passes all the asserts. This assignment also has hidden tests - which means that passing all the asserts you see does not guarantee that you have the correct answer! Make sure to double check and re-run your code to make sure it does what you expect.

## Setup

#### Package Installation

To use the `pandas` and `numpy` packages, you *may* have to install them first.

Up to now, we've only used packages that have already been installed for you. Notice the `!` below. This indicates that we're running a command in the shell, not in Python. 

`pip install --user` followed by the package name you want to install is how you can install new packages in the terminal on your computer or on datahub. Note that if installing from a Jupyter Notebook (as we are here, you'll need to start with a `!` to indicate that this is a terminal/shell command.

**First try the code imports in the section below**. If they run, the packages have been installed. If they error, uncomment the each of the two lines below to install the `pandas` and `numpy` packages. Each will take a minute or two to run. You only have to run this once (not every time you open the notebook). Once you have installed the package, recomment the line of code:

In [None]:
# !pip install --user pandas

In [None]:
#!pip install --user numpy

### Q1 - Imports (0.25 pts)

Once you've installed the package, you must import them to make their functionality available to you.

Import `pandas` as `pd`.

Import `numpy` as `np`.

Finally, import the `random` module.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert np
assert pd
assert random

## Part 1: `numpy`

`numpy` makes working with matrices that store homogenous data (most typically a whole bunch of numbers) in rows and columns much simpler than it would be with the standard Python libary.

In this section, you'll get practice creating and working with `numpy` arrays.

### Q2 - arrays (0.5 pts)

Create a 3 x 3 `numpy` array. 

The 0th row should store `[1, 2, 3]`, the next row should contain `[4, 5, 6]` and the final row should contain `[7, 8, 9]`. 

Store this output in `my_array`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert(isinstance(my_array, np.ndarray))

In [None]:
assert(my_array.shape == (3,3))


### Q3 - `row_sums()` ( 0.5 pts)

Write a function `row_sums()` that takes a `numpy` array `np_array` as its input parameter. 

Within the function, determine the sum of each row.

Store the sum of each row as a list. 

Return the list of the row sums from the function.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
test_output = row_sums(np.array([[1,1], [2,2]]))
assert(isinstance(test_output, list))

In [None]:
assert(len(test_output) == 2)


### Q4 - `array_min()` (0.75 pts)

Write a function `array_min()` that takes a `numpy` array `np_array` as its input parameter. 

Within the function, determine the minimum value stored within the array. Store this in `min_val`.

Then, store the *position* where this minimum value is stored in the input array in `out`. This output should be a tuple of `ndarrays`. 

For example, if the smallest value in the input array is in the first row, second column, this function would return: `(array([0]), array([1]))`

**Note:** The `where()` function from `numpy` will likely be helpful.

Return `out` (which stores the minimum position) from the function.

In [None]:
# uncomment to see where documentation
# np.where?

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
test_array = np.array([[17,1], [25,50]])
assert(isinstance(array_min(test_array), tuple))

In [None]:
assert(len(array_min(test_array)) == 2)


### Background: Experiment

Now that you've got some practice with `numpy` and arrays, for the rest of this assignment, we'll focus on writing code that is helpful when running and analyzing experiments. 

The goal of this experiment is to understand randomness a little better while.

To do this, we'll use a simple task. Adults were asked to randomly pick a number between 0 and Infinity. 

The population for this experiment was COGS 18 students from Spring 2019. 

Here, our goal is to determine: 

1. When asked to be random, are the numbers humans chose *actually* random?
2. If they are not random, what is/are the most commonly chosen number(s)? 

To do this, we'll use `pandas` DataFrames and functions you write to answer these questions.

## Part 2: `pandas`

While `numpy` revolutionized how to work with matrices in Python, `pandas` took things one step further in its introduction of the DataFrame. 

Here, you'll begin working with a dataset that stores heterogeneous information to gain practice with `pandas`.


### Q5 - Data (0.5 pts)

One very helpful function in `pandas` allows users to read datasets into Python from a CSV file and store them as a DataFrame.

Look up documentation for `pandas` online and figure out how to read in a CSV file into Python as a DataFrame using `read_csv()`. Use the default settings for `read_csv` here.

The data you will read into this notebook are in `data/random_guess.csv`. 

Assign this DataFrame to the variable `df`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# check to ensure that what you read in is a pandas dataframe
assert(isinstance(df, pd.DataFrame))

In [None]:
assert(df.shape == (272,2))


### Look at the data

Run the code below to get a sense of the data you've just read in:

In [None]:
# see the first few rows of data frame you read in
df.head()

In this dataset, each row is data collected from student. While we only see the first five rows, there are 272 individuals who participated in our experiment. The first column includes the experimental ID for the student. The second column includes their random guess.

We'll be working with these data throughout the rest of the assignment.

## Part 3: Experimentation


### Q6 - Unique Items (1 pt)

If people chose a number between 0 and Infinity truly at random, we would not expect a sample of a couple hundred people to frequently choose the same number. 

Let's write a function to start to assess randomness in human choice.

Define a function called `calculate_unique`. 

Inputs:
- `data` - DataFrame
- `variable` - name of column in `data` to summarize (string)

Output: 
- `num_unique`, `num_total`, `num_unique/num_total`

Procedure:
1. Calculate the number of unique responses in the specified variable of the input DataFrame. Store this in `num_unique`. (Hint: there is a `unique()` function in `pandas`)
2. Calculate the number of total responses. Store this in `num_total`
3. Return, `num_unique`, `num_total`, and the proportion of unique responses (`num_unique/num_total`) - Return all three, separated by commas, in the `return` statement.


In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
test_df = pd.DataFrame({'ID' : [1, 2, 3], 
                        'response' : ['a', 'a', 'b']})
num_unique, num_total, prop_unique = calculate_unique(test_df, 'response')
assert(num_unique == 2)
assert(num_total == 3)

### Q7 - Debugging (1 pts)

Often, as we write functions, we don't get it *quite* right the first time around. A function here has been provided for you, but it's not working just yet. 

This function should accomplish the following: 

1. Randomly sample `n` rows from the dataset (where `n` is the number of rows, specified when the function is called).
2. Return the smaller subset as a data frame from the function.

Read and understand the instructions above, the function provided here, and the error message(s) produced to determine how to debug the function to (1) get it working and (2) accomplish the task specified above.

Function to start with provided below.

Do not change the function name or the input parameter name. However, adding additional parameters to your function definition is required.

In [None]:
def select_sample(dataset):
    out = data.sample(10, replace = False)
    return out

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
## Use this cell to test, if needed

In [None]:
sub_df = select_sample(df, 10)
assert(sub_df.shape == (10, 2))
assert(sub_df.iloc[0,1] in df.Number.values)

sub_df2 = select_sample(df, 15)
assert(sub_df2.shape == (15, 2))
assert(sub_df2.iloc[0,1] in df.Number.values)

### Q8 - Refactoring (0.75 pts)

Beyond just getting a function working, it's important, as we write functions, to make sure they're as clear and streamlined as possible. 

The process of restructuring existing code that is working to improve its readability or speed is called **refactoring**.

By refactoring your code, the output shouldn't change, but the efficiency and/or readability of the code should. For example, if you have a whole bunch of `if`/`elif`/`else` statements that are all very similar and involve lots of copy and pasting, there's probably a way to refactor the code to improve the readability. If you have a whole bunch of nested `for` loops, there's almost certainly a way to refactor the code.

Refactored code is easer to read, easier to debug, and easier to modify.

The code provided in the `calculate_common()` function works, but it's really confusing to figure out what's going on. Refactor the code provided so that the code within the function has fewer than 5 lines of code, while accomplishing the same task. (Note: you may have to use the Internet to gather some possible ideas/approaches.)

If you're unsure where to start:

1. Run the code that you have to start. 
2. Before changing anything, call the function and see what it outputs.
3. Add comments to each chunk of code in the function to understand how it works currently.
4. Refactor the code - this could involve changing small pieces at a time or taking an entirely different approach.
5. Be sure to remove any comments that no longer apply.

Function to start with is provided for you below: 

In [None]:
def calculate_common(dataset, column):

    items = dataset[column].unique()

    out_count = {}

    for item in items:
        item_counter = 0

        for ind in dataset.index:
            if dataset.loc[ind, column] == item:
                item_counter += 1

        out_count[item] = item_counter

    def sort_dict(dictionary, reverse_it = True):

        sorted_dictionary = sorted(dictionary.items(), 
                                   key = lambda x: x[1], 
                                   reverse = reverse_it)

        return sorted_dictionary

    sorted_dict = sort_dict(out_count)

    return sorted_dict[0]

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
## Use this cell to test your function
calculate_common(df, 'Number')

In [None]:
test_df = pd.DataFrame({'ID' : [1, 2, 3], 
                        'response' : ['a', 'a', 'b']})

assert(calculate_common(test_df, 'response') == ('a', 2))

In [None]:
assert(callable(calculate_common))


### Q9 - `split_dataset` (1 pt)

Earlier, we defined a function `select_sample()` which allowed us to randomly sample a subset from our larger dataset.

Now, rather than specifying how many samples we want to randomly choose from a dataframe, let's take our entire dataframe and split it into thirds. We'll consider each third a different **replicate** of our experiment.

Write a function called `split_dataset`, which will return a list of dataframes. Each dataframe in the list will be a  random subsets (without replacement) of the input dataset.

Input(s):
- `dataset` : DataFrame
- `n_split` : int, default: 3

Output(s):
- `result` :  list of DataFrame(s)

Procedure(s):
- Use `sample()` with the input parameters `frac = 1` and `replace = False`. Store this in `shuffled`.
- Use `np.array_split` with `shuffled` and `n_split` as its input parameters. Store this in `result`.
- return `result`, which stores a list of dataframes


In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
test_df = pd.DataFrame({'ID' : [1, 2, 3, 4, 5, 6, 7], 
                        'response' : ['a', 'a', 'b', 'c', 'd', 'd', 'e']})
list_return = split_dataset(test_df, 2)
assert(len(list_return) == 2)

In [None]:
assert(callable(split_dataset))


So far, we have a bunch of functions in a notebook. Ultimately, we'll want to use three of these functions (`calculate_unique()`, `split_dataset()` and `calculate_common()`) across different notebooks, on different input datasets. So, it's time to move these to a module, where we'll add comments and docstrings before importing them into this notebook.

### Q10 - `my_experiment` module (1 pt)

1. Move (copy) the code for the three functions (`calculate_unique()`, `split_dataset()` and `calculate_common()`) into a file named `my_experiment.py`. This file should be stored in the same directory as this notebook.
2. Add `numpy` docstrings to each function.
3. Add lines to `import` `numpy` as `np` and `pandas` as `pd` at the top of your module.
3. Save your changes to `my_experiment.py`
4. Return to this notebook, restart your kernel, and re-run all code in this notebook.
5. Import the three functions using the following syntax: `import my_experiment as exp`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert(callable(exp.calculate_common))
assert(callable(exp.split_dataset))
assert(callable(exp.calculate_unique))

In [None]:
#hidden tests for this question

Now, let's put it all together. Be sure that `df` that you read in using `pandas` has been read in since restarting your kernel and then run the code below to:

1. Randomly split up our data into 5 different experiments.
2. Determine the most common 'random' number in each of our 5 datasets.
3. Calculate the proportion of unique values in each of our 5 datasets.

In [None]:
split_datasets = exp.split_dataset(df, 5)

# loop through to calculate most common 
print('Calculating for each dataset: \n')
count = 1

for dataset in split_datasets:    
    id_max, max_val = exp.calculate_common(dataset, 'Number')
    unique, total, proportion = exp.calculate_unique(dataset, 'Number')
    print('Dataset', count, ': most common:', max_val, '; proportion unique:', proportion)
    count += 1

### Q11 - Interpretation (0.75 pts)

We set out to answer the questions: 
1. When asked to be random, are the numbers humans chose *actually* random?
2. What is/are the most commonly chosen numbers? 

Consider the output from your five experiments above.

In the variable `humans_random`, store `True` if the data suggests humans are random and `False` if not.

In the variable `most_common`, specify the value your experiment suggests is the most commonly chosen number when humans are asked to choose a number at random.

In the variable `proportion_unique`, given the results above, store the proportion unique you'd expect from a new sample, rounded to the closest 10. For example, if you thought the proportion unique was 54\%, you would store `proportion_unique = 50`. If you thought the proportion unique was 87%, you would state: `proportion_unique = 90`.

Note that each time you run the cell above your answers will change slightly due to the randomness we built in. We're accounting for this when we check your answers below.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert(isinstance(most_common, int))
assert(proportion_unique % 10 == 0)

In [None]:
assert(isinstance(humans_random, bool))


# The End!

Have a look back over your answers, and also make sure to Restart & Run All from the kernel menu to double check that everything is working properly and all of your asserts pass silently. 

When you are ready, submit on datahub.