# Code Example: Representing Multivariate Data and Distributions

Representing multivariate data is easy in any representation based on rows and columns -- just add a column for each variable.
We already saw examples of that when we talked about linear regression; we just called them columns before splitting them into inputs and outputs.
For the moment, we are not concerned with inputs and outputs, so each variable is just another column.

Consider the following data describing ten fictional mangos.
Each row in this table corresponds to a single mango.

| yellowness | softness |
|---|---|
| 1 | 1 |
| 2 | 1 |
| 2 | 2 |
| 3 | 2 |
| 3 | 2 |
| 3 | 3 |
| 3 | 4 |
| 4 | 4 |
| 4 | 4 |
| 5 | 5 |

We can compact this table by combining rows with the same values and adding a count for the number of combined rows.
This is essentially making a histogram, but with more than one variable, and not using buckets.
**<font color="red">Feels like this should have come a lot earlier.</font>**

| yellowness | softness | count |
|---|---|---|
| 1 | 1 | 1 |
| 2 | 1 | 1 |
| 2 | 2 | 1 |
| 3 | 2 | 2 |
| 3 | 3 | 1 |
| 3 | 4 | 1 |
| 4 | 4 | 2 |
| 5 | 5 | 1 |

From this table, we can trivially turn the counts into probabilities by dividing the total count (or the original number of rows).

| yellowness | softness | sample probability |
|---|---|---|
| 1 | 1 | 1/10 = 0.1 |
| 2 | 1 | 1/10 = 0.1 |
| 2 | 2 | 1/10 = 0.1 |
| 3 | 2 | 2/10 = 0.2 |
| 3 | 3 | 1/10 = 0.1 |
| 3 | 4 | 1/10 = 0.1 |
| 4 | 4 | 2/10 = 0.2 |
| 5 | 5 | 1/10 = 0.1 |

At this point, the table is a concise representation of the sample probability distribution.
This table can easily be represented as a list or array in Python, but depending on your needs, you may prefer a dictionary of probabilities like we've used the last couple weeks.
But what will the key be?

Previously, we described probability distributions of one variable, so we used the values of that variable as the keys.
Now, we are looking at multiple variables, so there are multiple values at once.
The simplest way to handle this is to make a tuple of the values.
The main thing to remember here is to use a consistent order of variables when assembling the tuples.
Here is some example code to do so.


In [None]:
def build_multivariate_distribution(input_data):
    """Reads input as array-like data.

    Returns probability distribution of whole rows."""

    counts = {}
    for row in input_data:
        key = tuple(row)
        counts[key] = counts.get(key, 0) + 1

    total_count = sum(counts.values())

    return {key: count / total_count for key, count in counts.items()}

**Code Notes:**
* The initial string at the beginning of the function is called a doc-string.
  * Doc strings are used as built-in documentation.
  * A number of tools can automatically extract docstrings to generate separate documentation. Many module web pages use these to document their functions.
  * The doc string can be accessed programmatically as the `__doc__` attribute. In this case, that would be `build_multivariate_distribution.__doc__`.
* The `for` loop iterates over each row of data.
  * This is where the "array-like" requirement comes in. The phrasing is from NumPy, and in this context, means we expect the `for` loop to repeat for individual rows of data.
  * Each row is converted to a tuple containing the same values to get a key. Converting into a tuple freezes the values so they can not be changed. This lets the tuple be used as a key in the dictionary.
* `counts.get(key, 0)` is used to handle cases where the key is not in the dictionary yet.
  * If it is not in the dictionary yet, `0` will be returned as the default value, meaning no matching rows were previously counted.
  * Just using `counts[key]` would raise `KeyError` exceptions.
  * Alternatively, you could check `key in counts`, but using `counts.get(key, 0)` makes the overall code simpler.
* `counts.values()` returns a sequence of values in the dictionary.
  * Since this dictionary has row/count pairs, the sequence is of row counts.
  * Summing the row counts gives the total number of input rows.
* The last line is a dictionary comprehension.
  * A dictionary comprehension is like a list comprehension, but written surrounded with braces (`{}`) instead of brackets (`[]`), and requires two values of output separated by a colon (`:`).
  * This dictionary comprehension works with the sequence of key/count pairs from the `counts.items()` sequence.


In [None]:
# mango data: two columns: yellowness and softness
mango_data = [[1,1],[2,1],[2,2],[3,2],[3,2],[3,3],[3,4],[4,4],[4,4],[5,5]]

# output keys are (yellowness, softness) tuples
mango_distribution = build_multivariate_distribution(mango_data)
mango_distribution

{(1, 1): 0.1,
 (2, 1): 0.1,
 (2, 2): 0.1,
 (3, 2): 0.2,
 (3, 3): 0.1,
 (3, 4): 0.1,
 (4, 4): 0.2,
 (5, 5): 0.1}

**Code Notes:**
* `mango_data` has the same data as in the first table above, structured as a list of lists.
* The resulting distribution has tuples of yellowness and softness as keys.

**Side Note:**
* This function `build_multivariate_distribution` implements one of the ways to turn multiple variables into one variable, by wrapping up the multiple variables into a single tuple.
  * We previously said we would avoid this, and maintain this stance for reasoning about variables.
  * When programming, it may be much more convenient to merge them into one, but we emphasize that these tuple keys still are transparent about the original variable values.