# Ranking ingredients

Similar to the suggested pairings problem I solved with Dijkstra's algorithm, I had to find a way to solve a mundane problem that had a very clear naive solution. The problem was stated as so:

> Given a set of ingredients, find all ingredients that are commonly paired with the given set and rank them.

Assuming this data is stored in a relational database, such as MySQL, and depending on the schema design, it is not difficult to see that this would be a sequence of `INNER JOIN`s. As it turned out, the way we designed our schema for Saffron made it slightly trickier, but it was still very much do-able with the use of a junction table. To the address the ranking, we could rank based on total number of times that an ingredient $x$ showed up.

...But what's the fun in that? So I took it upon myself to do it in Python, a language I wasn't familiar with, and to find a more interesting method to rank the ingredients.

Side note: $\LaTeX$ doesn't seem to display properly on github.

## Adjacency Matrix

Let's take a look at simple example. In our actual adjacency matrix, columns and rows have the same labels, but for demonstration purposes, I've made them different. What the user gives us are the rows, and what we try to match those with are the columns.

Suppose we have a matrix as such:

In [16]:
import pandas as pd
df = pd.DataFrame([[1, 2, 0, 3], [0, 1, 0, 0], [1, 0, 1, 1], [0, 0, 0, 0]], index=['a', 'b', 'c', 'd'], columns=['a1', 'b1', 'c1', 'd1'])
print(df)

   a1  b1  c1  d1
a   1   2   0   3
b   0   1   0   0
c   1   0   1   1
d   0   0   0   0


And we're only interested in ingredients `a` and `c`:

In [20]:
dftemp = df.loc[['a','c'], :]
print(dftemp)

   a1  b1  c1  d1
a   1   2   0   3
c   1   0   1   1


We take only the ingredients that have been paired with *all* of the ingredients the user has given us

In [21]:
print(dftemp.loc[:,dftemp.columns[(dftemp >= 1).all()]])

   a1  d1
a   1   3
c   1   1


We've now solved the first issue of finding the intersection: If total count was our ranking criteria, the above result would be `[d1, a1]`.  But suppose we had a set where two of ingredients had a total sum of 100, it's not as clear which one should be ranked higher.

This is when I started looking at edge cases, and started thinking about the distribution of each column. It made sense to think that ingredients that would pair well would be ones that are often paired with all the one's that are being considered, as opposed to an ingredient that was only paired once with nine of say, 10, but 100 times with the tenth ingredient.

My first approach was to consider a Pearson's $\chi^2$ test for goodness of fit, but this only provides a test, as opposed to a measure. We also didn't have enough data to properly conduct such a test. So I sought to find a simple estimate to measure the uniformity of our observed data. Entropy seemed to be quite reasonable.

$$ 
\begin{equation}
\begin{split}
\mathrm{H}(x) &= \sum_{i=1}^n {\mathrm{P}(x_i)\,\mathrm{I}(x_i)}\\
 &= -\sum_{i=1}^n {\mathrm{P}(x_i) \log_b \mathrm{P}(x_i)} 
\end{split}
\end{equation}
$$

[Entropy](https://en.wikipedia.org/wiki/Entropy_(information_theory) measures the unpredictability of information content. And it's value is maximized in a [uniform distribution](https://en.wikipedia.org/wiki/Entropy_(information_theory)

$$log_{10}(n), \; n = b-a+1,$$

where `b` is the upper bound, and `a` is the lower bound.

We now have two quantities to take in to account, total count and entropy. One possibility would be to express the two as a linear combination, but I wanted to penalize heavily skewed distributions and reward those that were well distributed. Thus, I decided to take the sum of each ingredient and  raise each to it's entropy. This in effect heavily rewards uniformly distributed vectors, while penalizing skewed ones.

Here's the code:

    def find_intersection(clean_df, user_ingredients):
        # Consider only the rows of the desired user_ingredients
        df_temp = clean_df.loc[user_ingredients, :]

        # Drop columns that match `user_ingredients`. Examining shape of df_temp
        # before and after, column size decreases by 2
        df_temp = df_temp.drop(labels=user_ingredients, axis=1)

        # Then consider only the columns that have been paired with all ingredients of interest
        new_cols = df_temp.columns[(df_temp > float(0)).all()]

        # Remove rows that are same as column names
        final_df = df_temp.loc[:, new_cols]

        # Rank by examining uniformity of distribution along columns, 
        # penalizing for non-uniformity, while rewarding for uniformity
        sums = final_df.sum()
        entropy = stats.entropy(final_df, base=2)
        rankings = np.power(sums, entropy)
        rankings_sorted = rankings.sort_values()

        # Format into list of tuples (<col_name>, <ranking>)
        cols = rankings_sorted.index.values
        rankings_list = list(zip(cols, rankings))

        return rankings_list

Please see the related functions for further details. Thanks for viewing :)