# Item Similarity Model

In our similarity model, we use the following information about cultural heritage items to determine their similarity.

## List of Features

### 1. Hidden Tags
Hidden tags are tags that are extracted from item description using Natural Language Processing techniques. In particular, we use [TextBlob](https://textblob.readthedocs.io/en/dev/) library to extract noun phrases from the descriptions. After finding hidden tags for each item, we find **the number of matching hidden tags** between two items.

### 2. User Entered Tags
User entered tags give information about an item from the user perspective. We use **the number of matching tags** between two items.

### 3. Title
Title string contains the most descriptive words about an item. We use **the number of matching noun phrases** between two items.

### 4. Location
By using distance between two items in the model, we can find items close to each other. We use **distance between locations** in the following formula:

$$
\begin{align*}
    \frac{C}{Distance(L_1, L_2)} \quad \quad \text{where, } \quad  & L_1: \text{ Location of item 1} \\ 
                                                          & L_2: \text{ Location of item 2} \\
                                                          & C: \text{ A constant to prevent very low values}
\end{align*}
$$

### 5. Time Average Difference
By using the difference between time values of items, we can find items close in time. To find how close in time two items are, we use **the difference of average years** of two items as in the following formula:

$$
\begin{align*}
    \frac{C}{|T_{avg_1} - T_{avg_2}|} \quad \quad \text{where, } \quad & T_{avg_1} = \frac{T_{start_1} + T_{end_1}}{2} \\
                                                                     & T_{avg_2} = \frac{T_{start_2} + T_{end_2}}{2} \\
                                                                     & C : \text{A constant to prevent very low values}
\end{align*}
$$

### 6. Timeframe Overlapping Percentage
How much two items overlap in terms of their timeframes is also important to find contemporary items, and also to avoid the mistake of comparing very small timeframes with very large ones. We define a timeframe and timeframe related operations as follows:

$$
\begin{align*}
    TF &= \big(T_{start}, T_{end}) \\
    TF_1 \cup TF_2 &= \big(min(T_{start_1}, T_{start_2}), max(T_{end_1}, T_{end_2}) \big) \\
    TF_2 \cap TF_2 &= \big(max(T_{start_1}, T_{start_2}), min(T_{end_1}, T_{end_2}) \big) \\
    Y(TF) &= T_{end} - T_{start}
\end{align*}
$$


Then, we find a number between 0 and 1 that represents the overlapping percentage of two timeframes using the following formula:

$$
\begin{align*}
    \frac{Y(TF_1 \cap TF_2)}{Y(TF_1 \cup TF_2)}
\end{align*}
$$

## Combining

Let the $k^{th}$ feature above be called $F^k$. After computing all the above features, we will combine them using a simple linear model with real valued coefficients as follows:

$$ S_{i, j} = \sum_{k = 1}^6 C_kF^k_{i, j} \quad \quad \text{where, } \quad C_k : \text{Coefficient of feature $k$}, $$

to find $S_{i, j}$: Similarity of item $i$ and item $j$.