# Coffee price prediction
The objective is to predict the rating of coffee beans based on their origin, flavour and tasting notes.

In [None]:
import numpy as np
import pandas as pd

pd.options.plotting.backend = "plotly"

In [None]:
df = pd.read_csv("./data/simplified_coffee.csv")
for col in ["name", "roaster", "roast", "loc_country", "origin", "review"]:
    df[col] = df[col].astype("string")

df["review_date"] = pd.to_datetime(df["review_date"])
df = df.rename(columns={"loc_country": "roaster_country"})
df.head()

Let us first check for NaNs.

In [None]:
df.isna().sum()

The only column with NaNs is the roast. Since there are only 12 missing values, we could just remove these rows. However, since most coffees have the same roast type (as will see later), let us fill with the modal value.

In [None]:
df["roast"] = df["roast"].fillna(df["roast"].mode().iloc[0])
df.isna().sum()

Let's fix a typo in the roaster country for one coffee.

In [None]:
df["roaster_country"] = df["roaster_country"].str.replace("New Taiwan", "Taiwan")

# Exploratory data analysis

## Ratings
We can see that the ratings are approximately normally distribution. However, there is a large offset, with the median rating is ~94% which is very high.

In [None]:
df["rating"].hist()

## Coffee pricing
The distributon for the price of the coffee has a very long tail. This suggests that there may be benefit in applying the log transformation.

In [None]:
df["100g_USD"].hist()

Now that we have applied the log transformation, the distribution is closer to a normal distribution.

In [None]:
df["100g_USD"].apply(np.log1p).hist()

## Roasting style
The vast majority of the coffee have the medium-light roast type. This large uneveness in the dataset may make it challenging for a model to detect any impact of roast style on coffee rating.

In [None]:
df["roast"].hist()

## Roaster country
Most of the data we have is from US rosters.

In [None]:
df["roaster_country"].value_counts()

If we look at the distribution of pricing for the most common countries, we see that the distribution is quite different in each country. In particular, the coffees sold in the US are much more "peaky". This likely indicates that there is some bias in the dataset. Given that the source of the data is from the US, there are most coffees in the database at an afforable pricepoint (for US customers).

In [None]:
import plotly.express as px

countries = ["United States", "Taiwan", "Guatemala"]
px.histogram(
    df[df["roaster_country"].apply(lambda c: c in countries)],
    x="100g_USD",
    color="roaster_country",
    barmode="group",
    histnorm='percent',
)

## Country of origin
As expected, most of the coffees come from the largest coffee producers in the world. All examples are from one of the following regions:

- Africa
- Central or South America
with the exception of Hawaii.

In [None]:
df["origin"].hist(histnorm='percent')

## Highly and lowly rated coffees

If we look at the highest and lowest rates coffees, we see that they are dominated by certain roasters. This suggests that either:
- Certain roaster find the best/worst coffees or roast them particulraly well
- The reviewers favour/dislike certainer roasters

In either case, our model may need to access the roaster.

In [None]:
df[df["rating"] > 96]

In [None]:
df[df["rating"] < 90]

# Feature engineering

## Roaster
There is evidence that certain roasters product particularly good/poor coffee (or are preferred/disliked by the reviewers). The model may therefore need a feature giving it this information.

We cannot simply convert the roaster using one-hot encoding as there are too many different values. Let us instead only include the most common roasters (those with > 10 coffees).

In [None]:
roasters = df["roaster"].value_counts()
ROASTERS = roasters[roasters > 10].index

In [None]:
df["roaster"] = df["roaster"].where(df["roaster"].apply(lambda r: r in ROASTERS), "Other")

## Region of origin
The different regions of the world typically produce coffees which are similar in style. Eg African coffees are typically more acidic. Therefore it seems possible that the region may provide as much information as the country of origin. We will therefore engineer this feature.

In [None]:
import json

with open("./data/regions.json", "r") as f:
    REGIONS = json.load(f)

regions = {}
for r, countries in REGIONS.items():
    for c in countries:
        regions[c] = r

In [None]:
df["region"] = df["origin"].map(regions).fillna("Other")

The vast majority of coffees in the dataset come from the major coffee producing regions of the world as expected.

In [None]:
df["region"].hist(histnorm="percent")

## Flavour notes
As it stands, we cannot glean any information from the review column as it is unstructured. Let's begin by analysing the keywords present in the reviews.

In [None]:
import re


def extract_words(string: str) -> list[str]:
    return re.findall(r'\w+', string.lower())


words = pd.Series([word for review in df["review"] for word in extract_words(review)]).value_counts()

GENERIC_WORDS = ["and", "in", "with", "the", "of", "to", "a", "by", "like", "is", "around"]
COFFEE_WORDS = ["cup", "notes", "finish", "aroma", "hint", "undertones", "resonant", "high", "consolidates", "flavor"]
words = words.drop(GENERIC_WORDS + COFFEE_WORDS)
words.head()

We can see that the most common words relate to the flavour of the coffee. This suggests that we can extract some features for the different flavours in the coffee.

Using this information and the [coffee flavour wheel](https://www.anychart.com/products/anychart/gallery/Sunburst_Charts/Coffee_Flavour_Wheel.php), we can manually define some flavours and corresponding keywords which are stored in `flavours.json`.

In [None]:
import json

with open("./data/flavours.json", "r") as f:
    FLAVOURS = json.load(f)

We can now add boolean features for each flavour.

In [None]:
def rating_contains_words(review: str, keywords: list[str]) -> bool:
    words = extract_words(review)
    for w in keywords:
        if w in words:
            return True
    return False


for flavour, keywords in FLAVOURS.items():
    df[flavour] = df["review"].apply(rating_contains_words, args=(keywords,))

### Popularity of flavours
It is useful to examine the popularity of the different flavours, by plotting the histogram. We can see that the most common flavours are:
- Caramelly
- Acidic
- Fruity
- Chocolate

Intuitively, this makes sense as these are the sorts of flavours we see on coffee packets.

In [None]:
df[list(FLAVOURS.keys())].sum().divide(df.shape[0]).sort_values(ascending=False).plot.bar()

### Number of flavours per coffee
It is also convenient to check how many flavours the different coffees have. If we have done a good job at defining the flavour keywords, we would expect not many coffees to have no flavours.

This appears to be the case. In fact, most coffees have 6 flavours!

In [None]:
num_flavours = df[list(FLAVOURS.keys())].sum(axis=1)
num_flavours.hist()

# Building a model

In [None]:
features = ["roaster", "roast", "roaster_country", "region", "100g_USD"] + list(FLAVOURS.keys())
X = df[features].copy()
X["100g_USD"] = X["100g_USD"].apply(np.log1p)
y = df["rating"]


* Split the dataset into train/validation/test sets with 60%/20%/20% distribution. 
* Use the `train_test_split` function and set the `random_state` parameter to 1.

In [None]:
from sklearn.model_selection import train_test_split

X_train_val, X_test = train_test_split(X, test_size=0.2, random_state=1)
y_train_val, y_test = train_test_split(y, test_size=0.2, random_state=1)

X_train, X_val = train_test_split(X_train_val, test_size=0.25, random_state=1)
y_train, y_val = train_test_split(y_train_val, test_size=0.25, random_state=1)

In [None]:
from sklearn.feature_extraction import DictVectorizer

dv = DictVectorizer(sparse=False)
dv.fit(X_train.to_dict(orient="records"))


def _transform(df: pd.DataFrame):
    return dv.transform(df.to_dict(orient="records"))

## Linear regression
Let's start with the simplest model which is a linear regressor. 

In [None]:
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error

scores = pd.DataFrame(columns=["test", "validation"])
for alpha in [0.01, 0.03, 0.1, 0.3, 1.0, 3.0, 10.0]:
    model = Ridge(alpha=alpha)
    model.fit(_transform(X_train), y_train)
    scores.loc[alpha, :] = pd.Series(
        {
            "test": mean_squared_error(y_train, model.predict(_transform(X_train)), squared=False),
            "validation": mean_squared_error(y_val, model.predict(_transform(X_val)), squared=False),
        }
    )

scores.plot(log_x=True)

This suggests that the best value is 1.0 since this gives the same loss on the validation and test sets.

In [None]:
linear_model = Ridge(alpha=1.0)
linear_model.fit(_transform(X_train_val), y_train_val)

In [None]:
import plotly.graph_objects as go

fig = px.scatter(x=linear_model.predict(_transform(X_val)), y=y_val)
fig.add_trace(go.Scatter(x=[80, 100], y=[80, 100], showlegend=False))

This model captures the central part of the distribution quite well, but fails to predict the very high or low ratings.

In [None]:
pd.DataFrame(
    {"true": y_train_val, "prediction": np.round(linear_model.predict(_transform(X_train_val)), decimals=0)}
).hist(histnorm="percent", barmode="group")

## Gradient-boosted trees

In [None]:
import xgboost as xgb

eval_sets = {
    "train": (_transform(X_train), y_train),
    "validation": (_transform(X_val), y_val),
}

scores = {}
for max_depth in [1, 2, 3, 4, 5]:
    xgb_params = {
        'max_depth': max_depth,
        'min_child_weight': 1,
        'objective': 'reg:squarederror',
        'seed': 1,
        'verbosity': 1,
    }

    model = xgb.XGBRegressor(**xgb_params, eval_metric="rmse")
    model.fit(_transform(X_train), y_train, eval_set=list(eval_sets.values()))

    results = model.evals_result()
    scores[max_depth] = pd.DataFrame({k: results[f"validation_{i}"]["rmse"] for i, k in enumerate(eval_sets)})

pd.DataFrame({depth: df["validation"] for depth, df in scores.items()}).plot(
    labels={"index": "n_estimators", "variable": "max_depth", "value": "rmse"}
)

Let us select max depth 3 with 90 estimators, since this gives the lowest validation loss.

In [None]:
scores = {}
for eta in [0.01, 0.03, 0.1, 0.3, 1.0]:
    xgb_params = {
        'max_depth': 3,
        'n_estimators': 90,
        "eta": eta,
        'min_child_weight': 1,
        'objective': 'reg:squarederror',
        'seed': 1,
        'verbosity': 1,
    }

    model = xgb.XGBRegressor(**xgb_params, eval_metric="rmse")
    model.fit(_transform(X_train), y_train, eval_set=list(eval_sets.values()))

    results = model.evals_result()
    scores[eta] = pd.DataFrame({k: results[f"validation_{i}"]["rmse"] for i, k in enumerate(eval_sets)})

pd.DataFrame({eta: df["validation"] for eta, df in scores.items()}).plot(
    labels={"index": "n_estimators", "variable": "eta", "value": "rmse"}
)

We select `eta` = 0.3.

In [None]:
xgb_params = {
    'max_depth': 3,
    'n_estimators': 90,
    "eta": 0.3,
    'min_child_weight': 1,
    'objective': 'reg:squarederror',
    'seed': 1,
    'verbosity': 1,
}
xgb_model = xgb.XGBRegressor(**xgb_params, eval_metric="rmse")
xgb_model.fit(_transform(X_train_val), y_train_val, eval_set=[(_transform(X_train_val), y_train_val)])
results = xgb_model.evals_result()
scores = pd.Series(results[f"validation_0"]["rmse"])

In [None]:
scores.plot(labels={"index": "n_estimators", "value": "rmse"})

In the same way as the linear model, this model fails to capture the very low or high ratings.

In [None]:
pd.DataFrame(
    {
        "true": y_train_val,
        "prediction": np.round(xgb_model.predict(_transform(X_train_val)), decimals=0),
    }
).hist(histnorm="percent", barmode="group")

# Comparison of the models



In [None]:
models = {"linear": linear_model, "xgb": xgb_model}

Both models perform similarly well on the test set.

In [None]:
scores = pd.Series(dtype=float)
for name, model in models.items():
    y_pred = model.predict(_transform(X_test))
    scores[name] = mean_squared_error(y_test, y_pred, squared=False)

scores.plot.bar()

We also see that they lead to the same distribution of ratings. This suggests that the model is not the reason for failing to predict the highest/lowest scores is more due to some other more systematic error such as:
- Lack of information in the features (eg perhaps we need more detailed information about the origin)
- System error in the reviews (eg different reviewers)

In [None]:
pd.DataFrame(
    {"true": y_test} | {name: np.round(model.predict(_transform(X_test)), decimals=0) for name, model in models.items()}
).hist(histnorm="percent", barmode="group")

## Feature importances
We can get a bit more insight by evaluating the importance of the difference features.

In [None]:
from sklearn.inspection import permutation_importance

importances = {}
for name, model in models.items():
    r = permutation_importance(model, _transform(X_test), y_test, n_repeats=10, random_state=0)
    importances[name] = pd.Series(dict(zip(dv.get_feature_names_out(), r.importances_mean)))

importances = pd.DataFrame(importances)

We can see in both cases that the biggest influence is the price. This suggests that either:
- Price is genuinely an indicator of quality
- Price biases the reviewers

Other than the price, the region of origin plays a big influence. Surprisingly the flavour notes do not have that much influence.

In [None]:
importances.loc[importances.max(axis=1).sort_values(ascending=False).index].head(10)

## Final model selection
Overall, the two models have very similar performance. Since the linear regression model is simpler (and has slightly better performance), this is the preferred model.