# CARTE-Enbridge Bootcamp
## AI in Market Strategy

We are starting off today a little differently! Because the value of AI in Market Strategy is centred around specific applications, we are going to work through three different case studies. Each case study will focus on both a different domain and a different technology. By the end, we will have a strong understanding of the growing role of AI in Market Strategy!

## Case Study 1: Predictive Analytics

To begin with, we will be looking at a dataset of avocado prices and demand over a three-year period. Grocery stores need to understand trends in demand and pricing for avocados, to ensure they have enough stock and to ensure they are pricing their avocados competitively. We will be using a toolkit from Meta (aka Facebook) called [Prophet](https://facebook.github.io/prophet/). Prophet is a forecasting tool that is designed to be easy to use, and to produce forecasts that are both accurate and explainable.

Load the dataset in the cell below. Because we are using time-series data, we instruct Pandas to `parse` the dates in the dataset. This allows us to do things like compute the time between two dates, or to group data by year, month, or day. We specify the format to be `YYYY-MM-DD`, which is represented by `%Y-%m-%d`.

In [None]:
import pandas as pd

df = pd.read_csv("https://github.com/alexwolson/carte_workshop_datasets/raw/main/avocado.csv.zip", compression="zip", index_col=0)
df["Date"] = pd.to_datetime(df["Date"], format="%Y-%m-%d")
df.set_index("Date", inplace=True)

In [None]:
df.head() # 4046, 4225, 4770 are the PLU codes for different types of avocados

As ever, we will start by exploring the data. Let's plot the average price of avocados over time. We can use the `resample` method to group the data by month, and then take the average of each group. We can then plot the result using the `plot` method.

In [None]:
import matplotlib.pyplot as plt

df.resample("M")["AveragePrice"].mean().plot(figsize=(15,7))
plt.ylabel("Average Price")
plt.title("Average Price of Avocados")
plt.show()

Let's also look at the volume of each PLU code sold over time:

In [None]:
df.resample("M")[["4046", "4225", "4770"]].sum().plot(figsize=(15,7))
plt.ylabel("Volume")
plt.title("Volume of Avocados Sold")
plt.show()

Now let's move to building a predictive model. We will use Prophet to predict the average volume of avocados sold. Prophet is designed to be easy to use, and to produce forecasts that are both accurate and explainable. We will start by creating a new DataFrame with the columns that Prophet expects: `ds` for the date, and `y` for the value we want to predict. Since we want to be able to evaluate the quality of our predictions, we will separate out the last 6 months of data as a test set.

In [None]:
prophet_df = df[["Total Volume"]].resample("W").sum().reset_index() # Aggregate to the week level
prophet_df.columns = ["ds", "y"]
prophet_df_train = prophet_df[:-26] # All but the last six months
prophet_df_test = prophet_df[-26:]

Now we can create a Prophet model and fit it to our training data. Prophet supports automatically considering holidays, but we don't expect holidays to have a large impact on avocado sales, so we won't take advantage of this. In other contexts, considering things like weather, 'shocks' (e.g. a pandemic), or other events can be very important.

In [None]:
!pip install -U -q prophet plotly fastapi kaleido python-multipart uvicorn "typing-extensions<4.6.0"

In [None]:
from prophet import Prophet
from time import time

model = Prophet(interval_width=1)
start_time = time()
model.fit(prophet_df_train)
print(f'Training time: {time() - start_time} seconds')

In [None]:
predictions = model.predict(prophet_df_test)
# Calculate percentage of true values that fall between yhat_lower and yhat_upper
correct = []
for i in range(len(predictions)):
    if (prophet_df_test["y"].iloc[i] >= predictions["yhat_lower"].iloc[i]) and (prophet_df_test["y"].iloc[i] <= predictions["yhat_upper"].iloc[i]):
        correct.append(1)
    else:
        correct.append(0)
print(f"Percentage of true values that fall between yhat_lower and yhat_upper: {sum(correct)/len(correct) * 100:.2f}%")

We can see that our model is able to predict the volume of avocados sold with a reasonable degree of accuracy. We can visualize the predictions using the `plot` method. The black dots represent the actual values, and the blue line represents the predictions. The shaded blue area represents the uncertainty in the predictions.

In [None]:
from prophet.plot import plot_plotly, plot_components_plotly

fig = plot_plotly(model, model.predict(prophet_df_test), figsize=(1300,600))
fig.show()

We can also break down the predictions into their components. The first plot shows the overall trend in avocado sales, and the second plot shows the weekly seasonality.

In [None]:
fig = plot_components_plotly(model, model.predict(prophet_df), figsize=(1300,400))
fig.show()

**Your turn**

The avocado dataset breaks down the data by organic and conventional avocados. Using the separated datasets below, fit two distinct models to predict the volume of organic and conventional avocados sold. How do the predictions compare? What are the main differences between the two models?

In [None]:
prophet_df_conventional = df[df["type"] == "conventional"][["Total Volume"]].resample("W").sum().reset_index()
prophet_df_conventional.columns = ["ds", "y"]
prophet_df_conventional_train = prophet_df_conventional[:-26]
prophet_df_conventional_test = prophet_df_conventional[-26:]

prophet_df_organic = df[df["type"] == "organic"][["Total Volume"]].resample("W").sum().reset_index()
prophet_df_organic.columns = ["ds", "y"]
prophet_df_organic_train = prophet_df_organic[:-26]
prophet_df_organic_test = prophet_df_organic[-26:]

In [None]:
# Your Code Here

## Case Study 2: Natural Language Processing

In this case study, we will be looking at a dataset of natural-text reviews of wine. We will be using a toolkit called [spaCy](https://spacy.io/). spaCy is a Python library for Natural Language Processing (NLP) that is designed to be fast and production-ready. spaCy is a very powerful toolkit, and we will only be scratching the surface of what it can do today.

Load the dataset in the cell below. We will be using the `description` column, which contains the text of the review, and the `points` column, which contains the score given to the wine by the reviewer.

In [None]:
df = pd.read_csv("https://github.com/alexwolson/carte_workshop_datasets/raw/main/winemag-data-130k-v2.csv.zip", compression="zip", index_col=0).sample(frac=0.5)

In [None]:
df.head()

In [None]:
!pip install -U -q "spacy<3.7.0,>=3.6.0"

With spaCy, we can use a number of different language models made available for 73 different languages. To make sure that our code runs quickly, we will download the smallest English model, `en_core_web_sm`.

In [None]:
!python -m spacy download en_core_web_sm -q

In [None]:
import spacy
from tqdm import tqdm

nlp = spacy.load("en_core_web_sm")

The first step in any NLP task is to tokenize the text. Tokenization is the process of breaking up a string into a list of words. When we looked at encoding on Tuesday, the HuggingFace library handled this for us, but spaCy leave us to decide how we want to accomplish this. spaCy provides a `tokenizer` object that we can use to tokenize a string. We can then iterate over the tokens to get the individual words. spaCy also provides a `lemmatizer` object that we can use to get the root form of each word. This is useful because it allows us to group together words that have the same meaning, but different forms (e.g. "run", "runs", "running").

In [None]:
tokens = []
lemmas = []
first_doc = nlp(df["description"].iloc[0].lower())
print(f'word       root       part       stop')
print(f'--------------------------------------')
for token in first_doc:
    tokens.append(token.text)
    if not token.is_stop and not token.is_punct:
        lemmas.append(token.lemma_)
    if token.text != token.lemma_:
      print(f'{token.text:10} {token.lemma_:10} {token.pos_:10} {token.is_stop if token.is_stop else ""}')
    else:
      print(f'{token.text:10}            {token.pos_:10} {token.is_stop if token.is_stop else ""}')

We will apply the process to the entire dataset, using the `nlp.pipe` method. This method allows us to efficiently process a large number of documents. We will also remove stop words, which are words that are very common and don't add much meaning to the text (e.g. "the", "and", "a").

To speed up the process, we will disable the `parser` and `ner` components of the spaCy pipeline. The `parser` component is used to determine the syntactic structure of the text, and the `ner` component is used to identify named entities (e.g. people, places, organizations). Since we are only interested in the tokens, we can disable these components to speed up the process. We are also using the smallest spaCy model, which is faster but less accurate than the larger models. In a production setting, we would likely use a larger model, and a GPU to speed up the process.

In [None]:
tokens = []
for doc in tqdm(nlp.pipe(df["description"].str.lower(), disable=["parser", "ner"]), total=len(df)):
    tokens.append(" ".join([token.lemma_ for token in doc if not token.is_stop and not token.is_punct]))

Now that we have tokenized the text, we can use it to build a predictive model. We will use the `tokens` column as our input, and the `points` column as our output. We will use a `CountVectorizer` to convert the tokens into a vector of counts. We will then use a `LinearRegression` model to predict the score given to the wine by the reviewer.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import mean_absolute_error
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

model = Pipeline([
    ("vectorizer", CountVectorizer(min_df=0.01)), # Only include words that appear in at least 1% of reviews
    ("regressor", LinearRegression())
])

x_train, x_test, y_train, y_test = train_test_split(tokens, df["points"], test_size=0.2, random_state=42)

start_time = time()
model.fit(x_train, y_train)
print(f'Training time: {time() - start_time} seconds')

In [None]:
print(f'MAE: {mean_absolute_error(y_test, model.predict(x_test)):.2f}')

This is a very strong result! We are able to predict the score given to a wine by the reviewer with a mean absolute error of 1.63 points out of 100. Let's look at the words that are most associated with high and low scores. We can do this by looking at the coefficients of the `LinearRegression` model.

In [None]:
# Get words most associated with high scores
words = model.named_steps["vectorizer"].get_feature_names_out()
coefficients = model.named_steps["regressor"].coef_
word_scores = pd.DataFrame({"word": words, "score": coefficients})
word_scores.sort_values("score", ascending=False).head(10)

In [None]:
# Get words most associated with low scores
word_scores.sort_values("score", ascending=True).head(10)

In [None]:
# Get top 10 reviews with worst predictions in the test set
test_df = pd.DataFrame({"text": x_test, "actual": y_test, "predicted": model.predict(x_test)})
test_df["error"] = abs(test_df["actual"] - test_df["predicted"])
test_df.sort_values("error", ascending=False)

**Your Turn**

Our model predicts the score given to a wine based on the text of the review. But there are a few different columns that we could alternatively predict! Choose one of the following columns, and build a model to predict it based on the text of the review. Explore the results. Do you find anything interesting?

* `country`
* `price`
* `variety`
* `winery`

In [None]:
# Your code here

## Case Study 3: Recommendation

For our last case study, we are going to look at a dataset of movie ratings. We're going to start by building a simple recommendation system that recommends movies by finding the most similar users. Then, we will move on to using a powerful library that implements some of the state-of-the-art approaches.

Let's begin by loading part of the MovieLens dataset. This is a popular dataset of user ratings of movies. We will be using the `ratings` dataset, which contains the ratings given by users to movies. We will also load the `movies` dataset, which contains information about each movie.

In [None]:
movies = pd.read_csv("https://github.com/alexwolson/carte_workshop_datasets/raw/main/movies.csv.zip", compression="zip")
ratings = pd.read_csv("https://github.com/alexwolson/carte_workshop_datasets/raw/main/ratings.csv.zip", compression="zip")

In [None]:
movies.head()

In [None]:
ratings.head()

As you can see, the `ratings` dataset contains a `userId`, a `movieId`, a `rating`, and a `timestamp`. The `movies` dataset contains a `movieId`, a `title`, and a list of `genres`. We are not going to make predictions based on genre today, but it's a common approach to recommendation in this area. Instead, we will just focus on the users and their ratings.

Let's look at a random user to get a sense of what a users' ratings could look like:

In [None]:
ratings[ratings.userId == 42].merge(movies, on="movieId") # Merging so that we can see what the movies are

If we are working with users who have already rated a number of movies on the system, one approach is to look for the most similar users, and then recommend movies that those users have rated highly. We can do this by computing the similarity between users. We will use the [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) between the ratings of two users as our measure of similarity. The cosine similarity is a measure of the angle between two vectors. If the angle is small, the vectors are similar. If the angle is large, the vectors are dissimilar. We will use the `cosine_similarity` function from the `sklearn.metrics.pairwise` module to compute the cosine similarity between users.

We will also need to convert the format of our data from a list of users and reviews, to a matrix of users and reviews. We can do this using the `pivot_table` method. This method takes a DataFrame, and converts it from a long format to a wide format. We will use the `userId` as the index, the `movieId` as the columns, and the `rating` as the values. We will also fill in any missing values with 0, since we are only interested in whether a user has rated a movie or not.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

ratings_matrix = ratings.pivot_table(index="userId", columns="movieId", values="rating", fill_value=0)

In [None]:
user_one = ratings_matrix.iloc[42]
user_two = ratings_matrix.iloc[43]
print(f'Cosine similarity between user 42 and user 43: {cosine_similarity([user_one], [user_two])[0][0]:.2f}')

Now that we have our matrix and our method, let's go ahead and compute the similarity between each pair of users. We will store the results in a DataFrame, with the `userId` as the index and the `similarity` as the value.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse import csr_matrix
import numpy as np

# Convert the ratings matrix to a sparse matrix format if not already
ratings_sparse = csr_matrix(ratings_matrix.values)

# Compute the cosine similarity matrix in a vectorized way
# This computes the full n x n similarity matrix
similarities = cosine_similarity(ratings_sparse)

# Since the similarity with itself is always 1, we can fill the diagonal with 1s
np.fill_diagonal(similarities, 1)

Now that we have the similarity between each pair of users, we can use it to make recommendations. For user 42, we can take the top 10 users who are most similar, and then recommend the movies that they have rated most highly.

In [None]:
similarities_df = pd.DataFrame(similarities, index=ratings_matrix.index, columns=ratings_matrix.index)

In [None]:
similar_users = similarities_df[42].sort_values(ascending=False).head(10)

In [None]:
recommended_movies = ratings_matrix.loc[similar_users.index].mean().sort_values(ascending=False)
# Remove movies that the user has already rated
recommended_movies = recommended_movies[~recommended_movies.index.isin(ratings_matrix.iloc[42].replace(0, np.nan).dropna().index)]

In [None]:
for movie_id, rating in recommended_movies.head(10).items():
    print(f'{movies[movies["movieId"] == movie_id]["title"].iloc[0]} ({rating:.2f})')

And there we have it - a simple recommendation system! Unfortunately, this approach has some major problems.

1. Scalability - while it doesn't take too long to calculate similarities between 600 or so users, company like Netflix has millions or even billions of users!
2. Cold start - what if we have a new user who hasn't rated any movies yet? We can't make any recommendations for them.
3. Popularity bias - this approach will recommend popular movies, since lots of people have rated them, even if they are not a good fit for the user.

Let's use the same data, but employ a more sophisticated approach. We will use a library called Surprise, which implements a number of state-of-the-art methods for recommendation.

In [None]:
!pip install -U -q surprise

First, we have to convert the data into a format that Surprise can understand. We will use the `Reader` class to specify the range of ratings, and then use the `Dataset` class to convert the data.

In [None]:
from surprise import Dataset, Reader, SVD

reader = Reader(rating_scale=(0.5, 5.0))
data = Dataset.load_from_df(ratings[["userId", "movieId", "rating"]], reader)

We are going to use Singular Value Decomposition, or SVD. SVD works by breaking down our single, huge user-movie matrix into three smaller matrices. This process allows us to capture the most important patterns in the data using fewer details, which is essential when working with millions or even _billions_ of users. Using these three smaller matrices, SVD can approximate the expected values for missing entries in the user-movie matrix. This allows us to make predictions for new users, and to make recommendations for movies that have not been rated by many users.

In [None]:
model = SVD(random_state=42)
start_time = time()
model.fit(data.build_full_trainset())
print(f'Training time: {time() - start_time} seconds')

In [None]:
# Get top 10 movies for user 42
user_42_movies = ratings[ratings["userId"] == 42]["movieId"].unique()
predicted_ratings = []
for movie_id in movies["movieId"].unique():
    if movie_id in user_42_movies:
        continue
    predicted_ratings.append((movie_id, model.predict(42, movie_id).est))
predicted_ratings.sort(key=lambda x: x[1], reverse=True)

In [None]:
for movie_id, rating in predicted_ratings[:10]:
    print(f'{movies[movies["movieId"] == movie_id]["title"].iloc[0]} ({rating:.2f})')

As you can see, while many of these films are certainly popular, the SVD approach allows us to recommend movies that are more tailored to the user. We can also use the model to predict the rating that a user will give to a movie. This is a good way of evaluating the quality of the model.

In [None]:
ratings[ratings["userId"] == 42].merge(movies, on="movieId") # Merging so that we can see what the movies are

In [None]:
predictions = []
for movie_id in user_42_movies:
    predictions.append({
        "movieId": movies[movies["movieId"] == movie_id]["title"].iloc[0],
        "predicted": model.predict(42, movie_id).est,
        "actual": ratings[(ratings["userId"] == 42) & (ratings["movieId"] == movie_id)]["rating"].iloc[0]
    })
predictions_df = pd.DataFrame(predictions)
predictions_df["error"] = abs(predictions_df["predicted"] - predictions_df["actual"])
print(f'MAE: {predictions_df["error"].mean():.2f}')

Let's compare this against our original method:

In [None]:
similar_users = similarities_df[42].sort_values(ascending=False).head(10)
recommended_movies = ratings_matrix.loc[similar_users.index].mean().sort_values(ascending=False)

predictions = []
for movie_id in user_42_movies:
    predictions.append({
        "movieId": movies[movies["movieId"] == movie_id]["title"].iloc[0],
        "predicted": recommended_movies[movie_id],
        "actual": ratings[(ratings["userId"] == 42) & (ratings["movieId"] == movie_id)]["rating"].iloc[0]
    })
predictions_df = pd.DataFrame(predictions)
predictions_df["error"] = abs(predictions_df["predicted"] - predictions_df["actual"])
print(f'MAE: {predictions_df["error"].mean():.2f}')

As we can see, the SVD approach is not only much faster, but more accurate in reproducing the user's original ratings. SVD is simple, effective, and highly scalable - which is why it was the industry standard for companies like Amazon, Netflix, and Spotify for many years.

**Your Turn**

One challenge with a five-star rating system (and a big reason why companies like YouTube and Netflix have long since moved to a 'thumbs up, thumbs down' approach) is that each user has a different idea of what each rating means. For example, one user might give a 5-star rating to their favourite movie, while another user might only give a 5-star rating to a movie that they consider to be perfect. Try setting all ratings to 1 if a user rated 4 or 5, or 0 otherwise. How does this affect prediction quality?

In [None]:
# Your code here

## Conclusion, and bonus

We have covered a lot of ground today! We have looked at three different case studies, each of which uses a different approach to AI in Market Strategy. We have seen how AI can be used to predict the future, to understand text, and to make recommendations. We've looked at not just real-world examples, but also state-of-the-art toolkits that are used by many companies today.

As a bonus exercise, pick the case study that you found most interesting, and see if you can expand on the results from today. Some ideas:

* Predictive Analytics: Can you visualize the data in a more informative way? Can you predict the volume of avocados sold for a specific region, or for a specific type of avocado?
* Natural Language Processing: Can you identify reviewers' favourite regions or varieties of wine? Can you identify the most common words used to describe different types of wine?
* Recommendation: Surprise includes a number of different models. Can you try a different model, and compare the results? Can you use the model to recommend movies to a new user?