#### authors: Rafael Dousse, Eva Ray, Massimo Stefani

# Exercise 1 - Analysis

The bank UBS is offering the possibility to invest money in investment funds. A fund is composed of financial values such as stocks or bonds. For example, a fund composed mostly of stocks has more return potential but is more risky in case of stock market recession. There are thousands of funds available, see https://fundgate.ubs.com/. The probability to invest or not in a fund is conditioned by the profile of the fund and of the client. For example, a younger client with no child is potentially more interested into funds composed with stocks, showing higher risks but also higher potential returns. A family father will be more inclined to invest into low-risk funds. UBS want to build a system as illustrated on Figure~\ref{fig:ubs_system}, taking as input a set of values characterizing the fund and the client profile.

An investment fund can be characterized by the following elements: 

- The name of the fund.
- The current value of 1 share in the fund, expressed in CHF.
- The proportion of stock and bonds composing the fund (2 values in percentage).
- A vector of float values with the 5 last yearly returns over years from 2015 to 2019 (5 values expressed in percentage).
- A level of risk expressed with A, B, C, D, E with A representing the highest risk and E representing the lowest risk level.
- A sectorial information such as technology, pharmaceutical, financial. There are 24 different sectors available in UBS funds.
-  As the set of funds are worldwide, the emiting location is also available with the address of the managing entity of the fund, e.g. Market Street 1234, New York, USA.

A client profile contains the following information: 

- First name and last name of the client.
- The mother tongue of the client (mostly de, fr, it and en but other languages are present).
- The age of the client.
- The number of children of the client.
- The current wealth of the client that could be used to buy funds, expressed in CHF (total of cash available in the different accounts, not yet invested in funds).
- The postal code of the address of the client.
- A level of acceptance to risk expressed with A, B, C, D, E with A representing the highest level of acceptance of risk and E representing the lowest acceptance of risk.

Answer the following questions:

1. For each available information in the fund and client profile, explain how you would prepare the data: encoding, normalization, outlier treatment, etc.
2. How could you collect targets (output of the system) to train the system? How would you prepare the different sets?

**Be as comprehensive as possible.**  Don't limit your explanation to the "how" but also the "why".

---

**a) For each available information in the fund and client profile, explain why and how you would prepare the data: encoding, normalization, outlier treatment, etc. You may also decide not to include a piece of information, in that case, explain why.**

Our goal is to prepare the data so that they can be used as input to a machine learning model, and to ensure that the model can learn effectively from the data. To do that, we need to ensure that the data is in a suitable format, that the features are appropriately scaled, and that any categorical variables are encoded in a way that the model can understand.

#### Fund information
- **Name of the fund**: This value doesn't bring any information for the model, so we can drop it. If we were to use it, we would consider the name as a sequence of words and use word embeddings to represent it numerically.
- **Current value of 1 share in the fund, expressed in CHF**: This is a continuous numerical variable. It can vary a lot between different funds, so we can use z-score normalization to scale it. We could also consider using log transformation to reduce the impact of extreme values, if there are any funds with very high share values and some with very low share values.
- **Proportion of stock and bonds composing the fund (2 values in percentage)**: Each percentage is a numerical variable that takes values between 0 and 100. We can use min-max rescaling to scale it to a range between 0 and 1.
- **Vector of float values with the 5 last yearly returns over years from 2015 to 2019 (5 values expressed in percentage)**: Each yearly return is a numerical variable that can take positive or negative values. We can use the min-max normalization to scale it to a range between -1 and 1. If we find too many outliers, we can also consider using z-score normalization. We could also try to do feature engineering by computing the average return or the standard deviation of the returns over the 5 years, to reduce the dimensionality of the input.
- **Level of risk expressed with A, B, C, D, E with A representing the highest risk and E representing the lowest risk level**: This is a categorical variable with an order, so we can use ordinal encoding to represent it numerically. For example, we can map A to 5, B to 4, C to 3, D to 2 and E to 1.
- **Sectorial information such as technology, pharmaceutical, financial. There are 24 different sectors available in UBS funds**: This is a categorical variable without order, that can take 24 different values. We can use one-hot encoding to represent it numerically. The resulting vector will have 24 dimensions, with a 1 in the dimension corresponding to the sector of the fund and 0 in all other dimensions.
- **Emitting location with the address of the managing entity of the fund**: This is a textual variable that can take many different values. The exact address is not necessarily relevant, but the country might be. We can extract the country from the address and use one-hot encoding to represent it numerically (so, basically treat it like a categorical variable without order).

#### Client information

- **First name and last name of the client**: This value doesn't bring any information for the model, so we can drop it.
- **Mother tongue of the client**: This is a categorical variable without order, that can take a few different values (de, fr, it, ...). We can use one-hot encoding to represent it numerically.
- **Age of the client**: This is a numerical variable that can take values between 0 and about 110. Since we know the minimum and maximum possible values, we can use min-max normalization to scale it to a range between 0 and 1.
- **Number of children of the client**: This is a numerical variable that can take values between 0 and a small number (probably less than 10). We can use min-max normalization to scale it to a range between 0 and 1 or even leave it as is, since the range of values is already small.
- **Current wealth of the client that could be used to buy funds, expressed in CHF (total of cash available in the different accounts, not yet invested in funds)**: This is a continuous numerical variable. It can vary a lot between different clients, so we can use z-score normalization to scale it. We could also consider using log transformation to reduce the impact of extreme values, if there are any clients with very high wealth and some with very low wealth.
- **Postal code of the address of the client**: This is a categorical variable without order. The exact postal code is not necessarily relevant, but the region or country might be. We can extract the country from the postal code and use one-hot encoding to represent it numerically (so, basically treat it like a categorical variable without order).
- **Level of acceptance to risk expressed with A, B, C, D, E with A representing the highest level of acceptance of risk and E representing the lowest acceptance of risk**: A before, this is a categorical variable with an order, so we can use ordinal encoding to represent it numerically. For example, we can map A to 5, B to 4, C to 3, D to 2 and E to 1.

**b) How could you collect targets (output of the system) to train the system?**

The output of the system is a binary variable indicating whether the client invested in the fund or not. It can be encoded as 1 for "buy" and 0 for "not buy".

To collect the targets, we can use historical data from UBS, where we have records of which clients invested in which funds. The goal is to have a balanced dataset, with a similar enough number of "buy" and "not buy" examples. We can proceed as follows:
- **buy examples**: For each record of a client investing in a fund, we can create a positive example with the corresponding fund and client information, and the target set to 1.
- **not buy examples**: For each record of a client investing in a fund, we can create a negative example by pairing the client with another fund that they did not invest in, and the target set to 0. We should select funds that are diverse enough to avoid biasing the model.

If historical data is not available or not enough, we can also consider using surveys or interviews to collect data from clients about their investment preferences. However, this approach is more time-consuming and may not be as reliable as using historical data.

**How would you prepare the different sets?**

We want to create 3 different sets:
- **Training set**: Used to learn the model parameters. It should be the largest set, containing about 60% of the data.
- **Validation set**: Used to tune the hyperparameters and select the best model. It should contain about 20% of the data.
- **Test set**: Used to evaluate the final performance of the model. The data must be never seen before by the model. It should contain about 20% of the data.

Before splitting the data into these 3 sets, we should shuffle the data to ensure that the examples are randomly distributed. Here are some other idea to ensure that the 3 sets are representative of the overall data distribution:
- Ensure that the proportion of "buy" and "not buy" examples is similar in all 3 sets.
- Ensure that the number of inputs linked to a client is similar in all 3 sets.

For all the normalisations involved, we should compute the parameters (mean, std, min, max, ...) on the training set only, and then apply the same transformation to the validation and test sets.

The 3 sets technique is good enough to get a reliable estimate of the model performance and avoid overfitting. However, the best technique to train the model is to go one step further and use cross-validation. This technique consists in splitting the data into k folds (for example, k = 5), and then training the model k times, each time using a different fold as the validation set and the remaining folds as the training set. The final performance is then computed as the average performance over the k runs. This technique allows to use all the data for training and validation, and to get a more reliable estimate of the model performance.