## Prudential risk prediction

KATE expects your code to define variables with specific names that correspond to certain things we are interested in.

KATE will run your notebook from top to bottom and check the latest value of those variables, so make sure you don't overwrite them.

* Remember to uncomment the line assigning the variable to your answer and don't change the variable or function names.
* Use copies of the original or previous DataFrames to make sure you do not overwrite them by mistake.

You will find instructions below about how to define each variable.

Once you're happy with your code, upload your notebook to KATE to check your feedback.

We will work with a dataset published by an insurance company which contains anonymised information about their clients.

The aim is to predict people's risk profile based on their properties.

You will be given a description of the data set and the goal is to develop a prediction model.

##  Dataset

The data provided consists of three csv files in the `data/` folder:
* `X_train.csv`: the training set
* `y_train.csv`: the target for the training set, valued from 1 to 8
* `X_test.csv`: the test set that will be evaluated

Below we give the description of the data features, some categorical, others numerical. The dataset has been thoroughly anonymized, which makes it extra challenging. 

Although the risk profile is ordered, we will consider this problem as being a classification problem and the exact category accuracy will be used for evaluating your model. It has low signal, and a 8-classes classification problem, hence accuracy can be quite low.

## Get Started

Your task is to train a model to predict the target variable. You should save the predictions for the test set in the variable called `y_pred`, which will be evaluated against the ground truth. Below we give you a sample baseline implementation.

You are free to use all your modelling skills to get the best possible performance.

Good luck!

### Dataset info

**Variable descriptions:**
- Id - A unique identifier associated with an application.
- Product_Info_1-7 - A set of normalized variables relating to the product applied for
- Ins_Age - Normalized age of applicant
- Ht - Normalized height of applicant
- Wt - Normalized weight of applicant
- BMI - Normalized BMI of applicant
- Employment_Info_1-6 - A set of normalized variables relating to the employment history of the applicant.
- InsuredInfo_1-6 - A set of normalized variables providing information about the applicant.
- Insurance_History_1-9 - A set of normalized variables relating to the insurance history of the applicant.
- Family_Hist_1-5 - A set of normalized variables relating to the family history of the applicant.
- Medical_History_1-41 - A set of normalized variables relating to the medical history of the applicant.
- Medical_Keyword_1-48 - A set of dummy variables relating to the presence of/absence of a medical keyword being associated with the application.
- Response - This is the target variable, an ordinal variable relating to the final decision associated with an application

**Categorical (nominal) features:**
```
Product_Info_1, Product_Info_2, Product_Info_3, Product_Info_5, Product_Info_6, Product_Info_7, Employment_Info_2, Employment_Info_3, Employment_Info_5, InsuredInfo_1, InsuredInfo_2, InsuredInfo_3, InsuredInfo_4, InsuredInfo_5, InsuredInfo_6, InsuredInfo_7, Insurance_History_1, Insurance_History_2, Insurance_History_3, Insurance_History_4, Insurance_History_7, Insurance_History_8, Insurance_History_9, Family_Hist_1, Medical_History_2, Medical_History_3, Medical_History_4, Medical_History_5, Medical_History_6, Medical_History_7, Medical_History_8, Medical_History_9, Medical_History_11, Medical_History_12, Medical_History_13, Medical_History_14, Medical_History_16, Medical_History_17, Medical_History_18, Medical_History_19, Medical_History_20, Medical_History_21, Medical_History_22, Medical_History_23, Medical_History_25, Medical_History_26, Medical_History_27, Medical_History_28, Medical_History_29, Medical_History_30, Medical_History_31, Medical_History_33, Medical_History_34, Medical_History_35, Medical_History_36, Medical_History_37, Medical_History_38, Medical_History_39, Medical_History_40, Medical_History_41
```

**Continuous features:**
```
Product_Info_4, Ins_Age, Ht, Wt, BMI, Employment_Info_1, Employment_Info_4, Employment_Info_6, Insurance_History_5, Family_Hist_2, Family_Hist_3, Family_Hist_4, Family_Hist_5
```

**Discrete features:**
```
Medical_History_1, Medical_History_10, Medical_History_15, Medical_History_24, Medical_History_32
Medical_Keyword_1-48 are dummy variables.
```

### Baseline model

In [37]:
import pandas as pd

from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import  make_column_transformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline

X_train = pd.read_csv("data/X_train.csv")
y_train = pd.read_csv("data/y_train.csv")
X_test = pd.read_csv("data/X_test.csv")

categories = ["Product_Info_1", "Product_Info_2", "Product_Info_3",
              "Product_Info_5", "Product_Info_6", "Product_Info_7"]

cont_cols = ["Ins_Age", "Ht", "Wt", "BMI"]


preprocessor = make_column_transformer((OneHotEncoder(handle_unknown="ignore"), categories),
                                       (StandardScaler(), cont_cols))

# rf = RandomForestClassifier(n_estimators=100, random_state=42)
# pipeline = Pipeline(steps=[('preprocessor', preprocessor),
#                            ('classifier', rf)])

# pipeline.fit(X_train, y_train)
# y_pred = pipeline.predict(X_test)

In [38]:
from sklearn.neural_network import MLPClassifier
nn = MLPClassifier(hidden_layer_sizes=(100,), random_state=42)
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('classifier', nn)])

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

  y = column_or_1d(y, warn=True)
