# Capstone proposal - Liberty Mutual Group Fire Peril Loss Cost

## Predict expected fire losses for insurance policies

### (a) What are the main project idea and goals?


https://www.kaggle.com/c/liberty-mutual-fire-peril

This data set represents almost a million insurance records and the task is to predict a transformed ratio of loss to total insured value (called "target" within the data set). The provided features contain policy characteristics, information on crime rate, geodemographics, and weather.

### (b) What story you would like to tell with the data and what would you like to achieve at the end?

My main goal is to show how important variables can be identified and how groups of variables can be compressed to a smaller representation. Furthermore, I'd like to show a possible approach to dealing with highly imbalanced data. Lastly, I'd like to show, if indeed the case, how an enseble of serveral models can be combined to one aggregate model.

### (c) What is the main motivation behind your project?

My main motivation is to work with a complex data set that is inherently imbalanced and contains many features. Also, in my job at the exchange, we are particularly interested in working with imbalanced data as anomaly and fraud detection are a major concern. However, since I cannot provide a data set from my employer for this capstone project and I also happen to be very interested in the particularities of the insurance industry, I chose the Liberty Mutual Fire Peril dataset as an interesting and alternative.

## The data set

### (a) What is the size and format of the data that you plan to use?

- CSV
- Train: 902'789 rows and 302 columns
- Test: 450'728 rows and 301 columns

### (b) How do you expect to get, manage and process the data?

The data can be downloaded as a zip file.
I will download the zip file and load the csv's in a traditional manner in the the jupyter notebook (using `pd.read_csv()`).
Since I will not be able to upload the zip file via github for submission, I will provide the kaggle link and download instructions in order for you to be able to run the jupyter notebook.

## The analysis and methods

### (a) What are the main challenges that you envision for completing the project and how do you plan to get around each one?

- Preliminary exploration indicates that the data is already in reasonable quality. Hence, the focus will be mainly on exploring the variables, feature engineering and building the models.
- The main challenge with the data set is that it's very imbalanced, which makes predictions harder. Hence, the challenge will be to find out ways how to overcome the imbalance issue and what techniques can be used for that matter.
- Also, the columns hardly provide any meaning, which will make feature engineering particularly difficult. The variables are named with a group prefix so that similarity among variables can be identified. Further details on the meaning of the variables, however, is not provided. Possibly, this is due to confidentiality reasons. One way around this problem could be the applications of feature compression techniques such as PCA or T-SNE.

### (b) What the are steps that you plan to take to achieve the end goals?

- Online research
    - How to handle imbalanced data
    - How to handle model predictions where the target variable is zero-inflated (many true negatives)
    - I expect to have many small non-zero predictions as the coefficients will always have some sensitivity. However, how should I define cut-off values in order to determine when an observation is predicted to be zero?
    - How to use PCA or, possibly, T-SNE for feature compression in regression problems.
- Data exploration
- Data cleansing 
- Data visualisation
- Feature engineering
- Defining prediction benchmark
- Model selection and evaluation
- Final predictions

### (c) Show us that you have a pipeline in place and that you understand the feasibility of your project goals.

#### 1. Data exploration
The data exploration phase will focus on the distribution of data and evaluation of the quality of the data set. There, I will derive measures for the data cleansing step and get an idea which variables might be interesting and which not.

####  2. Data cleansing (missing value imputation and removal of unnecessary rows and columns)
Here, I might decide which rows or colums to drop from the dataset. Also, I will fill missing values with a specific rule. 

#### 3. Data visualisation (to show the relationship of the variables)
In this chapter I intend to visualize the variables, its relations to each other and to the target variable.

#### 4. Feature engineering

Some first online research hinted the following steps that might be important when dealing with imbalanced data:

    - determination of feature importance using methods provided in sklearn
    - capping the target variable to make it less sensitive to outliers (censor large claims)
    - downsampling of non-claims (true negatives) to reduce data set imbalance
    - noise reduction through feature selection
    - PCA and, possibly, T-SNE for feature compression
    
#### 5. Defining prediction benchmark
The intention is to define a simple benchmark, such as linear regression.

#### 6. Model selection and evaluation
Models that could be used:

    - K-Nearest Neighbors Regression
    - Support Vector Regression
    - Random Forest Regression
    - Neural Network
    - Extreme Gradient Boosting (XGBoost)

During the course of building and evaluating the model, and if time allows, I would also like to experiment with combinations and aggregations of multiple models in order to improve performance.

#### 7. Final predictions
The test set will not be used and looked at until the end. The test set will only be used on the final model and to draw final conclusions about the pros and cons.

## The communication

Here you can discuss your plan for analysis and communication of your findings.

Preparation of a Power Point slide deck or Jupyter Notebook that goes thorugh the findings of each of the seven steps from the pipeline. In the final communication I only intend to use summaries for each of the seven steps.

I deem this project to be interesting as their does not seem to be a lot of data cleasing to be necessary. Hence, I might be able to focus strongly on feature engineering, model building, evaluation and tuning.