# Allstate claims severity - Kaggle challenge

## Intro
When you’ve been devastated by a serious car accident, your focus is on the things that matter the most: family, friends, and other loved ones. Pushing paper with your insurance agent is the last place you want your time or mental energy spent. This is why Allstate, a personal insurer in the United States, is continually seeking fresh ideas to improve their claims service for the over 16 million households they protect. Allstate is currently developing automated methods of predicting the cost, and hence severity, of claims. In this recruitment challenge, Kagglers are invited to show off their creativity and flex their technical chops by creating an algorithm which accurately predicts claims severity. Aspiring competitors will demonstrate insight into better ways to predict claims severity for the chance to be part of Allstate’s efforts to ensure a worry-free customer experience...

## Content:
- Housekeeping and Imports
- Data Loading
- Data Exploration
- Data Cleaning
- Feature Engineering
- Data Transformation and Preparation
- Model Exploration and Performance Analysis
- Final Model Building
- Prediction on test set


## Housekeeping and Imports

For importing libraries necessary for the project, and for basic preprocessing functions (ex: typset conversion for NLP projects). 

## Data Loading

For loading data files into appropriate variables.

## Data Exploration

Section for **exploratory analysis** on the available data. 

The exploration techniques vary for numerical, categorical, or time-series variables. Currently, 

Here we typically:

- look at example records in the dataset
- investigate the datatypes of variables in the dataset
- calculate and investigate descriptive statistics (ex: central tendencies, variability etc.)
- investigate distribution of feature vectors (ex: to check for skewness and outliers)
- investigate distribution of prediction vector
- check out the relationship (ex: correlation) between different features
- check out the relationship between feature vectors and prediction vector

Common steps to check the health of the data:

- Check for missing data
- Check the skewness of the data, outlier detection
- etc...

### Look at Example Records

### Data-types, completeness Information

### Descriptive Statistics

### Visualizaton: Distribution of features

*Section has great potential for expansion.* 

Visualization techniques differ depending on the type of the feature vector (i.e. numerical: continuous or discrete, categorical: ordinal etc). Techniques will also depend on the type of data being dealt with, and the insight that we want to extract from it. 

Common visualization techniques include:
- Bar Plots: Visualize the frequency distribution of categorical features.
- Histograms: Visualize the frequency distribution of numerical features.
- Box Plots: Visualize a numerical feature, while providing more information like the median, lower/upper quantiles etc..
- Scatter Plots: Visualize the relationship (usually the correlation) between two features. Can include a goodness of fit line, to serve as a regression plot.

### Investigating correlations between features

### Visualizing prediction vector

### Investigating missing values

### Outlier Detection

The presence of outliers can often skew results which take into consideration these data points. 

One approach to detect outliers is to use Tukey's Method for identfying them: An outlier step is calculated as 1.5 times the interquartile range (IQR). A data point with a feature that is beyond an outlier step outside of the IQR for that feature is considered abnormal.

## Data Cleaning

### Imputing missing values

### Cleaning outliers or error values

## Feature Engineering

Section to extract more features from those currently available.

## Data Transformation and Preparation

### Transforming Skewed Continous Features 

It is common practice to apply a logarthmic transformation to highly skewed continuous feature distributions.

### Normalizing Numerical Features 

Another common practice is to perform some type of scaling on numerical features. Applying scaling doesn't change the shape of each feature's distribution; but ensures that each feature is treated equally when applying supervised learners.

### One Hot Encoding Categorical Features

It is encouraged to create a pipeline function for data preprocessing, rather than separate script blocks.

### Shuffle and Split Data

## Model Exploration

### Naive Predictor Performance

To set a baseline for the performance of the predictor. 

Common techniques:
- For categorical prediction vector, choose the most common class
- For numerical prediction vector, choose a measure of central tendency

Then calculate the evalation metric (accuracy, f-score etc)

### Choosing scoring metrics

### Creating a Training and Prediction Pipeling

### Model Evaluation

## Final Model Building

Using grid search (GridSearchCV) with different parameter/value combinations, we can tune our model for even better results.

Next steps can include feature importance extraction, predictions on the test set.. etc

## Predictions on Test Set