<center> <h1>XXX Datathon - Team n°1</h1> </center>

## In this notebook:

This Jupyter Notebook will be used to preprocess the data. We start with the given dataset ```DATASET_Train2.xlsx``` and we successively apply different transformations.

1) Data preprocessing

2) Data transformation, as specified in transform.py

3) Merging of external data: hazard, UW ratio, market index

The output dataset will be store in the ```processed_data.csv``` csv file.


### Assumptions

1) Some of the data was already preprocessed using external tools. In particular, we have preprocessed the column ```DEDUCTIBLES``` which had a messy structure. As a result, the following features were added to the original XXX dataset giving this new ```DATASET_Train2``` which is the one imported in this Notebook.

| Feature name          | Number of values | Type    |
|-----------------------|------------------|---------|
| PD_price(M)           | 5146 non-null    | float64 |   
| BI_price(M)           | 5146 non-null    | float64 |  
| BI_time(Days)         | 5146 non-null    | float64 |   
| Both_price(M)         | 5146 non-null    | float64 |  
| PD_percent_loss (%)   | 5146 non-null    | float64 |  
| BI_percent_loss (%)   | 5146 non-null    | float64 |   
| Both_percent_loss (%) | 5146 non-null    | float64 |  
| PD_percent_tiv (%)    | 5146 non-null    | float64 | 

#### Preprocessing of the column ```DEDUCTIBLES```

- Principle : The column may indicate a price, a number of days, or a percentage for PD, BI or both. We create a column for every possible case (said columns are listed above). For each line, the concerned columns are filled with processed data from the column deductibles, and the others are filled with 0s.
- Process : 
    1. we convert the column ```DEDUCTIBLES``` to csv format, and put it in a csv file
    2. we use the following regular expressions (vim format) to convert the 1-column format to our 8-column format. 
        ```vimscript
1,$s/^PD: *\(.*\)M *, *BI: *\(.*\)Day(s)$/\1; 0; \2; 0; 0; 0; 0; 0 
1,$s/^PD: *\(.*\)M *, *BI: *\(.*\)M *$/\1; \2; 0; 0; 0; 0; 0; 0 
1,$s/^PD: *\(.*\)% *of *loss *, *BI: *\(.*\)Day(s)$/0; 0; \2; 0; \1; 0; 0; 0
1,$s/^PD: *\(.*\)% *of *tiv *, *BI: *\(.*\)Day(s)$/0; 0; \2; 0; 0; 0; 0; \1
1,$s/^PD: *\(.*\)% *of *loss *, *BI: *\(.*\)M$/0; \2; 0; 0; \1; 0; 0; 0
1,$s/^PD: *\(.*\)M, *BI: *\(.*\)% *of *loss *$/\1; 0; 0; 0; 0; \2; 0; 0
1,$s/^PD,BI: *\(.*\)M$/0; 0; 0; \1; 0; 0; 0; 0/ 
1,$s/^ *BI: *\(.*\)Day(s)$/0; 0; \1; 0; 0; 0; 0; 0 
1,$s/^PD: *\(.*\)M *$/\1; 0; 0; 0; 0; 0; 0; 0 
1,$s/^PD: *\(.*\)% *of *loss *$/0; 0; 0; 0; \1; 0; 0; 0
1,$s/^PD: *\(.*\)% *of *tiv *$/0; 0; 0; 0; 0; 0; 0; \1
%s/,/./g

        ```


2) UW ratio data was already preprocessed in the ```UW_Ratio_Preprocessing``` file and will be directly imported in this Notebook.

# <font color='darkorange'>Imports </font>

In [1]:
import pandas as pd
import data_transform
import numpy as np
from tqdm import tqdm
from pandas import DataFrame

# Visualization options
pd.set_option('display.max_columns', 5000)
pd.set_option('display.max_rows', 5000)

# <font color='darkorange'>Data Loading </font>
#### Please choose which dataset you want to load

In [2]:
preprocess_train = False
preprocess_test = not preprocess_train

In [3]:
#Original data with the preprocessing of column 'Deductible'
if preprocess_train:
    df = pd.read_excel('data/DATASET_Train2.xlsx') 
else:
    df = pd.read_csv("data/DATASET_TEST_Processed.csv", 
                          parse_dates=["INCEPTION", "EXPIRY", "PRICING_DATE"])

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3259 entries, 0 to 3258
Data columns (total 63 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   SUBMISSION_ID          3259 non-null   int64         
 1   DIVISION_ID            3259 non-null   int64         
 2   FACUL_NUM              3259 non-null   object        
 3   TAG                    3259 non-null   object        
 4   DIVISION_NUM           3259 non-null   int64         
 5   SEGMENT_LOB            3259 non-null   object        
 6   UF_STATUS              3259 non-null   object        
 7   CT_STATUS              3259 non-null   object        
 8   UWYEAR                 3259 non-null   object        
 9   INCEPTION_MONTH        3259 non-null   object        
 10  INCEPTION              3259 non-null   datetime64[ns]
 11  EXPIRY                 3259 non-null   datetime64[ns]
 12  CT_PERIOD              3259 non-null   int64         
 13  MAR

# <font color='darkorange'> Preprocessing Data </font>

The data transform function is in **data_transform.py** file.（More analysis and explanations for performing certain steps can be found in the **Exploration Analysis** notebook）

To sum up, this function includes:

1) Dealing with missing values

2) Changing Cover_BI to a boolean variable that significates if BI is included in the contract

3) Grouping several sectors together

4) Grouping regional markets together(could be found in **constants.py**)

5) Removing useless variables (Variables with 99% the same value)

6) Substract *UWYEAR* and *Inception month*

7) Drop certain varaibles that we considered useless for the premium (could be found in **constants.py**)

In [5]:
df.drop(columns=['DEDUCTIBLES'], inplace=True)
df['DIVISION_NUM'] = df['DIVISION_NUM'].astype('str')

In [6]:
df_t = data_transform.transform_data(df) ;
if preprocess_test:
    df_t = df_t.drop(columns = ['BI_price(M)','PD_percent_tiv (%)'])
df_t.head()

(3259, 62)
(3259, 63)
(3259, 63)
(3259, 64)
## Dropped useless features:  ['TC_PROFIT_COMMISSION', 'REPORTCCY', 'BI_percent_loss (%)', 'Both_percent_loss (%)']
(3259, 60)
(3259, 61)
(3259, 62)
(3259, 48)


Unnamed: 0,SUBMISSION_ID,FACUL_NUM,DIVISION_NUM,SEGMENT_LOB,UWYEAR,CT_PERIOD,MAINOCCUPANCY,SECTOR,BUSINESSUNIT,UWCENTER,SCOPE_PERILS,SUBSIDIARY,PARTTYPE,GUARANTEE,MAIN_PRICING_CATEG,BI_TYPE,BI_PERIOD,INSUREDVALUEPD,INSUREDVALUEBI,TOTALINSUREDVALUE,NBLOCS,OIL,LIMIT,ATTACHMENT,XXX_SHARE,MODELED_CAT_EXPLOSS,DISCOUNTS,DEDUCTION,EXT_EXPENSE,WORDING,QUALITY_RISK_MGT,ASSET_QUALITY,BI_MITIGATION,MB_QUALITY,TXCHANGE,FXRATEUSD,TOP_MPL,TOP_FMLS,PD_price(M),BI_time(Days),Both_price(M),PD_percent_loss (%),COVER_BI,GEO_MARKET_SEGMENT,UWYEAR_label,INCEPTION_month
0,1,FA0020462,1,Ppty Non Energy,2020,12,Precious Metals Mines,BS Energy,BS Energy,LAC,All Risks,XXX CANADA,Quota-Share,EU,Mining - Hard Rock,Loss of profit,12.0,638676500.0,309468600.0,948145000.0,3,0.0,180635834.0,0.0,0.025,0.0,0.0,0.072997,0.072997,Standard,Average,Average,Average,Average,0.903179,1,250044400.0,0.0,0.5,30.0,0.0,0.0,True,Latin America,1,6
1,2,11F008861,1,Ppty Non Energy,2019,12,Precious Metals Mines,BS Energy,BS Energy,LAC,All Risks,XXX CANADA,Quota-Share,EU,Mining - Hard Rock,Loss of profit,12.0,612051400.0,191214100.0,803265500.0,3,0.0,175963400.0,0.0,0.075,0.0,0.0,0.09342,0.09342,Standard,Average,Average,Average,Average,0.879817,1,209700500.0,0.0,0.4,30.0,0.0,0.0,True,Latin America,1,6
2,3,FA0003626,1,Ppty Non Energy,2017,12,Precious Metals Mines,BS Energy,BS Energy,LAC,All Risks,XXX RE US,Excess of Loss,EU,Mining - Hard Rock,Loss of profit,24.0,1585276000.0,562431200.0,2147707000.0,1,0.0,278732700.0,0.0,0.075,0.0,0.0,0.095564,0.095564,Standard,Average,Average,Average,Average,0.929109,1,882653600.0,0.0,4.6,30.0,0.0,0.0,True,Latin America,1,10
3,4,11F007069,2,Ppty Non Energy,2018,12,Precious Metals Mines,BS Energy,BS Energy,LAC,All Risks,XXX CANADA,Quota-Share,EU,Mining - Hard Rock,Loss of profit,24.0,825993400.0,973995000.0,1799988000.0,2,0.0,589622642.0,0.0,0.12,0.0,0.0,0.635696,0.635696,Standard,Average,Average,Average,Average,0.842318,1,928537700.0,0.0,2.1,30.0,0.0,0.0,True,Latin America,1,5
4,5,11F008861,1,Ppty Non Energy,2020,12,Precious Metals Mines,BS Energy,BS Energy,LAC,All Risks,XXX CANADA,Quota-Share,EU,Mining - Hard Rock,Loss of profit,12.0,638728300.0,309429200.0,948157500.0,3,0.0,180635834.0,0.0,0.075,0.0,0.0,0.080394,0.080394,Standard,Average,Average,Average,Average,0.903179,1,250045200.0,0.0,0.5,30.0,0.0,0.0,True,Latin America,1,6


We now have our preliminary dataset. It has 46 different features both categorical and numerical. We will now concatenate external data to this dataset to make it more relevant for our ```PREMIUM``` prediction problem.

# <font color='darkorange'> Merge External Data </font>

### HAZARD data

In this section, we will utilise the information in ```exposure_data_v3.csv``` to obtain a new predictor **WEIGHTED HAZARD**, which depicts the seismic hazard as Peak Ground Acceleration(PGA), averaged across all the locations for each contract (grouped using **UWYEAR**, **FACUL_NUM** and **DIVISION_NUM**) and weighted by **TOTALINSUREDVALUE**. Here we fill the missing value of Insured Value BI and PD with 0, then we add them together to calculate the **TIV**: total insured value.

The rationale for adding this data is that there is a positve correlation between **HAZARD** and risk of incurring loss due to seismic activities. Subsequently, the risk will affect the pricing of the premium.


To derive this predictor, we use the database "Global Seismic Hazard Map Data"(http://gmo.gfz-potsdam.de/pub/gshap_data/gshap_data_frame.html), which is indexed by longitude and latitude. We merge exposure data to this database using **LONGITUDE** and **LATITUDE** as key. We then aggregate and average the seismic hazard across all sites for each contract.




In [7]:
df_exp = pd.read_csv('data/exposure_data_v3.csv')
df_exp = df_exp.drop_duplicates()
#Deal with missing values in columns = ['IV_BI_VAL','IV_PD','IV_BI_VAL']
df_exp['IV_BI_VAL'] = df_exp['IV_BI_VAL'].fillna(0.)
df_exp['IV_PD'] = df_exp['IV_PD'].fillna(0.)
df_exp['TIV'] = df_exp['IV_BI_VAL'] + df_exp['IV_PD']

In [8]:
lat_long = df_exp[['FACUL_NUM','YEAR', 'DIVISION_NUM', 'LONGITUDE','LATITUDE','TIV']]

In [9]:
header_list = ['LONGITUDE', 'LATITUDE', 'HAZARD']
smc = pd.read_csv('data/GSHPUB.DAT', 
                 sep="\s+", #separator whitespace
                 header=None,
                 names = header_list)

The weighted average hazard is calculated here. We have used the total insured value as weight because this correspond to the gravity of possible loss.(This weighted hazard implies the max loss times the probability of loss) 

In [10]:
hazard = pd.merge(lat_long, smc,  how='left',on=['LONGITUDE','LATITUDE'])
hazard['DIVISION_NUM'] = hazard['DIVISION_NUM'].astype('str')
hazard['weighted_hazard'] = hazard['TIV']*hazard['HAZARD']
hazard_mean = DataFrame({'weighted_hazard' : hazard.groupby(['FACUL_NUM', 'YEAR', 'DIVISION_NUM']).mean()['weighted_hazard'], 'sum_TIV':hazard.groupby(['FACUL_NUM', 'YEAR', 'DIVISION_NUM']).sum()['TIV']}).reset_index()
hazard_mean['weighted_hazard'] = hazard_mean['weighted_hazard']/hazard_mean['sum_TIV']
hazard_mean=hazard_mean.drop(columns=['sum_TIV'])

In [11]:
df_th = pd.merge(hazard_mean,
                df_t,
                how='right',
                left_on=['FACUL_NUM', 'YEAR', 'DIVISION_NUM'],
                right_on=['FACUL_NUM', 'UWYEAR', 'DIVISION_NUM']
               )

In [12]:
df_th.head()

Unnamed: 0,FACUL_NUM,YEAR,DIVISION_NUM,weighted_hazard,SUBMISSION_ID,SEGMENT_LOB,UWYEAR,CT_PERIOD,MAINOCCUPANCY,SECTOR,BUSINESSUNIT,UWCENTER,SCOPE_PERILS,SUBSIDIARY,PARTTYPE,GUARANTEE,MAIN_PRICING_CATEG,BI_TYPE,BI_PERIOD,INSUREDVALUEPD,INSUREDVALUEBI,TOTALINSUREDVALUE,NBLOCS,OIL,LIMIT,ATTACHMENT,XXX_SHARE,MODELED_CAT_EXPLOSS,DISCOUNTS,DEDUCTION,EXT_EXPENSE,WORDING,QUALITY_RISK_MGT,ASSET_QUALITY,BI_MITIGATION,MB_QUALITY,TXCHANGE,FXRATEUSD,TOP_MPL,TOP_FMLS,PD_price(M),BI_time(Days),Both_price(M),PD_percent_loss (%),COVER_BI,GEO_MARKET_SEGMENT,UWYEAR_label,INCEPTION_month
0,FA0020462,2020,1,0.449224,1,Ppty Non Energy,2020,12,Precious Metals Mines,BS Energy,BS Energy,LAC,All Risks,XXX CANADA,Quota-Share,EU,Mining - Hard Rock,Loss of profit,12.0,638676500.0,309468600.0,948145000.0,3,0.0,180635834.0,0.0,0.025,0.0,0.0,0.072997,0.072997,Standard,Average,Average,Average,Average,0.903179,1,250044400.0,0.0,0.5,30.0,0.0,0.0,True,Latin America,1,6
1,11F008861,2019,1,0.479085,2,Ppty Non Energy,2019,12,Precious Metals Mines,BS Energy,BS Energy,LAC,All Risks,XXX CANADA,Quota-Share,EU,Mining - Hard Rock,Loss of profit,12.0,612051400.0,191214100.0,803265500.0,3,0.0,175963400.0,0.0,0.075,0.0,0.0,0.09342,0.09342,Standard,Average,Average,Average,Average,0.879817,1,209700500.0,0.0,0.4,30.0,0.0,0.0,True,Latin America,1,6
2,FA0003626,2017,1,2.87224,3,Ppty Non Energy,2017,12,Precious Metals Mines,BS Energy,BS Energy,LAC,All Risks,XXX RE US,Excess of Loss,EU,Mining - Hard Rock,Loss of profit,24.0,1585276000.0,562431200.0,2147707000.0,1,0.0,278732700.0,0.0,0.075,0.0,0.0,0.095564,0.095564,Standard,Average,Average,Average,Average,0.929109,1,882653600.0,0.0,4.6,30.0,0.0,0.0,True,Latin America,1,10
3,11F007069,2018,2,0.025928,4,Ppty Non Energy,2018,12,Precious Metals Mines,BS Energy,BS Energy,LAC,All Risks,XXX CANADA,Quota-Share,EU,Mining - Hard Rock,Loss of profit,24.0,825993400.0,973995000.0,1799988000.0,2,0.0,589622642.0,0.0,0.12,0.0,0.0,0.635696,0.635696,Standard,Average,Average,Average,Average,0.842318,1,928537700.0,0.0,2.1,30.0,0.0,0.0,True,Latin America,1,5
4,11F008861,2020,1,0.449208,5,Ppty Non Energy,2020,12,Precious Metals Mines,BS Energy,BS Energy,LAC,All Risks,XXX CANADA,Quota-Share,EU,Mining - Hard Rock,Loss of profit,12.0,638728300.0,309429200.0,948157500.0,3,0.0,180635834.0,0.0,0.075,0.0,0.0,0.080394,0.080394,Standard,Average,Average,Average,Average,0.903179,1,250045200.0,0.0,0.5,30.0,0.0,0.0,True,Latin America,1,6


In [13]:
df_t = df_th.drop(columns=['YEAR'])

### UW Ratio 

The preprocessing of UW Ratio is explained in the file ```A - UW Ratio Preprocessing```. 
We have constructed an uw index that could show the influence of past loss. The formula is $uw\_index = uw_{-1}+\alpha*uw_{-2}+...+\alpha^{my-1}*uw_{-my}$, where $uw_{-i}$ means the uw ratio i year before the underwriting year(UWYEAR) of the contract. We have used a decay parameter $\alpha$ to represent the decaying influence of a loss many years ago and thus the 'present value' of the uw ratio. We have tuned these two parameters with a basic XGBoost regressor and the best result is $\alpha=0.5$ and $my=7$. This corresponds also to the convention that asks to keep the accident record for 7 years in the some type of insurance(car insurance for example).

In [14]:
%run 'A - UW Ratio Preprocessing.ipynb'



  0%|          | 0/2133 [00:00<?, ?it/s]

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18481 entries, 0 to 18480
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   FACUL_NUM      18481 non-null  object 
 1   UWYEAR         18481 non-null  int64  
 2   UW_RATIO_2014  10717 non-null  float64
 3   UW_RATIO_2015  11932 non-null  float64
 4   UW_RATIO_2016  13281 non-null  float64
 5   UW_RATIO_2017  14743 non-null  float64
 6   UW_RATIO_2018  16180 non-null  float64
 7   UW_RATIO_2019  17475 non-null  float64
 8   UW_RATIO_2020  18375 non-null  float64
dtypes: float64(7), int64(1), object(1)
memory usage: 1.3+ MB


100%|██████████| 2133/2133 [01:36<00:00, 22.04it/s]


In [15]:
df_uw = data# data is given by the A - UW Ratio Preprocessing.ipynb notebook
df_t_uw = pd.merge(df_t, df_uw, how='left', on=['FACUL_NUM', 'UWYEAR'],left_index=True).reset_index().drop(columns=['index'])
df_t_uw = df_t_uw.fillna(0)

In [16]:
def uw_index(df,alpha,max_year):
    """
    alpha is the decay parameter,
    max_year=1 : we only takes the last year's UW ratio into account 
    """    
    df1 = df.copy()
    df1['uw_index'] = 0.
    for index, row in df1.iterrows(): 
        y = row['UWYEAR']
        uw_index = 0
        for i in range(0,max_year):
            uw_index += row[str(y-i-1)]*pow(alpha,i)
        df1.at[index,'uw_index'] = uw_index   
    df1.drop(columns=['2006','2007','2008','2009','2010','2011','2012','2013','2014','2015','2016','2017','2018','2019'], inplace=True)
    return df1

In [17]:
df_t_uw = uw_index(df_t_uw,0.5,7)

In [18]:
df_t_uw.tail()

Unnamed: 0,FACUL_NUM,DIVISION_NUM,weighted_hazard,SUBMISSION_ID,SEGMENT_LOB,UWYEAR,CT_PERIOD,MAINOCCUPANCY,SECTOR,BUSINESSUNIT,UWCENTER,SCOPE_PERILS,SUBSIDIARY,PARTTYPE,GUARANTEE,MAIN_PRICING_CATEG,BI_TYPE,BI_PERIOD,INSUREDVALUEPD,INSUREDVALUEBI,TOTALINSUREDVALUE,NBLOCS,OIL,LIMIT,ATTACHMENT,XXX_SHARE,MODELED_CAT_EXPLOSS,DISCOUNTS,DEDUCTION,EXT_EXPENSE,WORDING,QUALITY_RISK_MGT,ASSET_QUALITY,BI_MITIGATION,MB_QUALITY,TXCHANGE,FXRATEUSD,TOP_MPL,TOP_FMLS,PD_price(M),BI_time(Days),Both_price(M),PD_percent_loss (%),COVER_BI,GEO_MARKET_SEGMENT,UWYEAR_label,INCEPTION_month,uw_index
3254,FA0020373,1,0.238015,3255,Ppty Non Energy,2017,12,Textiles,BS CP,BS Property,LAC,All Risks,XXX ASIA PACIFIC,Quota-Share,EU,Textile industry,Loss of profit,0.0,887930300.0,0.0,887930300.0,1,0.0,515340000.0,0.0,0.075,0.0,0.0,0.043635,0.043635,Standard,Average,Average,Non applicable,Average,4.1e-05,0.04402,104450500.0,0.0,0.9,0.0,0.0,0.0,False,Emerging Asia,1,5,0.0
3255,FA0045173,1,0.238015,3256,Ppty Non Energy,2018,12,Textiles,BS CP,BS Property,LAC,All Risks,XXX ASIA PACIFIC,Quota-Share,EU,Textile industry,Loss of profit,0.0,826958600.0,0.0,826958600.0,1,0.0,467082000.0,0.0,0.075,0.0,0.0,0.052358,0.052358,Standard,Average,Average,Non applicable,Average,3.7e-05,0.04401,94669420.0,0.0,0.0,0.0,0.0,10.0,False,Emerging Asia,1,5,0.0
3256,FA0045173,1,0.238015,3257,Ppty Non Energy,2019,12,Textiles,BS CP,BS Property,LAC,All Risks,XXX ASIA PACIFIC,Quota-Share,EU,Textile industry,Loss of profit,0.0,891857900.0,0.0,891857900.0,1,0.0,475650000.0,0.0,0.025,0.0,0.0,0.038689,0.038689,Standard,Average,Average,Non applicable,Average,3.8e-05,0.04291,96406010.0,0.0,0.0,0.0,0.0,10.0,False,Emerging Asia,1,5,0.0
3257,FA0041457,1,0.004342,3258,Ppty Non Energy,2020,12,Telecommunications,BS CP,BS Property,EMEA,All Risks,XXX UK,Quota-Share,EU,Telecommunications & Media,Gross earning,12.0,461470300.0,73310920.0,534781300.0,395,0.0,90317917.0,0.0,0.1,42026.986571,0.0,0.0,0.0,Standard,Average,Average,Average,Average,0.903179,1.0,67890870.0,0.0,0.0,0.0,0.0,0.0,True,Others,1,6,1.363152
3258,02F052652,1,0.000656,3259,Ppty Non Energy,2017,12,Industrial Conglomerate,BS CP,BS Property,EMEA,All Risks,XXX REASSURANCE,Excess of Loss,EU,Food producers,Loss of profit,24.0,8977364000.0,8954379000.0,17931740000.0,507,0.0,238534455.0,69600705.0,0.09,0.0,0.0,1.201958,1.201958,Standard,Average,Average,Average,Average,0.067573,7272936.0,803712100.0,0.0,0.0,0.0,0.0,0.0,True,Africa,1,9,0.019349


### Market Index

Please find the detailed explanation of market index for each index in **Market Data Scrapping** notebook. In general, we create an index to indicate the market situation for each industry.

In [19]:
df_fd = pd.read_csv('data/market_values.csv')  

In [20]:
data = []

for i, row in tqdm(df_t_uw.iterrows()):
    year = row['UWYEAR']#[:4]
    mpc = row["MAIN_PRICING_CATEG"]
    ratio = df_fd[df_fd["MAIN_PRICING_CATEG"] == mpc][str(year)]
    try :
        ratio = float(ratio)
    except:
        print(mpc)
        print(ratio)
    data.append(ratio)
    type(ratio)

data = df_t_uw.assign(financial_ratio = data)
data.head()

3259it [00:03, 1081.65it/s]


Unnamed: 0,FACUL_NUM,DIVISION_NUM,weighted_hazard,SUBMISSION_ID,SEGMENT_LOB,UWYEAR,CT_PERIOD,MAINOCCUPANCY,SECTOR,BUSINESSUNIT,UWCENTER,SCOPE_PERILS,SUBSIDIARY,PARTTYPE,GUARANTEE,MAIN_PRICING_CATEG,BI_TYPE,BI_PERIOD,INSUREDVALUEPD,INSUREDVALUEBI,TOTALINSUREDVALUE,NBLOCS,OIL,LIMIT,ATTACHMENT,XXX_SHARE,MODELED_CAT_EXPLOSS,DISCOUNTS,DEDUCTION,EXT_EXPENSE,WORDING,QUALITY_RISK_MGT,ASSET_QUALITY,BI_MITIGATION,MB_QUALITY,TXCHANGE,FXRATEUSD,TOP_MPL,TOP_FMLS,PD_price(M),BI_time(Days),Both_price(M),PD_percent_loss (%),COVER_BI,GEO_MARKET_SEGMENT,UWYEAR_label,INCEPTION_month,uw_index,financial_ratio
0,FA0020462,1,0.449224,1,Ppty Non Energy,2020,12,Precious Metals Mines,BS Energy,BS Energy,LAC,All Risks,XXX CANADA,Quota-Share,EU,Mining - Hard Rock,Loss of profit,12.0,638676500.0,309468600.0,948145000.0,3,0.0,180635834.0,0.0,0.025,0.0,0.0,0.072997,0.072997,Standard,Average,Average,Average,Average,0.903179,1,250044400.0,0.0,0.5,30.0,0.0,0.0,True,Latin America,1,6,0.051708,1.177278
1,11F008861,1,0.479085,2,Ppty Non Energy,2019,12,Precious Metals Mines,BS Energy,BS Energy,LAC,All Risks,XXX CANADA,Quota-Share,EU,Mining - Hard Rock,Loss of profit,12.0,612051400.0,191214100.0,803265500.0,3,0.0,175963400.0,0.0,0.075,0.0,0.0,0.09342,0.09342,Standard,Average,Average,Average,Average,0.879817,1,209700500.0,0.0,0.4,30.0,0.0,0.0,True,Latin America,1,6,0.144084,1.076494
2,FA0003626,1,2.87224,3,Ppty Non Energy,2017,12,Precious Metals Mines,BS Energy,BS Energy,LAC,All Risks,XXX RE US,Excess of Loss,EU,Mining - Hard Rock,Loss of profit,24.0,1585276000.0,562431200.0,2147707000.0,1,0.0,278732700.0,0.0,0.075,0.0,0.0,0.095564,0.095564,Standard,Average,Average,Average,Average,0.929109,1,882653600.0,0.0,4.6,30.0,0.0,0.0,True,Latin America,1,10,0.0,1.159407
3,11F007069,2,0.025928,4,Ppty Non Energy,2018,12,Precious Metals Mines,BS Energy,BS Energy,LAC,All Risks,XXX CANADA,Quota-Share,EU,Mining - Hard Rock,Loss of profit,24.0,825993400.0,973995000.0,1799988000.0,2,0.0,589622642.0,0.0,0.12,0.0,0.0,0.635696,0.635696,Standard,Average,Average,Average,Average,0.842318,1,928537700.0,0.0,2.1,30.0,0.0,0.0,True,Latin America,1,5,0.0,1.042942
4,11F008861,1,0.449208,5,Ppty Non Energy,2020,12,Precious Metals Mines,BS Energy,BS Energy,LAC,All Risks,XXX CANADA,Quota-Share,EU,Mining - Hard Rock,Loss of profit,12.0,638728300.0,309429200.0,948157500.0,3,0.0,180635834.0,0.0,0.075,0.0,0.0,0.080394,0.080394,Standard,Average,Average,Average,Average,0.903179,1,250045200.0,0.0,0.5,30.0,0.0,0.0,True,Latin America,1,6,0.07672,1.177278


# <font color='darkorange'> Save processed data </font>

In [21]:
#Original data with the preprocessing of column 'Deductible'
if preprocess_train:
    data.to_csv('data/processed_data_train.csv', index=False)
else:
    data.to_csv('data/processed_data_test.csv', index=False)


Our final dataset now looks like this. We are now ready to train our models.

In [22]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3259 entries, 0 to 3258
Data columns (total 49 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   FACUL_NUM            3259 non-null   object 
 1   DIVISION_NUM         3259 non-null   object 
 2   weighted_hazard      3259 non-null   float64
 3   SUBMISSION_ID        3259 non-null   int64  
 4   SEGMENT_LOB          3259 non-null   object 
 5   UWYEAR               3259 non-null   int64  
 6   CT_PERIOD            3259 non-null   int64  
 7   MAINOCCUPANCY        3259 non-null   object 
 8   SECTOR               3259 non-null   object 
 9   BUSINESSUNIT         3259 non-null   object 
 10  UWCENTER             3259 non-null   object 
 11  SCOPE_PERILS         3259 non-null   object 
 12  SUBSIDIARY           3259 non-null   object 
 13  PARTTYPE             3259 non-null   object 
 14  GUARANTEE            3259 non-null   object 
 15  MAIN_PRICING_CATEG   3259 non-null   o