<div style="text-align: center; margin-bottom: 5px;">
    <h1 style="font-size: 2.5em;">Uncovering the Cost of Risk</h1>
    <h2 style="font-weight: normal; margin-bottom: 15px; color: gray;">
        An ML Exploration of Life Insurance Premiums in the US and UK
    </h2>
    <hr style="width: 50%; margin: 0 auto; border: 1px solid lightgray">
</div>

<div style="text-align: left;">
    <h2 style="font-size: 2.25em; margin-bottom: 10px;">Introduction</h2>
    <hr style="width: 98%; margin: 0 auto; border: 0.5px solid #ccc; border-top: 0.0005px solid #333;">
</div>

Why does a 30-year-old man in the UK pay significantly less for life insurance than his counterpart in the US — or why might a 40-year-old non-smoker in excellent health face higher premiums than a 20-year-old smoker? Insurance pricing is often complex and opaque, but uncovering the patterns behind it is crucial. For consumers, greater transparency means fairer treatment and stronger trust in an increasingly algorithm-driven market and the ability to properly conduct cost-benefit analysis to ensure they are adequately insured relative to their needs and means. While for insurers, establishing rigorous and justified pricing is essential not only for maintaining profitability — by avoiding losses from underpricing risky customers and retaining/attracting customers through competitive pricing — but also for ensuring accurate reserving, as mispricing risks can lead to either under-reserving (threatening solvency, which is a core regulatory concern) or over-reserving (tying up capital that could be more efficiently allocated elsewhere).

By offering financial protection against the economic impact coming from the occurrence of unexpected death, life insurance fundamentally acts as a peace of mind for the insured and their loved ones, securing the well-being of their loved ones (through income replacement), covering immediate expenses like funeral costs and estate taxes, and critically, ensuring that large, outstanding debts — such as mortgages — and other financial obligations can be met. But this raises a deeper question: why is it especially important to understand how life insurance is priced? The primary distinction is the long-term contractual agreement inherent to life insurance (present in all types but especially pertinent with level cover), with the terms and pricing of an agreement often being set in stone for decades at a time (usually 10-30 years). Because of this long-term commitment, even minor pricing differences can compound into significant financial impacts — reducing disposable income, straining household budgets, and increasing the risk of underinsurance if customers are forced to scale back or cancel coverage altogether. On the flip side, taking the time to secure lower premiums can lead to substantial lifetime savings, freeing up resources for other financial goals such as investing, debt repayment, or building a stronger safety net. Additionally, although term policies can be cancelled at any time at often no additional cost, there are considerable consequences for not obtaining apt cover from the start. Firstly, if someone were to realise their cover is insufficient after their health has deteriorated (which is relatively common and a symptom of present bias), obtaining additional or replacement insurance becomes difficult or prohibitively expensive — exasperating the already difficult situation. Moreover, any future application would face an age and risk reset — reassessed at an older age and potentially worse health, leading to substantially higher premiums or even denial of coverage.

Regulators in both the UK and US are very aware of this issue and have implemented a range of measures to ensure clients fully understand how premiums are determined. These regulations also ensure that insurers base their pricing on a complete risk profile, as insurers too face significant consequences if clients are either overinsured, which leads to unnecessary costs (due to higher claims and increased risk of moral hazard), or underinsured, which can result in financial instability for both the client and the insurer. For instance, in the UK, the FCA has introduced guidelines such as the Insurance Conduct of Business Sourcebook (ICOBS), which includes a stipulation that "A firm must take reasonable steps to ensure that the insurance product it recommends or offers is suitable for the customer’s demands and needs" (FCA, 2021), which reduces the incidence of a client obtaining inadequate cover. In the US, under Proposition 103, California requires that the methodology used to obtain premiums and justifications for rate changes be provided to the state department of insurance to be approved before implementation and was likely introduced to mitigate over/underinsurance. Despite these attempts, there is still clearly a disconnect between the consumer and the insurer (in both directions), as is evident in a study by the ABI (Association of British Insurers, 2019). In this study they found that only 29% of customers believed that they understood how their premiums are calculated, that "70% of clients incorrectly assume that gender is taken into account when pricing" (a reality that will become more evident as we go on), and shockingly, that 41% would prefer to keep "information sharing with their insurer to a minimum", even if it means premiums may rise.

Given this persistent lack of transparency, the primary goal of this project is to open the so-called "black box" of life insurance pricing, demystifying pricing model behaviour and highlighting the key factors that influence premiums, subsequently enabling consumers to recognise the trade-offs involved in their policy choices. For insurers, examining a wide dataset of quotes — including cross-country comparisons between the US and the UK — may reveal key differences in pricing models and average pricing within the market. This could help identify gaps in their own risk models and enable them to benchmark their offerings against competitors, improving their competitive edge and pricing strategies in both markets. To achieve these objectives, machine learning will be employed due to its ability to handle large, complex datasets and reveal non-linear relationships between variables that traditional statistical methods might miss. By applying interpretable machine learning models, we can not only predict premium prices based on given risk information, but also provide insights into which factors most significantly influence pricing, helping both consumers and insurers understand the underlying drivers of premium calculations.

<div style="text-align: left;">
    <h2 style="font-size: 2.25em; margin-bottom: 10px;">Methodology</h2>
    <hr style="width: 98%; margin: 0 auto; border: 0.5px solid #ccc; border-top: 0.0005px solid #333;">
</div>

To fulfil these goals, life insurance quotes were scraped off independent insurance brokerage platforms (lifeinsure.com in the US, drewberryinsurance.co.uk in the UK), which provide consumer-facing premium estimates based on user-inputted risk profiles across multiple insurers. Given that these quotes originate from actual insurers and are intended for real customers, they will not only prove an authentic depiction of real-world market conditions, but they will also span a diverse range of products across multiple insurers, capturing the intricacies of the differing pricing models, underwriting criteria and risk tolerances into a single representative dataset to efficiently train our model on. Additionally, unlike proprietary insurer algorithms, which usually incorporate undisclosed risk factors (e.g., geodemographic data, credit score, past claims history), these platforms clearly reveal the one-to-one marginal effects that changing one risk factor will have on premiums, with the caveat that these are base premiums (which will likely be subject to additional underwriting) and therefore may not accurately indicate what the final consumer will actually pay. Issues may also arise from sampling bias, as commission agreements and partnerships between brokers and certain insurers could artificially limit product competition, skewing the data toward specific offerings. However, this concern is mitigated by the fact that brokers are legally required to act with fiduciary duty to their clients, meaning they must prioritise the best interests of their client above any financial incentives from insurers, reducing the likelihood of biased or overinflated product recommendations. 

Nonetheless, as with any online data collection, platforms are subject to radical and sudden changes, as was experienced firsthand when LifeInsure implemented scraping restrictions shortly after final data collection had taken place. At the time, no explicit restrictions were in place, and scraping had occurred smoothly over several days without interruption, suggesting that their scraping tolerance had suddenly changed. Of course, this damages the replicability and perhaps somewhat undermines the validity of this study, yet the data actually collected does still remain valid and representative within the context of the time period obtained and will therefore aptly provide valuable contributions to our understanding of life insurance pricing.

<table style="margin: 0 auto; text-align: center; border-collapse: collapse;">
  <thead>
    <tr>
      <th>Variable Name</th>
      <th>Description</th>
      <th>Data Type</th>
      <th>Sample Size</th>
      <th>Summary / Distribution</th>
      <th>Notes</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code>Premium (£)</code></td>
      <td>Monthly premium in GBP, essentially price of holding the insurance policy.</td>
      <td>Numeric (Continuous)</td>
      <td>14,180</td>
      <td>Target Variable</td>
      <td>USD values converted using ~0.746 exchange rate (XE, n.d.).</td>
    </tr>
    <tr>
      <td><code>ln(Premium)</code></td>
      <td>Natural log of premium.</td>
      <td>Numeric (Continuous)</td>
      <td>14,180</td>
      <td>Target Variable</td>
      <td></td>
    </tr>
        <tr>
      <td><code>ln(Coverage_Amount)</code></td>
      <td>Natural log of coverage amount — the max amount payable in case of a claim, a.k.a "sums insured".</td>
      <td>Numeric (Discrete)</td>
      <td>14,180</td>
      <td>ln([100,000, 250,000, 350,000, 500,000,<br>750,000, 1,000,000, 2,000,000, 5,000,000])</td>
      <td>USD values converted using ~0.746 exchange rate (XE, n.d.).</td>
    </tr>
        <tr>
      <td><code>Term_Length</code></td>
      <td>Duration of insurance term (in years).</td>
      <td>Numeric (Discrete)</td>
      <td>14,180</td>
      <td>[10, 15, 20, 25, 30]</td>
      <td>All terms are level; coverage remains constant over time (no interaction with term).</td>
    </tr>
        </tr>
        <tr>
      <td><code>Age</code></td>
      <td>Age of the individual in years.</td>
      <td>Numeric (Discrete)</td>
      <td>14,180</td>
      <td>[20, 30, 40, 45, 50, 60, 65, 70]</td>
      <td>Ages calculated based on 1st Jan baseline.</td>
    </tr>
        </tr>
        <tr>
      <td><code>Is_Male</code></td>
      <td>Gender of applicant<br>(1 = Male, 0 = Female).</td>
      <td>Categorical</td>
      <td>14,180</td>
      <td>Male: 7,133,<br>Female: 7,078</td>
      <td></td>
    </tr>
        </tr>
        <tr>
      <td><code>Is_Smoker</code></td>
      <td>Nicotine use of applicant<br>(1 = Smoker, 0 = Non-Smoker).</td>
      <td>Categorical</td>
      <td>14,180</td>
      <td>Smoker: 7,102,<br>Non-Smoker: 7,078</td>
      <td></td>
    </tr>
        </tr>
        <tr>
      <td><code>Is_UK</code></td>
      <td>Country of quote<br>(1 = UK, 0 = US).</td>
      <td>Categorical</td>
      <td>14,180</td>
      <td>UK: 11,894,<br>US: 2,286</td>
      <td></td>
    </tr>
    
  </tbody>
</table>


To ensure consistency and comparability across the datasets, the variables selected for this analysis had to be limited to those shared between the two broker platforms. This consequently means certain key assumptions had to be made in cases where variables were not shared to ensure the integrity and validity of the analysis. Interestingly, the US site required certain health information — subjective personal health ratings, along with height and weight (presumably to calculate BMI) — which was not requested in its UK equivalent for the initial quote calculations (this does not necessarily mean the UK quotes would not have required these at inception). To solve this problem, standard baselines were chosen where no loading was likely to take place, such as an average health rating and a healthy BMI of ~24 (height: 5'10,weight: 167lbs). Similarly, the UK platform specifically asked for employment status and occupation to be provided (which would supposedly flag especially risky jobs and load rates accordingly), so in much the same vein default options of 'Employed' and 'Others - Not Listed' were set as the employment status and occupation respectively. As general assumptions across both tools, all quotes were based on level cover and no medical exams, allowing for increased consistency between them, since the rate of cover reduction in decreasing term policies and the complexity of medical exams required, along with the harshness of their underwriting are all wildly inconsistent. Furthermore, supplemental policy benefits were ignored as they are too numerous and sporadic to realistically analyze, while critical illness cover was also omitted for simplicity, ensuring the focus remains on core life insurance variables.

Of course, studying every possible combination of variables — especially for the UK site where some inputs allow free-text entry — would be far too time-consuming and inefficient. Instead, representative samples were chosen for certain variables, as shown in the table above. For cover amounts, the sample was designed to have higher resolution at lower values, with wider gaps between values as the amounts increased. This approach reflects general market trends — where the majority of consumers typically hold policies within the £100,000–£500,000 range — ensuring that the model is trained with finer detail where most customers fall, leading to better predictive accuracy and more realistic pricing estimates. As coverage amounts increase, the number of policies falls off sharply — following an approximately exponential decay — meaning less granularity is needed to capture the complete distribution towards the higher ends of cover. To address this expected right-skewed distribution, a log transformation was applied, compressing the extreme values and certifying that Random Forest splits remained sensitive to the full spectrum of coverage amounts, while giving more weight to prevalent values. Ages and term lengths worked on roughly the same premise, keeping a consistent distribution, with finer granularity being added in places where marginal changes were likely to be disproportionately high. For example ages 45 and 65 were chosen because they lie in intervals where most people begin to seriously consider purchasing life insurance (implying premiums may be inflated due to excess demand) and age 65 also marks the cut-off where significant underwriting requirements and loadings are imposed, reflecting the hightened risk of health complications at these ages.

Another important thing to note from the table above is the severe imbalance between US and UK samples, with the UK sample containing roughly 5 times more data than the US sample. Consequently, the patterns and relationships the model captures are more reflective of UK market dynamics, with UK-specific consumer behaviors, pricing structures, and underwriting practices exerting a stronger influence on the learned outcomes. Although the model remains valid and internally consistent within the combined dataset, the relatively small and potentially less representative US sample limits the depth and breadth of US-specific trends captured during training. This raises the possibility that certain effects observed for the US subset may be disproportionately shaped by UK-driven trends or by noise within the limited US data. As such, predictions and inferences related specifically to the US market should be interpreted cautiously to account for the heigtened risk of bias and potential underrepresentation.

In this project, Random Forests were used as the primary machine learning model due to their ability to handle large, complex datasets and capture non-linear relationships between variables. A Random Forest is an ensemble method that builds multiple decision trees — where each tree splits the data based on different features (like age or smoking status) — and then predicts the target variable by averaging the outputs of all these trees. By combining the results of these trees, Random Forests improve predictive stability and capture more generalized patterns within the training data, making them well-suited for modeling life insurance premiums, where input-output relationships are often complex. Their robustness to noise and straightforward interpretability make them a practical choice, especially considering the need for the results to be accessible and reliable for a broad, multifaceted audience. 

To enhance the interpretability and insightfulness of the Random Forest model, Partial Dependence Plots (PDPs) and SHAP values were computed from its predicted outcomes. PDPs reveal how individual variables influence predictions, holding all other variables constant. Alternatively, SHAP values break down each feature's specific contribution to a given predicted outcome. These tools help clarify the impact of factors like age, smoking status, and coverage amount on premium prices, offering deeper insights into the model’s decision-making process. These tools help clarify the impact of factors like age, smoking status, and coverage amount on premium prices, providing a deeper understanding of how the model interprets these variables and their relationships with the target outcome.

In [4]:
from IPython.display import HTML # Importing HTML to display figures from our ./notebooks_dev/figures folder.
HTML("""
<div style="text-align:center;">
    <img src="./notebooks_dev/Figures/Fig1-premdist.png" style="width:70vw"/>
</div>
""")

These two plots illustrate the distribution of insurance premiums before and after log transformation. The raw premiums in our dataset are heavily right-skewed, with the majority of values clustered at lower amounts (under £1,000) and a long tail extending to very high premiums (over £15,000) — a typical pattern in insurance because only a small proportion of policies/high-risk individuals warrant premiums that extreme. After applying a log transformation, the distribution becomes much more symmetric and bell-shaped, simultaneously reducing skewness and compressing the range of values. While random forests are generally robust to non-normal targets, highly skewed data can still lead to suboptimal splits, where a disproportionate number of splits focus on rare, extreme values rather than the dense middle range where most premiums lie. Therefore, by transforming the target to a more balanced distribution, the model can partition the data more evenly, allowing it to better capture variations across the full range of typical premiums without being overly influenced by a small number of outliers. Furthermore, log-transforming premiums makes sense conceptually because insurance pricing often operates multiplicatively — risk factors tend to increase premiums by a percentage (loading) rather than by a fixed amount — and the log scale naturally captures this proportional relationship. For these reasons, log-transformed premiums will be used as the target variable during model training, ensuring that the random forest can learn the structure of the data more accurately and produce more stable, interpretable predictions.

<div style="text-align: left;">
    <h2 style="font-size: 2.25em; margin-bottom: 10px;">Analysis</h2>
    <hr style="width: 98%; margin: 0 auto; border: 0.5px solid #ccc; border-top: 0.0005px solid #333;">
</div>

In [5]:
HTML("""
<div style="text-align:center;">
    <img src="./notebooks_dev/Figures/Fig2-corrhmap.png" style="height:450px;"/>
</div>
""")

In the correlation matrix above, we can first observe that there is very little interaction between the explanatory variables, which makes sense since the input features were systematically varied across a wide range of combinations, resulting in minimal dependency between the variables. There was however, one exception to this, namely "Is_UK" seems to interact moderately (that is at least in the context of insurance modelling where hundreds of factors come into play) with coverage amounts, perhaps suggesting that policies originating from the UK might have a higher tendency for higher cover amounts than those from ther US. This is surprising, seeing as the US's astronomical healthcare costs would expectedly inflate the amount of cover needed drastically: not only can families be left with huge unpaid medical bills after a death (which must be settled from the estate), but healthcare insurance is often tied to employment, meaning family members may also lose their health coverage — all of which would certainly be taken into account by the insured and the insurer. The more likely explanation becomes evident when looking at the data, as unlike the UK quotes dataset, the US one had much less data present for the upper end of covers ($2,000,000 and $5,000,000), mostly because these tended to be medical exam required policies (which makes sense, as insurers need to justify the larger risk and guard against the adverse selection of sick people taking out huge covers), which were explicitly ignored by our scraping. 

This bodes well for our model since, although interaction effects are far less problematic for random forests than for other models, minimizing them still offers meaningful benefits. The independence of features enhances interpretability, making it easier to understand the individual contribution of variables without needing to account for complex relationships. Additionally, with reduced interaction, the risk of redundant features is minimized — highly correlated inputs are less likely to distort the splits or introduce overlapping signals. As a result, each tree can focus more cleanly on the relevant features, improving both prediction accuracy and the clarity of the model’s decision process.
 
Secondly, we can observe that the variables most strongly correlated with premiums are Age and ln(Cover_Amount), with correlation coefficients of 0.75 and 0.57 respectively, while Is_Smoker also shows a notable correlation of 0.23. This is not surprising as these variables tend to have more direct, monotonic relationships with premiums — typically, as age or coverage amount increases, so does the premium, which aligns with general insurance pricing logic. Older individuals tend to pose higher risk, and higher coverage naturally incurs higher costs. Similarly, smokers represent a well-known risk factor, hence their positive association with higher premiums.

In contrast, variables like term_length, Is_Male, and Is_UK show little to no linear correlation with premiums. However, this lack of correlation does not necessarily imply a lack of predictive value. These features may influence premiums in nonlinear or threshold-based ways that a simple correlation metric cannot detect. For example, term_length might exhibit a threshold effect, where premiums are relatively flat for short- to medium-term policies, but increase noticeably once the term exceeds, say, 25 years, due to the insurer’s extended exposure to risk. Similarly, Is_Male might show a plateau effect — gender may influence premiums only under specific conditions, such as within certain age bands or policy types, making the overall linear correlation appear negligible, even if the effect is meaningful in practice. Is_UK could follow a piecewise relationship, where the policy being written in the UK has minimal impact on premiums at lower coverage levels or younger ages, but results in a significant pricing adjustment at higher tiers due to regional underwriting policies or regulatory factors.

In short, while correlation is a helpful first step and has provided a general idea of what to expect, it is only a partial lens. The true strength of models like random forests lies in their ability to detect subtle nonlinear patterns — such as thresholds, plateaus, or piecewise effects — that often drive complex outcomes like insurance premiums, even when linear relationships appear weak or nonexistent.

In [6]:
HTML("""
<div style="text-align:center;">
    <img src="./notebooks_dev/Figures/Fig3-avgdem.png" style="width:50vw;"/>
</div>
""")

In [7]:
HTML("""
<div style="text-align:center;">
    <img src="./notebooks_dev/Figures/Fig4-PDPs.png" style="height:500px;"/>
</div>
""")

In [8]:
HTML("""
<div style="text-align:center;">
    <img src="./notebooks_dev/Figures/Fig5-ICEs.png" style="width:75vw"/>
</div>
""")

In [9]:
HTML("""
<div style="text-align:center;">
    <img src="./notebooks_dev/Figures/Fig6-SHAPimp.png" style="height:500px"/>
</div>
""")

Comparing with our correlations

In [10]:
HTML("""
<div style="text-align:center;">
    <img src="./notebooks_dev/Figures/Fig7-SHAPbee.png" style="height:500px;"/>
</div>
""")

In [25]:
HTML("""
<div style="display: flex; justify-content: center; margin-bottom: 20px;">
    <img src="./notebooks_dev/Figures/Fig8-SHAPwater.png" style="width: 45vw;" />
</div>

<div style="display: flex; justify-content: center;">
    <img src="./notebooks_dev/Figures/Fig9-SHAPwater.png" style="width: 45vw; margin: 10px;" />
    <img src="./notebooks_dev/Figures/Fig10-SHAPwater.png" style="width: 45vw; margin: 10px;" />
</div>
""")


<div style="text-align: left;">
    <h2 style="font-size: 2.25em; margin-bottom: 10px;">Conclusion</h2>
    <hr style="width: 98%; margin: 0 auto; border: 0.5px solid #ccc; border-top: 0.0005px solid #333;">
</div>

<div style="text-align: left;">
    <h2 style="font-size: 2.25em; margin-bottom: 10px;">References</h2>
    <hr style="width: 98%; margin: 0 auto; border: 0.5px solid #ccc; border-top: 0.0005px solid #333;">
</div>

**Project Repository Link: https://github.com/freitas-andrew/empirical_project.git**

FCA (2021). ICOBS 6.1.1 - Suitability. FCA Handbook.<br>Retrieved from https://www.handbook.fca.org.uk/handbook/ICOBS/6/1.html

Association of British Insurers. (2019). Consumer attitudes towards data and insurance.<br>Retrieved from https://www.abi.org.uk/globalassets/files/publications/public/data/britain_thinks_consumer_data_insurance_report.pdf

XE.com. (n.d.). XE Currency Converter. Retrieved April 21st, 2025,<br>from <https://www.xe.com/currencyconverter/convert/?Amount=1&From=USD&To=GBP>