> [Welcome to Colab](https://colab.research.google.com/), *a hosted `python` notebook environment, where code & documentation live side-by-side.*
>
> **Note:** if your notebook sections are expanded, please go to *`View > Collapse sections`* or press `Ctrl/⌘ + [`
---

---
<img src="https://upload.wikimedia.org/wikipedia/de/6/62/International_Actuarial_Association_Logo.svg" width="210" height="210">

# **The Actuarial Data Science Toolkit** 
### *A Practical Introduction*

# **0. Table of contents** 
---

↖️ `click  task bar`

# **1. Introduction**

---



In recent years the rise of data intensive methods has seen terms like *Big Data*, *Data Science* and *Arficial Intelligence* (AI) enter public lexicon. The use of such methods in data-rich, mostly internet-based businesses has resulted in tools that may be applied more broadly and indeed data science tools and methods are used in broad domains. Actuaries have deep expertise with data, yet we have been slow to adapt to use newer data science type tools. 

>   ℹ️ *An actuarial view of Data Science may be found [here](https://actuaries.org.uk/learn/lifelong-learning/data-science-an-actuarial-viewpoint/) and an overview of AI in Actuarial Science may be found [here](https://www.actuarialsociety.org.za/wp-content/uploads/2018/10/2018-Richman-FIN.pdf).*

This practical tutorial is designed to *introduce* data science tools that may prove useful in actuarial workflows. No prior knowledge is assumed and concepts are outlined with provided **reusable code**. Links to more thorough learning materials for individual topics are also provided. The hope is that it will spark curiousity in members and that they will use such tools more regularly in their day-to-day work. 

>   ℹ️ *Typical actuarial work needs better data maturity, along with practical upskilling to take full advantage of advanced AI. As a profession of thought leaders we should help shape our industries' futures to take advantage of these rapidly evolving technologies.*
 

# **2. From Excel to `python`**  🐍

---


#### **2.0 Section learning objectives**

>The purpose of this section is to introduce a member, who is proficient in Excel, to basic `python` tools. This is intended for an absolute beginner and  afterwards the member should feel comfortable using `python` for common tasks they would normally perform in Excel.
>
>   * Basic system operation
>       * Introduce notebooks
>       * `wget` data 
>       * `os` functionality
>       * `pip install`
>       * `import` packages
>
>   * `pandas`
>       * Link to learning resoruces
>       * Note on `v2.0.0` 
>       * Summarise data
>       * Read `.xlsx` data into df
>       * Rename columns
>       * Drop `na`
>       * Query with wildcards
>       * Filter 
>       * Conditional formatting
>
>   * Mention `openpyxl` & `xlwings`

#### 2.1 **We can't ignore Excel, for now...**

There is no denying that the majority of actuarial work is done in Excel (for now!). As such, learning to intergrate Excel as a tool in your data science workflow is essential for practical benefit. 


#### 2.2 **Why `python`?**

In the data science community there has been fierce debate over whether `python` or `R` was the better data science coding language. While both languages are capible, we have opted for `python` as it has become the defacto language used for modern AI tools. It is also worth mentioning that `julia`, a newer language is growing and there is already an [actuarial `julia` community](https://juliaactuary.org/).

A major advantage of Python is the popularity of notebooks, like this document. A notebook is a "living document" with live code and documentation displayed side-by-side. We have opted to use Colab as it provides free GPU access, needed for modern AI. We will need this GPU below!  

#### 2.3 **Getting data from an external source**
GitHub is a platform used to store code. In making this tutorial we created the Actuarial Society's GitHUb account. 

> ℹ️ *GitHub is a where open-source code is stored. It is a great place to colaborate and there are lots of free-to-use data science projects (repositories) to explore. If you don't have an account you can get one [here](https://github.com/).*

In [None]:
# clone the repository from the actuarialsociety's github 
# bash script (!)
!git clone https://github.com/actuarialsociety/dstoolkit.git

Let's check that our files are there

In [None]:
# Import os package to enable us to work with files
import os
print(f"Getting our current working directory{os.getcwd()} \n")
print(f"Listing what is in our directory \n {os.listdir()} \n") 

We can see the repository we just cloned `dstoolkit` is listed. 

In [None]:
# Examining our repo
os.listdir("./dstoolkit")

 ↖️ We could have also just used the task bar.

#### 2.4 **How to use `python` to work with Excel**

In order to extend the functionality of basic `python` we need to import **packages** into our notebook environment, into which all code is executed. Packages allow for an extension of basic python functionality.

The `python` pacakges `numpy` and `pandas` are literally two of the most used packages across any computing language. `numpy` provides broad mathematical functionality and `pandas` allows for data manipulation from multiple sources, including spreadsheets like Excel. 
In order to extend the functionality of basic `python` we need to import packages into our notebook environment, into which all code is executed. The `python` pacakges `numpy` and `pandas` are literally two of the most used packages across any computing language. `numpy` provides broad mathematical functionality and `pandas` allows for data manipulation from multiple sources, including spreadsheets like Excel. 

Typically packages are installed into our `python` environment using the `pip` command, however as we see below, due to their popularity, these packages are already installed in the Colab notebook environment. 

In [None]:
# bash script (!)
# we use -q (quiet) to suppress the output
%pip install numpy pandas Jinja2 sklearn openpyxl matplotlib seaborn scikit-learn -q

> ℹ️ `pandas` has very good documentation and a short introduction can be found [here](https://pandas.pydata.org/docs/user_guide/10min.html).

`pandas` a very  capable I/O API (Input/Output capabilities) and can read data from a [variety of sources](https://pandas.pydata.org/docs/user_guide/io.html), including common cloud platforms.

> ℹ️ *`pandas` version 2 (`pandas == 2.0`) was recently released. You can find the release notes [here](https://pandas.pydata.org/docs/dev/whatsnew/v2.0.0.html#), as well as a short video on what is new [here](https://www.youtube.com/watch?v=cSLPyRI_ZD8).*

In [None]:
# import necessary packages
import numpy as np
import pandas as pd

# path to data we just downloaded
# if you are interested in how modelpoints were created,
# check out the `generating_modelpoints.ipynb` file in the repo
DATA_XLS = "./dstoolkit/model_point.xlsx"

# ingesting our data into a dataframe (df)
df = pd.read_excel(DATA_XLS)


# now printing out some information about our data
print(f"\n Getting some basic info about our data... \n")
print(f"* size: {df.size} \n")
print(f"* shape:{df.shape} \n")

# now disaplying some information
# `display` is like `print` but renders pd df's better 
display(df.info())

print("\n Getting an idea about the ranges of each column... \n")
display(df.describe())

display(df.head(3))

We can already see from our `model_point.info()` command that there are `null` values, that we probably want to drop. In reality it probably worth checking if the data could be better sourced to have fewer `null` values.

In [None]:
print(f"\n We examine the number of null values in our data before dropping them... \n")
print(f"\n Number of null items: \n {df.isnull( ).sum()}")
print(f"\n Shape before dropping nulls: \n {df.shape}")
df = df.dropna()
print(f"\n Shape after dropping nulls: \n {df.shape}")

print(f"\n We probably also what to trim spaces in our column names,convert to lower case and replace spaces with _ to make it easier to work with\n")
new_columns = [str(col).strip().lower().replace(" ","_") for col in list(df.columns)]
print(new_columns)

# creating dict to rename df
rename_dict = dict()
for i in range(0,len(list(df.columns))):
  rename_dict.update({list(df.columns)[i]:new_columns[i]})

# renaming df
df = df.rename(columns=rename_dict)

# coverting df.sex to a Categorical variable
df.sex = pd.Categorical(df.sex)

# coverting df.issue to a datetime
df.issue_date =  pd.to_datetime(df.issue_date, format='%d/%m/%Y')



In [None]:
# Importantly this allows one to access columns easier when running queries
# Setting query wildcards
query_age = 55
query_sa = 100000

# Querying with wildcards
(df
 .query('age_at_entry > @query_age and sum_assured > @query_sa')\
 .head(3))

In [None]:
# We may wish to group by certrain columns and perform aggregations
(df
.groupby(['sex','policy_term'])\
.sum_assured\
.agg(['sum', 'mean','count']))

In [None]:
# We may also wish to perform conditional formatting on our dataframe

(df.
 sort_values('age_at_entry').head(5)[['age_at_entry', 'premium']]\
.style\
.background_gradient("Blues"))

Pulling in an external source of underwriter comments, we 

In [None]:

underwriter_outcomes = pd.read_csv('underwriter_outcomes.csv', index_col=0)
# left joining model_point_table and underwriter_outcomes
model_point_table_discount = model_point_table.merge(underwriter_outcomes, on='uuid', how='left')
model_point_table_discount
model_point_table_discount['final_premium'] = np.where(model_point_table_discount.discount==True,
                                                   model_point_table_discount.premium * 0.9,model_point_table_discount.premium)

model_point_table_discount

> * ℹ️ *[`openpyxl`](https://openpyxl.readthedocs.io/en/stable/) allows one to read named ranges from Excel*
> * ℹ️ *[`xlwings`](https://www.xlwings.org/) is a very good package for replacing VBA with python*

# **3. Make it visual 👀** 
---


#### **3.0 Section learning objectives** 

>The purpose of this section is to build upon the previous section; having ingested data, simple visualisation techniques are explored
>  * Slice list
>  * Basic plotting by looping with `matplotlib.pyplot` & `seaborn`
>  * Reduce distinct x-axis values for better visualisation
>  * Write & use a basic python function for repeated code
>  * Docstrings from own function and from packages
  
After looking at all available columns, we decide we wish to drop `'model_point'` & `'uuid'`, and thus only focus on columns with index 2 and above.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

#only focusing on columns with index 2 and above.
cols_to_plot = list(df.columns)[2:]
print(cols_to_plot)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Set a Seaborn style="white", "dark", "whitegrid", "darkgrid", "ticks"
sns.set(style="darkgrid", font_scale=1.1)

# Define subplot grid
fig, axs = plt.subplots(nrows=int(len(cols_to_plot) / 2) , ncols=2, figsize=(15, 12))

# Set figure title
fig.suptitle("Examining df... take 1", fontsize=18, y=0.95)

# Loop through cols_to_plot
for col, ax in zip(cols_to_plot, axs.ravel()):
    # Getting an idea of the distribution by counting values
    counts = df[col].value_counts().reset_index()
    
    # Use Seaborn barplot with custom colors
    sns.barplot(data=counts, x='index', y=col, ax=ax, palette='viridis')

    # Chart formatting
    # Chart title
    ax.set_title(col)
    # Chart y label
    ax.set_ylabel("count")
    # Chart x label
    ax.set_xlabel("value")

plt.tight_layout()
plt.show()


We can see that for `issue_date, sum_assured, premium` there are too many values we are counting by, and thus we struggle to see what is happening. 

We will create new variables that aggregate these columns for fewer values and replace those columns in our `cols_to_plot` list. 


In [None]:
df['issue_date_year'] = df['issue_date'].dt.to_period('Y')

print(f"For `issue_date_year`, we previously had {df['issue_date'].nunique() } unique values; now we have {df['issue_date_year'].nunique() } ")

In [None]:
# using list comprehension to create bins
bins = [(df['sum_assured'].max())*(i/10) for i in range(1,11)]
df['sum_assured_binned'] = pd.cut(df['sum_assured'], bins)

print(f"For `sum_assured`, we previously had {df['sum_assured'].nunique() } unique values; now we have {df['sum_assured_binned'].nunique() } ")

In [None]:
# using list comprehension to create bins
bins = [(df['premium'].max())*(i/10) for i in range(1,11)]
df['premium_binned'] = pd.cut(df['premium'], bins)

print(f"For `premium`, we previously had {df['premium'].nunique() } unique values; now we have {df['premium_binned'].nunique() } ")

In [None]:
# we make a dictionary to change items in our cols_to_plot list
rename_dict = dict()

for i in cols_to_plot:
  rename_dict.update({i:i})
rename_dict["issue_date"] = "issue_date_year"
rename_dict["sum_assured"] = "sum_assured_binned"
rename_dict["premium"] = "premium_binned"

# now we map rename_dict to cols_to_plot 
cols_to_plot = [*map(rename_dict.get, cols_to_plot)]

if 'policy_count' in cols_to_plot: cols_to_plot.remove('policy_count')

cols_to_plot

Now that we have updated `cols_to_plot` we will turn our plot commands into a function. Note the description (known as a Doc String) which outlines the detail of inputs and outputs. This will appear if you say `help(function_name)`

Notebooks are messy and often used for ad-hoc investigations. In a production environment we would want to turn all our commands into neat functions. 

In [None]:
from typing import List

def plot_distribution(df: pd.DataFrame, cols_to_plot: List[str], style: str = 'whitegrid', palette: str = 'viridis'):
    """
    Plot the distribution of the specified columns in a DataFrame using Seaborn's barplot.

    Parameters:
    df (pd.DataFrame): The DataFrame containing the columns to plot
    cols_to_plot (list): A list of column names to plot
    style (str): The Seaborn style to use for the plot. Options: "darkgrid", "whitegrid", "dark", "white", "ticks"
    palette (str): The color palette to use for the plot. Options: "deep", "muted", "pastel", "bright", "dark", "colorblind", "viridis", "inferno", "plasma", "magma", "cividis"

    Returns:
    None
    """

    # Set a Seaborn style="white", "dark", "whitegrid", "darkgrid", "ticks"
    sns.set(style="darkgrid", font_scale=0.7)

    # Define subplot grid
    fig, axs = plt.subplots(nrows=int(len(cols_to_plot) / 2), ncols=2, figsize=(15, 12))

    # Set figure title
    fig.suptitle("Examining data distribution", fontsize=18, y=1)

    # Loop through cols_to_plot
    for col, ax in zip(cols_to_plot, axs.ravel()):
        # Getting an idea of the distribution by counting values
        counts = df[col].value_counts().reset_index()
        
        # Use Seaborn barplot with custom colors
        plot = sns.barplot(data=counts, x='index', y=col, ax=ax, palette='viridis')
        plot.set_xticklabels(plot.get_xticklabels(), rotation=45)

        # Chart formatting
        # Chart title
        ax.set_title(col)
        # Chart y label
        ax.set_ylabel("count", fontsize = 10)
        # Chart x label
        ax.set_xlabel("value", fontsize = 10)

    plt.tight_layout()
    plt.show()


Examining our doc string

In [None]:
help(plot_distribution)

Importantly, this will provide in-context help for any function importanted from a package. 

In [None]:
help(pd.Categorical)

Now running our function

In [None]:
# Executing our function
plot_distribution(df, cols_to_plot)

> * ℹ️ *[`pygwalker`](https://github.com/Kanaries/pygwalker) turns your `pandas` dataframe into a Tableau-style User Interface for visual exploration.*

# **4. A taste of cutting edge AI**  🦾🤖
---


#### 4.0 **Section learning objectives** 
>This section tries to bring some basic NLP (natural language processing) processing from `huggingface` into our workflow. 
* Brief introduction to `huggingface`
* Live demo of whisper
* Sentiment analysis of free-text

Within AI, in the last few years, language models have improved dramatically. This is due to a sequence based model called the [Transformer](https://www.youtube.com/watch?v=SZorAJ4I-sA). If you have heard of *Chat GPT*, the "T" in "GPT" is for Transformer!

[`huggingface`](https://huggingface.co/) 🤗, rose to populatity from their `transformers` package. However, since their inception they have built a cutting-edge community, with lots of open-source models that are shared and freely available. [Huggingface spaces](https://huggingface.co/?recent=update-space) allows one to explore and run these models in a browser. Importantly, there are lots of [educational materials](https://huggingface.co/course/) that one can use to upskill.



In [None]:
# bash script (!)
# installing packages
%pip install torch torchaudio transformers ipywebrtc librosa ipywidgets -q

We will use Open AI's [Whisper](https://huggingface.co/openai/whisper-base.en), a high capable automatic speech recongition model.

We will use a huggingface [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines), an easy way to perform inference (which may be thought of as predicition).

In [None]:
# importing necessary packages
import librosa # for audio processing
import torch # for AI
import torchaudio # for AI audio processing
from transformers import WhisperTokenizer, pipeline # for tokenizing inputs
from ipywebrtc import AudioRecorder, CameraStream # for recording audio

from IPython.display import Audio # for playing audio

# as mentioned, modern AI requires a Graphic Processing Unit (GPU)
# we set the device we will run on based on whether a GPU is available
# cuda is a software layer, allowing us to control the GPU
device = "cuda:0" if torch.cuda.is_available() else "cpu"

pipe = pipeline(
  "automatic-speech-recognition",
  model="openai/whisper-base.en",
  chunk_length_s=1,
  device=device,
)

In [None]:
camera = CameraStream(constraints={'audio': True,'video':False})
recorder = AudioRecorder(stream=camera)
from google.colab import output
output.enable_custom_widget_manager()
recorder

In [None]:
# record phrase
with open('recording.webm', 'wb') as f:
    f.write(recorder.audio.value)
!ffmpeg -i recording.webm -ac 1 -f wav recording.wav -y -hide_banner -loglevel panic
sig, sr = torchaudio.load("recording.wav")
print(sig.shape)

In [None]:
#load audio file we just recorded
pipe('recording.wav')

Now let's say we wanted to perform a task other that "automatic-speech-recognition" 

In [None]:
pipe_sentiment = pipeline("text-classification")
pipe_sentiment(["This restaurant is awesome", "This restaurant is awful"])

What about GPT, can we access that?

In [None]:
generator = pipeline('text-generation', model='openai-gpt')
generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)

It's litearlly *that* easy to intergrate high quality opensource models into your workflows

# **5. Resources**  📓
---


* VSCode is a popular Interactive Development Environment (IDE). It makes working with code **much** easier. In addition, their [training material](https://code.visualstudio.com/docs/introvideos/basics) is a great place to learn practical development skills.

* [Kaggle](https://www.kaggle.com/) is online community of data scientists and machine learning practitioners. They provide free access to GPUs and their forums are particularly good for finding solutions to data science code related problems.

* [Swiss Association of Actuaries Data Science initiative](https://www.actuarialdatascience.org/), seems to be the most mature data science effort by any acturaial society.

* [A curated list of free and open source actuarial software](https://github.com/genedan/actuarial-foss)

* [fast.ai](https://www.fast.ai/), a course and python `package` for accessible deep learning.

* [Distil](https://distill.pub/), interactive articles about machine learning & AI.









![Alt Text](https://media.tenor.com/oFO9mCbbj98AAAAC/rocket-lift-off.gif)