> [Welcome to Colab](https://colab.research.google.com/), *a hosted `python` notebook environment, where code & documentation live side-by-side.*

> *Please go to `View > Collapse sections`* 

<img src="https://www.actuarialsociety.org.za/wp-content/uploads/2022/10/ASSA-LOGO-20224.png"  width="420" height="90">

# **The Actuarial Data Science Toolkit: A Practical Guide** 


# **0. Table of contents** 
---

↖️ `click task bar`

# **1. Introduction**

---



In recent years the rise of data intensive methods has seen terms like *Big Data*, *Data Science* and *Arficial Intelligence* (AI) have entered public lexicon. The use of such methods in data-rich, mostly internet-based businesses has resulted in tools that may be applied more broadly, and indeed data science tools and methods are used in broad domains. Actuaries have deep expertise with data, yet we have been slow to adapt to use newer data science type tools. 

> ℹ️ *An actuarial view of Data Science may be found [here](https://actuaries.org.uk/learn/lifelong-learning/data-science-an-actuarial-viewpoint/) and an overview of AI in Actuarial Science may be found [here](https://www.actuarialsociety.org.za/wp-content/uploads/2018/10/2018-Richman-FIN.pdf).*

This practical tutorial is designed to *introduce* data science tools that may prove useful in actuarial workflows. No prior knowledge is assumed and concepts are outlined with provided **reusable code snippets**. Links to more thorough learning materials for individual topics are also provided. The hope is that it will spark curiousity in members and that they will use such tools more regularly in their day-to-day work. 

> ℹ️ *Typical actuarial work needs better data maturity, along with practical upskilling to take full advantage of advanced AI. As a profession of thought leaders we should help shape our industries' futures to take advantage of these rapidly evolving technologies.*
 

# **2. From Excel to `python`**  🐍

---


#### **2.0 Section learning objectives** 

>The purpose of this section is to introduce a member, who is proficient in Excel, to basic `python` tools. This is intended for an absolute beginner and  afterwards the member should feel comfortable using `python` for common tasks they would normally perform in Excel. 
  * Introduce notebooks
  * `wget` data 
  * `pip install`
  * `import` packages
  * Link to `pd` learning resoruces
  * Note on `pd==2.0.0` 
  * `pd` feature: summarise data
  * `pd` feature: read data into df
  * `pd` feature: rename columns
  * `pd` feature: dropna
  * `pd` feature: drop duplicates `TODO`
  * `pd` feature: query
  * `pd` feature: filter 
  * `pd` feauture: conditional formatting

#### 2.1 **We can't ignore Excel, for now...**

There is no denying that the majority of actuarial work is done in Excel (for now!). As such, learning to intergrate Excel as a tool in your data science workflow is essential for practical benefit. 


#### 2.2 **Why `python`?**

In the Data Science community there has been fierce debate over whether `python` or `R` was the better data science coding language. While both languages are capible, we have opted for `python` as it has become the defacto language used for modern AI tools. It is also worth mentioning that `julia`, a newer language is growing and there is already an [actuarial `julia` community](https://juliaactuary.org/).

A major advantage of Python is the popularity of notebooks, like this document. A notebook is a "living document" with live code and documentation displayed side-by-side. We have opted to use Colab as it provides free GPU access, needed for modern AI.  

#### 2.3 **Getting data**
Firstly, download data from the Actuarial Society's GitHub, a platform used to store code.

In [None]:
# get data from actuarialsociety's github 
# bash script (!)
!wget https://github.com/actuarialsociety/dstoolkit/blob/main/model_point.xlsx

#### 2.4 **How to use `python` to work with Excel**

In order to extend the functionality of basic `python` we need to import packages into our notebook environment, into which all code is executed. The `python` pacakges `numpy` and `pandas` are literally two of the most used packages across any language. `numpy` provides broad mathematical functionality and `pandas` allows for data manipulation from multiple sources, including spreadsheets like Excel. 

Typically packages are installed into our `python` environment using a `pip` command, however as we see below, due to their popularity, these packages are already installed in the Colab notebook environment. 

In [None]:
# bash script (!)
!pip install numpy pandas

> ℹ️ `pandas` has very good documentation and a short introduction can be found [here](https://pandas.pydata.org/docs/user_guide/10min.html).

`pandas` a very  capable I/O API (Input/Output capabilities) and can read data from a [variety of sources](https://pandas.pydata.org/docs/user_guide/io.html), including common cloud platforms.

In the code-snippet below, we use data from an external `.csv` to append new information to an existing `.xlsx` spreadsheet.

> ℹ️ *`pandas` version 2 (`pandas==2.0`) was recently released. You can find the release notes [here](https://pandas.pydata.org/docs/dev/whatsnew/v2.0.0.html#), as well as a short video on what is new [here](https://www.youtube.com/watch?v=cSLPyRI_ZD8).*

In [None]:
# import necessary packages
import numpy as np
import pandas as pd

# path to data we just downloaded
DATA_XLS = "/content/model_point.xlsx"

# ingesting our data into a dataframe (df)
df = pd.read_excel(DATA_XLS)


# now printing out some information about our data
print(f"\n Getting some basic info about our data... \n")
print(f"* size: {df.size} \n")
print(f"* shape:{df.shape} \n")

# now disaplying some information
# `display` is like `print` but renders pd df's better 
display(df.info())

print("\n Getting an idea about the ranges of each column... \n")
display(df.describe())

display(df.head(3))

We can already see from our `model_point.info()` command that there are `null` values, that we probably want to drop. In reality it probably worth checking if the data could be better sourced to have fewer `null` values.

In [None]:
print(f"\n We examine the number of null values in our data before dropping them... \n")
print(f"\n Number of null items: \n {df.isnull( ).sum()}")
print(f"\n Shape before dropping nulls: \n {df.shape}")
df = df.dropna()
print(f"\n Shape after dropping nulls: \n {df.shape}")

print(f"\n We probably also what to trim spaces in our column names,convert to lower case and replace spaces with _ \n")
new_columns = [str(col).strip().lower().replace(" ","_") for col in list(df.columns)]
print(new_columns)

# creating dict to rename df
rename_dict = dict()
for i in range(0,len(list(df.columns))):
  rename_dict.update({list(df.columns)[i]:new_columns[i]})

# renaming df
df = df.rename(columns=rename_dict)

# coverting df.sex to a Categorical variable
df.sex = pd.Categorical(df.sex)

# coverting df.issue to a datetime
df.issue_date =  pd.to_datetime(df.issue_date, format='%d/%m/%Y')



In [None]:
# Importantly this allows one to access columns easier when running queries
# Setting query wildcards
query_age = 55
query_sa = 100000

# Querying with wildcards
(df
 .query('age_at_entry > @query_age and sum_assured > @query_sa')\
 .head(3))

In [None]:
# We may wish to group by certrain columns and perform aggregations
(df
.groupby(['sex','policy_term'])\
.sum_assured\
.agg(['sum', 'mean','count']))

In [None]:
# We may also wish to perform conditional formatting on our dataframe

(df.
 sort_values('age_at_entry').head(5)[['age_at_entry', 'premium']]\
.style\
.background_gradient("Blues"))

# **3. Make it visual 👀** 
---



####**3.0 Section learning objectives** 

>The purpose of this section is to build upon the previous section; having ingested data, simple visualisation techniques are explored
  * Basic plotting by looping with `matplotlib.pyplot`
  * Reduce distinct x-axis values for better visualisation
  * Write & use a basic python function for repeated code
  * Docstrings from own function and from packages
  
After looking at all available columns, we decide we wish to drop `'model_point'` & `'uuid'`, and thus only focus on columns with index 2 and above.

In [None]:
import matplotlib.pyplot as plt
#only focusing on columns with index 2 and above.
cols_to_plot = list(df.columns)[2:]
cols_to_plot

In [None]:
# define subplot grid
fig, axs = plt.subplots(nrows=int(len(cols_to_plot)/2) + 1, ncols=2, figsize=(15, 12))

# setting figure title
fig.suptitle("Examining df... take 1", fontsize=18, y=0.95)

# loop through cols_to_plot
for col, ax in zip(cols_to_plot, axs.ravel()):
    # getting an idea of the distribution by counting values
    df[col].value_counts().plot(kind='bar',ax=ax)

    # chart formatting
    # chart title
    ax.set_title(col)
    # chart y lable label 
    ax.set_ylabel("count")
    # chart x lable label 
    ax.set_xlabel("value")

plt.show()


We can see that for `issue_date, sum_assured, premium` there are too many values we are counting by, and thus we struggle to see what is happening. 

We will create new variables that aggregate these columns for fewer values and replace those columns in our `cols_to_plot` list. 


In [None]:
df['issue_date_month'] = df['issue_date'].dt.to_period('M')

print(f"For `issue_date_month`, we previously had {df['issue_date'].nunique() } unique values; now we have {df['issue_date_month'].nunique() } ")

In [None]:
# using list comprehension to creat bins
bins = [(df['sum_assured'].max())*(i/10) for i in range(1,11)]
df['sum_assured_binned'] = pd.cut(df['sum_assured'], bins)
print(f"For `sum_assured`, we previously had {df['sum_assured'].nunique() } unique values; now we have {df['sum_assured_binned'].nunique() } ")


In [None]:
# using list comprehension to creat bins
bins = [(df['premium'].max())*(i/10) for i in range(1,11)]
df['premium_binned'] = pd.cut(df['premium'], bins)
print(f"For `premium`, we previously had {df['premium'].nunique() } unique values; now we have {df['premium_binned'].nunique() } ")

In [None]:
rename_dict = dict()
for i in cols_to_plot:
  rename_dict.update({i:i})
rename_dict["issue_month"] = "issue_month_month"
rename_dict["sum_assured"] = "sum_assured_binned"
rename_dict["premium"] = "premium_binned"

cols_to_plot = [*map(rename_dict.get, cols_to_plot)]
cols_to_plot

Now that we have updated `cols_to_plot` we will turn our plot commands into a function. Note the description (known as a Doc String) which outlines the detail of inputs and outputs. This will appear if you say `help(function_name)`

Notebooks are messy and often used for ad-hoc investigations. In a production environment we would want to turn all our commands into neat functions. 

In [None]:
# NOTE: This must be blank in the tutorial after assert
def plot_data_cols(data, plot_cols):
    """
    Returns plot of plot_cols from data
    Args:
      * data: dataframe with columns names in plot_cols
      * plot_cols: column names in data that will be plott
    Returns:
      * plot of plot_cols from data
    """
  
    # define subplot grid
    fig, axs = plt.subplots(nrows=int(len(plot_cols)/2) + 1, ncols=2, figsize=(15, 12))

    # setting figure title
    fig.suptitle("Examining df... with our function", fontsize=18, y=0.95)

    # loop through cols_to_plot
    for col, ax in zip(plot_cols, axs.ravel()):
        # getting an idea of the distribution by counting values
        data[col].value_counts().plot(kind='bar',ax=ax)

        # chart formatting
        # chart title
        ax.set_title(col)
        # chart y lable label 
        ax.set_ylabel("count")
        # chart x lable label 
        ax.set_xlabel("value")

    plt.show()
    # loop through cols_to_plot
    for col, ax in zip(plot_cols, axs.ravel()):
        # getting an idea of the distribution by counting values
        df[col].value_counts().plot(kind='bar',ax=ax)

        # chart formatting
        # chart title
        ax.set_title(col)
        # chart y lable label 
        ax.set_ylabel("count")
        # chart x lable label 
        ax.set_xlabel("value")

    plt.show()

Examining our doc string

In [None]:
help(plot_data_cols)

Importantly, this will provide in-context help for any function importanted from a package. 

In [None]:
help(pd.Categorical)

Now running our function

In [None]:
plot_data_cols(df, cols_to_plot)

# **4. A taste of cutting edge AI**  🦾🤖
---


#### 4.0 **Section learning objectives** 
>This section tries to bring some basic NLP (natural language processing) processing from `huggingface` into our workflow. 
* Brief introduction to `huggingface`
* Live demo of whisper
* Sentiment analysis of free-text
* `pd` feature: merge 
* `pd` feature: pivot 
* `pd` feature: Write Excel output

Within AI, within the last few years, language models have improved dramatically. This is due to a sequence based model called the [Transformer](https://www.youtube.com/watch?v=SZorAJ4I-sA), which is now applied beyond language. If you have heard of *Chat GPT*, the "T" in "GPT" is for "Transformer"!

[`huggingface`](https://huggingface.co/), rose to populatity from their `transformers` package. However, since their inception they have built a cutting-edge community, with lots of open-source models that are shared and freely available. [Huggingface spaces](https://huggingface.co/?recent=update-space) allows one to explore and run these model in a browser. Importantly, there are lots of [educational materials](https://huggingface.co/course/) that one can use to upskill.


In [None]:
# bash script (!)
! pip install -q transformers ipywebrtc

In [None]:
import librosa
import torch
from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer

#load pre-trained model and tokenizer
tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-960h") #TODO replace with OpenAI Whisper
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h") #TODO replace with OpenAI Whisper

In [None]:
from ipywebrtc import AudioRecorder, CameraStream
import torchaudio
from IPython.display import Audio

In [None]:
camera = CameraStream(constraints={'audio': True,'video':False})
recorder = AudioRecorder(stream=camera)
from google.colab import output
output.enable_custom_widget_manager()
recorder

In [None]:
with open('recording.webm', 'wb') as f:
    f.write(recorder.audio.value)
!ffmpeg -i recording.webm -ac 1 -f wav file.wav -y -hide_banner -loglevel panic
sig, sr = torchaudio.load("file.wav")
print(sig.shape)
Audio(data=sig, rate=sr)

In [None]:
#load any audio file of your choice
speech, rate = librosa.load("file.wav",sr=16000)

In [None]:
input_values = tokenizer(speech, return_tensors = 'pt').input_values

In [None]:
#Store logits (non-normalized predictions)
logits = model(input_values).logits

In [None]:
#Store predicted id's
predicted_ids = torch.argmax(logits, dim =-1)

In [None]:
#decode the audio to generate text
transcription = tokenizer.decode(predicted_ids[0])
print(transcription)

`#TODO`
* create second data source of *Underwriter text comments*, where only some `uuid` values match `model_point` so that `merge` is needed to extract relevant items called `underwrite.csv`
* Perform Sentiment analysis with appopriate `huggingface` model
* Give discount to policies with positive underwriter sentiment

# **5. Resources**  📓
---


* VSCode is a popular Interactive Development Environment. It makes working with code easier. In addition their [training material](https://code.visualstudio.com/docs/introvideos/basics) is great for place to learn practical development skills.

* [Kaggle](https://www.kaggle.com/) is online community of data scientists and machine learning practitioners. They provide free access to GPUs for deep learning and their forums are particularly good for finding solutions to data science code related problems.

* [Swiss Association of Actuaries Data Science initiative](https://www.actuarialdatascience.org/)

* [fast.ai](https://www.fast.ai/), a course and python `package` for accessible deep learning.

* [Distil](https://distill.pub/), interactive articles about machine learning.