# Feature extraction part 2

This notebook contains on from `12_feature_extraction.ipynb`, looking at replicating some of the other methods in the FHRMA toolbox.

In [1]:
# Import packages
from dataclasses import dataclass
import glob
from itertools import compress, groupby
import matplotlib.pyplot as plt
import math
import numpy as np
import os
import pandas as pd
from scipy import io
from statistics import mode

In [2]:
# Define file paths
@dataclass(frozen=True)
class Paths:
    '''Singleton object for storing paths to data and database.'''

    fhrma_train_csv = './fhrma/train_test_data/traindata_csv/'
    fhrma_test_csv = './fhrma/train_test_data/testdata_csv/'


paths = Paths()

## Maeda et al. 2012 Baseline FHR

[Maeda et al. 2012](https://benthamopen.com/contents/pdf/TOMDJ/TOMDJ-4-28.pdf) - Central Computerized Automatic Fetal Heart Rate Diagnosis with a Rapid and Direct Alarm System

FHR was sampled every 250ms over a 5-minute period, and averaged every 2 seconds to determine 150 FHR (also found 150 uterine contraction data) (as there are 30 x 2 seconds in a minute, so 150 x 2 seconds in 5 minutes). FHR data were counted in intervals of 10 beats per minute (bpm) ranging from 0 to 200 bpm. The data in the interval with the most frequent FHR data was then averaged to determine the FHR baseline. 

So basically...

1. **Find the average of every 2 seconds**

2. **Look at data from a five minute period** - this will mean you are looking at a sample of 150 FHR (as each represents average of 2 seconds, and there are 150 x 2 seconds in 5 minutes)

3. **Look at frequency of data in bins of 10bpm** - i.e. number of FHR that are 140-149.99, 150-150.99, and so on.

4. **Find the most frequent bin** - for example, 140-150 has the most records, then just use the data from that bin

5. **Find the average of the heartrates from that bin** - so might get a result like 145.5, or so on. That represents the baseline FHR for that 5 minute portion of the data.

### MATLAB Implementation

Boudet et al. implement this method [in the FHRMA toolbox using MATLAB](https://github.com/utsb-fmm/FHRMA/blob/master/aammaeda.m), and this is copied below:

```
sFHR=avgsubsamp(FHR,8);
baseline=zeros(1,length(FHR));

for win=[0:150:length(sFHR)-151 length(sFHR)-150]
    
    bins=zeros(1,25);

    for i=1:150
        bins(ceil(sFHR(win+i)/10))=bins(ceil(sFHR(win+i)/10))+1;
    end
    [~,bestbins]=max(bins(1:20));
    
    baseline(win*8+1:win*8+1200)=mean(sFHR( sFHR<=bestbins*10 & sFHR>(bestbins-1)*10 ));

end


baseline(win*8+1201:length(FHR))=baseline(win*8+1200);
```

They use a function `avgsubsamp` for subsampling by average, which is also copied below:

```
function y=avgsubsamp(x,factor)
    y=zeros(1,floor(length(x)/factor));
    for i=1:length(y)
        y(i)=mean(x((i-1)*factor+1:i*factor));
    end
end
```

### Python Implementation

#### Set up

In [3]:
# Load the FHR for train01
fhr = pd.read_csv(os.path.join(paths.fhrma_train_csv, 'train01.csv'),
                  header=None)[0].values
fhr[0:10]

array([168., 168., 168., 170., 170., 170., 172., 172., 172., 173.])

In [4]:
# Load FHRMA version of results
md_std = io.loadmat('./fhrma/MD_std.mat')

# Get array listing filenames (and hence order of the data)
fhrma_files = np.concatenate(np.concatenate(md_std['data']['filename']))

# Get array with the baseline signal as per Maeda when implemented in FHRMA
fhrma_md = np.concatenate(md_std['data']['baseline'])

# Convert array into dictionary so each record is accompanied by relevant name
fhrma_maeda = {
    fhrma_files[i].replace('.fhr', ''): 
    fhrma_md[i][0] for i in range(len(fhrma_files))}

# Extract the same result as I am currently processing
fhrma_result = fhrma_maeda['train01']

#### Part 1. Mean of FHR from every 2 seconds (i.e. 8 records)

In FHRMA, they seperate the FHR into chunks of 8 (i.e. first 8 records, then next 8, then next 8, and so on). They then find the mean of each of those chunks.

<mark>This doesn't clean FHR beforehand, so includes large periods of 0, and includes values outside of normal</mark>

In [5]:
# Find mean of every 8 records
sfhr = []
start=0
end=len(fhr)
step=8
for i in range(start, end, step):
    sfhr.append(np.mean(fhr[i:i+step]))

# Preview head and tail
print(sfhr[:10])
print(sfhr[-10:])

[169.75, 172.25, 169.125, 166.75, 166.0, 166.0, 166.25, 168.5, 171.125, 172.875]
[160.25, 161.125, 162.0, 160.375, 157.25, 154.625, 156.5, 158.375, 159.0, 159.85714285714286]


#### Part 2. Find the most common heartrate bin in each 5 minute interval

We're looking at every 5 minutes / 300 seconds (which equates to 150 of the 2 second results).

We sort the heartrates into bins of 10bpm (e.g. 130-140, 140-150, 150-160), then look to see which bin is most common for that 5 minute period.

FHRMA then find the mean of all heartrates from that bin across the entire recorded FHR CTG, but I am minded to suggested that this should be the mean of only the heartrates from that bin in the current five minute interval.

**Filter to first five minutes**

<mark>my start points currently wrong - should find every 5 minutes, so like, 12 blocks, with last one being smaller</mark>

In [6]:
# For each of the possible start points
# (for every possible window of 150 data points)
start_points = len(sfhr)-149
for i in np.arange(0, start_points):
    # Filter to data from that 5 minute segment
    current = sfhr[0+i:150+i]

current = sfhr[0:150]
print(current)

[169.75, 172.25, 169.125, 166.75, 166.0, 166.0, 166.25, 168.5, 171.125, 172.875, 174.0, 173.25, 170.75, 171.0, 172.0, 172.0, 173.875, 175.75, 176.0, 175.875, 174.625, 173.125, 172.0, 172.0, 173.625, 174.625, 176.0, 176.0, 176.0, 174.875, 174.0, 174.0, 175.125, 176.0, 176.0, 175.25, 173.75, 172.125, 172.0, 173.625, 174.0, 174.0, 172.75, 172.0, 172.0, 172.0, 172.125, 173.75, 174.0, 174.0, 174.0, 174.0, 173.875, 172.25, 172.0, 172.0, 172.0, 172.0, 172.0, 172.375, 173.0, 172.0, 170.25, 170.0, 170.0, 170.0, 170.0, 170.625, 172.0, 171.875, 167.25, 141.25, 131.875, 116.5, 122.875, 127.25, 128.5, 129.75, 130.5, 132.0, 123.0, 118.125, 110.5, 119.0, 126.5, 132.125, 133.25, 135.75, 142.25, 151.375, 159.0, 163.875, 166.5, 165.875, 162.125, 155.75, 150.875, 149.75, 153.125, 156.0, 154.625, 146.875, 142.125, 141.5, 140.875, 140.875, 147.625, 152.875, 156.875, 158.0, 155.75, 146.75, 139.375, 139.5, 149.5, 158.25, 161.0, 158.25, 152.125, 144.125, 134.5, 119.75, 109.5, 99.0, 79.5, 75.75, 75.25, 76.0, 7

**Find most common bin**

In [7]:
# Divide each value by 10 then round up to nearest integer
bins = [math.ceil(x/10) for x in current]
print(bins)

# Find the most common bin
mode_bin = mode(bins)*10
print(mode_bin)

[17, 18, 17, 17, 17, 17, 17, 17, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 17, 17, 17, 17, 18, 18, 18, 17, 15, 14, 12, 13, 13, 13, 13, 14, 14, 13, 12, 12, 12, 13, 14, 14, 14, 15, 16, 16, 17, 17, 17, 17, 16, 16, 15, 16, 16, 16, 15, 15, 15, 15, 15, 15, 16, 16, 16, 16, 15, 14, 14, 15, 16, 17, 16, 16, 15, 14, 12, 11, 10, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 9, 10, 10, 11, 11, 12, 13, 14, 12, 11, 12, 12]
180


**FHRMA implementation - mean of all FHR from that bin**

In [8]:
# Filter sFHR based on max_bin
# If max_bin is 160, then must be less than or equal to 160
# Must also be greater than 10 less, so 150
mask = [(x <= mode_bin) & (x > mode_bin-10) for x in sfhr]
filtered = list(compress(sfhr, mask))

# Preview list and see its length
print(filtered[:10])
print(len(filtered))

# Find mean of list
mean = np.mean(filtered)
print(mean)

[172.25, 171.125, 172.875, 174.0, 173.25, 170.75, 171.0, 172.0, 172.0, 173.875]
318
173.36320754716982


**What I think this should be - mean of 5-min FHR from that bin**

In [9]:
# Filter the current five minutes to only those values that fall in the most
# common bin
mask = [(x <= mode_bin) & (x > mode_bin-10) for x in current]
filtered = list(compress(current, mask))

# Preview list and see its length
print(filtered[:10])
print(len(filtered))

# Find mean of list
mean = np.mean(filtered)
print(mean)

[172.25, 171.125, 172.875, 174.0, 173.25, 170.75, 171.0, 172.0, 172.0, 173.875]
59
173.260593220339


**Compare to FHRMA**

In [10]:
fhrma_result[:10]

array([173.36320755, 173.36320755, 173.36320755, 173.36320755,
       173.36320755, 173.36320755, 173.36320755, 173.36320755,
       173.36320755, 173.36320755])

### Part 3. Set that mean as the baseline for that 5 minutes

In [11]:
# Create array of zeros of length of fhr
baseline = [0] * len(fhr)

# Set the record in baseline of the first 1200 records (i.e. first 5 minutes)
# to that calculated mean
baseline[0:1200] = [mean] * 1200

In FHRMA, for every recorded heartbeat in the raw FHR trace, they have set a baseline heart rate.

In [12]:
print(len(fhrma_result))
print(len(fhr))

14007
14007


We can see that they have looked at each 5 minute period, and then for the final period, just what remains (which was 807 rather than 1200)

In [13]:
# View counts of consecutive equal values in the results
[(k, sum(1 for i in g)) for k,g in groupby(fhrma_result)]

[(173.36320754716982, 1200),
 (165.35695043103448, 3600),
 (173.36320754716982, 1200),
 (165.35695043103448, 3600),
 (155.50259067357513, 1200),
 (165.35695043103448, 1200),
 (155.50259067357513, 2007)]

In [14]:
print(f'''
This equates to {len(fhr)//1200} sets of 5 minutes, 
and then a final block of {(len(fhr) % 1200)/60/4} minutes
''')


This equates to 11 sets of 5 minutes, 
and then a final block of 3.3625 minutes

