# Procedure volume outlier in Python


The goal of this exercise is to use python and attempt to replicate the dataframe output shown in R during the Friday, Apr 3rd 2020 clinical informatics training session.

We start by importing the csv file and creating 3 separate lists out of it for the calculation:
1. Colonoscopy volume
2. Log2 of the colonoscopy volume
3. Outlier view

The reason for creating separate colonoscopy volume and log2 volume lists is to calculate the mean later on.

In [1]:
from csv import reader
import math
import statistics

file_opened = open("colonoscopy_volumes.csv")
read_file = reader(file_opened)
colonoscopy = list(read_file)
header = colonoscopy[0]
colonoscopy = colonoscopy[1:]

vol_list = []
for row in colonoscopy:
    volume = float(row[1])
    vol_list.append(volume)

We run a loop to generate log_volume value which then is appended together with the volume value into the **outlier_view** list which is identical to outlier_view dataframe demonstrated during the previously mentioned R tutorial discussion. 

In [2]:
volume_list = []
log_vol_list = []
outlier_view = []

for row in colonoscopy:
    sample = vol_list
    volume = float(row[1])
    log_vol = math.log2(volume)
    volume_list.append(volume)
    log_vol_list.append(log_vol)
    outlier_view.append([volume, log_vol])

for row in outlier_view:
    print(row)


[50.0, 5.643856189774724]
[960.0, 9.906890595608518]
[500.0, 8.965784284662087]
[360.0, 8.491853096329674]
[50.0, 5.643856189774724]
[500.0, 8.965784284662087]
[100.0, 6.643856189774724]
[200.0, 7.643856189774724]
[100.0, 6.643856189774724]
[100.0, 6.643856189774724]
[200.0, 7.643856189774724]
[1000.0, 9.965784284662087]
[200.0, 7.643856189774724]
[900.0, 9.813781191217037]
[300.0, 8.228818690495881]
[500.0, 8.965784284662087]
[240.0, 7.906890595608519]
[500.0, 8.965784284662087]
[100.0, 6.643856189774724]
[240.0, 7.906890595608519]
[200.0, 7.643856189774724]
[120.0, 6.906890595608519]
[12.0, 3.584962500721156]
[100.0, 6.643856189774724]
[50.0, 5.643856189774724]
[36.0, 5.169925001442312]
[15.0, 3.9068905956085187]
[500.0, 8.965784284662087]
[50.0, 5.643856189774724]
[600.0, 9.228818690495881]
[200.0, 7.643856189774724]
[70.0, 6.129283016944966]
[400.0, 8.643856189774725]
[30.0, 4.906890595608519]
[1000.0, 9.965784284662087]
[250.0, 7.965784284662087]
[100.0, 6.643856189774724]
[500.0,

To generate the final outlier_view table, we run another loop and calculate the mean for the actual volume and the log  volume, standard deviation of the log volume, outlier threshold and a boolean expression stating whether the volume is an outlier or not. 

In [3]:
   
outlier_view_final = []
    
for row in outlier_view:
    volume = row[0]
    log_vol = row[1]
    mean_act = statistics.mean(volume_list)
    mean_log = statistics.mean(log_vol_list)
    sd_log = statistics.stdev(log_vol_list)
    outlier_th = 1.2 * (mean_log + sd_log)
    is_outlier = log_vol > outlier_th
    outlier_view_final.append([volume, log_vol, mean_act , mean_log, sd_log, outlier_th, is_outlier])

# outlier_view_final[0] = ["volume", "log_vol", "mean_act", "mean_log", "sd_log", "outlier_th", "is_outlier"]
# the code above 👆 was used to add the header but somehow it changed the order of the row.
# I ended up leaving it out because I wanted to have the same order with the dataframe we had on Friday

for element in outlier_view_final:
    print(element)
    
print('\n', "Total entry we have is ", len(outlier_view_final), "rows, which are the same compared to the ones we had in the R dataframe.")
    

[50.0, 5.643856189774724, 335.08550185873605, 7.312484538963016, 1.9458551793956582, 11.110007662030407, False]
[960.0, 9.906890595608518, 335.08550185873605, 7.312484538963016, 1.9458551793956582, 11.110007662030407, False]
[500.0, 8.965784284662087, 335.08550185873605, 7.312484538963016, 1.9458551793956582, 11.110007662030407, False]
[360.0, 8.491853096329674, 335.08550185873605, 7.312484538963016, 1.9458551793956582, 11.110007662030407, False]
[50.0, 5.643856189774724, 335.08550185873605, 7.312484538963016, 1.9458551793956582, 11.110007662030407, False]
[500.0, 8.965784284662087, 335.08550185873605, 7.312484538963016, 1.9458551793956582, 11.110007662030407, False]
[100.0, 6.643856189774724, 335.08550185873605, 7.312484538963016, 1.9458551793956582, 11.110007662030407, False]
[200.0, 7.643856189774724, 335.08550185873605, 7.312484538963016, 1.9458551793956582, 11.110007662030407, False]
[100.0, 6.643856189774724, 335.08550185873605, 7.312484538963016, 1.9458551793956582, 11.110007662

The end result shows successful replication of the dataframe provided in the R discussion. Although it's quite interesting to work on, this exercise shows that R does a far more efficient work with this kind of project.
The strongest pros of R compared to python in this occassion would be:
1. Shorter strings of code to use
2. more user friendly interface, particularly in displaying the dataframe