### Identifying Outliers with Julia Language

- Based on a dataframe of patients diagnosed with cancer, the goal is to detect if there are lines that contain outliers, i.e. patients that deviate from the distribution mean of a given variable
- soruce: https://stats.oarc.ucla.edu/r/codefragments/mesimulation/

In [1]:
# Importing packages

import Pkg
# add_packages = ["CSV","DataFrames","ScikitLearn","Clustering","Distances"]
# for package in add_packages Pkg.add(package) end
using CSV, DataFrames, ScikitLearn, Clustering, Distances, Statistics

In [None]:
# Loading the dataset
dataset = CSV.File(read("data/dataset.csv")) |> DataFrame

In [5]:
# Importing ScikitLearn -> Label Encoding to transform string columns to numerical columns
@sk_import preprocessing: LabelEncoder

# Selecting string columns
col_types = Dict(names(dataset) .=> eltype.(eachcol(dataset)))
string_col = filter(p -> (last(p) == String7 || last(p) == String3 ), col_types)

# Label Encoding - transforming columns
le = LabelEncoder()
for column in keys(string_col) dataset[!, column] = le.fit_transform(dataset[!, column]) end

In [7]:
first(dataset,5)

Row,tumorsize,co2,pain,wound,mobility,ntumors,nmorphine,remission,lungcapacity,Age,Married,FamilyHx,SmokingHx,Sex,CancerStage,LengthofStay,WBC,RBC,BMI,IL6,CRP,DID,Experience,School,Lawsuits,HID,Medicaid
Unnamed: 0_level_1,Float64,Float64,Int64,Int64,Int64,Int64,Int64,Int64,Float64,Float64,Int64,Int64,Int64,Int64,Int64,Int64,Float64,Float64,Float64,Float64,Float64,Int64,Int64,Int64,Int64,Int64,Float64
1,67.9812,1.53433,4,4,2,0,0,0,0.801088,64.9682,0,0,1,1,1,6,6087.65,4.86842,24.1442,3.69898,8.08642,1,25,0,3,1,0.605867
2,64.7025,1.67613,2,3,2,0,0,0,0.326444,53.9171,0,0,1,0,1,6,6700.31,4.67905,29.4052,2.62748,0.803488,1,25,0,3,1,0.605867
3,51.567,1.53345,6,3,2,0,0,0,0.565031,53.3473,1,0,2,0,1,5,6042.81,5.00586,29.4826,13.8962,4.03416,1,25,0,3,1,0.605867
4,86.438,1.4533,3,3,2,0,0,0,0.848411,41.368,0,0,1,1,0,5,7162.7,5.26506,21.5573,3.00803,2.12586,1,25,0,3,1,0.605867
5,53.4002,1.56635,3,4,2,0,0,0,0.886491,46.8004,0,0,2,1,1,6,6443.44,4.98426,29.8152,3.8907,1.34932,1,25,0,3,1,0.605867


In [15]:
# Converting the dataset to a matrix
matrix = Matrix(dataset)
distance_type = Euclidean()
distances = pairwise(distance_type, matrix', matrix')

8525×8525 Matrix{Float64}:
    0.0     612.842     50.725  1075.49   …   473.169  2186.07    532.313
  612.842     0.0      657.752   463.14       552.294  2790.32    489.015
   50.725   657.752      0.0    1120.58       497.574  2142.09    562.426
 1075.49    463.14    1120.58      0.0        929.533  3248.54    838.722
  356.665   257.23     400.839   720.086      423.912  2536.53    408.036
  713.244   101.213    757.857   364.022  …   624.504  2889.56    550.955
  149.682   464.227    196.177   926.388      417.871  2332.45    451.344
 2090.12   2702.66    2045.27   3165.11      2365.29    412.115  2466.32
  565.32     49.5982   610.236   510.681      521.594  2743.33    464.432
  310.332   922.325    265.828  1384.84       684.168  1882.88    769.033
    ⋮                                     ⋱                      
  412.285   689.639    421.245  1097.09       184.126  2204.52    286.92
  957.452   480.286    998.45    458.91       626.619  3014.04    524.28
 1786.08   2386.64    

![title](images/euclidiana.png)

Euclidian Distance Formula and Euclidian Distance Matrix

source: https://www.dabblingbadger.com/blog/2020/2/27/implementing-euclidean-distance-matrix-calculations-from-scratch-in-python

In this project, the distances are being calculated comparing row to row - ex: row 1 = A1 and row 2 = A2. The model will categorize the distances and place them within groups, if the distance is not placed inside a group it is due the fact that a line was very different compared to the other lines (one or more variables were discrepants and when put in the Euclidian formula, made the distance too big = outlier) .

In [24]:
# Function to calculate Silhouette's Coefficient
# The Silhouette coefficient, when close to +1, indicates that the points are very far from the points of the other cluster,
# and when close to 0, it indicates that the points are very close or even intersecting another cluster.

function calc_silhouette(sil)
    silMean = mean(sil)
    println("Silhouette Coeficient Mean: $silMean")
end

calc_silhouette (generic function with 1 method)

In [17]:
num_clusters = 30
model = @time kmeans(distances, num_clusters)

131.155318 seconds (4.33 M allocations: 298.956 MiB, 0.47% gc time, 5.46% compilation time)


KmeansResult{Matrix{Float64}, Float64, Int64}([772.6924512384683 2619.8756593920903 … 1209.4966875567268 423.9783921038607; 1368.3732439959467 2010.4855721189822 … 1813.9868505409413 339.22015129573504; … ; 1432.606306586023 4763.008019260484 … 990.5193720753512 2517.0313552056264; 1101.4387536063941 2279.7185683928856 … 1545.491358786313 218.65291366369047], [15, 3, 15, 19, 30, 3, 6, 23, 21, 4  …  15, 18, 13, 28, 13, 4, 16, 6, 23, 30], [5.0479120614356995e7, 3.850676761209488e7, 3.392814274025345e7, 2.0657063395240784e7, 2.6776067961841583e7, 4.9212512300144196e7, 6.061230577410126e7, 1.3052726934370422e8, 5.475664035165024e7, 3.1885794318145752e7  …  5.9710417181575775e7, 3.305618056431961e7, 2.475804458933258e7, 9.00819408720398e6, 2.4353679088935852e7, 3.179032204541397e7, 1.834185376046753e7, 3.3701660810230255e7, 3.950539938864136e7, 3.1088912300247192e7], [393, 45, 441, 428, 360, 471, 103, 364, 305, 158  …  463, 452, 77, 438, 46, 11, 403, 171, 303, 407], [393, 45, 441, 428, 360,

In [31]:
calc_silhouette(silhouettes(model, distances))

Silhouette Coeficient Mean: 0.20488518559748964


In [34]:
# Identifying 5 outliers
for i = 1:5
    maxCost = findmax(model.costs) #costs - measures the performance of a machine learning model
    index = maxCost[2]
    println("\nOutlier found on line $index\n")
    show(dataset[index, 1:10], allcols = true)
    model.costs[index] = 0
end


Outlier found on line 664

[1mDataFrameRow[0m
[1m Row [0m│[1m tumorsize [0m[1m co2     [0m[1m pain  [0m[1m wound [0m[1m mobility [0m[1m ntumors [0m[1m nmorphine [0m[1m remission [0m[1m lungcapacity [0m[1m Age     [0m
     │[90m Float64   [0m[90m Float64 [0m[90m Int64 [0m[90m Int64 [0m[90m Int64    [0m[90m Int64   [0m[90m Int64     [0m[90m Int64     [0m[90m Float64      [0m[90m Float64 [0m
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────
 664 │   58.1622  1.36571      4      6         3        0          0          0      0.957281  51.3678
Outlier found on line 6202

[1mDataFrameRow[0m
[1m  Row [0m│[1m tumorsize [0m[1m co2     [0m[1m pain  [0m[1m wound [0m[1m mobility [0m[1m ntumors [0m[1m nmorphine [0m[1m remission [0m[1m lungcapacity [0m[1m Age     [0m
      │[90m Float64   [0m[90m Float64 [0m[90m Int64 [0m[90m Int64 [0m[90m Int64    [0m[90m Int64   