# Brief Explanation

The aim of this notebook is to generate a binary metric in order to evaluate my object detection model just in terms of its classification task ("Lung Opacity"- "No finding")

The first step is to define a new test set. Our previous one included just 278 images, due to a lack of more "Lung Opacity" images. As this is a test set, and there are plenty images belonging to the "No finding" group in the original dataset that haven't been used, the new one will contain a 80% - 20% ("No finding" - "Lung Oppacity" ) proportion 



In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!pip install bbox-visualizer
import bbox_visualizer as bbv
import numpy as np
import pandas as pd
import os
from glob import glob # Retrieve files/pathnames matching a specified pattern
import shutil, os #operations on files, operating system dependent functionality

import matplotlib.pyplot as plt
import seaborn as sns
#import bbox_visualizer as bbv

from sklearn.model_selection import GroupKFold
from tqdm.notebook import tqdm #Progress bar

import cv2
from skimage.io import imread
from sklearn.model_selection import train_test_split


Collecting bbox-visualizer
  Downloading https://files.pythonhosted.org/packages/e2/ed/3fee03fcc9913a772a802e9407a49dfb026f78bab4f1385e8b91eb544e4a/bbox_visualizer-0.1.0-py2.py3-none-any.whl
Installing collected packages: bbox-visualizer
Successfully installed bbox-visualizer-0.1.0


In [None]:
fds_ws = pd.read_excel('/content/drive/MyDrive/Quinto_Anio/TESIS_Eugenia_Berrino/Part_II_DS/vinbigdata/fds_withsplit.xlsx')
original_test = fds_ws[fds_ws['Group']=='Test']
original_test.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,image_id,class_name,class_id,rad_id,x_min,y_min,x_max,y_max,width,height,Group
5,5,32939,efc7bc78ce88e95191fdab525f974c24,Consolidation,7,R9,205.0,1095.0,920.0,1574.0,2304,2880,Test
15,15,59353,65ad4fb69f36c807fce87e66a1c6533d,Consolidation,8,R16,638.0,1388.0,1173.0,1888.0,2504,2930,Test
16,16,43678,fc34c8cc6321cfc97ec35783a5daa937,Consolidation,7,R8,319.0,698.0,1049.0,1622.0,2304,2880,Test
26,26,22204,c699f16ba0b86f474390da9515bcad7a,Consolidation,8,R8,916.0,1086.0,950.0,1122.0,2738,3174,Test
36,36,6575,c0440c09698f89df168dc146af067fe7,Consolidation,7,R11,1140.0,324.0,1356.0,459.0,2205,2273,Test


In [None]:
original_test.class_name.value_counts()

Consolidation    849
No finding        14
Name: class_name, dtype: int64

In [None]:
original_test = original_test.drop(columns=['Unnamed: 0.1','Group'])
original_test.columns

Index(['Unnamed: 0', 'image_id', 'class_name', 'class_id', 'rad_id', 'x_min',
       'y_min', 'x_max', 'y_max', 'width', 'height'],
      dtype='object')

In [None]:
ds = pd.read_excel('/content/drive/MyDrive/Quinto_Anio/TESIS_Eugenia_Berrino/Part_II_DS/vinbigdata/My_DS.xlsx')
ds

Unnamed: 0.1,Unnamed: 0,image_id,class_name,class_id,rad_id,x_min,y_min,x_max,y_max,width,height
0,57877,7b30d37b73be405bfd91ed5e2d46c473,Consolidation,7,R8,1148.0,911.0,1693.0,1482.0,2304,2880
1,4860,7acb16c6d6f5cfc41a958e0b41e25106,Consolidation,7,R10,761.0,964.0,976.0,1415.0,2304,2880
2,25382,6c79f2551808438721052023e043ab4d,Consolidation,4,R8,803.0,1156.0,1345.0,1496.0,3072,3072
3,61581,ecf474d5d4f65d7a3e23370a68b8c6a0,Consolidation,8,R8,675.0,620.0,757.0,706.0,2408,2692
4,12091,4b001bab36d94f73c1ead3ab74690dbc,Consolidation,8,R9,1574.0,923.0,1597.0,951.0,1936,2488
...,...,...,...,...,...,...,...,...,...,...,...
5986,22141,8cb084ad48ad4a21e15bdb8f4567ed8f,Consolidation,8,R17,1234.0,435.0,1705.0,1059.0,2466,2347
5987,27395,ec6ec12533b8495bb7344d8895dd4f05,Consolidation,7,R10,1591.0,1309.0,1805.0,1701.0,2304,2880
5988,28828,5d6c0df203f0e3f04467e27507029026,Consolidation,1,R9,1795.0,658.0,2380.0,1446.0,2851,2967
5989,25185,0b98b21145a9425bf3eeea4b0de425e7,Consolidation,4,R10,340.0,1077.0,846.0,1550.0,2208,2688


In [None]:
fds =  ds[~ds['image_id'].isin(original_test['image_id'])]
len(fds)

5128

In [None]:
nof = fds[fds['class_name']=='No finding']
print(nof.head())
print(len(nof))

      Unnamed: 0                          image_id  class_name  ...   w   h  area
4046       35643  2e968b23fc1bc1150bf1943e1b53b031  No finding  ... NaN NaN   NaN
2747       38968  32557c0fb41d37a9e3c8485d3dcdb8cc  No finding  ... NaN NaN   NaN
664        23788  9d82b2a79a46c1a91802cad4043cc36d  No finding  ... NaN NaN   NaN
5638       53868  f4b4ec6b571cc34532856f68890a89b1  No finding  ... NaN NaN   NaN
193        57501  50742e2516c5af7abede56edbd3bb6bb  No finding  ... NaN NaN   NaN

[5 rows x 16 columns]
158


In [None]:
fds = pd.concat([fds,nof]).sample(frac=1)
fds.head()  

Unnamed: 0.1,Unnamed: 0,image_id,class_name,class_id,rad_id,x_min,y_min,x_max,y_max,width,height
5925,3845,997fca43b97287e53c551c3d4753edcb,Consolidation,8,R8,1599.0,1507.0,1666.0,1591.0,2048,2500
2502,32411,60398630bcbf4f2fa6f5730fd9a9f4dc,Consolidation,7,R10,752.0,1375.0,1181.0,1820.0,3072,3072
5084,10874,05721adb43ab7c061733568b274c006b,Consolidation,8,R10,1605.0,1461.0,1685.0,1545.0,2304,2880
2218,49176,f5f8866773cc80861a7f5c30502d0fbb,Consolidation,7,R8,1681.0,1107.0,1964.0,1514.0,2304,2880
3969,53656,a6bcb9f5d59588d699c5aa83cd3039c7,Consolidation,8,R8,2108.0,991.0,2177.0,1068.0,2540,3072


In [None]:
fds.class_name.value_counts()

Consolidation    5049
No finding        158
Name: class_name, dtype: int64

In [None]:
# Normalizing Annotations 

# BB Normalized Limits
fds['x_min'] = fds.apply(lambda row: (row.x_min)/row.width, axis =1)
fds['y_min'] = fds.apply(lambda row: (row.y_min)/row.height, axis =1)

fds['x_max'] = fds.apply(lambda row: (row.x_max)/row.width, axis =1)
fds['y_max'] = fds.apply(lambda row: (row.y_max)/row.height, axis =1)

# BB Normalized Center
fds['x_mid'] = fds.apply(lambda row: (row.x_max+row.x_min)/2, axis =1)
fds['y_mid'] = fds.apply(lambda row: (row.y_max+row.y_min)/2, axis =1)

# BB Nomalized With & Height
fds['w'] = fds.apply(lambda row: (row.x_max-row.x_min), axis =1)
fds['h'] = fds.apply(lambda row: (row.y_max-row.y_min), axis =1)

# BB as a % area of the image
fds['area'] = fds['w']*fds['h']
fds.head()

Unnamed: 0.1,Unnamed: 0,image_id,class_name,class_id,rad_id,x_min,y_min,x_max,y_max,width,height,x_mid,y_mid,w,h,area
5925,3845,997fca43b97287e53c551c3d4753edcb,Consolidation,8,R8,0.780762,0.6028,0.813477,0.6364,2048,2500,0.797119,0.6196,0.032715,0.0336,0.001099
2502,32411,60398630bcbf4f2fa6f5730fd9a9f4dc,Consolidation,7,R10,0.244792,0.447591,0.38444,0.592448,3072,3072,0.314616,0.52002,0.139648,0.144857,0.020229
5084,10874,05721adb43ab7c061733568b274c006b,Consolidation,8,R10,0.696615,0.507292,0.731337,0.536458,2304,2880,0.713976,0.521875,0.034722,0.029167,0.001013
2218,49176,f5f8866773cc80861a7f5c30502d0fbb,Consolidation,7,R8,0.729601,0.384375,0.852431,0.525694,2304,2880,0.791016,0.455035,0.12283,0.141319,0.017358
3969,53656,a6bcb9f5d59588d699c5aa83cd3039c7,Consolidation,8,R8,0.829921,0.322591,0.857087,0.347656,2540,3072,0.843504,0.335124,0.027165,0.025065,0.000681


In [None]:
unique = fds.drop_duplicates(subset = ["image_id"])
unique

Unnamed: 0.1,Unnamed: 0,image_id,class_name,class_id,rad_id,x_min,y_min,x_max,y_max,width,height,x_mid,y_mid,w,h,area
5925,3845,997fca43b97287e53c551c3d4753edcb,Consolidation,8,R8,0.780762,0.602800,0.813477,0.636400,2048,2500,0.797119,0.619600,0.032715,0.033600,0.001099
2502,32411,60398630bcbf4f2fa6f5730fd9a9f4dc,Consolidation,7,R10,0.244792,0.447591,0.384440,0.592448,3072,3072,0.314616,0.520020,0.139648,0.144857,0.020229
5084,10874,05721adb43ab7c061733568b274c006b,Consolidation,8,R10,0.696615,0.507292,0.731337,0.536458,2304,2880,0.713976,0.521875,0.034722,0.029167,0.001013
2218,49176,f5f8866773cc80861a7f5c30502d0fbb,Consolidation,7,R8,0.729601,0.384375,0.852431,0.525694,2304,2880,0.791016,0.455035,0.122830,0.141319,0.017358
3969,53656,a6bcb9f5d59588d699c5aa83cd3039c7,Consolidation,8,R8,0.829921,0.322591,0.857087,0.347656,2540,3072,0.843504,0.335124,0.027165,0.025065,0.000681
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5852,41084,088d83359d1a00ba24251220ace42edc,Consolidation,7,R10,0.660543,0.232053,0.726837,0.271411,2504,3176,0.693690,0.251732,0.066294,0.039358,0.002609
977,53296,a61be8051f02ec494ed40696e988f6d1,Consolidation,8,R8,0.657552,0.670486,0.707031,0.711806,2304,2880,0.682292,0.691146,0.049479,0.041319,0.002044
2642,6515,d571f85ab9434bcb8bc11bd175453c96,Consolidation,7,R10,0.373598,0.226562,0.449514,0.361003,2674,3072,0.411556,0.293783,0.075916,0.134440,0.010206
252,61956,d79068eb77a5aa51eb57904fbfce1720,Consolidation,7,R10,0.588108,0.181944,0.825087,0.249306,2304,2880,0.706597,0.215625,0.236979,0.067361,0.015963


In [None]:
unique = unique[['image_id','class_name']]
unique.class_name.value_counts()

Consolidation    1573
No finding         79
Name: class_name, dtype: int64

In [None]:
imgs_test = pd.read_csv('/content/drive/MyDrive/Quinto_Anio/TESIS_Eugenia_Berrino/Part_II_DS/vinbigdata/ctest.csv')
imgs_test

Unnamed: 0.1,Unnamed: 0,image_id
0,0,21cf533a9fe77bdbee21babd427a0d1f
1,1,bc2be005526db7ab9d5ec6741ddee945
2,2,79c5d4d7f3b2e7a5a183bfbe664c699d
3,3,c34e6aa7a5db3386850b830dd3c45a98
4,4,c24029f31fb9ae265934082ce6b47d33
...,...,...
273,273,071ff9c782ead87dfa9b1c025c25e769
274,274,1bbd7232924e951e7fa87ffa0f62ec3d
275,275,f58ecf974a05d2f5ece85aa9393cf9d6
276,276,7978725e43f1b301551e0fdbc32faef5


In [None]:
tests = np.concatenate((nof.image_id.to_numpy(),imgs_test.image_id.to_numpy()))
tests

array(['910e6bb0acb1c0b9156b4b7ae0dc752c',
       '91783a3a0b479febd50ea952f711d776',
       '50742e2516c5af7abede56edbd3bb6bb',
       'bee99c179da019550328e64a55713d5e',
       '524f09c24a034a2daae14151142ff9be',
       'cfbb4effef1cb2fe0311d38e3d35b5ea',
       '3212a84727e182171bd4ab3da7d560a5',
       '758ecebbb7d3da26eeadc428aafc991b',
       'daa9fa42d787183c91519315f5927cce',
       '5b65be064e0cee0795a5c98148360b63',
       '9d82b2a79a46c1a91802cad4043cc36d',
       'b7c2310d1b1f98c1bb88789595d78613',
       '52a8241a351307b812b63dd80912e0dd',
       'fe095046c8e69bfe5d169425c73b3135',
       '78d5ff1cb48bc317bd20d7e3ddae34f5',
       'b2c002bcee4f53855b238c491956f953',
       '799915888fe8902eddfe9a19627c618c',
       'fe17eb35352cbd7bbe33d90206276c44',
       'aff94c85d299cc26f549b22be3c54d16',
       '51c9555c13a2ce6554d4592c95b80a39',
       'e1a603beea2b53b3031a337d8a87cd72',
       '67f6b4a6c560951eef3128e2e71419c7',
       '5dcadceca5d9be91e3a484b004989a11',
       'baa

In order to obtain an optimal model to predict the probability of an image belonging to a given class, and baring in mind that there are the following variables to take into consideration: 

1. Number of boxes
2. Confidence  

Then there is the IOU threshold, but it should be changed as well in the validation images test. 

The approach chosen for this is creating a regression model and by maximun likelihood (frequentist approach) obtain the opmital coeficients for both variables. In fact, the idea is to create four different models, compare them, and choose the best from them. 

The models are the following, with $BB$ amount of bounding boxes detected in the image, $C_{max}$ the maximum confidence level in the image and p the probability of casifying the image as belonging to the class "Lung opacity":


1. $logit(p)  =  \beta_0 + \beta_1*BB + \beta_2* C_{max} $
2. $logit(p)  =  \beta_0 + \beta_1*BB* C_{max} $
3. $logit(p)  =  \beta_0 + \beta_1*BB $
4. $logit(p)  =  \beta_0 + \beta_2* C_{max} $
