In [1]:
import numpy as np
import sys
import torch

Q1 and Q2: Generating Fake Images with an efficient data structure

In Python to create an array, we either use a Python list, NumPy array or a tensor. I am using a NumPy array to represent the images.

However, this would be computationally expensive if it's a high resolution 100,000*100,000 image. We have to use encoding/compression techniques to reduce the size of the image.

My idea in the code below, is that of an image which is one-hot encoded(with boolean values). As we are dealing with images which are binary in nature (parasite cells/not parasite cells and cancer cells/not cancer cells), we can represent the image with 1's and 0's replacing each pixel value(rather than the intensity values which are the traditional way of an image being represented).


Furthermore, we could use coding technqiues such as Huffman coding for lossless data compression(commonly used in JPEG), to further downsize the image.

I am using a torch.rand function to generate random numbers from 0 to 1. Since we expect the parasitic cells to occupy atleast 25% of the image, I am converting the random tensor array to a boolean by using 0.25 as a lower bound.




In [2]:
#function to generate an image(post processed)

def gen_image(rows,cols):
  #parasite1 = torch.HalfTensor(rows,cols).uniform_()>0.25
  #dye_image = torch.HalfTensor(rows,cols).uniform_()>.901
  parasite1 = torch.rand(rows, cols)>0.25
  dye_image = torch.rand(rows, cols)>0.901
  dye_image = np.array(dye_image)
  parasite1 = np.array(parasite1)

  return parasite1, dye_image
  

Calculating the worst case storage size scenario. For a 512*512 image with all 1's(all pixels corresponds to parasite cells). The size would be 262,264 bytes or (262 KB)

In [24]:
#worst case storage scenario in bytes

sys.getsizeof(np.array(torch.rand(512,512)>0))

262264

Q2 and Q3) In the code below, I am generating 1000 images of size 512*512 (1000 parasite images and 1000 dye tested parastie images). I am using the location of the parasitic cells as a contour to extract only the pixels within the boundary ( as we are not bothered with the dye lit pixels outside the parasite's body).

Then, I am calculating the ratio of cancer cells to the total area of the parasite's body for each image.

In [3]:
rows = 512
cols = 512
ratio = []

for i in range(1000):
  parasite_count = []
  dye_ROI_count = []
  Coordinates = []
  
  
  parasite_img, dye_img = gen_image(rows, cols)
  parasite_pos = np.where(parasite_img == True)
  Coordinates= list(zip(parasite_pos[0], parasite_pos[1]))
  positions = np.array(Coordinates)
  for j in positions:
    dye_ROI_count.append(dye_img[j[0],j[1]])
  cancer_count = dye_ROI_count.count(True)
  para_count = np.count_nonzero(parasite_img)
  ratio.append(cancer_count/para_count)

  


Converting the ratio list to a numpy array

In [4]:

ratio = np.array(ratio)

Q3) Using a conditional operator, to convert the ratio array to a boolean(with 1's corresponding to the image pairs where the cancer cells occupied more than 10% of the parasite body)

In [5]:
ratio_ROI = ratio>0.1

In [6]:

np.count_nonzero(ratio_ROI)

82

**Conclusion**

Based on the generated fake images. There are 82 images out of 1000 (8.2%) which corresponds to cancer cells occupying more than 10% of the image

***Improvement***

One improvement which I can do is
a. Use compression techniques to compress the image further(using transforms such as wavelet transforms to compress without much loss and then use the functions later)

b. Reading the Images through a pandas dataframe and using the utility of Pandas boolean masks and comparisions, I could make the code faster.