<a href="https://colab.research.google.com/github/adiojha629/TEWH_Malaria_Adi_Files/blob/master/Rescaling_Malaria_BroadInstitute_Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Task Description

**Reformat the Broad Insitute dataset into the 'numpy.float32' format.** Create numpy arrays labeled “red_blood_cell_features”, “rings_features”, “schizonts_features”, “trophozoites_features”, and “gametocyte_features”.

##Specifics:
The Broad Institute dataset is the dataset we will use to test our convolutional neural network for image classification. *In order to test these images, the dataset must be formatted in the appropriate np.array structure.* 

The folder containing the Broad Institute dataset is named “Malaria_BroadInstitute_Dataset”. It contains the following folders: “gametocyte”, “red_blood_cell”, “rings”, “schizonts”, “trophozoites”, and “white_blood_cell”.Note that gametocytes, rings, schizonts, and trophozoites, are all infected red blood cells. 

The image data should be stored in a ‘numpy.float32’ data format, with the shape of [x 128 128 3], where x is the number of images. Dimensions 1 and 2 ([128 128]) indicate the desired 128x128 pixels, whereas Dimension 3 ([3]) is for each of the RGB color channels. This means that unlike the training set, for the test set you must resize the images to 128x128 pixels. 

## Step one: Download Images
We first download the NIH dataset from online, and then we unzip the files.Inside, there are six folders each containing all of the images in their respective classes. 

In [1]:
# Download zip file with images if not already exists
!wget -nc https://utexas.box.com/shared/static/6bo67mzvhyoqjd1dgvar3kltniyv503c.zip -O Malaria_BroadInstitute_Dataset.zip

--2020-08-04 19:23:31--  https://utexas.box.com/shared/static/6bo67mzvhyoqjd1dgvar3kltniyv503c.zip
Resolving utexas.box.com (utexas.box.com)... 107.152.29.197
Connecting to utexas.box.com (utexas.box.com)|107.152.29.197|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /public/static/6bo67mzvhyoqjd1dgvar3kltniyv503c.zip [following]
--2020-08-04 19:23:32--  https://utexas.box.com/public/static/6bo67mzvhyoqjd1dgvar3kltniyv503c.zip
Reusing existing connection to utexas.box.com:443.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://utexas.app.box.com/public/static/6bo67mzvhyoqjd1dgvar3kltniyv503c.zip [following]
--2020-08-04 19:23:32--  https://utexas.app.box.com/public/static/6bo67mzvhyoqjd1dgvar3kltniyv503c.zip
Resolving utexas.app.box.com (utexas.app.box.com)... 107.152.29.201
Connecting to utexas.app.box.com (utexas.app.box.com)|107.152.29.201|:443... connected.
HTTP request sent, awaiting response... 302 Found
Loca

In [2]:
import numpy as np
import os
from shutil import copyfile
from zipfile import ZipFile

# Here we download the Broad Institute dataset as a zip file
!wget -nc https://utexas.box.com/shared/static/6bo67mzvhyoqjd1dgvar3kltniyv503c.zip -O Malaria_BroadInstitute_Dataset.zip

ROOT_DIR = os.path.join("/", "content")
DATASET_DIR = "/content/Malaria_BroadInstitute_Dataset/"

# Extract images if not already extracted
if not os.path.isdir(DATASET_DIR):
    print("Extracting images...")
    with ZipFile(os.path.join(ROOT_DIR, "Malaria_BroadInstitute_Dataset.zip"), "r") as zipObj:
        zipObj.extractall()
    print("Done!")

File ‘Malaria_BroadInstitute_Dataset.zip’ already there; not retrieving.
Extracting images...
Done!


## Practice Accessing Image Files
Now to access the actual images located in these folders, we need to know the names of these files. If we use the function ```os.listdir("DirectoryHere")```, we can retrieve a list of every single file name. 

Note that the main directory on this Google Colab document is ```"/content/"```. So for example, if we want to access the folder "Parasitized", the directory that goes in the ```os.listdir()``` function would be ```"/content/Malaria_BroadInstitute_Dataset/red_blood_cell"```. 

Here we generate a list of all of the file names in the "red_blood_cell" folder and store it in some variable, maybe perhaps conveniently call it ```RBCFiles```.

In [3]:
# Generate list of red blood cell file names
#RBCFiles = os.listdir("DirectoryHere")

# Generate list of red blood cell file names
RBCFiles = os.listdir("/content/Malaria_BroadInstitute_Dataset/red_blood_cell")

# Import example image as a np.array
import cv2
ExampleImage = cv2.imread('/content/Malaria_BroadInstitute_Dataset/red_blood_cell/' + RBCFiles[0])

##Generate list of files names for each type of cell
RBCFiles = os.listdir("/content/Malaria_BroadInstitute_Dataset/red_blood_cell")

RingsFiles = os.listdir("/content/Malaria_BroadInstitute_Dataset/rings")

SchiFiles = os.listdir("/content/Malaria_BroadInstitute_Dataset/schizonts")

TrophoFiles = os.listdir("/content/Malaria_BroadInstitute_Dataset/trophozoites")

GameFiles = os.listdir("/content/Malaria_BroadInstitute_Dataset/gametocyte")



## Creating New Folders to Save Rescaled Images
Before reshaping our images, we should figure out how to store and export these rescaled images. 


In [4]:
from PIL import Image

#First I make the folder called rescaled_images
rescaled_image_DIR = "/content/rescaled_images/"
os.mkdir(rescaled_image_DIR)

#Now I make subfolders for each type of cell we need (ie. RBC, Schizonts etc.)
rescaled_RBC_DIR = rescaled_image_DIR + "red_blood_cell_features/" 
os.mkdir(rescaled_RBC_DIR)

rescaled_Ring_DIR = rescaled_image_DIR + "ring_features/" 
os.mkdir(rescaled_Ring_DIR)

rescaled_Schi_DIR = rescaled_image_DIR + "schizonts_features/" 
os.mkdir(rescaled_Schi_DIR)

rescaled_Troph_DIR = rescaled_image_DIR + "trophozoites_features/" 
os.mkdir(rescaled_Troph_DIR)

rescaled_Game_DIR = rescaled_image_DIR + "gametocyte_features/" 
os.mkdir(rescaled_Game_DIR)

## Reshaping a Single Image
Here we develop an algorithm for a single image first, to see if we can get a working method first. 

In the code chunk below, a particular image is loaded into a np.array for us to work with.

Here are some useful functions that I used:

- ```np.shape(ArrayHere)```: tells you the shape of the array
- ```cv2.imread('FileDirectoryHere')```: access and stores an image as a np.array.
- ```cv2.resize(InputImage, dsize=(128,128))```: reshapes the np.array into (128,128,3).

In [5]:
import numpy as np
import cv2

# Import example image as a np.array
ExampleImage = cv2.imread('/content/Malaria_BroadInstitute_Dataset/red_blood_cell/a1ff36df-71df-4e6e-a65f-ced9a2b381c3cell4.jpg')

# Check the shape of the np.array
print('The shape for this image is:',np.shape(ExampleImage))

# Resize the np.array into (128,128,3)
ResizedImage = cv2.resize(ExampleImage, dsize=(128,128))
print('The new shape for this image is:',np.shape(ResizedImage))

The shape for this image is: (126, 105, 3)
The new shape for this image is: (128, 128, 3)


## Rescaling and saving all images



In [6]:
#Adi Code
#The following 5 for loops will rescale images from each category, and put them in there respective folders

#Some debugging features were added:
#If Debug = True; all images will be rescaled and saved
#If Debug = False, then only only an 'x' amount of images will be rescaled and saved

debug = False ##Change this to True for all images to be resized

x = 3 ##Change to nonzero to test the code

if (debug):
  loop_control_RBC = len(RBCFiles) ##These variable determine how long the for loop below runs
  loop_control_Rings=len(RingsFiles)
  loop_control_Schi=len(SchiFiles)
  loop_control_Troph=len(TrophoFiles)
  loop_control_Game=len(GameFiles)
else:
  loop_control_RBC = x
  loop_control_Rings=x
  loop_control_Schi=x
  loop_control_Troph=x
  loop_control_Game=x



#Red Blood Cells
for index in range(loop_control_RBC):
  image_at_index = cv2.imread('/content/Malaria_BroadInstitute_Dataset/red_blood_cell/' + RBCFiles[index])
  resized_image = cv2.resize(image_at_index, dsize=(128,128))
  Img = Image.fromarray(resized_image,'RGB')
  #print('The new shape for this image is:',np.shape(ResizedImage))
  Img.save(rescaled_RBC_DIR+"rescaled_RBC_"+str(index)+".png")

#Rings 
for index in range(loop_control_Rings):
  image_at_index = cv2.imread('/content/Malaria_BroadInstitute_Dataset/rings/' + RingsFiles[index])
  resized_image = cv2.resize(image_at_index, dsize=(128,128))
  Img = Image.fromarray(resized_image,'RGB')
  #print('The new shape for this image is:',np.shape(ResizedImage))
  Img.save(rescaled_Ring_DIR+"rescaled_Ring_"+str(index)+".png")

#Schizonts
for index in range(loop_control_Schi):
  image_at_index = cv2.imread('/content/Malaria_BroadInstitute_Dataset/schizonts/' + SchiFiles[index])
  resized_image = cv2.resize(image_at_index, dsize=(128,128))
  Img = Image.fromarray(resized_image,'RGB')
  #print('The new shape for this image is:',np.shape(ResizedImage))
  Img.save(rescaled_Schi_DIR+"rescaled_Schizonts_"+str(index)+".png")

#Trophozoites
for index in range(loop_control_Troph):
  image_at_index = cv2.imread('/content/Malaria_BroadInstitute_Dataset/trophozoites/' + TrophoFiles[index])
  resized_image = cv2.resize(image_at_index, dsize=(128,128))
  Img = Image.fromarray(resized_image,'RGB')
  #print('The new shape for this image is:',np.shape(ResizedImage))
  Img.save(rescaled_Troph_DIR+"rescaled_Trophozoites_"+str(index)+".png")

#Gametocytes
for index in range(loop_control_Game):
  image_at_index = cv2.imread('/content/Malaria_BroadInstitute_Dataset/gametocyte/' + GameFiles[index])
  resized_image = cv2.resize(image_at_index, dsize=(128,128))
  Img = Image.fromarray(resized_image,'RGB')
  #print('The new shape for this image is:',np.shape(ResizedImage))
  Img.save(rescaled_Game_DIR+"rescaled_Gametocyte_"+str(index)+".png")