# Day 1: Introduction to RxRx19 dataset

Today we will be familiarizing ourselves with the dataset structure, and performing a few operations to understand it better.

## 1. Loading the datasets

The first step is to clone the repository with the data. The original dataset available on Kaggle has more than 305,520 images, with total size of >400 GB. Hence, we created a subset of 16,000 images and their labels.

> If you are curious to know how we created this subset, you can check [here](https://github.com/ai4all-sfu/comp-biology-2020/blob/master/day0-data-preprocessing.ipynb).



In [None]:
! git clone https://github.com/ai4all-sfu/comp-biology-2020.git

To check if the files are available, we use the code below. We should see two folders: 'sample_data' and 'comp-biology-2020'.

In [None]:
!ls

comp-biology-2020  sample_data


Let us take a look at the structure of the dataset.

![alt text](https://drive.google.com/uc?id=14wlQF2nvj_V1_JuSNpikM4UXWFh3p_Nu)
RxRx19-images.zip consists of the image dataset. As discussed on the slides, two types of cells are being considered: HRCE and Vero cells. Each folder consists of cell images taken from 26 Plates. Each plate has ~26,000 images. Each cell site is passed through five channels, so each channel has ~5000 images. (Slide contains further discussion on how each channel is different from the other)

Let us take a look at the metadata and embeddings file. To reduce the size of the files, we saved them in pickle format. Let's unpickle them now.



In [None]:
import pandas as pd
import pickle

embeddings = pd.read_pickle('comp-biology-2020/embeddings.pkl', compression = 'xz')
metadata = pd.read_pickle('comp-biology-2020/metadata.pkl', compression = 'xz')

#changing the index
embeddings.set_index('site_id', inplace=True)

Let us print the head of metadata.pkl so that we can understand what it contains.

In [None]:
metadata.head()

## 2. Understanding the Data

* 'site_id': it refers to the cell site under consideration. Every cell site has a unique site_id. As we discussed, every cell is analyzed under 4 sites, and each site is analyzed under 5 different channels. The format of site_id is as follows: 'experiment_plate_well_site'. So each of these will have an image each for the five different channels. The comparison between them is given below for your reference.

![alt text](https://drive.google.com/uc?id=1MhREWlcUghzmwiMMVz8GXU_p6zWZYonu)

* 'disease_condition': it refers to whether the cell is infected with the SARS-CoV-2 virus or not. In the original 'metadata.csv' there were three disease conditions: Active SARS-CoV-2 (the cell has been infected with the virus), UV Inactivated SARS-CoV-2, and Mock (mock preparations of SARS-CoV-2 on cells). We combined the disease conditions 'UV Inactivated SARS-CoV-2' and 'Mock' into one class: 'Inactive' (because they are similar), so that we can consider two distinct classes in our classification task: 'Active' and 'Inactive'.

* 'treatment': it refers to the drug that is used to treat the cell from the virus, 'treatment_conc' refers to the amount of concentration that the drug is used under. For an 'inactive' cell site, the value under 'treatment' and 'treatment_conc' will be NaN. 

Let us print embeddings.pkl and take a look at its head.

In [None]:
embeddings.head()

In the above result, for every site_id we can observe 1024 feature values. These are lower-dimensional feature vectors (embeddings) for the image that provides some indication of what the image includes.

In our subset we have 16,000 images in total, chosen from all the four cell subfolders (HRCE-1, HRCE-2, Vero-2, and Vero-2). Each image is of dimensions 1024 x 1024 x 1. They are grayscale images. We will not be directly handling the images in our project. Instead, we will be using the embeddings.pkl file. 

Let us print the shape of metadata.pkl and embeddings.pkl.

In [None]:
print("Metadata : ",metadata.shape)
print("Embeddings : ",embeddings.shape)

As we can see above, 'metadata' has 16,000 rows for the images and 10 columns for the metadata values for each image. 'embeddings' has 16,000 rows too with 1024 columns denoting 1024 feature values for each image.

Let us join 'metadata' and 'embeddings' to understand how they correlate better.

In [None]:
merged = pd.merge(embeddings, metadata, on=['site_id'], how='inner')
merged.head()

Let us pick out the first row and take a clear look at the information that we get about each cell site.

In [None]:
merged.iloc[0,:]

We will be using the merged dataframe in all of our exercises today. Let's eliminate all the other column values from 'merged' and just have our 'feature embeddings' for each image and the corresponding 'disease_condition'.

In [None]:
feat_disease = merged.iloc[:, list(range(1025)) + [-3]].head()
feat_disease

If you want to learn more about this dataset, you can check the following links: 
Explore the following links: 
* RxRx19: The First Morphological Imaging Dataset on SARS-CoV-2 Virus ([link](https://www.rxrx.ai/rxrx19) and [github](https://gist.github.com/bmabey/ae215f5c154cbc5c3b7e0a519e3d403b))
* RxRx19a COVID-19 Image Embeddings ([link](https://www.kaggle.com/tunguz/rxrx19a))


Below, given an image, we are going to create feature embeddings using Computer Vision techniques.


On Day 2, we are going to reduce the dimension of our 'embeddings' dataset learn about several factor models, and how to scale the dataset.
And on day 3, we will be using 'embeddings' and parsing the values of  'disease_condition' for each of those embeddings and appending them to a labels list. We will train our model using this, and evaluate it against our test dataset. Our result will be a predicted 'disease_condition' label for a new cell image from the test dataset.

Now, we are going to plot the frequency count of disease condition 'active' in the 'feat_disease' dataframe. We use pyplot to give a title to the figure, value_counts from pandas to count the number of occurences of each category.

In [None]:
import numpy as np
import matplotlib.pyplot as pyplot

fig = pyplot.figure()
fig.suptitle('\nFrequency distribution of active and inactive disease conditions')
feat_disease['disease_condition'].value_counts().plot.bar()

### Activity 1

Plot the same graph, but in the form of a 'line plot', 'scatter plot', 'pie plot', and a 'density plot'. Try using the 'groupby' function from pandas to plot the graph. Add a legend and title too.


In [None]:
#ADD YOUR CODE HERE 

### Activity 2

Plot the frequency distribution of all the 26 plates in the 'merged' dataframe. Use any two plot types of your choice. The plates in the x axis should be in the increasing order.
Try plotting the same, using plot functions from matplotlib.

In [None]:
#ADD YOUR CODE HERE

## Feature Extraction from an image

First, we use scikit-image which is a library containing a collection of algorithms for image processing. We will use the methods 'imread' and 'imshow' to read an image, and display it.

In [None]:
import cv2
import numpy as np
from skimage.io import imread, imshow

image = imread('comp-biology-2020/supplement_images_day1/E08_s2_w1.png') 
image.shape, imshow(image)

We will be discussing two methods for feature extraction.

**1. Raw Pixel Feature Vector**

The simplest way is to extract the raw pixel feature vector. The image shape here is 1024 x 1024. Hence, the number of features should be 1,048,576. We can generate this using the reshape function from NumPy where we specify the dimension of the image:

In [None]:
features = np.reshape(image, (1024*1024))

features.shape, features

The shape of the feature vector is (1048576, ). This is nothing but (1024*1024, ).

Now, let us consider a different method for feature extraction.


**2. Extracting Edge Features**

Consider two objects: a car and a bus. You can recognize the objects in an instant. What are the features that you considered while differentiating each of these images? The shape could be one important factor, followed by color, or size. What if the machine could also identify the shape as we do?

A similar idea is to extract edges as features and use that as the input for the model. Edge is basically where there is a sharp change in color. Look at the below image:


In [None]:
image1 = imread('comp-biology-2020/supplement_images_day1/edgedetection.png') 
imshow(image1)

The first image is the original image, and the second one contains the outlines for every object in the image. These are the edges in the image: basically boundaries between different image intensities.

We can see that the original image on the left has various colors and shades, while the “edges-only” representation on the right is black and white. The image on the right requires less storage. By detecting the edges of an image, we are doing away with much of the detail, thereby making the image “more lightweight”. Thus, edge detection can be incredibly useful in cases where we don’t need to maintain all the intricate details of an image, but rather only care about the overall shape.

An image is represented in the form of numbers. Let us consider an image of a black square on a white background.

In [None]:
image1 = imread('comp-biology-2020/supplement_images_day1/pixel_workingimg.png') 
imshow(image1)

Above is a pixel-level representation of the image, where each pixel has a value between 0 (black) and 1 (white). Let us determine if the pixel in the green box is an edge or not. At first sight, we can say that it is an edge as there is a change in pixel intensity from 0 (the black region) to 1 (the white region in the green box). We can help the computer reach the same conclusion by using the neighbouring pixels.

Let’s take a small 3 x 3 box of local pixels centered at the green pixel in question. This box is shown in red. Then, let’s “apply” a filter to this little box. 

We multiply each pixel in the red box by each pixel in the filter element-wise. So, the top left pixel in the red box is 1 whereas the top left pixel in the filter is -1, so multiplying these gives -1. Each pixel in the result is achieved in exactly the same way.

The next step is to obtain the vertical score. We sum up the pixels in the result, giving us -4. Note that -4 is actually the smallest value we can get by applying this filter (since the pixels in the original image can be only be between 0 and 1). Thus, we know the pixel in question is part of a top vertical edge because we achieve the minimum value of -4. In the image below, 'Sum' refers to the 'Vertical Score'.

There are various kinds of filters that can be used to highlight the edges in an image. We are using the 'Sobel' filter in our exercise. Similarly, there are other options such as Prewitt filters, and Canny filters.

![alt text](https://drive.google.com/uc?id=1UfTCfQDgGfXZzj9rjVXSNBVY0upc--6D)

### Activity 3 

Try out the same (on paper) by applying the filter on a window from the bottom of the square. What value are you getting? Is it big enough to be declared an edge? 

How will you do the same to find a horizontal edge? Try taking the transpose of the vertical filter and apply this new filter to the image to derive the result, horizontal score and ultimately the horizontal edges.

Let us first test it out on a sample image of a puppy. We read the image, and convert it to grayscale.

In [None]:
puppy = imread('comp-biology-2020/supplement_images_day1/pupper.jpg', as_gray=True) 
puppy.shape, imshow(puppy)

Instead of doing the math behind applying a filter to an image, there is an in-built function in skimage library to get the result directly after applying the prewitt vertical and horizontal filters, and display it.

In [None]:
from skimage.filters import sobel_h, sobel_v
from skimage import feature

#calculating horizontal edges using sobel kernel
edges_sobel_horizontal = sobel_h(puppy)
#calculating vertical edges using sobel kernel
edges_sobel_vertical = sobel_v(puppy)

imshow(edges_sobel_vertical, cmap='gray')

Display edges_sobel_horizontal as well and observe the differences between the two results.

Now, let us obtain the same result if we code the math behind applying the filter to the image.

First read the image again without converting it to grayscale, Then, assign the Prewitt filter values (these are fixed values) for vertical and horizontal filters.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

puppy = imread('comp-biology-2020/supplement_images_day1/pupper.jpg') 
#define the vertical filter
vertical_filter = [[-1,-2,-1], [0,0,0], [1,2,1]]

#define the horizontal filter
horizontal_filter = [[-1,0,1], [-2,0,2], [-1,0,1]]

#get the dimensions of the image
n,m,d = puppy.shape

Copy the original image into edges_img. We will be inserting our edge values into it one by one. Next, loop over the pixels in our original image: create a 3x3 box which is essentially a window/matrix considered in our original image, on which we will apply the filter. Then, multiply the values in the window with the ones in the vertical filter element-wise, and obtain the vertical score. Do the same for the horizontal filter. 

Now, in our example while studying how to apply a filter to an image, we only detected a single horizontal/vertical edge. In order to detect the horizontal edges, vertical edges, and edges that fall somewhere in between, we can combine the vertical and horizontal scores by calculating the Euclidean distance between them, and inserting this edge score into our image. This will give us the final result which we obtained in our previous example by directly using the in-built function.

In [None]:
#initialize the edges image
edges_img = puppy.copy()

#loop over all pixels in the image
for row in range(3, n-2):
    for col in range(3, m-2):
        
        #create little local 3x3 box
        local_pixels = puppy[row-1:row+2, col-1:col+2, 0]
        
        #apply the vertical filter
        vertical_transformed_pixels = vertical_filter*local_pixels
        #remap the vertical score
        vertical_score = vertical_transformed_pixels.sum()/4
        
        #apply the horizontal filter
        horizontal_transformed_pixels = horizontal_filter*local_pixels
        #remap the horizontal score
        horizontal_score = horizontal_transformed_pixels.sum()/4
        
        #combine the horizontal and vertical scores into a total edge score
        edge_score = (vertical_score**2 + horizontal_score**2)**.5
        
        #insert this edge score into the edges image
        edges_img[row, col] = [edge_score]*3

Normalize the values in the image, and display.

In [None]:
#remap the values in the 0-1 range in case they went out of bounds
edges_img = edges_img/edges_img.max()
imshow(edges_img)

Save your image results.

In [None]:
from skimage.io import imsave
imsave('edge_puppy.jpg', edges_img) 

### Activity 4
Extract the raw pixel feature vector for the set of cell images given in the directory 'test_images_day1'. Think about how we can transform this feature vector back into the image. Try it out if you have any ideas.



In [None]:
#ADD YOUR CODE HERE 

### Activity 5 (Advanced)
Use the edge detection method we just discussed about, and extract the features for the same set of images from Activity 4. Use the inbuilt method for Sobel filter in skimage, and the detailed method.



In [None]:
#ADD YOUR CODE HERE

### Activity 6 (Advanced)
For those who are interested in exploring this topic further, read about using Prewitt Filters, and Canny filters as an alternative to the Sobel filters we used above. Implement them if understood, and we can discuss the results in the office hours.

Note: For each activity, save your image results using imsave, to display during the next session.

Feature extraction will not be used in the following days, this is just to give an introduction to image processing and the dataset construction using basic Computer Vision techniques.

In [None]:
#ADD YOUR CODE HERE