-----------------------------------------------------------------------------------------------------------------------
<h2> <u>Goals of this notebook</u> </h2>

* The first step of doing data analysis is to **read the data**!
* After this, some **simple visualization** gives us some intuition of how the data is structured.
* With this in mind, let's start by reading and visualizing some commonly used data formats.

-----------------------------------------------------------------------------------------------------------------------
<h2> <u>What's a Jupyter notebook, anyway?</u> </h2>

* If you have not used Jupyter notebooks before, the following instructions may be helpful:
  * You are currently inside a Jupyter notebook.
  * The notebook consists of 'cells' of code. Cells are the grey boxes with a "In [ ]:" on the left.
  * When you click on a cell, a green box appears around the cell to indicate that the cell has been selected. Now, you can execute the code within a cell in one of two ways:
      * by clicking on 'Run' in the toolbar on the top of the notebook.
      * by pressing 'Ctrl + enter'.
  * You can create a new cell by clicking on 'Insert' in the toolbar on the top of the notebook.

-----------------------------------------------------------------------------------------------------------------------
<h2> <u>What am I supposed to do?</u> </h2>

* The code in the first few cells is already written.
* Go through the same and understand it. Feel free to insert new code cells in between and print stuff in order to better understand what is going on.
* Towards the end, some blocks are left empty for you to fill in.

-----------------------------------------------------------------------------------------------------------------------
<h2> <u>Some tips</u> </h2>

* The print command is your friend. Once you read anything into a variable v, you can print(v) to see the contents of v. Use this to understand the data inside v. Further, depending on the type of v, you can do print(v.shape), print(len(v)), print(v.size). Use print extensively to understand as well as to debug your own code!

-----------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------

<h2> <u>Import required modules</u> </h2>


In [None]:
import numpy as np # numerical library
import matplotlib.pylab as plt # for plotting
import pandas as pd # data handling similar to R

<h2> <u>Mount Google drive folder</u> </h2>


In [None]:
# Mount Google drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
cd /content/drive/My Drive/ML_workshop

In [None]:
ls

-----------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------

<h2> <u>Toy data for Regression</u> </h2>

In regression, the goal to approximate a mapping from an input to an output, where the output is an integer or floating point value. Some instances of a regression task could be:
* to map the salary of a person (input) to the rent of their apartment (output). 
* to map a person's brain MR image (input) to their age (output).

In the machine learning literature, inputs are sometimes also called as 'features', and outputs are sometimes also called as 'labels'. In the following two cells, we will read and visualize features and labels for a toy regression task.

-----------------------------------------------------------------------------------------------------------------------

The following cell reads data from .txt files

<a href="https://numpy.org/doc/stable/reference/generated/numpy.loadtxt.html">np.loadtxt documentation</a>

In [None]:
features = np.loadtxt('machine_learning/data/features_linear_regression.txt')
labels = np.loadtxt('machine_learning/data/labels_linear_regression.txt')
nsamples = features.size
print ('Number of samples: {}'.format(nsamples))

-----------------------------------------------------------------------------------------------------------------------

The following cell visualizes the read data in a scatter plot. Do you think that the labels are correlated with the features? If yes, what kind of a model could be used to express this correlation? For instance, would a linear model suffice in this toy example?

<a href="https://matplotlib.org/3.3.1/api/_as_gen/matplotlib.pyplot.scatter.html">plt.scatter documentation</a>

In [None]:
plt.scatter(features, labels)
plt.grid('on')
plt.xlabel('features')
plt.ylabel('labels')
plt.show()

-----------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------

<h2> <u>Toy data for Binary Classification</u> </h2>

In classification, the goal to approximate a mapping from an input to an output, where the output is a discrete variable that can take one of a specified set of discrete values. Some instances of a classification task could be:
* to map the text of an email (input) to a label as to whether the email is spam or not (output). 
* to map a person's brain MR image (input) to a label as to whether the person has a brain tumour or not (output).
* to map a person's lung CT image (input) to a label as to whether the person is COVID postive or not (output).


As before, inputs are sometimes also called as 'features', and outputs are sometimes also called as 'labels' / or for classification tasks there may also be called as 'categories'. In the following two cells, we will read and visualize features and labels for a toy classfication task.

-----------------------------------------------------------------------------------------------------------------------

The following cell reads data from .csv files

<a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html">pd.read_csv documentation</a>

<a href="https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html#dataframe">Intro. to pandas dataframes</a>

In [None]:
features = pd.read_csv('machine_learning/data/features_linear_classification.csv')
labels = pd.read_csv('machine_learning/data/labels_linear_classification.csv')
nsamples = features.shape[0] # .shape function returns size of each dimension as a pair; (100, 2) in this case
ndims = features.shape[1]
print(labels.shape)
print ('Number of samples: {}'.format(nsamples))
print ('Number of features: {}'.format(ndims))
print ('Feature names: {}'.format(features.columns))
print(features)
print(labels['0'])

-----------------------------------------------------------------------------------------------------------------------

The following cell visualizes the read data in a plot, with the two types of labels being marked with different symbols. Again, think of the complexity of the 'decision boundary' required to separate the two classes. For instance, would a linear decision boundary suffice to do this binary classification task?

<a href="https://matplotlib.org/3.3.1/api/_as_gen/matplotlib.pyplot.plot.html#matplotlib.pyplot.plot">plt.plot documentation</a>

In [None]:
pos_rows = labels['0'] > 0
neg_rows = labels['0'] <= 0
plt.plot(features.feature1[pos_rows],features.feature2[pos_rows],'+',markersize=10,mew=2)
plt.plot(features.feature1[neg_rows],features.feature2[neg_rows],'_',markersize=10,mew=2)
plt.grid('on')
plt.xlabel('feature1',fontsize=16), plt.ylabel('feature2',fontsize=16)
plt.show()

-----------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------

<h2> <u>Medical Images</u> </h2>

Reading medical image formats can be slightly different than reading other types of data. There are various useful libraries for this purpose. "nibabel" is one of the popular ones.

<a href="https://nipy.org/nibabel/gettingstarted.html">Nibabel introduction</a>

In [None]:
import nibabel as nib

-----------------------------------------------------------------------------------------------------------------------

Read a MR image. This returns the pixel values of the image, plus information about the imaging protocol.

In [None]:
I = nib.load('machine_learning/data/example_mri.nii') 
print(I)

-----------------------------------------------------------------------------------------------------------------------

Get the pixel values.

In [None]:
V = I.get_fdata()
print(V.shape)
print(V.dtype)
print(type(I))
print(type(V))

-----------------------------------------------------------------------------------------------------------------------

The read image is 3D. The following cell visualized 2D slices in different orientations.

In [None]:
plt.figure(figsize = (12, 6))

cx = np.int(V.shape[0]/2) # index of the central slice along the x-direction
cy = np.int(V.shape[1]/2) # index of the central slice along the y-direction
cz = np.int(V.shape[2]/2) # index of the central slice along the z-direction

plt.subplot(1,3,1), plt.imshow(V[:,:,cz], cmap='gray') # Sagital plane
plt.subplot(1,3,2), plt.imshow(V[:,cy,:], cmap='gray') # Coronal plane
plt.subplot(1,3,3), plt.imshow(V[cx,:,:], cmap='gray') # Axial plane
plt.show()

-----------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------

<h2> <u> Exercise 1A:</u></h2>

* (i) Read features from a csv file and labels from a text file
* (ii) Plot the features and the labels using the same procedure we used previously.
* The feature file to be read is: "data/ex1_features_classification.csv", and the label file is: "data/ex1_labels_classification.txt"

Essentially, you need to copy-paste parts of code from the some of the previous cells.

For this new data, think of the complexity of the 'decision boundary' required to separate the two classes. For instance, would a linear decision boundary suffice to do this binary classification task?

In [None]:
# TODO - reading and visualizing classification data from csv and txt files

-----------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------

<h2> <u> Exercise 1B:</u></h2>

* (i) Read an MR image 
* (ii) display it the same way we did.
* The MR image to be read is: "data/ex1_mri.nii.gz"

Essentially, you need to copy-paste parts of code from the some of the previous cells.

In [None]:
# TODO - reading and visualizing MRI image

-----------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------

<h2> <u> End of notebook</u></h2>
    
-----------------------------------------------------------------------------------------------------------------------
