# Data Visualization (2017/18)

## Solutions for Assignment 4 - Visualizing tabular data with many features

Presented by Group ??: 
- Name1
- Name2

Date: xx.xx.2017

**Due date**: Friday, 22. Dec. 2017, 22:00

Please hand in a copy of the solved notebook and a html-export of it.

Remark: To keep the file size low, the notebook contains png images of solutions you should obtain. Double-click in the images to see the markdown code - do not execute these cells as you do not have the images.

In [1]:
import pandas as pd
import numpy as np

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

from bokeh.models import ColumnDataSource, ColorBar, LinearColorMapper, CategoricalColorMapper
from bokeh.models import Arrow, NormalHead, LabelSet, Label
from bokeh.plotting import figure, output_notebook, show
from bokeh.palettes import Category10, Category20, Viridis
from bokeh.transform import factor_cmap, linear_cmap
from bokeh.io import export_png
from bokeh.layouts import gridplot, row
from bokeh.core.properties import value

output_notebook()

# Exercise 1: Characterizing car types

In [None]:
# load the data
cars = pd.read_csv( '93cars.dat.csv', sep='\s+', na_values='*')

# substitute missing values
cars.LuggageCapacity = cars.LuggageCapacity.fillna(cars.LuggageCapacity.median())
cars.Cylinders = cars.Cylinders.fillna(cars.Cylinders.median())
cars.RearSeatRoom = cars.RearSeatRoom.fillna(cars.RearSeatRoom.median())

In [None]:
cars.columns

<font color='deeppink'>**Task 1a)**</font> Understand your data.
- Which of the variables are quantitative and can be used in a PCA? Store their names in the following variable `var`.

In [None]:
var = []

- Which variables can be used to divide the cars into groups? How many groups do you obtain and what size do they have? Do you expect differences between the groups and if yes which?

 - Variable:
 - Number of groups + size of each group:
 - Hypothesis:

<font color='deeppink'>**Task 1b)**</font> Compute the PCA.

scikit-learn PCA: http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

In [None]:
# store standardized data in cars_std
cars_std = 

# store PCA in variable pca
pca = 

Important variables for further analysis (see docu above):

In [None]:
pca.components_
pca.explained_variance_
pca.explained_variance_ratio_

<font color='deeppink'>**Task 1c)**</font> How many principal components to use.

In [None]:
var_exp = pca.explained_variance_ratio_*100
cum_var_exp = np.cumsum(var_exp)
x = ['PC%s' %(i+1) for i in range(len(var))]

source = ColumnDataSource( dict(x=x, var_exp=var_exp, cum_var_exp=cum_var_exp) )

p = figure( plot_width=520, plot_height=400, toolbar_location=None, x_range=x, y_range=(-2,105) )

p.vbar( source=source, x='x', top='var_exp', width=0.9, bottom=0, legend='Explained variance' )

p.circle( x, cum_var_exp, color='orange', size=5, legend="Cumulative explained variance")
p.line( x, cum_var_exp, color='orange', line_width=2 )

p.legend.location = (235,155)
p.legend.border_line_color = None
p.xgrid.visible = False
p.yaxis.axis_label = "Explained variance in percent"

show(p)

Here is a sample image to check your routine:
![](img2.png)

- How much variance explains the first and the first two principal component(s) roughly?

- How many components do you need to explain 90% of variance in the data (roughly, use figure estimate)?

<font color='deeppink'>**Task 1d)**</font> Interpret the axis.

The following chart presents a projection of the cars data onto the first two principal components. Use techniques you learned in class to derive the meaning of the axes.

In [None]:
pca_auto = pd.DataFrame( pca.transform(cars_std), columns=['PC%i' % (i+1) for i in range(pca.n_components_)])

pca_auto['label'] = cars.Type
pca_auto['label'] = pca_auto['label'].astype('str')

factors = sorted(pca_auto.label.unique())

In [None]:
source = ColumnDataSource(pca_auto)

p = figure( plot_width=600, plot_height=600, y_range=(-4.5,4.8) )

p.circle( source=source, x='PC1', y='PC2', size=9, legend='label',
          color=factor_cmap('label', palette=Category10[10], factors=factors))
p.xaxis.axis_label = 'PC1' 
p.yaxis.axis_label = 'PC2' 

show(p)

You should get the following chart:
![](img1.png)

- Explain the PC1 and PC2 axis. What is typical for cars on the left vs. right (x-axis)? What is typical for cars located towards the top vs. bottom of the chart (y-axis)?

- Use one of the groupings you discussed in Task 1a) and color the plot accordingly. What can you tell about the groups? Feal free to comment on additional findings here.

# **Exercise 2**: Analyzing tumor data

The following code reads the tumor data and does some preprocessing. You have access to the following variables:

- **df** contains the tabular data with genes and labels. Labels are stored in the last column
- **labels** the last column as a seperate array
- **df_std** a standardized version of the data
- **ngenes** the number of genes
- **cancer_types** list of cancers contained in the data

In [None]:
# read the gene expression data, transpose it, and name the columns
df = pd.read_csv('nci.csv', header=None, sep='\s+')
df = df.T
df.columns = ['G%i' % i for i in range(len(df.columns))]

ngenes = len(df.columns)

# read the labels and add them to the dataframe
labels = pd.read_csv( 'nci-labels.csv' )
df['cancer'] = labels

# remove synthetic tumors
df = df[df['cancer'].isin(['CNS', 'RENAL', 'BREAST', 'NSCLC', 'UNKNOWN', 'OVARIAN', 'MELANOMA',
       'PROSTATE', 'LEUKEMIA', 'COLON' ])]
df = df.reset_index()
labels = df.cancer

# get a standardized version of the data
df_std = StandardScaler().fit_transform(df.iloc[:,1:-1])

# get the occuring cancer types
cancer_types = sorted(labels.unique())

** Helper routines **

In [None]:
def getLegend( df ):
    df = df.sort_values(by=['cancer'])
    source = ColumnDataSource(df)

    l = figure( plot_width=200, plot_height=300, y_range=(100,101) )
    l.circle( source=source, x='G0', y='G1', size=7, legend='cancer', 
              color=factor_cmap('cancer', palette=Category10[10], factors=cancer_types))
    l.xaxis.visible = False
    l.yaxis.visible = False
    l.xgrid.visible = False
    l.ygrid.visible = False
    l.legend.location = "top_left"
    l.legend.border_line_color = None
    
    return l

In [None]:
df.cancer.value_counts()

<font color='deeppink'>**Task 2a)**</font> Make a feasibility check.

- Show 6 scatterplots of your data using random genes for the axes. 

In [None]:
row = [] 

p = figure( plot_width=300, plot_height=300 )
ids = np.random.choice( len(df.columns)-1, 2 )
var1 = 'G'+str(ids[0])
var2 = 'G'+str(ids[1])
p.circle( source=df, x=var1, y=var2, size=7,
          color=factor_cmap('cancer', palette=Category10[10], factors=cancer_type))
p.xaxis.axis_label = var1
p.yaxis.axis_label = var2
row.append(p)

row.append(getLegend(df))
    
show( gridplot([row]))


Should give:
![](img3.png)

- Looking at the charts you created above, can you find clusters for the different cancer types?

- Explain the following chart (also understand the code to compute it).

 Hint: The method `np.linalg.norm( x-y)` computes the Euklidean distance between two vectors `x` and `y`.

In [None]:
cnt = pd.DataFrame( index=cancer_types, columns=['same', 'other']).fillna(0)

for i in df_std:
    dfc = pd.DataFrame( {'d': [np.linalg.norm(i-j) for j in df_std], 'l': labels} ).sort_values(by='d')
    if dfc.iloc[0,1] is dfc.iloc[1,1]: 
        cnt.loc[dfc.iloc[0,1],'same'] += 1
    else: 
        cnt.loc[dfc.iloc[0,1],'other'] += 1
        
p = figure( y_range=cancer_types, plot_height=350 )
p.hbar_stack( ['same','other'], y='index', color=['steelblue','orange'], height=0.9, source=cnt,
              legend=[value(x) for x in ['same','other']])
p.legend.location = 'bottom_right'
show(p)

![](img4.png)

- In summary, what do you think? Is it possible to classify different types of cancer using gene expression data? Which cancer types are probably very hard to describe which are easy? Why?

<font color='deeppink'>**Task 2b)**</font> Compute the PCA projection and visualize the first two components in a scatterplot. Add outlines to each cancer type to make analysis easier.

Hint: scipy provides a convex hull implementation https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.ConvexHull.html.
**hull.vertices** returns a list of sorted boundary vertices.

In [None]:
# compute pca of data df_std
pca = 

# create a dataframe of the projected data
pca_data = pd.DataFrame( pca.transform(df_std), columns=['PC%i' % (i+1) for i in range(pca.n_components_)])
pca_data['label'] = labels

In [None]:
source = ColumnDataSource(pca_data)

# color mapper. Usage: mapper['LEUKEMIA'] return the colorname of the given cancer type.
mapper = dict(zip(cancer_types, Category10[10]))

p = figure( plot_width=600, plot_height=600, toolbar_location=None )

# render the convex hulls first
for c in cancer_types:
    # get the data for the current cancer type
    data = pca_data[pca_data.label == c].reset_index()
    
    # your code is missing here


p.circle( source=source, x='PC1', y='PC2', size=9, 
          color=factor_cmap('label', palette=Category10[10], factors=factors))
p.xaxis.axis_label = 'PC1' 
p.yaxis.axis_label = 'PC2' 

        
show( gridplot( [[p, getLegend(df)]] ) )

Sample solution:
![](img5.png)

- Identify tumor classes (+ explain briefly):
 - Cluster
 - No compact cluster
 - Hard to distinguish

<font color='deeppink'>**Task 2c)**</font> How confident should you be in the results?

In [None]:
n = 60

var_exp = pca.explained_variance_ratio_[:n]*100
cum_var_exp = np.cumsum(var_exp)
x = ['PC%s' %(i+1) for i in range(n)]

source = ColumnDataSource( dict(x=x, var_exp=var_exp, cum_var_exp=cum_var_exp) )

p = figure( plot_width=520, plot_height=400, toolbar_location=None, x_range=x, y_range=(-2,105) )

p.vbar( source=source, x='x', top='var_exp', width=0.9, bottom=0, legend='Explained variance' )

p.circle( x, cum_var_exp, color='orange', size=5, legend="Cumulative explained variance")
p.line( x, cum_var_exp, color='orange', line_width=2 )

p.legend.location = (235,155)
p.legend.border_line_color = None
p.xgrid.visible = False
p.yaxis.axis_label = "Explained variance in percent"

show(p)

![](img6.png)

- Take a look at the provided explained variance plot. How reliable is you analysis based on the first two components?