# Advance Section Activities (ML Classification Techniques):


### Andy and Juan Notes to Complete in Notebook
### Delete once finished
1. ML basics
    * Explain what training means
    * Explain overfitting / variance
    * Evaluation on testing data
* Logistic Regression or Classification 
    * Use age vs. (other)
* Clustering
    * K nearest neighbor 
    * Gaussian Mixture Model 
    * DBSCAN (density based)
    * Explain strengths / weaknesses of each
* Use dimension reduction to visualize clusters
    * Briefly explain dimension reduction
    * Add label coloring to clusters
* SVM??


# Overview

This is the expert level notebook for the Data Science (DS) and Machine Learning (ML) FredHutch.io tutorial, where we will work through beginning to end on different aspects and techniques in DS and ML for Research and Analysis.

In this notebook we will work through Machine Learning techniques and strategies on the genes data (datasets available [here](https://www.dropbox.com/sh/jke9h4km90ner9l/AAD1UyucvlXIFbKTjl-D15U6a?dl=0)) from the same five cancer types (BRCA, KIRC, COAD, LUAD, PRAD) from the TCGA projects available from the [National Cancer Institute's Genomic Data Commons](https://gdc.cancer.gov/). 

We will keep working with *python libraries* introduced in the Beginner and Intermediate Tutorials and introduce some new libraries with special purposes in **Machine Learning**.
> **Libraries Used in This Tutorial**
* Data Manipulation and Processing
     - [pandas]( https://pandas.pydata.org/)
     - [numpy]( https://numpy.org/)
* Data Visualization
	- [Matplotlib](https://matplotlib.org/)
    - [Seaborn](https://seaborn.pydata.org/)
    - [Altair](https://altair-viz.github.io/)
* Statistics
    - [Scipy](https://www.scipy.org/)
    - [Statsmodels](https://www.statsmodels.org/stable/index.html)
* Machine Learning
    - [Scikit-Learn](https://scikit-learn.org/stable/)
    
In this notebook we will be focusing specifically on Machine Learning modeling in **python**. We'll primariy focus in:
* Introducing what Machine Learning is
* Fundamental concepts and techniques in ML
* Introduce and familiarize with using **Scikit-Learn**
* Fundamental categories of models
* Some specific examples with regression and clustering ML models.

# Table of Contents
[1. Backgroud on Machine Learning: What is Machine Learning?](#1.-Backgroud-on-Machine-Learning:-What-is-Machine-Learning?)
* [1.1 Types of Machine Learning Models](#1.1-Types-of-Machine-Learning-Models)
    * [1.1.1 Supervised Machine Learning](#1.1.1-Supervised-Machine-Learning)
    * [1.1.2 Unsupervised Machine Learning](#1.1.2-Unsupervised-Machine-Learning)

# 1. Backgroud on Machine Learning: _What is Machine Learning?_

In the world of analytics and specifically Data Science, _"Machine Learning"_ is so ubiquitos and a big buzzword thrown all over. Sometimes in the context of _"We use ML in our (insert product)!"_ or _"Just use ML and you'll get the answer",_ being almost this esoteric concept in Data Science often associated as part of AI.

A better way of viewing Machine Learning is as the union of the concepts of computer programming (Beginner Notebook) and statistical concepts (Intermediate Notebook) in Data Science. The primary idea to remember is that we are building models out of data, where our models "learn" or become tuned from data and then can make predictions on similar but never before seen data.  


## 1.1 Types of Machine Learning Models

There are a large variety of Machine Learning models available for us work with, but it is important to always remember that our problem or question at hand will dictate the type of models we can use. In other words, _don’t try to paint a wall with a hammer_ or _screw on a shelf with a saw_ , each tool has a best use scenario. It might sound enticing to use a fancy sounding model say “K-Means” or “Support Vector Machine” for your research, but if you are trying to predict effects of drug dosage on how fast a cancer metastasize, you might be using the wrong tool for the job.

Let’s first discuss the two main types Machine Learning algorithms are categorized into: [_Supervised_](https://en.wikipedia.org/wiki/Supervised_learning) and [_Unsupervised_](https://en.wikipedia.org/wiki/Unsupervised_learning) .

### 1.1.1 Supervised Machine Learning

Recall that Machine Learning is our models “learning” from our data, hence _Supervised Machine Learning_ would be a model that “learns” from data that has some labels already attached to it. Some of the models that fall under this type of ML are _Regression_ and _Classification_ , such as Linear Regression model with continuous data to predict a continuous value (age, dosage, time, etc) or Logistic Regression for classifying values (sex, smoker, education level, etc).  If our data has labels already attach to them and we want to figure out the relation of a target variable to these labels, then we are dealing with a Supervised Learning problem.

### 1.1.2 Unsupervised Machine Learning

Given that Supervised Learning requires labels for our data, _Unsupervised Machine Learning_ is the name given to models that use unlabeled data, which mainly means we “omit” labels that might exist in our data, allowing the model to learn patterns from some features. These types of models tend to be used for _Clustering_ or for _Dimension Reduction_ purposes.

## 1.2 Key Basic ML Ideas and Concepts



In [None]:
# Data Manipulation
import pandas as pd
import numpy as np

# Statistics
from scipy import stats
import statsmodels.api as sm
from sklearn.model_selection import train_test_split

# visualization
import altair as alt
import matplotlib.pyplot as plt
import seaborn as sns

# setting up the plot style
plt.style.use('ggplot')
%matplotlib inline

In [None]:
def create_genes_subset(split_size=.1):
    """
        Creates a smaller dataframe from the large 'genes.csv' file based on a split from the metadata file.
        Returns a dataframe that has been transformed by log2 and the index needed for the remaining samples
        from the genes.csv file to remain independent.
    
    """
    
    metadata = pd.read_csv('../metadata.csv')
    
    big_split, small_split = train_test_split(metadata, test_size=split_size, random_state=4)
    
    skiplines_small = np.sort(big_split.index) + 1 
    skiplines_big = np.sort(small_split.index) + 1
    
    genes_small = pd.read_csv('../genes.csv', skiprows=skiplines_small)
    
    genes_nonAllZero = genes_small.loc[:,~genes_small.isin([0]).all(axis=0)]
    
    genes_log2_trans = np.log2(genes_nonAllZero.iloc[:,1:] + 1)
    genes_log2_trans['barcode'] = genes_small['barcode']
    
    genes_merged = pd.merge(left=small_split, right=genes_log2_trans, how='left', left_on='barcode', right_on='barcode')
    
    return genes_merged, big_split, small_split