# This notebook walks you through some helpful tools for the final project.
## 1 Installation
This time there is no more Vocareum! Because the project is very open-ended, your own machines will allow more creativity. So you need to install Jupyter Notebook and other packages locally. We recommend first install Anaconda, which is a Python platform that helps you to manage packages and create virtual environments. A virtual environment is a Python environment such that the Python interpreter, libraries and scripts installed into it are isolated from those installed in other virtual environments, and (by default) any libraries installed in a “system” Python, i.e., one which is installed as part of your operating system [1].
### Recommended packages
There are many resources out there about installing the packages we will talk about below, and we recommend you follow this order:  
* Install Anaconda, and create a virtual environment. Then within the environment:  
* Install Python 3.5+
* Install Jupyter Notebook. Make sure it has "Python 3" listed in kernel. 
* Install the whatever packages you like, for example:  
    * "conda install scikit-learn"   
    * https://pytorch.org/get-started/locally/   

## 2 Pandas
Pandas is a good tool to process CSV files in Python.

### Basic usage
Assume we have a file called thanksgiving.xls: 

In [1]:
import pandas as pd
# You always need to use this as a start: read the csv file into the memory. 
# The return will be a pandas.DataFrame object.
df = pd.read_excel('./thanksgiving.xls')
# Gives you first few rows.
df.head()

FileNotFoundError: [Errno 2] No such file or directory: './thanksgiving.xls'

In [26]:
# Select single column
df['Food']

0     Turkey
1    Seafood
2     Turkey
Name: Food, dtype: object

In [29]:
# If you select multiple colomns, use a list for indices
df[['Name', 'Studied?']].to_numpy()

array([['Anna', 'No'],
       ['Peter', 'No'],
       ['Brian', 'Yes']], dtype=object)

In [28]:
# Row selection
df[df["Food"] =='Turkey']

Unnamed: 0,Name,Food,Studied?
0,Anna,Turkey,No
2,Brian,Turkey,Yes


You may also look into how to find NANs in the csv, how to convert a DataFrame into a numpy array, etc.

## 3 Sklearn
Scikit-learn is a free machine learning library for Python. It features various algorithms like support vector machine, random forests, and k-neighbours, and it also supports Python numerical and scientific libraries like NumPy and SciPy [2]. So this is a great tool to help you complete the baseline part of this project.

### Basic Usage [3]
Although the mechanisms of different ML algorithms varies a lot, Sklearn provide some neat functions which almost all of them share, so you don't need much code to try many algorithms.
 
__fit(X, y)__  
Fit the model using X as training data and y as target values

__get_params([deep])__  
Get parameters for this estimator.

__predict(X)__  
Predict the class labels for the provided data.

__predict_proba(X)__  
Return probability estimates for the test data X.

__score(X, y[, sample_weight])__  
Return the mean accuracy on the given test data and labels.

In [32]:
# example 
from sklearn.neighbors import KNeighborsClassifier
X = [[0], [1], [2], [3]]
y = [0, 0, 1, 1]
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X, y)
print(neigh.predict([[1.1],[3.5]]))
print(neigh.predict_proba([[0.9]]))

[0 1]
[[0.66666667 0.33333333]]


## 4 Pytorch
If you are not satisfied with the baseline methods, here is the more advanced part: neural networks (NN)! Considering most students won't have access to GPUs, we keep the data size for this project very small, so you may start with trying Pytorch with the CPU version. However, please remember once the data or the model gets larger, as they always do in real research, using CPUs alone would be too slow.

We recommend Pytorch as the deep learning framework due to its popularity and similarity with numpy. If you want to know more about the comparison between Pytorch and Tensorflow, here’s some good articles: https://thegradient.pub/state-of-ml-frameworks-2019-pytorch-dominates-research-tensorflow-dominates-industry/, https://towardsdatascience.com/pytorch-vs-tensorflow-in-2020-fe237862fae1.

The basic pipeline contains 5 steps: prepare the data, build the model, training, validating, and testing. We will give a short introduction for the first three, since the code for validating and testing are kind of similar with training.

### Data preparation
To create your own dataset, you always need to inherit the torch.utils.data.Dataset class by implement the following methods yourself, because it is the interface between a customized dataset and the general torch.utils.data.Dataloader that can process any dataset with those methods.

In [None]:
import torch
from torch.utils.data import Dataset

class CSVDataset(Dataset):
    def __init__(self, csv_file):
        # Where the initial logic happens like reading a csv, doing data augmentation, etc.
        raise NotImplementedError

    def __len__(self):
        # Returns count of samples (an integer) you have. 
        raise NotImplementedError

    def __getitem__(self, idx):
        # Given an index, returns the correponding datapoint. 
        # This function is called from dataloader like this:
        # img, label = CSVDataset.__getitem__(99)  # For 99th item
        raise NotImplementedError

### Model Architecture
Similarly, to build your own model, you always need to inherit the torch.nn.Module class and implement the following 2 methods, so that your model can be called as __prediction=model(data)__. Below is an example of CNN, but for this project maybe a MLP is enough.

In [None]:
import torch.nn as nn
import torch.nn.functional as F

class Net(nn.Module):
    def __init__(self):
        """
        You need to initialize most NN layers here.
        """
        super(Net, self).__init__()
        # 1 input image channel, 6 output channels, 3x3 square convolution
        # kernel
        self.conv1 = nn.Conv2d(1, 6, 3)
        self.conv2 = nn.Conv2d(6, 16, 3)
        # an affine operation: y = Wx + b
        self.fc1 = nn.Linear(16 * 6 * 6, 120)  # 6*6 from image dimension
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        """
        Define in what order the input x is forwarded through all the NN layers to become a final output. 
        """
        # Max pooling over a (2, 2) window
        x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
        # If the size is a square you can only specify a single number
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)
        x = x.view(-1, self.num_flat_features(x))
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

### Training 
The example below provides the most fundamental steps that training a NN would take.

In [None]:
import torch.optim as optim
from torch.utils.data import DataLoader

def train(csv_file):
    # Initialize an object of the model class
    net = Net()
    # Define your loss function
    criterion = nn.MSELoss()
    # Create your optimizer
    optimizer = optim.SGD(net.parameters(), lr=0.01)
    # Initialize an object of the dataset class
    dataset = CSVDataset(csv_file)
    # Wrap a dataloader around the dataset object.
    dataloader = Dataloader(dataset)
    # Beging training!
    for batch_idx, (input, target) in enumerate(dataloader):
        # You always want to use zero_grad(), backward(), and step() in the following order.
        # zero_grad clears old gradients from the last step (otherwise you’d just accumulate the gradients from all loss.backward() calls).
        optimizer.zero_grad()
        # As said before, you can only code as below if your network belongs to the nn.Module class.
        output = net(input)
        loss = criterion(output, target)
        # loss.backward() computes the derivative of the loss w.r.t. the parameters (or anything requiring gradients) using backpropagation.
        loss.backward()
        # optimizer.step() causes the optimizer to take a step based on the gradients of the parameters.
        optimizer.step()

## 5 Reference
[1] https://docs.python.org/3/library/venv.html#:~:text=A%20virtual%20environment%20is%20a,part%20of%20your%20operating%20system.  
[2] https://www.dataquest.io/blog/sci-kit-learn-tutorial.  
[3] http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html  