# Task FluType

The task is to build a very simple classification model, which allows to predict the virus type from given features `P`.

For building the classifier you have a trainings-data set 
```
./data/flutype_train.csv
```
with which the classifier is trained, and a test data set
```
./data/flutype_test.csv
```
with which the performance of the classifier is tested, subsequently.

Trainings and test data sets have the identical format
consisting of a column `virus` which is the outcome (classification prediction, dependent variable) and `P1, ... P13` which are the independent variables (features) with which the classification model is fitted.

To start the task you should clone the repository from git
```
git clone https://github.com/matthiaskoenig/flutype-task
```
and get the jupyter notebook running
```
cd flutype-task
jupyter notebook
```
This document is the `task.ipynb`, the data is in the subfolder `data`.

In [1]:
# loading test dataset
import pandas as pd
dtest = pd.read_csv('./data/flutype_test.csv', sep="\t")
dtest.head()

The minimum supported version is 2.1



Unnamed: 0,virus,P1,P3,P4,P5,P6,P7,P8,P10,P13
0,X31,1569,2185,3988,3104,161,394,788,4396,443
1,X31,1840,2203,5003,2975,148,613,726,4284,486
2,X31,2039,2269,5163,3067,126,689,692,3493,372
3,X31,1168,1578,4211,1948,172,568,258,3472,314
4,H1,510,892,2609,2874,484,849,1650,2791,1445


The trainings data has the same format.

Your task is to build a simple classification model in python using either scikit-learn (http://scikit-learn.org/stable/) or tensorflow (https://www.tensorflow.org/) using the trainings-data set and test the performance of the classifier on the test-data set.

The problem is a so called supervised learning problem (http://scikit-learn.org/stable/supervised_learning.html) with multi-class classification (i.e. the classification outcome can be one of the multiple virus classes). 

You can use whatever classification algorithm/method you want.
Simplest solution is probably a logistic regression, alternatives could be support vector machines, neuronal network, decision tree, ...

The main outcome of the task is the fitted classifier and the evaluation of the performance of the classifier on trainings & test data set. You should provide a table and graph(s) on the performance (i.e. correct/incorrect classifictions, ...).

## Additional info
* manage your solution in a github repository
* document the solution in a jupyter notebook
* write down what you learned/surprised you

In [2]:
# read the trainings data
dtrain = pd.read_csv('./data/flutype_train.csv', sep="\t")
dtrain

Unnamed: 0,virus,P1,P3,P4,P5,P6,P7,P8,P10,P13
0,X31,1779,2285,4711,2976,145,792,694,3709,384
1,X31,1286,1678,4854,3133,141,402,738,4583,343
2,X31,2147,1688,4821,2520,153,426,543,3737,422
3,X31,1526,2012,4708,2707,141,569,1396,3697,324
4,X31,1409,1894,4811,3449,170,556,328,3357,386
5,X31,1930,2001,5294,2381,138,437,565,3870,370
6,X31,1920,2140,4192,1899,142,474,936,4254,468
7,X31,1433,2030,4337,2581,152,617,710,4894,499
8,X31,1628,2202,5215,2348,187,505,617,4476,475
9,X31,1047,2514,3970,2894,165,479,327,4005,428


In [3]:
# build classifier with trainings dataset
from sklearn.linear_model import LogisticRegression
X_train = dtrain.ix[:, dtrain.columns != 'virus']
Y_train = dtrain['virus']

clf = LogisticRegression()
clf = clf.fit(X_train, Y_train)

# save classifier to disk
pd.to_pickle(clf, "results/logistic_regressor.dat")

In [4]:
# run classifier on test dataset
X_test = dtest.ix[:, dtest.columns != 'virus']
Y_test = dtest['virus'] 
predicted_test = clf.predict(X_test) 

# run classifier on training dataset
predicted_train = clf.predict(X_train) 

In [5]:
# evaluation & graphs (performance)