# This assignment is made by Andreas Uldall Leonhard (cph-al141), Jacob Sørensen (cph-js284) og Yosuke Ueda (cph-yu173)

## Project 2: Pipelines and optimisations

This assignment still focuses on predictive machine learning, but starts to spread
into the area of data preprocessing and model optimisation.
You will still be working with supervised classification tasks, but start to use more 
powerful machine learning tools from `sklearn` and the data processing library `pandas`.
We will also be asking you to work on a completely new type of data (voice) and 
reflect on the external validity of your model.

The data is based on the [Kaggle](https://kaggle.com) dataset [Gender Recognition by Voice](https://www.kaggle.com/primaryobjects/voicegender).

### Part 1: Data exploration
Your first task is to download and explore the data. What features are there?
How are they related?
Hand two lines that describes
* what a frequency is
* what the median frequency means
* what the output label is

### Part 2: Data preparation
When we train our model we'll use a 10-fold `KFold` 
cross-validator **with** shuffling.
Instantiate a `KFold` class and store it in a meaningful variable.

When that is done, illustrate that you indeed do get 10 
iterations of your data by iterating over the folds and simply
printing *the shape of* the four variables: `x_train`, `y_train`, `x_test` and
`y_test` (see 
the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) for examples on
how to do this).

Hand in
* the instantiation of the k-fold cross-validator
* a loop that prints the *shape* of `x_train`, `y_train`, `x_test` and `y_test`

###  Part 3: Model construction
We will use four different classification models for this task:
1. Logistic Regression
2. Support Vector Machine classifier
3. Decision Tree classifier
4. k-Nearest Neighbors classifier

Instantiate the four different classifiers in *four different 
pipelines*.

For now the default parameters are fine.

Hand in 
* the code for constructing the four pipelines
* one line of text per model describing how you think the classifier will perform, given the data type you are working with (voice)

## Part 4: Model validation
Now the time comes to train and validate your model.
This training and testing **should happen for all four models**.
The easiest way to do this is to use the `cross_val_score` 
function from `sklearn` once for all the four models.
The code should look something like this:
```python
pipeline1 = ...
pipeline2 = ...
pipeline3 = ...
pipeline4 = ...
my_kfold_validator = ...
for model in ... :
    score = cross_val_score(model, ...)
    print(score)
```

Hand in
* a list per model (four lists in total) of 10 values each, showing the scores of the 10 folds,
* at least one paragraph of text that describes what the 'score' means
* at least one paragraph of text that describes why the scores are different

## Part 5: Model optimisation: scaling
On the website of the [Gender Recognition by Voice](https://www.kaggle.com/primaryobjects/voicegender) dataset, they say
we can do better. So let's try!

One thing that's very easy to do is to use a 
[StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html).
It's particularly easy, because it fits right into your existing
pipelines. So simply add four (separate!) instances of the
`StandardScaler` to the pipelines, one for each pipeline.

Now repeat the above validation code, where you run the 
`cross_val_score` for *each* of the four pipelines. But this 
time the `StandardScaler` is included in the pipeline.

Hand in
* the code for your new pipelines that includes the `StandardScaler`)
* at least one line of text that describes what scaling actually is
* the **mean** of the 10 scores of the four models (this time it's only **one** number per model
* at least two lines of text describing which model performed well, and whether this aligned with your expectation from part 3

## Part 6: Manual Hyperparameter Tuning

For the fourth classifier in this project -- namely kNN -- conduct a manual search for the best value of $k$ (the hyperparameter ´n_neighbors´), that yields the highest score.

That means:

  1. choosing a value (positive integer >= 1), 
  2. putting it into the model, 
  3. (re-)training the kNN model, and 
  4. calculating the score. 
  5. Then try 1)-4) all over again. 

Do these steps at least 10 times to find a good value of the hyperparameter.

Pleas, hand in:

* A list of hyperparameter values, plus the scores, from the 10 times you changed the hyperparameter. The scores should be the **mean** from your 10-fold cross validation runs
* A paragraph reflecting on why the value you found for `n_neighbors` -- for the highest score -- has that value.


## Part 1: Data exploration

In [2]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder

In [3]:
df = pd.read_csv("voice.csv")

In [4]:
df.head()

Unnamed: 0,meanfreq,sd,median,Q25,Q75,IQR,skew,kurt,sp.ent,sfm,...,centroid,meanfun,minfun,maxfun,meandom,mindom,maxdom,dfrange,modindx,label
0,0.059781,0.064241,0.032027,0.015071,0.090193,0.075122,12.863462,274.402906,0.893369,0.491918,...,0.059781,0.084279,0.015702,0.275862,0.007812,0.007812,0.007812,0.0,0.0,male
1,0.066009,0.06731,0.040229,0.019414,0.092666,0.073252,22.423285,634.613855,0.892193,0.513724,...,0.066009,0.107937,0.015826,0.25,0.009014,0.007812,0.054688,0.046875,0.052632,male
2,0.077316,0.083829,0.036718,0.008701,0.131908,0.123207,30.757155,1024.927705,0.846389,0.478905,...,0.077316,0.098706,0.015656,0.271186,0.00799,0.007812,0.015625,0.007812,0.046512,male
3,0.151228,0.072111,0.158011,0.096582,0.207955,0.111374,1.232831,4.177296,0.963322,0.727232,...,0.151228,0.088965,0.017798,0.25,0.201497,0.007812,0.5625,0.554688,0.247119,male
4,0.13512,0.079146,0.124656,0.07872,0.206045,0.127325,1.101174,4.333713,0.971955,0.783568,...,0.13512,0.106398,0.016931,0.266667,0.712812,0.007812,5.484375,5.476562,0.208274,male


In [5]:
df.shape

(3168, 21)

In [6]:
len(df)

3168

## Your first task is to download and explore the data. What features are there? How are they related? Hand two lines that describes

what a frequency is
what the median frequency means
what the output label is

### what a frequency is 
### what the median frequency means
### what the output label is

### The columns in the dataset describes the statistics on how the data is distributed. The data content is an analysis of the acoustic properties of voices. 

### Frequency is a number of repeating occurencies of an event. 
### The Median frequency is the middelvalue of the amplitude of voices
### The output label is whether it is a male or female. In the dataset the label is the column marked label.  

### The frequency is showing how frequent a data is appearing, and the median indicates the middelvalue between minimum and maximum values. The last column is the output label which is the labeling of gender - male or female. 

## Part 2: Data preparation

In [7]:
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split

In [8]:
x = df.iloc[:, 0:20].values
y = df['label']
label = LabelEncoder()
y = label.fit_transform(y)

In [9]:
kfolder = KFold(10, shuffle=True)
kfolder.get_n_splits(df)

10

In [10]:
print(kfolder) 

KFold(n_splits=10, random_state=None, shuffle=True)


## Instantiation of k fold with 10-fold (groups) and shuffling set to boolean true

In [11]:
i = 0;
for train_index, test_index in kfolder.split(x):
    x_train, x_test = x[train_index], x[test_index]
    y_train, y_test = y[train_index], y[test_index]
    i=i+1
    print(i, x_train.shape, x_test.shape,y_train.shape, y_test.shape)

1 (2851, 20) (317, 20) (2851,) (317,)
2 (2851, 20) (317, 20) (2851,) (317,)
3 (2851, 20) (317, 20) (2851,) (317,)
4 (2851, 20) (317, 20) (2851,) (317,)
5 (2851, 20) (317, 20) (2851,) (317,)
6 (2851, 20) (317, 20) (2851,) (317,)
7 (2851, 20) (317, 20) (2851,) (317,)
8 (2851, 20) (317, 20) (2851,) (317,)
9 (2852, 20) (316, 20) (2852,) (316,)
10 (2852, 20) (316, 20) (2852,) (316,)


### The above cell shows that we got 10 iterations over the folds. This is illustrated by the 10 lines. 

In [12]:
print(x_train , x_test)

[[0.05978098 0.06424127 0.03202691 ... 0.0078125  0.         0.        ]
 [0.06600874 0.06731003 0.04022873 ... 0.0546875  0.046875   0.05263158]
 [0.0773155  0.08382942 0.03671846 ... 0.015625   0.0078125  0.04651163]
 ...
 [0.14205626 0.09579843 0.18373124 ... 2.9375     2.9296875  0.19475862]
 [0.14365874 0.09062826 0.18497617 ... 3.59375    3.5859375  0.31100218]
 [0.16550895 0.09288354 0.18304392 ... 0.5546875  0.546875   0.35      ]] [[0.16051433 0.07676688 0.14433678 ... 0.5390625  0.53125    0.28393665]
 [0.17124697 0.07487157 0.15280665 ... 0.5703125  0.5625     0.1383547 ]
 [0.17020985 0.07202301 0.1460039  ... 0.9140625  0.90625    0.17155172]
 ...
 [0.13462555 0.08440291 0.15453076 ... 4.1640625  4.15625    0.10213033]
 [0.14973093 0.08285205 0.1809322  ... 6.03125    5.828125   0.3657004 ]
 [0.1785725  0.04667868 0.1643883  ... 6.1484375  6.         0.10129123]]


In [13]:
y_train

array([1, 1, 1, ..., 0, 0, 0], dtype=int64)

## Part 3: Model construction
Logistic Regression
Support Vector Machine classifier
Decision Tree classifier

In [14]:
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn import tree
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn import preprocessing

In [15]:
model_svm = svm.SVC()
model_dt = tree.DecisionTreeClassifier()
model_lr = LogisticRegression()
model_knn = KNeighborsClassifier()

## Part 4: Model validation

In [16]:
pipeline_lr = Pipeline([('logisticModel',model_lr)])
pipeline_svm = Pipeline([('svm',model_svm)])
pipeline_dt = Pipeline([('decision tree',model_dt)])
pipeline_knn = Pipeline([('knn',model_knn)])

### The performance of pipeline with logistic regression will perform good, because the data set is only divided in two categories. The data will probably be placed above and under the classifier. 
### The performance of the pipeline using the svm-model will likewise perform good,if we choose a linear kernel then we will effectively have the same situation as above.
### The performance of the model using the decision tree will depend on the depth of the tree, but we predict this model will also perform well.
### The performance of the model using the KNN ??? 

In [17]:
for xmodel in [pipeline_lr, pipeline_svm, pipeline_dt, pipeline_knn]:
    score = cross_val_score(xmodel,x_train, y_train, cv=kfolder)
    print(score)

[0.90909091 0.91958042 0.87719298 0.9122807  0.93684211 0.90526316
 0.88421053 0.91578947 0.91578947 0.8877193 ]
[0.69230769 0.76223776 0.75087719 0.70526316 0.71578947 0.6877193
 0.72280702 0.73684211 0.75087719 0.74736842]
[0.95454545 0.97902098 0.97192982 0.96140351 0.96491228 0.96491228
 0.95789474 0.96491228 0.96491228 0.96140351]
[0.74475524 0.7027972  0.68070175 0.75438596 0.72631579 0.70526316
 0.73684211 0.72982456 0.70175439 0.75789474]


### The closer the score is to 1 the better the model is at predicting difference between male and female. 
### Our intuition tells us that the difference between the models is due to the default parameters applied to each model.  

In [18]:
from sklearn.preprocessing import StandardScaler

In [19]:
datascale = StandardScaler().fit_transform(x_train)
pipeline_lr = Pipeline([('standardscaler',StandardScaler()),('logisticModel',model_lr)])
pipeline_svm = Pipeline([('standardscaler',StandardScaler()),('svm',model_svm)])
pipeline_dt = Pipeline([('standardscaler',StandardScaler()),('decision tree',model_dt)])
pipeline_knn = Pipeline([('standardscaler',StandardScaler()),('knn',model_knn)])

In [20]:
pipeline_lr

Pipeline(memory=None,
     steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('logisticModel', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

In [21]:
for xmodel in [pipeline_lr, pipeline_svm, pipeline_dt, pipeline_knn]:
    score = cross_val_score(xmodel,x_train, y_train, cv=kfolder)
    print(score)

[0.95454545 0.98251748 0.97894737 0.96842105 0.95789474 0.97894737
 0.9754386  0.95438596 0.97192982 0.98245614]
[0.97552448 0.98951049 0.99298246 0.98596491 0.96491228 0.98245614
 0.97894737 0.98245614 0.98245614 0.9754386 ]
[0.95104895 0.97902098 0.95087719 0.96842105 0.96491228 0.97192982
 0.95087719 0.97894737 0.96140351 0.95438596]
[0.97902098 0.98951049 0.97192982 0.9754386  0.97192982 0.95789474
 0.97192982 0.97894737 0.96491228 0.97192982]


## Part 5: Model optimisation: scaling

### Scaling is to normalize the data so all the values are in the same range. 

In [22]:
for xmodel in [pipeline_lr, pipeline_svm, pipeline_dt, pipeline_knn]:
    score = cross_val_score(xmodel,x_train, y_train, cv=kfolder)
    score_avg = sum(score)
    print(score_avg/10)    

0.9716071647650596
0.9817629738682371
0.9614329530119005
0.9737001594896333


### The average score show that Support Vector Machine was the model that performed best using scaled data. This aligned pretty good for our expectation. 

In [23]:
model_knn = KNeighborsClassifier(n_neighbors=3)

In [24]:
pipeline_knn = Pipeline([('standardscaler',StandardScaler()),('knn',model_knn)])

In [25]:
score = cross_val_score(pipeline_knn,x_train, y_train, cv=kfolder)
score_avg = sum(score)
print(score_avg/10)

0.9758041958041959


## Part 6: Manual Hyperparameter Tuning

In [26]:
arr=[100,600,150,50,30,15,5,1,2,4,3]

def tryvalidate(u):
    model_knn = KNeighborsClassifier(n_neighbors=u)
    pipeline_knn = Pipeline([('standardscaler',StandardScaler()),('knn',model_knn)])
    score = cross_val_score(pipeline_knn,x_train, y_train, cv=kfolder)
    score_avg = sum(score)
    print(u  , " " , score_avg/10)
    return
 

for a in arr:
    tryvalidate(a)
     

100   0.9354778554778556
600   0.8516783216783217
150   0.9246153846153847
50   0.9446129309287205
30   0.9568617347564716
15   0.9642264752791068
5   0.974054717212612
1   0.9754569991412098
2   0.9747564715985769
4   0.9733468286099864
3   0.9730045393203289


In [27]:
0.9274260826892406

0.9274260826892406