# Data set
The task for this project is to do metal ion prediction using the provided data set. The data set has 3 explaining features Mod1, Mod2 and Mod3 and 3 objective values Pb, Cd and c_total. The latter is the sum of the first two. There are 201 training examples. For each objective triplet there are 3 measurements, so every 3 subsequent training examples have the same objective values. Low concentration samples (few ions) are overly represented, and high concentration samples are not. 

In the following pairwise matrix scatterplot c-total and the modulator features are shown. The colors represent different c-total values, and they are not binned in any way. The plot shows no clear linear dependencies between c-total and the modulators. The modulators themselves show some linear dependence.

![](img/scatters_manual.png)

# Results
Leave-1-out cross-validation leads to perfect prediction on this data set. This is because every triplet of the objective values is repeated 3 times, and the modulators don't have much variation for the same sample. This shows that the error of the modulators is small enough that it itself is not enough to cause prediction errors which affect the c-index. Here is the resulting plot, with small randomness added to the plot so that one can clearly see all c-index values are exactly 1.
![](img/water_cindex_plots1.png)


From the cindex plots, we can see that neighbor values that are multiples of 3 tend to be better. This is mostly due to the fact that the data was repeated 3 times for each training example. The neighbor value $k=6$ seems to be the best in this data set, although it is somewhat biased because of the structure of the data set. The predictions are much better than the baseline prediction (mean).

![](img/water_cindex_plots3.png)


After removing the mean plot and the value for $k=1$, we can see the plot more clearly. Still, $k=6$ seems to be the best value for k. However for production use, a larger value may be better if more measurements are made.
![](img/water_cindex_plots32.png)



# Implementation
Cross-validation was already in the first weeks report, so not included here. It already supported Leave-X-out cross-validation, so no changes were made. The code for the older classes can be found at https://github.com/alileino/ml_study/, especially in the files measures/cv.py and models/knn.py. Also, the full code for this report can be found in the github page under analysis/water_ion.py.

## Data preparation
The modulators were z-score normalized. The output values were not.
```python
class WaterDataProvider:
    objective_columns = ["c_total", "Cd", "Pb"]
    def __init__(self):
        self.X, self.Y, self.df = self.load_data()

    def load_data(self):
        df = pd.read_csv("../data/Water_data.csv")
        y = df[WaterDataProvider.objective_columns]

        X = df.drop(WaterDataProvider.objective_columns, axis=1)
        X = (X - X.mean())/(X.var())
        return X, y, df
```
 
## Scatter plot
The following scatter-function can be called for any of the objective names (Pb, Cd, c_total). It leaves the others out to get a smaller plot, because we are not interested in the relationships between the objective values.
 ```python
 def scatters(y):
    data = WaterDataProvider()
    objective = WaterDataProvider.objective_columns
    objective.remove(y)
    df = data.df.drop(objective, axis=1)
    sns.set(style="ticks")
    sns.pairplot(df, hue=y, palette="hls")
    plt.show()
```

## C-index
C-index was calculated with the method described in the slides. The division-by-zero error was fixed by returning 1 when the predictions are perfect.
```python
def c_index(truey, predy):
    n = 0
    h_sum = 0
    for i in range(len(truey)):
        t = truey[i]
        p = predy[i]
        for j in range(i+1, len(truey)):
            nt = truey[j]
            np = predy[j]
            if t != nt:
                n = n+1
                if (p < np and t < nt) or (p > np and t > nt):
                    h_sum += 1
                elif (p < np and t > nt) or (p > np and t < nt):
                    pass
                elif p == np:
                    h_sum += 0.5
    if n != 0:
        return h_sum/n
    return 1
```



## C-index plots
The following function plots a c-index plot for different neighbor values in the range [llim, hlim-1].

```python
def water_cindex_plots(leave_out, randomize=0, llim=1, hlim=15, show_mean=True):
    '''
    :param leave_out: how many consequtive samples to leave to test set in each K-fold
    :param randomize: float describing the amount of randomness added to line plots
    :param llim: the inclusive lower limit for neighbor count
    :param hlim: the exxclusive higher limit for neighbor count
    :param show_mean: True if the (baseline) mean c-index should be plotted
    :return: None
    '''
    plt.figure()
    data = WaterDataProvider()
    X = data.X.values
    Y = data.Y
    neighbors = np.arange(llim, hlim)
    for ycolumn in Y.columns:
        y = Y[ycolumn].values
        scores = []
        for k in neighbors:
            knn = KNN(n_neighbors=k, regression=True)

            s = cv_score(knn, X, y, cv=KFold(n_splits=(len(X)//leave_out)), 
                score_func=c_index)
            scores.append(np.mean(s) + np.random.rand()*randomize)

        plt.plot(neighbors, scores, label=ycolumn)
    if show_mean:
        plt.plot(neighbors, np.repeat(0.5, len(neighbors)))
    plt.suptitle("Leave-%i-Out CV c-index vs. neighbors" % leave_out)
    plt.ylabel("C-index")
    plt.legend()
    plt.xticks(neighbors)
    plt.xlabel("neighbors")
    savefig("water_cindex_plots%i%i" % (leave_out, llim))
    
```