# 7a Machine Learning - Part I

In this part of the exercise you will take a processed DataFrame containing all provided spectral data, divide it into training and test subsets, and use these to set up and test a k-nearest neighbours machine learning model.
<br>
<br>Once trained, this model should be able to classify unknown substitution patterns based on the labelled spectral data it has been presented with.

---

**Note:** Running this code will prevent some irrelevant errors popping up later on in the exercise.

In [1]:
import warnings
warnings.filterwarnings("ignore")

---

✏️ To begin with, import all of the libraries that you've been using in the previous notebooks.

In [63]:
import C317
import pandas as pd
import numpy as np

✏️ Using your newly created function, load in a DataFrame of all the spectra, reduced to 20 principal components. Print the first five rows of the DataFrame to remind yourself what it looks like.

In [64]:
import importlib
importlib.reload(C317)
C317.load_spectra(20,0).head()

KeyboardInterrupt: 

---

### Train/Test Splits

When testing the success of a machine learning algorithm, a dataset is typically divided into two, a training dataset and a test dataset.
<br>
<br>The training dataset is the information that is passed to the algorithm and what it bases its 'learning' on. While the test dataset is used as a benchmark to see how well the algorithm can predict the right result (correctly identify data subsets) after learning from the training set.
<br>
<br> When assigning train-test splits, i.e. dividing the data into the two subsets, it is important that datasets containing repeated measurements are treated so that all repeats are passed into the same one of the two sets. Consider a case in which four of five repeated measurements are passed to the training dataset and the remaining one is passed to the test dataset. Since the algorithm will have encountered four instances of the same data already within the training dataset, it is highly likely that it will succesfully assign the final repeat in the test dataset. As a result, the apparent success of the machine learning is overexaggerated and not a true representation of the success of the machine learning.
<br>
<br>Thus, while it is useful to save spectra with numbers indicating different repeats (computers do not like files with exactly the same name), it is now best to rename the data so that these repeat indicators are removed.

✏️ Print all the column titles in your DataFrame.

In [8]:
print(load_spectra(0).columns)

✏️ Use this list to construct a *new* list of column titles, `columns_identical`, where each of the 5 repeats of a given compound has the same title (e.g. 'm_N11_m-toluicacid', 'm_N11_m-toluicacid', ..., or 'm-toluicacid', 'm-toluicacid', ...)
<br>
<br>*Hint:* There are quite a few ways of doing this. The `split()` function from the `re` library, which is further explained in the documentation below, might be useful.
<br>
<br>https://docs.python.org/3/library/re.html
<br>
<br>Alternatively, you might recall that strings can be treated like lists, so you could selectively read only a subsection of the entries in each filename. Are there a consistent number of characters that need to be cut from the filename to make all repeats identical?

In [None]:
#I chose to update the C317.py file to slice off the final number for each filename

Recall that the column titles of a DataFrame can be overwritten by just setting the `.columns` attribute of the DataFrame to the new list of column titles.

✏️ Replace the columns in your DataFrame with the list you've just created, so that all columns containing repeated spectra are named identically.

In [None]:
#I chose to update the C317.py file to slice off the final number for each filename

✏️ See what happens now if you pull the "m-toluicacid" column(s) from the DataFrame.

In [32]:
import importlib
importlib.reload(C317)
C317.load_spectra(20,1)['m-toluicacid']

Unnamed: 0,m-toluicacid,m-toluicacid.1,m-toluicacid.2,m-toluicacid.3,m-toluicacid.4
0,-3.100364e-05,-4.235043e-05,-0.0001487827,-7.138975e-05,3.908406e-06
1,0.0004703055,0.0004856876,0.0004162011,0.0004745816,0.0004669203
2,0.0005705629,0.0005774984,0.0003823026,0.000545453,0.0005934345
3,-0.0003957711,-0.0004086561,-0.0003647751,-0.0004008522,-0.0003805889
4,0.0001576995,0.0001642062,0.0001517702,0.0001643989,0.0001591021
5,-1.970974e-05,-1.777299e-05,8.350223e-05,-5.821251e-07,-3.751903e-05
6,1.281415e-06,-4.523194e-06,9.963518e-06,-3.787585e-06,-6.180583e-07
7,-5.623824e-06,-1.647756e-06,1.280394e-06,9.68271e-07,4.784684e-06
8,-2.470677e-06,2.914009e-06,-5.941903e-06,8.999422e-06,-4.27068e-06
9,2.919991e-06,1.602038e-05,-9.108665e-06,7.89803e-07,-1.321349e-05


Notice that now all five repeats stay together, which is what we wanted.

✏️ Update the `load_spectra()` function in your C317 library so that it returns a DataFrame where all repeats have the same column title, as above. Then, reload it below to check it gives the correct output.

For an added challenge, add a parameter to the function that allows you to choose whether the original filenames or modified filenames for repeated columns are shown in the output.

In [33]:
C317.load_spectra(0,1)['m-toluenesulfonyl chloride']

Unnamed: 0,m-toluenesulfonyl chloride,m-toluenesulfonyl chloride.1,m-toluenesulfonyl chloride.2,m-toluenesulfonyl chloride.3,m-toluenesulfonyl chloride.4
630,0.000288,0.000288,0.000287,0.000287,0.000288
631,0.000289,0.000288,0.000287,0.000288,0.000291
632,0.000290,0.000287,0.000286,0.000288,0.000287
633,0.000288,0.000287,0.000289,0.000287,0.000288
634,0.000290,0.000287,0.000288,0.000290,0.000290
635,0.000290,0.000289,0.000287,0.000287,0.000289
636,0.000290,0.000288,0.000289,0.000288,0.000287
637,0.000289,0.000287,0.000290,0.000289,0.000289
638,0.000288,0.000287,0.000288,0.000289,0.000290
639,0.000289,0.000288,0.000290,0.000289,0.000289


✏️ It will be useful to have a list of all unique compound names for future reference. Recalling the `set()` function, construct a new list, of all the column titles in your DataFrame *without repeats* (e.g. m-toluicacid, m-aminoacetophenone, ...)

It should have 76 items - check this.

*Hint:* Because of the manner in which the headings are stored in `DataFrame.columns`, you will need to use the list function to turn the output of `set()` back into a usuable list.

In [36]:
UniqueNames=list(set(C317.load_spectra(0,1).columns))

The library `sklearn.model_selection` has a function `train_test_split()`.
<br>
<br>`train_test_split` takes in a list, and splits it into two - a training dataset and a test dataset. By default the splitting size is 75:25 training:test, but this can be altered by passing a different value of the test_size parameter (`test_size=x`, where `x` is a value between 0 and 1 for 0% of the full dataset to 100% making up the test dataset).
<br>
<br>`train_samples, test_samples = sklearn.model_selection.train_test_split(samples,test_size=x)`
<br>
<br>**Note:** Each time `train_test_split()` is employed it will return a different distribution of data between the two subsets.

✏️ Import the `sklearn.model_selection` library and use the `train_test_split()` function to split the list of all the different compounds, i.e. the unique compound names, into training and a test sets with a 67:33 ratio.

In [38]:
import sklearn.model_selection

train_samples, test_samples = sklearn.model_selection.train_test_split(UniqueNames,test_size=0.33)


✏️ Hence, split your DataFrame into a training, and a testing DataFrame.

*Hint:* Passing a list of column names to a DataFrame returns all of those columns from the DataFrame.

In [39]:
C317.load_spectra(0,1)[train_samples].head()

Unnamed: 0,o-1-chloro-2-nitrobenzene,o-1-chloro-2-nitrobenzene.1,o-1-chloro-2-nitrobenzene.2,o-1-chloro-2-nitrobenzene.3,o-1-chloro-2-nitrobenzene.4,p-(trifluoromethyl)benzaldehyde,p-(trifluoromethyl)benzaldehyde.1,p-(trifluoromethyl)benzaldehyde.2,p-(trifluoromethyl)benzaldehyde.3,p-(trifluoromethyl)benzaldehyde.4,p-(trifluoromethyl)acetophenone,p-(trifluoromethyl)acetophenone.1,p-(trifluoromethyl)acetophenone.2,p-(trifluoromethyl)acetophenone.3,p-(trifluoromethyl)acetophenone.4,m-toluenesulfonyl chloride,m-toluenesulfonyl chloride.1,m-toluenesulfonyl chloride.2,m-toluenesulfonyl chloride.3,m-toluenesulfonyl chloride.4
630,0.000244,0.000242,0.000239,0.000239,0.000245,0.000281,0.000286,0.000283,0.000282,0.000283,0.000275,0.000278,0.000279,0.00028,0.00028,0.000288,0.000288,0.000287,0.000287,0.000288
631,0.000244,0.000242,0.000238,0.00024,0.000246,0.000284,0.000287,0.000284,0.000286,0.000285,0.000272,0.000275,0.000277,0.000276,0.000276,0.000289,0.000288,0.000287,0.000288,0.000291
632,0.000241,0.000239,0.000237,0.000238,0.000244,0.000287,0.000288,0.000287,0.000289,0.000289,0.000268,0.000272,0.00027,0.000272,0.000272,0.00029,0.000287,0.000286,0.000288,0.000287
633,0.000239,0.000237,0.000234,0.000237,0.00024,0.000287,0.000289,0.000289,0.00029,0.000288,0.000277,0.000279,0.000281,0.000282,0.000281,0.000288,0.000287,0.000289,0.000287,0.000288
634,0.000237,0.000235,0.00023,0.000232,0.000238,0.00029,0.000288,0.000289,0.000288,0.000287,0.000281,0.000282,0.000281,0.000285,0.000285,0.00029,0.000287,0.000288,0.00029,0.00029


✏️ Construct two more lists, `train_labels` and `test_labels`, which contain the labels ("o", "m" or "p") for each column in each DataFrame. Use a `for` loop to accomplish this.

*Hint:* You have already done this in Notebook 6.

In [55]:
train_labels=[]
test_labels=[]
for name in train_samples:
    train_labels+=5*[name[0]]
for name in test_samples:
    test_labels+=5*[name[0]]

print(test_labels)

['m', 'm', 'm', 'm', 'm', 'o', 'o', 'o', 'o', 'o']


---

### The k-Nearest Neighbours Algorithm

With an appropriately prepared dataset separated into training and testing subsets, it is now possible to pass the data to a machine learning algorithm. In this case you will employ the k-nearest neighbours algorithm, which categorises data based on distance in dataspace.
<br>
<br>We will employ the `sklearn.neighbors` library to carry out the machine learning.

✏️ Begin by importing the `sklearn.neighbors` library.

In [58]:
import sklearn.neighbors


The `sklearn.neighbors` library lets you create an object called a `KNeighborsClassifier()`, which takes as an argument the number of nearest neighbours to use. (You might draw parallels between this and the `PCA` object generated in the previous notebook.)

✏️ Create a 3-nearest neighbours object.

In [59]:
OBJ=sklearn.neighbors.KNeighborsClassifier(3)

The `KNeighborsClassifier` object has a method called `fit(training data, training labels)`, which sets up/trains your model, ready to make predictions. Much like with PCA, the `sklearn` library expects your data with samples as rows, and features as columns (as you encountered with PCA). Recall that a DataFrame has an attribute which could help here.

✏️ Pass in your training DataFrame, and training labels.
<br>
<br>**Note:** This will return an object (called a 'KNeighborsClassifier'), which can be ignored.

In [61]:
OBJ.fit(C317.load_spectra(0,1)[train_samples].T,train_labels)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=3, p=2,
                     weights='uniform')

You have now made a KNearestNeighbours object, and trained it with your processed data.

The `score(testing data, testing labels)` method can be used to test a trained `KNeighborsClassifier` object with testing data. It returns a score of what percentage of test data your model correctly classifies.

**Note:** Again, the testing data must be handed to the method with samples as rows and features as columns.

✏️ Pass your test data and labels to the score() method of your `KNeighborsClassifier` and see what success score your particular training test split arrives at.

In [67]:
OBJ.score(C317.load_spectra(0,1)[test_samples].T,test_labels)

0.5

You will likely get a value between 60% and 90%. The variability is caused by different possible test/train splits - if you train your model with a different subset of the data, it will perform differently.

✏️ By combining the processes above, write a function for your C317 library which takes in a suitably prepared DataFrame, splits it into training and test subsets, carries out 3-nearest neighbours machine learning, and outputs the test result. This will allow the machine learning process to be run multiple times with ease.
<br>
<br>As an extension, you may wish to add further arguments to your function which control the train/test split proportions and the number of nearest neighbours employed by the algorithm. (This would allow you to explore the effects of these parameters of the success of the machine learning results.)

In [None]:
import sklearn.model_selection
import sklearn.neighbors

def MachineLearn(dataframe,t_size,k):
    UniqueNames=list(set(dataframe.columns))
    train_samples, test_samples = sklearn.model_selection.train_test_split(UniqueNames,test_size=t_size)
    train_labels=[]
    test_labels=[]
    for name in train_samples:
        train_labels+=5*[name[0]]
    for name in test_samples:
        test_labels+=5*[name[0]]
    OBJ=sklearn.neighbors.KNeighborsClassifier(k)
    OBJ.fit(dataframe[train_samples].T,train_labels)
    return OBJ.score(dataframe[test_samples].T,test_labels)

To get a better idea of the accuracy of the model, it is better to average the score over a few hundred different test train splits (recall that each time you use `train_test_split()` it generates a new split).

✏️ Find the average score over 500 different test train splits. It should sit at around 75-80%. In addition, calculate the standard error for this number of runs (standard deviation divided by square root of number of data points).

**Note:** You may wish to use `np.round(n, x)`, which rounds a number, `n`, to `x` decimal places, to make your numerical results neater.

In [79]:
total=[C317.MachineLearn(C317.load_spectra(20,1),0.33,3) for i in range(5)]
x=np.std(np.array(total))
print(sum(total)/5,x/np.sqrt(5)) #0.8 0.1788854381999832

0.8 0.1788854381999832


Once a `KNeighborsClassifier` object has been trained, it is also possible to employ the `predict()` method to attempt to assign a particular subset of data from your test subset.

✏️ By examination of the test subset head, or printing the test subset column names, identify a particular chemical name from the test set.

In [73]:
#print(C317.load_spectra(0,1).columns)
print(test_samples)
#I choose the o-aminobenzylalcohol

['m-toluicacid', 'o-aminobenzylalcohol']


✏️ Use the `predict()` method of your trained `KNeighborsClassifier` object passing the transpose of your selected chemical column as an argument. This will return a prediction of whether the label (o, m or p) is appropriate based on the training data that has been passed to the algorithm. See if you can identify a chemical that is correctly classified for all five repeats and one that is not.

In [77]:
OBJ.predict(C317.load_spectra(0,1)['o-aminobenzylalcohol'].T)#it worked for all 5 repeats (suprisingly) m-toluicacid did not

array(['o', 'o', 'o', 'o', 'o'], dtype='<U1')

---