# Task 3

### Description

We used pandas dataframes to import and handle the data. We then preprocessed the data by splitting the various amino acid combinations into 4 different data points and used a One-Hot-Encoder to transform these new categorical data points into a form, which the classifier can actually understand and analyse. If we would have only passed the raw data (i.e. by transforming the respective characters into integers ranging from 1 to 21), the classifier would assume, that the data is continuous and fit a wrong model. For the classification itself, we used a Multi-Layer-Perceptron.

### Main

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.neural_network import MLPClassifier

In [2]:
pd.set_option('display.max_columns', 100)

In [3]:
# Specify location of project folder
folder = '/Users/gian-andreagottini/Documents/Coding/Introduction to Machine Learning/Task 3/'

In [4]:
# Import data
train_data = pd.read_csv(folder + 'train.csv')
x_train = pd.DataFrame(data = train_data['Sequence'])
y_train = pd.DataFrame(data = train_data['Active'])
x_test = pd.read_csv(folder + 'test.csv')

In [5]:
# Define function
def split(train, test):
    proteins_train = []
    proteins_test = []
    ohe = OneHotEncoder(sparse = False)
    
    for i in range(train.size):
        proteins_train.append(list(train.loc[i, 'Sequence']))
    
    proteins_train_ohe = ohe.fit_transform(proteins_train)
    cats = np.concatenate(ohe.categories_).ravel()
    
    for i in range(test.size):
        proteins_test.append(list(test.loc[i, 'Sequence']))
    
    proteins_test_ohe = ohe.transform(proteins_test)
    
    transformed_train = pd.DataFrame(data = proteins_train_ohe, columns = cats)
    transformed_test = pd.DataFrame(data = proteins_test_ohe, columns = cats)
    
    return transformed_train, transformed_test

In [6]:
X_train_split, X_test_split = split(x_train, x_test)

In [7]:
clf = MLPClassifier(random_state = 1)
clf.fit(X_train_split, y_train)

  return f(*args, **kwargs)


MLPClassifier(random_state=1)

In [8]:
y_pred = clf.predict(X_test_split)

In [9]:
predictions = pd.DataFrame(data = y_pred)

In [10]:
predictions.to_csv(folder + 'solution.csv', header = False, index = False)