## Setting: 

* Quite often we use clustering for feature engineering
* We create clusters and use the membership of samples to those clusters as features
* Today we will try to use clustering as a feature engineering technique for a simple dataset like iris and see if imporves results

# Task

1. For the `iris` dataset, build a logistic regression model against the cluster-membership variable alone.
2. For the `iris` dataset, build an SVM model by adding the cluster-membership variable to the feature matrix. Did the model performance go up?


You need to write a class `Iris`, which has following methods

* __init__()
    * Initiates `load_data()` method of sklearn
    * returns None
    
* load_data()
    * loads given inbuilt dataset
    * splits into test and train datasets
    * accepts 
        * data (str, dataset to be loaded)
    * returns 
        * 1
    
* Kmeans()
    * Performs kmeans clustering on the loaded data
    * Adds clusters to the original data (if output parameter is specified as "all") or replaces original data with cluster membership data (column matrix)
    * accepts 
        * technique (k-means initialization techique) (default "random")
        * n_clusters (number of clusters) (default 2)
        * output ("all", "one") (default "all")
    * returns
        * self
        
* model()
    * runs given model on the dataset and returns accuracy score
    * accepts 
        * model sklearn ml algorithm object
    * returns
        * accuracy score

In [1]:
from sklearn.cluster import KMeans
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn import svm
import pandas as pd, numpy as np

In [34]:
class Iris(object):
    
    def __init__(self):
        self.load_data()
        return None
    
    def load_data(self, datafn=load_iris):
        data = datafn()
        self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(data.data, data.target, test_size=0.3, random_state=42)
        return 1
    
    def Kmeans(self, technique='random', n_clusters=2, output = 'all'): 
        km = KMeans(init=technique, n_clusters=n_clusters)    
        km.fit(self.X_train)
        self.X_train = pd.DataFrame(self.X_train)
        self.X_test = pd.DataFrame(self.X_test)
        if output == 'all':
            self.X_train['km'] = km.labels_
            self.X_test['km'] = km.predict(self.X_test)
        elif output == 'one':
            self.X_train = km.labels_.reshape(-1, 1)
            self.X_test = km.predict(self.X_test).reshape(-1, 1) 
        return self
    
    def model(self, model = LogisticRegression()):
        model.fit(self.X_train, self.y_train)
        print self.X_train.shape
        predictions = model.predict(self.X_test)
        return accuracy_score(self.y_test, predictions)


### Iteration 1

In [35]:
Iris().Kmeans(n_clusters = 3).model()

(105, 5)


0.9555555555555556

#### Iteration 2

In [36]:
Iris().model()

(105, 4)


0.97777777777777775

#### Iteration 3

In [37]:
Iris().Kmeans(n_clusters = 3, output='one').model()

(105, 1)


0.51111111111111107

In [39]:
from sklearn.svm import SVC
Iris().Kmeans(n_clusters = 3, output='one').model(SVC())

(105, 1)


0.93333333333333335