# Logistic Regression

## SML Query

This notebook provides an example of how to use SML to read in a dataset, split the data into training and testing data, replace troublesome such as NaNs from the dataset, and perform classifcation on the dataset. For this use-case we use publicly availiable [Chronic Kideny Disease Data](https://archive.ics.uci.edu/ml/datasets/Chronic_Kidney_Disease) and use logistic regression to classify the classes of kidney diease. **[Clarify with Mike]**.

**[ Why are we Preprocessing Kidney Data??]**

### Imports

We Make the nescessary imports to use sml to read in the dataset, split the dataset into training and testing data, replace troublesome values from the dataset and perform classifcation on the dataset.

In [2]:
from sml import execute

### Query

Next we create a query statement to `READ` in the data and the file is delimited by ',', the header is the first row, and the types of values are numeric and string, next we `REPLACE` any values of '?' with the mode of the column, `SPLIT` the dataset and use 80% of it for training and 20% of it for testing, and lastly, we perform classification using logistic regression on the 25th column, using columns 1-24 as the the predictiors.

In [6]:
query = 'READ "../data/chronic.csv" (separator = ",", header = None) AND\
REPLACE (missing="?", strategy = "mode") AND SPLIT (train = .8, test = 0.2) AND CLASSIFY \
(predictors = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24],\
 label = 25, algorithm = logistic)'

execute(query, verbose=True)


Sml Summary:
   Dataset Path:        ../data/chronic.csv
   Delimiter:      ,
   Training Set Split:       80.00%
   Testing Set Split:        20.00%
   Predictiors:        ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24']
   Label:         25
   Algorithm:     logistic
   Dataset Preview:
   0   1   2   3   4   5   6   7   8    9  ...  15  16  17  18  19  20  21  \
0  65   4   0   1   3   0   0   0   0  112 ...   3  88  33   0   4   3   1   
1  59   2   0   4   3   0   0   0   0   88 ...   2  10  33   1   7   3   1   
2  50   4   2   5   2   0   0   0   0   38 ...  32  63  33   1   4   3   0   
3  65   5   1   4   3   0   1   1   0  130 ...  23  84  42   0   7   3   0   
4  69   4   2   5   3   0   0   0   0   71 ...  10  89   1   1   7   3   1   

   22  23  24  
0   1   1   2  
1   1   1   2  
2   1   0   2  
3   0   0   2  
4   1   1   2  

[5 rows x 25 columns]




## Manually
The subsequent cells below show how the same actions of a SML query can be performed manually.

## Imports
Here we import the necessary libraries needed to perform the same actions as the SML query above.

In [7]:
import pandas as pd
import seaborn as sns
import numpy as np

from sklearn.preprocessing import LabelEncoder, Imputer

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction import DictVectorizer

from sklearn import cross_validation, metrics
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score

from sklearn.linear_model import LogisticRegression


### READ

By default the Chronic Kidney Disease Dataset does not include it's headers, so we specify the headers manually, and read the file into a pandas dataframe.

In [8]:
import pandas as pd
import seaborn as sns
names = [
    'age', 'blood Pressure', 'Specific Gravity', 'Albumin', 'Sugar', 'Red Blood Cells', 'Pus Cell',
    'Pus Cell clumps', 'Bacteria', 'Blood Clucose', 'Blood Urea', 'Serum Creatinine', 'Sodium',
    'Potassium', 'Potassium', 'Potassium', 'Hemoglobin', 'Packed Cell Volume', 'White Blood Cell Count',
    'Red Blood Cell Count', 'Hypertension', ' Diabetes Mellitus', 'Coronary Artery Diasease',
    'Appetite', 'Pedal Edema', 'Anemia', 'Class']

data = pd.read_csv('../data/chronic.csv', names = names)

data.head()

Unnamed: 0,age,blood Pressure,Specific Gravity,Albumin,Sugar,Red Blood Cells,Pus Cell,Pus Cell clumps,Bacteria,Blood Clucose,...,Packed Cell Volume,White Blood Cell Count,Red Blood Cell Count,Hypertension,Diabetes Mellitus,Coronary Artery Diasease,Appetite,Pedal Edema,Anemia,Class
0,48,80,1.02,1,0,?,normal,notpresent,notpresent,121,...,5.2,yes,yes,no,good,no,no,ckd,,
1,7,50,1.02,4,0,?,normal,notpresent,notpresent,?,...,?,no,no,no,good,no,no,ckd,,
2,62,80,1.01,2,3,normal,normal,notpresent,notpresent,423,...,?,no,yes,no,poor,no,yes,ckd,,
3,48,70,1.005,4,0,normal,abnormal,present,notpresent,117,...,3.9,yes,no,no,poor,yes,yes,ckd,,
4,51,80,1.01,2,0,normal,normal,notpresent,notpresent,106,...,4.6,no,no,no,good,no,no,ckd,,


### Preprocess 
Next we have to encode categorical values so this can be passed into the sklearn machine learning library to perform logistic regression.

#### Encode Categorical Values


In [9]:
def encode_categorical(df, cols=None):
    categorical = list()
    if cols is not None:
        categorical = cols
    else:
        for col in df.columns:
            if df[col].dtype == 'object':
                categorical.append(col)

    for feature in categorical:
        l = list(df[feature])
        s = set(l)
        l2 = list(s)
        numbers = list()
        for i in range(0,len(l2)):
            numbers.append(i)
        df[feature] = df[feature].replace(l2, numbers)
    return df

data_encoded = encode_categorical(data)

AttributeError: 'DataFrame' object has no attribute 'dtype'

### REPLACE

We impute missing values in our panadas dataframe to account for NaNs in our dataset

In [10]:
# Remove NaNs
class ImputeCategorical(BaseEstimator, TransformerMixin):
    def __init__(self, columns=None):
        self.columns = columns
        self.imputer = None
    def fit(self, data, target=None):
        if self.columns is None:
            self.columns = data.columns
        self.imputer = Imputer(missing_values=0, strategy='most_frequent')
        self.imputer.fit(data[self.columns])
        return self
    def transform(self, data):
        """
        Uses the encoders to transform a data frame.
        """
        output = data.copy()
        output[self.columns] = self.imputer.transform(output[self.columns])

        return output
imputer = ImputeCategorical(['workclass', 'native-country', 'occupation'])
data = imputer.fit_transform(data)
data.head()

labels = data['income']
features = data.drop('income',1)

KeyError: "['workclass' 'native-country' 'occupation'] not in index"

### SPLIT

We then seperate our labels from our features and use a sklearn algorithm to perform a 80%/20% split our training and testing dataset respectively.

In [7]:
X_train, X_test, y_train, y_test = train_test_split(features, labels,
                                                    test_size=.2, random_state=42)

### CLASSIFY

Lastly we fit our logistic regression  with our training dataset and make predictions on our testing dataset and display the accuracy.

In [8]:
logreg = LogisticRegression(C=1e5)
logreg.fit(X_train, y_train)
pred = logreg.predict(X_test)