# Computer Graphics Mini - Project
## Disaster Damage Assessment
## Project : Classifying damaged and undamaged super pixels

## Importing and Exploring the Data
The code cell below is run to load necessary Python libraries and load the image data. Note that the last column from this dataset, `'target'`, will be our target label (whether the superpixel is damaged or undamaged, that is 1 for damaged, and 0 for undamaged). All other columns are features about each superpixel.

In [6]:
# Import libraries
import numpy as np
import pandas as pd
from time import time
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn.cross_validation import train_test_split
from sklearn.cross_validation import KFold
from sklearn.grid_search import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import make_scorer
from sklearn.ensemble import RandomForestClassifier

# Read image data
data = pd.read_csv("data_final_csv_A.csv")
data1 = pd.read_csv("pyne_2_data.csv")
print "Damage data read successfully!"

Damage data read successfully!


### Implementation: Data Exploration
Let's begin by investigating the dataset to determine how many superpixels we have information on, and learn about the damage rate among these superpixels. In the code cell below, we compute the following:
- The total number of superpixels, `n_superpixels`.
- The total number of features for each student, `n_features`.
- The number of damaged super pixels, `n_damaged`.
- The number of undamaged super pixels, `n_undamaged`.
- The percentage damage, `damaged_rate`, in percent (%).


In [2]:
# TODO: Calculate number of students
n_students = data.shape[0]

# TODO: Calculate number of features
n_features = data.shape[1]-1

# TODO: Calculate damaged
n_damaged = data[data['target']==1].shape[0]

# TODO: Calculate undamaged
n_undamaged = data[data['target'] == 0].shape[0]

# TODO: Calculate damaged percentage rate
damaged_rate = float(n_damaged*1.0 / n_undamaged*1.0)*100

# Print the results
print "Total number of superpixels: {}".format(n_students)
print "Number of features: {}".format(n_features)
print "Number of superpixels damaged: {}".format(n_damaged)
print "Number of superpixels undamaged: {}".format(n_undamaged)
print "Damaged percentage: {:.2f}%".format(damaged_rate)

Total number of superpixels: 1508
Number of features: 62
Number of superpixels damaged: 1004
Number of superpixels undamaged: 504
Damaged percentage: 199.21%


## Preparing the Data
In this section, we will prepare the data for modeling, training and testing.

### Identify feature and target columns
The code cell below is run to separate the image data into feature and target columns to see if any features are non-numeric.

In [7]:
# Extract feature columns
feature_cols = list(data.columns[:-1])

# Extract target column 'passed'
target_col = data.columns[-1] 

# Show the list of columns
print "Feature columns:\n{}".format(feature_cols)
print "\nTarget column: {}".format(target_col)

# Separate the data into feature data and target data (X_all and y_all, respectively)
X_all = data[feature_cols]
y_all = data[target_col]

# Show the feature information by printing the first five rows
print "\nFeature values:"
print X_all.head()

Feature columns:
['a1', 'a2', 'a3', 'a4', 'a5', 'a6', 'a7', 'a8', 'a9', 'a10', 'a11', 'a12', 'a13', 'a14', 'a15', 'a16', 'a17', 'a18', 'a19', 'a20', 'a21', 'a22', 'a23', 'a24', 'a25', 'a26', 'a27', 'a28', 'a29', 'a30', 'a31', 'a32', 'a33', 'a34', 'a35', 'a36', 'a37', 'a38', 'a39', 'a40', 'a41', 'a42', 'a43', 'a44', 'a45', 'a46', 'a47', 'a48', 'a49', 'a50', 'a51', 'a52', 'a53', 'a54', 'a55', 'a56', 'a57', 'a58', 'a59', 'a60', 'a61', 'a62']

Target column: target

Feature values:
         a1         a2        a3  a4  a5  a6  a7  a8  a9  a10 ...   a53  a54  \
0  4.713330  13.562540  0.999252  16  32   0  11  32  13  129 ...    32   15   
1  5.028532  26.282548  0.998761   8  33   0  19  38  13  191 ...    35   23   
2  4.965613  15.500211  0.998826  18  41   0  21  41  19  159 ...    27   20   
3  5.100038  18.949694  0.999275  11  44   0  14  30  20  128 ...    22   23   
4  4.746942  12.734803  0.999223  10  31   0  13  44   9  164 ...    25   18   

   a55  a56  a57  a58  a59  a60  a61

### Preprocess Feature Columns

As you can see, there are several non-numeric columns that need to be converted! Many of them are simply `yes`/`no`, e.g. `internet`. These can be reasonably converted into `1`/`0` (binary) values.

In [8]:
def preprocess_features(X):
    ''' Preprocesses the student data and converts non-numeric binary variables into
        binary (0/1) variables. Converts categorical variables into dummy variables. '''
    
    # Initialize new output DataFrame
    output = pd.DataFrame(index = X.index)

    # Investigate each feature column for the data
    for col, col_data in X.iteritems():
        
        # If data type is non-numeric, replace all yes/no values with 1/0
        if col_data.dtype == object:
            col_data = col_data.replace(['yes', 'no'], [1, 0])

        # If data type is categorical, convert to dummy variables
        if col_data.dtype == object:
            # Example: 'school' => 'school_GP' and 'school_MS'
            col_data = pd.get_dummies(col_data, prefix = col)  
        
        # Collect the revised columns
        output = output.join(col_data)
    
    return output

X_all = preprocess_features(X_all)
print "Processed feature columns ({} total features):\n{}".format(len(X_all.columns), list(X_all.columns))

Processed feature columns (62 total features):
['a1', 'a2', 'a3', 'a4', 'a5', 'a6', 'a7', 'a8', 'a9', 'a10', 'a11', 'a12', 'a13', 'a14', 'a15', 'a16', 'a17', 'a18', 'a19', 'a20', 'a21', 'a22', 'a23', 'a24', 'a25', 'a26', 'a27', 'a28', 'a29', 'a30', 'a31', 'a32', 'a33', 'a34', 'a35', 'a36', 'a37', 'a38', 'a39', 'a40', 'a41', 'a42', 'a43', 'a44', 'a45', 'a46', 'a47', 'a48', 'a49', 'a50', 'a51', 'a52', 'a53', 'a54', 'a55', 'a56', 'a57', 'a58', 'a59', 'a60', 'a61', 'a62']


### Implementation: Training and Testing Data Split
So far, we have converted all _categorical_ features into numeric values. For the next step, we split the data (both features and corresponding labels) into training and test sets. In the following code cell below, we to implement the following:
- Randomly shuffle and split the data (`X_all`, `y_all`) into training and testing subsets.
- Set a `random_state` for the function(s) we use
  - Store the results in `X_train`, `X_test`, `y_train`, and `y_test`.

In [9]:
#Set the number of training points
num_train = 710

#Set the number of testing points
num_test = X_all.shape[0] - num_train

#Shuffle and split the dataset into the number of training and testing points above
X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, test_size=0.25, random_state=42)

#Show the results of the split
print "Training set has {} samples.".format(X_train.shape[0])
print "Testing set has {} samples.".format(X_test.shape[0])

Training set has 1131 samples.
Testing set has 377 samples.


## Training and Evaluating Models


### Setup
Run the code cell below to initialize three helper functions which you can use for training and testing the three supervised learning models you've chosen above. The functions are as follows:
- `train_classifier` - takes as input a classifier and training data and fits the classifier to the data.
- `predict_labels` - takes as input a fit classifier, features, and a target labeling and makes predictions using the F<sub>1</sub> score.
- `train_predict` - takes as input a classifier, and the training and testing data, and performs `train_clasifier` and `predict_labels`.
 - This function will report the F<sub>1</sub> score for both the training and testing data separately.

In [10]:
def train_classifier(clf, X_train, y_train):
    ''' Fits a classifier to the training data. '''
    
    # Start the clock, train the classifier, then stop the clock
    start = time()
    clf.fit(X_train, y_train)
    end = time()
    
    # Print the results
    #print 'Trained model in {:.4f} seconds'.format(end - start)

    
def predict_labels(clf, features, target):
    ''' Makes predictions using a fit classifier based on F1 score. '''
    
    # Start the clock, make predictions, then stop the clock
    start = time()
    y_pred = clf.predict(features)
    end = time()
    
    # Print and return results
    #print "Made predictions in {:.4f} seconds.".format(end - start)
    print "Accuracy: {:.4f}".format(accuracy_score(target.values, y_pred))
    return f1_score(target.values, y_pred, pos_label=1)


def train_predict(clf, X_train, y_train, X_test, y_test):
    ''' Train and predict using a classifer based on F1 score. '''
    
    # Indicate the classifier and the training set size
    print "Training a {} using a training set size of {} : ".format(clf.__class__.__name__, len(X_train))
    
    # Train the classifier
    train_classifier(clf, X_train, y_train)
    
    # Print the results of prediction for both training and testing
    print "F1 score training set: {:.4f}".format(predict_labels(clf, X_train, y_train))
    print "F1 score     test set: {:.4f}\n".format(predict_labels(clf, X_test, y_test))

### Implementation: Model Performance Metrics
With the predefined functions above, you will now import the three supervised learning models of your choice and run the `train_predict` function for each one. Remember that you will need to train and predict on each classifier for three different training set sizes: 100, 200, and 300. Hence, you should expect to have 9 different outputs below — 3 for each model using the varying training set sizes. In the following code cell, you will need to implement the following:
- Import the three supervised learning models you've discussed in the previous section.
- Initialize the three models and store them in `clf_A`, `clf_B`, and `clf_C`.
 - Use a `random_state` for each model you use, if provided.
 - **Note:** Use the default settings for each model — you will tune one specific model in a later section.
- Create the different training set sizes to be used to train each model.
 - *Do not reshuffle and resplit the data! The new training points should be drawn from `X_train` and `y_train`.*
- Fit each model with each training set size and make predictions on the test set (9 in total).  
**Note:** Three tables are provided after the following code cell which can be used to store your results.

In [11]:
from sklearn.metrics import f1_score
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, VotingClassifier, AdaBoostClassifier


#Initialize the models
clf_A = DecisionTreeClassifier(random_state=42)
clf_B = SVC(random_state=42)
clf_C = GaussianNB()
clf_D = RandomForestClassifier(n_estimators=900)
clf_V = VotingClassifier(estimators=[('dt', clf_A), ('gnb', clf_C), ('rf', clf_D)], voting='hard')
clf_E = AdaBoostClassifier()

#Set up the training set sizes
"""
print "\nTrain size 400\n"

X_train_300 = X_train[:400]
y_train_300 = y_train[:400]
train_predict(clf_A, X_train_300, y_train_300, X_test, y_test)
train_predict(clf_B, X_train_300, y_train_300, X_test, y_test)
train_predict(clf_C, X_train_300, y_train_300, X_test, y_test)
train_predict(clf_D, X_train_300, y_train_300, X_test, y_test)
train_predict(clf_V, X_train_300, y_train_300, X_test, y_test)

print "\nTrain size 500\n"

X_train_100 = X_train[:500]
y_train_100 = y_train[:500]
train_predict(clf_A, X_train_100, y_train_100, X_test, y_test)
train_predict(clf_B, X_train_100, y_train_100, X_test, y_test)
train_predict(clf_C, X_train_100, y_train_100, X_test, y_test)
train_predict(clf_D, X_train_100, y_train_100, X_test, y_test)
train_predict(clf_V, X_train_100, y_train_100, X_test, y_test) """


print "\nTrain size 1130\n"

X_train_200 = X_train[:1130]
y_train_200 = y_train[:1130]
train_predict(clf_A, X_train_200, y_train_200, X_test, y_test)
train_predict(clf_B, X_train_200, y_train_200, X_test, y_test)
train_predict(clf_C, X_train_200, y_train_200, X_test, y_test)
train_predict(clf_D, X_train_200, y_train_200, X_test, y_test)
train_predict(clf_V, X_train_200, y_train_200, X_test, y_test)
train_predict(clf_E, X_train_200, y_train_200, X_test, y_test)


# TODO: Execute the 'train_predict' function for each classifier and each training set size
# train_predict(clf, X_train, y_train, X_test, y_test)


Train size 1130

Training a DecisionTreeClassifier using a training set size of 1130 : 
Accuracy: 1.0000
F1 score training set: 1.0000
Accuracy: 0.8568
F1 score     test set: 0.8911

Training a SVC using a training set size of 1130 : 
Accuracy: 1.0000
F1 score training set: 1.0000
Accuracy: 0.6684
F1 score     test set: 0.8013

Training a GaussianNB using a training set size of 1130 : 
Accuracy: 0.8416
F1 score training set: 0.8675
Accuracy: 0.8568
F1 score     test set: 0.8821

Training a RandomForestClassifier using a training set size of 1130 : 
Accuracy: 1.0000
F1 score training set: 1.0000
Accuracy: 0.8886
F1 score     test set: 0.9132

Training a VotingClassifier using a training set size of 1130 : 
Accuracy: 1.0000
F1 score training set: 1.0000
Accuracy: 0.8859
F1 score     test set: 0.9095

Training a AdaBoostClassifier using a training set size of 1130 : 
Accuracy: 0.9389
F1 score training set: 0.9538
Accuracy: 0.8594
F1 score     test set: 0.8938



## Displaying the result as an image
The results from the prediction have to be displayed suitably, in the form of an image matrix.

In [12]:
import openpyxl
from openpyxl import Workbook
from openpyxl.styles import Color, PatternFill, Font, Border
from openpyxl.styles import colors
from openpyxl.cell import Cell
import xlrd
from openpyxl import load_workbook

#applying the classification model to a sample image

feature_cols_test = list(data1.columns[:]);
#print feature_cols_test;

X_sample_test = data1[feature_cols];

pred_image = clf_D.predict(X_sample_test);
print pred_image;

#function to check if the particular pixel is damaged or undamaged
def check_damage(value):
    if(pred_image[value]==1):
        return 1
    else:
        return 0


#workbook = xlrd.open_workbook('pyne_2_data_labels.xlsx')
wbread = load_workbook('pyne_2_data_labels.xlsx')
wsread = wbread.get_active_sheet()

wb = openpyxl.Workbook()
ws = wb.active


redFill = PatternFill(start_color='FFFF0000',
                   end_color='FFFF0000',
                   fill_type='solid')

greenFill = PatternFill(start_color='0000FF00',
                   end_color='0000FF00',
                   fill_type='solid')

#change the index entries for iteration just as required
#iterate through all the pixels, to color the damaged with one color and the undamaged with another

for row in wsread.iter_rows('A1:AJD637'):    
    for cell in row:
        xy_address = str(cell.column) + str(cell.row)
        if(check_damage(cell.value)==1):
            print cell.value           
            print xy_address
            ws[xy_address].fill = redFill
        else:
            ws[xy_address].fill = greenFill
            

wb.save("sample.xlsx")


[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0]


  def get_active_sheet(self):


41
GT111
41
GU111
41
GV111
41
GW111
41
GX111
41
GY111
41
GZ111
41
HA111
41
HB111
41
GN112
41
GO112
41
GP112
41
GQ112
41
GR112
41
GS112
41
GT112
41
GU112
41
GV112
41
GW112
41
GX112
41
GY112
41
GZ112
41
HA112
41
HB112
41
HC112
41
GH113
41
GI113
41
GJ113
41
GK113
41
GL113
41
GM113
41
GN113
41
GO113
41
GP113
41
GQ113
41
GR113
41
GS113
41
GT113
41
GU113
41
GV113
41
GW113
41
GX113
41
GY113
41
GZ113
41
HA113
41
HB113
41
HC113
41
HD113
41
HE113
41
FY114
41
FZ114
41
GA114
41
GB114
41
GC114
41
GD114
41
GE114
41
GF114
41
GG114
41
GH114
41
GI114
41
GJ114
41
GK114
41
GL114
41
GM114
41
GN114
41
GO114
41
GP114
41
GQ114
41
GR114
41
GS114
41
GT114
41
GU114
41
GV114
41
GW114
41
GX114
41
GY114
41
GZ114
41
HA114
41
HB114
41
HC114
41
HD114
41
HE114
41
FQ115
41
FR115
41
FS115
41
FT115
41
FU115
41
FV115
41
FW115
41
FX115
41
FY115
41
FZ115
41
GA115
41
GB115
41
GC115
41
GD115
41
GE115
41
GF115
41
GG115
41
GH115
41
GI115
41
GJ115
41
GK115
41
GL115
41
GM115
41
GN115
41
GO115
41
GP115
41
GQ115
41
GR115
41
GS115
4