# Samuel Watkins, 3032132676

# HW 6: Homebrew Computer Vision
## Due Monday Apr 2, 2018 at 2 PM

1. Download the [zip file](https://www.dropbox.com/s/cst9awcjpp08k33/50_categories.tar.gz). Look at some of the images, noting that there are 50 classes in 4244 images (e.g. "goldfish", “llama”, “speed-boat”, ...). Caution: it’s a pretty large file (~208M).
2. Write a set of methods that takes as input one of these images, and then computes real-numbered features as the return. You should produce at least 15 features.
3. Based on the feature set for each image, build a random forest classifier. Produce metrics on your estimated error rates using cross-validation. How much better is this than the expectation with random guessing? What are the 3 most important features?
4. Make sure your final classifier can run on a directory of different images, where a call like `run_final_classifier("/new/directory/path/")` on a directory that contains files like `validation1.jpg`, `validation2.jpg`, etc. will produce an output file that looks like:  
```
filename              predicted_class  
``` 
` `-----------------------------------------------------------------
```
validation1.jpg       unicorn  
validation2.jpg       camel  
```

    We will have a validation set to test how good your classifier is.

# Function to Extract Features from an Image

In [97]:
import matplotlib.pyplot as plt
import numpy as np

def extractImageFeatures(pathToImage):
    imageArray = plt.imread(pathToImage).astype("float")
    
    arraySize = np.prod(imageArray.shape)
    avgAllChans = np.mean(imageArray)
    stdAllChans = np.std(imageArray)
    ratioStdAvgAllChans = stdAllChans/avgAllChans
    
    avgRedChan = np.mean(imageArray[0])
    stdRedChan = np.std(imageArray[0])
    
    avgBlueChan = np.mean(imageArray[1])
    stdBlueChan = np.std(imageArray[1])
    
    avgGreenChan = np.mean(imageArray[2])
    stdGreenChan = np.std(imageArray[2])
    
    ratioRedBlue = avgRedChan/avgBlueChan
    ratioBlueGreen = avgBlueChan/avgGreenChan
    ratioRedGreen = avgRedChan/avgGreenChan
    
    ratioStdAvgRedChan = stdRedChan/avgRedChan
    ratioStdAvgBlueChan = stdBlueChan/avgBlueChan
    ratioStdAvgGreenChan = stdGreenChan/avgGreenChan
    
    features=np.array([arraySize,avgRedChan,stdRedChan,avgBlueChan,stdBlueChan,
              avgGreenChan,stdGreenChan,ratioRedBlue,ratioBlueGreen,
              ratioRedGreen,ratioStdAvgRedChan,ratioStdAvgBlueChan,ratioStdAvgGreenChan,
              avgAllChans,stdAllChans,ratioStdAvgAllChans])
    features[np.isnan(features)]=0.0
    features[np.isinf(features)]=0.0
    
    return features

In [98]:
pathToImage = "/home/sam/Documents/watkins-ay250-s2018-hw/hw_6/50_categories/airplanes/airplanes_0001.jpg"
print(extractImageFeatures(pathToImage))

[1.95816000e+05 2.22739531e+02 1.54667741e+01 2.22814908e+02
 1.53525868e+01 2.22737018e+02 1.54015331e+01 9.99661706e-01
 1.00034969e+00 1.00001128e+00 6.94388374e-02 6.89028708e-02
 6.91467150e-02 1.69752548e+02 6.90050805e+01 4.06503945e-01]


# Extract Features from All Images

In [99]:
from glob import glob

In [137]:
pathToImageFolders = "50_categories/"
eachFolder = glob(pathToImageFolders+"*/")
train = 0.2 # ratio of training dataset to total dataset
X_train = list()
Y_train = list()
X_test = list()
Y_test = list()

for folder in eachFolder:
    filesInFolder = glob(folder+"*.jpg")
    for iFile,imageFile in enumerate(filesInFolder):
        if iFile < int(0.5*len(filesInFolder)):
            X_train.append(extractImageFeatures(imageFile))
            Y_train.append(folder[len(pathToImageFolders):-1])
        else:
            X_test.append(extractImageFeatures(imageFile))
            Y_test.append(folder[len(pathToImageFolders):-1])

X_train = np.vstack(X_train)
Y_train = np.array(Y_train)
X_test = np.vstack(X_test)
Y_test = np.array(Y_test)




# Build A Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from sklearn import cross_validation

In [138]:
classifier = RandomForestClassifier(n_estimators=50)

classifier.fit(X_train,Y_train)

pred_rf = classifier.predict(X_test)

In [144]:
print(f"Score: {metrics.accuracy_score(Y_test,pred_rf)}")
scores = cross_validation.cross_val_score(classifier,np.vstack([X_train,X_test]),np.concatenate([Y_train,Y_test]),cv=5)
print(f"Accuracy from cross-validation:{np.mean(scores)} (+/- {np.std(scores)})")

Score: 0.19588592800374008
Accuracy from cross-validation:0.20859869355131982 (+/- 0.010412088090862723)


In [140]:
classifier.feature_importances_

array([0.11381806, 0.05620071, 0.05189537, 0.05245607, 0.05354846,
       0.05437473, 0.05092574, 0.06310891, 0.06761372, 0.06308358,
       0.05322788, 0.04843142, 0.05221076, 0.07823844, 0.07042652,
       0.07043962])

# Compare to Random Guessing

In [120]:
from sklearn.dummy import DummyClassifier

In [141]:
dummyclf = DummyClassifier(strategy="uniform",random_state=42)

dummyclf.fit(X_train,Y_train)

dummypred_rf = dummyclf.predict(X_test)

In [142]:
print(metrics.accuracy_score(Y_test,dummypred_rf))
scores = cross_validation.cross_val_score(dummyclf,np.vstack([X_train,X_test]),np.concatenate([Y_train,Y_test]),cv=5)
print(f"{np.mean(scores)} (+/- {np.std(scores)})")

0.021972884525479196
0.017004725253143567 (+/- 0.0038749670420298575)
