# Assignment 6

## Assignment Instructions
Repo: https://github.com/cesar-ca/cs156-assign6

### Machine Learning Fashionista 2.0

In this assignment we revisit the dataset from the **dimension reduction unit**. The pictures of clothing are all originally taken from **ImageNet**, which is a large dataset containing over a million photos with many different categories. Every year there is a competition to see which techniques perform the best. The winning entry is then open-sourced and made available to all machine learning researchers for further research or to allow the development of novel applications.

**Support Vector Machines**

- Train a support vector classifier using each of the following kernels:
    - Linear
    - Poly (degree = 2)
    - RBF
- If you encounter any issues with training time or memory issues, then you may use a reduced dataset, but carefully detail why and how you reduced the dataset. 
- Unnecessarily reducing the dataset will result in reduced grades!
Report your error rates on the testing dataset for the different kernels.

In [1]:
#!pip3 install python-resize-image
#!pip3 install resizeimage
#!pip3 install sklearn
#!pip3 install skimage
#!pip3 install scikit-image

In [2]:
# Importing libraries for handling image data
from PIL import Image
import PIL.ImageOps
from glob import glob

# More relevant libraries
import matplotlib.pyplot as plt
from random import shuffle, seed, random, sample
from collections import defaultdict

# Importing statistical packages
import numpy as np
import pylab as pl
import pandas as pd

In [3]:
# Packages from machine learning models of sklearn
from sklearn import svm
from sklearn import metrics
from sklearn.decomposition import PCA
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC

In [4]:
from skimage import io
import imageio

In [5]:
from skimage.transform import resize
from skimage.io import imread_collection

In [6]:
# After loading photos to a folder, doing pre-processing
man_folder = '/Users/dmstudent1/Projects/Assignment6/man/*.JPEG'
man_sample = '/Users/dmstudent1/Projects/Assignment6/man_sample/*.JPEG'
man_files = glob('/Users/dmstudent1/Projects/Assignment6/man/*')

woman_folder = '/Users/dmstudent1/Projects/Assignment6/woman/*.JPEG'
woman_sample = '/Users/dmstudent1/Projects/Assignment6/woman_sample/*.JPEG'
woman_files = glob('/Users/dmstudent1/Projects/Assignment6/woman/*')

# Taking the photos in the folder together
man_clothes = imread_collection(man_folder)
man_sample_clothes = imread_collection(man_sample)

woman_clothes = imread_collection(woman_folder)
woman_sample_clothes = imread_collection(woman_sample)

In [7]:
# Resizing the photos to work with the same dimensions throughout the data
height_resize = 160
width_resize = 120
    
# Resizing all male photos in the data
man_resize = [resize(man_clothes[i], (height_resize, width_resize),
                     mode='constant', anti_aliasing=True,
                     anti_aliasing_sigma=None) for i in range(len(man_clothes))]

man_sample_resize = [resize(man_sample_clothes[i], (height_resize, width_resize),
                           mode='constant', anti_aliasing=True,
                           anti_aliasing_sigma=None) for i in range(len(man_sample_clothes))]

In [8]:
# Resizing all female clothing photos in the data 
woman_resize = [resize(woman_clothes[i], (height_resize, width_resize), 
                       mode='constant', anti_aliasing=True,
                       anti_aliasing_sigma=None) for i in range(len(woman_clothes))]

woman_sample_resize = [resize(woman_sample_clothes[i], (height_resize, width_resize),
                             mode='constant', anti_aliasing=True,
                             anti_aliasing_sigma=None) for i in range(len(woman_sample_clothes))]



In [9]:
# To work with the simple linear classifiers, the image data has to be stored as arrays
man_array = np.array([i.flatten() for i in man_resize])
woman_array = np.array([i.flatten() for i in woman_resize])

# Consolidating the data in a single place with 1, 0 to determine if it is male or female clothing
raw_data = [(row, '1') for row in man_array] + [(row, '0') for row in woman_array]

In [10]:
# Splitting the data into training set and test set
X = np.array([x for (x,y) in raw_data])
y = np.array([y for (x,y) in raw_data])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

**Support Vector Machines | Linear Kernel**

Using the Fashionista dataset, we can train a support vector classifier using a Linear kernel

Then, we report the error rate on the testing dataset for the linear kernels.

In [11]:
clf_linear = svm.SVC(kernel='linear')
clf_linear.fit(X_train, y_train)

SVC(kernel='linear')

In [13]:
print("train score: ", clf_linear.score(X_train, y_train))

train score:  1.0


In [14]:
print(metrics.classification_report(y_train, clf_linear.predict(X_train)))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      1018
           1       1.00      1.00      1.00       991

    accuracy                           1.00      2009
   macro avg       1.00      1.00      1.00      2009
weighted avg       1.00      1.00      1.00      2009



**Support Vector Machines | Linear Kernel (With Sample)**

Using the Fashionista dataset, we can train a support vector classifier using a Linear kernel

Then, we report the error rate on the testing dataset for the linear kernels.

In [15]:
# To work with the simple linear classifiers, the image data has to be stored as arrays
man_array_sample = np.array([i.flatten() for i in man_sample_resize])
woman_array_sample = np.array([i.flatten() for i in woman_sample_resize])

# Consolidating the data in a single place with 1, 0 to determine if it is male or female clothing
raw_data_sample = [(row, '1') for row in man_array_sample] + [(row, '0') for row in woman_array_sample]

In [16]:
# Splitting the data into training set and test set
X_sample = np.array([x for (x,y) in raw_data_sample])
y_sample = np.array([y for (x,y) in raw_data_sample])

X_train_sample, X_test_sample, y_train_sample, y_test_sample = train_test_split(X_sample, y_sample, test_size = 0.2)

In [17]:
clf_sample = svm.SVC(kernel='linear')
clf_sample.fit(X_train_sample, y_train_sample)

SVC(kernel='linear')

In [18]:
print("train score: ", clf_sample.score(X_train_sample, y_train_sample))

train score:  1.0


In [19]:
print(metrics.classification_report(y_train_sample, clf_sample.predict(X_train_sample)))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       698
           1       1.00      1.00      1.00       710

    accuracy                           1.00      1408
   macro avg       1.00      1.00      1.00      1408
weighted avg       1.00      1.00      1.00      1408



**Support Vector Machines | Poly (degree = 2) Kernel**

Using the Fashionista dataset, we can train a support vector classifier using a Poly (degree = 2) kernel

Then, we report the error rate on the testing dataset for the Poly (degree = 2) kernels.

In [20]:
clf_poly = svm.SVC(kernel='poly',degree=2)
clf_poly.fit(X_train, y_train)

SVC(degree=2, kernel='poly')

In [21]:
print("train score: ", clf_poly.score(X_train, y_train))

train score:  0.9342956694873071


In [22]:
print(metrics.classification_report(y_train, clf_poly.predict(X_train)))

              precision    recall  f1-score   support

           0       0.95      0.92      0.93      1018
           1       0.92      0.95      0.93       991

    accuracy                           0.93      2009
   macro avg       0.93      0.93      0.93      2009
weighted avg       0.93      0.93      0.93      2009



**Support Vector Machines | RBF Kernel**

Using the Fashionista dataset, we can train a support vector classifier using a RBF kernel

Then, we report the error rate on the testing dataset for the RBF kernels.

In [23]:
clf_rbf = svm.SVC(kernel='rbf')
clf_rbf.fit(X_train, y_train)

SVC()

In [24]:
print("train score: ", clf_rbf.score(X_train, y_train))

train score:  0.8680935788949726


In [25]:
print(metrics.classification_report(y_train, clf_rbf.predict(X_train)))

              precision    recall  f1-score   support

           0       0.87      0.88      0.87      1018
           1       0.87      0.86      0.87       991

    accuracy                           0.87      2009
   macro avg       0.87      0.87      0.87      2009
weighted avg       0.87      0.87      0.87      2009



Using Support Vector Machine for image classification for the fashionista dataset with female and male clothing proved to be helpful with high precision and recall across different kernels.

Using a linear kernel, there was an issue with training as it was overfitting by getting 100% rates so I used a linear kernel on a reduced (sampled) dataset to provide another look at the linear kernel.

Using a reduced data set for the linear kernel, we see better results with high precision and recall.

For the poly (degree = 2) kernel, we also see very good results with high precision and recall values.

The RBF kernel underperformed compared to the other kernel with lower values of precision and recall.

In [26]:
from IPython.core.display import HTML
HTML("""
<style>

div.cell { /* Tunes the space between cells */
margin-top:1em;
margin-bottom:1em;
}

div.texxt_cell_render h1 { /* Main titles bigger, centered */
font-size: 2.2em;
line-height:1.4em;
text-align:center;
}

div.text_cell_render h2 { /* Parts names nearer from the text */
margin-bottom: -0.4em;
}


div.text_cell_render { /* Customize text cells */
font-family: 'Times New Roman';
font-size:1.2em;
line-height:1.4em;
padding-left:2em;
padding-right:2em;
}
</style>
"""
)