# Classifiying Men's and Women's Fashion 

In [35]:
#Import necessary packages and libraries

import zipfile
import os
from PIL import Image

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
from sklearn.svm import LinearSVC
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

## Data Preprocessing

This step entails extracting the images from the zipped files, opening each image file, converting it into an array format, resizing it to have the same shape, adding class labels for classification purposes, and eventually merging and shuffling the data. 

In [2]:
#extract images from zipped files 
with zipfile.ZipFile("female-clothing.zip", 'r') as zip_ref:
    zip_ref.extractall()
with zipfile.ZipFile("male-clothing.zip", 'r') as zip_ref:
    zip_ref.extractall()
    
#get image file paths 
m_files = os.listdir('../fashion-classifier/men')
w_files = os.listdir('../fashion-classifier/women')
print('men:',len(m_files),'| women:',len(w_files))

#merge both datasets 
all_files = m_files + w_files
print('total:',len(all_files))

men: 1242 | women: 1270
total: 2512


In [3]:
#convert image files into arrays
def read_img(file,name):
    images = np.zeros((14700))
    for image in file: 
        arr = Image.open(name+'/'+image) #get img array 
        img = arr.resize((70,70)) #resize for standard sizing
        arr = np.asarray(img) #turn into array
        img.close()
        flatten = arr.flatten() #flatten
        images = np.vstack((images,flatten)) #stack 
    images = np.delete(images, 0, 0)
    return images

#read images from files 
men = read_img(m_files[:500],'men')
women = read_img(w_files[:500],'women')


In [4]:
#turn into dataframes and add class labels
men = pd.DataFrame(men)
men['label'] = 0
women = pd.DataFrame(women)
women['label'] = 1

#merge and shuffle
fashion = men.append(women,ignore_index=True)
fashion = shuffle(fashion)

#separate dependent and independent variables
X = fashion.loc[:, fashion.columns != 'label']
y = fashion['label']

#see dataframe
fashion.head()

#split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=42)

## Classification using SVC 

The classification pipeline consists of a StandardScaler() which removes the mean and scaling to unit variance, which helps ensure that the individual features are more or less normally distributed, preventing the model from behaving badly as a result of some features that would otherwise dominate the objective function. The model used is a Support Vector Classifier with a linear kernel, and consists of the regularization parameter C. This is set to 0.2, which is relatively strong, to prevent the model from overfitting. This is useful, especially when classifying on the original pixel data, since there can be a lot of unnecessary noise that the model could try to learn. Lastly, I checked model accuracy both with and without cross validation to get an idea for how much overfitting might still be present. 


### Classifying Original Pixel Data
 

In [22]:
svc = make_pipeline(StandardScaler(), LinearSVC(random_state=0, C=.2))
svc.fit(X_train, y_train)
ypred = svc.predict(X_test)

cv_scores = cross_val_score(svc, X_train, y_train, cv=10)
print('SVC score:',svc.score(X_train, y_train))
print("CV average score: %.2f" % cv_scores.mean())


SVC score: 1.0
CV average score: 0.59


Fitting the model on raw pixel data yields unexpected good accuracy score of 1, contrasted with a 0.59 when the score is checked with cross validation. This indicates that there was overfitting on the original data. This makes intuitive sense, since the raw pixel data probably has a number of features that didn't have as much predictive power as others, but the model trained on them anyways. Overall precision, accuracy, and recall revolves around 60%, which suggests that the model learned some difference between women and men's fashion, but might not be a reliable classifier with accurately classifying them. 

### Classifying PCA representations 

In [34]:
pca = PCA(8)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

svc.fit(X_train_pca, y_train)
ypred = svc.predict(X_test_pca)

cv_scores = cross_val_score(svc, X_train_pca, y_train, cv=7)

print('SVC score:',svc.score(X_train_pca, y_train))
print("CV average score: %.2f" % cv_scores.mean())


SVC score: 0.66
CV average score: 0.64


In applying PCA, after using about 6-8 components, the error rates stopped improving, which suggests that these components have more explanatory power than the rest of the features. As expected, the difference between accuracy and cross-validated accuracy decreased, representing a decrease in overfitting as a result of using eigenvectors as feature representations. This was dependent on how many components were used. The overfitting level increased as more components were used, and than 8 components seemed to be the optimal number of eigenvectors to minimize overfitting and maximize accuracy. However, the improvement in accuracy was not by a huge margin than using raw data, which might suggest that while PCA could provide improvement in the error rates by training on useful feature representations, that the model could still really only learn to classify not more than ~64% accurately.

### Classifying LDA representations

In [24]:
lda = LDA()
X_train_lda = lda.fit_transform(X_train,y_train)
X_test_lda = lda.transform(X_test)

svc.fit(X_train_lda, y_train)
ypred_ = svc.predict(X_test_lda)

cv_scores = cross_val_score(svc, X_train_lda, y_train, cv=10)

print('SVC score:',svc.score(X_train_lda, y_train))
print("CV average score: %.2f" % cv_scores.mean())


SVC score: 0.902
CV average score: 0.90


The Linear Discriminant Analysis provided a significant improvement in the accuracy score while ensuring that the model is not overfit or undefit. This might be attributed to the fact that LDA tries to expressedly find linear combinations of features that could separate classes, and this supervised dimensionality reduction method then performs better on fashion data, which may not always have a clearly defined separation. While PCA could look for all kinds of patterns in an unsupervised fashion, and there may well be all kinds of patterns in the feature data, in this classification task, we are more concerned with the categories of men's vs women's fashion, instead of the myriad other patterns that are no doubt present. LDA takes our priority of these classes into account while extracting useful linear combinations of features in a supervised manner, and as a result, provides better error rates than the Principal Component Analysis. 