# Homework 5 : Mini-Project
**Gaurav Pant** 

UIC CS 412, Spring 2018

This project uses th following python files -

* ``dataApp.py`` : This loads the dataset, handles the conversion of categorical data to numbers, and slits data into different categories

* ``dt.py`` : This runs the Decision Tree Classifier on the dataset

* ``randomForest.py``: This runs the Random Forest Classifier on the dataset

* ``svm.py`` : This runs the Support Vector Classifier on the dataset

* ``adaBoost.py``: This runs the AdaBoost Classifier on the dataset

* ``mlp.py`` : This runs the Multi-layer Perceptron Classifier on the dataset

* ``votingClassifier.py`` : This runs the Voting Classifier on the dataset


## 1. The Task & Dataset
This project aims to predict a person’s “empathy” on a scale from 1 to 5 using any of the other attributes in the dataset. From this rating, student volunteers can be recruited to help Alzheimer’s patients at a non-profit organization.

The dataset used in this project is the [Young People Survey](https://www.kaggle.com/miroslavsabo/young-people-survey/) dataset, the data consists of 1010 responses of people, regarding 150 categories like - Music preferences, Personality traits, view on life & opinions etc. 

![Feature](image.png)

As noticeable in the above graph, there is trend visible between gender and empathy. i.e. Females tend to have a higher empathy score. Thus, our goal is to find similiar preditions using highly capable machine learning tools.

## 2. Preprocessing
For any classifier to predict data accurately, the data must be 'clean'. i.e. Inconsistent data must be handled beforehand.

1. Out of the 150 categories of data 11 were categorical columns, they were needed to preprocessed to numerical values. 

e.g. 
* The ``Gender`` column had two possible categorical values -

``[Male, Female]``

* On preprocessing -

In [None]:
df['Gender'] = pd.Categorical(df['Gender'])
df['Gender'] = df['Gender'].cat.codes

* The categorical values were now numerical (this is necessary for classification later on) -

``[0-Male, 1-Female]``

2. Another issue in the data was handling missing data. Specific data values were empty across various categories; these values are necessary to be filled with an appropriate value to proceed with the classification process.

e.g. 
* Imputer from ``sklearn.preprocessing`` was used to solve this problem -

In [None]:
imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=0)
imp = imp.fit(X_train)
X_train_imp = imp.transform(X_train)

## 3. The Solution

### The ML Solution
After cleaning the data, we could proceed to use the attribute values to predict the Empathy of any person. The following classifiers were analyzed in this project -

#### 3.1 Decision Tree Classifier
We first started with a simple Decision Tree Classifier, which tries to predict the empathy score by using simple decision rules inferred from the training data that we provide it.

In [3]:
from dt import *
clf = tree.DecisionTreeClassifier()
resArray = np.zeros(10)
for i in range(0, 10):
    clf.fit(X=X_train_imp, y=y_train_imp)
    clf.feature_importances_ # [ 1.,  0.,  0.]
    result = clf.score(X=X_dev_imp, y=y_dev_imp)
    resArray[i] = result

print(str(np.mean(resArray)))
#0.34702970297

0.34702970297


#### 3.2 Random Forest Classifier
The Random Forest Classifier is an ensemble method, in which each tree in the ensemble is built from a bootstrap sample from the training data.

In [5]:
from randomForest import *
clf = RandomForestClassifier(n_estimators=10)
resArray = np.zeros(10)
for i in range(0, 10):
    clf = clf.fit(X=X_train_imp, y=y_train_imp)
    result = clf.score(X=X_dev_imp, y=y_dev_imp) 
    resArray[i] = result

print(str(np.mean(resArray)))
#0.355445544554

0.355445544554


#### 3.3 Support Vector Classifier

In [None]:
from svm import *
clf = SVC()
resArray = np.zeros(10)
for i in range(0, 10):
    clf = clf.fit(X=X_train_imp, y=y_train_imp)
    result = clf.score(X=X_dev_imp, y=y_dev_imp) 
    resArray[i] = result

print(str(np.mean(resArray)))
#0.470297029703

As the SVM model achieved the highest accuracy, I further tried to perform bagging on the model.

In [None]:
bagging = BaggingClassifier(svm.SVC(), max_samples=0.5, max_features=0.5)
resArray = np.zeros(10)
for i in range(0, 10):
    bagging = bagging.fit(X=X_train_imp, y=y_train_imp)
    result = bagging.score(X=X_dev_imp, y=y_dev_imp) 
    resArray[i] = result

print(str(np.mean(resArray)))
#0.372277227723

However, this led to a drop in accuracy, as a result I removed bagging and began tuning the hyperparameter C using the development data. Using this value of C we predict on the test data.

In [19]:
#bagging = BaggingClassifier(svm.SVC(), max_samples=0.5, max_features=0.5)
#0.372277227723
resArray = np.zeros(10)
for i in range(10):
    clf = svm.SVC(C=i+1)
    clf = clf.fit(X=X_train_imp, y=y_train_imp)
    result = clf.score(X=X_dev_imp, y=y_dev_imp) 
    resArray[i] = result

#tuning hyperparameter C
print('C = ')
print(str(np.argmax(resArray)))
c = np.argmax(resArray) + 1

clf = svm.SVC(C=c)
for i in range(10):
    clf = clf.fit(X=X_train_imp, y=y_train_imp)
    result = clf.score(X=X_test_imp, y=y_test_imp) 
    resArray[i] = result
    
print(str(np.mean(resArray)))
#0.435643564356

C = 
1
0.435643564356


#### 3.4 AdaBoost Classifier

In [9]:
from adaBoost import *
clf = AdaBoostClassifier(n_estimators=100)
resArray = np.zeros(10)
for i in range(0, 10):
    clf = clf.fit(X=X_train_imp, y=y_train_imp)
    result = clf.score(X=X_dev_imp, y=y_dev_imp) 
    resArray[i] = result

print(str(np.mean(resArray)))
#0.376237623762

0.376237623762


#### 3.5 Multi-layer Perceptron Classifier


In [11]:
from mlp import *
clf = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=1)
resArray = np.zeros(10)
for i in range(0, 10):
    clf = clf.fit(X=X_train_imp, y=y_train_imp)
    result = clf.score(X=X_dev_imp, y=y_dev_imp) 
    resArray[i] = result

print(str(np.mean(resArray)))
#0.386138613861

0.386138613861


#### 3.6 Voting Classifier
In case of Voting Classifier, multiple classifiers are considered and are each given votes/ weightage depending on which a prediction is made. Thus, I tried combining all the above classifiers & tried various combinations to achieve the best accuracy possible.

In [16]:
from votingClassifier import *
clf1 = tree.DecisionTreeClassifier()
clf2 = RandomForestClassifier(n_estimators=10)
clf3 = svm.SVC()
clf4 = AdaBoostClassifier(n_estimators=100)
clf5 = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=1)

eclf = VotingClassifier(estimators=[('dt', clf1), ('rf', clf2), ('svm', clf3), ('ada', clf4), ('mlp', clf5)], voting='hard')
resArray = np.zeros(10)
for i in range(0, 10):
    eclf = eclf.fit(X=X_train_imp, y=y_train_imp)
    result = eclf.score(X=X_dev_imp, y=y_dev_imp) # 1.0
    resArray[i] = result

print(str(np.mean(resArray)))
#0.373762376238

0.438613861386


The corresponding combinations and their respective accuracies are listed below -

``[('dt', clf1), ('rf', clf2), ('svm', clf3), ('ada', clf4)], voting='hard')

0.383663366337

[('dt', clf1), ('rf', clf2), ('svm', clf3), ('ada', clf4), ('mlp', clf5)]

0.373762376238

[('rf', clf2), ('svm', clf3), ('ada', clf4), ('mlp', clf5)]

0.390594059406

[('rf', clf2), ('svm', clf3), ('ada', clf4)]

0.384653465347

[('rf', clf2), ('svm', clf3)]

0.350495049505

[('rf', clf2), ('svm', clf3), ('mlp', clf5)]

0.433663366337``

### The Evaluation Process
For the classification process, the data was split into three categories (training, development and testing) in the ratio 60:20:20. Accuracy was used as a parameter to evaluate classifiers.
* All the classification models were trained using the training data
* Once we found the model with the best accuracy we try to tune the hyperparameters using the development data
* Lastly, we calculate the accuracy for the test data.

### The Result
SVM performed the best amongst all the classifiers, with an average accuracy of 47.03%. On performing bagging on SVM accuracy dropped to 37.22%, so it wasn’t considered in the final model.