For this assignment, I experimented with RandomForest classification, principal component analysis, and recursive feature elimination. Things were running fine yesterday--and today, something has gone wrong. I cannot get models to build, in a notebook or when I export the code to a Python script. I include observations about feature extraction with RandomForest, and I report the accuracies I'd gotten with PCA yesterday, before the code stopped running. I wasn't able to build an RFE model.

In [4]:
from scipy.io import arff
import pandas
from sklearn import svm, metrics
from sklearn.model_selection import cross_val_predict, train_test_split
from sklearn.feature_selection import SelectFromModel, RFE
from sklearn.ensemble import RandomForestClassifier
import numpy as np

In [5]:
data, meta = arff.loadarff('emobase2010.arff')

In [6]:
df = pandas.DataFrame.from_records(data)

In [7]:
df.columns = data.dtype.names

In [8]:
# remove neutral, unknown and other classes
a = df['class']!=b'NEU'
b = df['class']!=b'UNK'
c = df['class']!=b'OTH'
df = df.loc[a&b&c]

In [9]:
df['class'].value_counts()

b'DIS'    467
b'SUR'    452
b'ACC'    450
b'ANT'    412
b'SAD'    285
b'FEA'    239
b'JOY'    226
b'ANG'    212
Name: class, dtype: int64

In [7]:
adata = df.as_matrix()

<h3> Experiment 1: feature extraction with RandomForest </h3>
following
https://chrisalbon.com/machine-learning/feature_selection_using_random_forest.html

In [None]:
## Create column with random boolean values to make test set
df['is_train'] = np.random.uniform(0,1,len(df)) <= .75

In [14]:
## create train and test dataframes
X_train, X_test = df[df['is_train'] == True], df[df['is_train'] == False]

In [15]:
## get labels for train and test
y_train, y_test = list(X_train["class"]), list(X_test["class"])

## remove labels from train and test sets
X_train, X_test = X_train.drop("class", axis=1), X_test.drop("class", axis=1)

In [16]:
## train random forest classifier with train data and labels
clf = RandomForestClassifier(n_estimators = 1000, random_state = 0, n_jobs = 1)
clf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=1000, n_jobs=1, oob_score=False, random_state=0,
            verbose=0, warm_start=False)

In [17]:
## Make list of feature names
feat_labels = list(X_train.columns.values)

In [18]:
## Create list of importance values to get idea of where to set threshold
feature_importance = [float(feature[1]) for feature in zip(feat_labels, clf.feature_importances_)]

In [19]:
## Based on list, set threshold at .001
for i in sorted(feature_importance):
    print(i)

0.0
3.203248765449538e-06
0.00021252683121117196
0.00021983113100411945
0.0003726158272709695
0.0003835705002436763
0.0004164912027243639
0.00042292869410031204
0.00042400948542095017
0.00042750553819220966
0.00042757721351884605
0.0004408069297079444
0.0004417925077562446
0.00044247700274108984
0.00044312964253919595
0.00044418305598408
0.00044421714791860955
0.00044645146176128096
0.00044789839158930765
0.00044968500177646354
0.000451191165419074
0.00045236246660464776
0.0004543267824006463
0.0004546428867194536
0.0004548952669475171
0.0004555357385890501
0.00045637697178565464
0.00045684925008243474
0.0004574505282600815
0.0004607290946233226
0.00046135063186556
0.00046143765351661634
0.00046244002051114703
0.00046270693127506207
0.00046361569295062945
0.00046387742785451654
0.00046434046300659426
0.0004647206135089396
0.00046567279331261713
0.00046656882313409625
0.00046668704663207876
0.00046730212041169744
0.00046767799718052246
0.00046787358166088623
0.000468491777863593
0.00046

In [21]:
## create model with most important features
sfm = SelectFromModel(clf, threshold=0.001)
sfm.fit(X_train, y_train)

SelectFromModel(estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=1000, n_jobs=1, oob_score=False, random_state=0,
            verbose=0, warm_start=False),
        prefit=False, threshold=0.001)

In [22]:
## Returns 50 features
for feature_list_index in sfm.get_support(indices=True):
    print(feat_labels[feature_list_index])

pcm_loudness_sma_linregc1
pcm_loudness_sma_stddev
pcm_loudness_sma_percentile99.0
pcm_fftMag_mfcc_sma[0]_percentile99.0
logMelFreqBand_sma[0]_percentile99.0
logMelFreqBand_sma[1]_quartile3
logMelFreqBand_sma[1]_percentile99.0
logMelFreqBand_sma[4]_percentile99.0
logMelFreqBand_sma[5]_amean
logMelFreqBand_sma[5]_quartile2
logMelFreqBand_sma[5]_quartile3
logMelFreqBand_sma[5]_percentile99.0
logMelFreqBand_sma[6]_amean
logMelFreqBand_sma[6]_quartile3
logMelFreqBand_sma[6]_percentile99.0
logMelFreqBand_sma[7]_amean
logMelFreqBand_sma[7]_quartile3
lspFreq_sma[4]_quartile1
F0finEnv_sma_amean
F0finEnv_sma_linregc1
F0finEnv_sma_quartile1
F0finEnv_sma_quartile2
F0finEnv_sma_quartile3
pcm_loudness_sma_de_amean
pcm_loudness_sma_de_linregc2
pcm_loudness_sma_de_quartile2
pcm_fftMag_mfcc_sma_de[0]_amean
logMelFreqBand_sma_de[5]_amean
F0finEnv_sma_de_amean
F0finEnv_sma_de_linregc2
F0finEnv_sma_de_kurtosis
F0finEnv_sma_de_quartile1
F0finEnv_sma_de_quartile2
F0finEnv_sma_de_quartile3
F0finEnv_sma_de_iq

In [21]:
## Create data subset with most important features
X_important_train, X_important_test = sfm.transform(X_train), sfm.transform(X_test)

In [22]:
## train new classifier with feature subset
clf_import = RandomForestClassifier(n_estimators = 1000, random_state = 0, n_jobs = 1)
clf_import.fit(X_important_train, list(y_train))

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=1000, n_jobs=1, oob_score=False, random_state=0,
            verbose=0, warm_start=False)

In [25]:
## Run full classifier on test data
y_pred=clf.predict(X_test)
print(type(y_pred),type(y_test))

metrics.accuracy_score(list(y_test), list(y_pred))

<class 'numpy.ndarray'> <class 'pandas.core.series.Series'>


0.32831325301204817

In [26]:
## Run classifier with feature subset on test data
y_import_pred = clf_import.predict(X_important_test)
metrics.accuracy_score(list(y_test), list(y_import_pred))

0.29518072289156627

With features reduced from over 1500 to 50, we lose very little accuracy, still scoring above chance (0.16) and above the SVM baseline (0.27).

<h3> Experiment 2: Dimension reduction with PCA </h3>
following http://machinelearningmastery.com/feature-selection-machine-learning-python/

In [7]:
from sklearn.decomposition import PCA

In [8]:
## Look at shape of data
df.shape

(2743, 1583)

In [10]:
## split dataframe into features & labels
labels = list(df["class"])
features = df.drop("class", axis = 1)

In [12]:
## Run PCA. I wanted to use maximum likelihood estimation, but it would not run.
# pca = PCA(n_components='mle', svd_solver='full')
pca = PCA(n_components=2)
fit = pca.fit_transform(features)

In [10]:
## run same classifier as baseline with PCA reduced dimensions
wclf = svm.SVC(kernel='linear', class_weight='balanced')

In [None]:
predicted = cross_val_predict(wclf, fit, labels)

In [None]:
metrics.accuracy_score(labels, predicted)

When I ran this yesterday, I was able to get accuracies--today, it isn't working. However, when it was running, reducing dimensionality with n_components equalling 2 and 3, accuracies were around 21%--beating chance, but not the baseline of the full feature set using the same classifier. I hope that using MLE would help accuracies. I can't figure out why it ran yesterday and not today...!


<h3> Experiment 3: Recursive feature elimination </h3>
following http://machinelearningmastery.com/feature-selection-machine-learning-python/

I wasn't able to get this model to build and I'm not sure why. When I couldn't get it to run using the same classifier as used in the baseline, I tried it using logistic regression, which is the classifier used in the tutorial. Still no luck.

In [12]:
from sklearn.linear_model import LogisticRegression

In [13]:
model = LogisticRegression()

In [14]:
rfe = RFE(model, 3)

In [None]:
fit = rfe.fit(features, labels)