In [45]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import cross_val_score

from sklearn import ensemble

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import VarianceThreshold
from sklearn.metrics import accuracy_score

from sklearn.model_selection import GridSearchCV

from sklearn.feature_selection import SelectKBest

Let's build a neural network. We will have multiple features we feed into our model, each of which will go through a set of perceptron models to arrive at a response which will be trained to our output.

Like many models we've covered, this can be used as both a regression or classification model.

First, we need to load our dataset. For this example we'll use The Museum of Modern Art in New York's [public dataset](https://media.githubusercontent.com/media/MuseumofModernArt/collection/master/Artworks.csv) on their collection.

In [11]:
# artworks = pd.read_csv('https://media.githubusercontent.com/media/MuseumofModernArt/collection/master/Artworks.csv')
# Save the downloaded file to csv:
# artworks.to_csv('artworks' , sep = ',' , index =False)


artworks = pd.read_csv('U4L3.3_artworks.csv'  )

In [12]:
artworks

Unnamed: 0,Title,Artist,ConstituentID,ArtistBio,Nationality,BeginDate,EndDate,Gender,Date,Medium,...,ThumbnailURL,Circumference (cm),Depth (cm),Diameter (cm),Height (cm),Length (cm),Weight (kg),Width (cm),Seat Height (cm),Duration (sec.)
0,"Ferdinandsbrücke Project, Vienna, Austria, Ele...",Otto Wagner,6210,"(Austrian, 1841–1918)",(Austrian),(1841),(1918),(Male),1896,Ink and cut-and-pasted painted pages on paper,...,http://www.moma.org/media/W1siZiIsIjU5NDA1Il0s...,,,,48.600000,,,168.900000,,
1,"City of Music, National Superior Conservatory ...",Christian de Portzamparc,7470,"(French, born 1944)",(French),(1944),(0),(Male),1987,Paint and colored pencil on print,...,http://www.moma.org/media/W1siZiIsIjk3Il0sWyJw...,,,,40.640100,,,29.845100,,
2,"Villa near Vienna Project, Outside Vienna, Aus...",Emil Hoppe,7605,"(Austrian, 1876–1957)",(Austrian),(1876),(1957),(Male),1903,"Graphite, pen, color pencil, ink, and gouache ...",...,http://www.moma.org/media/W1siZiIsIjk4Il0sWyJw...,,,,34.300000,,,31.800000,,
3,"The Manhattan Transcripts Project, New York, N...",Bernard Tschumi,7056,"(French and Swiss, born Switzerland 1944)",(),(1944),(0),(Male),1980,Photographic reproduction with colored synthet...,...,http://www.moma.org/media/W1siZiIsIjEyNCJdLFsi...,,,,50.800000,,,50.800000,,
4,"Villa, project, outside Vienna, Austria, Exter...",Emil Hoppe,7605,"(Austrian, 1876–1957)",(Austrian),(1876),(1957),(Male),1903,"Graphite, color pencil, ink, and gouache on tr...",...,http://www.moma.org/media/W1siZiIsIjEyNiJdLFsi...,,,,38.400000,,,19.100000,,
5,"The Manhattan Transcripts Project, New York, N...",Bernard Tschumi,7056,"(French and Swiss, born Switzerland 1944)",(),(1944),(0),(Male),1976-77,Gelatin silver photograph,...,http://www.moma.org/media/W1siZiIsIjE0OCJdLFsi...,,,,35.600000,,,45.700000,,
6,"The Manhattan Transcripts Project, New York, N...",Bernard Tschumi,7056,"(French and Swiss, born Switzerland 1944)",(),(1944),(0),(Male),1976-77,Gelatin silver photographs,...,http://www.moma.org/media/W1siZiIsIjE0OSJdLFsi...,,,,35.600000,,,45.700000,,
7,"The Manhattan Transcripts Project, New York, N...",Bernard Tschumi,7056,"(French and Swiss, born Switzerland 1944)",(),(1944),(0),(Male),1976-77,Gelatin silver photograph,...,http://www.moma.org/media/W1siZiIsIjE0OSJdLFsi...,,,,35.600000,,,45.700000,,
8,"The Manhattan Transcripts Project, New York, N...",Bernard Tschumi,7056,"(French and Swiss, born Switzerland 1944)",(),(1944),(0),(Male),1976-77,Gelatin silver photograph,...,http://www.moma.org/media/W1siZiIsIjE1MCJdLFsi...,,,,35.600000,,,45.700000,,
9,"The Manhattan Transcripts Project, New York, N...",Bernard Tschumi,7056,"(French and Swiss, born Switzerland 1944)",(),(1944),(0),(Male),1976-77,Gelatin silver photograph,...,http://www.moma.org/media/W1siZiIsIjE1MSJdLFsi...,,,,35.600000,,,45.700000,,


In [13]:
artworks.shape

(152576, 29)

In [14]:
artworks.columns

Index(['Title', 'Artist', 'ConstituentID', 'ArtistBio', 'Nationality',
       'BeginDate', 'EndDate', 'Gender', 'Date', 'Medium', 'Dimensions',
       'CreditLine', 'AccessionNumber', 'Classification', 'Department',
       'DateAcquired', 'Cataloged', 'ObjectID', 'URL', 'ThumbnailURL',
       'Circumference (cm)', 'Depth (cm)', 'Diameter (cm)', 'Height (cm)',
       'Length (cm)', 'Weight (kg)', 'Width (cm)', 'Seat Height (cm)',
       'Duration (sec.)'],
      dtype='object')

We'll also do a bit of data processing and cleaning, selecting columns of interest and converting URL's to booleans indicating whether they are present.

In [15]:
# Display first 3 values to peek in:
artworks['URL'].value_counts(dropna=False)[:3]

NaN                                           74928
http://www.moma.org/collection/works/84024        1
http://www.moma.org/collection/works/91329        1
Name: URL, dtype: int64

There are a lot of missing values!

In [16]:
# Display first 5 values to peek in:
artworks['ThumbnailURL'].value_counts(dropna=False)[:5]

NaN                                                                                                                               85452
http://www.moma.org/media/W1siZiIsIjE5OTkxNCJdLFsicCIsImNvbnZlcnQiLCItcmVzaXplIDMwMHgzMDBcdTAwM2UiXV0.jpg?sha=cabab63ac5a7d402       59
http://www.moma.org/media/W1siZiIsIjEzNjMxNSJdLFsicCIsImNvbnZlcnQiLCItcmVzaXplIDMwMHgzMDBcdTAwM2UiXV0.jpg?sha=fbf6ad0392f98ee3       28
http://www.moma.org/media/W1siZiIsIjIzMDI1NyJdLFsicCIsImNvbnZlcnQiLCItcmVzaXplIDMwMHgzMDBcdTAwM2UiXV0.jpg?sha=1ef42de47aa2e7fb       11
http://www.moma.org/media/W1siZiIsIjIyODM1MyJdLFsicCIsImNvbnZlcnQiLCItcmVzaXplIDMwMHgzMDBcdTAwM2UiXV0.jpg?sha=a1619cfbf3adbe4e        9
Name: ThumbnailURL, dtype: int64

In [17]:
artworks['Department'].value_counts(dropna=False)

Prints & Illustrated Books               62017
Photography                              30327
Architecture & Design                    19166
NaN                                      17169
Drawings                                 11498
Painting & Sculpture                      3812
Film                                      3759
Media and Performance Art                 2736
Fluxus Collection                         2070
Architecture & Design - Image Archive       22
Name: Department, dtype: int64

In [19]:
# Select Columns.
artworks = artworks[['Artist', 'Nationality', 'Gender', 'Date', 'Department',
                    'DateAcquired', 'URL', 'ThumbnailURL', 'Height (cm)', 'Width (cm)']]

# Convert URL's to booleans.
artworks['URL'] = artworks['URL'].notnull()
artworks['ThumbnailURL'] = artworks['ThumbnailURL'].notnull()

# Drop films and some other tricky rows.
artworks = artworks[artworks['Department']!='Film']
artworks = artworks[artworks['Department']!='Media and Performance Art']
artworks = artworks[artworks['Department']!='Fluxus Collection']

# Drop missing data.
artworks = artworks.dropna()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [21]:
print(artworks.shape)
artworks.head()

(105817, 10)


Unnamed: 0,Artist,Nationality,Gender,Date,Department,DateAcquired,URL,ThumbnailURL,Height (cm),Width (cm)
0,Otto Wagner,(Austrian),(Male),1896,Architecture & Design,1996-04-09,True,True,48.6,168.9
1,Christian de Portzamparc,(French),(Male),1987,Architecture & Design,1995-01-17,True,True,40.6401,29.8451
2,Emil Hoppe,(Austrian),(Male),1903,Architecture & Design,1997-01-15,True,True,34.3,31.8
3,Bernard Tschumi,(),(Male),1980,Architecture & Design,1995-01-17,True,True,50.8,50.8
4,Emil Hoppe,(Austrian),(Male),1903,Architecture & Design,1997-01-15,True,True,38.4,19.1


## Building a Model

Now, let's see if we can use multi-layer perceptron modeling (or "MLP") to see if we can classify the department a piece should go into using everything but the department name.

Before we import MLP from SKLearn and establish the model we first have to ensure correct typing for our data and do some other cleaning.

In [22]:
# Get data types.
artworks.dtypes

Artist           object
Nationality      object
Gender           object
Date             object
Department       object
DateAcquired     object
URL                bool
ThumbnailURL       bool
Height (cm)     float64
Width (cm)      float64
dtype: object

The `DateAcquired` column is an object. Let's transform that to a datetime object and add a feature for just the year the artwork was acquired.

In [23]:
artworks['DateAcquired'] = pd.to_datetime(artworks.DateAcquired)
artworks['YearAcquired'] = artworks.DateAcquired.dt.year
artworks['YearAcquired'].dtype

dtype('int64')

Great. Let's do some more miscellaneous cleaning.

In [24]:
# In `gender` are multiple values:
artworks[artworks['Gender'].str.contains('\) \(')]

Unnamed: 0,Artist,Nationality,Gender,Date,Department,DateAcquired,URL,ThumbnailURL,Height (cm),Width (cm),YearAcquired
65,"Peter Eisenman, Robert Cole",(American) (),(Male) (Male),1975,Architecture & Design,1980-01-08,True,True,34.9251,113.3477,1980
66,"Rem Koolhaas, Madelon Vriesendorp",(Dutch) (Dutch),(Male) (Female),1987,Architecture & Design,2000-01-19,True,True,63.5001,99.0602,2000
76,"Aldo Rossi, Gianni Braghieri, M. Bosshard",(Italian) (Italian) (Italian),(Male) (Male) (Male),1974,Architecture & Design,1980-01-08,True,True,72.4000,91.4000,1980
107,"Erik Gunnar Asplund, Sigurd Lewerentz",(Swedish) (Swedish),(Male) (Male),1937,Architecture & Design,1990-01-17,True,True,41.3000,96.2000,1990
110,"Paul Nelson, Frantz Jourdain, Oscar Nitzchke",(American) (French) (American),(Male) (Male) (Male),1938,Architecture & Design,1966-01-01,True,True,37.5000,95.3000,1966
111,"Paul Nelson, Frantz Jourdain, Oscar Nitzchke",(American) (French) (American),(Male) (Male) (Male),1938,Architecture & Design,1966-01-01,True,True,37.5000,95.9000,1966
112,"Paul Nelson, Oscar Nitzchke, Frantz Jourdain",(American) (American) (French),(Male) (Male) (Male),1938,Architecture & Design,1966-01-01,True,True,71.1000,71.1000,1966
113,"Paul Nelson, Frantz Jourdain, Oscar Nitzchke",(American) (French) (American),(Male) (Male) (Male),1938,Architecture & Design,1966-01-01,True,True,71.0000,127.6000,1966
151,"Diller + Scofidio, Elizabeth Diller, Ricardo S...",(American) (American) (American),() (Female) (Male),1989,Architecture & Design,1992-01-15,True,True,121.0000,92.7000,1992
154,"Rem Koolhaas, Zoe Zenghelis, Elia Zenghelis, M...",(Dutch) (British) (British) (Dutch),(Male) (Female) (Male) (Female),1975,Architecture & Design,1992-01-15,True,True,113.0000,68.6000,1992


In [25]:
artworks['Date'].head(10)

0       1896
1       1987
2       1903
3       1980
4       1903
5    1976-77
6    1976-77
7    1976-77
8    1976-77
9    1976-77
Name: Date, dtype: object

In [26]:
# Remove multiple nationalities, genders, and artists.
artworks.loc[artworks['Gender'].str.contains('\) \('), 'Gender'] = '\(multiple_persons\)'
artworks.loc[artworks['Nationality'].str.contains('\) \('), 'Nationality'] = '\(multiple_nationalities\)'
artworks.loc[artworks['Artist'].str.contains(','), 'Artist'] = 'Multiple_Artists'

# Convert dates to start date, cutting down number of distinct examples.
artworks['Date'] = pd.Series(artworks.Date.str.extract(
    '([0-9]{4})', expand=False))[:-1]

# Final column drops and NA drop.
X = artworks.drop(['Department', 'DateAcquired', 'Artist', 'Nationality', 'Date'], 1)



In [27]:
# Create dummies separately.
artists = pd.get_dummies(artworks.Artist)
nationalities = pd.get_dummies(artworks.Nationality)
dates = pd.get_dummies(artworks.Date)

# Concat with other variables, but artists slows this wayyyyy down so we'll keep it out for now
X = pd.get_dummies(X, sparse=True)
X = pd.concat([X, nationalities, dates], axis=1)

Y = artworks.Department

In [28]:
Y

0              Architecture & Design
1              Architecture & Design
2              Architecture & Design
3              Architecture & Design
4              Architecture & Design
5              Architecture & Design
6              Architecture & Design
7              Architecture & Design
8              Architecture & Design
9              Architecture & Design
10             Architecture & Design
11             Architecture & Design
12             Architecture & Design
13             Architecture & Design
14             Architecture & Design
15             Architecture & Design
16             Architecture & Design
17             Architecture & Design
18             Architecture & Design
19             Architecture & Design
20             Architecture & Design
21             Architecture & Design
22             Architecture & Design
23             Architecture & Design
24             Architecture & Design
25             Architecture & Design
26             Architecture & Design
2

In [29]:
# Alright! We've done our prep, let's build the model.
# Neural networks are hugely computationally intensive.
# This may take several minutes to run.

# Import the model.
from sklearn.neural_network import MLPClassifier


import time
start_time = time.time()

# Establish and fit the model, with a single, 1000 perceptron layer.
mlp = MLPClassifier(hidden_layer_sizes=(1000,))
mlp.fit(X, Y)

print("\n--- %s seconds ---" % (time.time() - start_time))


--- 105.47419905662537 seconds ---


In [35]:
import time
start_time = time.time()

print(mlp.score(X, Y))

print("\n--- %s seconds ---" % (time.time() - start_time))

0.5813054613152896

--- 2.693662405014038 seconds ---


In [17]:
Y.value_counts()/len(Y)

Prints & Illustrated Books    0.521192
Photography                   0.228186
Architecture & Design         0.113148
Drawings                      0.103717
Painting & Sculpture          0.033756
Name: Department, dtype: float64

In [28]:
from sklearn.model_selection import cross_val_score
cross_val_score(mlp, X, Y, cv=5)

array([0.57890012, 0.61883387, 0.51859377, 0.5554768 , 0.47930252])

Now we got a lot of information from all of this. Firstly we can see that the model seems to overfit, though there is still so remaining performance when validated with cross validation. This is a feature of neural networks that aren't given enough data for the number of features present. _Neural networks, in general, like_ a lot _of data_. You may also have noticed something also about neural networks: _they can take a_ long _time to run_. Try increasing the layer size by adding a zero. Feel free to interrupt the kernel if you don't have time...

Also note that we created bools for artist's name but left them out. Both of the above points are the reason for that. It would take much longer to run and it would be much more prone to overfitting.

## Model parameters

Now, before we move on and let you loose with some tasks to work on the model, let's go over the parameters.

We included one parameter: hidden layer size. Remember in the previous lesson, when we talked about layers in a neural network. This tells us how many and how big to make our layers. Pass in a tuple that specifies each layer's size. Our network is 1000 neurons wide and one layer. (100, 4, ) would create a network with two layers, one 100 wide and the other 4.

How many layers to include is determined by two things: computational resources and cross validation searching for convergence. It's generally less than the number of input variables you have.

You can also set an alpha. Neural networks like this use a regularization parameter that penalizes large coefficients just like we discussed in the advanced regression section. Alpha scales that penalty.

Lastly, we'll discuss the activation function. The activation function determines whether the output from an individual perceptron is binary or continuous. By default this is a 'relu', or 'rectified linear unit function' function. In the exercise we went through earlier we used this binary function, but we discussed the _sigmoid_ as a reasonable alternative. The _sigmoid_ (called 'logistic' by SKLearn because it's a 'logistic sigmoid function') allows for continuous variables between 0 and 1, which allows for a more nuanced model. It does come at the cost of increased computational complexity.

If you want to learn more about these, study [activation functions](https://en.wikipedia.org/wiki/Activation_function) and [multilayer perceptrons](https://en.wikipedia.org/wiki/Multilayer_perceptron). The [Deep Learning](http://www.deeplearningbook.org/) book referenced earlier goes into great detail on the linear algebra involved.

You could also just test the models with cross validation. Unless neural networks are your specialty cross validation should be sufficient.

For the other parameters and their defaults, check out the [MLPClassifier documentaiton](http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier).

## Drill: Playing with layers

Now it's your turn. Using the space below, experiment with different hidden layer structures. You can try this on a subset of the data to improve runtime. See how things vary. See what seems to matter the most. Feel free to manipulate other parameters as well. It may also be beneficial to do some real feature selection work...

In [13]:
# Your code here. Experiment with hidden layers to build your own model.



In [49]:
#artworks = pd.read_csv('https://media.githubusercontent.com/media/MuseumofModernArt/collection/master/Artworks.csv')

In [89]:
artworks.head()

Unnamed: 0,Artist,Nationality,Gender,Date,Department,DateAcquired,URL,ThumbnailURL,Height (cm),Width (cm),YearAcquired
0,Otto Wagner,(Austrian),(Male),1896,Architecture & Design,1996-04-09,True,True,48.6,168.9,1996
1,Christian de Portzamparc,(French),(Male),1987,Architecture & Design,1995-01-17,True,True,40.6401,29.8451,1995
2,Emil Hoppe,(Austrian),(Male),1903,Architecture & Design,1997-01-15,True,True,34.3,31.8,1997
3,Bernard Tschumi,(),(Male),1980,Architecture & Design,1995-01-17,True,True,50.8,50.8,1995
4,Emil Hoppe,(Austrian),(Male),1903,Architecture & Design,1997-01-15,True,True,38.4,19.1,1997


First, I will create subsets to reduce the runtime and compare the full dataset to the reduced dataset for several configurations of the neural network.

In [36]:
art_50 = artworks.sample(frac=0.5)

X_50 = art_50.drop(['Department', 'DateAcquired', 'Artist', 'Nationality', 'Date'], 1)

artists = pd.get_dummies(art_50.Artist)
nationalities = pd.get_dummies(art_50.Nationality)
dates = pd.get_dummies(art_50.Date)

X_50 = pd.get_dummies(X_50, sparse=True)
X_50 = pd.concat([X_50, nationalities, dates], axis=1)

Y_50 = art_50.Department

In [37]:
art_10 = artworks.sample(frac=0.1)

X_10 = art_10.drop(['Department', 'DateAcquired', 'Artist', 'Nationality', 'Date'], 1)

artists = pd.get_dummies(art_10.Artist)
nationalities = pd.get_dummies(art_10.Nationality)
dates = pd.get_dummies(art_10.Date)

X_10 = pd.get_dummies(X_10, sparse=True)
X_10 = pd.concat([X_10, nationalities, dates], axis=1)

Y_10 = art_10.Department

In [40]:
# Create a function to fit and cross validation:
def run_mlp(X,Y, sizes):
    mlp = MLPClassifier(hidden_layer_sizes=(sizes))
    mlp.fit(X, Y)
    print('Hidden Layer Sizes: {sizes}'.format(sizes=sizes))
    print('Accuracy: ',mlp.score(X, Y))
    scores = cross_val_score(mlp, X, Y, cv=5)
    print('Cross Val Scores: {scores}'.format(scores=scores))
    print('\nCross Val Mean: ',scores.mean())
    


# Test the neural net with different settings:


## 2 layers of length 100 and 4:

In [35]:
# Import the model.
from sklearn.neural_network import MLPClassifier

import time
start_time = time.time()

# Test with the original data set:
run_mlp(X, Y, [100,4])

print("\n--- %s seconds ---" % (time.time() - start_time))

Hidden Layer Sizes: [100, 4]
Accuracy:  0.5211922469924493
Cross Val Scores: [0.52116602 0.52116802 0.52119265 0.52121728 0.52121728]
Cross Val Mean:  0.5211922484380136

--- 53.598079204559326 seconds ---


In [36]:
import time
start_time = time.time()

run_mlp(X_50, Y_50, [100,4])

print("\n--- %s seconds ---" % (time.time() - start_time))

Hidden Layer Sizes: [100, 4]
Accuracy:  0.5222461631511303
Cross Val Scores: [0.52225267 0.52220752 0.52220752 0.52220752 0.52235561]
Cross Val Mean:  0.5222461692338147

--- 38.10286021232605 seconds ---


In [37]:
import time
start_time = time.time()

run_mlp(X_10, Y_10, [100,4])

print("\n--- %s seconds ---" % (time.time() - start_time))

Hidden Layer Sizes: [100, 4]
Accuracy:  0.5317520317520318
Cross Val Scores: [0.55004721 0.53163362 0.53141238 0.53191489 0.53169347]
Cross Val Mean:  0.5353403145368477

--- 16.16951274871826 seconds ---


With a network structure of [100,4], we can see that the accuracy of the network stays relatively the same across datasets. This is an interesting finding and not what I expected, since neural nets tend to need a lot of data. 

None of the networks are overfitting by much either. 

Perhaps the lack of data can be made up for by expanding the size of the hidden layers, so more analysis is being done despite having less data.

## 3 layers of length 10:
Let's test with a 3-layer structure to see what happens:

In [83]:
import time
start_time = time.time()

# Test with the original data set:
run_mlp(X, Y, [10,10,10])

print("\n--- %s seconds ---" % (time.time() - start_time))

Hidden Layer Sizes: [10, 10, 10]
Accuracy:  0.6582874207357986
Cross Val Scores: [0.6106964  0.61264411 0.55195388 0.56185616 0.51384557]

Cross Val Mean:  0.5701992247641375

--- 150.14933276176453 seconds ---


In [84]:
import time
start_time = time.time()

run_mlp(X_50, Y_50, [10,10,10])

print("\n--- %s seconds ---" % (time.time() - start_time))

Hidden Layer Sizes: [10, 10, 10]
Accuracy:  0.6549104105239283
Cross Val Scores: [0.61513748 0.63853714 0.63778114 0.62256662 0.62898195]

Cross Val Mean:  0.6286008657787081

--- 48.0041983127594 seconds ---


In [42]:
import time
start_time = time.time()

run_mlp(X_10, Y_10, [10,10,10])

print("\n--- %s seconds ---" % (time.time() - start_time))

Hidden Layer Sizes: [10, 10, 10]
Accuracy:  0.6238896238896239
Cross Val Scores: [0.58829084 0.61661945 0.5862069  0.53380615 0.62298959]

Cross Val Mean:  0.5895825858082173

--- 9.418059349060059 seconds ---


With a minimal network structure of **[10, 10, 10]**, we see that the accuracy drops from 0.66 to 0.62 to 0.59 as we reduce the dataset. The difference between the full dataset and the 50% dataset is not as big a difference as the 10% dataset. This shows us that although we see a benefit in reduced runtime, it may not be worth our while to reduce datasets in the future because the performance gets worse and worse in a non-linear manner.

I have read advice online that says beyond 2-3 hidden layers, there tends not to be a big increase in performance. 

Let's test that theory by running 5 layers with 10 perceptrons each.


## 5 layers of length 10:

In [43]:
import time
start_time = time.time()

# Test with the original data set:
run_mlp(X, Y, [10,10,10,10,10])

print("\n--- %s seconds ---" % (time.time() - start_time))

Hidden Layer Sizes: [10, 10, 10, 10, 10]
Accuracy:  0.6119999621988905
Cross Val Scores: [0.56864783 0.64675865 0.50630818 0.50642661 0.47093847]

Cross Val Mean:  0.5398159491844005

--- 105.76184105873108 seconds ---


In [44]:
import time
start_time = time.time()

run_mlp(X_50, Y_50, [10,10,10,10,10])

print("\n--- %s seconds ---" % (time.time() - start_time))

Hidden Layer Sizes: [10, 10, 10, 10, 10]
Accuracy:  0.6650223028653511
Cross Val Scores: [0.54672588 0.63683614 0.63674164 0.62559063 0.63815105]

Cross Val Mean:  0.6168090668546738

--- 52.31616759300232 seconds ---


In [45]:
import time
start_time = time.time()

run_mlp(X_10, Y_10, [10,10,10,10,10])

print("\n--- %s seconds ---" % (time.time() - start_time))

Hidden Layer Sizes: [10, 10, 10, 10, 10]
Accuracy:  0.5562275562275563
Cross Val Scores: [0.60812087 0.55146364 0.5578649  0.49881797 0.55534532]

Cross Val Mean:  0.5543225401389625

--- 9.831654071807861 seconds ---



3 layers of [10,10,10] gave scores of 0.65, 0.62, and 0.59. Increasing to 5 layers gave scores of 0.61, 0.66, and 0.55. This shows that adding more layers doesn't necessarily improve results. It seems like the size of the layers is more important. 

Here we see that the 50% reduced dataset actually performed the best, despite some overfitting. This finding would contradict the earlier finding that more data produces better results, but this finding also uses 5 small hidden layers, when standard practice is to use fewer larger layers.




## Neural Net with 1 layer, higher length:
Let's expand the size of the hidden layers, but decrease the number to the minimum (1).

In [94]:
import time
start_time = time.time()

# Test with the original data set:
run_mlp(X, Y, [150])

print("\n--- %s seconds ---" % (time.time() - start_time))

Hidden Layer Sizes: [150]
Accuracy:  0.4196301161439088
Cross Val Scores: [0.52385902 0.58821584 0.53087936 0.56951139 0.53076269]

Cross Val Mean:  0.5486456597002354

--- 63.540528297424316 seconds ---


In [93]:
import time
start_time = time.time()

run_mlp(X_50, Y_50, [150])

print("\n--- %s seconds ---" % (time.time() - start_time))

Hidden Layer Sizes: [150]
Accuracy:  0.5033076283359794
Cross Val Scores: [0.5803099  0.62726757 0.55722522 0.60718336 0.58899707]

Cross Val Mean:  0.5921966249896583

--- 24.53026056289673 seconds ---


In [88]:
import time
start_time = time.time()

run_mlp(X_10, Y_10, [150])

print("\n--- %s seconds ---" % (time.time() - start_time))

Hidden Layer Sizes: [150]
Accuracy:  0.5753165753165753
Cross Val Scores: [0.56421152 0.51935788 0.57817667 0.57635934 0.49858089]

Cross Val Mean:  0.5473372595124187

--- 5.443480491638184 seconds ---




Now, we have our worst set of scores yet at 0.48, 0.59, and 0.54. Perhaps more than one layer is needed, or if there is only one layer, it must be much larger than this.




## NN with 1 layer, size 2:
Lets see what happens when we continue with the minimum of 1 layer, but this time with a minimal size of 2. Will the score get even worse?

In [95]:
import time
start_time = time.time()
# Test with the original data set:
run_mlp(X, Y, [2])
print("\n--- %s seconds ---" % (time.time() - start_time))

Hidden Layer Sizes: [2]
Accuracy:  0.5211922469924493
Cross Val Scores: [0.52116602 0.52116802 0.5211454  0.52121728 0.52121728]

Cross Val Mean:  0.521182797982029

--- 29.751643896102905 seconds ---


In [93]:
import time
start_time = time.time()
run_mlp(X_50, Y_50, [2])
print("\n--- %s seconds ---" % (time.time() - start_time))

Hidden Layer Sizes: [150]
Accuracy:  0.5033076283359794
Cross Val Scores: [0.5803099  0.62726757 0.55722522 0.60718336 0.58899707]

Cross Val Mean:  0.5921966249896583

--- 24.53026056289673 seconds ---


In [88]:
import time
start_time = time.time()

run_mlp(X_10, Y_10, [2])

print("\n--- %s seconds ---" % (time.time() - start_time))

Hidden Layer Sizes: [150]
Accuracy:  0.5753165753165753
Cross Val Scores: [0.56421152 0.51935788 0.57817667 0.57635934 0.49858089]

Cross Val Mean:  0.5473372595124187

--- 5.443480491638184 seconds ---




With scores of 0.52, 0.59, and 0.55, these scores with hidden layers=[2] are comparable with the scores of hidden layers=[150]. 




## NN with 1 layer, size 1000:
Let's see when we keep the minimum of 1 layer, but expand the size drastically.

... computer died 2 times :(

In [30]:
import time
start_time = time.time()
# Test with the original data set:
run_mlp(X, Y, [1000])
print("\n--- %s seconds ---" % (time.time() - start_time))


--- 0.0 seconds ---


In [43]:
import time
start_time = time.time()
run_mlp(X_50, Y_50, [1000])
print("\n--- %s seconds ---" % (time.time() - start_time))

Hidden Layer Sizes: [1000]
Accuracy:  0.5791751719966735
Cross Val Scores: [0.60994047 0.27978834 0.64987715 0.62234193 0.61036015]

Cross Val Mean:  0.5544616082699998

--- 322.0889313220978 seconds ---


In [42]:
import time
start_time = time.time()

run_mlp(X_10, Y_10, [1000])

print("\n--- %s seconds ---" % (time.time() - start_time))

Hidden Layer Sizes: [1000]
Accuracy:  0.2373842373842374
Cross Val Scores: [0.5434372  0.5879017  0.53497164 0.24007561 0.19848771]

Cross Val Mean:  0.42097477557563134

--- 33.0052604675293 seconds ---




When we expanded the layer, the full dataset died twice. 

The 50% reduced dataset was overfitting, with the accuracy was reduced to 0.55. 

The 10% dataset was dramatically underfitting, which is interesting. This shows us that although large datasets are preferred, they may lead to overfitting.


Let's perform some basic feature selection to see if we can improve this algorithm. Now that we are trying to make a robust algorithm (instead of just tinkering with variables) we will split our data into training and testing datasets to see the accuracy.

In [46]:

X_train, X_test, Y_train, Y_test = train_test_split(X, Y,
                                                    test_size=0.3,
                                                    random_state=0)



In [65]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

from sklearn.feature_selection  import SelectFromModel

from sklearn.svm  import LinearSVC

from sklearn.ensemble  import RandomForestClassifier




In [67]:
import time
start_time = time.time()

pipe = Pipeline([
    ('feat', VarianceThreshold()),
    ('mlp', MLPClassifier())
])

param_grid = [
    {
        'feat__threshold': [(.8 * (1 - .8)),(.6 * (1 - .6)),(.4 * (1 - .4))],
        'mlp__hidden_layer_sizes': [[10,10,10],[1000]]
    }
]

grid = GridSearchCV(pipe, cv=3, n_jobs=1, param_grid=param_grid)
grid.fit(X_train, Y_train)
print(grid.best_params_)
prediction = grid.predict(X_test)
print(accuracy_score(Y_test, prediction))

print("\n--- %s seconds ---" % (time.time() - start_time))

{'feat__threshold': 0.24, 'mlp__hidden_layer_sizes': [10, 10, 10]}
0.5192465192465192

--- 230.45436120033264 seconds ---


With a validation score of 0.52, feature selection via variance threshold did not drastically improve our results. We used the hidden layer structures that produced the 2 best scores from earlier in the notebook to test this. 

When using a hidden layer structure of [10,10,10] on the full dataset, the score was 0.57. 
`strange`

## SelectKBest
We will try one more method of feature reduction -- SelectKBest.

In [68]:
import time
start_time = time.time()

pipe = Pipeline([
    ('feat', SelectKBest(k=100)),
    ('mlp', MLPClassifier(hidden_layer_sizes = [10,10,10]))
])

pipe.fit(X_train, Y_train)
prediction = pipe.predict(X_test)
print(accuracy_score(Y_test, prediction))

print("\n--- %s seconds ---" % (time.time() - start_time))

  f = msb / msw


0.5767340767340767




Alright! Using SelectKBest with 100 features, we were able to obtain a score of  0.58. This is still not great, but a significant improvement from the values around 0.54 that all the other models were obtaining.



## Conclusion

I have experimented with hidden layer structures, dataset sizes, and feature selection to classify this set using neural nets. With the full feature dataset, I generally found that including all datapoints led to greater accuracy scores, but this was not always the case. Sometimes including all the datapoints led to overfitting, so I will make a note to be cautious about this in the future. When the structure was very small and minimal, I saw more consistency in results between the differently-sized datasets.

Given that there were so many categorical features conveying little information per feature, I thought that setting a variance threshold to reduce the feature set would be beneficial. However, even after iterating through multiple threshold values and hidden layer structure using GridSearch, I did not see an increase in performance. I did, however, see a significant increase in performance when using SelectKBest. I will continue using these methods as I learn more about data science.
