In [491]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.neural_network import MLPClassifier
import sklearn
%matplotlib inline

We're ready to build our first neural network. We will have multiple features we feed into our model, each of which will go through a set of perceptron models to arrive at a response which will be trained to our output.

Like many models we've covered, this can be used as both a regression or classification model.

First, we need to load our dataset. For this example we'll use The Museum of Modern Art in New York's [public dataset](https://media.githubusercontent.com/media/MuseumofModernArt/collection/master/Artworks.csv) on their collection.

In [492]:
artworks = pd.read_csv('https://media.githubusercontent.com/media/MuseumofModernArt/collection/master/Artworks.csv')

In [493]:
artworks.columns

Index(['Title', 'Artist', 'ConstituentID', 'ArtistBio', 'Nationality',
       'BeginDate', 'EndDate', 'Gender', 'Date', 'Medium', 'Dimensions',
       'CreditLine', 'AccessionNumber', 'Classification', 'Department',
       'DateAcquired', 'Cataloged', 'ObjectID', 'URL', 'ThumbnailURL',
       'Circumference (cm)', 'Depth (cm)', 'Diameter (cm)', 'Height (cm)',
       'Length (cm)', 'Weight (kg)', 'Width (cm)', 'Seat Height (cm)',
       'Duration (sec.)'],
      dtype='object')

We'll also do a bit of data processing and cleaning, selecting columns of interest and converting URL's to booleans indicating whether they are present.

In [494]:
# Select Columns.
#artworks = artworks[['Artist', 'Nationality', 'Gender', 'Date', 'Department', 'Classification','Medium','DateAcquired', 'URL', 'ThumbnailURL', 'Height (cm)', 'Width (cm)']]
artworks = artworks[[ 'Date', 'Department','Medium', 'Classification','DateAcquired', 'Height (cm)', 'Width (cm)']]


# Drop films and some other tricky rows.
artworks = artworks[artworks['Department']!='Film']
artworks = artworks[artworks['Department']!='Fluxus Collection']

# Drop missing data.
artworks = artworks.dropna()

In [495]:
artworks.head()

Unnamed: 0,Date,Department,Medium,Classification,DateAcquired,Height (cm),Width (cm)
0,1896,Architecture & Design,Ink and cut-and-pasted painted pages on paper,Architecture,1996-04-09,48.6,168.9
1,1987,Architecture & Design,Paint and colored pencil on print,Architecture,1995-01-17,40.6401,29.8451
2,1903,Architecture & Design,"Graphite, pen, color pencil, ink, and gouache ...",Architecture,1997-01-15,34.3,31.8
3,1980,Architecture & Design,Photographic reproduction with colored synthet...,Architecture,1995-01-17,50.8,50.8
4,1903,Architecture & Design,"Graphite, color pencil, ink, and gouache on tr...",Architecture,1997-01-15,38.4,19.1


## Building a Model

Now, let's see if we can use multi-layer perceptron modeling (or "MLP") to see if we can classify the department a piece should go into using everything but the department name.

Before we import MLP from SKLearn and establish the model we first have to ensure correct typing for our data and do some other cleaning.

In [496]:
# Get data types.
artworks.dtypes

Date               object
Department         object
Medium             object
Classification     object
DateAcquired       object
Height (cm)       float64
Width (cm)        float64
dtype: object

The `DateAcquired` column is an object. Let's transform that to a datetime object and add a feature for just the year the artwork was acquired.

In [497]:
artworks['DateAcquired'] = pd.to_datetime(artworks.DateAcquired)
artworks['YearAcquired'] = artworks.DateAcquired.dt.year
artworks['YearAcquired'].dtype

dtype('int64')

In [498]:
artworks.Department.value_counts()

Drawings & Prints        59402
Photography              24497
Architecture & Design    12203
Painting & Sculpture      3652
Media and Performance      474
Name: Department, dtype: int64

Great. Let's do some more miscellaneous cleaning.

In [499]:
#Feature engineering
artworks['medium_groups'] = 'med_other'
artworks.loc[artworks['Medium'].str.contains("paper|Book|canvas|print|Lithograph|lithograph|Letterpress|Photograph|Drawing|Poster|Notebook", regex=True),'medium_groups'] = 'Med_Drawings_prints'
artworks.loc[artworks['Medium'].str.contains("Photo|Collage|photo", regex=True),'medium_groups'] = 'Med_Photo'
artworks.loc[artworks['Medium'].str.contains("Architecture|Design|Textile|Model", regex=True),'medium_groups'] = 'Med_Archi'
artworks.loc[artworks['Medium'].str.contains("Painting|Oil|Drypoint|Etching|Scupture|Wood|Bronze|Plaster|Marble|Stone|plate", regex=True),'medium_groups'] = 'Med_Paint_scul'
artworks.loc[artworks['Medium'].str.contains("Media|Video|Audio|Performance|Film|Tape|Software", regex=True),'medium_groups'] = 'Med_Media'

In [500]:
#Feature engineering
artworks['Class'] = 'other'
artworks.loc[artworks['Classification'].str.contains("Paper|Book|Print|Photograph|Drawing|Poster|Notebook", regex=True),'Class'] = 'Drawings or prints'
artworks.loc[artworks['Classification'].str.contains("Photo|Collage", regex=True),'Class'] = 'Photo'
artworks.loc[artworks['Classification'].str.contains("Architecture|Design|Textile|Model", regex=True),'Class'] = 'Archi'
artworks.loc[artworks['Classification'].str.contains("Painting|Scupture", regex=True),'Class'] = 'Paint_scul'
artworks.loc[artworks['Classification'].str.contains("Media|Video|Audio|Performance|Film|Tape|Software", regex=True),'Class'] = 'Media'

In [501]:

# Convert dates to start date, cutting down number of distinct examples.
artworks['Date'] = pd.Series(artworks.Date.str.extract(
    '([0-9]{4})', expand=False))[:-1]


# Final column drops and NA drop.
artworks[['Height (cm)', 'Width (cm)']]=sklearn.preprocessing.scale(artworks[['Height (cm)', 'Width (cm)']])
X = artworks.drop(['Department', 'Medium', 'medium_groups', 'DateAcquired', 'Date', 'YearAcquired', 'Classification', 'Class'], 1)

# Create dummies separately.

dates = pd.get_dummies(artworks.Date)
mediums = pd.get_dummies(artworks.medium_groups)
classitems = pd.get_dummies(artworks.Class)


In [502]:
X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100228 entries, 0 to 138053
Data columns (total 2 columns):
Height (cm)    100228 non-null float64
Width (cm)     100228 non-null float64
dtypes: float64(2)
memory usage: 2.3 MB


In [503]:

# Concat with other variables, but artists slows this wayyyyy down so we'll keep it out for now

X = pd.concat([X, mediums, dates, classitems], axis=1)



Y = artworks.Department

In [None]:
# Alright! We've done our prep, let's build the model.
# Neural networks are hugely computationally intensive.
# This may take several minutes to run.

# Import the model.
from sklearn.neural_network import MLPClassifier

# Establish and fit the model, with a single, 1000 perceptron layer.
mlp = MLPClassifier(hidden_layer_sizes=(1000,))
mlp.fit(X, Y)

In [10]:
mlp.score(X, Y)

0.68022899006559012

In [308]:
Y.value_counts()/len(Y)

Drawings & Prints        0.590534
Photography              0.245183
Architecture & Design    0.121846
Painting & Sculpture     0.036552
Media and Performance    0.004744
Film                     0.001141
Name: Department, dtype: float64

In [12]:
from sklearn.model_selection import cross_val_score
cross_val_score(mlp, X, Y, cv=5)

array([ 0.57519536,  0.52577072,  0.36922856,  0.48580744,  0.54039799])

Now we got a lot of information from all of this. Firstly we can see that the model seems to overfit, though there is still so remaining performance when validated with cross validation. This is a feature of neural networks that aren't given enough data for the number of features present. _Neural networks, in general, like_ a lot _of data_. You may also have noticed something also about neural networks: _they can take a_ long _time to run_. Try increasing the layer size by adding a zero. Feel free to interrupt the kernel if you don't have time...

Also note that we created bools for artist's name but left them out. Both of the above points are the reason for that. It would take much longer to run and it would be much more prone to overfitting.

## Model parameters

Now, before we move on and let you loose with some tasks to work on the model, let's go over the parameters.

We included one parameter: hidden layer size. Remember in the previous lesson, when we talked about layers in a neural network. This tells us how many and how big to make our layers. Pass in a tuple that specifies each layer's size. Our network is 1000 neurons wide and one layer. (100, 4, ) would create a network with two layers, one 100 wide and the other 4.

How many layers to include is determined by two things: computational resources and cross validation searching for convergence. It's generally less than the number of input variables you have.

You can also set an alpha. Neural networks like this use a regularization parameter that penalizes large coefficients just like we discussed in the advanced regression section. Alpha scales that penalty.

Lastly, we'll discuss the activation function. The activation function determines whether the output from an individual perceptron is binary or continuous. By default this is a 'relu', or 'rectified linear unit function' function. In the exercise we went through earlier we used this binary function, but we discussed the _sigmoid_ as a reasonable alternative. The _sigmoid_ (called 'logistic' by SKLearn because it's a 'logistic sigmoid function') allows for continuous variables between 0 and 1, which allows for a more nuanced model. It does come at the cost of increased computational complexity.

If you want to learn more about these, study [activation functions](https://en.wikipedia.org/wiki/Activation_function) and [multilayer perceptrons](https://en.wikipedia.org/wiki/Multilayer_perceptron). The [Deep Learning](http://www.deeplearningbook.org/) book referenced earlier goes into great detail on the linear algebra involved.

You could also just test the models with cross validation. Unless neural networks are your specialty cross validation should be sufficient.

For the other parameters and their defaults, check out the [MLPClassifier documentaiton](http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier).

## Drill: Playing with layers

Now it's your turn. Using the space below, experiment with different hidden layer structures. You can try this on a subset of the data to improve runtime. See how things vary. See what seems to matter the most. Feel free to manipulate other parameters as well. It may also be beneficial to do some real feature selection work...

In [506]:
# Your code here. Experiment with hidden layers to build your own model.

# Taking a subset of the artworks dataframe to get easier on runtime

print ('Dimensions of X', X.shape)
print ('Dimensions of Y',Y.shape)
X=X.sample(frac=0.1, replace=False, random_state=2)
Y=Y.sample(frac=0.1, replace=False, random_state=2)
print ('Subset of X', X.shape)
print ('Subset of Y', Y.shape)

Dimensions of X (100228, 212)
Dimensions of Y (100228,)
Subset of X (10023, 212)
Subset of Y (10023,)


In [507]:
Y.reset_index(drop=True, inplace=True)
X.reset_index(drop=True, inplace=True)

In [106]:
# Import the model.
from sklearn.neural_network import MLPClassifier

# Establish and fit the model, with a single, 100 perceptron layer.
mlp = MLPClassifier(hidden_layer_sizes=(1000,))
mlp.fit(X, Y)
mlp.score(X, Y)

0.59218436873747493

In [34]:
from sklearn.model_selection import cross_val_score
cross_val_score(mlp, X, Y, cv=5)

array([ 0.59360731,  0.52995392,  0.61111111,  0.625     ,  0.60185185])

## Now using 1000 wide with 2 layers

In [13]:
mlp = MLPClassifier(hidden_layer_sizes=(1000,4))
mlp.fit(X, Y)
mlp.score(X, Y)



0.59594095940959413

In [14]:
from sklearn.model_selection import cross_val_score
cross_val_score(mlp, X, Y, cv=5)



array([ 0.59360731,  0.59447005,  0.59722222,  0.59722222,  0.59722222])

#### Now using 1000 wide with 4 layers

In [77]:
mlp = MLPClassifier(hidden_layer_sizes=(1000,6))
mlp.fit(X, Y)
mlp.score(X, Y)



0.59686346863468631

In [78]:
from sklearn.model_selection import cross_val_score
cross_val_score(mlp, X, Y, cv=5)



array([ 0.59360731,  0.59447005,  0.59722222,  0.59722222,  0.59722222])

Conclusion increasing the number of layers with same 1000 preceptron we get the same accuracy and it takes more time to converge.
But one big advantage with increasing number of layers is that There is literally no overfitting as seen by the difference in the scores of model and test scores.

Since both the scores i.e model score and test data score are both same we can say that there is no overfitting in the example of 2 layers and 4 layers.

In other words by increasing the number of layers the overfitting is decreasing.

### To overcome the convergence warning or make the model converge better

In [79]:
# changed the solver to lbfgs, which is better with smaller datasets.
mlp = MLPClassifier(hidden_layer_sizes=(1000,4), solver='lbfgs', max_iter=400)
mlp.fit(X, Y)
mlp.score(X, Y)

0.23985239852398524

In [80]:
from sklearn.model_selection import cross_val_score
cross_val_score(mlp, X, Y, cv=5)

array([ 0.59360731,  0.59447005,  0.59722222,  0.59722222,  0.59722222])

Also tried increasing the max iterations to 400 but it did not have any effect on accuracy.

In [54]:
Y.value_counts()

Drawings & Prints        646
Photography              260
Architecture & Design    125
Painting & Sculpture      47
Media and Performance      6
Name: Department, dtype: int64

Looks like there is class imbalance problem...let us see if solving this increases accuracy

In [157]:
# SMOTE i.e oversampling and undersampling
# Storing the column names as we are generating numpy arrays which dont have labels
from imblearn.over_sampling import SMOTENC
from imblearn.over_sampling import SMOTE
from collections import Counter
smote_NC = SMOTENC( categorical_features=[0],random_state=0)
X, Y = smote_NC.fit_resample(X, Y)
print('Resampled dataset samples per each class\n {}'.format(Counter(Y)))

Resampled dataset samples per each class
 Counter({'Drawings & Prints': 5967, 'Photography': 5967, 'Architecture & Design': 5967, 'Painting & Sculpture': 5967, 'Media and Performance': 5967})


In [61]:
# changed the solver to lbfgs, which is better with smaller datasets.
mlp = MLPClassifier(hidden_layer_sizes=(1000,4), solver='lbfgs')
mlp.fit(X, Y)
mlp.score(X, Y)

0.20000000000000001

In [62]:
from sklearn.model_selection import cross_val_score
cross_val_score(mlp, X, Y, cv=5)

array([ 0.2       ,  0.2744186 ,  0.14883721,  0.2       ,  0.2       ])

But that had a negative effect. I wonder why this has taken place...need to study why

### Feature selection to have an effect on outcome variable

Added classification and medium features separately to the beginning of the script and run again

In [None]:
#Feature engineering


#Shown above

In [461]:
X.head()

Unnamed: 0,Height (cm),Width (cm),YearAcquired,Med_Archi,Med_Drawings_prints,Med_Media,Med_Paint_scul,Med_Photo,med_other,1501,...,2010,2011,2012,2013,2014,2015,2016,2017,2018,YearAcquired.1
0,-0.021569,-0.233174,1947,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1947
1,-0.645021,-0.482975,1992,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1992
2,-0.392717,-0.366442,1976,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1976
3,-0.354489,-0.363599,1982,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1982
4,0.142475,0.011582,2011,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2011


In [508]:
mlp = MLPClassifier(hidden_layer_sizes=(1000,4), solver='lbfgs')
mlp.fit(X, Y)
mlp.score(X, Y)

0.99800458944427817

In [510]:
from sklearn.model_selection import cross_val_score
cross_val_score(mlp, X, Y, cv=5)

array([ 0.9775673 ,  0.97457627,  0.98802993,  0.98303393,  0.98551449])

Increased the number of layers i.e doubled the layers.

In [511]:
mlp = MLPClassifier(hidden_layer_sizes=(1000,8), solver='lbfgs')
mlp.fit(X, Y)
mlp.score(X, Y)

0.99730619574977553

In [512]:
from sklearn.model_selection import cross_val_score
cross_val_score(mlp, X, Y, cv=5)

array([ 0.97706879,  0.97956132,  0.98603491,  0.98602794,  0.98751249])

The accuracy went up high from 59% to 99% by doing some feature engineering/selection. 

When introduced one component of the features i.e Medium on which the artwork was done the accuracy went up from 59% to 82%. But when introduced another component i.e the classification of artwork feature, the accuracy went up to 99%.

From this one can confidently say that feature engineering is crucial or the most important key element for the successs of any datascience model.

The differnece between the scores of model and test scores is not huge around 1 to 2%  suggesting an overfit by 1-2%.