<a href="https://colab.research.google.com/github/chabelicastano/cap4770-spring23/blob/main/Labs/ChabeliCastano_bagging_n_pasting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Bagging and Pasting

#### Part of the [Inquiryum Machine Learning Fundamentals Course](http://inquiryum.com/machine-learning/)

![](https://raw.githubusercontent.com/zacharski/datamining-guide/master/labs/pics/divider.png)

![](https://raw.githubusercontent.com/zacharski/ml-class/master/labs/pics/bagging..png)


Now we are about to embark on our journey from simple decision trees to algorithms that use decision trees as components. The path goes like this:


![](https://raw.githubusercontent.com/zacharski/ml-class/master/labs/pics/dbxg.png)

The use of decision trees began in the 1980s and XGBoost was introduced in 2016. Throughout the next few notebooks we will explore this progression of algorithms.  

### A collective of classifiers

To gain an intuition on how this works, let's look at how our confidence might increase when more people tell us something. Whether it is multiple doctors giving us the same diagnosis or something as simple as ...

#### The Mary Spender example

Let's say one of your friends mentions over lunch that you would love a particular musical artist on YouTube, say Mary Spender, who you never heard before. 

![](https://raw.githubusercontent.com/zacharski/ml-class/master/labs/pics/MarySpender2.png)

What is the chance that you will actually like Mary Spender's music? Maybe slighly better than chance? Let's say you think there is a 60% chance you will like her. You will file away the recommendation but you are not going to rush home and watch a YouTube video.  Now, in addition to the lunch friend's recommendation,  an old music school friend, now living in Austin messages you saying you should check out Mary Spender and the friend predicts you will absolutely love her. Then a week later, while talking with an old bandmate over the phone, that bandmate, again, recommends Mary Spender. Over the course of less than 10 days, three of your friends independently (because they don't know one another) recommend Mary Spender. Now what is the likelihood of you liking Mary Spender? I am guessing you think that now it is higher than 60%. Maybe now you think it is 90% likely you will like her. It is the aggregate of these 3 people's opinions (3 classifiers) that ups the accuracy of the prediction.

![](https://raw.githubusercontent.com/zacharski/ml-class/master/labs/pics/spender22.png)


This is similar to how bagging works. One aggregates the votes of a number of classifiers and the vote of that ensemble of classifiers is more accurate than that of a single classifier. Even if the accuracy of each component classifier is low (known as a weak classifier), the ensemble can be a strong (high accuracy) classifier. Of course there are some caveats. 

Back to the Mary Spender example. Suppose one of your friends went to a Mary Spender concert and then later in the week met with four other of your friends and mentioned that she thought you would love Mary Spender's music. Then, over the course of a week all those friends recommended Mary Spender to you. In this case the recommendations are not that independent---all are based on one person's opinion. Thus, the accuracy would not be as great as in the example above. Similarly, if you made 10 copies of the exact same classifier each trained on exactly the same data, the accuracy of the ensemble of clones would not be any better than the accuracy of a single copy. Moving away from Mary Spender and our musical tastes and back to machine learning, we can try to create independence among the classifier in 2 ways:

1. We can change the type of classifier. For example, we can use a k-Nearest Neighbor Classifier with Manhattan distance and a k of 5, a k-Nearest Neighbor Classifier with Euclidean distance and a k of 3, a decision tree classifier using entropy and a max depth of 5, and a decision tree classifier with using gini and no max depth specified. Hopefully, the accuracy of the ensemble of the four classifiers would be greater than that of a single classifier.
2. We can have an ensemble of the same classifier (for example, 10 decision tree classifiers with identical hyperparameters) but each classifier can get a different subset of the training data. The classifiers would thus build different models (differents 'rules') and, again, the accuracy of the ensemble should be greather than that of a single classifier. This is the approach we will take.

### Bagging and Pasting


In this Jupyter notebook, we are going to explore Bagging algorithms. Bagging algorithms come in a variety of 'flavors' including one called 'bagging' and one called 'pasting'.

But first an experiment on the what *with replacement* means. As you will see shortly, that term is the crucial difference between bagging and pasting.

### A small experiment
NOTE: The following code is just used for illustration and is nothing we will be using for machine learning. 

Consider a list of 5 red balls and 5 blue balls:

In [1]:
bag = ['red', 'red', 'red', 'red', 'red',
       'blue', 'blue', 'blue', 'blue', 'blue']

Suppose we want to pick 7 random balls from this list. Python offers two functions that will give us random elements from a list.One is called `choices` which selects a sample with replacement, which means that once a ball is selected it is put back in the bag so it has the potential to be selected again. Let's give it a try, and just because things are random let's do this 100 times:

In [2]:
import random
total = 0
for i in range(100):
    set = random.choices(bag, k=7)
    blue = set.count('blue')
    red = set.count('red')
    if blue > 5 or red > 5:
        print("%i blue and %i red" % (set.count('blue'), set.count('red')))
        total +=1
print("Balls selected exceeded balls in bag: %i" % (total))

6 blue and 1 red
1 blue and 6 red
1 blue and 6 red
6 blue and 1 red
1 blue and 6 red
1 blue and 6 red
1 blue and 6 red
1 blue and 6 red
6 blue and 1 red
1 blue and 6 red
Balls selected exceeded balls in bag: 10


*A reminder: please don't mindlessly execute the code. Look at it and understand it*

There are five blue balls. Since we are doing the selection with replacement there are times when we select more than 5 blue balls (or five red ones).  

When we print

```
Balls selected exceeded balls in bag: 13
```

it shows how many times that was the case.


When I ran this, 14 times out of 100 had more of one color ball than there were in the original bag. In fact, several times I ended up with all 7 of the balls blue, even though the original list had only 5 balls:

```
7 blue and 0 red
6 blue and 1 red
7 blue and 0 red
1 blue and 6 red
0 blue and 7 red
1 blue and 6 red
1 blue and 6 red
6 blue and 1 red
6 blue and 1 red
1 blue and 6 red
6 blue and 1 red
1 blue and 6 red
6 blue and 1 red
6 blue and 1 red
Balls selected exceeded balls in bag: 14
```
Again, this is called selecting with replacement (we put what we selected back in the set before selecting again). 

The other alternative is to select without replacement--once we select something we can't select it again. Python's `sample` does this:


In [3]:
import random
total = 0
for i in range(1000):
    set = random.sample(bag, k=7)
    blue = set.count('blue')
    red = set.count('red')
    if blue > 5 or red > 5:
        print("%i blue and %i red" % (set.count('blue'), set.count('red')))
        total +=1
print("Balls selected exceeded balls in bag: %i" % (total))

Balls selected exceeded balls in bag: 0


As you can see, the number of a specific colored ball that we select never exceeded the number of balls of that color in the original set.

Now back to bagging and pasting. In both approaches we are going to sample the training data. Let's say we want 70% of the training data in our sample. In bagging ([Breiman, 1996](https://link.springer.com/content/pdf/10.1007/BF00058655.pdf)), if we our training dataset is 1000 instances and we want 70% for a particular classifier, the algorithm will randomly select 700 out of the 1,000 **with replacement**. With pasting ([Breiman, 1998](https://link.springer.com/article/10.1023/A:1007563306331)), the selection is done **without replacement**. 

#### but wait, there is more ...

There are two other options. Instead of selecting a random subset of training data instances, we can select a random subset of columns (features). Let's say we have a dataset of 1,000 instances each with 100 features. When we select a random subset of columns, we still have 1,000 instances but now they have just a subset of the features. This is called Random Subspaces ([Ho, 1998](https://pdfs.semanticscholar.org/b41d/0fa5fdaadd47fc882d3db04277d03fb21832.pdf?_ga=2.196949164.1638238666.1596910000-1073138517.1596910000)).

Finally, we can train a classifier on both random subsets of instances and random subsets of features. This is known as Random Patches ([Louppe and Geurts, 2012](https://www.researchgate.net/publication/262212941_Ensembles_on_Random_Patches))

In summary, the four methods are:

* **bagging** - select a subset of data set instances using replacement
* **pasting** - select a subset of data set instances without replacement
* **Random Subspaces** - select a subset of features
* **Random Patches** - select both a subset of features and of instances

Let's see how this works!

First, let's grab the Wisconsin Cancer data we used before:

#### Wisconsin Cancer Dataset
![](https://raw.githubusercontent.com/zacharski/ml-class/master/labs/pics/aimam.png)
image from Nvidia's [AI Improves Breast Cancer Diagnoses by Factoring Out False Positives](https://blogs.nvidia.com/blog/2018/02/01/making-mammography-more-meaningful/)

[A description of the Cancer Database](#Breast-Cancer-Database)

In this dataset we are trying to predict the diagnosis---either M (malignant) or B (benign).

Let's load the dataset and split it into training and testing sets

In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

colNames = ['id', 'diagnosis', 'radiusAvg', 'textureAvg', 'perimeterAvg', 'areaAvg',
            'smoothnessAvg', 'compactnessAvg', 'concavityAvg', 'concavityPointsAvg',
            'symmetryAvg', 'FractalDimensionAvg', 'radiusSE', 'textureSE', 'perimeterSE',
            'areaSE','smoothnessSE', 'compactnessSE', 'concavitySE', 'concavityPointsSE',
            'symmetrySE', 'FractalDimensionSE', 'radiusWorst', 'textureWorst', 'perimeterWorst',
            'areaWorst', 'smoothnessWorst', 'compactnessWorst', 'concavityWorst', 'concavityPointsWorst',
            'symmetryWorst>', 'FractalDimensionWorst']
len(colNames)

data = pd.read_csv('https://raw.githubusercontent.com/zacharski/ml-class/master/data/wdbc.data', names=colNames)
data.set_index('id', inplace=True)

trainingdata, testdata = train_test_split(data, test_size = 0.2)
testdata

Unnamed: 0_level_0,diagnosis,radiusAvg,textureAvg,perimeterAvg,areaAvg,smoothnessAvg,compactnessAvg,concavityAvg,concavityPointsAvg,symmetryAvg,...,radiusWorst,textureWorst,perimeterWorst,areaWorst,smoothnessWorst,compactnessWorst,concavityWorst,concavityPointsWorst,symmetryWorst>,FractalDimensionWorst
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
897630,M,18.770,21.43,122.90,1092.0,0.09116,0.14020,0.10600,0.06090,0.1953,...,24.540,34.37,161.10,1873.0,0.14980,0.48270,0.46340,0.20480,0.3679,0.09870
9110720,B,11.990,24.89,77.61,441.3,0.10300,0.09218,0.05441,0.04274,0.1820,...,12.980,30.36,84.48,513.9,0.13110,0.18220,0.16090,0.12020,0.2599,0.08251
9112712,B,9.755,28.20,61.68,290.9,0.07984,0.04626,0.01541,0.01043,0.1621,...,10.670,36.92,68.03,349.9,0.11100,0.11090,0.07190,0.04866,0.2321,0.07211
871001502,B,8.219,20.70,53.27,203.9,0.09405,0.13050,0.13210,0.02168,0.2222,...,9.092,29.72,58.08,249.8,0.16300,0.43100,0.53810,0.07879,0.3322,0.14860
9113846,B,12.270,29.97,77.42,465.4,0.07699,0.03398,0.00000,0.00000,0.1701,...,13.450,38.05,85.08,558.9,0.09422,0.05213,0.00000,0.00000,0.2409,0.06743
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
895633,M,16.260,21.88,107.50,826.8,0.11650,0.12830,0.17990,0.07981,0.1869,...,17.730,25.21,113.70,975.2,0.14260,0.21160,0.33440,0.10470,0.2736,0.07953
906616,B,11.610,16.02,75.46,408.2,0.10880,0.11680,0.07097,0.04497,0.1886,...,12.640,19.67,81.93,475.7,0.14150,0.21700,0.23020,0.11050,0.2787,0.07427
899147,B,11.950,14.96,77.23,426.7,0.11580,0.12060,0.01171,0.01787,0.2459,...,12.810,17.72,83.09,496.2,0.12930,0.18850,0.03122,0.04766,0.3124,0.07590
883539,B,12.420,15.04,78.61,476.5,0.07926,0.03393,0.01053,0.01108,0.1546,...,13.200,20.37,83.85,543.4,0.10370,0.07776,0.06243,0.04052,0.2901,0.06783


Now divide up the data into the features and labels

In [5]:
colNames.remove('id')
colNames.remove('diagnosis')
trainingDataFeatures = trainingdata[colNames]
testDataFeatures = testdata[colNames]
trainingDataLabels = trainingdata['diagnosis']
testDataLabels = testdata['diagnosis']


Let's get a base accuracy using a single decision tree classifier:

In [6]:
from sklearn import tree
clf = tree.DecisionTreeClassifier(criterion='entropy')
clf.fit(trainingDataFeatures, trainingDataLabels)
predictions = clf.predict(testDataFeatures)

accuracy_score(testDataLabels, predictions)

0.9385964912280702

Now we will see if we can improve on that accuracy.

### Building a bagging classifier

Let's build a collective of 20 decision tree classifiers (`n_estimators`). Let's train each one with 100 random samples from our dataset (`max_samples`) with replacement (`bootstrap=True`). `n_jobs` means how many jobs to run in parallel. `n_jobs=-1` means use all available CPU cores.   

Just to reinforce the vocabulary we are learning, `n_estimators`, `max_samples`, `bootstrap` are among the **hyperparameters** of the bagging classifier.

In [7]:
from sklearn.ensemble import BaggingClassifier
clf = tree.DecisionTreeClassifier(criterion='entropy')

bagging_clf = BaggingClassifier(clf, n_estimators=20, max_samples=100, 
                                bootstrap=True, n_jobs=-1)
bagging_clf.fit(trainingDataFeatures, trainingDataLabels)
predictions = bagging_clf.predict(testDataFeatures)
accuracy_score(testDataLabels, predictions)

0.9385964912280702

When I did this using a single decision tree classifier was 90.3% accurate, while the bagging classifier was 96.5% accurate--halving the error rate! that's pretty good!


### Pasting
Let's try the same thing with pasting (without replacement):

For that we set the hyperparameter: `bootstrap=False`


In [8]:
pasting_clf = BaggingClassifier(clf, n_estimators=20, max_samples=100, 
                                bootstrap=False, n_jobs=-1)
pasting_clf.fit(trainingDataFeatures, trainingDataLabels)
predictions = pasting_clf.predict(testDataFeatures)
accuracy_score(testDataLabels, predictions)

0.956140350877193

### Random Subspaces
Again, random subspaces are when we randomly select feature subsets rather than subsets of the dataset instances. This time we will create 50 classifiers for our collective and each will train on a dataset with 7 features (`max_feature=7`).

In [9]:
subspace_clf = BaggingClassifier(clf, n_estimators=50, max_features=7, 
                                bootstrap=True, n_jobs=-1)
subspace_clf.fit(trainingDataFeatures, trainingDataLabels)
predictions = subspace_clf.predict(testDataFeatures)
accuracy_score(testDataLabels, predictions)

0.9473684210526315

### Random Patches
Finally, let's combine things and try random patches. In this example each classifier will be given a subset of 100 training instances with 7 features each.:

In [10]:
subspace_clf = BaggingClassifier(clf, n_estimators=100, max_features=7, 
                                 max_samples=100, bootstrap=False, n_jobs=-1)
subspace_clf.fit(trainingDataFeatures, trainingDataLabels)
predictions = subspace_clf.predict(testDataFeatures)
accuracy_score(testDataLabels, predictions)

0.9473684210526315

While it is common to use a decision tree as the base classifier, we can use any classifier. Here we use kNN:

In [11]:
from sklearn.neighbors import KNeighborsClassifier
kNN = KNeighborsClassifier()
bagging_clf = BaggingClassifier(kNN, n_estimators=20, max_samples=100, 
                                bootstrap=True, n_jobs=-1)
bagging_clf.fit(trainingDataFeatures, trainingDataLabels)
predictions = bagging_clf.predict(testDataFeatures)
accuracy_score(testDataLabels, predictions)

0.9122807017543859

#### Summary
As you can see, any of these simple bagging algorithms typically outperforms using a single classifier. 


### Review

We import the bagging classifier library with:

```
from sklearn.ensemble import BaggingClassifier
```

and create an instance of one with:

```
my_bagging_classifier = BaggingClassifier(baseClassifier, Hyperparameters,n_jobs=-1)
```

#### Base Classifier
while any classifier can be used we typically use a decision tree

#### Hyperparameters
Here is a list of the hyperparameters (from the [sklearn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html#sklearn.ensemble.BaggingClassifier)):

* `n_estimators`: integer, default value = 10, the number of classifiers (estimators) in the ensemble.
* `max_samples`: integer or float, default value = 1.0(meaning use all the training instances), the number of samples (instances) to draw from the training dataset to train each base classifier.
    * if integer, then draw max_features features.
    * if float, then draw max_samples * X.shape[0] samples. For example if `max_samples` is 0.7 and there are 100 instances in the training dataset then draw 70 samples.
* `max_features`, integer or float, default value =1.0,  
the number of features to draw from the training dataset to train each base estimator 
    * if integers, then draw max_features features.
    * if float, then draw max_features * X.shape[1] features.
* `bootstrap` boolean, default value =True, whether samples and features are drawn with replacement. If False, sampling without replacement is performed.

For other hyperparamters, consult the documentation.


![](https://raw.githubusercontent.com/zacharski/datamining-guide/master/labs/pics/torchdivide.png)


# <font color='#EE4C2C'>You Try ...</font> 
## <font color='#EE4C2C'>Predicting musical genres from audio file attributes</font> 


When you listen even to a few seconds of a song you can identify it as blues, country, classical, or any other genre. How do you do this? What attributes are you hearing in the audio file that helps you make this classification? And, more to the point, can we train a computer to do it?

![](https://raw.githubusercontent.com/zacharski/ml-class/master/labs/pics/bluesClassical.png)

We are going to be using  the [GTZAN Dataset for Music Genre Classification](https://www.kaggle.com/andradaolteanu/gtzan-dataset-music-genre-classification). It provides data of 100 songs for each of 10 genres. The data is in several formats:

* 30 second audio files (wav)
* spectral images of those 30 second clips (see image above)
* a csv file containing acoustic attributes of the 30 second clip
* a csv file containing acoustic attributes of 3 second clips (the 30 second clips were split into 3 second ones)

We are going to use the 3 second csv file which is available at 

https://raw.githubusercontent.com/zacharski/ml-class/master/data/gtzan.csv

Go ahead and load the data into a dataframe (the first row contains feature names)


In [12]:
music = pd.read_csv("https://raw.githubusercontent.com/zacharski/ml-class/master/data/gtzan.csv")
music

Unnamed: 0,filename,length,chroma_stft_mean,chroma_stft_var,rms_mean,rms_var,spectral_centroid_mean,spectral_centroid_var,spectral_bandwidth_mean,spectral_bandwidth_var,...,mfcc16_var,mfcc17_mean,mfcc17_var,mfcc18_mean,mfcc18_var,mfcc19_mean,mfcc19_var,mfcc20_mean,mfcc20_var,label
0,blues.00000.0.wav,66149,0.335406,0.091048,0.130405,0.003521,1773.065032,167541.630869,1972.744388,117335.771563,...,39.687145,-3.241280,36.488243,0.722209,38.099152,-5.050335,33.618073,-0.243027,43.771767,blues
1,blues.00000.1.wav,66149,0.343065,0.086147,0.112699,0.001450,1816.693777,90525.690866,2010.051501,65671.875673,...,64.748276,-6.055294,40.677654,0.159015,51.264091,-2.837699,97.030830,5.784063,59.943081,blues
2,blues.00000.2.wav,66149,0.346815,0.092243,0.132003,0.004620,1788.539719,111407.437613,2084.565132,75124.921716,...,67.336563,-1.768610,28.348579,2.378768,45.717648,-1.938424,53.050835,2.517375,33.105122,blues
3,blues.00000.3.wav,66149,0.363639,0.086856,0.132565,0.002448,1655.289045,111952.284517,1960.039988,82913.639269,...,47.739452,-3.841155,28.337118,1.218588,34.770935,-3.580352,50.836224,3.630866,32.023678,blues
4,blues.00000.4.wav,66149,0.335579,0.088129,0.143289,0.001701,1630.656199,79667.267654,1948.503884,60204.020268,...,30.336359,0.664582,45.880913,1.689446,51.363583,-3.392489,26.738789,0.536961,29.146694,blues
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9985,rock.00099.5.wav,66149,0.349126,0.080515,0.050019,0.000097,1499.083005,164266.886443,1718.707215,85931.574523,...,42.485981,-9.094270,38.326839,-4.246976,31.049839,-5.625813,48.804092,1.818823,38.966969,rock
9986,rock.00099.6.wav,66149,0.372564,0.082626,0.057897,0.000088,1847.965128,281054.935973,1906.468492,99727.037054,...,32.415203,-12.375726,66.418587,-3.081278,54.414265,-11.960546,63.452255,0.428857,18.697033,rock
9987,rock.00099.7.wav,66149,0.347481,0.089019,0.052403,0.000701,1346.157659,662956.246325,1561.859087,138762.841945,...,78.228149,-2.524483,21.778994,4.809936,25.980829,1.775686,48.582378,-0.299545,41.586990,rock
9988,rock.00099.8.wav,66149,0.387527,0.084815,0.066430,0.000320,2084.515327,203891.039161,2018.366254,22860.992562,...,28.323744,-5.363541,17.209942,6.462601,21.442928,2.354765,24.843613,0.675824,12.787750,rock


Let's examine the values of the label column (the genres):

In [13]:
music.label.unique()

array(['blues', 'classical', 'country', 'disco', 'hiphop', 'jazz',
       'metal', 'pop', 'reggae', 'rock'], dtype=object)

Those are the 10 genres we are trying to predict. So if we were just to guess without hearing the clip, we would be accurate 10% of the time. How accurate do you think you would be based on hearing a 3 second clip? I am pretty confident I could correctly label the 30 second clips, but I am much less confident about labeling 3 second ones. Since guessing randomly would give me 10% accuracy, I am estimating maybe 50-60% accuracy. Let's see how a computer does.

#### Feature Names
So the column we are trying to predict is `label`. Now let's get the names of the feature columns

In [14]:
featureNames = list(music.columns)
featureNames.remove('filename')
featureNames.remove('label')
print(featureNames)
print(len(featureNames))

['length', 'chroma_stft_mean', 'chroma_stft_var', 'rms_mean', 'rms_var', 'spectral_centroid_mean', 'spectral_centroid_var', 'spectral_bandwidth_mean', 'spectral_bandwidth_var', 'rolloff_mean', 'rolloff_var', 'zero_crossing_rate_mean', 'zero_crossing_rate_var', 'harmony_mean', 'harmony_var', 'perceptr_mean', 'perceptr_var', 'tempo', 'mfcc1_mean', 'mfcc1_var', 'mfcc2_mean', 'mfcc2_var', 'mfcc3_mean', 'mfcc3_var', 'mfcc4_mean', 'mfcc4_var', 'mfcc5_mean', 'mfcc5_var', 'mfcc6_mean', 'mfcc6_var', 'mfcc7_mean', 'mfcc7_var', 'mfcc8_mean', 'mfcc8_var', 'mfcc9_mean', 'mfcc9_var', 'mfcc10_mean', 'mfcc10_var', 'mfcc11_mean', 'mfcc11_var', 'mfcc12_mean', 'mfcc12_var', 'mfcc13_mean', 'mfcc13_var', 'mfcc14_mean', 'mfcc14_var', 'mfcc15_mean', 'mfcc15_var', 'mfcc16_mean', 'mfcc16_var', 'mfcc17_mean', 'mfcc17_var', 'mfcc18_mean', 'mfcc18_var', 'mfcc19_mean', 'mfcc19_var', 'mfcc20_mean', 'mfcc20_var']
58


So we have 58 features. 

#### Training and test sets
Now it is time to construct the training and test sets:

In [16]:
## divide the original data 80% going into the music_training dataset 
## the rest in music_test
music_training, music_test = train_test_split(music, train_size = 0.8, shuffle=True, random_state=1)
                              
## now create the DataFrames for just the features (excluding the label column 
## filename column)                              
music_training_features = music_training[featureNames]
music_test_features = music_test[featureNames]
                              
## now create the labels data structure for both the training and test sets                              
music_training_labels = music_training['label']
music_test_labels = music_test['label']
music_test_labels

1571    classical
9567         rock
2564      country
2012      country
6197        metal
          ...    
1123    classical
4301       hiphop
9385         rock
7182          pop
1995    classical
Name: label, Length: 1998, dtype: object

### Building a single decision tree classifier
Let's build a single decision tree classifier called `clf` using entropy, fit it to the data, make predictions and determine the accuracy:

In [18]:
## Create clf, an instance of the Decision Tree Classifier
music_clf = tree.DecisionTreeClassifier(criterion='entropy')

## Fit it to the data
music_clf.fit(music_training_features, music_training_labels)

## get the predictions for the test set
pred = music_clf.predict(music_test_features)

## get the accuracy score
acc_s = accuracy_score(music_test_labels, pred)
acc_s

0.6746746746746747

When I did this I got 66% accuracy. That doesn't sound great but keep in mind that random guessing would only be 10% accuracy. 

### Building a Random Patch Classifier

Now we are going to build a random patch classifier.

* the base classifier will be a decision tree using entropy
* the ensemble will contain 20 base classifiers
* each classifier will use a random sample of 70% of the training data
* each classifier will use a random sample of 70% of the features
* the sampling will be done with replacement
* it will use all available cpu cores.

We are going to

1. build the classifier
2. train the classifier on the data
3. make predictions on the test set
4. determine the accuracy

In [19]:
#Base classifier - Decision Tree using entropy
base_clf = tree.DecisionTreeClassifier(criterion='entropy')

In [32]:
#random patch classifier

#70% of the training data
samples = int(0.7 * music_training.shape[0])
#70% of the number of features
features = int(0.7 * 58)
#print('{} {}'.format(samples, features))

rpatch_clf = BaggingClassifier(base_clf, n_estimators = 20, max_features=features, max_samples=samples, bootstrap=False, n_jobs=-1)

5594 40


In [22]:
#fit the data
rpatch_clf.fit(music_training_features, music_training_labels)

In [25]:
#predict
music_pred1 = rpatch_clf.predict(music_test_features)

In [26]:
accuracy1 = accuracy_score(music_test_labels, music_pred1)
accuracy1

0.8588588588588588

What accuracy did you get? Was it better than using a single classifier?
Keep your original code above. Make a copy of it below and 
experiment a bit with the hyperparameters. (try 3 or 4 different things) What is the best accuracy you can get?

Answer: The accuracy for the random patch was better than the one for the single classifier

For the following four classifiers, see notes below showing accuracy of each of them

In [28]:
# -1- Let's try the same rendom patch, but with a higher number of estimators. 
rpatch_clf2 = BaggingClassifier(base_clf, n_estimators = 100, max_features=features, max_samples=samples, bootstrap=False, n_jobs=-1)
#fit the data
rpatch_clf2.fit(music_training_features, music_training_labels)
#predict
music_pred2 = rpatch_clf2.predict(music_test_features)
#accuracy
accuracy2 = accuracy_score(music_test_labels, music_pred2)
accuracy2

0.8943943943943944

In [29]:
# -2- Let's try now a bagging classifier with 100 estimators
bagging_clf3 = BaggingClassifier(base_clf, n_estimators=100, max_samples=samples, bootstrap=True, n_jobs=-1)
#fit the data
bagging_clf3.fit(music_training_features, music_training_labels)
#predict
music_pred3 = bagging_clf3.predict(music_test_features)
#accuracy
accuracy3 = accuracy_score(music_test_labels, music_pred3)
accuracy3

0.8593593593593594

In [30]:
# -3- Let's try a pasting classifier with the same parameters as the bagging classifier above
pasting_clf4 = BaggingClassifier(base_clf, n_estimators=100, max_samples=samples, bootstrap=False, n_jobs=-1)
#fit the data
pasting_clf4.fit(music_training_features, music_training_labels)
#predict
music_pred4 = pasting_clf4.predict(music_test_features)
#accuracy
accuracy4 = accuracy_score(music_test_labels, music_pred4)
accuracy4

0.8673673673673674

In [35]:
# Since the random patch with 100 estimators was the one with the best accuracy, let's try some variation 
# to see if we can have something above 90%
features2 = int(0.85 * 58)
samples2 = int(0.90 * music_training.shape[0])
rpatch_clf5 = BaggingClassifier(base_clf, n_estimators = 120, max_features=features2, max_samples=samples2, bootstrap=False, n_jobs=-1)
#fit the data
rpatch_clf5.fit(music_training_features, music_training_labels)
#predict
music_pred5 = rpatch_clf5.predict(music_test_features)
#accuracy
accuracy5 = accuracy_score(music_test_labels, music_pred5)
accuracy5


0.8828828828828829

**Classifier:** rpatch_clf = BaggingClassifier(base_clf, n_estimators = 20, max_features=features, max_samples=samples, bootstrap=False, n_jobs=-1) --- **Accuracy** = 0.8588588588588588

**Classifier:** rpatch_clf2 = BaggingClassifier(base_clf, n_estimators = 100, max_features=features, max_samples=samples, bootstrap=False, n_jobs=-1)
**Accuracy** = 0.8943943943943944

**Classifier:** bagging_clf3 = BaggingClassifier(base_clf, n_estimators=100, max_samples=samples, bootstrap=True, n_jobs=-1)
**Accuracy** = 0.8593593593593594

**Classifier:** pasting_clf4 = BaggingClassifier(base_clf, n_estimators=100, max_samples=samples, bootstrap=False, n_jobs=-1)
**Accuracy** = 0.8673673673673674

**Classifier:** rpatch_clf5 = BaggingClassifier(base_clf, n_estimators = 120, max_features=features2, max_samples=samples2, bootstrap=False, n_jobs=-1)
**Accuracy** = 0.8828828828828829


##<font color='#52BE80'>NOTES</font>

### Breast Cancer Database

[back](#Wisconsin-Cancer-Dataset)

  This breast cancer databases was obtained from the University of Wisconsin
   Hospitals, Madison from Dr. William H. Wolberg.  If you publish results
   when using this database, then please include this information in your
   acknowledgements.  Also, please cite one or more of:

   1. O. L. Mangasarian and W. H. Wolberg: "Cancer diagnosis via linear 
      programming", SIAM News, Volume 23, Number 5, September 1990, pp 1 & 18.

   2. William H. Wolberg and O.L. Mangasarian: "Multisurface method of 
      pattern separation for medical diagnosis applied to breast cytology", 
      Proceedings of the National Academy of Sciences, U.S.A., Volume 87, 
      December 1990, pp 9193-9196.

   3. O. L. Mangasarian, R. Setiono, and W.H. Wolberg: "Pattern recognition 
      via linear programming: Theory and application to medical diagnosis", 
      in: "Large-scale numerical optimization", Thomas F. Coleman and Yuying
      Li, editors, SIAM Publications, Philadelphia 1990, pp 22-30.

   4. K. P. Bennett & O. L. Mangasarian: "Robust linear programming 
      discrimination of two linearly inseparable sets", Optimization Methods
      and Software 1, 1992, 23-34 (Gordon & Breach Science Publishers).

Title: Wisconsin Breast Cancer Database (January 8, 1991)


Sources:
   -- Dr. WIlliam H. Wolberg (physician)
      University of Wisconsin Hospitals
      Madison, Wisconsin
      USA
   -- Donor: Olvi Mangasarian (mangasarian@cs.wisc.edu)
      Received by David W. Aha (aha@cs.jhu.edu)
   -- Date: 15 July 1992