<a href="https://colab.research.google.com/github/drozzel/Portfolio/blob/main/BachProject.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Bach Project
## Notebook Summary
###1.Import Necessary Libraries
###2.Read In the Data Files
###3.Clean and Organize the Data
###4.Build and Evaluate Basic Model
###5.Find the Ideal Parameters
###6.Build and Evaluate Ideal Model



##Import Necessary Libraries

In [None]:

#Import the pandas library
#This library will be used to read in and organize the data.
import pandas as pd

#Import the train_test_split
#This library will be used to create our training and test datasets.
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.metrics import accuracy_score
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import GridSearchCV

##Read in the Data Files
###A brief description of the data being utilized.
In this notebook we will be creating a XGBoost Model to analyze the Bach Chorales dataset.
Chords are determined by the combination of notes being played. These notes are C, C#, D, D#, E, F, F#, G, G#, A, A#,B.There is also a bass note as well as a meter that determine the chord being played. This dataset has a variety of chords from Johann Sebastian Bach 1000 various pieces.
The model will look to identify what chord is being played by analyzing the notes that are being played, the bass note and meter.


In [None]:
#Use the wget command to import the model into the notebook's directory
!wget https://raw.githubusercontent.com/zacharski/ml-class/master/data/bach.zip
#Use the unzip command to unzip the file.
!unzip bach.zip
#Read in the csv file.
bach = pd.read_csv('bach.csv')
#View a sample of the dataset
print(bach.head())


--2021-10-16 18:32:38--  https://raw.githubusercontent.com/zacharski/ml-class/master/data/bach.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 41761 (41K) [application/zip]
Saving to: ‘bach.zip’


2021-10-16 18:32:39 (13.3 MB/s) - ‘bach.zip’ saved [41761/41761]

Archive:  bach.zip
  inflating: bach.csv                
  choral_ID  event_number    C  C#   D  D#  ...    A  A#   B bass meter chord_label
0  000106b_             1  YES  NO  NO  NO  ...  YES  NO  NO    F     3         F_M
1  000106b_             2  YES  NO  NO  NO  ...   NO  NO  NO    E     5         C_M
2  000106b_             3  YES  NO  NO  NO  ...   NO  NO  NO    E     2         C_M
3  000106b_             4  YES  NO  NO  NO  ...  YES  NO  NO    F     3         F_M
4  000106b_             

#Organize the Data
##First we want to seperate the labels and the features.
###We will be using the notes,bass, and meter for the features and the chord_label for the labels.


In [None]:
#Print all the columns 
print(bach.columns)
#Determine which columns to drop to create the features.
bach_features = bach.drop(columns=['choral_ID','event_number','chord_label'])
#Create the label dataframe which consists of just the chord_label column.
bach_labels = bach['chord_label']


Index(['choral_ID', 'event_number', 'C', 'C#', 'D', 'D#', 'E', 'F', 'F#', 'G',
       'G#', 'A', 'A#', 'B', 'bass', 'meter', 'chord_label'],
      dtype='object')


When looking at the note columns we can see that they are not numerically represented so these must be changed to have a 0 if the note is not being played and a 1 if it is being played. We also want to have a numerical representation of the bass column so the model can analyze this. This can be done by one_hot_encoding the features.

In [None]:
#Replace YES, NO with 1,0 for the notes
bach_features.replace(('YES','NO'),(1,0),inplace=True)
#One hot encode the bass column
bach_features = pd.get_dummies(bach_features)



In [None]:
#Split the data intro train and test data.
bach_train_features,bach_test_features,bach_train_labels,bach_test_labels=train_test_split(bach_features,bach_labels,test_size = .4)

#Build and Evaluate the Basic Model
Next we will build a basic XGBoost Model without searching for the best parameters. The classifier will be a decision tree.


In [None]:
#Create and fit the model.
clf = tree.DecisionTreeClassifier(criterion='entropy')
bagging_clf = BaggingClassifier(clf, n_estimators=20, max_samples=100, 
                                bootstrap=True, n_jobs=-1)
bagging_clf.fit(bach_train_features, bach_train_labels)
predictions = bagging_clf.predict(bach_test_features)
accuracy_score(bach_test_labels,predictions)

0.6672550750220653

My accuracy for this base model was 66.7%, not bad! Let's see if we can do better though!


#Finding the Ideal Parameters
For our next step we want to figure out what the ideal parameters for our predictive model would be, the parameters we'll be adjusting are the n_estimators, bootstrap, and max_samples.

The N_estimators parameter is used to represent the number of trees being used within the classifier.

The bootstrap parameter can be summed up as a boolean value that represent replacement. True meaning the random training values will be replaced before drawing another one.

The max_samples parameter represents the number of samples from the training data 

In [None]:
#First create the param_grid that will contain the various values we would like to test
hyperparam_grid = [
    {'bootstrap': [True,False],'n_estimators':[80,100,120],'max_samples': [1200,1400,1600,1800]}
  ]


Next we want to utilize GridSearchCV to find the ideal parameters.

In [None]:
grid_search = GridSearchCV(bagging_clf, hyperparam_grid, cv=10)

Next we want to fit this model to find the best parameters.

In [None]:
grid_search.fit(bach_train_features, bach_train_labels)



GridSearchCV(cv=10, error_score=nan,
             estimator=BaggingClassifier(base_estimator=DecisionTreeClassifier(ccp_alpha=0.0,
                                                                               class_weight=None,
                                                                               criterion='entropy',
                                                                               max_depth=None,
                                                                               max_features=None,
                                                                               max_leaf_nodes=None,
                                                                               min_impurity_decrease=0.0,
                                                                               min_impurity_split=None,
                                                                               min_samples_leaf=1,
                                                                     

In [None]:
#Display the best parameters
grid_search.best_params_

{'bootstrap': True, 'max_samples': 1400, 'n_estimators': 120}

Here we can see the best parameters end up being bootstrap: True which means replacement of the samples taken. We will have 1400 samples taken. Finally 120 n_estimators.

In [None]:
#Have the new model make the predicitions
idealPredictions = grid_search.best_estimator_.predict(bach_test_features)

In [None]:
#Test the accuracy of the final model
accuracy_score(bach_test_labels, idealPredictions)

0.7427184466019418

We end up with a final accuracy of 74.3 so we see a total increase of 7.6% in our models accuracy just by doing a simple scan of some parameter combinations. There are of course a large multitude of hyper parameters we could test for to further increase the accuracy of our model!
