In [35]:
# STEP 3 -> CALCULATING THE ACCURACY

import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

music_data = pd.read_csv('music.csv')
X = music_data.drop(columns=['genre']) # This dont modify the original dataset, just create a new file with the changes
y = music_data['genre']
X_train, X_test, y_train, y_test = train_test_split(X, y,  test_size = 0.2) # arguments are (input, output, size of test dataset)

model = DecisionTreeClassifier() # model = new instance of the class. We're allocating 20% of the data for testing
model.fit(X_train, y_train)
predictions = model.predict(X_test) # a male and female predictions at the same time

score = accuracy_score(y_test, predictions) # two arguments: the expected values and the predictions which contains the actual values
score

1.0

In [43]:
# STEP 4 -> PERSISTING MODELS

import pandas as pd
from sklearn.tree import DecisionTreeClassifier
import joblib # object that has methods for saving and loading models

# music_data = pd.read_csv('music.csv')
# X = music_data.drop(columns=['genre']) # This dont modify the original dataset, just create a new file with the changes
# y = music_data['genre']

# model = DecisionTreeClassifier() # model = new instance of the class. We're allocating 20% of the data for testing
# model.fit(X,y)

# joblib.dump(model, 'music-recommender.joblib') # save the trained model
model = joblib.load('music-recommender.joblib') # load the trained model
predictions = model.predict([[21,1]]) # a 21 yeard old female prediction 
predictions




array(['HipHop'], dtype=object)

In [57]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree # this object has a method for exporting a decision tree in a graphical format

music_data = pd.read_csv('music.csv') # dataset
X = music_data.drop(columns=['genre']) # input set
y = music_data['genre'] # output set

model = DecisionTreeClassifier() # model
model.fit(X,y) # train it

tree.export_graphviz(model, out_file='music-recommender.dot', 
                      feature_names=['age', 'gender'], # age and gender
                      class_names=sorted(y.unique()), # unique list of genres
                      label='all', # every node has labels that we can read
                      rounded=True, # if node have rounded corners
                      filled=True) # if node is filled with a color

In [None]:
'''
STEP 1 -> PREPARING THE DATA

Separate the dataset in two models input and output
input = the necessary data attributes to makes a prediction research or calculate the average
output = the result or answer (average/prediction) of the given inputs

obs: By convention is used a capital X to represent the input dataset and lower y to represent the output dataset
'''

In [None]:
'''
STEP 2 -> LEARNING AND PREDICTING

Import from the library scikit-learn:
* the package: sklearn;
* the module: tree;
* the class DecisionTreeClassifier (implements the decision tree algorithm)

Create a model to train it to it learns patterns in the data.
--model = DecisionTreeClassifier()

Train it
--model.fit(X,y)

Ask the model to make a prediction
Ex.: What is the kind of music that 21 year old male likes
--model.predict([[input],[...]])
'''


In [None]:
'''
STEP 3 -> CALCULATING THE ACCURACY

A general rule of thumb is to allocate 70 to 80 percent of our data for training and the other
20 to 30 percent for testing.

Split the dataset in two sets, for training and testing.
--train_test_split() -> This function returns a tuple, so we can unpack it into four variable:
X(train/test) and y(train/test)

Now for training the model only pass the training dataset
--model.fit(X_train,y_train)

And when we make prediction pass the input testing dataset
model.predictions(X_test)

To calculate the accuracy compares the output set for testing (y_test) with the predictions
--accuracy_score(y_test, predictions) -> returns a score between 0 to 1

obs: Every time we run again then split our dataset into training and test sets we'll have different datasets
because this function randomly picks data for training and testing

tips: Press ctrl + enter to run the current cell without adding a new cell

Key concepts in machine learning

Using very little data for training a model worsen the accuracy.
The more data we give to our model and the cleaner the data is, we get the better result. Avoid:

* duplicates
* irrelevants
* incomplete

That's will learn bad patterns in your data, so it's really important to clean our data before training our model.
Also, have enough data. The more complex are problems is the more data we need.

'''

In [None]:
'''
STEP 4 -> PERSISTING MODELS

In real application we might have a data set with millions or thousands of samples. Traing a model for that maybe takes seconds,
minutes, or even hours. So that's why models persistance is important.
Once in a while we build and train our model and then we'll save it to a file.

Now next time we want to make predictions and we simply load the model from the file and ask it to make predictions. That models
is already trained, we dont need to retrain, it's like a intelligent person.

First of all, import the joblib.

After we train the model, call joblib.
joblib.dump(model, 'name-of-file-model')

tip: comment all selected lines with ctrl + /

'''


In [None]:
'''
STEP 5 -> VIZUALIZING A DECISION TREE

Export the model in a visual format so you will see how this model makes predictions.

First, import the object tree:
-- from sklearn import tree

After train the model, call:
tree.export_graphiviz(model,out_file='name.dot', 
                      feature_names=['columns-input', 'columns-input'],
                      class_names=sorted(y.unique()),
                      label='all',
                      rounded=True,
                      filled=True)
                      
-> arguments = model, name of output file, features or columns of the dataset, class or labels in the output dataset like 
hiphop/dance, label='all', rounded='True', filled=True.

.unique() -> returns the unique list of classes, without duplicates.
sorted() -> sort the result in alphabetically

obs: .dot represents a graph description language

'''