This code is an introduction to supervised learning solving a classification problem using **decision trees**.
It follows [this tutorial](https://youtu.be/7eh4d6sabA0). 

# **Classification Problem**
We will follow these steps of solving a machine learning problem.


1. Import the Data
2. Clean the Data
3. split the Data into Training/ Test steps
4. Create a Model
5. Train the Model
6. Make Predictions
7. Evaluate and improve


# Problem description
Enter in the text cell below what you will be predicting in this classification problem (y) and which columns will be used in the prediction (X)

In [147]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import joblib
from sklearn import tree

1. Import the Data.

In [148]:
df = pd.read_csv('albums.csv')

2. Display columns and describe the data set

In [149]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 10 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   id                    100000 non-null  int64  
 1   artist_id             100000 non-null  int64  
 2   album_title           100000 non-null  object 
 3   genre                 100000 non-null  object 
 4   year_of_pub           100000 non-null  int64  
 5   num_of_tracks         100000 non-null  int64  
 6   num_of_sales          100000 non-null  int64  
 7   rolling_stone_critic  100000 non-null  float64
 8   mtv_critic            100000 non-null  float64
 9   music_maniac_critic   100000 non-null  float64
dtypes: float64(3), int64(5), object(2)
memory usage: 7.6+ MB


In [150]:
df.describe()

Unnamed: 0,id,artist_id,year_of_pub,num_of_tracks,num_of_sales,rolling_stone_critic,mtv_critic,music_maniac_critic
count,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0
mean,50000.5,24981.78205,2009.52096,8.4895,500044.72656,2.748945,2.75178,2.748225
std,28867.657797,14450.407866,5.776074,4.04511,288033.73321,1.435789,1.437516,1.434577
min,1.0,1.0,2000.0,2.0,1009.0,0.5,0.5,0.5
25%,25000.75,12388.0,2004.0,5.0,251603.5,1.5,1.5,1.5
50%,50000.5,24940.5,2010.0,8.0,499531.5,2.5,3.0,3.0
75%,75000.25,37498.25,2015.0,12.0,749354.25,4.0,4.0,4.0
max,100000.0,50000.0,2019.0,15.0,999994.0,5.0,5.0,5.0


3. Prepare Data

In [151]:
# Run this section to inspect X
X = df.drop(columns = ['num_of_tracks', 'id', 'artist_id', 'album_title','num_of_sales', 'genre', 'music_maniac_critic'])
X

Unnamed: 0,year_of_pub,rolling_stone_critic
0,2006,4.0
1,2014,3.0
2,2000,2.5
3,2017,1.5
4,2010,4.5
...,...,...
99995,2016,2.5
99996,2013,5.0
99997,2018,2.0
99998,2007,4.0


In [152]:
# Uncomment this section to inpect y
y = df['genre']
y

0               Folk
1              Metal
2             Latino
3                Pop
4        Black Metal
            ...     
99995       Pop-Rock
99996          Retro
99997          Indie
99998            Pop
99999           Rock
Name: genre, Length: 100000, dtype: object

In [153]:
model = DecisionTreeClassifier()
model.fit(X, y)
predictions = model.predict([[2014,4]])
predictions



array(['Indie'], dtype=object)

4. Calculate accuracy

In [154]:
# Train 80% of the data set and use the rest to test
X_train, X_test, y_train, y_test = train_test_split(
                                    X, y, test_size=0.2)

model = DecisionTreeClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

# Compute model accuracy
score = accuracy_score(y_test, predictions)
score


0.09025

5. Persisting Models

In [155]:
# Save the model to file
joblib.dump(model, 'MODELNAME.joblib')


['MODELNAME.joblib']

5.b. Import the model and make predictions

In [156]:
# Load saved model. Make sure that you have run the previous
# section at least once, and that the file exists.

model = joblib.load('MODELNAME.joblib')
predictions = model.predict(X_test)
predictions

array(['Indie', 'Indie', 'Indie', ..., 'Indie', 'Indie', 'Indie'],
      dtype=object)

6. (Optional) Visualize decision trees

In [157]:
tree.export_graphviz(model, out_file = 'MODELNAME.joblib',
                    feature_names = X.columns, 
                    class_names = str(sorted(y.unique())), 
                    label = 'all',
                    rounded = True, 
                    filled = True)

#Download the file music-recommender.dot and open it in VS Code.
