This code is an introduction to supervised learning solving a classification problem using **decision trees**.
It follows [this tutorial](https://youtu.be/7eh4d6sabA0). 

# **Classification Problem**
We will follow these steps of solving a machine learning problem.


1. Import the Data
2. Clean the Data
3. split the Data into Training/ Test steps
4. Create a Model
5. Train the Model
6. Make Predictions
7. Evaluate and improve


# Problem description
Enter in the text cell below what you will be predicting in this classification problem (y) and which columns will be used in the prediction (X)

In [4]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import joblib
from sklearn import tree

1. Import the Data.

In [5]:
df = pd.read_csv('cleanedfile.csv')
df = df.drop(columns = ['Name', 'Country'])
df

Unnamed: 0,Rank,Sales $,Profit $,Assets $,Market Value $
0,1,190500000000,45800000000,4914700000000,249500000000
1,2,136200000000,40400000000,3689300000000,464800000000
2,3,245500000000,42500000000,873700000000,624400000000
3,4,173500000000,39300000000,4301700000000,210400000000
4,5,229700000000,49300000000,510300000000,1897200000000
...,...,...,...,...,...
495,496,9900000000,2600000000,17700000000,94700000000
496,497,19000000000,401300000,48400000000,66800000000
497,498,9400000000,1400000000,89700000000,17600000000
498,499,7300000000,1300000000,170300000000,20000000000


2. Display columns and describe the data set

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0   Rank            500 non-null    int64
 1   Sales $         500 non-null    int64
 2   Profit $        500 non-null    int64
 3   Assets $        500 non-null    int64
 4   Market Value $  500 non-null    int64
dtypes: int64(5)
memory usage: 19.7 KB


In [7]:
df.describe()

Unnamed: 0,Rank,Sales $,Profit $,Assets $,Market Value $
count,500.0,500.0,500.0,500.0,500.0
mean,250.5,49794600000.0,4173743000.0,322928000000.0,104809400000.0
std,144.481833,55455790000.0,7457710000.0,631083400000.0,208070200000.0
min,1.0,4900000000.0,-22400000000.0,14800000000.0,1500000000.0
25%,125.75,18075000000.0,1400000000.0,48375000000.0,28950000000.0
50%,250.5,31400000000.0,2500000000.0,109200000000.0,52350000000.0
75%,375.25,58250000000.0,4725000000.0,298475000000.0,104100000000.0
max,500.0,559200000000.0,63900000000.0,4914700000000.0,2252300000000.0


3. Prepare Data

In [8]:
# Run this section to inspect X
X = df.drop(columns = ['Profit $'])
X

Unnamed: 0,Rank,Sales $,Assets $,Market Value $
0,1,190500000000,4914700000000,249500000000
1,2,136200000000,3689300000000,464800000000
2,3,245500000000,873700000000,624400000000
3,4,173500000000,4301700000000,210400000000
4,5,229700000000,510300000000,1897200000000
...,...,...,...,...
495,496,9900000000,17700000000,94700000000
496,497,19000000000,48400000000,66800000000
497,498,9400000000,89700000000,17600000000
498,499,7300000000,170300000000,20000000000


In [9]:
# Uncomment this section to inpect y
y = df['Profit $']
y

0      45800000000
1      40400000000
2      42500000000
3      39300000000
4      49300000000
          ...     
495     2600000000
496      401300000
497     1400000000
498     1300000000
499     1800000000
Name: Profit $, Length: 500, dtype: int64

4. Calculate accuracy

In [10]:
# Train 80% of the data set and use the rest to test
X_train, X_test, y_train, y_test = train_test_split(
                                    X, y, test_size=0.2)

model = DecisionTreeClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

# Compute model accuracy
score = accuracy_score(y_test, predictions)
score

0.01

5. Persisting Models

In [11]:
# Save the model to file
joblib.dump(model, 'MARKET.joblib')



['MARKET.joblib']

5.b. Import the model and make predictions

In [12]:
# Load saved model. Make sure that you have run the previous
# section at least once, and that the file exists.

model = joblib.load('MARKET.joblib')
predictions = model.predict(X_test)
predictions

array([  2800000000,   3000000000,   8200000000,   5400000000,
        14100000000,   1600000000,   8500000000,   3200000000,
         1000000000,   1500000000,     33100000,   6300000000,
         1800000000,   1200000000,   5600000000,   4700000000,
         1900000000,   3100000000,   6200000000,   1500000000,
        40300000000,   9300000000,  -5000000000,   3100000000,
         2700000000,   5700000000,   3100000000,   1300000000,
         1400000000,  45800000000,   1500000000,   3600000000,
         3400000000,  17100000000,   2100000000,    767200000,
         4300000000,   5200000000,  27900000000,   6200000000,
         7500000000,  14700000000,  10100000000,    739000000,
        -1300000000,   9700000000,   1200000000,   2500000000,
         1500000000,   2500000000,  12900000000,   3800000000,
         2900000000,   1500000000,   4700000000,   1600000000,
         2500000000,   5400000000,   3100000000, -22200000000,
          767200000,   1200000000,   3700000000,   5100

6. (Optional) Visualize decision trees

In [15]:
tree.export_graphviz(model, out_file = 'MODELNAME.dot',
                    feature_names = X.columns, 
                    class_names = str(sorted(y.unique())), 
                    label = 'all',
                    rounded = True, 
                    filled = True)

#Download the file music-recommender.dot and open it in VS Code.
