# Experiments with Decision Trees

**Task**

Try various settings for decision tree training. You'll need to achieve an accuracy score of at least 0.86 on the test set.

Spoiler alert: You can do better if you dare to!

Write the code in Practicum's JupyterHub as was explained in the previous lesson. The path to the data file is /datasets/train_data_us.csv. Or you can download the dataset and try to train the model locally on your machine (not recommended). Then upload the trained model here and test it.

In [4]:
import pandas as pd
import sklearn
import joblib

# import decision tree from the sklearn library
from sklearn.tree import DecisionTreeClassifier

# import from the library
from joblib import dump

# train model (like before)
try:
    df = pd.read_csv('https://code.s3.yandex.net/datasets/train_data_us.csv')
except:
    df = pd.read_csv('C:/Users/hotty/Desktop/Practicum by Yandex/5. Introduction to Machine Learning/train_data_us.csv')
print('Data has been read correctly!')

df.loc[df['last_price'] > 113000, 'price_class'] = 1
df.loc[df['last_price'] <= 113000, 'price_class'] = 0

# store features in seperate variables
features = df.drop(['last_price', 'price_class'], axis=1)
target = df['price_class']

# create an empty model and assign it to a variable
# train a model by calling the fit() method
model = DecisionTreeClassifier(random_state=12345, max_depth=5)
model.fit(features, target)

# save model 
# first argument is model 
# second argument is path to file 
joblib.dump(model, 'model.joblib')

Data has been read correctly!


['model.joblib']

# Test Dataset

In [5]:
test_df = pd.read_csv('https://code.s3.yandex.net/datasets/test_data_us.csv')

test_df.loc[test_df['last_price'] > 113000, 'price_class'] = 1
test_df.loc[test_df['last_price'] <= 113000, 'price_class'] = 0

test_features = test_df.drop(['last_price', 'price_class'], axis=1)
test_target = test_df['price_class']
test_predictions = model.predict(test_features)

def error_count(answers, predictions):
    count = 0
    for i in range(len(answers)):
        if answers[i] != predictions[i]:
            count += 1
    return count

# determine accuracy
def accuracy(answers, predictions):
    correct = 0
    for i in range(len(answers)):
        if answers[i] == predictions[i]:
            correct += 1
    return correct / len(answers)

print('Accuracy:', accuracy(test_target, test_predictions))

Accuracy: 0.8904320987654321


You can open and run the model using the `load` function. The review code in the training platform will do it for you.

In [6]:
#import joblib

# the argument is a path to the file 
# the return value is the model 
model = joblib.load('model.joblib')

# ...
# testing the model
# ...