Simplify data so that it is easy to understand

In [1]:
import pandas as pd

dataset_path = '../resources/datasets/melb_data.csv'
df = pd.read_csv(dataset_path)

df.columns

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

The Melbourne data has some missing values (some houses for which some variables weren't recorded.)<br>
We'll learn to handle missing values in a later tutorial.  <br>
Your Iowa data doesn't have missing values in the columns you use. <br>
So we will take the simplest option for now, and drop houses from our data. <br>
Don't worry about this much for now, though the code is:<br>
dropna drops missing values (think of na as "not available")<br>

In [2]:
df = df.dropna(axis=0)

Selecting the prediction target<br>
We'll use the dot notation to select the column we want to predict, which is called the prediction target. By convention, the prediction target is called y

In [3]:
y = df.Price

Choosing "Features" (Columns that are inputted into our model)

In [4]:
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']

By convention, this data is called X.

In [5]:
X = df[melbourne_features]

review the data we'll be using to predict house prices using the describe method and the head method, which shows the top few rows.

In [6]:
X.describe()

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
count,6196.0,6196.0,6196.0,6196.0,6196.0
mean,2.931407,1.57634,471.00694,-37.807904,144.990201
std,0.971079,0.711362,897.449881,0.07585,0.099165
min,1.0,1.0,0.0,-38.16492,144.54237
25%,2.0,1.0,152.0,-37.855438,144.926198
50%,3.0,1.0,373.0,-37.80225,144.9958
75%,4.0,2.0,628.0,-37.7582,145.0527
max,8.0,8.0,37000.0,-37.45709,145.52635


Building Your Model<br>
You will use the scikit-learn library to create your models.<br>

The steps to building and using a model are:

- Define: What type of model will it be? A decision tree? Some other type of model? Some other parameters of the model type are specified too.
- Fit: Capture patterns from provided data. This is the heart of modeling.
- Predict: Just what it sounds like
- Evaluate: Determine how accurate the model's predictions are.

In [7]:
from sklearn.tree import DecisionTreeRegressor

# Define model. Specify a number for random_state to ensure same results each run
model = DecisionTreeRegressor(random_state=48)

# Fit model
model.fit(X=X, y=y)

Many machine learning models allow some randomness in model training. Specifying a number for random_state ensures you get the same results in each run. This is considered a good practice. You use any number, and model quality won't depend meaningfully on exactly what value you choose.

We now have a fitted model that we can use to make predictions.

In practice, you'll want to make predictions for new houses coming on the market rather than the houses we already have prices for. But we'll make predictions for the first few rows of the training data to see how the predict function works.

In [8]:
print(X.head())
print(model.predict(X.head()))

   Rooms  Bathroom  Landsize  Lattitude  Longtitude
1      2       1.0     156.0   -37.8079    144.9934
2      3       2.0     134.0   -37.8093    144.9944
4      4       1.0     120.0   -37.8072    144.9941
6      3       2.0     245.0   -37.8024    144.9993
7      2       1.0     256.0   -37.8060    144.9954
[1035000. 1465000. 1600000. 1876000. 1636000.]


In [9]:
#Predict with our data

df1 = pd.DataFrame({
    'Rooms': [2, 4, 1, 2, 3],
    'Bathroom': [1, 2, 2, 1, 3],
    'Landsize': [450, 600, 750, 300, 500],
    'Lattitude': [-37.8079, -37.8543, -37.8093, -37.7863, -37.7969],
    'Longtitude': [144.9934, 144.9874, 144.9943, 144.9983, 144.9823]
})

print(df1.head)
print(model.predict(df1.head()))

<bound method NDFrame.head of    Rooms  Bathroom  Landsize  Lattitude  Longtitude
0      2         1       450   -37.8079    144.9934
1      4         2       600   -37.8543    144.9874
2      1         2       750   -37.8093    144.9943
3      2         1       300   -37.7863    144.9983
4      3         3       500   -37.7969    144.9823>
[ 480000. 3625000.  517500. 1350000. 2950000.]
