# [Starting your ML project](https://www.kaggle.com/dansbecker/starting-your-ml-project/notebook)

## Introduction
**This will be your workspace for Kaggle's Machine Learning education track.**

You will build and continually improve a model to predict housing prices as you work through each tutorial.  
Fork this notebook and write your code in it.  
You will see examples predicting home prices using data from Melbourne, Australia.  
You will then write code to build a model predicting prices in the US state of Iowa.  
The data from the tutorial, the Melbourne data, is not available in this workspace.  
You will need to translate the concepts to work with the data in this notebook, the Iowa data.  

Come to the [Learn Discussion](https://www.kaggle.com/learn-forum) forum for any questions or coments. 

# Write Your Code Below

In [1]:
import pandas as pd

# Load the data and greet everyone.
main_file_path = 'input/melbourne_data.csv'
data = pd.read_csv(main_file_path)
print('hello world')

hello world


## Use pandas to get familiar with your data

The first thing you'll want to do is familiarize yourself with the data.  
You'll use the Pandas library for this.  
Pandas is the primary tool that modern data scientists use for exploring and manipulating data.  
Most people abbreviate pandas in their code as pd.  
We do this with the command  
`import pandas as pd`  
The most important part of the Pandas library is the DataFrame.  
A DataFrame holds the type of data you might think of as a table.  
This is similar to a sheet in Excel, or a table in a SQL database.  
The Pandas DataFrame has powerful methods for most things you'll want to do with this type of data.  
Let's start by looking at a basic data overview with our example data from Melbourne and the data you'll be working with from Iowa:

In [2]:
# Explore some summary statistics:
data.describe()

Unnamed: 0.1,Unnamed: 0,Rooms,Price,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
count,18396.0,18396.0,18396.0,18395.0,18395.0,14927.0,14925.0,14820.0,13603.0,7762.0,8958.0,15064.0,15064.0,18395.0
mean,11826.787073,2.93504,1056697.0,10.389986,3107.140147,2.913043,1.538492,1.61552,558.116371,151.220219,1965.879996,-37.809849,144.996338,7517.975265
std,6800.710448,0.958202,641921.7,6.00905,95.000995,0.964641,0.689311,0.955916,3987.326586,519.188596,37.013261,0.081152,0.106375,4488.416599
min,1.0,1.0,85000.0,0.0,3000.0,0.0,0.0,0.0,0.0,0.0,1196.0,-38.18255,144.43181,249.0
25%,5936.75,2.0,633000.0,6.3,3046.0,2.0,1.0,1.0,176.5,93.0,1950.0,-37.8581,144.931193,4294.0
50%,11820.5,3.0,880000.0,9.7,3085.0,3.0,1.0,2.0,440.0,126.0,1970.0,-37.803625,145.00092,6567.0
75%,17734.25,3.0,1302000.0,13.3,3149.0,3.0,2.0,2.0,651.0,174.0,2000.0,-37.75627,145.06,10331.0
max,23546.0,12.0,9000000.0,48.1,3978.0,20.0,8.0,10.0,433014.0,44515.0,2018.0,-37.40853,145.52635,21650.0


## Interpreting data description

The results show 8 numbers for each column in your original dataset.  
The first number, the **count**, shows how many rows have non-missing values.  
Missing values arise for many reasons.  
For example, the size of the 2nd bedroom wouldn't be collected when surveying a 1 bedroom house.  
We'll come back to the topic of missing data.  
The second value is the **mean**, which is the average.  
Under that, **std** is the standard deviation, which measures how numerically spread out the values are.  
To interpret the **min, 25%, 50%, 75%** and **max** values, imagine sorting each column from lowest to highest value.  
The first (smallest) value is the min.  
If you go a quarter way through the list, you'll find a number that is bigger than 25% of the values and smaller than 75% of the values.  
That is the 25% value (pronounced "25th percentile").  
The 50th and 75th percentiles are defined analgously, and the max is the largest number.

# [Selecting and filtering in pandas](https://www.kaggle.com/dansbecker/selecting-and-filtering-in-pandas)

This is part of Kaggle's Learn Machine Learning series.  
Selecting and Filtering Data  

Your dataset had too many variables to wrap your head around, or even to print out nicely.  
How can you pare down this overwhelming amount of data to something you can understand?  
To show you the techniques, we'll start by picking a few variables using our intuition.  
Later tutorials will show you statistical techniques to automatically prioritize variables.  
Before we can choose variables/columns, it is helpful to see a list of all columns in the dataset.  
That is done with the **columns** property of the DataFrame.

In [3]:
data.columns

Index(['Unnamed: 0', 'Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method',
       'SellerG', 'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom',
       'Car', 'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea',
       'Lattitude', 'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

There are lots of ways to go about selecting different subsets of your data.  
Let's start with the basics.

### Selecting a single column

You can pull out any variable (or column) with **dot-notation**.  
This single column is stored in a pandas **Series**, which is kind of like a DataFrame with a single column.

In [4]:
# Store the Series labeled SalePrice separately:
price_data = data.Price
# Read the first few entries:
price_data.head()

0    1480000.0
1    1035000.0
2    1465000.0
3     850000.0
4    1600000.0
Name: Price, dtype: float64

### Selecting multiple columns

You can select multiple columns from a DataFrame by providing a list of column names inside brackets.  
Each item in that list must be a string ('in quotes').

In [5]:
columns_of_interest = ['Landsize', 'BuildingArea']
two_columns = data[columns_of_interest]
two_columns.head()

Unnamed: 0,Landsize,BuildingArea
0,202.0,
1,156.0,79.0
2,134.0,150.0
3,94.0,
4,120.0,142.0


In [6]:
two_columns.describe()

Unnamed: 0,Landsize,BuildingArea
count,13603.0,7762.0
mean,558.116371,151.220219
std,3987.326586,519.188596
min,0.0,0.0
25%,176.5,93.0
50%,440.0,126.0
75%,651.0,174.0
max,433014.0,44515.0


# [Your first scikit-learn model](https://www.kaggle.com/dansbecker/your-first-scikit-learn-model)

This tutorial is part of the series [Learning Machine Learning](https://www.kaggle.com/dansbecker/learn-machine-learning).

## Choosing the prediction target

You have the code to load your data, and you know how to index it.  
You are ready to choose which column you want to predict.  
This column is called the prediction target.  
There is a convention that the prediction target is referred to as y.  

In [7]:
y = data.Price
y.describe()

count    1.839600e+04
mean     1.056697e+06
std      6.419217e+05
min      8.500000e+04
25%      6.330000e+05
50%      8.800000e+05
75%      1.302000e+06
max      9.000000e+06
Name: Price, dtype: float64

## Choosing Predictors

Next we will select the predictors.  
There may be times when you use all of the other variables besides the target as predictors.  
It's possible to model with non-numeric variables, but we'll start with a narrower set of numeric variables.

In [8]:
data.columns

Index(['Unnamed: 0', 'Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method',
       'SellerG', 'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom',
       'Car', 'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea',
       'Lattitude', 'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

In [9]:
# You may need to remove or replace NaN values from some of the predictors:
# http://scikit-learn.org/stable/modules/preprocessing.html#imputation-of-missing-values
data_predictors = ['Price', 'Rooms']

By convention, this data is called **X**:

In [10]:
X = data[data_predictors]
X.describe()

Unnamed: 0,Price,Rooms
count,18396.0,18396.0
mean,1056697.0,2.93504
std,641921.7,0.958202
min,85000.0,1.0
25%,633000.0,2.0
50%,880000.0,3.0
75%,1302000.0,3.0
max,9000000.0,12.0


## Building your model

You will use the **scikit-learn** library to create your models.  
When coding, this library is written as `sklearn`, as you will see in the sample code.  
Scikit-learn is easily the most popular library for modeling the types of data typically stored in DataFrames.  
The steps to building and using a model are:  
* **Define**: What type of model will it be? A decision tree? Some other type of model? Some other parameters of the model type are specified too.
* **Fit**: Capture patterns from provided data. This is the heart of modeling.
* **Predict**: Just what it sounds like.
* **Evaluate**: Determine how accurate the model's predictions are.

In [11]:
from sklearn.tree import DecisionTreeRegressor

# Define the model:
melbourne_model = DecisionTreeRegressor()

# Fit the model:
melbourne_model.fit(X, y)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_split=1e-07,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best')

In practice, you'll want to make predictions for new houses coming on the market rather than the houses we already have prices for.  
Here we'll make predictions for the first rows of the training data to see how the predict function works.

In [12]:
print(X.head())
melbourne_model.predict(X.head())

       Price  Rooms
0  1480000.0      2
1  1035000.0      2
2  1465000.0      3
3   850000.0      3
4  1600000.0      4


array([1480000., 1035000., 1465000.,  850000., 1600000.])

# Build a model for the Iowa data

Now it's time for you to define and fit a model for your data (in your notebook).  
Select the target variable you want to predict.  
You can go back to the list of columns from your earlier commands to recall what it's called (hint: you've already worked with this variable).  
Save this to a new variable called y.  
Create a list of the names of the predictors we will use in the initial model.  
Use just the following columns in the list (you may need to remove or replace NaN values from some of the predictors):
* LotArea
* YearBuilt
* 1stFlrSF
* 2ndFlrSF
* FullBath
* BedroomAbvGr
* TotRmsAbvGrd  
        
Using the list of variable names you just created, select a new DataFrame of the predictors data.  
Save this with the variable name X.  
Create a DecisionTreeRegressorModel and save it to a variable (with a name like my_model or iowa_model).  
Ensure you've done the relevant import so you can run this command.  
Fit the model you have created using the data in X and the target data you saved above.  
Make a few predictions with the model's predict command and print out the predictions.