# [Starting your ML project](https://www.kaggle.com/dansbecker/starting-your-ml-project/notebook)

## Introduction
**This will be your workspace for Kaggle's Machine Learning education track.**

You will build and continually improve a model to predict housing prices as you work through each tutorial.  
Fork this notebook and write your code in it.  
You will see examples predicting home prices using data from Melbourne, Australia.  
You will then write code to build a model predicting prices in the US state of Iowa.  
The data from the tutorial, the Melbourne data, is not available in this workspace.  
You will need to translate the concepts to work with the data in this notebook, the Iowa data.  

Come to the [Learn Discussion](https://www.kaggle.com/learn-forum) forum for any questions or coments. 

# Write Your Code Below

In [1]:
import pandas as pd

# Load the data and greet everyone.
main_file_path = 'input/train.csv'
data = pd.read_csv(main_file_path)
print('hello world')

hello world


## Use pandas to get familiar with your data

The first thing you'll want to do is familiarize yourself with the data.  
You'll use the Pandas library for this.  
Pandas is the primary tool that modern data scientists use for exploring and manipulating data.  
Most people abbreviate pandas in their code as pd. We do this with the command  
`import pandas as pd`  
The most important part of the Pandas library is the DataFrame.  
A DataFrame holds the type of data you might think of as a table.  
This is similar to a sheet in Excel, or a table in a SQL database.  
The Pandas DataFrame has powerful methods for most things you'll want to do with this type of data.  
Let's start by looking at a basic data overview with our example data from Melbourne and the data you'll be working with from Iowa:

In [2]:
# Explore some summary statistics:
data.describe()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,...,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,730.5,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,443.639726,...,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,180921.19589
std,421.610009,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,456.098091,...,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,79442.502883
min,1.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,365.75,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,129975.0
50%,730.5,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,...,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,1095.25,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,712.25,...,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,1460.0,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,...,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0


## Interpreting data description

The results show 8 numbers for each column in your original dataset.  
The first number, the **count**, shows how many rows have non-missing values.  
Missing values arise for many reasons.  
For example, the size of the 2nd bedroom wouldn't be collected when surveying a 1 bedroom house.  
We'll come back to the topic of missing data.  
The second value is the **mean**, which is the average.  
Under that, **std** is the standard deviation, which measures how numerically spread out the values are.  
To interpret the **min, 25%, 50%, 75%** and **max** values, imagine sorting each column from lowest to highest value.  
The first (smallest) value is the min.  
If you go a quarter way through the list, you'll find a number that is bigger than 25% of the values and smaller than 75% of the values.  
That is the 25% value (pronounced "25th percentile").  
The 50th and 75th percentiles are defined analgously, and the max is the largest number.

# [Selecting and filtering in pandas](https://www.kaggle.com/dansbecker/selecting-and-filtering-in-pandas)

This is part of Kaggle's Learn Machine Learning series.  
Selecting and Filtering Data  

Your dataset had too many variables to wrap your head around, or even to print out nicely.  
How can you pare down this overwhelming amount of data to something you can understand?  
To show you the techniques, we'll start by picking a few variables using our intuition.  
Later tutorials will show you statistical techniques to automatically prioritize variables.  
Before we can choose variables/columns, it is helpful to see a list of all columns in the dataset.  
That is done with the **columns** property of the DataFrame.

In [3]:
data.columns

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive

There are lots of ways to go about selecting different subsets of your data.  
Let's start with the basics.

### Selecting a single column

You can pull out any variable (or column) with **dot-notation**.  
This single column is stored in a pandas **Series**, which is kind of like a DataFrame with a single column.

In [4]:
# Store the Series labeled SalePrice separately:
price_data = data.SalePrice
# Read the first few entries:
price_data.head()

0    208500
1    181500
2    223500
3    140000
4    250000
Name: SalePrice, dtype: int64

### Selecting multiple columns

You can select multiple columns from a DataFrame by providing a list of column names inside brackets.  
Each item in that list must be a string ('in quotes').

In [5]:
columns_of_interest = ['LotArea', 'GrLivArea']
two_columns = data[columns_of_interest]
two_columns.head()

Unnamed: 0,LotArea,GrLivArea
0,8450,1710
1,9600,1262
2,11250,1786
3,9550,1717
4,14260,2198


In [6]:
two_columns.describe()

Unnamed: 0,LotArea,GrLivArea
count,1460.0,1460.0
mean,10516.828082,1515.463699
std,9981.264932,525.480383
min,1300.0,334.0
25%,7553.5,1129.5
50%,9478.5,1464.0
75%,11601.5,1776.75
max,215245.0,5642.0


# [Your first scikit-learn model](https://www.kaggle.com/dansbecker/your-first-scikit-learn-model)

This tutorial is part of the series [Learning Machine Learning](https://www.kaggle.com/dansbecker/learn-machine-learning).

## Choosing the prediction target

You have the code to load your data, and you know how to index it.  
You are ready to choose which column you want to predict.  
This column is called the prediction target.  
There is a convention that the prediction target is referred to as y.  

In [7]:
y = data.SalePrice
y.describe()

count      1460.000000
mean     180921.195890
std       79442.502883
min       34900.000000
25%      129975.000000
50%      163000.000000
75%      214000.000000
max      755000.000000
Name: SalePrice, dtype: float64

## Choosing Predictors

Next we will select the predictors.  
There may be times when you use all of the other variables besides the target as predictors.  
It's possible to model with non-numeric variables, but we'll start with a narrower set of numeric variables.

In [8]:
data.columns

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive

In [9]:
data_predictors = ['TotRmsAbvGrd', 'MasVnrArea', 'SalePrice', 'LotFrontage', 'LotArea', 'YearBuilt']

By convention, this data is called **X**:

In [10]:
X = data[data_predictors]
X.describe()

Unnamed: 0,TotRmsAbvGrd,MasVnrArea,SalePrice,LotFrontage,LotArea,YearBuilt
count,1460.0,1452.0,1460.0,1201.0,1460.0,1460.0
mean,6.517808,103.685262,180921.19589,70.049958,10516.828082,1971.267808
std,1.625393,181.066207,79442.502883,24.284752,9981.264932,30.202904
min,2.0,0.0,34900.0,21.0,1300.0,1872.0
25%,5.0,0.0,129975.0,59.0,7553.5,1954.0
50%,6.0,0.0,163000.0,69.0,9478.5,1973.0
75%,7.0,166.0,214000.0,80.0,11601.5,2000.0
max,14.0,1600.0,755000.0,313.0,215245.0,2010.0


## Building your model