**[Machine Learning Home Page](https://www.kaggle.com/learn/intro-to-machine-learning)**

---


# Introduction
Machine learning competitions are a great way to improve your data science skills and measure your progress. 

In this exercise, you will create and submit predictions for a Kaggle competition. You can then improve your model (e.g. by adding features) to improve and see how you stack up to others taking this micro-course.

The steps in this notebook are:
1. Build a Random Forest model with all of your data (**X** and **y**)
2. Read in the "test" data, which doesn't include values for the target.  Predict home values in the test data with your Random Forest model.
3. Submit those predictions to the competition and see your score.
4. Optionally, come back to see if you can improve your model by adding features or changing your model. Then you can resubmit to see how that stacks up on the competition leaderboard.

## Recap
Here's the code you've written so far. Start by running it again.

In [1]:
# Code you have previously used to load data
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
#from learntools.core import *



# Path of the file to read. We changed the directory structure to simplify submitting to a competition
iowa_file_path = 'F:\\ML App\\app\\Dataset\\train.csv'

home_data = pd.read_csv(iowa_file_path)
# Create target object and call it y
y = home_data.SalePrice
# Create X
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = home_data[features]

# Split into validation and training data
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

# Specify Model
iowa_model = DecisionTreeRegressor(random_state=1)
# Fit Model
iowa_model.fit(train_X, train_y)

# Make validation predictions and calculate mean absolute error
val_predictions = iowa_model.predict(val_X)
val_mae = mean_absolute_error(val_predictions, val_y)
print("Validation MAE when not specifying max_leaf_nodes: {:,.0f}".format(val_mae))

# Using best value for max_leaf_nodes
iowa_model = DecisionTreeRegressor(max_leaf_nodes=100, random_state=1)
iowa_model.fit(train_X, train_y)
val_predictions = iowa_model.predict(val_X)
val_mae = mean_absolute_error(val_predictions, val_y)
print("Validation MAE for best value of max_leaf_nodes: {:,.0f}".format(val_mae))

# Define the model. Set random_state to 1
rf_model = RandomForestRegressor(random_state=1)
rf_model.fit(train_X, train_y)
rf_val_predictions = rf_model.predict(val_X)
rf_val_mae = mean_absolute_error(rf_val_predictions, val_y)

print("Validation MAE for Random Forest Model: {:,.0f}".format(rf_val_mae))


Validation MAE when not specifying max_leaf_nodes: 29,653
Validation MAE for best value of max_leaf_nodes: 27,283
Validation MAE for Random Forest Model: 22,762




In [3]:
print(X.iloc[6])

LotArea         10084
YearBuilt        2004
1stFlrSF         1694
2ndFlrSF            0
FullBath            2
BedroomAbvGr        3
TotRmsAbvGrd        7
Name: 6, dtype: int64


In [4]:
print(y.iloc[6])

307000


# Creating a Model For the Competition

Build a Random Forest model and train it on all of **X** and **y**.  

In [5]:
# To improve accuracy, create a new Random Forest model which you will train on all training data
rf_model_on_full_data = RandomForestRegressor(random_state=1)

# fit rf_model_on_full_data on all data from the training data
rf_model_on_full_data.fit(X,y)




RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=10,
                      n_jobs=None, oob_score=False, random_state=1, verbose=0,
                      warm_start=False)

# Make Predictions
Read the file of "test" data. And apply your model to make predictions

In [6]:
# path to file you will use for predictions
test_data_path = 'F:\\ML App\\app\\Dataset\\test.csv'

# read test data file using pandas
test_data = pd.read_csv(test_data_path)

# create test_X which comes from test_data but includes only the columns you used for prediction.
# The list of columns is stored in a variable called features
test_X = test_data[features]

# make predictions which we will submit. 
test_preds = rf_model_on_full_data.predict(test_X)

# The lines below shows how to save predictions in format used for competition scoring
# Just uncomment them.

output = pd.DataFrame({'Id': test_data.Id,
                       'SalePrice': test_preds})
output.to_csv('submission.csv', index=False)

In [7]:
import joblib
joblib.dump(rf_model_on_full_data, "model.pkl")

['model.pkl']

In [8]:
print(train_X)

      LotArea  YearBuilt  1stFlrSF  2ndFlrSF  FullBath  BedroomAbvGr  \
6       10084       2004      1694         0         2             3   
807     21384       1923      1072       504         1             3   
955      7136       1946       979       979         2             4   
1040    13125       1957      1803         0         2             3   
701      9600       1969      1164         0         1             3   
303      9800       1972       894         0         1             3   
1264     4060       1998      1337         0         2             2   
216      8450       2004      1436         0         2             3   
1157     5001       2007      1314         0         2             2   
350      7820       2007      1869         0         2             2   
1350    11643       1969      1338      1296         2             6   
1086     1974       1973       546       546         1             3   
202      7000       1924       865       445         2          

In [14]:
print(train_y)

6       307000
807     223500
955     145000
1040    155000
701     140000
303     149900
1264    181000
216     210000
1157    230000
350     318061
1350    200000
1086     83500
202     112000
1192    125000
527     446261
310     165600
916      35311
1437    394617
8       129900
241     110500
983     255900
623     168500
1267    378500
1393    163000
494      91300
774     395000
997     185000
731     187500
864     250580
275     205000
         ...  
141     260000
1110    188000
753     275500
1001     86000
1239    265900
580     181900
562     108000
398      67000
668     168000
252     173000
907     250000
468     250000
914     173733
357     134000
1278    237000
1300    225000
1202    117000
1305    325000
1414    207000
508     161000
749      98000
129     150000
144     125000
960     116500
847     133500
715     165000
905     128000
1096    127000
235      89500
1061     81000
Name: SalePrice, Length: 1095, dtype: int64


In [9]:
print(type(train_X.iloc[:1,:]))

<class 'pandas.core.frame.DataFrame'>


In [10]:
print(train_X.iloc[:1,:])

   LotArea  YearBuilt  1stFlrSF  2ndFlrSF  FullBath  BedroomAbvGr  \
6    10084       2004      1694         0         2             3   

   TotRmsAbvGrd  
6             7  


In [11]:
train_X.iloc[:1,:].index

Int64Index([6], dtype='int64')

In [12]:
sample_data = train_X.iloc[:1,:]

In [13]:
print(sample_data)

   LotArea  YearBuilt  1stFlrSF  2ndFlrSF  FullBath  BedroomAbvGr  \
6    10084       2004      1694         0         2             3   

   TotRmsAbvGrd  
6             7  


In [14]:
joblib.dump(sample_data, "sample_data.pkl")

['sample_data.pkl']

# Test Your Work
After filling in the code above:
1. Click the **Commit** button. 
2. After your code has finished running, click the "Open Version" button.  This brings you into the "viewer mode" for your notebook. You will need to scroll down to get back to these instructions.
3. Click **Output** button on the left of your screen. 

This will bring you to a part of the screen that looks like this: 
![](https://imgur.com/a/QRHL7Uv)

Select the button to submit and you will see your score. You have now successfully submitted to the competition.

4. If you want to keep working to improve your model, select the edit button. Then you can change your model and repeat the process to submit again. There's a lot of room to improve your model, and you will climb up the leaderboard as you work.

# Continuing Your Progress
There are many ways to improve your model, and **experimenting is a great way to learn at this point.**

The best way to improve your model is to add features.  Look at the list of columns and think about what might affect home prices.  Some features will cause errors because of issues like missing values or non-numeric data types. 

The [Intermediate Machine Learning](https://www.kaggle.com/learn/intermediate-machine-learning) micro-course will teach you how to handle these types of features. You will also learn to use **xgboost**, a technique giving even better accuracy than Random Forest.


# Other Micro-Courses
The **[Pandas Micro-Course](https://kaggle.com/Learn/Pandas)** will give you the data manipulation skills to quickly go from conceptual idea to implementation in your data science projects. 

You are also ready for the **[Deep Learning](https://kaggle.com/Learn/Deep-Learning)** micro-course, where you will build models with better-than-human level performance at computer vision tasks.

---
**[Machine Learning Home Page](https://www.kaggle.com/learn/intro-to-machine-learning)**





*Have questions or comments? Visit the [Learn Discussion forum](https://www.kaggle.com/learn-forum) to chat with other Learners.*