# Kaggle Project

Kaggle is one of the most known data science community. It's a place to learn, share and compete with other data scientists.

Your work is to **compete** for Housing Price prediction.

Join the competiton here: https://www.kaggle.com/c/home-data-for-ml-course

The leaderboard is calculated on your MAE score. The least your error, the best your place on the leaderboard!

In this Notebook, I help you create the model, predict values from the given test_set and create a submission for Kaggle.

Now:
1. Download at least `train.csv` and `test.csv` data from Kaggle
1. Move it to this folder
1. Run this code as it is
1. Submit your first `submission.csv` to the Kaggle competition.

Now, your job is to make the best model you can do.

Note that on Kaggle, you can only upload 5 submissions per day, so **start early**.


## Constraints
* Return this Notebook with the code for your best model
    - Explain step by step what you did to improve your model and why, what worked and what did not.
    - Write your observations.
    - Feel free to give your MAE at each improvement step.
    - You can show code for your older model if you find it meaningful, but try to keep it minimum.
* Join within this Notebook **your username on Kaggle, and a screenshot of your place in the leaderboard**.
* Send your work to my inbox laure.daumal@ext.devinci.fr
* **Deadline**: Wednesday, 22nd of April, 23h42.

Don't forget to
1. Clear all outputs before saving & sending it to me (**Cell >> All Output >> Clear**).
* Restart the kernel and run all cells to check no errors are found (**Cell >> Run All**).

## Hints
* Select the best features - Use **feature selection**!
* You know how to visualize data - it can be useful to use it.
* You will get bonus points for each interesting thing you do in this Notebook. Show me what you learnt. There is a lot of things we saw in the last Notebooks that you can re-use. For example:
    - Handle missing values
    - Conversion from categorical-string to numerical
    - Normalize data (StandardScaler)
    - Heatmaps (if some features are highly correlated, keep only one)
    - Use dimensionality reduction (keep those with highest eigenvalues)
    - Try different models
    - ...
* The higher your place in the leaderboard, the more your grade.
* You need to be able to explain your work. Using the solution from another person won't do the trick.

In [None]:
import pandas as pd
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor

In [None]:
# Load files
train_data = 'train.csv'  # training dataset
test_data = 'test.csv'    # same dataset, without target `SalePrice`.

"""
train_data and test_data have the same number of features
but test_data does not have the target `SalePrice`. 

Your goal is to predict it the more accurately possible.
You can only know your accuracy (MAE) on the test set if you 
upload your submission to Kaggle.
"""

# Prepare input
train_df = pd.read_csv(train_data)

y = train_df.SalePrice

features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 
            'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']

X = train_df[features]

# Divide train_data into train & testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# Train
model = DecisionTreeRegressor(random_state=1)
model.fit(X_train, y_train)

# Test
val_predictions = model.predict(X_test)
val_mae = mean_absolute_error(val_predictions, y_test)
print(f"MAE: {val_mae:.0f}")

# Train again, but on the non-splitted train_data
test_df = pd.read_csv(test_data)
test_X = test_df[features]

model = DecisionTreeRegressor(random_state=1)
model.fit(X, y)

test_preds = model.predict(test_X)

# Prepare submission CSV to send to Kaggle
submission = pd.DataFrame({'Id': test_df.Id,
                           'SalePrice': test_preds})
submission.to_csv('submission.csv', index=False)
submission