In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/saas-2020-fall-cx-kaggle-compeition/sample_submission.csv
/kaggle/input/saas-2020-fall-cx-kaggle-compeition/train_features.csv
/kaggle/input/saas-2020-fall-cx-kaggle-compeition/test_features.csv
/kaggle/input/saas-2020-fall-cx-kaggle-compeition/train_targets.csv


# Career Exploration Kaggle Competition: Real Estate Price Prediction

### Table Of Contents

* [1. Exploratory Data Analysis](#eda)
* [2. Feature Engineering](#feature-engineering)
* [3. Modeling](#modeling)
    * [3.1 Validation and Evaluation](#validation)
    * [3.2 Linear Regression](#linear-regression)
    * [3.3 Regularized Regression](#reg)
    * [3.4 Random Forest](#random-forest)
    * [3.5 Neural Network](#nn)
    * [3.6 XGBoost](#xgb)


### Hosted by and maintained by the [Students Association of Applied Statistics (SAAS)](https://saas.berkeley.edu).  Authored by Derek Cai(dcai@berkeley.edu).


## Note: Please upload/share your work in the Notebook section after the competition deadline! This will give you a chance to showcase your work to other fellows in CX :)

## Data Loading

In [2]:
X_train = pd.read_csv("/kaggle/input/saas-2020-fall-cx-kaggle-compeition/train_features.csv")
y_train = pd.read_csv("/kaggle/input/saas-2020-fall-cx-kaggle-compeition/train_targets.csv")
X_test = pd.read_csv("/kaggle/input/saas-2020-fall-cx-kaggle-compeition/test_features.csv")
sample_submission = pd.read_csv("/kaggle/input/saas-2020-fall-cx-kaggle-compeition/sample_submission.csv")

When we do EDA and feature engineering on a dataset, we often examine the training features and the test features together, so when you do complex feature engineering and data cleaning, you don't need to do twice or worry about your transformations not applying to test set.

In [3]:
df = pd.concat((X_train, X_test), axis=0)

<span id="eda"></span>

## 1. Exploratory Data Analysis

Provide at least two plots that demonstrate interesting aspects of the dataset, and especially certain features' influence on the target variable, revenue.


In [4]:
# space for sick plots

## 2. Feature Engineering

The data you are given is already pretty clean(no Nan values). But real datasets can get a lot messier and require a lot of data cleaning beforehand. As a general rule of thumb, data cleaning should be done before or with feature engineering.

### Feature Selection
Not all features are useful. What are some features you can/should get rid of in this dataset? And why should you get rid of them?

In [5]:
bad_feature = []

In [6]:
def feature_dropper(bad_feature):
    for feature in bad_feature:
        del X_train[feature]
        del X_train[feature]

### Dimensionality Reduction
When the data has high dimensions, it is very useful to use PCA to lower the dimension of the data during feature engineering. 
Since we only have around 20 features, this is not necessary. But it could potentially help with your kaggle score.
PS: PCA is designed for continuous variables, so maybe you should try ignore categorical columns for PCA.

In [7]:
# optional
from sklearn.decomposition import PCA
# implement your own PCA function here

## One-Hot Encoding
Some algorithms(ex. decision trees) can deal with categorical data directly while others need a bit pre-processing to do so.
One-Hot Encoding is one of many pre-processing methods to handle categorical data. For more information on one-hot encoding, read this article: https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/

In [8]:
def one_hot_encoding(df):
    df = pd.get_dummies(df)
    return df 


## Feel Free to do more feature engineering on df! All the methods listed above are ones to help you get started.

In [9]:
# Splitting up our engineered df back into training and test
X_train = df[:X_train.shape[0]]
y_train = y_train
X_test = df[X_test.shape[0]:]

<span id="modeling"/>

## 3. Modeling

For each of the models we try, make sure you also run the [Prediction](#prediction) cells at the bottom, so you can submit your predictions to the competition! This is how we'll be making sure you're keeping up with the project.

<span id="validation"/>

### 3.1 Validation and Evaluation

Our Kaggle competition uses Root-Mean-Square-Error (RMSE). In mathematical notation, it is:

$$\text{RMSE}(\hat{y}, y) = \sqrt{\frac{1}{n} \sum_{i = 1}^n (y_i - \hat{y}_i)^2}$$

#### Evaluation

Complete the function below.

In [10]:
from sklearn.metrics import mean_squared_error

def evaluate(y_pred, y_true):
    """Returns the RMSLE(y_pred, y_true)"""
    return np.sqrt(mean_squared_error(y_true, y_pred))

#### Validation

Use the [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function to split up your training data into a training set and a validation set. The default size of the validation set is 20% of the full training data here.

In [11]:
from sklearn.model_selection import train_test_split

train_X, valid_X, train_y, valid_y = train_test_split(X_train, y_train, random_state=666)

The validation method above is usable but not that robust. K-Fold Cross-Validation should be better. Feel free to set up your own K-Fold cross-validation scheme. For more information, please read https://towardsdatascience.com/cross-validation-a-beginners-guide-5b8ca04962cd.

#### K-Fold Cross Validation

In [12]:
#K-Fold Cross Validation code

<span id="linear-regression"/>

### 3.2 Linear Regression

Fit a linear regression model to your data and report your RMSE.

In [13]:
from sklearn.linear_model import LinearRegression

In [14]:
# instantiating linear regression object (model)
lm = LinearRegression()

# fitting model on training sets
lm.fit(train_X, train_y)

# using model to predict on validation set
y_valid_pred = lm.predict(valid_X)

# IMPORTANT: This model is a "dumb" model that predicts negative values for some movie revenues
# However, because we are using RMLSE we cannot have negative predictions
# Ideally you create a better model that doesn't predict negative revenues
y_valid_pred[y_valid_pred < 0] = 0

# evaluating prediction on validation set
evaluate(y_valid_pred, valid_y)

ValueError: could not convert string to float: 'train8017'

<span id="reg" />

### 3.3 Regularized Regression

Fit a [LASSO regression model](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html) to your data with $\lambda = 1$

In [15]:
from sklearn.linear_model import Lasso

In [16]:
# YOUR CODE HERE

#### 3.3.1 Hyperparameter Tuning

Perform [3-fold cross-validation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) on the parameter $\lambda$, which is called **alpha** when you pass it into Lasso. Find the best parameter of $\lambda \in \{0.001, 0.005, 0.01, 0.05, 0.1\}$ and report the **RMSE** on the validation set if you use this parameter. 
PS: The given $\lambda$ list may not contain the optimal $\lambda$ for the model. Feel free to find better ones!

In [17]:
from sklearn.model_selection import KFold

kf = KFold(n_splits=5)

alphas = [1e-3, 5e-3, 1e-2, 5e-2, 0.1]

cv_scores = np.zeros(len(alphas))

for alphai, alpha in enumerate(alphas):
    print('Training alpha =', alpha, end='\t')
    scores = np.zeros(5)
    for i, (train_index, test_index) in enumerate(kf.split(X_train)):
        # YOUR CODE HERE
    cv_scores[alphai] = scores.mean()
    print('RMSLE = ', cv_scores[alphai])

IndentationError: expected an indented block (<ipython-input-17-5a8aec5591ea>, line 14)

In [18]:
best_alpha = alphas[np.argmax(cv_scores)]
best_alpha

NameError: name 'alphas' is not defined

In [19]:
model = Lasso(alpha=best_alpha)
model.fit(train_X, np.log(train_y))
training_accuracy = # YOUR CODE HERE
validation_accuracy = # YOUR CODE HERE

print('Training accuracy', training_accuracy)
print('Validation accuracy', validation_accuracy)

SyntaxError: invalid syntax (<ipython-input-19-b80e8a18118b>, line 3)

<span id="random-forest"/>

### 3.4 Random Forest

Fit a random forest model to your data and report your RMSE.

**NOTE:** If you're finding that your model is performing worse than your linear regression, make sure you tune the parameters to the RandomForestRegressor!

Try to understand what the parameters mean by looking at the Decision Trees lecture.

In [20]:
from sklearn.ensemble import RandomForestRegressor

In [21]:
# YOUR CODE HERE

<span id="nn" />

### 3.5 Neural Network

This section is optional.

Train a neural network on the data. Report your RMSE.

**NOTE**: Neural Networks require a lot of time to train and it is better to use GPU to train them. Kaggle provides free weekly GPU usage(37 hours/week). To use GPU, choose 'GPU' in the Accelerator from Settings located on the right side of your screen.

In [22]:
# YOUR CODE HERE

<span id="xgb" />

### 3.6 XGBoost (Stretch)

Now that we've tried many different types of classifiers, it's time to bring out the big guns.

Below are hyperparameters for an XGBoost model: tinker around with these to achieve the best validation score (below). Learn about what some of the hyperparameters mean [here](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.train).

**NOTE**: Feel free to reach out on slack if you run into any trouble <3

In [23]:
from xgboost import train

In [24]:
params = {
    'eta': # YOUR CODE HERE
    'max_depth': # YOUR CODE HERE
    'subsample': # YOUR CODE HERE
    'colsample_bytree': # YOUR CODE HERE
    'silent': # YOUR CODE HERE
}

SyntaxError: invalid syntax (<ipython-input-24-bfddbe3f18d4>, line 3)

In [25]:
from xgb import run_xgb
xgb_preds = run_xgb(...) # change this

ModuleNotFoundError: No module named 'xgb'

## Prediction

In [26]:
1d arra

SyntaxError: invalid syntax (<ipython-input-26-9e2fa975816b>, line 1)

In [27]:
sample_submission.shape

(9289, 2)

In [28]:
#model here should be your best model based on your validation accuracy; pred should be an array

#replace the following line with model.predict(X_test) or similar statements to generate your predictions
pred = np.ones((sample_submission.shape[0], 1))

In [29]:
pred.shape

(9289, 1)

In [30]:
#sanity check: you are predictiing 9289 targets
#this statement should return true
pred.shape[0] == 9289

True

## Submission

In [31]:
sample_submission['SALE PRICE'] = pred

In [32]:
sample_submission.to_csv("submission.csv", index=False)