# California Housing Prices Data Set

Create a regression model to predict house prices. See https://www.kaggle.com/camnugent/california-housing-prices for data understanding



## Import packages

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import metrics

import seaborn as sns
sns.set()

## Load and inspect data set

In [None]:
# Fetch the file
my_file = project.get_file("housing.csv")

# Read the CSV data file from the object storage into a pandas DataFrame
my_file.seek(0)
original_data = pd.read_csv(my_file)

original_data.head()

In [None]:
original_data.describe(include='all') # descriptive statistics for all columns

In [None]:
original_data.isnull().sum() # check for null values

In [None]:
original_data[original_data.duplicated(keep=False)] # check for duplicate rows

There are no duplicates but missing values for "total_bedrooms". Decide what to do with these null values: 

In [None]:
data_wo_null = # your code
data_wo_null.isnull().sum() # check

## Select predictors

Create a correlation map:

In [None]:
# your code

Remove redundant features: 

In [None]:
data_reduced_features = data_wo_null[['<your feature 1>', '<your feature 2>','...']]
data_reduced_features.head()

## Remove outliers
The next step is to detect outliers and handle them:

In [None]:
data_reduced_features.hist(figsize=(25,25), bins=50)

In [None]:
q = # your code

data_reduced_features_2 = # your code

## Prepare data for modeling

Get dummies since there is a categorical feature:

In [None]:
dummies = # your code
dummies.head()

Set X and y (predictors and target) according to your dataframe:

In [None]:
target = dummies['<your target column>']
predictors = # your code

Split data into training and test sets: 

In [None]:
X_train, X_test, y_train, y_test = # your code

Use StandardScaler to scale your predictors:

In [None]:
scaler = # your code 
# your code

X_train = # your code
X_test = # your code

## Regression model and evaluation

Create a linear regression model: 

In [None]:
# your code

In [None]:
print('training performance')
print(reg.score(X_train,y_train))
print('test performance')
print(reg.score(X_test,y_test))

In [None]:
y_pred = reg.predict(X_test)
test = pd.DataFrame({'Predicted':y_pred,'Actual':y_test})
fig= plt.figure(figsize=(16,8))
test = test.reset_index()
test = test.drop(['index'],axis=1)
plt.plot(test[:50])
plt.legend(['Actual','Predicted'])
sns.jointplot(x='Actual',y='Predicted',data=test,kind='reg',);

Interpret the result and feel free to try out further analyses.

# Deployment

Deploy the linear regression model via the _Watson Machine Learning_ (WML) service on IBM Cloud. Please refer to the documentation for more details about the [watson-machine-learning-client](https://pypi.org/project/watson-machine-learning-client/) or the [REST API](https://watson-ml-api.mybluemix.net/).

In [None]:
# reminder: make sure to review the list of stored artifacts (and delete artifacts that are no longer needed)

In [None]:
# import Python client library (documentation available at http://ibm-wml-api-pyclient.mybluemix.net/)
from ibm_watson_machine_learning import APIClient

In [None]:
# set your IBM Cloud API key 
api_key = "..."

# set the URL of your WML instance 
# depending on the region you chose during instance creation it will take one of the below values:
# - Frankfurt: https://eu-de.ml.cloud.ibm.com
# - Dallas: https://us-south.ml.cloud.ibm.com
# - London: https://eu-gb.ml.cloud.ibm.com
# - Tokyo: https://jp-tok.ml.cloud.ibm.com
wml_url = "https://us-south.ml.cloud.ibm.com"

In [None]:
# setup the API client
wml_client = # your code

In [None]:
# list all existing deployment spaces
wml_client.spaces.list()

In [None]:
# set the id of the deployment space you want to use as default
# your code

In [None]:
# setup required properties to store the model
sofware_spec_uid = wml_client.software_specifications.get_id_by_name("default_py3.7")
metadata = {
            wml_client.repository.ModelMetaNames.NAME: 'California Model',
            wml_client.repository.ModelMetaNames.TYPE: 'scikit-learn_0.23',
            wml_client.repository.ModelMetaNames.SOFTWARE_SPEC_UID: sofware_spec_uid
}
metadata

In [None]:
# assign your favorite model to the deployment_classifier variable
deployment_model = # your code
deployment_model

In [None]:
# store the scikit-learn model in WML
model = # your code

In [None]:
# review available models in your WML instance
wml_client.repository.list()

In [None]:
# retrieve the id of the model you deployed
published_model_uid = wml_client.repository.get_model_uid(model)
published_model_uid

In [None]:
# setup required properties to deploy the model
metadata = {
    wml_client.deployments.ConfigurationMetaNames.NAME: "Deployment of California model",
    wml_client.deployments.ConfigurationMetaNames.ONLINE: {}
}

In [None]:
# deploy the model as a web service (an API endpoint is generated for your deployment so your tools and apps can use a REST API to send data to your deployed model for analysis)
created_deployment = # your code

In [None]:
# keep the REST API endpoint for evaluation
scoring_endpoint = wml_client.deployments.get_scoring_href(created_deployment)
scoring_endpoint

## Deployment validation

Use the stored deployment to make a prediction.

In [None]:
# review test data
X_test[0:2]

In [None]:
y_test[0:2]

In [None]:
# import requests module
import requests

In [None]:
# setup the request payload as per the API documentation
scoring_values = X_test[0:2].tolist() # please note that the syntax is different than for the classification model since X_test is already an array
payload_scoring = {"input_data": [{"values": scoring_values}]}

In [None]:
# create a token to make an authenticated request
token_response = requests.post('https://iam.eu-de.bluemix.net/identity/token', data={"apikey": api_key, "grant_type": 'urn:ibm:params:oauth:grant-type:apikey'})
mltoken = token_response.json()["access_token"]

In [None]:
# send the scoring request
header = {'Content-Type': 'application/json', 'Authorization': 'Bearer ' + mltoken}
response_scoring = requests.post(f'{scoring_endpoint}?version=2020-10-10', json=payload_scoring, headers={'Authorization': 'Bearer ' + mltoken})
response_scoring.content

Do the results match your expectation? Are the estimations accurate?

In [None]:
# use the local model to make the same prediction in your notebook and compare the results
reg.predict(X_test[0:2])