# About this Notebook

#### A majority of the content in this course was provided by Coursera as a final project. I took the some of the original content and adjusted it to provide more context/ instructions and to run in Google Colab + added a section at the end to use our trained model on new data.

#### *Read the comments to see what content needs to be adjusted depending on whether you use the training data or your own data.*

# About the Dataset

This dataset contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015. It was taken from [here](https://www.kaggle.com/harlfoxem/housesalesprediction?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-wwwcourseraorg-SkillsNetworkCoursesIBMDeveloperSkillsNetworkDA0101ENSkillsNetwork20235326-2022-01-01). It was also slightly modified.

| Variable      | Description                                                                                                 |
| ------------- | ----------------------------------------------------------------------------------------------------------- |
| id            | A notation for a house                                                                                      |
| date          | Date house was sold                                                                                         |
| price         | Price is prediction target                                                                                  |
| bedrooms      | Number of bedrooms                                                                                          |
| bathrooms     | Number of bathrooms                                                                                         |
| sqft_living   | Square footage of the home                                                                                  |
| sqft_lot      | Square footage of the lot                                                                                   |
| floors        | Total floors (levels) in house                                                                              |
| waterfront    | House which has a view to a waterfront                                                                      |
| view          | Has been viewed                                                                                             |
| condition     | How good the condition is overall                                                                           |
| grade         | overall grade given to the housing unit, based on King County grading system                                |
| sqft_above    | Square footage of house apart from basement                                                                 |
| sqft_basement | Square footage of the basement                                                                              |
| yr_built      | Built Year                                                                                                  |
| yr_renovated  | Year when house was renovated                                                                               |
| zipcode       | Zip code                                                                                                    |
| lat           | Latitude coordinate                                                                                         |
| long          | Longitude coordinate                                                                                        |
| sqft_living15 | Living room area in 2015(implies-- some renovations) This might or might not have affected the lotsize area |
| sqft_lot15    | LotSize area in 2015(implies-- some renovations)                                                            |


In [None]:
# Create Virtual Env. and Download Required Libraries #
!apt install python3.10-venv
!source myenv/bin/activate  # On Windows use `myenv\Scripts\activate`
!pip install numpy pandas matplotlib seaborn scikit-learn requests pyodide-http

In [None]:
# Surpress warnings:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

In [None]:
# Run if you receieve errors in the import functions below, then try them again #
#!pip uninstall -y numpy scipy pandas matplotlib seaborn scikit-learn
#!pip install numpy scipy pandas matplotlib seaborn scikit-learn

In [None]:
# Here for Troubleshooting #
import numpy as np
import pandas as pd
import matplotlib
import seaborn as sns
import sklearn

print("NumPy version:", np.__version__)
print("Pandas version:", pd.__version__)
print("Matplotlib version:", matplotlib.__version__)
print("Seaborn version:", sns.__version__)
print("Scikit-learn version:", sklearn.__version__)

In [None]:
# Required imports #
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler,PolynomialFeatures
from sklearn.linear_model import LinearRegression
%matplotlib inline

# **Importing Data Sets**

##### Run all steps provided below if using the test data. Please read comments if using personal dat

In [None]:
# Ignore if using local data #
import requests

def download(url, filename):
    response = requests.get(url)
    if response.status_code == 200:
        with open(filename, "wb") as f:
            f.write(response.content)
    else:
        print(f"Failed to download file. Status code: {response.status_code}")


In [None]:
# Replace if using local data #

filepath='https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-SkillsNetwork/labs/FinalModule_Coursera/data/kc_house_data_NaN.csv'

In [None]:
# If using different dataset, replace "housing.csv" and file_name with dataset details #
download(filepath, "housing.csv")
file_name="housing.csv"

In [None]:
# Create dataframe #
df = pd.read_csv(file_name)

In [None]:
# Review Data, replace # to view different stats #

#df.head()
#df.dtypes
df.describe()

# **Data Wrangling**

##### Our data set consists of some unnecessary data that will make it difficult to deal with down the line, so we will remove them now. You may not need to do this with your data, or you can replace necessary variables

In [None]:
# Replace data within index do drop those columns #

df.drop(['date', 'id', 'Unnamed: 0'], axis=1, inplace=True)

In [None]:
# The next few cells find and deal with missing values #

print("number of NaN values for the column bedrooms :", df['bedrooms'].isnull().sum())
print("number of NaN values for the column bathrooms :", df['bathrooms'].isnull().sum())

In [None]:
# We use an Average to replace missing values #

mean=df['bedrooms'].mean()
df['bedrooms'].replace(np.nan,mean, inplace=True)

mean=df['bathrooms'].mean()
df['bathrooms'].replace(np.nan,mean, inplace=True)

In [None]:
# Confirm values were handled properly #

print("number of NaN values for the column bedrooms :", df['bedrooms'].isnull().sum())
print("number of NaN values for the column bathrooms :", df['bathrooms'].isnull().sum())

# **Exploratory Data Analysis**

##### Here, we will examine different model types and explore correlation between different variables. Read comments for more details. Most of this section is not required for model training but helps understand your data better!

In [None]:
# Count the number of houses with unique floor values #
# NOT REQUIRED #

floor_count = df['floors'].value_counts().to_frame()

print(floor_count)

In [None]:
# Here we use a Boxplot to determine whether houses with a waterfront view or w/out have more price outliers #
# NOT REQUIRED #

sns.boxplot(x='waterfront', y='price', data=df)

plt.show()

In [None]:
# Use Regplot to determine if the feature sqft_above is correlated to price #
# NOT REQUIRED #

sns.regplot(x='sqft_above', y='price', data=df, line_kws={'color': 'magenta'})
plt.ylim(0,)
plt.show()

In [None]:
# Determine correlation strength between features #
# NOT REQUIRED BUT USEFULL #

df.corr()['price'].sort_values()

# View correlation between ALL features
#df.corr()

# **Model Developement!**

##### The portion we're all hear for, we will begin to develope our model. See comments for additional details

### Below we create a ***Linear Regression Model***, using a single dependent feature and a single independent feature

In [None]:
# Fitting Linear Regression model. Change variables X and Y as needed
# Refer back to df.corr() to see which variables work better with each other (i.e sqft_living:price vs. long:price)
# Tip: X is independent, and Y is dependent of X

X = df[['sqft_living']]
Y = df['price']
lm = LinearRegression()
lm.fit(X,Y)
lm.score(X, Y)

### Here we will create a ***Multilinear Regression Model***, using a list of independent features and still using a single dependent feature ('price')

In [None]:
# Adjust list if using your own data

features =["floors", "waterfront","lat" ,"bedrooms" ,"sqft_basement" ,"view" ,"bathrooms","sqft_living15","sqft_above","grade","sqft_living"]

In [None]:
# Use our list 'features' to train the model #
# You should notice that the R^2 score is higher than using the Linear Model #

X = df[features]
Y = df['price']

# Train simple Linear Model #
multi_lm = LinearRegression()
multi_lm.fit(X, Y)
multi_lm.score(X, Y)

Create a list of tuples, the first element in the tuple contains the name of the estimator:

<code>'scale'</code>

<code>'polynomial'</code>

<code>'model'</code>

The second element in the tuple  contains the model constructor

<code>StandardScaler()</code>

<code>PolynomialFeatures(include_bias=False)</code>

<code>LinearRegression()</code>

In [None]:
Input=[('scale',StandardScaler()),('polynomial', PolynomialFeatures(include_bias=False)),('model',LinearRegression())]

In [None]:
# We use the list to create a pipeline object to predict price, fitting the object using 'features', then we calculate the R^2 Score #

from sklearn.metrics import r2_score

pipe = Pipeline(Input)
Z = X.astype(float)
pipe.fit(Z, Y)
y_pipe = pipe.predict(Z)
print(r2_score(Y, y_pipe))

# Model Eval and Refinement

In [None]:
# Import required modules #

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

In [None]:
# Split data into training and testing sets #

X = df[features]
Y = df['price']

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.15, random_state=1)


print("number of test samples:", x_test.shape[0])
print("number of training samples:",x_train.shape[0])

In [None]:
# Create and fit Ridge Regression Model #
# Ridge Models attempt to limit over-fitting by adding a 'penalty' to the loss function #

from sklearn.linear_model import Ridge

In [None]:
# NOT REQUIRED but good to compare to the model after this one #

RidgeModel = Ridge(alpha=0.1)
RidgeModel.fit(x_train, y_train)
yhat = RidgeModel.predict(x_test)
print(r2_score(y_test, yhat))

In [None]:
# Perform a second order Polynomial on both training and testing sets #
# Create and train another Ridge Reg Model using training data, then calculate the R^2 Score using the Test data #

pr = PolynomialFeatures(degree=2)
x_train_pr = pr.fit_transform(x_train)
x_test_pr = pr.transform(x_test)

RidgeModel = Ridge(alpha=0.1)
RidgeModel.fit(x_train_pr, y_train)

yhat = RidgeModel.predict(x_test_pr)

print(r2_score(y_test, yhat))


# Attempting to Predict data
##### Here we will try to predict the price of homes using our trained model. Remember to replace variables if you used your own data

In [None]:
# Training data is based on King County info. Keep in mind if changing 'lat'!!! #

new_house_features = pd.DataFrame({
    'floors': [2],
    'waterfront': [0],
    'lat': [47.5112],
    'bedrooms': [3],
    'sqft_basement': [0],
    'view': [0],
    'bathrooms': [1.0],
    'sqft_living15': [1340],
    'sqft_above': [1340],
    'grade': [7],
    'sqft_living': [1340]
})

# Transform the new house's features using the same PolynomialFeatures object
new_house_features_poly = pr.transform(new_house_features)

# Predict the price using the Ridge regression model
predicted_price = RidgeModel.predict(new_house_features_poly)
print('Predicted Home Price:', predicted_price)

# **Authored by:** Noah Bonaguidi
# bonaguidin@gmail.com

###  Coursera Data Analysis with Python W/ a focus on using the predictive model
##### A majority of the content in this course was provided by Coursera as a final project. I took the some of the original content and adjusted it to provide more context/ instructions and to run in Google Colab + added a section at the end to use our trained model on new data.