## Instructions {-}

1. Please answer the following questions as part of your project proposal.

2. Write your answers in the *Markdown* cells of the Jupyter notebook. You don't need to write any code, but if you want to, you may use the *Code* cells.

3. Use [Quarto](https://quarto.org/docs/output-formats/html-basics.html) to print the *.ipynb* file as HTML. You will need to open the command prompt, navigate to the directory containing the file, and use the command: `quarto render filename.ipynb --to html`. Submit the HTML file.

4. The project proposal is worth 8 points, and is due on **18th April 2023 at 11:59 pm**. 

5. You must make one submission as a group, and not individually.

6. Maintaining a GitHub repository is optional, though encouraged for the project.

7. Share the link of your project's GitHub repository [here](https://docs.google.com/spreadsheets/d/1khao3unpj_vsx4kOSg_Zzo77YK1UWL2w73Oa0aAirOo/edit#gid=0) (optional).

# 1) Team name
Mention your team name.

*(0 points)*

Python Princesses

# 2) Member names
Mention the names of your team members.

*(0 points)*

Avanti Parkhe, Ava Serin, Emily Zhang, Ada Zhong

# 3) Link to the GitHub repository (optional)
Share the link of the team's project repository on GitHub.

Also, put the link of your project's GitHub repository [here](https://github.com/avaserin/Stat_303-3_Project).

We believe there is no harm in having other teams view your GitHub repository. However, if you don't want anyone to see your team's work, you may make the repository *Private* and add your instructor and graduate TA as *Colloborators* in it.

*(0 points)*

# 4) Topic
Mention the topic of your course project.

*(0 points)*

Predicting median house prices of Boston neighborhoods

# 5) Problem statement

*(4 points)*

Explain the problem statement. The problem statement must include:

## 5a) The problem

We are trying to predict median house price of Boston neighborhoods/towns based on a number of features. We were motivated to analyze this problem as we are college students nearing graduation who want to learn about what factors influence house prices, as finding housing after graduation is a complex process. One member of our group is moving to Boston post-grad, so we figured this dataset would be fitting to our particular needs.

## 5b) Type of response
Is it about predicting a continuous response or a binary response or a combination of both?

We are predicting a continuous response (median house price).

## 5c) Performance metric
How will you assess model accuracy?

  - If it is a classification problem, then which measure(s) will you optimize for your model – precision, recall, false negative rate (FNR), accuracy, ROC-AUC etc., and why?
  - If it is a regression problem, then which measure(s) will you optimize for your model – RMSE (Root mean squared error), MAE (mean absolute error), maximum absolute error etc., and why?

We currently plan to optimize RMSE. As RMSE penalizes larger errors more, it would reduce the outliers in our prediction. Large errors in our prediction impacts consumer purchasing decisions, given how high housing prices are, as well as the accuracy of information provided to stakeholders. We want to penalize larger errors to minimize these negative consequences.

## 5d) Naive model accuracy
What is the accuracy of the naive model (Standard deviation of response in case of continuous response / proportion of the majority class in case of classification model)

In [1]:
# Importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(font_scale=1.35)

# sklearn has 100s of models - grouped in sublibraries, such as linear_model
from sklearn.linear_model import LogisticRegression, LinearRegression

# sklearn has many tools for cleaning/processing data, also grouped in sublibraries
# splitting one dataset into train and test, computing cross validation score, cross validated prediction
from sklearn.model_selection import train_test_split, cross_val_predict, cross_val_score

#sklearn module for scaling data
from sklearn.preprocessing import StandardScaler

#sklearn modules for computing the performance metrics
from sklearn.metrics import accuracy_score, mean_absolute_error, mean_squared_error, r2_score, \
roc_curve, auc, precision_score, recall_score, confusion_matrix

data = pd.read_csv('boston.csv')

In [2]:
# Separating the predictors and response - THIS IS HOW ALL SKLEARN OBJECTS ACCEPT DATA (different from statsmodels)
y = data.MEDV
X = data.drop("MEDV", axis = 1)

# Creating training and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 45)

# With linear/logistic regression in scikit-learn, especially when the predictors have different orders 
# of magn., scaling is necessary. This is to enable the training algo. which we did not cover. (Gradient Descent)
scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test) # Do NOT refit the scaler with the test data, just transform it.

lr_model = LinearRegression()
lr_model.fit(X, y);

In [3]:
y_std = np.std(y)
print("Standard deviation of response variable: ", y_std)


Standard deviation of response variable:  9.188011545278206


In [4]:
print("RMSE on train data:")
print("Linear regression:", np.sqrt(mean_squared_error(lr_model.predict(X_train),y_train)))

print("RMSE on test data:")
print("Linear regression:", np.sqrt(mean_squared_error(lr_model.predict(X_test),y_test)))


RMSE on train data:
Linear regression: 4.534072623998566
RMSE on test data:
Linear regression: 5.214447779533273


Standard deviation of response variable:  9.188011545278206

RMSE on train data:
Linear regression: 4.534072623998566

RMSE on test data:
Linear regression: 5.214447779533273

# 6) Data

## 6a) Source
What data sources will you use, and how will the data help solve the problem? Explain.
If the data is open source, share the link of the data.

*(0.5 point)*

We are using a dataset published by the University of Toronto's Data for Evaluating Learning in Valid Experiments website. It contains housing information collected by the U.S Census Service regarding housing in the area of Boston. The dataset came as a .tar.gz file. We then converted this file into a csv file. The [documentation](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html) and [data](https://www.cs.toronto.edu/~delve/data/boston/desc.html) are linked here.

## 6b) Response & predictors
What is the response, and mention some of the predictors.

*(0.5 point)*

The response variable is the variable MEDV, median value of owner-occupied homes is 1000’s, and some of the predictors are CRIM (per capita crime rate by town), RM (average number of rooms per dwelling), PTRATIO (pupil-teacher ratio by town), and Tax (full-value-property-tax-rate per $10,000). 

## 6c) Size
What is the number of continuous predictors, categorical predictors, and observations in your dataset(s). If you are using multiple datasets, please provide the information for each dataset. When counting predictors, count only those that have sufficient non-missing values, and will be useful.

*(1 point)*

Number of continuous predictors: 14

Number of categorical predictors: 0

Number of observations in the dataset: 506

# 7) Exisiting solutions
Are there existing solutions of your problem? Almost all Kaggle datasets have exisiting solutions. If yes, then how do you plan to build up on those solutions? **What is the highest model accuracy / performance achieved in the existing solutions?**

*(1 point)*

While we didn’t find the dataset on Kaggle, there are existing solutions out there. We plan on creating more in depth models than the existing solutions on Kaggle and trying out different accuracy metrics. Some of the existing solutions on Kaggle look into cross validated R2 scores. We are planning on looking into RMSE as a way to distinguish our project. We will also tune different models with various C values and interactions (poly(), interaction terms, etc…). In the existing solutions, they measured cross validated R2 scores. The highest score for this was around 0.85. 


# 8) Stakeholders

Who are the stakeholders, and how will your project benefit them? Explain.

*(1 point)*

Our primary stakeholders are current and prospective Boston residents, local realtors, and Boston housing authority. Our project will benefit current and prospective Boston residents because they will be able to better predict house prices and budget more efficiently. Local realtors will be able to market house prices more accurately and have better information on what houses to present to certain clients. The Boston housing authority can benefit by having more information on what neighborhoods have more affordable housing. 