Welcome!  This notebook is a Data Science/Machine Learning Project



# Framing the Problem and Looking at the Big Picture

The objective of this project is to build a model of median rental rates in California using city
data from Towncharts.com which provides information that is collated from a large variety of government agencies and public data sources.
<br>
Hypothetically, this model would be created to benefit and serve a particular purpose in a company.
Which in turn would determine the specifications and approach in your project

In this project our superviser Adam Smith has told us that our model's output will be fed to another Machine Learning system along with many other signals. This downstream system will provide current property owners with insights towards determining their rental amount. Getting this right is critical, you want to ensure that the price is low enough to attract applicants, while high enough to cover costs. Pricing units lower may result in problematic tenants, but higher prices can lead to longer vacancies.

What are the current solutions/workarounds: Mr.Smith says that currently the estimates of median rental rates are outsourced. This is costly and their estimates are off by more than 200 dollars. 
That is why the company has chosen to investigate if an in house estimate based on publicly available data can come as close or improve upon the outsourced estimates.


My project is a typical supervised learning task given the fact that the
training examples are labeled (each instance comes with the expected output i.e, the cities median rental rate).
Moreover, we are asked to predict the value of $Y$ (median rental rate) using $X_{1}...X_{m}$ where **m** = *number of features*. This task is a a typical regression task in particualr a multiple regression problem. Additionally, seeing as there is no continous flow of data coming in the ML system, we need not adjust to changing data rapidly, and the data is small enough to fit in memory,so plain batch learning should do just fine.


We need a performance measure of our ML system. We have chosen in this case to select the Root Mean Square Error(RMSE).This performance measure will be aligned with the business objective as it will give an idea of how much error the system typically makes in its predictions. Our minimum performance needed to reach the business objective would be a RMSE of less than or equal to 200
.$RMSE(X,h) = \sqrt{ \frac 1m \sum\limits_{i=1}^m ( h(x^i) - y ^i )^2 }$

In the case that our data has many outlier cities we may consider using the Mean Absolute Error
> $MAE(X,h) = \frac 1m \sum\limits_{i=1}^m \mid h(x^i) - y ^i\mid$

After framing the problem and looking at the big picture its good practice for me to  to list and verify the assumptions that were 
made so far (by me and others). This will help avoid a difference in expectations and help verify that the characteristics of my system output correspond to the expected signal that the downstream system expects. 

# Data Collection

Collection of data was gathered via webscraping from Towncharts.com the website provides information and data about every geographic location in the United States including city, county, zip code, state and more. Additional variables latitude and longitude for the cities were merged from a kaggle dataset Collection of data was gathered via webscraping from Towncharts.com the website provides information and data about every geographic location in the United States including city, county, zip code, state and more. Additional variables latitude and longitude for the cities were merged from a [kaggle dataset](https://www.kaggle.com/camnugent/california-housing-feature-engineering#cal_cities_lat_long.csv). Typically I would be asked and it would be good practice to import the csv file into a table using the companies chosen database. This provides the opportunity for collaboration as all a coworker would need to access the data would be to get their credentials and access authorizations. 

For more information about how string methods, pandas, beautiful soup and more were used to produce the Towncharts_California_Housing.csv the following link outlines the process
> https://github.com/clazaro97chosen/American-Community-Survey-Project/blob/master/Scrape_the_Data.ipynb 

Checking the legal obligations from 
the user aggreement on Towncharts.com use of this data is allowed and encouraged.

**Download the Data**

Best practice calls for me to use a python file which i have created for fetching the data and writing a function to load the data.

# Setup

In [26]:

# Common imports
import numpy as np
import pandas as pd
import os

# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

In [39]:
def load_housing_data(housing_path):
    csv_path = os.path.join(housing_path,'california_housing.csv')
    return pd.read_csv(csv_path,index_col=0)   

In [40]:
housing = load_housing_data(os.path.join('datasets'))

Data dictionary: provided by Towncharts.com

* city_name: Name of city
* housing_units:Total number of housing units in the area
* housing_density : The number of housing units per square mile in the area
* change_hunits: Change in housing units from 2010 to 2017
* percent_of_rent_to_total: The percent of all occupied housing units that are rental housing units (%)
* owned_homes: The percent of all occupied housing units that are owned housing units (%)
* med_homeval: Median home value i.e how much property is worth( house and lot, mobile homes and lot, or condominium unit) if it was for sale \\$

* med_rental_rate: the median monthly rental amount for a rental unit in this area \\$

* med_owner_cost: The monthly cost of housing for property owners including mortgage payment, taxes, insurance,and utilities.\\$

* med_own_cost_aspercentof_income: The monthly owner cost as a percent of the household income. This measure is an excellent way to understand how affordable housing is for owners in an area (%)
* med_hval_aspercentof_medearn: How much the property is worth(house and lot, mobile home and lot, or condominium unit) if it was for sale as a percent of the median earnings for a worker in the area (%)
* med_hcost_ownmortg: Median housing cost for homeowners with a mortgage(including the cost of the mortgage or other debt) \\$
* med_hcost_own_wo_mortg: Median housing cost for homeowners who do not have a mortgage. This isolates the cost of ownership seperate from the financing cost of debt \\$
* hcost_aspercentof_hincome_ownmortg: Homeowners with a mortgage showing total cost (including mortgage debt) as a percent of household income (%)
* hcost_as_perc_of_hincome_womortg: Homeowners without a mortgage showing total cost as a percent of household income.
*med_real_estate_taxes: The median real estate taxes paid by owners of homes in the area \\$
* family_members_per_hunit: The average size of related families members who live together in a housing unit. 
* median_num_ofrooms: The average number of rooms of total rooms for housing units in the area
* median_year_house_built: The average year the housing units were built in the area. This indicates the average age of housing units in the area.
* household_size_of_howners: For people who own their homes how many people on average are living in them whether they are part of family or related or not. 
* household_size_for_renters: The average size of a household for people who are renting.
* med_year_moved_in_for_owners: The median year that a home owner moved into their home
* med_year_renter_moved_in: The median year that a renter moved into their home 
* The following varialbes are monthly rental rates by size of  Rental in Bedrooms as a percentage
studio_1000_1499,studio_1500_more,
studio_750_999,
onebed_1000_1499,
onebed_1500_more,
onebed_750_999,
twobed_1000_1499,
twobed_1500_more,
twobed_750_999,
threebed_1000_1499,
threebed_1500_more,
threebed_750_999

In [41]:
housing.head()

Unnamed: 0,housing_units,housing_density,change_hunits,percent_of_rent_to_total,owned_homes,med_homeval,med_rental_rate,med_owner_cost,med_own_cost_aspercentof_income,med_hval_aspercentof_medearn,...,onebed_750_999,twobed_1000_1499,twobed_1500_more,twobed_750_999,threebed_1000_1499,threebed_1500_more,threebed_750_999,city,Latitude,Longitude
0,8751.0,156.2,-3.7,50.7,49.3,151600.0,1059.0,1093.0,24.0,427.0,...,0.096,0.131,0.032,0.508,0.583,0.137,0.181,Adelanto,34.582769,-117.409214
1,7674.0,984.7,1.2,25.6,74.4,745000.0,2261.0,2488.0,23.0,928.0,...,0.0,0.064,0.899,0.038,0.0,0.962,0.0,Agoura Hills,34.153339,-118.761675
2,32414.0,3104.1,0.2,53.0,47.0,729100.0,1607.0,2259.0,21.0,1006.0,...,0.058,0.261,0.633,0.019,0.119,0.667,0.036,Alameda,37.765206,-122.241636
3,7724.0,4319.4,-2.1,52.4,47.6,766000.0,1739.0,2501.0,21.0,991.0,...,0.031,0.138,0.821,0.012,0.084,0.849,0.0,Albany,37.886869,-122.297747
4,30990.0,4061.1,0.2,59.7,40.3,553800.0,1286.0,1629.0,22.0,1296.0,...,0.24,0.575,0.312,0.054,0.18,0.629,0.039,Alhambra,34.095286,-118.127014


"Towncharts.com - United States Demographics Data." United States Demographics data. N.p., 15 Dec. 2016. Web. 04 Sep. 2019. <http://www.towncharts.com/>.