# Housing Prices Prediction

## Importing the libraries

In [1]:
import pandas as pd
import numpy as np

from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression

## Building our model

In [2]:
# Create a object to do several operations
boston = load_boston()

In [3]:
X = boston.data
y = boston.target

# Shape of data
print(X.shape)
print(y.shape)

(506, 13)
(506,)


### Visualization of data

### Statistics for Boston housing dataset

In [4]:
# Minimum price of the data
min_price = np.amin(y)

# Maximum price of the data
max_price = np.amax(y)

# Mean price of the data
mean_price = np.mean(y)

# Median price of the data
median_price = np.median(y)

# Standard deviation of prices of the data
std_price = np.std(y)

# Show the calculated statistics
print("Statistics for Boston housing dataset:\n")
print("Minimum price: ${}".format(min_price)) 
print("Maximum price: ${}".format(max_price))
print("Mean price: ${}".format(mean_price))
print("Median price ${}".format(median_price))
print("Standard deviation of prices: ${}".format(std_price))

Statistics for Boston housing dataset:

Minimum price: $5.0
Maximum price: $50.0
Mean price: $22.532806324110677
Median price $21.2
Standard deviation of prices: $9.188011545278203


In [5]:
# Feature names
print(boston.feature_names)

['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']


In [6]:
print(boston.DESCR)

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pu

The features can be summarized as follows:
- CRIM: This is the per capita crime rate by town.
- ZN: This is the proportion of residential land zoned for lots larger than 25,000 sq.ft.
- INDUS: This is the proportion of non-retail business acres per town.
- CHAS: This is the Charles River dummy variable (this is equal to 1 if tract bounds river; 0 otherwise).
- NOX: This is the nitric oxides concentration (parts per 10 million).
- RM: This is the average number of rooms per dwelling.
- AGE: This is the proportion of owner-occupied units built prior to 1940.
- DIS: This is the weighted distances to five Boston employment centers.
- RAD: This is the index of accessibility to radial highways.
- TAX: This is the full-value property-tax rate per $10,000.
- PTRATIO: This is the pupil-teacher ratio by town.
- B: This is calculated as 1000(Bk — 0.63)², where Bk is the proportion of people of African American descent by town.
- LSTAT: This is the percentage lower status of the population.

In [7]:
df = pd.DataFrame(X)
df.columns = boston.feature_names
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


In [8]:
df.describe()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,3.613524,11.363636,11.136779,0.06917,0.554695,6.284634,68.574901,3.795043,9.549407,408.237154,18.455534,356.674032,12.653063
std,8.601545,23.322453,6.860353,0.253994,0.115878,0.702617,28.148861,2.10571,8.707259,168.537116,2.164946,91.294864,7.141062
min,0.00632,0.0,0.46,0.0,0.385,3.561,2.9,1.1296,1.0,187.0,12.6,0.32,1.73
25%,0.082045,0.0,5.19,0.0,0.449,5.8855,45.025,2.100175,4.0,279.0,17.4,375.3775,6.95
50%,0.25651,0.0,9.69,0.0,0.538,6.2085,77.5,3.20745,5.0,330.0,19.05,391.44,11.36
75%,3.677083,12.5,18.1,0.0,0.624,6.6235,94.075,5.188425,24.0,666.0,20.2,396.225,16.955
max,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0,396.9,37.97


### LinearRegression

In [9]:
reg = LinearRegression().fit(X, y)

## Measuring the model

In [10]:
reg.score(X, y)

0.7406426641094095

## Predicting Selling Prices
Imagine that we were a real estate agent in the Boston area looking to use this model to help price homes owned by our clients that they wish to sell.
- What price would we recommend each client sell his/her home at ?
- Do these prices seem reasonable given the values for the respective features?
To find out the answers of these questions we will execute the folowing code snippet and discuss its output.

In [11]:
# Produce a matrix for client data
client_data = np.array([[0.01000, 0.0, 8.07, 0.0, 0.269, 6.57, 75.2, 1.0900, 1.0, 102.0, 17.8, 392.83, 9.14], # Client 1
                        [0.03237, 0.0, 1.81, 0.0, 0.469, 5.14, 78.0, 6.0622, 1.0, 242.0, 17.8, 392.83, 4.03], # Client 2
                        [0.06905, 18.0, 2.81, 0.0, 0.458, 3.23, 54.8, 6.0622, 3.0, 244.0, 18.7, 396.90, 5.33]])  # Client 3

In [12]:
# Show predictions
for i, X in enumerate(reg.predict(client_data)):
    print("Predicted selling price for Client {}'s home: ${:,.2f}".format(i+1, X))

Predicted selling price for Client 1's home: $36.27
Predicted selling price for Client 2's home: $20.76
Predicted selling price for Client 3's home: $13.60


## Conclusion
From the statistical calculations done at the beginning of the project we found out the following information:
- Minimum price: $5.0

- Maximum price: $50.0

- Mean price: $22.532806324110677

- Median price $21.2

- Standard deviation of prices: $9.188011545278203

Given these values, we can conclude:
- Selling price for client 1 is near the $36.27 dollars, which is near the maximum of the dataset. This is a reasonable price because of its features (6 rooms, very low poverty level and low student-teacher ratio, etc..), the house may be in a wealthy neighborhood.
- For client 2, we can see that its features are intermediate between the latter 2, and therefore, its price is quite near the mean and median.
- Selling price for client 3 is the lowest of the three and given its features is reasonable as it is near the minimum of the dataset.