<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

## Boston Housing Data 

---


Check the [source and data dictionary of the Boston housing data](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html)

#### 1. Load the boston housing data (provided) and examine the first few rows

Refer to the data dictionary above to make sense of the columns!

In [1]:
import pandas as pd

housing = pd.read_csv("../datasets/boston_housing_data.csv")
housing.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2


#### 2. Make your life easier by giving the columns more meaningful names

In [2]:
better_names = ["crime_rate", "proportion_zoned", "non_retail", "dummy_charles_river",
                "nitric_oxide", "rooms_per_dwelling", "pre_1940", "dist_to_employment_centre",
                "highway_access", "tax_rate", "pupil_teacher_ratio", "ethnicity_factor",
                "pct_lower_status", "median_home_value"]

housing.columns = better_names
housing.head()

Unnamed: 0,crime_rate,proportion_zoned,non_retail,dummy_charles_river,nitric_oxide,rooms_per_dwelling,pre_1940,dist_to_employment_centre,highway_access,tax_rate,pupil_teacher_ratio,ethnicity_factor,pct_lower_status,median_home_value
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2


#### 3.  Conduct a brief integrity check of your data. 

This integrity check should include, but is not limited to, checking for missing values and making sure all values make logical sense. (i.e. is one variable a percentage, but there are observations above 100%?)

Summarize your findings in a few sentences, including what you checked and, if appropriate, any 
steps you took to rectify potential integrity issues.

In [3]:
# Checking for features with improperly recorded observations:
    
# Given the information about the features and observing a few of their 
# observations I believe there are 7 features which take place on a 
# normalized scale (0-1 or 0-100)

# - CHAS (0-1),
# - CRIM(0-100),
# - ZN(0-100), 
# - INDUS(0-100), 
# - RM(0-100), 
# - LSTAT(0-100), 
# - PTRATIO(0-100)

In [4]:
max_one = ["dummy_charles_river"]
max_hund = ['proportion_zoned','non_retail','rooms_per_dwelling',
            'pct_lower_status','pupil_teacher_ratio', 'crime_rate']

In [5]:
for feature in max_one:
    if len(housing[housing[feature] > 1]) > 0:
        print('Abnormal value found in ' + feature)
# if nothing is returned then nothing unusual was found

In [6]:
for feature in max_hund:
    if len(housing[(housing[feature] < 0) | (housing[feature] > 100)]) > 0:
        print('Abnormal value found in ' + feature)
# if nothing is returned then nothing unusual was found

In [7]:
housing.isnull().sum()

crime_rate                   0
proportion_zoned             0
non_retail                   0
dummy_charles_river          0
nitric_oxide                 0
rooms_per_dwelling           0
pre_1940                     0
dist_to_employment_centre    0
highway_access               0
tax_rate                     0
pupil_teacher_ratio          0
ethnicity_factor             0
pct_lower_status             0
median_home_value            0
dtype: int64

#### 4. For what two attributes does it make the *least* sense to calculate mean and median? Why?

**Potential Solution: **_The dummy variable `CHAS` and the categorical variable `RAD`. _
- `CHAS` is a dummy (categorical) variable that makes no sense quantitatively. 
- `RAD` is a variable that indexes the distance to highways. It has many low values and, after a large gap, has higher values. It stands to reason that this is not a "_true_" quantitative variable in the sense that the difference between `RAD = 1` and `RAD = 2` may not be the same as the difference between `RAD = 2` and `RAD = 3`.

#### 5. Univariate analysis of your choice

Conduct a full univariate (single variable) analysis on MEDV, CHAS, TAX, and RAD. 

For each variable, you should answer the three questions generally asked in a univariate analysis using the most appropriate metrics.
- A measure of central tendency
- A measure of spread
- A description of the shape of the distribution (using a plot)

If you feel there is additional information that is relevant, include it. 

---
_**Sketch of Answer:**_   
You should report at least one **measure of center**, one **measure of spread**, and a plot of the shape of the **distribution** of each variable.  
- Defending which of these choices is better. (i.e. median is a better measure of center than mean because...) 
- Including multiple measures of center and/or spread and interpreting what these reveal about the distribution of a variable is especially good.
- Including an informative plot that goes along with these metrics and this description would turn this from a "good" analysis into a "great" one. A report to a stakeholder would ideally include these.
---