<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

## Practice Loading and Describing Data 

_Authors: Matt Brems (DC)_

---

In this lab you will practice loading data using python and describing it with statistics.

It might be a good idea to first check the [source of the Boston housing data](https://archive.ics.uci.edu/ml/datasets/Housing).

### 1. Load the boston housing data (provided)

In [2]:
# Download the data and save to a file called "housing.data."

import urllib
data_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data"

# this saves a file called 'housing.data' locally'
urllib.request.urlretrieve(data_url, './datasets/housing.data')

('./datasets/housing.data', <http.client.HTTPMessage at 0x208b6c534a8>)

The data file does not contain the column names in the first line, so we'll need to add those in manually. You can find the names and explanations [here](https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.names). We've extracted the names below for your convenience. You may choose to edit the names, should you decide it would be more helpful to do so.

In [3]:
names = ["CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE",
         "DIS", "RAD", "TAX", "PTRATIO", "B", "LSTAT", "MEDV"]

From housing data naming:
Concerns housing values in suburbs of Boston.

    1. CRIM      per capita crime rate by town
    2. ZN        proportion of residential land zoned for lots over 
                 25,000 sq.ft.
    3. INDUS     proportion of non-retail business acres per town
    4. CHAS      Charles River dummy variable (= 1 if tract bounds 
                 river; 0 otherwise)
    5. NOX       nitric oxides concentration (parts per 10 million)
    6. RM        average number of rooms per dwelling
    7. AGE       proportion of owner-occupied units built prior to 1940
    8. DIS       weighted distances to five Boston employment centres
    9. RAD       index of accessibility to radial highways
    10. TAX      full-value property-tax rate per 10,000 USD
    11. PTRATIO  pupil-teacher ratio by town
    12. B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks 
                 by town
    13. LSTAT    % lower status of the population
    14. MEDV     Median value of owner-occupied homes in 1000 USD's

### 2. Load the `housing.data` file with python

Using any method of your choice.

> _**Hint:** despite this file having a strange `.data` extension, using python's `open() as file` and `file.read()` or `file.readlines()` we can load this in and see that it is a text file formatted much the same as a CSV. You can use string operations to format the data._

In [4]:
# A:
data = []
with open('./datasets/housing.data', 'rU') as f:
    rows = f.readlines()
    for row in rows:
        row = [float(x) for x in row.split()]
        data.append(row)
f.close()


  This is separate from the ipykernel package so we can avoid doing imports until


In [5]:
#will allow us to call a column index from data
d = {key_name:[row[index] for row in data] for index, key_name in enumerate(names)}

In [6]:
#Shows the data within the first two rows
data[0:2]

[[0.00632,
  18.0,
  2.31,
  0.0,
  0.538,
  6.575,
  65.2,
  4.09,
  1.0,
  296.0,
  15.3,
  396.9,
  4.98,
  24.0],
 [0.02731,
  0.0,
  7.07,
  0.0,
  0.469,
  6.421,
  78.9,
  4.9671,
  2.0,
  242.0,
  17.8,
  396.9,
  9.14,
  21.6]]

In [7]:
#Shows the data within column headed CRIM
d['CRIM']

[0.00632,
 0.02731,
 0.02729,
 0.03237,
 0.06905,
 0.02985,
 0.08829,
 0.14455,
 0.21124,
 0.17004,
 0.22489,
 0.11747,
 0.09378,
 0.62976,
 0.63796,
 0.62739,
 1.05393,
 0.7842,
 0.80271,
 0.7258,
 1.25179,
 0.85204,
 1.23247,
 0.98843,
 0.75026,
 0.84054,
 0.67191,
 0.95577,
 0.77299,
 1.00245,
 1.13081,
 1.35472,
 1.38799,
 1.15172,
 1.61282,
 0.06417,
 0.09744,
 0.08014,
 0.17505,
 0.02763,
 0.03359,
 0.12744,
 0.1415,
 0.15936,
 0.12269,
 0.17142,
 0.18836,
 0.22927,
 0.25387,
 0.21977,
 0.08873,
 0.04337,
 0.0536,
 0.04981,
 0.0136,
 0.01311,
 0.02055,
 0.01432,
 0.15445,
 0.10328,
 0.14932,
 0.17171,
 0.11027,
 0.1265,
 0.01951,
 0.03584,
 0.04379,
 0.05789,
 0.13554,
 0.12816,
 0.08826,
 0.15876,
 0.09164,
 0.19539,
 0.07896,
 0.09512,
 0.10153,
 0.08707,
 0.05646,
 0.08387,
 0.04113,
 0.04462,
 0.03659,
 0.03551,
 0.05059,
 0.05735,
 0.05188,
 0.07151,
 0.0566,
 0.05302,
 0.04684,
 0.03932,
 0.04203,
 0.02875,
 0.04294,
 0.12204,
 0.11504,
 0.12083,
 0.08187,
 0.0686,
 0.14866

In [39]:
d['RAD']

[1.0,
 2.0,
 2.0,
 3.0,
 3.0,
 3.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 4.0,
 4.0,
 4.0,
 4.0,
 4.0,
 4.0,
 4.0,
 4.0,
 4.0,
 4.0,
 4.0,
 4.0,
 4.0,
 4.0,
 4.0,
 4.0,
 4.0,
 4.0,
 4.0,
 4.0,
 4.0,
 4.0,
 5.0,
 5.0,
 5.0,
 5.0,
 3.0,
 3.0,
 3.0,
 3.0,
 3.0,
 3.0,
 3.0,
 3.0,
 3.0,
 3.0,
 3.0,
 4.0,
 4.0,
 4.0,
 4.0,
 3.0,
 5.0,
 2.0,
 5.0,
 8.0,
 8.0,
 8.0,
 8.0,
 8.0,
 8.0,
 3.0,
 4.0,
 4.0,
 4.0,
 4.0,
 4.0,
 4.0,
 4.0,
 4.0,
 4.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 4.0,
 4.0,
 4.0,
 4.0,
 3.0,
 3.0,
 3.0,
 3.0,
 2.0,
 2.0,
 2.0,
 2.0,
 4.0,
 4.0,
 4.0,
 2.0,
 2.0,
 2.0,
 2.0,
 2.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 6.0,
 6.0,
 6.0,
 6.0,
 6.0,
 6.0,
 6.0,
 6.0,
 6.0,
 2.0,
 2.0,
 2.0,
 2.0,
 2.0,
 2.0,
 2.0,
 4.0,
 4.0,
 4.0,
 4.0,
 4.0,
 4.0,
 4.0,
 4.0,
 4.0,
 4.0,
 4.0,
 4.0,
 4.0,
 4.0,
 4.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0,
 5.0

In [8]:
d.keys()

dict_keys(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV'])

### 3.  Conduct a brief integrity check of your data. 

This integrity check should include, but is not limited to, checking for missing values and making sure all values make logical sense. (i.e. is one variable a percentage, but there are observations above 100%?)

Summarize your findings in a few sentences, including what you checked and, if appropriate, any 
steps you took to rectify potential integrity issues.

In [23]:
for i in names:
    print(i, 'count is: ', len(d[i]))

CRIM count is:  506
ZN count is:  506
INDUS count is:  506
CHAS count is:  506
NOX count is:  506
RM count is:  506
AGE count is:  506
DIS count is:  506
RAD count is:  506
TAX count is:  506
PTRATIO count is:  506
B count is:  506
LSTAT count is:  506
MEDV count is:  506


In [32]:
#LSTAT is a percentage so every value should be below 100
for i in d['LSTAT']:
    if i > 100 or i < 0:
        print('This shouldnt be')
    else:
        continue
print('Loop done')

Loop done


In [38]:
test_set = ['AGE', 'RM']
for test in test_set:
    for i in d[test]:
        if i == 0:
            print('A zero exists')
        else:
            continue
print('loops complete')

loops complete


I have found that there are no missing values. I've also spot checked some of the columns for odd values. For example in the percentage values (LSTAT) I found that non of the values are above 100% or below 0%. I also checked Age and RM for any 0 values as it would not make sense for an age or number of rooms to be 0.

### 4. For what two attributes does it make the *least* sense to calculate mean and median? Why?

The 'CHAS' variable is a dummy variable indicating whether or not the tract of land is adjacent the Charles river. The values are either 1 or 0. Therefore a mean where the number is neither 1 nor 0 wouldn't make sense relative to the attribute.

'RAD' describes the distance to highways. The values are all whole integers and there is a jump in the data from mostly values <5 to values of 24. There isn't a clear indication of what the difference is between a 1 and a 2 or 5 and 24. 


### 5. Which two variables have the strongest linear association? 

Report both variables, the metric you chose as the basis for your comparison, and the value of that metric. *(Hint: Make sure you consider only variables for which it makes sense to find a linear association.)*

In [46]:
import numpy as np

not_include = ['CHAS', 'RAD']
relationships = []
for name in d.keys:
    if name in not_include:
        for other in d.keys():
            if (name != other) and (other not in not_include):
                relationships.append([name, other. no.corrcoef(d[name], d[other])[0,1]])
                        



TypeError: 'builtin_function_or_method' object is not iterable

### 6. Look at distributional qualities of variables.

Answer the following questions:
1. Which variable has the most symmetric distribution? 
2. Which variable has the most left-skewed (negatively skewed) distribution? 
3. Which variable has the most right-skewed (positively skewed) distribution? 

Defend your method for determining this.

In [7]:
# A:

### 8. Repeat question 6 but scale the variables by their range first.

As you may have noticed, the spread of the distribution contributed significantly to the results in question 6.

In [8]:
# A:

### 9. Univariate analysis of your choice

Conduct a full univariate analysis on MEDV, CHAS, TAX, and RAD. 

For each variable, you should answer the three questions generally asked in a univariate analysis using the most appropriate metrics.
- A measure of central tendency
- A measure of spread
- A description of the shape of the distribution (plot or metric based)

If you feel there is additional information that is relevant, include it. 

In [9]:
# A:

### 10. Have you been using inferential statistics, descriptive statistics, or both?

For each exercise, identify the branch of statistics on which you relied for your answer.

In [10]:
# A:

### 11. Reducing the number of observations

It seems likely that this data is a census - that is, the data set includes the entire target population. Suppose that the 506 observations was too much for our computer (as unlikely as this might be) and we needed to pare this down to fewer observations. 

**11.A Use the `random.sample()` function to select 50 observations from `'AGE'`.**

([This documentation](https://docs.python.org/2/library/random.html) may be helpful.)

In [51]:
# A:
random.sample('AGE', 50)

ValueError: Sample larger than population or is negative

**11.B Identify the type of sampling we just used.**

In [12]:
# A:

### 12. [BONUS] Of the remaining types of sampling, describe (but do not execute) how you might implement at least one of these types of sampling.


In [13]:
# A: