<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

## Practice Loading and Describing Data 

_Authors: Matt Brems (DC)_

---

In this lab you will practice loading data using python and describing it with statistics.

It might be a good idea to first check the [source of the Boston housing data](https://archive.ics.uci.edu/ml/datasets/Housing).

### 1. Load the boston housing data (provided)

In [21]:
# Download the data and save to a file called "housing.data."

import urllib.request
data_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data"

# this saves a file called 'housing.data' locally'
urllib.request.urlretrieve(data_url, './datasets/housing.data')

('./datasets/housing.data', <http.client.HTTPMessage at 0x11bfae2b0>)

The data file does not contain the column names in the first line, so we'll need to add those in manually. You can find the names and explanations [here](https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.names). We've extracted the names below for your convenience. You may choose to edit the names, should you decide it would be more helpful to do so.

In [15]:
names = ["CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE",
         "DIS", "RAD", "TAX", "PTRATIO", "B", "LSTAT", "MEDV"]
# 1. CRIM      - per capita crime rate by town
# 2. ZN        - proportion of residential land zoned for lots over 25,000 sq.ft.
# 3. INDUS     - proportion of non-retail business acres per town
# 4. CHAS      - Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
# 5. NOX       - nitric oxides concentration (parts per 10 million)
# 6. RM        - average number of rooms per dwelling
# 7. AGE       - proportion of owner-occupied units built prior to 1940
# 8. DIS       - weighted distances to five Boston employment centres
# 9. RAD       - index of accessibility to radial highways
# 10. TAX      - full-value property-tax rate per 10,000 dollars. 
# 11. PTRATIO  - pupil-teacher ratio by town
# 12. B        - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
# 13. LSTAT    - Percent lower status of the population
# 14. MEDV     - Median value of owner-occupied homes in $1000's

### 2. Load the `housing.data` file with python

Using any method of your choice.

> _**Hint:** despite this file having a strange `.data` extension, using python's `open() as file` and `file.read()` or `file.readlines()` we can load this in and see that it is a text file formatted much the same as a CSV. You can use string operations to format the data._

In [25]:
# A:
data = []
with open('./datasets/housing.data', 'rU') as f:
    rows = f.readlines()
    for row in rows:
        row = [float(x) for x in row.split()]
        data.append(row)
f.close()

  This is separate from the ipykernel package so we can avoid doing imports until


In [27]:
# A:
data[0:2]

[[0.00632,
  18.0,
  2.31,
  0.0,
  0.538,
  6.575,
  65.2,
  4.09,
  1.0,
  296.0,
  15.3,
  396.9,
  4.98,
  24.0],
 [0.02731,
  0.0,
  7.07,
  0.0,
  0.469,
  6.421,
  78.9,
  4.9671,
  2.0,
  242.0,
  17.8,
  396.9,
  9.14,
  21.6]]

In [35]:
d = {key_name:[row[index] for row in data] for index, key_name in enumerate(names)}


In [37]:
print(d.keys())

dict_keys(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV'])


### 4. For what two attributes does it make the *least* sense to calculate mean and median? Why?

In [38]:
# A:

#CHAS  -dummy/categorical variable so can't use quantiative methods 

#RAD - index which is not a true quantiative variable




### 5. Which two variables have the strongest linear association? 

Report both variables, the metric you chose as the basis for your comparison, and the value of that metric. *(Hint: Make sure you consider only variables for which it makes sense to find a linear association.)*

In [41]:
# A:

import numpy as np

relationships = []

for name in d.keys():
    if name not in ['RAD','CHAS']:
        for key in d.keys():
            if (name != key) and (key not in ['RAD','CHAS']):
                relationships.append([name, key, np.corrcoef(d[name], d[key])[0,1]])


In [42]:
sort_rel = sorted(relationships, key=lambda x: np.abs(x[2]), reverse=True)

In [47]:
for i in range(2):
    print(sort_rel[i])

['NOX', 'DIS', -0.7692301132258278]
['DIS', 'NOX', -0.7692301132258278]


### 6. Look at distributional qualities of variables.

Answer the following questions:
1. Which variable has the most symmetric distribution? 
2. Which variable has the most left-skewed (negatively skewed) distribution? 
3. Which variable has the most right-skewed (positively skewed) distribution? 

Defend your method for determining this.

In [57]:
# A:

symmetric = sorted([[key_name, np.abs(np.mean(y) - np.median(y))] for key_name,y in d.items()],
                   key=lambda x: x[1])
print(symmetric[0])

left = sorted([[k, np.mean(y) - np.median(y)] for k,y in d.items()],
              key=lambda x: x[1])
print (left[0])

right = sorted([[k, np.mean(y) - np.median(y)] for k,y in d.items()],
               key=lambda x: x[1], reverse=True)
print (right[0])

['NOX', 0.01669505928853754]
['B', -34.765968379446576]
['TAX', 78.23715415019763]


### 8. Repeat question 6 but scale the variables by their range first.

As you may have noticed, the spread of the distribution contributed significantly to the results in question 6.

In [55]:
# A:

def scaled_diff(y):
    return (np.mean(y) - np.median(y))/np.ptp(y)
symmetric = sorted([[k, np.abs(scaled_diff(y))] for k,y in d.items()],
                   key=lambda x: x[1])
print (symmetric[0])

left = sorted([[k, scaled_diff(y)] for k,y in d.items()],
              key=lambda x: x[1])
print(left[0])

right = sorted([[k, scaled_diff(y)] for k,y in d.items()],
               key=lambda x: x[1], reverse=True)
print(right[0])

['RM', 0.01458792629848224]
['AGE', -0.09191656863263897]
['RAD', 0.19780030933150025]


### 9. Univariate analysis of your choice

Conduct a full univariate analysis on MEDV, CHAS, TAX, and RAD. 

For each variable, you should answer the three questions generally asked in a univariate analysis using the most appropriate metrics.
- A measure of central tendency
- A measure of spread
- A description of the shape of the distribution (plot or metric based)

If you feel there is additional information that is relevant, include it. 

In [67]:
# A:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

data.plot(y="MEDV", kind='hist')
             
             

AttributeError: 'list' object has no attribute 'plot'

### 10. Have you been using inferential statistics, descriptive statistics, or both?

For each exercise, identify the branch of statistics on which you relied for your answer.

In [10]:
# A:
Descriptive 

### 11. Reducing the number of observations

It seems likely that this data is a census - that is, the data set includes the entire target population. Suppose that the 506 observations was too much for our computer (as unlikely as this might be) and we needed to pare this down to fewer observations. 

**11.A Use the `random.sample()` function to select 50 observations from `'AGE'`.**

([This documentation](https://docs.python.org/2/library/random.html) may be helpful.)

In [71]:
# A:

age = d['AGE']

import random

age_sample = random.sample(age, 50)

print (len(age_sample))

50


**11.B Identify the type of sampling we just used.**

In [73]:
# A:Random sampling


### 12. [BONUS] Of the remaining types of sampling, describe (but do not execute) how you might implement at least one of these types of sampling.


In [74]:
# A: Stratified random sampling 
