<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

## Practice Loading and Describing Data 

_Authors: Matt Brems (DC)_

---

In this lab you will practice loading data using python and describing it with statistics.

It might be a good idea to first check the [source of the Boston housing data](https://archive.ics.uci.edu/ml/datasets/Housing).

### 1. Load the boston housing data (provided)

In [1]:
# Download the data and save to a file called "housing.data."

import urllib
data_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data"

# this saves a file called 'housing.data' locally'
urllib.request.urlretrieve(data_url, './datasets/housing.data')

('./datasets/housing.data', <http.client.HTTPMessage at 0x105f1b2b0>)

The data file does not contain the column names in the first line, so we'll need to add those in manually. You can find the names and explanations [here](https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.names). We've extracted the names below for your convenience. You may choose to edit the names, should you decide it would be more helpful to do so.

In [2]:
names = ["CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE",
         "DIS", "RAD", "TAX", "PTRATIO", "B", "LSTAT", "MEDV"]

### 2. Load the `housing.data` file with python

Using any method of your choice.

> _**Hint:** despite this file having a strange `.data` extension, using python's `open() as file` and `file.read()` or `file.readlines()` we can load this in and see that it is a text file formatted much the same as a CSV. You can use string operations to format the data._

In [3]:
# A:
with open('./datasets/housing.data') as file:
    my_data = {name:[] for name in names}
    for line in file:
        items = line.split()
        for i in range(len(items)):
            datum = items[i]
            try:
                float_ = float(datum)
                my_data[names[i]].append(float_)
            except:
                my_data[names[i]].append(datum)
                    
            
my_data

{'CRIM': [0.00632,
  0.02731,
  0.02729,
  0.03237,
  0.06905,
  0.02985,
  0.08829,
  0.14455,
  0.21124,
  0.17004,
  0.22489,
  0.11747,
  0.09378,
  0.62976,
  0.63796,
  0.62739,
  1.05393,
  0.7842,
  0.80271,
  0.7258,
  1.25179,
  0.85204,
  1.23247,
  0.98843,
  0.75026,
  0.84054,
  0.67191,
  0.95577,
  0.77299,
  1.00245,
  1.13081,
  1.35472,
  1.38799,
  1.15172,
  1.61282,
  0.06417,
  0.09744,
  0.08014,
  0.17505,
  0.02763,
  0.03359,
  0.12744,
  0.1415,
  0.15936,
  0.12269,
  0.17142,
  0.18836,
  0.22927,
  0.25387,
  0.21977,
  0.08873,
  0.04337,
  0.0536,
  0.04981,
  0.0136,
  0.01311,
  0.02055,
  0.01432,
  0.15445,
  0.10328,
  0.14932,
  0.17171,
  0.11027,
  0.1265,
  0.01951,
  0.03584,
  0.04379,
  0.05789,
  0.13554,
  0.12816,
  0.08826,
  0.15876,
  0.09164,
  0.19539,
  0.07896,
  0.09512,
  0.10153,
  0.08707,
  0.05646,
  0.08387,
  0.04113,
  0.04462,
  0.03659,
  0.03551,
  0.05059,
  0.05735,
  0.05188,
  0.07151,
  0.0566,
  0.05302,
  0.04684

### 3.  Conduct a brief integrity check of your data. 

This integrity check should include, but is not limited to, checking for missing values and making sure all values make logical sense. (i.e. is one variable a percentage, but there are observations above 100%?)

Summarize your findings in a few sentences, including what you checked and, if appropriate, any 
steps you took to rectify potential integrity issues.

In [9]:
# Create a median function to help describe the columns
def median(array):
    length = len(array)
    if length == 1:
        return array[0]
    elif length % 2 == 1:
        midpoint = length // 2
        return sorted(array)[int(midpoint)]
    else:
        midpoint1 = (length / 2) - 1
        midpoint2 = midpoint1 + 1
        array_sorted = sorted(array)
        return (array_sorted[int(midpoint1)] + array_sorted[int(midpoint2)]) / 2
    
test = [3, 6, 1, 4, 2, 5]
print(median(test))

test2 = [2, 5, 5, 1, 0]
print(median(test2))

3.5
2


In [8]:
from collections import Counter

# Create a mode function
def mode(array):
    value_count = Counter(array)
    most = value_count.most_common(1)
    mode, _ = most[0]
    return mode

mode_test = [2, 5, 1, 0, 0, 2, 3, 3, 5, 6, 3]
print(mode(mode_test))


3


In [12]:
# A:
# Check for nulls, summarize each column
print('========DATA SUMMARY=========')
for key, value in my_data.items():
    non_nulls = len(list(filter(lambda x: x is not None, value)))
    if isinstance(value[0], str):
        print("Column {0}: {1} non-nulls, type: {2}".format(key, non_nulls, type(value[0])))
    else:
        mean = sum(value) / len(value)
        med = median(value)
        mode_ = mode(value)
        minimum = min(value)
        maximum = max(value)
        summary = "Column {0}: {1} non-nulls, min: {2}, max: {3}, mean: {4}, median: {5}, mode: {6}"
        print(summary.format(key, non_nulls, minimum, maximum, mean, med, mode_))
    

Column CRIM: 506 non-nulls, min: 0.00632, max: 88.9762, mean: 3.6135235573122535, median: 0.25651, mode: 0.01501
Column ZN: 506 non-nulls, min: 0.0, max: 100.0, mean: 11.363636363636363, median: 0.0, mode: 0.0
Column INDUS: 506 non-nulls, min: 0.46, max: 27.74, mean: 11.136778656126504, median: 9.69, mode: 18.1
Column CHAS: 506 non-nulls, min: 0.0, max: 1.0, mean: 0.0691699604743083, median: 0.0, mode: 0.0
Column NOX: 506 non-nulls, min: 0.385, max: 0.871, mean: 0.5546950592885372, median: 0.538, mode: 0.538
Column RM: 506 non-nulls, min: 3.561, max: 8.78, mean: 6.284634387351787, median: 6.2085, mode: 5.713
Column AGE: 506 non-nulls, min: 2.9, max: 100.0, mean: 68.57490118577078, median: 77.5, mode: 100.0
Column DIS: 506 non-nulls, min: 1.1296, max: 12.1265, mean: 3.795042687747034, median: 3.2074499999999997, mode: 3.4952
Column RAD: 506 non-nulls, min: 1.0, max: 24.0, mean: 9.549407114624506, median: 5.0, mode: 24.0
Column TAX: 506 non-nulls, min: 187.0, max: 711.0, mean: 408.237154

### 4. For what two attributes does it make the *least* sense to calculate mean and median? Why?

In [None]:
# A: 
# The attributes 'ZN' and 'CHAS' both have medians of 0 and means that are skewed by the max values.
# These variables are most likely binomials, meaning they only have two possible values.

### 5. Which two variables have the strongest linear association? 

Report both variables, the metric you chose as the basis for your comparison, and the value of that metric. *(Hint: Make sure you consider only variables for which it makes sense to find a linear association.)*

In [26]:
# A:
import numpy as np

columns = list(my_data.keys())
for i in range(len(columns)):
    for j in range(len(columns)):
        if j != i:
            colname1, colname2 = columns[i], columns[j]
            col1 = my_data[colname1]
            col2 = my_data[colname2]
            cov_ = np.cov(col1, col2)
            print('Covariance for {0} and {1}: {2}'.format(colname1, colname2, cov_))


Covariance for CRIM and ZN: [[ 73.9865782  -40.21595603]
 [-40.21595603 543.93681368]]
Covariance for CRIM and INDUS: [[73.9865782  23.99233881]
 [23.99233881 47.06444247]]
Covariance for CRIM and CHAS: [[ 7.39865782e+01 -1.22108643e-01]
 [-1.22108643e-01  6.45129730e-02]]
Covariance for CRIM and NOX: [[7.39865782e+01 4.19593894e-01]
 [4.19593894e-01 1.34276357e-02]]
Covariance for CRIM and RM: [[73.9865782  -1.32503785]
 [-1.32503785  0.49367085]]
Covariance for CRIM and AGE: [[ 73.9865782   85.40532232]
 [ 85.40532232 792.35839851]]
Covariance for CRIM and DIS: [[73.9865782  -6.87672154]
 [-6.87672154  4.43401514]]
Covariance for CRIM and RAD: [[73.9865782  46.84776101]
 [46.84776101 75.81636598]]
Covariance for CRIM and TAX: [[   73.9865782    844.82153807]
 [  844.82153807 28404.75948812]]
Covariance for CRIM and PTRATIO: [[73.9865782   5.39933079]
 [ 5.39933079  4.68698912]]
Covariance for CRIM and B: [[  73.9865782  -302.38181632]
 [-302.38181632 8334.75226292]]
Covariance for CR

In [18]:
list(my_data.keys())

['CRIM',
 'ZN',
 'INDUS',
 'CHAS',
 'NOX',
 'RM',
 'AGE',
 'DIS',
 'RAD',
 'TAX',
 'PTRATIO',
 'B',
 'LSTAT',
 'MEDV']

### 6. Look at distributional qualities of variables.

Answer the following questions:
1. Which variable has the most symmetric distribution? 
2. Which variable has the most left-skewed (negatively skewed) distribution? 
3. Which variable has the most right-skewed (positively skewed) distribution? 

Defend your method for determining this.

In [None]:
# A:

### 8. Repeat question 6 but scale the variables by their range first.

As you may have noticed, the spread of the distribution contributed significantly to the results in question 6.

In [None]:
# A:

### 9. Univariate analysis of your choice

Conduct a full univariate analysis on MEDV, CHAS, TAX, and RAD. 

For each variable, you should answer the three questions generally asked in a univariate analysis using the most appropriate metrics.
- A measure of central tendency
- A measure of spread
- A description of the shape of the distribution (plot or metric based)

If you feel there is additional information that is relevant, include it. 

In [None]:
# A:

### 10. Have you been using inferential statistics, descriptive statistics, or both?

For each exercise, identify the branch of statistics on which you relied for your answer.

In [None]:
# A:

### 11. Reducing the number of observations

It seems likely that this data is a census - that is, the data set includes the entire target population. Suppose that the 506 observations was too much for our computer (as unlikely as this might be) and we needed to pare this down to fewer observations. 

**11.A Use the `random.sample()` function to select 50 observations from `'AGE'`.**

([This documentation](https://docs.python.org/2/library/random.html) may be helpful.)

In [None]:
# A:

**11.B Identify the type of sampling we just used.**

In [None]:
# A:

### 12. [BONUS] Of the remaining types of sampling, describe (but do not execute) how you might implement at least one of these types of sampling.


In [None]:
# A: