# CSCI 4800: Introduction to Data Science
## Assignment 2: Exploratory Data Analysis

**NOTE** click near here to select this cell, esc-Enter will get you into cell edit mode, shift-Enter gets you back

#### **Name**: April Hudspeth

#### **Student ID**: 995032557
===

## Overview

Exploratory Data Analysis (EDA) is the process of examining and visualizing a novel dataset to understand its characteristics and patterns, before attempting more formal analysis. 

### The Dataset

that we'll use can downloaded from Canvas, under Assignment 2

Its a dataset containing various attributes of Abalone specimens, in particular the number of "rings" (last column) that shows the approximate age of the specimen. The dataset is typically used to predict number of rings from other attributes.

The data directory contains these files:

* **abalone.data**, A csv file with data on a number of abalone specimens.
* **abalone.names**, A text file with background information on the dataset.

Create a A2 directory on your VM and download this data into it. 

### Deliverables

Complete the all the exercises below and turn in a write up in the form of an IPython notebook, that is, **an .ipynb file**.
The write up should include your code, answers to exercise questions, and plots of results.
The submission will be as an assignment on Canvas with this file (after your edits) as an attachment. 

You have to use this notebook and fill in answers inline.
Don't forget to include answers to questions that ask for natural language responses, i.e., in English, not code!

We will test your code automatically (so use the same function names as requested by the questions) and that can be executed with "Cell > Run all".

### Guidelines

#### Code

This assignment can be done with basic python and matplotlib.
Feel free to use PANDAs, too, which you may find well suited to several exercises.

You're not required to do your coding in iPython, so feel free to use your favorite editor or IDE.
But when you're done, remember to put your code into an ipython notebook for your write up and make sure all cells work properly.

#### Collaboration

This assignment is to be done individually.  Everyone should be getting a hands on experience in this course.  You are free to discuss course material with fellow students, and we encourage you to use Internet resources to aid your understanding, but the work you turn in, including all code and answers, must be your own work.

## Part 0: Reading

### Exercise 0

Step 0 is to read the dataset. First download it from the link above, and save it into a data directory such as the path in the cell below. Look at the first few lines of the file. Notice that most columns are numeric, but the first collumn is string with one of three values (gender). 

Now construct two versions of the data table. First produce a variable 'abalone_raw' which is a list of records, and each record should be a list of strings. Now construct the variable 'abalone' which is list of list of numbers from it by parsing the numeric strings to float values. For the first column, map the string values to numeric ones and create a dictionary and inverse dictionary to map between the string values and numeric values. 

In [1]:
# Please preserve the format of this line so we can use it for automated testing.
DATA_PATH = "/home/datascience/HWs/A2/data/" # Make this the /path/to/the/data`
import re
import csv
from itertools import chain
# TODO Load data files here...
def loaddatafile(fname):
    with open(fname, 'rb') as f:
        reader = csv.reader(f,  delimiter=',')
        return list(reader)

def rawtodata(table):
    c = []
    g = []
    p = 0
    k = {}

    for i in table:
        g.append(i[:1])
        c.append(i[1:])
        
    g = chain.from_iterable(zip(*g)) 
    g = list(g)
    s = set(g)

   
    for i in s:
        k[i] = p
        p = p + 1
    
    inverse = dict((v,key) for key, v in k.items())
    for key in k.keys():
        for n,i in enumerate(g):
            if key == i:
                g[n]= k[key] 
   
    
    for n,i in enumerate(c):
        i.insert(0, g[n])
   

    return(c, k, inverse)
    # convert the string table to a numeric one, and return dicts

abalone_raw = loaddatafile(DATA_PATH + "abalone.data")
print abalone_raw[0]
abalone,adict,alkup = rawtodata(abalone_raw)
# abalone
adict                 # check the string -> number map for the first column

['M', '0.455', '0.365', '0.095', '0.514', '0.2245', '0.101', '0.15', '15']


{'F': 2, 'I': 0, 'M': 1}

## Part 1: Basic Statistics

Create a list of the column names for this dataset from the Dataset description. Preserve the case and the spaces in these names:

In [2]:
colnames=['Sex', 'Length', 'Diameter', 'Height', 'Whole Weight', 'Schucked Weight', 'Viscera Weight', 'Shell Weight', 'Rings']

Now create a dictionary 'coldict' mapping column name to column, and use this to define a "getcol" function which returns a named column from the abalone table.

In [3]:
coldict = {col:[] for col in colnames}
for value in abalone:
    for i in range(len(value)):
        coldict[colnames[i]].append(float(value[i]))



def getcol(colname):
    return coldict[colname]



> TODO: What is the min, max, average and std deviation of the Height column?

In [4]:
import numpy
print max(coldict['Height'])
print min(coldict['Height'])
print sum(coldict['Height'])/ len(coldict['Height'])
print numpy.std(coldict['Height'], axis=0)

1.13
0.0
0.13951639933
0.0418220494777


In [5]:
summaries = []
for i in range(len(colnames)):
    summaries.append([max(coldict[colnames[i]]), min(coldict[colnames[i]]), (sum(coldict[colnames[i]])/ len(coldict[colnames[i]])), numpy.std(coldict[colnames[i]], axis=0)])
    
summaries

[[2.0, 0.0, 0.9916207804644481, 0.79631463906237065],
 [0.815, 0.075, 0.5239920995930099, 0.1200785362060288],
 [0.65, 0.055, 0.407881254488869, 0.099227986099363466],
 [1.13, 0.0, 0.1395163993296614, 0.041822049477699706],
 [2.8255, 0.002, 0.82874215944458, 0.49033031361377233],
 [1.488, 0.001, 0.35936748862820106, 0.22193637778166947],
 [0.76, 0.0005, 0.18059360785252604, 0.10960112830473712],
 [1.005, 0.0015, 0.23883085946851795, 0.13918600552884758],
 [29.0, 1.0, 9.933684462532918, 3.2237830658211966]]

## Part 2: Histograms

> TODO: Now create a 3x3 grid of histograms, one for each column. Make sure your figure is large enough (should consume most of the width of the page). We recommend you use pylab, and its 'subplots' function. Include the column name as a title above each subfigure. Try to use loops rather than enumerating all 9 column names.

## Part 3: Scatter plots

> TODO: Now ceate a grid of scatter plots for each column vs the "Rings" column. Use color to distinguish the sex of the specimen in each plot. Make titles of the form "&lt;colname&gt; vs Rings". Its fine to include "Rings vs Rings" as the last plot.

In [None]:
import pylab
import matplotlib.pyplot as plt

# This is the 3x3 histogram of each column
# plt.rcParams['figure.figsize'] = (15,15)

f, axes = plt.subplots(3,3)
k = 3
h = 6
for i in range(3):
    for j in range(3):
        if i == 0:
            axes[i, j].hist(getcol(colnames[j]))
            axes[i, j].set_title(colnames[j])
        elif i == 1:
            axes[i, j].hist(getcol(colnames[k]))
            axes[i,j].set_title(colnames[k])
            k = k + 1
        else:
            axes[i, j].hist(getcol(colnames[h]))
            axes[i,j].set_title(colnames[h])
            h = h + 1
      


rings = getcol("Rings")
sex = getcol("Sex")


plt.subplots_adjust(hspace=.5)
plt.show()
# TODO create the 3x3 grid of scatter plots
fig, axs = plt.subplots(3, 3)
rw2 = 3
rw3 = 6
for i in range(3):
    for j in range(3):
        if i == 0:
            axs[i, j].scatter(getcol(colnames[j]), getcol("Rings"))
            m, b = numpy.polyfit(numpy.array(getcol(colnames[j])), numpy.array(getcol("Rings")), 1)
            axs[i,j].plot(numpy.array(getcol(colnames[j])), m*numpy.array(getcol(colnames[j])) + b, '-', color='red')
            axs[i, j].set_title(colnames[j] + " vs" + " Rings")
        elif i == 1:
            axs[i, j].scatter(getcol(colnames[rw2]), getcol("Rings"))
            m, b = numpy.polyfit(numpy.array(getcol(colnames[rw2])), numpy.array(getcol("Rings")), 1)
            axs[i,j].plot(numpy.array(getcol(colnames[rw2])), m*numpy.array(getcol(colnames[rw2])) + b, '-', color='red')
            axs[i, j].set_title(colnames[rw2] + " vs" + " Rings")
            rw2 = rw2 + 1
        else:
            axs[i, j].scatter(getcol(colnames[rw3]), getcol("Rings"))
            m, b = numpy.polyfit(numpy.array(getcol(colnames[rw3])), numpy.array(getcol("Rings")), 1)
            axs[i,j].plot(numpy.array(getcol(colnames[rw3])), m*numpy.array(getcol(colnames[rw3])) + b, '-', color='red')
            axs[i, j].set_title(colnames[rw3] + " vs" + " Rings")
            rw3 = rw3 + 1
plt.subplots_adjust(hspace=0.5)
plt.show()

> TODO: Do you notice any issues with the dataset? e.g. outliers? In the Height vs Rings and Viscera Weight vs Rings there are a couple noticable outliers. 

## Part 4: Regression lines

Add regression lines to the scatter plots above as per lab 3.

>TODO code to generate scatter plots with regression lines

## Part 5: Prediction Error

> TODO: Next we would like to explore prediction, and find the feature that gives the best (lowest error) predictions of number of rings. You can do this with polyfit, once again predicting the Rings feature from one of the others, by adding an option to return the "residual" of the fit, which is a measure of its prediction error. Read the documentation for polyfit on how to do this. Then make a 3 x 3 array of residuals. 

In [8]:
import numpy as np
residuals = np.zeros([3,3])
# TODO get the residuals returned by polyfit
# Rings vs Whole Weight, x array = sorted(Whole Weight), y array = Rings
# colnames=['Sex', 'Length', 'Diameter', 'Height', 'Whole Weight', 'Schucked Weight', 'Viscera Weight', 'Shell Weight', 'Rings']
x = np.array(getcol("Shell Weight"))
r = []
for c in colnames:
    x = np.array(getcol(c))
    y = np.array(getcol("Rings"))
    z, res, _, _, _ = np.polyfit(x, y, 1, full=True)
    r.append(res)

h = len(r) - 1  # starting from the "Rings" column and moving towards the "Sex" column
for i in range(3):
    for j in range(3):
        residuals[i,j] = r[h]
        h = h - 1

residuals

array([[  1.37713325e-23,   2.63133893e+04,   3.23915437e+04],
       [  3.57207389e+04,   3.07338147e+04,   2.99199168e+04],
       [  2.90749668e+04,   2.99560836e+04,   3.64146651e+04]])

> TODO: What feature gives the smallest residual (other than Rings of course)? The Shell Weight gives the next lowest residual.

The residuals are sums of the squared error for all the predictions. A more useful measure is the RMS (root-mean-squared) distance for each point. This is an estimate of how far the actual rings count for a specimen is from its prediction. From the residuals above, compute the RMS value for each residual. 

In [None]:
import math

# RMS = math.sqrt(sum([prediction1**2, prediction2**2,...,predictionN**2])/len(predictions)) = math.sqrt(residuals[i,j]/len(predictions))
# the number of predictions is the number of values in a column
def calc_rms():
    r = []
    for i in range(3):
        for j in range(3):
            r.append(math.sqrt(residuals[i,j]/len(getcol("Rings"))))
            
    return r

rms_residuals = calc_rms()

rms_residuals

## Part 6: Significance

So far we have studied prediction without worrying about chance. The linear regression coefficient between any two data sequences of the same size will normally be non-zero due to noise. This suggests that one sequence "predicts" the other. e.g. pick a random woman and man from a room, then their ages are almost surely different. The age and gender attributes predict each other perfectly on this sample, but the direction of influence is completely arbitrary! Obviously this doesnt generalize.

Statistical tests measure the likelihood that an observation may be due to chance if there is no "real" influence between two variables. The probability of the observations due to chance when there is no influence is called a p-value. You want this probability to be small, say less than 0.01. 

> TODO: Use the 'lingress' function from scipy.stats to perform linear fits between each data column and the rings column. Save the pvalues it returns for each fit into a 3 x 3 array. 

In [None]:
from scipy.stats import linregress
pvalues = np.zeros([3,3])
# TODO: fill in the pvalues array\


def get_p_values(pvalues):
    p = []
    y = np.array(getcol("Rings"))
    
    for i in range(len(colnames)):
        x = np.array(getcol(colnames[i]))
        slope, intercept, r_value, p_value, std_err = linregress(x,y)
        p.append(p_value)
    k = len(p) - 1
    for i in range(3):
        for j in range(3):
            pvalues[i,j] = p[k]
            k = k - 1
    return pvalues
    
#slope, intercept, r_value, p_value, std_err = stats.linregress(x,y)


pvalues = get_p_values(pvalues)
pvalues



> TODO: Are all the p-values less than 0.01 ? yes

# Submission

Please sunmit your ipython notebook on Canvas under Assignment 2.