# NBA Player Statistics Workshop

Given a dataset of NBA players performance and salary in 2014, use Python to load the dataset and compute the summary statistics for the `SALARY` field:

- mean
- median
- mode
- minimum
- maximum

You will need to make use of the csv module to load the data and interact with it. Computations should require only simple arithmetic. (For the purposes of this exerciese, attempt to use pure Python and no third party dependencies like Pandas - you can then compare and contrast the use of Pandas for this task later). 

**Bonus:**

Determine the relationship of PER (Player Efficiency Rating) to Salary via a visualization of the data.


NBA 2014 Players Dataset: [http://bit.ly/gtnbads](http://bit.ly/gtnbads)

In [12]:
# Imports

import os
import csv
import json
import urllib2

from collections import Counter
from operator import itemgetter

## Fetching the Data

You have a couple of options of fetching the data set to begin your analysis:

1. Click on the link above and Download the file. 
2. Write a Python function that automatically downloads the file and writes it to disk. 

In either case, you'll have to be cognizant of where the CSV file lands. Here is a quick implementation of a function to download a URL at a file and write it to disk. Note the many approaches to do this as outlined here: [How do I download a file over HTTP using Python?](http://stackoverflow.com/questions/22676/how-do-i-download-a-file-over-http-using-python). 

In [4]:
def download(url, path):
    """
    Downloads a URL and writes it to the specified path.
    Note the use of "with" to automatically close files.
    """
    response = urllib2.urlopen(url)
    with open(path, 'w') as f:
        f.write(response.read())
    
    response.close()

**Your turn: use the above function to download the data!**

In [6]:
## Write the Python to download the file here:

## Loading the Data

Now that we have the CSV file that we're looking for, we need to be able to open the file and read it into memory. The trick is that we want to read only a single line at a time - consider really large CSV files. Python provides memory efficient iteration in the form of `generators` and the `csv.reader` module exposes one such generator, that reads the data from the CSV one row at a time. Moreover, we also want to parse our data so that we have specific access to the fields we're looking for. The `csv.DictReader` class will give you each row as a dictionary, where the keys are derived from the first, header line of the file. 

Here is a function that reads data from disk one line at a time and `yield`s it to the user. 

In [8]:
def read_csv(path):
    # First open the file
    with open(path, 'r') as f:
        # Create a DictReader to parse the CSV
        reader = csv.DictReader(f)
        for row in reader:
            # Yield each row one at a time.
            yield row

**Your turn: use the above function to print out the first row of the CSV!**

To do this, you'll need to pass into the function the path of the data you downloaded. The function "returns" a generator, so you'll need to use a `for` loop in order to access the data. Note that `break` will stop a `for` loop from running. E.g. the code:

```python
for idx in xrange(100):
    if idx > 10:
        break
```

Will stop the for loop after 10 iterations.

In [10]:
## Write the Python to print the first row of the CSV here!

## Summary Statistics

In this section, you'll use the CSV data to write computations for mean, median, mode, minimum, and maximum. These summary statistics can be computed inside of a single function, which will take as an argument the field that you want to compute the statistics upon. The function will then load the data from disk, make the computation, and return a dictionary with the summary statistics. 

The function stub is below - it's up to you to fill in the details. Test your function by running the print command in the following cell. Note that what we're looking for is the act of solving the _algorithm_ involved in computing summary statistics. In real life, you'll use _tools_ for solving these problems (like Pandas). Please attempt to make the computations using only the `sorted` (and `itemgetter`) and `Counter` tools already imported for you. 

(You'll have to sort the data in order to get the Median, and you'll need something to keep track of frequency, e.g. the Mode). How do you find out about `sorted`, `itemgetter`, and `Counter` in Python?

**Your turn: fill in the below function to return a dictionary of summary statistics for a specific field**

In [19]:
def statistics(path, field):
    """
    Takes as input a path to `read_csv` and the field to
    compute the summary statistics upon.
    """
    
    # Uncomment below to load the CSV into a list
    # data = list(read_csv(path))
    
    # Fill in the function here
    
    return {
        'mean': None,
        'median': None,
        'mode': None,
        'maximum': None,
        'minimum': None,
    }

# Runs the function to ensure everything is working
print json.dumps(statistics('', 'SALARY'), indent=4)

{
    "minimum": null, 
    "median": null, 
    "mode": null, 
    "maximum": null, 
    "mean": null
}


Keep playing with the above function to get it to work more efficiently or to reduce bad data in the computation - e.g. what are all those zero salaries? 


## Visualization

Congratulations if you've made it this far! It's time for the bonus round!

You've now had some summary statistics about the salaries of NBA players, but what we're really interested in is the relationship between `SALARY` and the rest of the fields in the data set. The `PER` - Player Efficiency Rating, is an aggregate score of all performance statistics; therefore if we determine the relationship of `PER` to `SALARY`, we might learn a lot about how to model NBA salaries. 

In order to explore this, let's create a scatter plot of `SALARY` to `PER`, where each point is an NBA player.

Visualization is going to require a third party library, select one of the following:

- matplolib
- seaborn
- bokeh
- pandas

And `pip install` it. Follow the documentation to create the scatter plot inline in the notebook in the following cells. (Our recommendation is Bokeh - it is a bit more interactive). Note, you probably already have matplotlib, so that might be the simplest if you're having trouble with installation. 

In [20]:
# Insert your Python to create the visualization here