## Lab: csvs, functions, numpy, and distributions

Run the cell below to load the required packages and set up plotting in the notebook!

In [1]:
import numpy as np
import scipy.stats as stats
import csv
import seaborn as sns
%matplotlib inline

### Sales data

For this lab we will be using a truncated version of some sales data that we will be looking at further down the line in more detail. 

The csv has about 200 rows of data and 4 columns. The relative path to the csv ```sales_info.csv``` is provided below. If you copied files over and moved them around, this might be different for you and you will have to figure out the correct relative path to enter.

In [2]:
sales_csv_path = '../../assets/datasets/sales_info.csv'

#### 1. Loading the data

Set up an empty list called ```rows```.

Using the pattern for loading csvs we learned earlier, add all of the rows in the csv file to the rows list.

For your reference, the pattern is:
```python
with open(my_csv_path, 'r') as f:
    reader = csv.reader(f)
    ...
```

Beyond this, adding the rows in the csv file to the ```rows``` variable is up to you.

In [2]:
row = []
with open('../../assets/datasets/sales_info.csv', 'rU') as f:
    row = [line.strip().split(',') for line in f]

##### 2. Separate header and data

The header of the csv is contained in the first index of the ```rows``` variable, as it is the first row in the csv file. 

Use python indexing to create two new variables: ```header``` which contains the 4 column names, and ```data``` which contains the list of lists, each sub-list representing a row from the csv.

Lastly, print ```header``` to see the names of the columns.

In [3]:
header = row[0]
data = row[1:]
print header
data

['volume_sold', '2015_margin', '2015_q1_sales', '2016_q1_sales']


[['18.4207604861', '93.8022814583', '337166.53', '337804.05'],
 ['4.77650991918', '21.0824246877', '22351.86', '21736.63'],
 ['16.6024006077', '93.6124943024', '277764.46', '306942.27'],
 ['4.29611149826', '16.8247038328', '16805.11', '9307.75'],
 ['8.15602328201', '35.0114570034', '54411.42', '58939.9'],
 ['5.00512242518', '31.8774372328', '255939.81', '332979.03'],
 ['14.60675', '76.5189730216', '319020.69', '302592.88'],
 ['4.45646649485', '19.3373453608', '45340.33', '55315.23'],
 ['5.04752965097', '26.142470349', '57849.23', '42398.57'],
 ['5.38807023767', '22.4270237673', '51031.04', '56241.57'],
 ['9.34734863474', '41.892132964', '68657.91', '3536.14'],
 ['10.9303977273', '66.4030492424', '4151.93', '137416.93'],
 ['6.27020860495', '47.8693242069', '121837.56', '158476.55'],
 ['12.3959191176', '86.7601504011', '146725.31', '125731.51'],
 ['4.55771189614', '22.9481762576', '119287.76', '21834.49'],
 ['4.20012242627', '18.7060545353', '20335.03', '39609.55'],
 ['10.2528698945', '4

#### 3. Create a dictionary with the data

Use loops or list comprehensions to create a dictionary called ```sales_data```, where the keys of the dictionary are the column names, and the values of the dictionary are lists of the data points of the column corresponding to that column name.

In [4]:
# A second way to create a dictionary 
# ^^ sales_data = (hed :[x[header.index[hed]] for x in data] for hed in header)
#
#
#keys = column names / values are lists of data points corresponding to that column name.
sales_data = {}

data = np.array(data, dtype = np.float64)
data = data.T
for i in range(len(header)):
    sales_data[header[i]] = data[i]
sales_data

# i in range length sales_data(header(i)) = datai

{'2015_margin': array([  93.80228146,   21.08242469,   93.6124943 ,   16.82470383,
          35.011457  ,   31.87743723,   76.51897302,   19.33734536,
          26.14247035,   22.42702377,   41.89213296,   66.40304924,
          47.86932421,   86.7601504 ,   22.94817626,   18.70605454,
          44.04117663,   62.19900401,   14.25180952,   16.04326864,
          25.19117143,   31.75306583,   23.1614514 ,   48.82074074,
          73.23150442,   23.45033357,   14.14479263,   36.40852849,
          36.17186191,   59.89347792,   37.10855486,   52.40559169,
          30.68109917,   48.13336834,   47.74068036,   97.22433987,
          31.29239268,   35.27017991,   31.9091556 ,   29.14825321,
          32.62359167,   47.98937045,   55.5221865 ,   31.94163795,
          49.34206285,   42.86938521,   53.18490733,   25.40500628,
          43.93909625,   44.53483184,   39.53006519,   31.51060332,
          50.13319728,   28.71158014,   52.42356307,   24.002801  ,
          47.31843443,   49.19443

**3.A** Print out the first 10 items of the 'volume_sold' column.

In [61]:
sales_data['volume_sold'][:10]


array([ 18.42076049,   4.77650992,  16.60240061,   4.2961115 ,
         8.15602328,   5.00512243,  14.60675   ,   4.45646649,
         5.04752965,   5.38807024])

#### 4. Convert data from string to float

As you can see, the data is still in string format (which is how it is read in from the csv). For each key:value pair in our ```sales_data``` dictionary, convert the values (column data) from string values to float values.

In [None]:
#did this already when creating the above dictionary.

#### 5. Write function to print summary statistics

Now write a function to print out summary statistics for the data.

Your function should:

- Accept two arguments: the column name and the data associated with that column
- Print out information, clearly labeling each item when you print it:
    1. Print out the column name
    2. Print the mean of the data using ```np.mean()```
    3. Print out the median of the data using ```np.median()```
    4. Print out the variance of the data using ```np.var()```
    5. Print out the standard deviation of the data using ```np.std()```
    
Remember that you will need to convert the numeric data from these function to strings by wrapping them in the ```str()``` function.

In [67]:
def sumstats(header, data):
    print 'column: '+ (header)
    print 'mean: ' + str(np.mean(data))
    print 'median: ' + str(np.median(data))
    print 'variange: ' + str(np.var(data))
    print 'standard deviation: ' + str(np.std(data))

**5.A** Using your function, print the summary statistics for 'volume_sold'

In [68]:
sumstats('volume_sold', sales_data['volume_sold'])

column: volume_sold
mean: 10.018684079
median: 8.16634551564
variange: 84.1299652005
standard deviation: 9.1722388325


**5.B** Using your function, print the summary statistics for '2015_margin'

In [71]:
sumstats('2015_margin', sales_data['2015_margin'])

column: 2015_margin
mean: 46.8588951379
median: 36.5621438181
variange: 2016.06166296
standard deviation: 44.9005753077


**5.C** Using your function, print the summary statistics for '2015_q1_sales'

In [70]:
sumstats('2015_q1_sales', sales_data['2015_q1_sales'])

column: 2015_q1_sales
mean: 154631.6682
median: 104199.41
variange: 47430301462.3
standard deviation: 217784.989066


**5.D** Using your function, print the summary statistics for '2016_q1_sales'

In [69]:
sumstats('2016_q1_sales', sales_data['2016_q1_sales'])

column: 2016_q1_sales
mean: 154699.17875
median: 103207.2
variange: 47139411653.4
standard deviation: 217116.124812


#### 6. Plot the distributions

We've provided a plotting function below called ```distribution_plotter()```. It takes two arguments, the name of the column and the data associated with that column.

In individual cells, plot the distributions for each of the 4 columns. Do the data appear skewed? Symmetrical? If skewed, what would be your hypothesis for why?

In [None]:
def distribution_plotter(column, data):
    sns.set(rc={"figure.figsize": (10, 7)})
    sns.set_style("white")
    dist = sns.distplot(data, hist_kws={'alpha':0.2}, kde_kws={'linewidth':5})
    dist.set_title('Distribution of ' + column + '\n', fontsize=16)