# Introduction to Data Science â€“ Homework 2
*COMP 5360 / MATH 4100, University of Utah, http://datasciencecourse.net/*

Due: Friday, January 23, 11:59pm.

This homework is designed to reinforce the skills we covered in the last week including various descriptive statistics. Make sure to go through the lectures again in case you have any trouble.

**Note: Please provide comments describing what your code was doing at each stage**.

**Note: Please fill the cell that says `Your Interpretation` or `Your description` in the below code**.


## Your Data
Fill out the following information: 

*First Name:*   
*Last Name:*   
*E-mail:*   
*UID:*  

## Part 1: Vector data

We first will work with a vector of monthly average temperatures for Taylorsville, UT, which we downloaded from [NOAA](https://www.ncdc.noaa.gov/). The data is included in this repository in the file `TaylorSville.csv`.

The data is stored in the CSV format, which is a simple textfile with 'Comma Separated Values'.
To load the data, we use the read_csv function from the pandas library. The following code reads the file and stores it in a vector:

In [None]:
# import the csv library
import csv
# import the math library we'll use later
import math
# import the pandas library
import pandas as pd

# read in and print the data
data = pd.read_csv("TaylorSville.csv")
print(data)

# define vectors of interest
temperature_vector = data["TAVG"].tolist()
precipitation_vector = data["PRCP"].tolist()
        
# print the vector to see if it worked
print (temperature_vector)
print (precipitation_vector)

We'll next use descriptive statistics to analyze the data in `temperature_vector`.

In this problem, we'll do calculations that are also available in NumPy. For the purpose of this homework, however, **we want you to implement the solutions using standard python functionality and the math library, and then check your results using functions in NumPy** . 

See the the [NumPy library](http://docs.scipy.org/doc/numpy-1.11.0/reference/routines.statistics.html) documentation and include the checks as a separate code cell. 

### Task 1.1: Calculate the Sum of a Vector

Write a function that calculates and returns the sum of a vector that you pass into it. 

Pass the temperature and precipitaton vector into this function and print the result.

In [None]:
# your code goes here

In [None]:
# check results using Numpy

### Task 1.2: Calculate the Mean of a Vector

Write a function that calculates and returns the [arithmetic mean](https://en.wikipedia.org/wiki/Arithmetic_mean) of a vector that you pass into it. Use your above defined sum function to calculate sum of all the elements of a vector.

Pass the temperature and precipitaton vector into this function and print the result. Provide a written interpretation of your results (e.g., "The mean temperature for TaylorSville is XXX degrees Fahrenheit. The mean precipitation is XXX inches.")

In [None]:
# your code goes here

In [None]:
# check results using Numpy

**Your Interpretation:** TODO

### Task 1.3: Calculate the Median of a Vector
Write a function that calculates and returns the [median](https://en.wikipedia.org/wiki/Median) of a vector. Pass the temperature vector into this function and print the result. Make sure that your function works for both vectors with an even and odd number of elements. In the case of an even number of elements, use the mean of the two middle values. Provide a written interpretation of your results.

Hint: the [`sorted()`](https://docs.python.org/3/library/functions.html#sorted) function might be helpful for this.

In [None]:
# your code goes here

In [None]:
# check results using Numpy

**Your Interpretation:** TODO

### Task 1.4: Calculate the Variance and Standard Deviation of a Vector

Write a function that calculates and returns the [variance](https://en.wikipedia.org/wiki/Variance) of a vector. Pass the temperature and precipitation vectors into this function and print the result. Provide a written interpretation of your results. Is the variance high? Why, Why not? 

The variance is the average of the squared deviations from the mean, i.e.,

$$ Var(X) = {\sigma}^{2} = \frac{1}{N-1} \sum_{i=1}^{N} {{(x_i - \mu)}^2} $$

where $\mu$ is the mean of the vector. Hint: use your mean function to calculate it.

Finally, calculate the standard devation $\sigma$, defined as the square root of the variance. Explain which measure (variance or standard deviation) is more interpretable in this context, and explain why.

In [None]:
# your code goes here

In [None]:
# check results using Numpy

**Your Interpretation:** TODO

### Task 1.5: Histogram

Write a function called `histogram` that takes a vector and an integer `b` and calculates a [histogram](https://en.wikipedia.org/wiki/Histogram) with `b` bins. The function should return an array containing two arrays. The first should be the counts for each bin, the second should contain the borders of the bins.

For `b=5` for the temperature vector, your output should look like this: 

`[[19, 102, 78, 72, 95], [11.0, 23.0, 35.0, 47.0, 59.0, 71.0]]`

Here, the first array gives the size of these bins, the second defines the bands that are equally spaced between minimum and maximum value. That is, the first band from 11.0-23.0 has 19 entries, the second, from 23.0-35.0 has 102 entries, etc. 


Calculate the histogram for precipitation and temperature when b=20. Once you have the calculated bins and values, plot the histogram for precipitation and temperature using `hist` function in `matplotlib.pyplot` library. Provide a written interpretation of your results. Comment on the shape of each histogram; is it unimodal or bimodal? Is the distribution skewed (if so, in which direction)?

Explain how changing the number of bins affects the appearance of the histogram.

In [None]:
# your code goes here

# the call to your function
print(histogram(temperature_vector, bins = 5))
print(histogram(precipitation_vector, bins = 5))

In [None]:
# check results using Numpy

**Your interpretation:** TODO

## Part 2: Working with Matrices

For the second part of the homework, we are going to work with matrices. The [dataset we will use](https://www.wunderground.com/history/airport/KSLC/2015/1/1/CustomHistory.html?dayend=31&monthend=12&yearend=2015&req_city=&req_state=&req_statename=&reqdb.zip=&reqdb.magic=&reqdb.wmo=) contains different properties of the weather in Salt Lake City for 2015 (temperature, humidity, sea level, wind speed ...). It is stored in the file [`SLC_2015.csv`](SLC_2015.csv) in this repository.

We first read the data from the file and store it in a nested python array (`weather_matrix`). A nested python array is an array, where each element is an array itself. Here is a simple example: 

In [None]:
arr1 = [1,2,3]
arr2 = ['a', 'b', 'c']

nestedArr = [arr1, arr2]
nestedArr

We provide you with the import code, which writes the mean wind speed data into the nested list `wind_matrix`. The list contains one list for each month, which, in turn, contains the mean wind speed (MPH) of every day of that month. 

In [None]:
# initialize the 12 arrays for the months
wind_matrix = [[] for i in range(12)]

# open the file and append the values of the last column to the array
with open('SLC_2015.csv') as csvfile:
    filereader = csv.reader(csvfile, delimiter=',', quotechar='|')
    # get rid of the header
    next(filereader)
    for row in filereader:
        month = int(row[0].split('/')[0])
        mean_wind = int(row[17])
        wind_matrix[month-1].append(mean_wind)

print(wind_matrix)

# the mean wind speed on August 23. Note the index offset:
print("Mean wind speed on August 23: " + str(wind_matrix[7][22]))

We will next compute the same descriptive statistics as in Part 1 using the nested array `wind_matrix`. 

In this problem, **we again want you to implement the solutions using standard python functionality and the math library**. We recommend you check your results using NumPy.

**Note:** Since the lists in the matrix are of varying lengths (28 to 31 days) many of the standard NumPy functions won't work directly.

### Task 2.1: Calculate the Avg. Wind Speed from February to July (6 months)

Write a function that calculates the mean of a matrix from a particular starting month to the ending month. For this version calculate the mean over all elements in the matrix as if it was one large vector. 
Pass in the matrix with the weather data,starting and ending month and return the result. Provide a written interpretation of your results.
Can you use your function from Part 1 and get a valid result?

In [None]:
# your code goes here

In [None]:
# check results with numpy

**Your Interpretation:** TODO

### Task 2.2:  Calculate the mean wind speed of each month in the second half year (July to December)

Write a function that calculates the mean wind speed of each month in the second half year and returns an array with the means for each row. Provide a written interpretation of your results. Can you use the function you implemented in Part 1 here efficiently? If so, use it.

In [None]:
# your code goes here

**Your Interpretation:** TODO

### Task 2.3:  Calculate the median wind speed from February to July

Write a function that calculates and returns the median of a matrix over all values from February to July (independent from which row they are coming) and returns it. Provide a written interpretation of your results. Can you use your function from Part 1 and get a valid result?

In [None]:
# your code goes here

In [None]:
# check results with numpy

**Your Interpretation:** TODO

### Task 2.4: Calculate the median wind speed of each month in the first half year (January to June)

Write a function that calculates the median wind speed of each sub array (i.e. each row in the wind_matrix) in the first half year, and returns an array of medians (one entry for each row). To do so, use the function you implemented in Part 1. Provide a written interpretation of your results. 

In [None]:
# your code goes here

**Your Interpretation:** TODO

### Task 2.5: Calculate the standard deviation of a whole matrix

Write a function that calculates the [standard deviation](https://en.wikipedia.org/wiki/Standard_deviation) of a matrix over all values in the matrix (ignoring from which row they were coming) and returns it. Can you use your function calculating variance from Part 1 and get a valid result? Provide a written interpretation of your results. 

In [None]:
# your code goes here

In [None]:
# check results with numpy

**Your Interpretation:** TODO

### Task 2.6: Calculate the standard deviation of each vector of a matrix

Write a function that calculates the [standard deviation](https://en.wikipedia.org/wiki/Standard_deviation) of each array in the matrix and returns an array of standard deviations (one standard deviation for each row). To do so, use the function calculating variance you implemented in Part 1. 
Pass in the matrix with the wind speed data and return the result. Provide a written interpretation of your results - is the standard deviation consistent across the seasons? 

In [None]:
# your code goes here

**Your Interpretation:** TODO

## Part 3: Poisson distribution 

In class, we looked at [Bernoulli](https://en.wikipedia.org/wiki/Bernoulli_distribution) and [binomial](https://en.wikipedia.org/wiki/Binomial_distribution) discrete random variables. Another example of a discrete random variable is a *Poisson random variable*. 

Read the [wikipedia article on the Poisson distribution](https://en.wikipedia.org/wiki/Poisson_distribution)

### Part 3.1. Descriptive statistics

Describe what a Poisson random variable is. What is the parameter, $\lambda$? What is the min, max, mean, and variance of a Poisson random variable? 

**Your description:** TODO

### Part 3.2. Example 

Give an example of an application that is described by a Poisson random variable. What type(s) of experiment(s) generate data that is Poisson distributed?

**Your description:** TODO

### Part 3.3. Probability mass function

For the parameter $\lambda = 4.5$, plot the probability mass function (you may use scipy). 

In [None]:
# your code

### Part 3.4. Poission sampling

Write python code that takes 1700 samples from the Poisson distribution with parameter $\lambda = 4.5$. Make a histogram of the samples and compute the sample mean and variance. How does the histogram compare to the probability mass function?

In [None]:
# your code

**Your description:** TODO

### Part 3.5. Rare events

Explain why extremely unlikely events under a Poisson model still have nonzero probability. How can this be related to the example you gave in Part 3.2?

**Your description:** TODO