  ### Benny Cohen
  
  
  #### 9/22/2019
  #### Analytics Programming Module 4 Jupyter Notebook 

In this notebook we will be using numpy to analyze a csv file containing information about water consumption by residents of New York. 
The data set contains info about...
1. The year
2. The population of New York during that year
3. The amount of water consumed during that year
4. The gallons per person consumed. 

Since we are going to be using numpy, we have to import it!

In [15]:
import numpy as np

Now let's load the data. In this case I downloaded it onto my local machine from https://data.cityofnewyork.us/Environment/Water-Consumption-In-The-New-York-City/ia2d-e54m
and am bringing it in with a local file path. We pass the delimiter as a ',' since this is a comma seperated file. We also pass the parameter skip_header as true since the file contains headers. 

In [16]:
data = np.genfromtxt("Water_Consumption_In_The_New_York_City.csv", delimiter = ',', skip_header = True)
type(data)

numpy.ndarray

Let's set our print options and see what the data looks like. 

In [17]:
np.set_printoptions(precision = 3, suppress = True)

In [18]:
print("The type of data is " + str(data.dtype))
rows, columns = np.shape(data)
print("There are " + str(rows) + " rows (years)  and " + str(columns) + " columns")
print(data)


The type of data is float64
There are 40 rows (years)  and 4 columns
[[   1979.  7102100.     1512.      213. ]
 [   1980.  7071639.     1506.      213. ]
 [   1981.  7089241.     1309.      185. ]
 [   1982.  7109105.     1382.      194. ]
 [   1983.  7181224.     1424.      198. ]
 [   1984.  7234514.     1465.      203. ]
 [   1985.  7274054.     1326.      182. ]
 [   1986.  7319246.     1351.      185. ]
 [   1987.  7342476.     1447.      197. ]
 [   1988.  7353719.     1484.      202. ]
 [   1989.  7344175.     1402.      191. ]
 [   1990.  7335650.     1424.      194. ]
 [   1991.  7374501.     1469.      199. ]
 [   1992.  7428944.     1369.      184. ]
 [   1993.  7506166.     1368.5     182. ]
 [   1994.  7570458.     1357.7     179. ]
 [   1995.  7633040.     1325.7     174. ]
 [   1996.  7697812.     1297.9     169. ]
 [   1997.  7773443.     1205.5     155. ]
 [   1998.  7858259.     1219.5     155. ]
 [   1999.  7947660.     1237.2     156. ]
 [   2000.  8008278.     124

As mentioned at the start, we see that the first collumn has the year, the second the population, the third the amount consumed, and the fourth the proportion per person.

Let's find the maximum yearly NYC consumption of water in millions of gallons per day. We know each row represents a year and we know that the 3rd column contains the amount of water consumed per year. We therefore need the year which has the max of the 3rd column

In [19]:
yearWithMost = data[data[:,2] == data[:,2].max()] # this is, select the row where the 3rd column = the max of the 3rd column
int(yearWithMost[0][0])

1979

In [20]:
print(str(int(yearWithMost[0][0])) + " has the most water consumption in millions of gallons per day with " + str(yearWithMost[0][2]))

1979 has the most water consumption in millions of gallons per day with 1512.0


Let's find out the number of years. We can check this by seing how many unique values there are in the year column(the first column)

In [21]:
numYears = np.unique(data[:,0]).shape
print("There are " + str(numYears[0]) + " years")

There are 40 years


We can use aggregation functions to summarize the data like to see
what is the mean and the standard deviation of the per capity daily water consumption. (the last col) 

In [22]:
mean = data[:,-1].mean()
std = data[:,-1].std()
print("The mean is " + str(mean) + " and the standord deviation is  " + str(std))

The mean is 159.425 and the standord deviation is  31.58709190476388


Let's find the difference in population from year to year

In [23]:
pop_diff = np.diff(data[:,1])
pop_diff

array([-30461. ,  17602. ,  19864. ,  72119. ,  53290. ,  39540. ,
        45192. ,  23230. ,  11243. ,  -9544. ,  -8525. ,  38851. ,
        54443. ,  77222. ,  64292. ,  62582. ,  64772. ,  75631. ,
        84816. ,  89401. ,  60618. ,  16685.5,  16685.5,  16685.5,
        16685.5,  16685.5,  16685.5,  16685.5,  16685.5,  16685.5,
        16685.5,  97830. ,  75069. ,  50707. ,  38648. ,  30794. ,
         7795. , -37705. , -39523. ])

And let's get the type of that array

In [24]:
type(pop_diff)

numpy.ndarray

And the type of the elements in the array...

In [25]:
pop_diff.dtype

dtype('float64')

In [26]:
print("There are " + str(len(pop_diff)) + " elements")

There are 39 elements


It makes sense that there are 39 elements even though there are 40 years because there if there are 40 years then there are 39 differences between consecutive years. It's easier to think of it mathematically. The zeroth element is arr[1] - arr[0]. The first element is arr[2] - arr[1]... the 39th element would be arr[40] - arr[39]. But there is no arr[40] since arrays start at 0. Therefore the last element, the 38th is arr[39] - arr[38]. We can verify this 

In [28]:
pop = data[:,1]
popDif = []
for i in range(len(pop) - 1):
    popDif.append(pop[i+1] - pop[i]) #Recursive definition of Dif. 
pop_diff_check = np.array(popDif)

In [29]:
np.sum(pop_diff != pop_diff_check)

0

The sum is 0 because they are equal on all values so the inverse evaluates to false for all of them.
Np.Dif should be faster input our csv is small so it doesn't matter.

Now that we know the mean and standard deviation of the average per-capity daily water use over this time period, we can create a simulation of per capita daily water use over this time frame. Take the number of years, multiply by 365.25, and round to the nearest whole integer. That's the number of values we want to generate.


In [34]:
years,_ = np.shape(data)
numValues = round(years * 365.25)
numValues

14610

Now let's generate an array of normally distibuted values with the mean and std from above and convert it to a list

In [42]:
simulated_daily_use_array = np.random.normal(mean,std,numValues)
simulated_daily_use_array

array([122.12 , 137.371, 186.389, ..., 141.247, 170.599, 182.797])

In [43]:
type(simulated_daily_use_array)

numpy.ndarray

In [44]:
simulated_daily_use_list = list(simulated_daily_use_array)

In [45]:
type(simulated_daily_use_list)

list

Now let's use the provided profiling code to meassure performance

In [46]:
from timeit import timeit
import math

def sqrt_1():
    # This one uses a loop.  Slow!
    sqrt_1 = [] 
    for i in range(0, len(simulated_daily_use_list)-1):  
        sqrt_1.append(math.sqrt(simulated_daily_use_list[i]))
    return(sqrt_1)

def sqrt_2():
    # This one uses list comprehension. Better!
    return([math.sqrt(x) for x in simulated_daily_use_list])

def sqrt_3():
    # This one uses numpy.  Best!
    return(np.sqrt(simulated_daily_use_array))

print("Time for loop: " + str(timeit(sqrt_1, number = 1000)))
print("Time for list comprehension: " + str(timeit(sqrt_2, number = 1000)))
print("Time for numpy vectorized operation: " + str(timeit(sqrt_3, number = 1000)))

Time for loop: 6.32414402400002
Time for list comprehension: 4.440285979999999
Time for numpy vectorized operation: 0.0184440650000397


We see that vectorizing scales much better than loops and comprehension