# Review: Data Types and Summary Statistics

|Aggregate Stat | Quantitative Continuous | Quantitative Discrete | Qualitative Ordinal | Qualitative Nominal |
|------|------|------|------|------|
| unique values | yes* | yes | yes | yes |
| min | yes | yes | yes | no |
| max | yes | yes | yes | no |
| range |  |  |  |  |
| mean |  |  |  |  |
| median | | | |  |
| mode |  |  |  |  |
| variance |  |  |  |  |


## Copying numpy arrays

This is a reminder to use "deep copy" rather than "shallow copy"

In [None]:
# let's try the obvious thing
nparray = np.array([[0, 1, 2, 3], [10, 11, 12, 13], [20, 21, 22, 23], [30, 31, 32, 33]])
nparray2 = nparray
print("nparray")
print(nparray)
print("nparray2")
print(nparray2)

In [None]:
nparray[0,0] = 200
print("nparray")
print(nparray)
print("nparray2")
print(nparray2)

# whaaat just happened?

In [None]:
# how do we stop that happening?? hint, what are we doing? we are *copying*
nparray2 = nparray.copy()
print("nparray")
print(nparray)
print("nparray2")
print(nparray2)

In [None]:
nparray[0,0] = 0
print("nparray")
print(nparray)
print("nparray2")
print(nparray2)

## Doing things to whole numpy arrays (broadcasting)

In [None]:
import numpy as np

nparray = np.array([[0, 1, 2, 3], [10, 11, 12, 13], [20, 21, 22, 23], [30, 31, 32, 33]])
print("nparray\n", nparray)
print("nparray shape\n", nparray.shape)

In [None]:
# what if I want every element in nparray * 2?
print(nparray)
print(nparray*2)

In [None]:
# what if I want every element in nparray / 2?
print(nparray/2)
# watch out!!
(nparray/2).dtype

In [None]:
# let's get some summary statistics

data = np.genfromtxt('data/vehiclesNumeric.csv', dtype=int, delimiter=',', skip_header=1, encoding='utf8')
print(data[0:10])
print(data.shape)
for col in range(data.shape[1]): 
    print(col, data[:,col].min(), data[:,col].mean(), data[:,col].max())

In [None]:
# (review!) how do we assign value(s) to a row or column?
nparray[:1] = np.zeros(nparray.shape[1])
print(nparray)

In [None]:
# let's sum across each column
np.sum(nparray, axis=0)

In [None]:
# how would we sum across each row?


In [None]:
# what if we had a tensor?
nptensorFloat = np.ones([3, 4, 5])
print(nptensorFloat)

np.sum(nptensorFloat, axis=2)

In [None]:
# what if we don't specify an axis?

In [None]:
# what other functions can we apply across axes?

In [None]:
# let's take it up a notch

nparrayRandomInt = np.random.randint(low=0, high=10, size=(3,4))
print(nparrayRandomInt)

print(nparrayRandomInt - np.min(nparrayRandomInt, axis=0))

# whaaat just happened? let's look at the shapes


In [None]:
# what if we try to do the subtract-min thing across axis 1?
print(nparrayRandomInt - np.min(nparrayRandomInt, axis=1))


In [None]:
# how can we fix that? make the arrays shape-compatible!
print(nparrayRandomInt - np.min(nparrayRandomInt, axis=1)[:, np.newaxis])

In [None]:
# is there another way to achieve this?
print(nparrayRandomInt - np.min(nparrayRandomInt, axis=1, keepdims=True))


## Why numpy?

Numpy is space efficient
(reference: https://www.geeksforgeeks.org/python-lists-vs-numpy-arrays/)

- very space efficient because it's based on C

In [None]:
# importing numpy package
import numpy as np
  
# importing system module
import sys
  
# declaring a list of 1000 elements 
S= range(1000)
  
# printing size of each element of the list
print("Size of each element of list in bytes: ",sys.getsizeof(S))
  
# printing size of the whole list
print("Size of the whole list in bytes: ",sys.getsizeof(S)*len(S))
  
# declaring a Numpy array of 1000 elements 
D= np.arange(1000)
  
# printing size of each element of the Numpy array
print("Size of each element of the Numpy array in bytes: ",D.itemsize)
  
# printing size of the whole Numpy array
print("Size of the whole Numpy array in bytes: ",D.size*D.itemsize)

Numpy *can be* more time efficient (reference: https://stackoverflow.com/questions/9708783/numpy-vs-list-comprehension-which-is-faster)

In [19]:
import sys, numpy
import timeit #times things, use numpy for fast programming

def numpysum(n):
    a = numpy.arange(n) ** 2
    b = numpy.arange(n) ** 3
    return a + b

def pythonsum(n):
    a = [i ** 2 for i in range(n)]
    b = [i ** 3 for i in range(n)]
    return [a[i] + b[i] for i in range(n)]

for size in [10, 100, 1000]:
    print("size", size)
    print("time with python", timeit.timeit(lambda: pythonsum(size)))
    print("time with numpy", timeit.timeit(lambda: numpysum(size)))