## **Programming with Python for Data Science**


###**Lesson: Numerical Analysis with NumPy** 

## **Notebook Preparation for Lesson in 1•2 steps:**
Each lesson will start with a similar template:  
1. **save** the notebook to your google drive (copy to drive)<br/> ![](https://drive.google.com/uc?export=view&id=1NXb8jeYRc1yNCTz_duZdEpkm8uwRjOW5)
 
2. **update** the NET_ID to be your netID (no need to include @illinois.edu)

In [None]:
LESSON_ID = 'p4ds:ds:numpy1'   # keep this as is
NET_ID    = 'salonis3' # CHANGE_ME to your netID (keep the quotes)

#**Lesson Numerical Analysis with NumPy**
###**Welcome to NumPy; Python's Super Library**
One of the more popular libraries for working with arrays of data is NumPy (Numerical Python and pronounced num pie). You could use the list type for all your array needs; however, NumPy makes working with arrays (both single and multiple dimensions) easy by allowing efficient syntax and a large number of utility methods and functions to handle just about any numerical need. NumPy is also very fast. The library is built in C (a language Python is also implemented in) and has great execution time. For easy and fast array and matrix manipulation, NumPy comes to the rescue.

For doing numerical linear algebra and working with vectors and matrices, NumPy is the standard library (it's also at the core of most machine learning algorithms). Other popular libraries like Pandas and SciPy (Scientific Python) use NumPy as well. Learning how to effectively work with NumPy makes the other libraries easier to learn. Much of what you will learn in Pandas comes directly from NumPy.

In the same spirit of this class, where we are more interested in understanding the concepts rather than learning recipes on how to use an API (the public functions and methods), we will learn the basics of NumPy within the context of helping us work with data.

To use NumPy you must import it; and as a convention, it's imported as np:

```
import numpy as np
```

###**Python List vs Numpy Array**
As we have seen, a Python list can hold a set of heterogeneous items (e.g. strings, numbers, booleans, functions, etc). However, NumPy arrays can only hold multiple items of the same type. Numpy arrays are statically typed (vs dynamic) and homogeneous. The type of the elements is determined when the array is created -- it's why it can be memory efficient and fast.

NumPy also makes a distinction between numbers that have no decimal points (aka. integers) and those that do (aka. floating point - since the decimal point is allowed to 'float'). Not only can you specify different types in NumPy (e.g. integer, float, boolean) but also how much storage you want those values to occupy. If you are only dealing with small numbers (e.g. -128 to 127) you can manage by using a single byte (8 bits) of data for each number; however, if you need to hold very large numbers you can specify up to 64 bits (8 bytes). You can see the type by using the .dtype property For now, we won't need to worry about selecting the most suitable size.

> ***Coder's Log:*** You don't have to master bits and bytes for this class, but you should be curious why everything in computers seems to be a power of 2 (2,4,8,16,32,64,etc). A bit (the smallest unit capable of holding a value can either be on (i.e. hold a 1) or off (i.e. hold a 0). So one bit can hold 2 states. Two bits could hold 4 states (on, on; on, off; off, on; off, off). Eight bits (a byte) can hold 2⁸ == (2*2*2*2 * 2*2*2*2) == 256. So 8 bits can hold 256 unique values. One possible range of values would be the numbers from -128 to +127.

Let's take a quick look at both the Python list and the NumPy single dimension array:

In [1]:
import numpy as np

py_list = list(range(0,10))

# numpy's version of range
np_num  = np.arange(0,10)

print(type(np_num))
print(py_list)
print(np_num)

<class 'numpy.ndarray'>
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[0 1 2 3 4 5 6 7 8 9]


What did you notice about how each gets printed?

###**Array Creation**
There are several functions to create arrays (and more than one dimension, as we will see):

In [2]:
# create an array from a Python list
data = np.array([1,2,3])
print(data)

[1 2 3]


![](https://drive.google.com/uc?export=view&id=1GI5aW0FX4XG3cvW7unZLutL6cc1ipscp)
 
In addition, there are convenience functions to create arrays/vectors/matrices filled with 1's, 0's, and random numbers:

![](https://drive.google.com/uc?export=view&id=1yNQslZbTHQt7A5mH4xpp5L45fpy4jQMu)





###**Linspace (not a typo)**
Numpy provides a way to easily divide a range into equally sized intervals. This is very useful when you need to create grid spacing or equally split up a discrete range.
```
x_axis = np.linspace(0, 10, 100)
```
This becomes super useful when you need to 'spin' around a portion of a circle:
```
x_axis = np.linspace(-np.pi, np.pi, 100)
```

In [None]:
# type&run the above example/exercise in this cell

###**Size and Shape**
You can find out how many elements are in the array using .size property. This is the preferred way (but the len function does work -- for one dimensional arrays).

Each NumPy array also has a shape attribute. The shape is returned as a tuple that describes the size of each dimension For single dimension arrays, shape[0] is the number of elements in the array:

```
np_num = np.arange(0,10)
print(np_num.size, np_num.shape)
```


In [None]:
# type&run the above example/exercise in this cell

Numpy will return (10,) for the above single dimension array with 10 elements.

###**Operator Overload**
As we have seen some of the mathematical operators work with lists:
```
print(py_list * 3)      # concatenation 
print(py_list + py_list)# concatenation 
print(py_list + 2)      # will **not** work (comment this before submitting)
``` 

In [None]:
# type&run the above example/exercise in this cell

 
These operators (* and +) are defined on the list class as methods. This is one of the powerful features of Python. When you create your own class (we have not done that yet), you can also define how some binary and unary operations will work. We have seen that the set class defines the - operator for doing set difference between two sets.

Since NumPy knows all the elements are the same type (usually numbers), the mathematical operators work on the individual elements:
```
print(np_num * 3)      # multiply each item by 3
print(np_num + np_num) # add the corresponding items together 
print(np_num + 2)      # add two to each item
print(np_num % 2)      # modulus/remainder
```

In [None]:
# type&run the above example/exercise in this cell

These are called *element-wise* operations since the mathematical operation is performed on each element in the array.
```
data = np.array([1,2])
ones = np.ones(2)
print(data + ones)
print(data * data)
```

In [None]:
# type&run the above example/exercise in this cell

It's very easy to build vectors/matrices of answers without using any loops:

In [None]:
miles = np.array([1,2,3])
kilometers = miles * 1.6
print(kilometers)

This can be visualized much like:
```
[1] * [1.6] = [1.6]
[2] * [1.6] = [3.2]
[3] * [1.6] = [4.8]
```
These operations that are performed on each element in an array are called vectorized operations and also named universal functions ('UFuncs' for short) in NumPy.

Some other functions that work on an element-wise basis:
```
np.abs    --> Calculate the absolute value element-wise.

np.sin    --> Trigonometric sine, element-wise.

np.log    --> Natural logarithm, element-wise.

np.square --> Element-wise square of the input
```
If you need to perform an operation, element-by-element, most likely NumPy has a function to do it for you.

###**None in NumPy**
Numpy has a special constant, numpy.NaN, that is used to indicate a missing value or a value that can't be represented. It's value is numpy.nan (not a number). There are also the values numpy.NaN and numpy.NAN that mean the same.

There are several functions that can be used to determine if a value is NaN:

In [None]:
data = np.array([1,2,3, np.nan])
print(data[3])
print(np.isnan(data[3]))

###**Simple Functions**
There are also aggregation functions (functions that return a single value like max) that work with NumPy arrays as well:

![](https://drive.google.com/uc?export=view&id=1H1hXGJ_B1PKg8-7Y-Sy6xtDBRj-O7f2b)

```
np.sum        --> Compute sum of elements

np.prod       --> Compute product of elements

np.mean       --> Compute mean of elements

np.std        --> Compute standard deviation

np.var        --> Compute variance

np.min        --> Find minimum value

np.max        --> Find maximum value

np.argmin     --> Returns the index of minimum value

np.argmax     --> Returns the index of maximum value

np.median     --> Compute median of elements

np.percentile --> Compute rank-based statistics of elements

np.any        --> Evaluate whether any elements are true

np.all        --> Evaluate whether all elements are true

np.ptp        --> Calculates the range (np.max - np.min)
```
 


In [None]:
x = np.array([0,1,2,3,4,5,6])
print(np.sum(x))
print(x + 2*x)

Also, most aggregates have a NaN-safe counterpart that computes the result while ignoring missing values, which are marked by the np.NaN value.

In [None]:
x = np.array([1,2,0,3,6,5,np.NaN,4])
print(np.nansum(x))
print(np.nanargmin(x))
print(np.nanargmax(x))

###**2D and beyond**

We mentioned a few times that the power of Numpy comes into play as soon as we extend beyond a simple series of numbers (i.e. a 1 dimensional array).

![](https://drive.google.com/uc?export=view&id=1GgqPn5CE3Otp5TgHSTdGU1UVIh02Sqv7)

Using NumPy to manage 2 dimensional arrays is just as easy:

In [None]:
data = np.array([
  [1,  2, 3], 
  [4,  5, 6],  
  [7,  8, 9],  
  [10,11,12]]) 
  
print(data)

np.ones, np.zeros, and np.random work in multiple dimensions as well:

![](https://drive.google.com/uc?export=view&id=1H1Io5pRTciXA5vTOtDeTprSr41zKRFiX)

We will explore higher dimensions in the next lesson.

###**Size, Dimension, Shape**
The method size returns the number of elements, and ndim returns the number of dimensions. Try to find out what the len function returns on data:

In [None]:
print(data.size, data.ndim)

You can ask a Numpy 2D array its shape as well:
```
print(data.shape)  # 4 rows, 3 columns
```
Each element of the returned tuple describes each dimension.

###**Indexing**
Standard array/list indexing works as well:

In [None]:
one_d = np.array([1,2,3])
two_d = np.array([[1,2,3],[4,5,6],[7,8,9]])
print(one_d[1])
print(two_d[1])
print(two_d[1][1])

###**Slicing And Dicing**
Just as with Python lists you can slice NumPy arrays using the same syntax:
```
[start : stop : step]
```

In [None]:
x = np.arange(100).reshape(20,5)
print("Rows 2,3,4 ", x[2:5])

# Advanced slicing
print("First 5 rows\n", x[:5])
print("Row 18 to the end\n", x[18:])
print("Last 5 rows\n", x[-5:])
print("Reverse the rows\n", x[::-1])

However, for multiple dimensions, this syntax is extended to each dimension -- where each dimension is separated by a comma:

In [None]:
data = np.arange(1,21).reshape(4,5)
print(data)
print(data[:2  , :3])     # two rows, three columns
print(data[:3:2, :4:2])   # ???

So it's incredibly easy to extract an attribute (column) from all the rows in a dataset:
```
print(data[:, 2]) # all rows, column '3'
```
Do not go past this point unless you understand what is happening. Most of the lesson (and NumPy and Pandas) depends on understanding this compact syntax.


![](https://drive.google.com/uc?export=view&id=1GshJQE1LD5gxh9EYKdUoBr0vQMmhXVaN)

![](https://drive.google.com/uc?export=view&id=1Gxaoom96zt84hvuWnAh4IKE-4E6KLGCJ)

![](https://drive.google.com/uc?export=view&id=1Gj0nJtsxet4Q9oVGYy9eETpy7DkSgoGF)

###**Operator Indirection and Masking**
What's even more powerful is that you can use the result of one NumPy array to extract values from another NumPy array. In this example, we build an array of the integers that represents an index for values we want to extract. In the example below, we will extract the values at index 0, 5, and 9:

In [None]:
np_num = np.arange(0,100,10) 
# [0,10,20,30,40,50,..,90]

np_idx = np.array([0,5,9])   
# want first sixth and last 

print(np_num[ np_idx ])      
# [0, 50, 90]

If the index array contains boolean values, we extract the values whose index value is True. This is called a **boolean mask**. For example the following code creates a boolean array (from a Python List) that is filled with alternating True, False values (10 total):

```
np_num  = np.arange(0,10)
np_bool = np.array([True, False]*5)
print(type(np_bool))
```
Here is a quick illustration of the above situation
```
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9] np_num
[T, F, T, F, T, F, T, F, T, F] np_bool
```
We can then use this boolean mask to extract the values inside of np_num. Those values that are True inside np_bool will "filter" those in np_num:

In [None]:
np_num  = np.arange(0,10)
np_bool = np.array([True, False]*5)
np_wow  = np_num[np_bool]
print(np_wow)

The above situation:
```
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9] np_num
[T, F, T, F, T, F, T, F, T, F] np_bool
------------------------------
[0,    2,    4,    6,    8,  ] np_wow

```
###**Now for the Fun**
With that background information, we can easily create boolean arrays based on a condition and then use that array to extract the values:

In [None]:
np_num  = np.arange(0,10)
np_bool = np_num < 5
print(np_num[np_bool])

That's an efficient way to get all the values inside np_num whose value is less than five. We can even combine the syntax to make for a reading friendly version:

```
print(np_num[np_num<5])
```
See the appendix for yet another (very useful) way to mask with Numpy.

###**Working with Strings**
Even though Numpy is meant for heavy duty processing of numeric data, it also can be used to process strings, unicode, and character data (https://docs.scipy.org/doc/numpy/reference/routines.char.html).

In [None]:
fruit = np.array("apples,pears,bananas,oranges".split(','))
print(fruit)
lens = np.char.str_len(fruit)
print(lens)


You can do masking as well:
```
short = (lens <= 5)
print(fruit[short])
print(np.char.count(fruit, "a"))
```
The function np.char.count works with multiple dimensions as well:
```
data = ["apple pear pizza",
        "apple lemon bagel milk lemon",
        "pizza soda apple"]
trips = np.array(data)

print(np.char.count(trips, 'apple'))
print(np.char.count(trips, 'lemon'))
print(np.char.count(trips, 'pizza'))
```


In [None]:
# type&run the above example/exercise in this cell

You should see the following output:
```
[1 1 1] <= bought an apple on all three trips
[0 2 0] <= bought 2 lemons on trip 2
[1 0 1] <= bought a pizza on trip 1 and trip 3
```

###**A word on words**
All of the following will be used at times when dealing with numPy data: lists, arrays, vectors, columns, rows, matrices (and a few others). In general, a list, an array, a vector all describe a one dimensional ordered set of numbers. The word vector could mean a row vector or a column vector. Both are components inside a matrix -- which is usually a 2 dimensional grid of values:

![](https://drive.google.com/uc?export=view&id=1H0Lf3zEmHA6djJbGj4LPj2XG-qCtcPVT)
 

###**Before you go, you should know:**


* numpy array creation


* how to index an array


* how operators (e.g. +, -, *) work on numpy arrays


* how masking works

#**Lesson Assignment**
One of the great outcomes of learning Python, Numpy, Matplotlib, (and soon Pandas) is that you can do everything (and more) using your new found power. What's great is once you have a workflow (i.e. pipeline) that moves data from a raw source to information (and action), is that if the data ever changes or gets updated (and it usually does), re-running the whole pipeline is SUPER easy.

This lesson is one of the first steps to getting Numpy to do some interesting analysis on a simple dataset.

You can use Excel even (if you want) to confirm your answers but the objective is start understanding how to write programs that you might have used Excel to do at one point.

###**Simple Data Analysis**
The lesson comes with its own dataset, data.csv. It is the amount spent on various merchants. You will use Numpy to analyze the data.

data.csv file is available to download from Moodle. It's a semi-colon separated data file that shows spending around campus town. You must upload it here to use.



In [79]:
def display_data():
  p = 'data.csv'
  with open(p, 'r') as fd:
    data = fd.read()
    print(data)

display_data()

Date;Merchant;Type;Price;Tax;Total
01/02/2017;Bevande CafÃ©;Restaurant; 4.57 ; 0.41 ; 4.98 
01/03/2017;Papa Del's Pizza;Restaurant; 17.57 ; 1.58 ; 19.15 
01/04/2017;Radio Maria;Restaurant; 56.20 ; 5.06 ; 61.26 
01/05/2017;Bevande CafÃ©;Restaurant; 3.56 ; 0.32 ; 3.88 
01/06/2017;County Market;Grocery; 30.15 ; 2.71 ; 32.86 
01/07/2017;Papa Del's Pizza;Restaurant; 18.08 ; 1.63 ; 19.71 
01/08/2017;Bombay Indian Grill;Restaurant; 13.07 ; 1.18 ; 14.25 
01/09/2017;Lai Lai Wok;Restaurant; 20.67 ; 1.86 ; 22.53 
01/10/2017;Bevande CafÃ©;Restaurant; 1.99 ; 0.18 ; 2.17 
01/11/2017;Bombay Indian Grill;Restaurant; 20.07 ; 1.81 ; 21.88 
01/12/2017;Meijer;Grocery; 26.98 ; 2.43 ; 29.41 
01/13/2017;Bevande CafÃ©;Restaurant; 2.18 ; 0.20 ; 2.38 
01/14/2017;CafÃ© Kopi;Restaurant; 4.10 ; 0.37 ; 4.47 
01/15/2017;Bevande CafÃ©;Restaurant; 2.73 ; 0.25 ; 2.98 
01/16/2017;Bevande CafÃ©;Restaurant; 2.00 ; 0.18 ; 2.18 
01/17/2017;Radio Maria;Restaurant; 55.67 ; 5.01 ; 60.68 
01/19/2017;Bombay Indian Grill;Restaura

In [80]:
def read_data():
  import numpy as np
  from numpy import genfromtxt
  my_data = np.genfromtxt('data.csv', dtype=None, delimiter=';', names=True, encoding = None) 
  return my_data

data = read_data()

Numpy provides a utility function named genfromtxt that will read in data from a file (or url) and return a structured array that is ready for analysis. We can read this file into a structured array via np.genfromtxt and then perform some basic analysis on the returned numPy array.

Don't worry, we won't spend too much on the numPy function np.genfromtxt since we will be using Pandas very soon that adds more functionality on top of numPy.

All of the questions will be from analyzing the data found in data.csv. You will be passed a Numpy structured data array. Each question will ask you to manipulate the data and return a result.

* All the information to solve these is given in this lesson
* You should solve these by using only the Numpy library
* Avoid converting to pure Python to solve these
* There's no need for loops, regular expressions, etc

Each can be solved using Numpy operators and functions given in the lesson. Many of these answers will use np.sum on an np.array if you didn't read each of the sections and didn't run each of the samples, it will be impossible (unless you already know NumPy) to answer these questions.

###**q0**
###**Find the total amount spent on all merchants.**
answer: already done for you!

In [81]:
import numpy as np
def q0(data):
  return np.sum(data['Total'])

print(q0(data))

901.3900000000001


You can test it like this:
```
def test_q0():
  data = read_data()
  print(q0(data))
```

In order to get a feel for the result of each stage, just print each step:
```
sub_set = data['Total']
print(sub_set)
total = np.sum(sub_set)
print(total)
```

###**q1**
###**How many transactions were on Restaurants?**


In [83]:
def q1(data): 
    return len(data[data['Type']=='Restaurant'])

###**q2:**
###**Find the total spent on Restaurants**
Hint: you will want to solve this in two parts:

* First get the sub set that is restaurants (re-read the section on indirection).
* Second, you can do something similar as in q0.

In [84]:
# type&run the above example/exercise in this cell
def q2(data):
    x = np.sum(data[data['Type']=='Restaurant']['Total'])
    return x

###**q3:**
###**How much was spent on taxes?**

In [85]:
def q3(data):
    return np.sum(data['Tax'])

In [None]:
# type&run the above example/exercise in this cell

###**q4:**
###**How much was spent on taxes for Groceries?**

In [86]:
def q4(data):
    return np.sum(data[data['Type']=='Grocery']['Tax'])

In [None]:
# type&run the above example/exercise in this cell

###**q5:**
###**What was the mean (i.e. average) of all transactions?**

In [36]:
def q5(data):
    return np.mean(data['Total'])

In [None]:
# type&run the above example/exercise in this cell

###**q6:**
###**What was the mean (i.e. average) of all transactions whose total was under 3.00?**

In [None]:
# type&run the above example/exercise in this cell

In [89]:
def q6(data):
    return np.mean(data[data['Total']<3.00]['Total'])

###**q7:**
###**What is the sum of price on transactions where price was a whole dollar amount (i.e. cents were 0) ?**

Hints:

* Think about the math needed to solve this
* You want values that are whole dollars (4.00, 23.00, etc)
* The mod operator could be useful (but not mandatory)
* Don't use Python to post process the data. A good Numpy mask will do

In [125]:
def q7(data):
    x = np.mod(data['Price'], 1)
    mask = (x == 0)
    return np.sum(data[mask]['Price'])

In [None]:
# type&run the above example/exercise in this cell

###**q8:**
###**How much was spent total on Merchants whose name is less than 7 characters long?**

In [None]:
# type&run the above example/exercise in this cell

In [53]:
def q8(data):
    return np.sum(data[np.char.str_len(data['Merchant'])<7]['Total'])

###**q9:**
###**Total number of transactions where the Merchant's name had the word 'Café' in it?**

In [90]:
def q9(data):
    return data[np.char.find(data['Merchant'],'Café')>=0].shape[0]

In [None]:
# type&run the above example/exercise in this cell

###**q10:**
###**How many Merchants have the word 'Café' in them? (i.e. unique set Merchants)?**

In [None]:
# type&run the above example/exercise in this cell

In [139]:
def q10(data):
    return len(np.unique(data[np.char.find(data['Merchant'],'Café')>=0]['Merchant']))

##**Submission**

After implementing all the functions and testing them please download the notebook as "solution.py" and submit to gradescope under "Week12:DS:NumPy" assignment tab and Moodle.

**NOTES**

* Be sure to use the function names and parameter names as given. 
* DONOT use your own function or parameter names. 
* Your file MUST be named "solution.py". 
* Comment out any lines of code and/or function calls to those functions that produce errors. If your solution has errors, then you have to work on them but if there were any errors in the examples/exercies then comment them before submitting to Gradescope.
* Grading cannot be performed if any of these are violated.

**References and Additional Readings**
* https://hackernoon.com/introduction-to-numpy-1-an-absolute-beginners-guide-to-machine-learning-and-data-science-5d87f13f0d51
* https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Numpy_Python_Cheat_Sheet.pdf
* http://jalammar.github.io/visual-numpy

# Appendix: (but still very useful)
Here's yet another way to use masking in Numpy.

###**Using the Where Clause**
Numpy allows similar masking using the where function.
You pass in a condition to where function to get back an array that can be used for masking. For example, let's find all elements of an array whose value is 5:

Here's the code to generate an array of 10 numbers:
```
np.random.seed(101)
values = np.random.randint(0,7, (10,))
print(values)
```
Now let's create a condition:
```
mask = values == 5
print(mask)
```
And finally pass that condition/mask to the where function:
```
indices = np.where(mask)
print(len(indices), indices)
```

The where function returns a set of tuples. Each tuple is an index.

You can iterate over the first element of the tuple to get the values:
```
for idx in range(0, len(indices)):
  i = indices[idx]
  print(values[i]) # should be a 5!
```


In [None]:
# type&run the above example/exercise in this cell

###**One more time, in 2D**

Let's do the same process, but now with a two dimensional array:
```
np.random.seed(55)
values = np.random.randint(0,7, (2,3))
print(values)

mask = values == 5
print(mask)

rows, cols = np.where(mask)
print(rows)
print(cols)
# NOTE len(rows) == len(cols)
```

**A few important notes:**

* np.where still returns a tuple, but we unravel (assign the different parts of) the tuple to different variables
* since values is 2D, np.where will return indices for both rows and columns
We can now iterate over these indices:

```
for idx in range(0, len(rows)): 
  r = rows[idx]
  c = cols[idx]
  print(values[r][c]) # should be a 5!
```

In [None]:
# type&run the above example/exercise in this cell

 
That's a lot to learn. Be sure you understand the different parts of all of this. You don't need to memorize anything, just understand how numpy works at the syntax level.