# Python for Data Science

In this notebook, we will explore the basics of using Python programming language for data science. These exercises are implemented in Ipython notebooks. [Ipython](https://ipython.org/) is an interactive envrionment for python programming commonly used for data science and machine learning. 

This notebook is adapted from the [Exploratory Computing course](https://github.com/mbakker7/exploratory_computing_with_python) by Mark Bakker. 

## Notebook 2: Arrays

In this notebook, we will do math on arrays using functions of the `numpy` package. A nice overview of `numpy` functionality can be found [here](http://wiki.scipy.org/Tentative_NumPy_Tutorial). The exercises in this notebook will also require plotting and the packages required are called below.

In [None]:
# import libraries
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

In [None]:
# this code is to generate the dataset for exercises
# you can ignore it!
data = np.random.uniform(low=-10.0,high=10.0,size=(1000,2))
np.savetxt('xypoints.dat',data,fmt='%.2f',delimiter=',')

### One-dimesional arrays
There are many ways to create arrays using Numpy. For example, you can enter the individual elements of an array.

In [None]:
# create an 1-dimensional array of size 4
# add 4 elements to the array
np.array([1, 7, 2, 12])

* The `array` function takes one sequence of points between square brackets. 
* You can use `ones(shape)` to create an array of the specified `shape` filled with the value 1. 
* You can use `zeros(shape)` to create an array filled with the value 0. 
* Similar to the already mentioned `linspace` function there is the `arange(start, end, step)` function, which creates an array starting at `start`, taking steps equal to `step` and stopping before it reaches `end`.

In [None]:
# takes default step of 1 and doesn't include 7
print(np.arange(1, 7)) 

# starts at 0 end ends at 4, giving 5 numbers
print(np.arange(5))

Arrays have a dimension. So far we have only used one-dimensional arrays. Hence the dimension is 1. It is possible to compute the size of a 1-D array as follows:

In [None]:
# initialize array
x = np.array([1, 7, 2, 12])

# check number of dimensions of x
print('number of dimensions of x:', np.ndim(x))

# check the size of x
print('length of x:', len(x))

Individual elements of an array can be accessed with their index. Indices start at 0. 

In [None]:
# create an array
x = np.arange(20, 30)

# print the complete array
print(x)

# print first element
# indexing starts from 0!
print(x[0])

# print 6th element
print(x[5])

A range of indices may be specified using the colon syntax: `x[start:end_before]` or `x[start:end_before:step]`. If the `start` isn't specified, 0 will be used. If the step isn't specified, 1 will be used. 

In [None]:
# create an array
x = np.arange(20, 30)

# print the complete array
print(x)

# get first 5 elements
print(x[0:5])

# same as previous line 
print(x[:5]) 

# get 4th to 7th elements
print(x[3:7])

# get every second element using step 2
print(x[2:9:2]) 

You can also start at the end and count back. Generally, the index of the end is not known. You can find out how long the array is and access the last value by typing `x[len(x)-1]` but it would be inconvenient to have to type `len(arrayname)` all the time. Luckily, there is a shortcut: `x[-1]` is the last value in the array. 

In [None]:
# create array with step 10
xvalues = np.arange(0, 100, 10)

# print the array
print(xvalues)

# print the last value in array
print(xvalues[len(xvalues) - 1])  

# same as previous line but shorter
print(xvalues[-1])  

# prints the array backwards
# start at the end and go back with steps of -1
print(xvalues[-1::-1])  

You can assign one value to a range of an array by specifying a range of indices, 
or you can assign an array to a range of another array, as long as the ranges have equal length. 

In [None]:
# create an array and fill with 20
x = 20 * np.ones(10)
print(x)

# assign 40 to first 5 elements
x[0:5] = 40
print(x)

# reassign new array to first 5 elements
x[0:5] = np.arange(40, 50, 2)
print(x)

#### Exercise 1: Arrays and indices<a name="#back1"></a>
Create an array of zeros with length 20. 
* Change the first 5 values to 10. 
* Change the next 10 values to a sequence starting at 12 and increasig with steps of 2 to 30 - do this with one command. 
* Set the final 5 values to 30. 
* Plot the value of the array on the $y$-axis vs. the index of the array on the $x$-axis. 
* Draw vertical dashed lines at $x=4$ and $x=14$ (i.e, the section between the dashed lines is where the line increases from 10 to 30). 
* Set the minimum and maximum values of the $y$-axis to 8 and 32 using the `ylim` command.

In [None]:
# put your code here


[Answer for Exercise 1](#ex1answer)

### Two-dimensional arrays
Arrays may have arbitrary dimensions. We will make frequent use of two-dimensional arrays. They can be created with any of the aforementioned functions by specifying the number of rows and columns of the array. Note that the number of rows and columns must be a tuple!.

In [None]:
# create an array with 3 rows and 4 columns
# fill array with 1s
x = np.ones((3, 4)) 
print(x)

Arrays may also be defined by specifying all the values in the array. The `array` function gets passed one list consisting of separate lists for each row of the array. 
* In the example below the rows are entered on different lines. That may make it easier to enter the array, but it is not required. 
* You can change the size of an array to any shape using the `reshape` function as long as the total number of entries doesn't change. 

In [None]:
# define array with different entries
x = np.array([[4, 2, 3, 2],
              [2, 4, 3, 1],
              [0, 4, 1, 3]])

# print the array
print(x)

# reshape array to 6 rows, 2 columns
print(np.reshape(x, (6, 2)))

# reshape array to 1 row, 12 columns
print(np.reshape(x, (1, 12)))  

The index of a two-dimensional array is specified with two values. First the row index, then the column index.

In [None]:
# create array with zeros
x = np.zeros((3, 8))
print(x)

# set element in 1st row and 1st column to 100
x[0,0] = 100
print(x)

# set row with index 1, columns starting from 4 to 200
x[1,4:] = 200
print(x)

# set row 2, columns from end until before 4 to 400
x[2,-1:4:-1] = 400  
print(x)

#### Arrays are not matrices
We may think that arrays are matrices, or that one-dimensional arrays are vectors. But, it is crucial to understand that *arrays are not vectors or matrices*. 
* In numpy, the multiplication and division of two arrays is term by term and not like matrix or vector operations.

In [None]:
# create 2 1-d arrays
a = np.arange(4, 20, 4)
b = np.array([2, 2, 4, 4])

# print the arrays
print('array a:', a)
print('array b:', b)

# perform multiplication
print('a * b  :', a * b)  

# perform division
print('a / b  :', a / b)  

#### Exercise 2: Two-dimensional array indices
For array `x` shown below, write code to print: 

* first row of `x`
* first column of `x`
* third row of `x`
* last two columns of `x`
* four values in the upper right hand corner of `x`
* four values at the center of `x`

`x = np.array([[4, 2, 3, 2],
              [2, 4, 3, 1],
              [2, 4, 1, 3],
              [4, 1, 2, 3]])`

In [None]:
# put your code here


[Answer for Exercise 2](#ex2answer)

### Visualizing two-dimensional arrays
Two-dimensonal arrays can be visualized with `plt.matshow` function. 
* A colorbar is added as a legend showing that the value 1 corresponds to blue and the value 4 corresponds to yellow. 
* Ticks in the colorbar are specified to be 1, 2, 3, and 4. 
* Note that the first row of the matrix (with index 0), is plotted at the top, which corresponds to the location of the first row in the matrix.

In [None]:
# define 2-d array
x = np.array([[8, 4, 6, 2],
              [4, 8, 6, 2],
              [4, 8, 2, 6],
              [8, 2, 4, 6]])
print(x)

# use matshow to plot matrix
plt.matshow(x)

# add colorbar as legend to plot
plt.colorbar(ticks=[2, 4, 6, 8])

If you want other colors, you can choose one of the other color maps. To find out all the available color maps, go 
[here](http://matplotlib.org/examples/color/colormaps_reference.html). 
* To change the color map, you need to import the `cm` part of the matplotlib package, which contains all the color maps. 
* After you have imported the color map package (which we call `cm` below), you can specify any of the available color maps with the `cmap` keyword.

In [None]:
# import the colormap subpackage
import matplotlib.cm as cm

# create new plot with rainbow color map
plt.matshow(x, cmap=cm.rainbow)

# add a legend
plt.colorbar(ticks=np.arange(2, 9, 2));

#### Exercise 3: Create and visualize an array
Create an array of size 10 by 10. 
* The upper left-hand quadrant of the array should get the value 4, 
* the upper right-hand quadrant the value 3, 
* the lower right-hand quadrant the value 2,
* the lower left-hand quadrant the value 1. 

First create an array of 10 by 10 using the `zeros` command, then fill each quadrant by specifying the correct index ranges. 
* Note that the first index is the row number. 
* The second index runs from left to right. 

Visualize the array using `matshow`. Use the `jet` colormap to plot the matrix.

In [None]:
# put your code here


[Answer for Exercise 3](#ex3answer)

### Using Conditions on Arrays
If you have a variable, you can check whether its value is smaller or larger than a certain other value. This is called a *conditional* statement.

In [None]:
# create a variable
a = 4

# example of checking conditions
print('a < 2:', a < 2)
print('a > 2:', a > 2)

The statement `a < 2` returns a variable of type boolean, which means it can either be `True` or `False`. 
* Besides smaller than or larger than, there are several other conditions you can use:

In [None]:
a = 4

# check different possible conditions
print(a < 4)

# a is smaller than or equal to 4
print(a <= 4) 

# a is equal to 4. Note: there are 2 equal signs
print(a == 4) 

print(a >= 4) 

print(a > 4)

# a is not equal to 4
print(a != 4) 

It is important to understand the difference between one equal sign like `a=4` and two equal signs like `a==4`. 
* One equal sign means assignment. Whatever is on the right side of the equal sign is assigned to what is on the left side of the equal sign. 
* Two equal signs is a comparison and results in either `True` (when the left and right sides are equal) or `False`.

In [None]:
# checking equality
print(4 == 4)

# assign a with output of equality condition
a = 4 == 5

# print output a
print(a)
print(type(a))

You can also perform comparison statements on arrays, and it will return an array of booleans (`True` and `False` values) for each value in the array. 
* For example let's create an array and find out what values of the array are below 3,

In [None]:
# create array
data = np.arange(5)
print(data)

# check for values less than 3
print(data < 3)

The statement `data<3` returns an array of type `boolean` that has the same length as the array `data` and for each item in the array it is either `True` or `False`. 
* Cool thing is that this array of `True` and `False` values can be used to specify the indices of an array,

In [None]:
# create an array
a = np.arange(5)

# select a subset of array a using boolean array
b = np.array([ True, True, True, False, False ])

# output is only the elements with true
print(a[b])

When the indices of an array are specified with a boolean array, only the values of the array where the boolean array is `True` are selected. 
* All values of an array that are less than, for example, 3 may be obtained by specifying a condition as the indices.

In [None]:
# initialize array a
a = np.arange(5)

# print complete array
print('complete array:', a)

# print subarray with values less than 3
print('values less than 3:', a[a < 3])

If we want to replace all values that are less than 3 by, for example, the value 10, use the following short syntax:

In [None]:
# create an array
a = np.arange(5)
print(a)

# replace values less than 3 with 10
a[a < 3] = 10
print(a)

#### Exercise 4: Replace high and low in Arrays
Create an array for variable $x$ consisting of 100 values from 0 to 20. 
* Compute $y=\sin(x)$ and plot $y$ vs. $x$ with a blue line. 
* Next, replace all values of $y$ that are larger than 0.5 by 0.5, and all values that are smaller than $-$0.75 by $-0.75$. 
* Plot $x$ vs. $y$ using a red line on the same graph. 

In [None]:
# put your code here


[Answer to Exercise 4](#ex4answer)

#### Exercise 5: Change marker color based on value
Create an array for variable $x$ consisting of 100 points from 0 to 20 and compute $y=\sin(x)$.
* Plot a blue dot for every $y$ that is larger than zero, and a red dot otherwise

In [None]:
# put your code here


[Answer to Exercise 5](#ex5answer)

### Select indices based on multiple conditions
Multiple conditions can be given as well: 
* When two conditions both have to be true, use the `&` symbol. 
* When at least one of the conditions needs to be true, use the '|' symbol. 
* For example, let's plot $y=\sin(x)$ and plot blue markers when $y>0.7$ or $y<-0.5$ (using one plot statement), and a red marker when $-0.5\le y\le 0.7$. 
* When there are multiple conditions, they need to be between parenteses.

In [None]:
# create array
x = np.linspace(0, 6 * np.pi, 50)

# get y data using sin function
y = np.sin(x)

# plot the output satisfying several conditions
plt.plot(x[(y > 0.7) | (y < -0.5)], y[(y > 0.7) | (y < -0.5)], 'bo')
plt.plot(x[(y > -0.5) & (y < 0.7)], y[(y > -0.5) & (y < 0.7)], 'ro');

### Exercise 6: Multiple conditions 
The file `xypoints.dat` contains 1000 randomly chosen $x,y$ locations of points; both $x$ and $y$ vary between -10 and 10. 
* Load the data using `loadtxt`, and store the first row of the array in an array called `x` and the second row in an array called `y`. 
* First, plot a red dot for all points. 
* On the same graph, plot a blue dot for all $x,y$ points where $x<-2$ and $-5\le y \le 0$.
* Finally, plot a green dot for any point that lies in the circle with center $(x_c,y_c)=(5,0)$ and with radius $R=5$. 
* Hint: it may be useful to compute a new array for the radial distance $r$ between any point and the center of the circle using the formula $r=\sqrt{(x-x_c)^2+(y-y_c)^2}$. 
* Use the `plt.axis('equal')` command to make sure the scales along the two axes are equal and the circular area looks like a circle.

In [None]:
# put your code here


[Answer to Exercise 6](#ex6answer)

### Exercise 7: Fix the error 
In the code below, it is meant to give the last 5 values of the array `x` the values [50,52,54,56,58] and print the result to the screen, but there are some errors in the code.
* Remove the comment markers and run the code to see the error message. Then fix the code and run it again.

In [None]:
# fix the code below

#x = np.ones(10)
#x[5:] = np.arange(50, 62, 1)
#print(x)

[Answer to Exercise 7](#ex7answer)

### Solutions

The solutions to the exercises are available here. Please refrain from looking at them before solving the exercise.

<a name="ex1answer">Answer to Exercise 1</a>

In [None]:
# solution to exercise 1

# create array
x = np.zeros(20)

# assign 10 to first 5 elements
x[:5] = 10

# assign range to next 10 elements
x[5:15] = np.arange(12, 31, 2)

# assign 30 till the end
x[15:] = 30

# plot the array
plt.plot(x)

# plot the dotted lines in the same graph
plt.plot([4, 4], [8, 32],'k--')
plt.plot([14, 14], [8, 32],'k--')

# set the ylimit to the graph
plt.ylim(8, 32);

[Back to Exercise 1](#back1)

<a name="ex2answer">Answer to Exercise 2</a>

In [None]:
# solution to exercise 2

# initialize the array in question
x = np.array([[4, 2, 3, 2],
              [2, 4, 3, 1],
              [2, 4, 1, 3],
              [4, 1, 2, 3]])

# print first row
print('the first row of x')
print(x[0])

# print first column
print('the first column of x')
print(x[:, 0])

# print third row
print('the third row of x')
print(x[2])

# print last two columns
print('the last two columns of x')
print(x[:, -2:])

# print four values in corner
print('the four values in the upper right hand corner')
print(x[:2, 2:])

# print four values in center
print('the four values at the center of x')
print(x[1:3, 1:3])

[Back to Exercise 2](#back2)

<a name="ex3answer">Answer to Exercise 3</a>

In [None]:
# solution to exercise 3

# create array
x = np.zeros((10, 10))

# assign values to 4 quadrants
x[:5, :5] = 4
x[:5, 5:] = 3
x[5:, 5:] = 2
x[5:, :5] = 1

# print and show array
print(x)
plt.matshow(x)
plt.colorbar(ticks=[1, 2, 3, 4]);

[Back to Exercise 3](#back3)

<a name="ex4answer">Answer to Exercise 4</a>

In [None]:
# define x
x = np.linspace(0, 20, 100)

# compute y values using sin function
y = np.sin(x)

# plot x vs y
plt.plot(x, y, 'b')

# reassign values in array y
y[y > 0.5] = 0.5
y[y < -0.75] = -0.75

# plot the new y
plt.plot(x, y, 'r');

[Back to Exercise 4](#back4)

<a name="ex5answer">Answer to Exercise 5</a>

In [None]:
# solution to exercise 5

# define x
x = np.linspace(0, 6 * np.pi, 50)

# compute y values using sin function
y = np.sin(x)

# plot the blue markers
plt.plot(x[y > 0], y[y > 0], 'bo')

# plot the red markers
plt.plot(x[y <= 0], y[y <= 0], 'ro');

[Back to Exercise 5](#back5)

<a name="ex6answer">Answer to Exercise 6</a>

In [None]:
# solution to exercise 6

# load the dataset
data = np.loadtxt('xypoints.dat',delimiter=',')
x = data[:,0]
y = data[:,1]

# plot dataset with red markers
plt.plot(x, y, 'ro')

# plot the points that satisfy condition
plt.plot(x[(x < -2) & (y >= -5) & (y < 0)], y[(x < -2) & (y >= -5) & (y < 0)], 'bo')

# compute the radius for all points
r = np.sqrt((x - 5) ** 2 + y ** 2)

# plot the points that satisfy radius condition
plt.plot(x[r < 5], y[r < 5], 'go')

# fix the scaling of plot
plt.axis('scaled');

[Back to Exercise 6](#back7)

<a name="ex7answer">Answer to Exercise 7</a>

In [None]:
# this is the corrected code snippet
x = np.ones(10)
x[5:] = np.arange(50, 60, 2)
print(x)

[Back to Exercise 7](#back7)