# 1. Basic Arrays and Data Types

In [4]:
import numpy

## 1.1 What is an array?

An array is an ordered and structured collection of elements. Arrays are structured around the number of dimensions they contain, as well as how many elements exist along each dimension. Today we will focus on arrays that are only one dimension.

For example, we can have a one-dimensional array that contains six elements:

  4 5 6 7 8 9


### 1.1.1 Why numpy arrays and not just python lists or tuples?

Numpy arrays are structured in memory very differently from python lists or tuples. The full array is stored within a single large block in the computer's memory, thus making it much quicker to do many operations. Especially when the number of elements in the array becomes large.

## 1.2 Creating an array

Perhaps the most straightforward way to create a numpy array is by passing a python list or a tuple to the `numpy.array` function. 

In [3]:
# create a numpy array from a python list:
my_list = [1, 3, 5, 7, 9]
print( numpy.array(my_list) )

# or without pre-specifying the python list:
print( numpy.array( [2, 4, 6, 8] ) )

# numpy arrays can also be created based on python tuples:
print( numpy.array( (11, 12, 13, 14, 15) ) )

[1 3 5 7 9]
[2 4 6 8]
[11 12 13 14 15]


## 1.2.1 Creating arrays based on range

Numpy has a function which is similar to the `range()` function which is `arange(a, b, s)`, where $a$ is the start point, $b$ is the end point and $s$ is the step size. The function can have integers or floats as inputs.

Another function to create a range is the `linspace(a, b, i)` function, where $a$ is the start point, $b$ is the end point, and $i$ is the number of items.

The advantage of the `linspace` function is that you can specify the number of items and the advantage of the `arange` function is that you can specify the step size.

In [4]:
# fixed step size:
x = numpy.arange(1, 11.9, 2.1)
print( x )

# fixed number of items:
y = numpy.linspace(0, 2, 9)
print( y )

[  1.    3.1   5.2   7.3   9.4  11.5]
[ 0.    0.25  0.5   0.75  1.    1.25  1.5   1.75  2.  ]


## 1.2.2 Creating arrays of random values

If we want to create an array with random numbers we can use the `numpy.random.random` function, this function returns an array with numbers between 0 and 1.

The expression `numpy.random.uniform(x, y, z)` returns an array with `z` random numbers uniformly drawn from the interval between `x` and `y`.

In [5]:
# create array with random numbers between 0 and 1:
print( numpy.random.random(4) )

# create array with random numbers between 5 and 10:
print( numpy.random.uniform(5, 10, 5) )

[ 0.16631777  0.03131789  0.55566663  0.234745  ]
[ 9.2635129   7.08976559  8.55498765  6.60740643  5.24548956]


## Exercise 1.1

Create an array of length 15 with equally spaced values from 13 to 99.

Create an array that ranges from 0 to 1 in steps of 0.015.

Create an array of length 10,000,000 with random values between 0 and 100.


For more information about creating arrays: 
http://docs.scipy.org/doc/numpy/user/basics.creation.html

## 1.2.3 Size and shape

Often it is really useful to check the size and shape of an array when debugging your code. For an array `a` you can use the property `a.size` to show the total number of elements in the array and the propoerty `a.shape` to see the shape. These can differ when we consider arrays that contain more than one dimension (next week).

In [6]:
a = numpy.random.uniform(1, 100, 1000000)

print(a.size, a.shape)

1000000 (1000000,)


## 1.3 Arrays and data types

### 1.3.1 Data types in Python

Before discussing Numpy datatypes, first a small list of the data types in Python:

##### Immutable types:
- boolean (True, False)
- int (integer)
- float
- complex 
- str (string)
- byte
- tuple ( )

The type of these variables cannot be changed after they are created

##### Mutable types:
- list [ ]
- set
- dict { } (dictionary)

The type of these variables can be changed after being created

### 1.3.2 Data types in Numpy

Numpy is based on **arrays**. You can think of an array as a list, or a table, where each cell of the table contains an item of the same **datatype**. The elements of an array must all be the same data type. 

The 5 basic data types of a numerical variable are:
- float (float16, float32, or float64)
- integer (int8, int16, int32, or int64)
- unsigned integer: this number cannot be negative (uint8, uint16, uint32, or uint64)
- boolean (bool)
- complex (complex64 or complex128)
- string (for example <U3 or <U64, where the number indicates the maximum length of the strings)

The data type of the array `x` can be obtained with the `x.dtype` function. 

#### 1.3.2.1 Memory Details

The numbers 8, 16, 32, 64, 128 in the name of datatypes are used to indicate the amount of memory storage.

For example, the data type `int8` only uses 1 byte (8 bits) and therefore this variable can only store a number in the range of -128 to 127. The data type int16 uses 2 bytes and therefore this variable can store a number in the range of -32768 to 32767. 

For the unsigned integer, the number can never by negative and therefore the data type uint8 can have a value in the range of 0 to 255, where 255 = 2^8 - 1. 

### A few examples of different types:


In [7]:
# array with integers:
x = numpy.array([1, 3, 5, 7, 9])
print( x, x.dtype )

[1 3 5 7 9] int64


In [8]:
# array with floats:
y = numpy.array([2.2, 4.4, 6.6, 8.8])
print( y, y.dtype )

[ 2.2  4.4  6.6  8.8] float64


In [9]:
# array with booleans:
z = numpy.array([True, False, True])
print( z, z.dtype )

[ True False  True] bool


In [10]:
# array with strings:
x = numpy.array(["a", "b", "cde"])
print( x, x.dtype )

['a' 'b' 'cde'] <U3


### 1.3.3 Specifying data types

If you initialize an array with multiple data types, the elements are converted to the same type. 

You can specify the data type of the array when you create the array with the `array()` function using the `dtype` keyword argument. Note that not every combination is possible. For example, when your array contains text then you cannot choose float as data types, because the text cannot be converted to floats, except when the text is exactly representing a floating point number.


In [11]:
# array with mixed data types:
x = numpy.array([1, 3.4, True, 2.3+4.5j, "a"])
print( x, x.dtype )

['1' '3.4' 'True' '(2.3+4.5j)' 'a'] <U64


In [12]:
# explicitly specify the data type:
y = numpy.array([9, 8, 7, 6], dtype='float')
print( y, y.dtype )

[ 9.  8.  7.  6.] float64


In [13]:
# Try to convert strings to integers
strings = ["12", "3", "24"]
z = numpy.array(strings, dtype='int')
print( z, z.dtype )

[12  3 24] int64


In [14]:
numerals = ["one", "two", "three"]
a = numpy.array(numerals, dtype='int')
print( a, a.dtype )

ValueError: invalid literal for int() with base 10: 'one'

### 1.3.4 Converting to a different data type

use the astype() function to convert an existing Numpy array to a different type.

In [16]:
# convert an array with integers to the data type float:
x = numpy.array([1, 3, 5, 7, 9])
print( x, x.dtype )

g = x.astype('float')
print( g, g.dtype )

[1 3 5 7 9] int64
[ 1.  3.  5.  7.  9.] float64


In [17]:
# convert an array with strings to float:
y = numpy.array(["1.4", "3.4", "5.4"])
print( y, y.dtype )

h = y.astype('float')
print( h, h.dtype )

['1.4' '3.4' '5.4'] <U3
[ 1.4  3.4  5.4] float64


Sometimes, the data types are converted but the content is slightly changed. 
For example, when converting from float to integer, then the numbers are rounded down (floor). 
For example, when converting an array with only zeros and ones to the data type Boolean than the 0 is converted to False and the 1 is converted to True. 

In [18]:
# from float to integer
x = numpy.array([2.2, 3.2, 2.8])
print( x, x.dtype )

a = x.astype('int')
print( a, a.dtype )

[ 2.2  3.2  2.8] float64
[2 3 2] int64


In [19]:
# from 0-1 to boolean
x = numpy.array([0, 1, 1, 0])
print(x, x.dtype)
a = x.astype('bool')
print(a, a.dtype)

[0 1 1 0] int64
[False  True  True False] bool


Not every data type can be converted to all other data types. Some examples:


In [20]:
# from string to boolean
x = numpy.array(["true", "false"])
a = x.astype('bool')
print( a, a.dtype )

ValueError: invalid literal for int() with base 10: 'true'

In [21]:
# from non-numeric strings to integer
x = numpy.array(["a", "b", "c"])
#x = numpy.array(["1", "2"])
a = x.astype('bool')
print( a, a.dtype )

ValueError: invalid literal for int() with base 10: 'a'

## Exercise 1.2

Part A) Create an array named `a` of length 5 containing random values between 1 and 10. Determine the type of `a`. Convert `a` to an array of type `<U8`, which is a string that uses 8 bytes, and then convert it back to an array of float values. Describe what happens to the values.

Part B) Convert the resulting array from of Part A into an array of integers. Then convert it back to float values. Describe what happens to the values.

## 1.4 Indexing arrays

Indexing, as you might remember from Data Processing or Intro to R, involves selecting a subset of an array. There are many ways to address content in an array and we will discuss 3 here (for more complete information about indexing see
http://docs.scipy.org/doc/numpy/user/basics.indexing.html) and more next week:

### 1.4.1 Linear indexing

Indexing in a 1-dimensional array is the same as the indexing in a Python list (we will revisit this next week when we discuss multi-dimensional arrays).

When accessing more than one element from an array, the slicing `":"` can be used, and this works similar as it works with python lists.
If the index is `[a:b]` then indices that are used are `a` up to but not including `b`. 


In [23]:
a = numpy.arange(10)

print( a )
print( a[:] )
print( a[3:6] )
print( a[:4] )

[0 1 2 3 4 5 6 7 8 9]
[0 1 2 3 4 5 6 7 8 9]
[3 4 5]
[0 1 2 3]


Some functions return linear indexes. Two examples are the `argmin` and `argmax` functions:

In [24]:
a = numpy.random.uniform(-0.5, .5, 12)
print( a )

# Print the index of the maximum value
max_value = numpy.argmax(a)
print( max_value )

print( a[max_value] )

[ 0.35455843  0.28424377  0.19526181 -0.08488424 -0.05738835  0.10225253
  0.11434307  0.41496516 -0.23599021  0.18925508 -0.2678831  -0.43501644]
7
0.414965162489


## Exercise 1.3

Build an array of 10,000,000 random values between 0 and 1. Print the minimum value of the array as well as the five values that occur before the minimum.

### 1.4.2 Boolean indexing

Return all values in the array for which the index is True.

In [26]:
a = numpy.random.uniform(-0.5, .5, 25)

# Create a boolean index for positive numbers in array a
index = a > 0.0
print( index )

# Print all the positive numbers
print( a[index] )

[False False  True  True  True  True  True  True False  True False  True
  True False False  True  True  True False  True  True False  True False
  True]
[ 0.16883431  0.38054613  0.11656546  0.26608041  0.22479248  0.00765304
  0.43948191  0.15957886  0.12361741  0.45781249  0.3733486   0.33505375
  0.42694833  0.35379198  0.03713069  0.41881297]


### 1.4.3 Indexing with an array of indices

Specify a separate array in which you store the indices as integers and you will return exactly the elements of the array with these indices. 

In [27]:
b = numpy.linspace(0, 1, 10)
print( b )

# Print numbers at prime indices
index = numpy.array([ 2, 3, 5, 7])
print( b[index] )

[ 0.          0.11111111  0.22222222  0.33333333  0.44444444  0.55555556
  0.66666667  0.77777778  0.88888889  1.        ]
[ 0.22222222  0.33333333  0.55555556  0.77777778]


## Exercise 1.4

Draw 1,000,000 samples from a normal distribution with a mean of 1 and a standard deviation of 0.5. Compute the proportion of samples that are below 0.


## Exercise 1.5

Build an array with 10,000 random values between 0 and 1. Determine the value of the entry that is the largest value of all entries whose index is divisable by 5.
