# Numpy
Numpy is a Python package to efficiently do Data Science. This document details an introduction to Numpy and the Numpy array, a faster and more powerful alternative to Python lists. Python lists can hold any data type, as well as different types at the same time. As powerful as lists may be, when analyzing data however, we are often carrying out operations over entire collections of values with a requiment for speedy results. Utilizing Python lists becomes problematic. For example, we will use height data:

In [1]:
# Create a python list of height in meters
height = [1.73, 1.68, 1.71, 1.89, 1.79]

# Create a python list of wight in kilograms
weight = [65.4, 59.2, 63.6, 88.4, 68.7]

# Calculate the BMI for each person
weight / height ** 2

TypeError: unsupported operand type(s) for ** or pow(): 'list' and 'int'

As can be seen from the above output, Python does not have the capability of performing mathematical operations on lists. The operation could be done by extracting the individual `height` and `weight` values for each person and cluculating them separately. *But* this take time and is inefficient. Thus the better solution is to use __Numeric Python__ or __Numpy Array__. 

## Vector (single dimension) Numpy Arrays
The Numpy array is similar to the Python list except, with the numpy arrays, we can perform calculations over entire arrays is a very fast and programatically efficient way. For example:

In [2]:
import numpy as np

# Create a numpy array from `height`
np_height = np.array(height)

# Create a numpy arra from `weight`
np_weight = np.array(weight)

# Calculate the BMI of each person
bmi = np_weight / np_height **2
bmi

array([ 21.85171573,  20.97505669,  21.75028214,  24.7473475 ,  21.44127836])

Each calcultaton is performed element-wise; the first person's BMI was calculated by dividing the first element of `np_weight` by the square of the first element from  `np_height` etc.  

__Note:__ Numpy arrays cannot contain elements with different types. If you try to build such a list, some of the elments' types are changed to end up with a homogenous list. This is known as *type coercion*. Additionally, the typical arithmetic operators, such as `+`, `-`, `*` and `/` have a different meaning for regular Python lists and Numpy arrays. 

### Subsetting Numpy Arrays
Outside of that, we can use Numpy array in the very same way as Python lists for subsetting (e.g. `bmi[1]` for index `1`) and an array of booleans as follows:

In [3]:
# List values above `23`
bmi > 23

array([False, False, False,  True, False], dtype=bool)

In [4]:
# Subset based on the boolean
bmi[bmi > 23]

array([ 24.7473475])

In [5]:
# Indexing starts at `0`
bmi[0:4]

array([ 21.85171573,  20.97505669,  21.75028214,  24.7473475 ])

__Note:__ When using subsetting, don;t forget that the indexing starts at `0`.

## Two-dimensional Numpy Arrays
Two-dimensional (or even n-dimensiona) arrays can be created from a Python list of lists:

In [6]:
# 2D array from `height` and `weight` lists
np_2d = np.array([[1.73, 1.68, 1.71, 1.89, 1.79],
                          [65.4, 59.2, 63.6, 88.4, 68.7]])

# View the shap
np_2d.shape

(2, 5)

Here we have created a Numpy array with __2__ rows and __5__ columns and just as with single dimension arrays, the same rule applies, Numpy arrays can only contain data of a single type. This is very important to remember in Data Science as in many cases we deal with differnt types of data. So coercing the data into a Numpy array, will force *ALL* the data to a single type, which can impact any future analysis.

In [7]:
# Create a Numpy array but add a "string" type to existing "float" types
np.array([[1.73, 1.68, 1.71, 1.89, 1.79],
                  [65.4, 59.2, 63.6, 88.4, "68.7"]])

array([['1.73', '1.68', '1.71', '1.89', '1.79'],
       ['65.4', '59.2', '63.6', '88.4', '68.7']], 
      dtype='|S32')

The output above shows that all the *float* type has be coerced to a single type -> *string*. 
### Subsetting Two-dimensional Numpy Arrays
Unlike single dimension arrays, where indexing `0` provides the first element of the vector, since n-dimensional numpy arrays are in essence lists of lists, indexing `0` provides the first element; the entire row.

In [8]:
# Get the first element of the `np_2d` array
np_2d[0]

array([ 1.73,  1.68,  1.71,  1.89,  1.79])

To further index the elements within the row, we can extend the same call with another element:

In [9]:
# Get the third element of the first row
np_2d[0][2]

1.71

__Remember__, indexing starts at `0`, so therefore the $3^{rd}$ element is index `2`.

An alternate way of subsetting uses a single *square bracket* and a *comma* to separate the rows (or individual Python lists) and then the element within the row (or list), as follows:

In [10]:
# Alternate subsetting method
np_2d[0, 2]

1.71

As can be seen, this method is far more intuitive and allows for futher flexability. For example, suppose we want to select the `height` and `weight` of both the $2^{nd}$ and $3^{rd}$ family members: 

In [11]:
# Both `Height` and `Weight` of the 2nd and 3rd memebers
np_2d[:, 1:3]

array([[  1.68,   1.71],
       [ 59.2 ,  63.6 ]])

Here we can see that we use `:` to represent that we want all rows (in this case, both of them) and then we want the $2^{nd}$ and $3^{rd}$ column, so we insert the indices, __1__ through __3__. __Remember__ that the $3^{rd}$ index is not included here.  The result of this intersection is a two-dimensional array with __2__ rowns and __2__ colums. this is particulary usefull in Data Science when we have to sift thought large amounts of data to potentially look at the formatting of particular variables or find the value of a particular variable, or select and entire "column".

In [12]:
# Create a relatively large list manually
x = [[74, 180], [74, 215], [72, 210], [72, 210], [73, 188], [69, 176], [69, 209], [71, 200], [76, 231], [71, 180],
     [73, 188], [73, 180], [74, 185], [74, 160], [69, 180], [70, 185], [73, 189], [75, 185], [78, 219], [79, 230],
     [76, 205], [74, 230], [76, 195], [72, 180], [71, 192], [75, 225], [77, 203], [74, 195], [73, 182], [74, 188],
     [78, 200], [73, 180], [75, 200], [73, 200], [75, 245], [75, 240], [74, 215], [69, 185], [71, 175], [74, 199],
     [73, 200], [73, 215], [76, 200], [74, 205], [74, 206], [70, 186], [72, 188], [77, 220], [74, 210], [70, 195],
     [73, 200], [75, 200], [76, 212], [76, 224], [78, 210], [74, 205], [74, 220], [76, 195], [77, 200], [81, 260],
     [78, 228], [75, 270], [77, 200], [75, 210], [76, 190], [74, 220], [72, 180], [72, 205], [75, 210], [73, 220],
     [73, 211], [73, 200], [70, 180], [70, 190], [70, 170], [76, 230], [68, 155], [71, 185], [72, 185], [75, 200],
     [75, 225], [75, 225], [75, 220], [68, 160], [74, 205], [78, 235], [71, 250], [73, 210], [76, 190], [74, 160],
     [74, 200], [79, 205], [75, 222], [73, 195], [76, 205], [74, 220], [74, 220], [73, 170], [72, 185], [74, 195],
     [73, 220], [74, 230], [72, 180], [73, 220], [69, 180], [72, 180], [73, 170], [75, 210], [75, 215], [73, 200],
     [72, 213], [72, 180], [76, 192], [74, 235], [72, 185], [77, 235], [74, 210], [77, 222], [75, 210], [76, 230],
     [80, 220], [74, 180], [74, 190], [75, 200], [78, 210], [73, 194], [73, 180], [74, 190], [75, 240], [76, 200],
     [71, 198], [73, 200], [74, 195], [76, 210], [76, 220], [74, 190], [73, 210], [74, 225], [70, 180], [72, 185],
     [73, 170], [73, 185], [73, 185], [73, 180], [71, 178], [74, 175], [74, 200], [72, 204], [74, 211], [71, 190],
     [74, 210], [73, 190], [75, 190], [75, 185], [79, 290], [73, 175], [75, 185], [76, 200], [74, 220], [76, 170],
     [78, 220], [74, 190], [76, 220], [72, 205], [74, 200], [76, 250], [74, 225], [75, 215], [78, 210], [75, 215],
     [72, 195], [74, 200], [72, 194], [74, 220]]

# Create a realtivley large list automatically using a a sample random distribution 5000 times
height = np.round(np.random.normal(1.75, 0.20, 5000), 2)
weight = np.round(np.random.normal(60.32, 15, 5000), 2)

# Use the `column_stack` function to paste the two colums together
x = np.column_stack((height, weight))

# Create a numpy array
np_x = np.array(x)

# View the "size" of the array
print "The shape of the array is:", np_x.shape

# Print the 50th row of np_x (Remember that the 50th row is index 49)
print "The 50th row of the array is:", np_x[49, :]

# Make a new variable, np_y, containing the entire second column of np_x
np_y = np_x[:, 1]
print "The shape of the new array is:", np_y.shape

# Print the first column index of the 270th row of np_x
print "The first column element of the 270th row is:", np_x[269, 0]

The shape of the array is: (5000, 2)
The 50th row of the array is: [  1.48  60.36]
The shape of the new array is: (5000,)
The first column element of the 270th row is: 1.82


__Note:__ in the above Python code, to generate a random normal distribution, the syntax used is as follows:
$$x\,=\,np.random.normal(<distribution\,mean>,\,<distribution\,standard\,deviation>,\,<no.\,of\,samples>)$$


Finally, n-dimensional arrays allow us to also perform element-wise calculations, for example:

In [13]:
# Build 2d array
np_mat = np.array([[1, 2],
                   [3, 4],
                   [5, 6]])

print "Every element multiplied by 2:\n", np_mat * 2, "\n"
print "2d array adding another array:\n", np_mat + np.array([10, 10]), "\n"
print "2d array added to itself:\n", np_mat + np_mat, "\n"

Every element multiplied by 2:
[[ 2  4]
 [ 6  8]
 [10 12]] 

2d array adding another array:
[[11 12]
 [13 14]
 [15 16]] 

2d array added to itself:
[[ 2  4]
 [ 6  8]
 [10 12]] 



## Numpy Basic Statistics
Since this document outlines using Numpy in the Data Science process, we will leverage it to get to know the data. Aside from simply performing array mathamatical operations, Numpy also allows us to perform __summary statistics__ like getting the *mean*, *median*, *correlation coefficient* and *standard deviation*.

In [14]:
# Get the summary statistics of the `height` (first column) from realtivley large array used before
print "The average person's height is:\n", np.mean(np_x[:, 0]), "\n"
print "The middle person's height (median) is:\n", np.median(np_x[:, 0]), "\n"
print "What is the correlation coefficient of Height vs. Weight?\n", np.corrcoef(np_x[:, 0], np_x[:, 1]), "\n"
print "The standard deviation of a person's height is:\n", np.std(np_x[:, 0])

The average person's height is:
1.75 

The middle person's height (median) is:
1.75 

What is the correlation coefficient of Height vs. Weight?
[[ 1.          0.02290169]
 [ 0.02290169  1.        ]] 

The standard deviation of a person's height is:
0.204925645052


Numpy also support the standard `Python` functions like `sum()` and `sort()`, but the key difference here is __speed__. Due to the fact that the numpy array coerces all the data to a single data type, the calculations are faster.

## Data Science Exercise
You've contacted the FIFA for some data and they handed you two lists. The lists are the following:
```
positions = ['GK', 'M', 'A', 'D', ...]
heights = [191, 184, 185, 180, ...]
```
Each element in the lists corresponds to a player. The first list, positions, contains strings representing each player's position. The possible positions are: 'GK' (goalkeeper), 'M' (midfield), 'A' (attack) and 'D' (defense). The second list, heights, contains integers representing the height of the player in cm. The first player in the lists is a goalkeeper and is pretty tall (191 cm).

You're fairly confident that the median height of goalkeepers is higher than that of other players on the soccer field. Some of your friends don't believe you, so you are determined to show them using the data you received from FIFA and your newly acquired Python skills.

### Instructions
- Create `heights` and `positions`, which are regular lists, to numpy arrays. Call them `np_heights` and `np_positions`.
- Extract all the heights of just the goalkeepers. Assign the result to `gk_heights`.
- Extract all the heights of the all the other players. Assign the result to `other_heights`.
- Print out the median height of the goalkeepers.
- Do the same for the other players.

In [15]:
# Manually create 50 sample `positions`
positions = ['GK', 'M', 'A', 'D', 'M', 'D', 'M', 'M', 'M', 'A', 'M', 'M', 'A', 'A', 'A', 'D', 'A', 'D', 'M', 'GK',
             'D', 'D', 'M', 'M', 'M', 'M', 'D', 'M', 'GK', 'D', 'GK', 'D', 'D', 'M', 'A', 'M', 'D', 'M', 'GK', 'M',
             'GK', 'A', 'D', 'GK', 'A', 'GK', 'GK', 'GK', 'GK', 'M']

# Manually create 50 sample `heights`
heights = [191, 184, 185, 180, 181, 187, 170, 179, 183, 186, 185, 170, 187, 183, 173, 188, 183, 180, 188, 175, 193,
           180, 185, 170, 183, 173, 185, 185, 168, 190, 178, 185, 185, 193, 183, 184, 178, 180, 177, 188, 177, 187,
           186, 183, 189, 179, 196, 190, 189, 188]

# Import numpy
import numpy as np

# Convert positions and heights to numpy arrays: np_positions, np_heights
np_heights = np.array(heights)
np_positions = np.array(positions)

# Heights of the goalkeepers: gk_heights
gk_heights = np_heights[np_positions == "GK"]

# Heights of the other players: other_heights
other_heights = np_heights[np_positions != "GK"]

# Print out the median height of goalkeepers. Replace 'None'
print("Median height of goalkeepers: " + str(np.median(gk_heights)))

# Print out the median height of other players. Replace 'None'
print("Median height of other players: " + str(np.median(other_heights)))

Median height of goalkeepers: 179.0
Median height of other players: 185.0
