# Numpy

In describing numpy arrays, let's start with Python lists and why a data scientist might need something better.  As a reminder, a list is a container object that can hold different data types or objects.

The numpy package should come with Anaconda distribution.  If not, install it using the **pip install** command.
```
$ pip install numpy
$ pip3 install numpy
```

In [None]:
h = [67, 70, 65, 63, 72]
w = [160, 150, 165, 120, 205]

Body Mass Index (BMI)
Formula: weight (lb) / [height (in)]<sup>2</sup> x 703

In [None]:
bmi = w / (h**2) * 703

Lists are not good for calculations.  To calculate the BMI for each person, we would have to write a loop.  This is terribly inefficient and tiresome to write.  A more elegant solution is to use Numpy arrays.

Numpy can perform calculations over entire arrays making it quick and easy.

In [None]:
import numpy as np

In [None]:
np_ht = np.array([67, 70, 65, 63, 72])

In [None]:
np_ht

In [None]:
np_wt = np.array(w)

In [None]:
np_wt

In [None]:
bmi = np_wt / (np_ht**2) * 703

In [None]:
bmi

Numpy calculations are fast & efficient because it assumes all data types inside the array are the same.

In [None]:
same = np.array([1.2, "is", True])

In [None]:
same

## List vs np.array

In [None]:
python_list = [1, 2, 3, 4]

In [None]:
np_array = np.array([1, 2, 3, 4])

In [None]:
python_list + python_list

In [None]:
np_array + np_array

Different objects have different behaviors!

## Numpy subsettings

In [None]:
bmi

In [None]:
bmi[1]

In [None]:
bmi > 25

In [None]:
bmi[bmi > 25]

## Baseball example

Suppose you are researching the bmi of MLB players and obtain a list of players' height & weight (Source: [stat.ucla.edu](http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_MLB_HeightsWeights)).  For this exercise, you can find the stats in the baseball.csv file. 

In [None]:
import csv

In [None]:
height = []
weight = []
with open('baseball.csv') as file:
    player_reader = csv.reader(file)
    next(player_reader) # skip the first row
    for row in player_reader:
        height.append(int(row[3]))
        weight.append(int(row[4]))
#         print(row)


In [None]:
len(height)

In [None]:
len(weight)

In [None]:
np_ht = np.array(height)
np_wt = np.array(weight)

In [None]:
bmi = np_wt / (np_ht**2) * 703

In [None]:
bmi

### Light weight players
Let's find out something about the player with a smaller build.

In [None]:
light = bmi < 21

In [None]:
light

In [None]:
print(bmi[light])

## Multi-dimensional Arrays

In [None]:
h = [67, 70, 65, 63, 72]
w = [160, 150, 165, 120, 205]

In [None]:
np_ht = np.array(h)
np_wt = np.array(w)

In [None]:
type(np_ht)

In [None]:
type(np_wt)

In [None]:
np_2d = np.array([ [67, 70, 65, 63, 72], [160, 150, 165, 120, 205] ])

In [None]:
np_2d

In [None]:
np_2d.shape

The shape attribute tells us that the np_2d array has 2 rows and 5 columns.

In [None]:
np_2d = np.array([ [67, 70, 65, 63, 72], [160, 150, 165, 120, "205"] ])

In [None]:
np_2d

Changing one item to a string cause numpy to change all the items to a string.  

In [None]:
np_2d = np.array([ [67, 70, 65, 63, 72], [160, 150, 165, 120, 205] ])

In [None]:
np_2d

### Subsetting a 2-D array

In [None]:
np_2d[0]

In [None]:
np_2d[0][2]

In [None]:
np_2d[0, 2]

### 2-D array using the baseball data

In [None]:
np_baseball = np.array([height, weight])

In [None]:
np_baseball.shape

In [None]:
bmi = np_baseball[1] / np_baseball[0]**2 * 703

In [None]:
bmi

## Using numpy for basic statistics

In [None]:
np.mean(np_baseball[0])

In [None]:
np.mean(np_baseball[1])

In [None]:
np.corrcoef(np_baseball[0], np_baseball[1])

In [None]:
np.std(np_baseball[0])

In [None]:
np.std(np_baseball[1])