# NumPy
When analyzing data, you'll often want to carry out operations over entire collections of values, and you want to do this fast. With lists, this is a problem.
<br>
The solution is to use NumPy, or Numeric Python. It's a Python package that, among others, provides an alternative to the regular python list: the NumPy array. The NumPy array is pretty similar to the list, but has one additional feature: you can perform calculations over entire arrays. It's really easy, and super-fast as well.

#### Import Libraries

In [None]:
import numpy as np
import pandas as pd

#### Import Data

In [None]:
# Use pandas to import data from csv
mlb_data = pd.read_csv('../../data/mlb.csv')

In [None]:
# Create lists from columns
mlb_heights_in = mlb_data['Height'].tolist()
mlb_weights_lb = mlb_data['Weight'].tolist()

# Your First NumPy Array
We're going to dive into the world of baseball. Along the way, you'll get comfortable with the basics of `numpy`, a powerful package to do data science.
<br>
A list `mlb_heights_in` has already been defined in the Python script, representing the height of some baseball players in inches. Add some code here and there to create a `numpy` array from it.

In [None]:
# Create a numpy array from mlb_heights_in: np_mlb_heights_in
np_mlb_heights_in = np.array(mlb_heights_in)

# Print out type of np_mlb_heights
print(type(np_mlb_heights_in))

np_mlb_heights_in

In [None]:
# Convert np_mlb_heights to meters: np_mlb_heights_m
np_mlb_heights_m = np_mlb_heights_in * 0.0254

np_mlb_heights_m

# Baseball player's BMI
The MLB also offers to let you analyze their weight data. Again, both are available as regular Python lists: `mlb_heights_in` and `mlb_weights_lb`. `mlb_heights_in` is in inches and `mlb_weights_lb` is in pounds.

In [None]:
# Create array from mlb_weights_lb with metric units: np_mlb_weight_kg
np_mlb_weight_kg = np.array(mlb_weights_lb) * 0.453592

# Calculate the BMI: bmi
bmi = np_mlb_weight_kg / (np_mlb_heights_m ** 2)

bmi

# Lightweight baseball players
To subset both regular Python lists and `numpy` arrays, you can use square brackets: `[]`. For `numpy` specifically, you can also use boolean `numpy` arrays.

In [None]:
# Create the light array
light = bmi < 21

# Print out light
print(light)

In [None]:
# Print out BMIs of all baseball players whose BMI is below 21
print(bmi[light])

# Subsetting NumPy Arrays
You've seen it with your own eyes: Python lists and `numpy` arrays sometimes behave differently. Luckily, there are still certainties in this world. For example, subsetting (using the square bracket notation on lists or arrays) works exactly the same.

In [None]:
# Create a numpy array from np_mlb_weights_lb: np_mlb_weights_lb
np_mlb_weights_lb = np.array(mlb_weights_lb)

# Select the weight at index 50
np_mlb_weights_lb[50]

In [None]:
# Subset a sub-array of np_mlb_heights_in: index 100 up to and including index 110
np_mlb_heights_in[100:111]

# 2D NumPy Arrays
The arrays' `np_mlb_heights_in` and `np_mlb_weights_lb` are one-dimensional arrays, but it's perfectly possible to create two-dimensional, three-dimensional, heck even seven-dimensional arrays!

# Your First 2D NumPy Array
Before working on the actual MLB data, let's try to create a 2D `numpy` array from a small list of lists.

`baseball` is a list of lists. The main list contains 4 elements. Each of these elements is a list containing the height and the weight of 4 baseball players, in this order. `baseball` is already coded for you in the script.

In [None]:
# Create baseball, a list of lists
baseball = [[180, 78.4],
            [215, 102.7],
            [210, 98.5],
            [188, 75.2]]

# Create a 2D numpy array from baseball: np_baseball
np_baseball = np.array(baseball)

# Print out the type of np_baseball
print(type(np_baseball))

# Print out the shape of np_baseball
print(np_baseball.shape)

np_baseball

# Baseball data in 2D form
You have another look at the MLB data and realize that it makes more sense to restructure all this information in a 2D `numpy` array. This array should have 1015 rows, corresponding to the 1015 baseball players you have information on, and 2 columns (for height and weight).

In [None]:
# a list of lists
mlb_baseball = np.column_stack((mlb_heights_in, mlb_weights_lb))

In [None]:
# Create a 2D numpy array from mlb_baseball: np_baseball
np_baseball = np.array(mlb_baseball)
np_baseball

In [None]:
# Print out the shape of np_baseball
print(np_baseball.shape)

# Subsetting 2D NumPy Arrays
If your 2D `numpy` array has a regular structure, i.e. each row and column has a fixed number of values, complicated ways of subsetting become very easy.
<br>
For 2D `numpy` arrays, however, it's pretty intuitive! The indexes before the comma refer to the rows, while those after the comma refer to the columns. The `:` is for slicing;

In [None]:
# Select the 50th row of np_baseball
np_baseball[49,:]

In [None]:
# Select the entire second column of np_baseball: np_weight_lb
np_weight_lb = np_baseball[:,1]
np_weight_lb

In [None]:
# Print out height of 124th player
np_baseball[123,0]

# 2D Arithmetic
Remember how you calculated the Body Mass Index (BMI) for all baseball players? `numpy` was able to perform all calculations element-wise (i.e. element by element). For 2D numpy arrays this isn't any different! You can combine matrices with single numbers, with vectors, and with other matrices.

In [None]:
np_mat = np.array([[1, 2],
                   [3, 4],
                   [5, 6]])
np_mat

In [None]:
np_mat * 2

In [None]:
np_mat + np.array([10, 10])

In [None]:
np_mat + np_mat

# NumPy: Basic Statistics
A typical first step in analyzing your data, is getting to know your data in the first place. For the NumPy arrays from before, this is pretty easy, because it isn't a lot of data. However, as a data scientist, you'll be crunching thousands, if not millions or billions of numbers.

# Average versus median
You now know how to use `numpy` functions to get a better feeling for your data. It basically comes down to importing `numpy` and then calling several simple functions on the `numpy` arrays.
<br>
The baseball data is available as a 2D `numpy` array with 3 columns (height, weight, age) and 1015 rows. The name of this `numpy` array is `np_baseball`.

In [None]:
# Create np.array from mlb_data
np_mlb = np.array(mlb_data)

# Create a 2D array with desired columns
np_baseball = np.array(np_mlb[:, 3:6])

np_baseball

In [None]:
# Create np_height_in from np_baseball
np_height_in = np.array(np_baseball[:,0])

# Print out the mean of np_height_in
print(f'mean: {np.mean(np_height_in)}')

# Print out the median of np_height_in
print(f'median: {np.median(np_height_in)}')

In [92]:
# Print mean height (first column)
avg = np.mean(np_baseball[:,0])
print("Average: " + str(avg))

# Print median height
med = np.median(np_baseball[:,0])
print("Median: " + str(med))

# Print out the standard deviation on height
stddev = np.std(np_baseball[:,0])
print("Standard Deviation: " + str(stddev))

# # Print out correlation between first and second column. Replace 'None'
# corr = np.corrcoef(np_baseball[:,0], np_baseball[:,1])
# print("Correlation: " + str(corr))

Average: 73.6896551724138
Median: 74.0
Standard Deviation: 2.3127918810465395


# Blend it all together

In [119]:
fifa = pd.read_csv('../../data/fifa.csv')
fifa.head()

Unnamed: 0,id,name,rating,position,height,foot,rare,pace,shooting,passing,dribbling,defending,heading,diving,handling,kicking,reflexes,speed,positioning
0,1001,GÃ¡bor KirÃ¡ly,69,GK,191,Right,0,,,,,,,70.0,66.0,63.0,74.0,35.0,66.0
1,100143,Frederik Boi,65,M,184,Right,0,61.0,65.0,63.0,59.0,62.0,62.0,,,,,,
2,100264,Tomasz Szewczuk,57,A,185,Right,0,65.0,54.0,43.0,53.0,55.0,74.0,,,,,,
3,100325,Steeve Joseph-Reinette,63,D,180,Left,0,68.0,38.0,51.0,46.0,64.0,71.0,,,,,,
4,100326,Kamel Chafni,72,M,181,Right,0,75.0,64.0,67.0,72.0,57.0,66.0,,,,,,


In [118]:
# Two list from fifa dataframe
positions = fifa[' position'].tolist()
heights = fifa[' height'].tolist()

# Convert positions and heights to numpy arrays: np_positions, np_heights
np_positions = np.array(positions)
np_heights = np.array(heights)

# Heights of the goalkeepers: gk_heights
gk_heights = np_heights[np_positions == " GK"]
# gk_heights

# Heights of the other players: other_heights
other_heights = np_heights[np_positions != " GK"]

# Print out the median height of goalkeepers.
print("Median height of goalkeepers: " + str(np.median(gk_heights)))

# Print out the median height of other players.
print("Median height of other players: " + str(np.median(other_heights)))

Median height of goalkeepers: 188.0
Median height of other players: 181.0
