# NumPy

**NumPy** stands for Numerical Python

-Used for scientific computing.

-Primary data structure: *the n-dimensional array* (or nd array)

NumPy is used for:

* Fast (vectorized) array operations for data processing
* Efficient descriptive statistics (mean, median, mode, etc.)
* Manipulations for merging multiple data sets

Need to have an import statement if you want to use NumPy:

In [1]:
import numpy as np

## ndarrays

Unlike vanilla Python lists, ndarrays are **homogenous** (only one type of data allowed in each ndarray)

We will use NumPy's generating functions to generate some ndarrays to play with:

In [2]:
# np.arange
arr1d = np.arange(10)
arr1d

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [3]:
# np.ones, np.zeros
np.ones((5,10))

array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]])

In [6]:
# np.random.rand, np.random.randn
arr2d = np.random.rand(3,3)
arr2d

array([[0.65507384, 0.4043348 , 0.73669794],
       [0.97132408, 0.25470983, 0.55360637],
       [0.25173892, 0.54591791, 0.91453562]])

**Getting ndarray metadata**

In [7]:
print(arr2d.ndim) # number of dimensions
print(arr2d.shape) # shape of ndarray
print(arr2d.dtype) # data type

2
(3, 3)
float64


**Operations with ndarrays**

* Scalar operations (one ndarray interacting with a scalar value)

* Unary operations (just acting on one ndarray and modifying all values)
 
* Binary operations (two ndarrays interacting with each other)

* Aggregation operators (shrinking one ndarray into one scalar value i.e. mean/median)

(Sometimes follow very different rules from vanilla Python lists)

Again, these operations are **fast** !

In [17]:
# Scalar Operations

print("arr2d from before...")
print(arr2d)

print("scalar addition (+1)")
print(arr2d + 1)

print("scalar subtraction (-2.2)")
print(arr2d - 2.2)

print("scalar multiplication (x5)")
print(arr2d * 5)

print("scalar division (/2)")
print(arr2d / 2)

arr2d from before...
[[0.65507384 0.4043348  0.73669794]
 [0.97132408 0.25470983 0.55360637]
 [0.25173892 0.54591791 0.91453562]]
scalar addition (+1)
[[1.65507384 1.4043348  1.73669794]
 [1.97132408 1.25470983 1.55360637]
 [1.25173892 1.54591791 1.91453562]]
scalar subtraction (-2.2)
[[-1.54492616 -1.7956652  -1.46330206]
 [-1.22867592 -1.94529017 -1.64639363]
 [-1.94826108 -1.65408209 -1.28546438]]
scalar multiplication (x5)
[[3.2753692  2.02167399 3.68348971]
 [4.85662038 1.27354915 2.76803186]
 [1.25869459 2.72958957 4.57267812]]
scalar division (/2)
[[0.32753692 0.2021674  0.36834897]
 [0.48566204 0.12735491 0.27680319]
 [0.12586946 0.27295896 0.45726781]]


In [14]:
# Another operation with scalars

# arr2d from before
print(arr2d)

# Foreshadowing masking...
print(arr2d > 0.5)

[[0.65507384 0.4043348  0.73669794]
 [0.97132408 0.25470983 0.55360637]
 [0.25173892 0.54591791 0.91453562]]
[[ True False  True]
 [ True False  True]
 [False  True  True]]


In [21]:
# Unary Operations
# (Google online for more)

print("arr2d from before...")
print(arr2d)

print("Apply sqrt():")
print(np.sqrt(arr2d))

arr2d from before...
[[0.65507384 0.4043348  0.73669794]
 [0.97132408 0.25470983 0.55360637]
 [0.25173892 0.54591791 0.91453562]]
Apply sqrt():
[[0.80936632 0.63587326 0.8583111 ]
 [0.98555775 0.50468785 0.74404729]
 [0.5017359  0.73886258 0.95631356]]


In [23]:
# Binary Operations
# You can add, subtract, multiply, and divide 2 ndarrays
# as long as they have the same shape (m by n dimensions)

print("arr2d from before...")
print(arr2d)

print("Squaring an ndarray with multiplication binary operator:")
print(arr2d * arr2d)

arr2d from before...
[[0.65507384 0.4043348  0.73669794]
 [0.97132408 0.25470983 0.55360637]
 [0.25173892 0.54591791 0.91453562]]
Squaring an ndarray with multiplication binary operator:
[[0.42912174 0.16348663 0.54272386]
 [0.94347046 0.0648771  0.30648001]
 [0.06337248 0.29802637 0.83637541]]


In [27]:
# Aggregation Operators

print("arr2d from before...")
print(arr2d)

print("Mean of entire dataset")
print(np.mean(arr2d))

print("Mean of each column")
print(np.mean(arr2d, axis=0))

print("Mean of each row")
print(np.mean(arr2d, axis=1))

# It's not really clear what row vs column is when 
# we view the ndarray in the 2D list print format.
# This will be a lot more clear with pandas DataFrames!

arr2d from before...
[[0.65507384 0.4043348  0.73669794]
 [0.97132408 0.25470983 0.55360637]
 [0.25173892 0.54591791 0.91453562]]
Mean of entire dataset
0.5875488123873033
Mean of each column
[0.62604561 0.40165418 0.73494665]
Mean of each row
[0.59870219 0.59321343 0.57073082]


## Closing Remark on NumPy: GOOGLE IS YOUR BEST FRIEND!

It's not possible to go through all of the functionality offered by NumPy, and it's not really the best way to learn this stuff either.

The best way to learn is to be given some data and some questions to answer about the data, and figuring things out on your own.

You'll get very familiar with documentation pages like this:
https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html

# pandas

* Built on top of NumPy.
* Primary data structure is the **DataFrame**
* A secondary data structure is the **Series** (another type of vector)
* DataFrames are basically just 2-D tables we are used to (like in Excel)
* THERE IS SO MUCH YOU CAN DO WITH DATAFRAMES. They are the core data structure in a lot of beginner data science projects.


**Just like with NumPy, the best way to learn is to just tackle problems and use Google as you go along. I will demo some useful pandas operations below.**

Need to have an import statement if you want to use Pandas:

In [30]:
import pandas as pd

**We can finally start working with wild data!!!**

In [48]:
# importing data from csv files is super straight-forward and useful...

nba = pd.read_csv("NBA Reg Season Player Avgs with Win Pct 2000-2019.csv")

display(nba)

# (if you want to export an Excel file into a Pandas Dataframe, just download the Excel sheet as a csv file, 
# and repeat what is below)

Unnamed: 0.1,Unnamed: 0,Year,Player,Tm,start_pct,MP,PTS,TRB,AST,FGA,...,2PA,2P%,FTA,FT%,STL,BLK,TOV,PF,start_pct.1,win_pct
0,0,2001,Shareef Abdur-Rahim,VAN,1.000000,40.0,20.5,9.1,3.1,15.8,...,15.0,0.487,6.6,0.834,1.1,1.0,2.9,2.9,1.000000,0.280488
1,1,2001,Mike Bibby,VAN,1.000000,38.9,15.9,3.7,8.4,14.1,...,10.6,0.478,2.3,0.761,1.3,0.1,3.0,1.8,1.000000,0.280488
2,2,2001,Michael Dickerson,VAN,0.985714,37.4,16.3,3.3,3.3,14.6,...,11.3,0.429,3.9,0.763,0.9,0.4,2.3,3.0,0.985714,0.280488
3,3,2001,Othella Harrington,VAN,0.909091,28.8,10.9,6.6,0.8,8.8,...,8.7,0.470,3.5,0.779,0.4,0.6,2.4,3.1,0.909091,0.280488
4,4,2001,Bryant Reeves,VAN,0.640000,24.4,8.3,6.0,1.1,7.4,...,7.3,0.462,1.9,0.796,0.6,0.7,1.2,3.2,0.640000,0.280488
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3505,3505,2019,Reggie Bullock,DET,1.000000,30.8,12.1,2.8,2.5,10.0,...,3.3,0.463,1.5,0.875,0.5,0.1,1.2,1.8,1.000000,0.500000
3506,3506,2019,Andre Drummond,DET,1.000000,33.5,17.3,15.6,1.4,13.3,...,12.8,0.548,5.2,0.590,1.7,1.7,2.2,3.4,1.000000,0.500000
3507,3507,2019,Wayne Ellington,DET,0.928571,27.3,12.0,2.1,1.5,9.8,...,2.0,0.607,1.2,0.758,1.1,0.1,0.9,1.9,0.928571,0.500000
3508,3508,2019,Blake Griffin,DET,1.000000,35.0,24.5,7.5,5.4,17.9,...,10.9,0.525,7.3,0.753,0.7,0.4,3.4,2.7,1.000000,0.500000


In [49]:
# Get column names
print(nba.columns)

Index(['Unnamed: 0', 'Year', 'Player', 'Tm', 'start_pct', 'MP', 'PTS', 'TRB',
       'AST', 'FGA', 'FG%', '3PA', '3P%', '2PA', '2P%', 'FTA', 'FT%', 'STL',
       'BLK', 'TOV', 'PF', 'start_pct.1', 'win_pct'],
      dtype='object')


In [50]:
# Chop off the left-most column since it is useless (we already have a dedicated, built-in index in the DataFrame)

# To be efficient, pandas operations just return a copy of the modified DataFrame unless you specify inplace=True.
# Otherwise, the original DataFrame will not actually get modified.
nba.drop(columns=["Unnamed: 0"], inplace=True)
display(nba)

Unnamed: 0,Year,Player,Tm,start_pct,MP,PTS,TRB,AST,FGA,FG%,...,2PA,2P%,FTA,FT%,STL,BLK,TOV,PF,start_pct.1,win_pct
0,2001,Shareef Abdur-Rahim,VAN,1.000000,40.0,20.5,9.1,3.1,15.8,0.472,...,15.0,0.487,6.6,0.834,1.1,1.0,2.9,2.9,1.000000,0.280488
1,2001,Mike Bibby,VAN,1.000000,38.9,15.9,3.7,8.4,14.1,0.454,...,10.6,0.478,2.3,0.761,1.3,0.1,3.0,1.8,1.000000,0.280488
2,2001,Michael Dickerson,VAN,0.985714,37.4,16.3,3.3,3.3,14.6,0.417,...,11.3,0.429,3.9,0.763,0.9,0.4,2.3,3.0,0.985714,0.280488
3,2001,Othella Harrington,VAN,0.909091,28.8,10.9,6.6,0.8,8.8,0.466,...,8.7,0.470,3.5,0.779,0.4,0.6,2.4,3.1,0.909091,0.280488
4,2001,Bryant Reeves,VAN,0.640000,24.4,8.3,6.0,1.1,7.4,0.460,...,7.3,0.462,1.9,0.796,0.6,0.7,1.2,3.2,0.640000,0.280488
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3505,2019,Reggie Bullock,DET,1.000000,30.8,12.1,2.8,2.5,10.0,0.413,...,3.3,0.463,1.5,0.875,0.5,0.1,1.2,1.8,1.000000,0.500000
3506,2019,Andre Drummond,DET,1.000000,33.5,17.3,15.6,1.4,13.3,0.533,...,12.8,0.548,5.2,0.590,1.7,1.7,2.2,3.4,1.000000,0.500000
3507,2019,Wayne Ellington,DET,0.928571,27.3,12.0,2.1,1.5,9.8,0.421,...,2.0,0.607,1.2,0.758,1.1,0.1,0.9,1.9,0.928571,0.500000
3508,2019,Blake Griffin,DET,1.000000,35.0,24.5,7.5,5.4,17.9,0.462,...,10.9,0.525,7.3,0.753,0.7,0.4,3.4,2.7,1.000000,0.500000


**Hope you are starting to realize why we work with cells in notebooks for data science stuff.**

Cells give us more control in deciding which steps to run. We don't ALWAYS have to run the WHOLE .py file like we did before. \
This helps us find bugs more easily.