# Lesson 5
### Written by Adithya Solai

_Copyright © 2021 Adithya Solai. All rights are reserved.
You cannot use, modify, or redistribute this code without 
explicit permission from Adithya Solai._

# NumPy

**NumPy** stands for Numerical Python

-Used for scientific computing.

-Primary data structure: *the n-dimensional array* (or nd array)

NumPy is used for:

* Fast (vectorized) array operations for data processing
* Efficient descriptive statistics (mean, median, mode, etc.)
* Manipulations for merging multiple data sets

Need to have an import statement if you want to use NumPy:

In [23]:
import numpy as np

## ndarrays

Unlike vanilla Python lists, ndarrays are **homogenous** (only one type of data allowed in each ndarray)

We will use NumPy's generating functions to generate some ndarrays to play with:

In [24]:
# np.arange
arr1d = np.arange(10)
arr1d

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [25]:
# np.ones, np.zeros
np.ones((5,10))

array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]])

In [26]:
# np.random.rand, np.random.randn
arr2d = np.random.rand(3,3)
arr2d

array([[0.86264944, 0.23328849, 0.46849595],
       [0.38584548, 0.17964964, 0.04744465],
       [0.88314166, 0.86490599, 0.61361069]])

**Getting ndarray metadata**

In [27]:
print(arr2d.ndim) # number of dimensions
print(arr2d.shape) # shape of ndarray
print(arr2d.dtype) # data type

2
(3, 3)
float64


**Operations with ndarrays**

* Scalar operations (one ndarray interacting with a scalar value)

* Unary operations (just acting on one ndarray and modifying all values)
 
* Binary operations (two ndarrays interacting with each other)

* Aggregation operators (shrinking one ndarray into one scalar value i.e. mean/median)

(Sometimes follow very different rules from vanilla Python lists)

Again, these operations are **fast** !

In [28]:
# Scalar Operations

print("arr2d from before...")
print(arr2d)

print("scalar addition (+1)")
print(arr2d + 1)

print("scalar subtraction (-2.2)")
print(arr2d - 2.2)

print("scalar multiplication (x5)")
print(arr2d * 5)

print("scalar division (/2)")
print(arr2d / 2)

arr2d from before...
[[0.86264944 0.23328849 0.46849595]
 [0.38584548 0.17964964 0.04744465]
 [0.88314166 0.86490599 0.61361069]]
scalar addition (+1)
[[1.86264944 1.23328849 1.46849595]
 [1.38584548 1.17964964 1.04744465]
 [1.88314166 1.86490599 1.61361069]]
scalar subtraction (-2.2)
[[-1.33735056 -1.96671151 -1.73150405]
 [-1.81415452 -2.02035036 -2.15255535]
 [-1.31685834 -1.33509401 -1.58638931]]
scalar multiplication (x5)
[[4.31324721 1.16644245 2.34247977]
 [1.9292274  0.89824821 0.23722324]
 [4.41570831 4.32452994 3.06805346]]
scalar division (/2)
[[0.43132472 0.11664424 0.23424798]
 [0.19292274 0.08982482 0.02372232]
 [0.44157083 0.43245299 0.30680535]]


**Another operation with scalars**

In [29]:
# arr2d from before
print(arr2d)

# Foreshadowing masking...
print(arr2d > 0.5)

[[0.86264944 0.23328849 0.46849595]
 [0.38584548 0.17964964 0.04744465]
 [0.88314166 0.86490599 0.61361069]]
[[ True False False]
 [False False False]
 [ True  True  True]]


**Unary Operations**

(Google online for more examples besides just `sqrt()`)

In [30]:
print("arr2d from before...")
print(arr2d)

print("Apply sqrt():")
print(np.sqrt(arr2d))

arr2d from before...
[[0.86264944 0.23328849 0.46849595]
 [0.38584548 0.17964964 0.04744465]
 [0.88314166 0.86490599 0.61361069]]
Apply sqrt():
[[0.92878923 0.48299947 0.68446764]
 [0.62116462 0.42385097 0.21781793]
 [0.93975617 0.93000322 0.78333307]]


**Binary Operations**

You can add, subtract, multiply, and divide 2 ndarrays as long as they have the same shape (m by n dimensions)

In [31]:
print("arr2d from before...")
print(arr2d)

print("Squaring an ndarray with multiplication binary operator:")
print(arr2d * arr2d)

arr2d from before...
[[0.86264944 0.23328849 0.46849595]
 [0.38584548 0.17964964 0.04744465]
 [0.88314166 0.86490599 0.61361069]]
Squaring an ndarray with multiplication binary operator:
[[0.74416406 0.05442352 0.21948846]
 [0.14887673 0.03227399 0.00225099]
 [0.77993919 0.74806237 0.37651808]]


**Aggregation Operators**

In [32]:
print("arr2d from before...")
print(arr2d)

print("Mean of entire dataset")
print(np.mean(arr2d))

print("Mean of each column")
print(np.mean(arr2d, axis=0))

print("Mean of each row")
print(np.mean(arr2d, axis=1))

# It's not really clear what row vs column is when 
# we view the ndarray in the 2D list print format.
# This will be a lot more clear with pandas DataFrames!

arr2d from before...
[[0.86264944 0.23328849 0.46849595]
 [0.38584548 0.17964964 0.04744465]
 [0.88314166 0.86490599 0.61361069]]
Mean of entire dataset
0.5043368887572185
Mean of each column
[0.71054553 0.42594804 0.3765171 ]
Mean of each row
[0.52147796 0.20431326 0.78721945]


## Closing Remark on NumPy: GOOGLE IS YOUR BEST FRIEND!

It's not possible to go through all of the functionality offered by NumPy, and it's not really the best way to learn this stuff either.

The best way to learn is to be given some data and some questions to answer about the data, and figuring things out on your own.

You'll get very familiar with documentation pages like this:
https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html

# pandas

* Built on top of NumPy.
* Primary data structure is the **DataFrame**
* A secondary data structure is the **Series** (another type of vector)
* DataFrames are basically just 2-D tables we are used to (like in Excel)
* THERE IS SO MUCH YOU CAN DO WITH DATAFRAMES. They are the core data structure in a lot of beginner data science projects.


**Just like with NumPy, the best way to learn is to just tackle problems and use Google as you go along. I will demo some useful pandas operations below.**

Need to have an import statement if you want to use Pandas:

In [33]:
import pandas as pd

**We can finally start working with wild data!!!**

Importing data from csv files is super straight-forward and useful...

In [34]:
nba = pd.read_csv("NBA Reg Season Player Avgs with Win Pct 2000-2019.csv")

display(nba)

# (if you want to export an Excel file into a Pandas Dataframe, just download the Excel 
# sheet as a csv file, and repeat what we did above)

Unnamed: 0.1,Unnamed: 0,Year,Player,Tm,start_pct,MP,PTS,TRB,AST,FGA,...,2PA,2P%,FTA,FT%,STL,BLK,TOV,PF,start_pct.1,win_pct
0,0,2001,Shareef Abdur-Rahim,VAN,1.000000,40.0,20.5,9.1,3.1,15.8,...,15.0,0.487,6.6,0.834,1.1,1.0,2.9,2.9,1.000000,0.280488
1,1,2001,Mike Bibby,VAN,1.000000,38.9,15.9,3.7,8.4,14.1,...,10.6,0.478,2.3,0.761,1.3,0.1,3.0,1.8,1.000000,0.280488
2,2,2001,Michael Dickerson,VAN,0.985714,37.4,16.3,3.3,3.3,14.6,...,11.3,0.429,3.9,0.763,0.9,0.4,2.3,3.0,0.985714,0.280488
3,3,2001,Othella Harrington,VAN,0.909091,28.8,10.9,6.6,0.8,8.8,...,8.7,0.470,3.5,0.779,0.4,0.6,2.4,3.1,0.909091,0.280488
4,4,2001,Bryant Reeves,VAN,0.640000,24.4,8.3,6.0,1.1,7.4,...,7.3,0.462,1.9,0.796,0.6,0.7,1.2,3.2,0.640000,0.280488
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3505,3505,2019,Reggie Bullock,DET,1.000000,30.8,12.1,2.8,2.5,10.0,...,3.3,0.463,1.5,0.875,0.5,0.1,1.2,1.8,1.000000,0.500000
3506,3506,2019,Andre Drummond,DET,1.000000,33.5,17.3,15.6,1.4,13.3,...,12.8,0.548,5.2,0.590,1.7,1.7,2.2,3.4,1.000000,0.500000
3507,3507,2019,Wayne Ellington,DET,0.928571,27.3,12.0,2.1,1.5,9.8,...,2.0,0.607,1.2,0.758,1.1,0.1,0.9,1.9,0.928571,0.500000
3508,3508,2019,Blake Griffin,DET,1.000000,35.0,24.5,7.5,5.4,17.9,...,10.9,0.525,7.3,0.753,0.7,0.4,3.4,2.7,1.000000,0.500000


**Get column names**

In [35]:
print(nba.columns)

Index(['Unnamed: 0', 'Year', 'Player', 'Tm', 'start_pct', 'MP', 'PTS', 'TRB',
       'AST', 'FGA', 'FG%', '3PA', '3P%', '2PA', '2P%', 'FTA', 'FT%', 'STL',
       'BLK', 'TOV', 'PF', 'start_pct.1', 'win_pct'],
      dtype='object')


**Chop off the left-most column since it is useless (we already have a dedicated, built-in index in the DataFrame)**

In [36]:
# To be efficient, pandas operations just return a copy of the modified DataFrame unless you 
# specify inplace=True. Otherwise, the original DataFrame will not actually get modified.
nba.drop(columns=["Unnamed: 0"], inplace=True)
display(nba)

Unnamed: 0,Year,Player,Tm,start_pct,MP,PTS,TRB,AST,FGA,FG%,...,2PA,2P%,FTA,FT%,STL,BLK,TOV,PF,start_pct.1,win_pct
0,2001,Shareef Abdur-Rahim,VAN,1.000000,40.0,20.5,9.1,3.1,15.8,0.472,...,15.0,0.487,6.6,0.834,1.1,1.0,2.9,2.9,1.000000,0.280488
1,2001,Mike Bibby,VAN,1.000000,38.9,15.9,3.7,8.4,14.1,0.454,...,10.6,0.478,2.3,0.761,1.3,0.1,3.0,1.8,1.000000,0.280488
2,2001,Michael Dickerson,VAN,0.985714,37.4,16.3,3.3,3.3,14.6,0.417,...,11.3,0.429,3.9,0.763,0.9,0.4,2.3,3.0,0.985714,0.280488
3,2001,Othella Harrington,VAN,0.909091,28.8,10.9,6.6,0.8,8.8,0.466,...,8.7,0.470,3.5,0.779,0.4,0.6,2.4,3.1,0.909091,0.280488
4,2001,Bryant Reeves,VAN,0.640000,24.4,8.3,6.0,1.1,7.4,0.460,...,7.3,0.462,1.9,0.796,0.6,0.7,1.2,3.2,0.640000,0.280488
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3505,2019,Reggie Bullock,DET,1.000000,30.8,12.1,2.8,2.5,10.0,0.413,...,3.3,0.463,1.5,0.875,0.5,0.1,1.2,1.8,1.000000,0.500000
3506,2019,Andre Drummond,DET,1.000000,33.5,17.3,15.6,1.4,13.3,0.533,...,12.8,0.548,5.2,0.590,1.7,1.7,2.2,3.4,1.000000,0.500000
3507,2019,Wayne Ellington,DET,0.928571,27.3,12.0,2.1,1.5,9.8,0.421,...,2.0,0.607,1.2,0.758,1.1,0.1,0.9,1.9,0.928571,0.500000
3508,2019,Blake Griffin,DET,1.000000,35.0,24.5,7.5,5.4,17.9,0.462,...,10.9,0.525,7.3,0.753,0.7,0.4,3.4,2.7,1.000000,0.500000


**Hope you are starting to realize why we work with cells in notebooks for data science stuff.**

Cells give us more control in deciding which steps to run. We don't ALWAYS have to run the WHOLE .py file like we did before. \
This helps us find bugs more easily.

**Aggregates of particular columns**

In [37]:
print("Mean PTS", nba['PTS'].mean())
print("Std Dev PTS", nba['PTS'].std())

Mean PTS 13.314472934472924
Std Dev PTS 5.736595009838543


**Filtering on a column using masks**

Only display players that averaged more than 30 points in a season.

In [38]:
# Again, this does not actually modify the nba df down to this smaller df.
# This just returns a new, filtered-down df that we can just assign to `nba` if we want.
nba[nba['PTS'] >= 30]

Unnamed: 0,Year,Player,Tm,start_pct,MP,PTS,TRB,AST,FGA,FG%,...,2PA,2P%,FTA,FT%,STL,BLK,TOV,PF,start_pct.1,win_pct
87,2001,Allen Iverson *,PHI,1.0,42.0,31.1,3.8,4.6,25.5,0.42,...,21.2,0.441,10.1,0.814,2.5,0.3,3.3,2.1,1.0,0.682927
310,2002,Allen Iverson *,PHI,0.983333,43.7,31.4,4.5,5.5,27.8,0.398,...,23.4,0.419,9.8,0.812,2.8,0.2,4.0,1.7,0.983333,0.52439
443,2003,Tracy McGrady *,ORL,0.986667,39.4,32.1,6.5,5.5,24.2,0.457,...,18.2,0.481,9.7,0.793,1.7,0.8,2.6,2.1,0.986667,0.512195
504,2003,Kobe Bryant *,LAL,1.0,41.5,30.0,6.9,5.9,23.5,0.451,...,19.5,0.465,8.7,0.843,2.2,0.8,3.5,2.7,1.0,0.609756
875,2005,Allen Iverson *,PHI,1.0,42.3,30.7,4.0,7.9,24.2,0.424,...,19.7,0.451,10.5,0.835,2.4,0.1,4.6,1.9,1.0,0.52439
956,2006,LeBron James,CLE,1.0,42.5,31.4,7.0,6.6,23.1,0.48,...,18.3,0.518,10.3,0.738,1.6,0.8,3.3,2.3,1.0,0.609756
1026,2006,Allen Iverson *,PHI,1.0,43.1,33.0,3.2,7.4,25.3,0.447,...,22.2,0.465,11.5,0.814,1.9,0.1,3.4,1.7,1.0,0.463415
1072,2006,Kobe Bryant *,LAL,1.0,41.0,35.4,5.3,4.5,27.2,0.45,...,20.7,0.482,10.2,0.85,1.8,0.4,3.1,2.9,1.0,0.54878
1137,2007,Allen Iverson *,PHI,1.0,42.7,31.2,2.7,7.3,24.4,0.413,...,20.9,0.444,11.6,0.885,2.2,0.1,4.4,1.4,1.0,0.426829
1251,2007,Kobe Bryant *,LAL,1.0,40.8,31.6,5.7,5.4,22.8,0.463,...,17.6,0.497,10.0,0.868,1.4,0.5,3.3,2.7,1.0,0.512195


**Get only LeBron James' Stats**

In [39]:
nba[nba['Player'] == 'LeBron James']

Unnamed: 0,Year,Player,Tm,start_pct,MP,PTS,TRB,AST,FGA,FG%,...,2PA,2P%,FTA,FT%,STL,BLK,TOV,PF,start_pct.1,win_pct
637,2004,LeBron James,CLE,1.0,39.5,20.9,5.5,5.9,18.9,0.417,...,16.1,0.438,5.8,0.754,1.6,0.7,3.5,1.9,1.0,0.426829
892,2005,LeBron James,CLE,1.0,42.4,27.2,7.4,7.2,21.1,0.472,...,17.2,0.499,8.0,0.75,2.2,0.7,3.3,1.8,1.0,0.512195
956,2006,LeBron James,CLE,1.0,42.5,31.4,7.0,6.6,23.1,0.48,...,18.3,0.518,10.3,0.738,1.6,0.8,3.3,2.3,1.0,0.609756
1259,2007,LeBron James,CLE,1.0,40.9,27.3,6.7,6.0,20.8,0.476,...,16.8,0.513,9.0,0.698,1.6,0.7,3.2,2.2,1.0,0.609756
1308,2008,LeBron James,CLE,0.986667,40.4,30.0,7.9,7.2,21.9,0.484,...,17.1,0.531,10.3,0.712,1.8,1.1,3.4,2.2,0.986667,0.54878
1629,2009,LeBron James,CLE,1.0,37.7,28.4,7.6,7.2,19.9,0.489,...,15.2,0.535,9.4,0.78,1.7,1.1,3.0,1.7,1.0,0.804878
1816,2010,LeBron James,CLE,1.0,39.0,29.7,7.3,8.6,20.1,0.503,...,15.0,0.56,10.2,0.767,1.6,1.0,3.4,1.6,1.0,0.743902
1915,2011,LeBron James,MIA,1.0,38.8,26.7,7.5,7.0,18.8,0.51,...,15.3,0.552,8.4,0.759,1.6,0.6,3.6,2.1,1.0,0.707317
2113,2012,LeBron James,MIA,1.0,37.5,27.1,7.9,6.2,18.9,0.531,...,16.5,0.556,8.1,0.771,1.9,0.8,3.4,1.5,1.0,0.69697
2259,2013,LeBron James,MIA,1.0,37.9,26.8,8.0,7.3,17.8,0.565,...,14.5,0.602,7.0,0.753,1.7,0.9,3.0,1.4,1.0,0.804878


**Get all instances where a player attempted less than 14 field goals, but still averaged over 20 pts in a season.**

In [40]:
nba[np.logical_and(nba['FGA'] <= 14, nba['PTS'] >= 20)]

Unnamed: 0,Year,Player,Tm,start_pct,MP,PTS,TRB,AST,FGA,FG%,...,2PA,2P%,FTA,FT%,STL,BLK,TOV,PF,start_pct.1,win_pct
631,2004,Corey Maggette,LAC,0.986301,36.0,20.7,5.9,3.1,13.9,0.447,...,10.7,0.482,8.5,0.848,0.9,0.2,2.8,3.0,0.986301,0.341463
949,2006,Shaquille O'Neal *,MIA,0.983051,30.6,20.0,9.2,1.9,13.6,0.6,...,13.6,0.6,8.0,0.469,0.4,1.8,2.8,3.9,0.983051,0.634146
1084,2007,Kevin Martin,SAC,1.0,35.2,20.2,4.3,2.2,13.3,0.473,...,9.2,0.515,7.1,0.844,1.2,0.1,1.7,2.3,1.0,0.402439
1202,2007,Amar'e Stoudemire,PHO,0.95122,32.8,20.4,9.6,1.0,12.9,0.575,...,12.8,0.577,7.1,0.781,1.0,1.3,2.8,3.6,0.95122,0.743902
1350,2008,Dwight Howard,ORL,1.0,37.7,20.7,14.2,1.3,11.9,0.599,...,11.8,0.601,10.9,0.59,0.9,2.1,3.2,3.3,1.0,0.634146
1510,2009,Dwight Howard,ORL,1.0,35.7,20.6,13.8,1.4,12.4,0.572,...,12.4,0.573,10.7,0.594,1.0,2.9,3.0,3.4,1.0,0.719512
1871,2011,Dwight Howard,ORL,1.0,37.6,22.9,14.1,1.4,13.4,0.593,...,13.3,0.597,11.7,0.596,1.4,2.4,3.6,3.3,1.0,0.634146
2100,2012,Dwight Howard,ORL,1.0,38.3,20.6,14.5,1.9,13.4,0.573,...,13.3,0.579,10.6,0.491,1.5,2.1,3.2,2.9,1.0,0.560606
2477,2014,Brook Lopez,BRK,1.0,31.4,20.7,6.0,0.9,13.5,0.563,...,13.4,0.566,6.8,0.817,0.5,1.8,1.6,3.1,1.0,0.536585
2690,2015,Jimmy Butler,CHI,1.0,38.7,20.0,5.8,3.3,14.0,0.462,...,11.1,0.484,7.1,0.834,1.8,0.6,1.4,1.7,1.0,0.609756


**Get all instances where a player averaged a Triple Double for a season**

In [41]:
nba[np.logical_and(nba['AST'] >= 10, (np.logical_and(nba['PTS'] >= 10, nba['TRB'] >= 10)))]

Unnamed: 0,Year,Player,Tm,start_pct,MP,PTS,TRB,AST,FGA,FG%,...,2PA,2P%,FTA,FT%,STL,BLK,TOV,PF,start_pct.1,win_pct
2951,2017,Russell Westbrook,OKC,1.0,34.6,31.6,10.7,10.4,24.0,0.425,...,16.8,0.459,10.4,0.845,1.6,0.4,5.4,2.3,1.0,0.573171
3127,2018,Russell Westbrook,OKC,1.0,36.4,25.4,10.1,10.3,21.1,0.449,...,17.0,0.485,7.1,0.737,1.8,0.3,4.8,2.5,1.0,0.585366
3312,2019,Russell Westbrook,OKC,1.0,36.0,22.9,11.1,10.7,20.2,0.428,...,14.5,0.481,6.2,0.656,1.9,0.5,4.5,3.4,1.0,0.597561


## Closing Remarks on pandas: We are just scratching the surface!

I touched mostly on filtering and working with a well-made data table. However, most of your work with pandas in the wild will be data cleaning and processing. Very little time is spent on actually doing these neat analytical queries we just did in the data science lifecycle. You will have lots of practice with the tedious data processing/cleaning procedures in your projects, where you will try to extract data from the wild internet and put it into a usable pandas DataFrame.

**Here are some pandas concepts & functions that are useful for processing & cleaning that you can explore through Google:**

* The mask filtering and `np.logical_or`/`np.logical_and` shown in this lesson.
* `df.loc` vs `df.iloc`
* Using binary operators like +, -, *, and / on columns to make new columns
* `df.merge()` (basically allows you to do SQL joins between 2 pandas dfs)
* `df.str` functions when working with text data in a pandas df.