# Part 3


## NumPy

In [6]:
import numpy as np # import numpy and give it the short form ('nickname') np

<b>From Wikipedia:
    
NumPy (pronounced /ˈnʌmpaɪ/ (NUM-py) or sometimes /ˈnʌmpi/ (NUM-pee)) is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays. The ancestor of NumPy, Numeric, was originally created by Jim Hugunin with contributions from several other developers. In 2005, Travis Oliphant created NumPy by incorporating features of the competing Numarray into Numeric, with extensive modifications. NumPy is open-source software and has many contributors.</b>

<a href = "https://en.wikipedia.org/wiki/NumPy">Full article here</a>

Some important **Numpy** features:
<ul>
<li>ndarray: fast and space-efficient multidimensional array with vectorized arithmetic and sophisticated broadcasting</li>
<li>Standard vectorized math</li>
<li>Reading / writing arrays to disk</li>
<li>Memory-mapped file access</li>
<li>Linear algebra, rng, fourier transform</li>
<li>Integration of C, C++, FORTRAN</li>
</ul>

**Creating NumPy arrays**

The function *array()* is commonly used to create numpy ndarrays on the fly from other Python sequence-like objects such as tuples and lists.

In [None]:
np.array(range(3))

In [None]:
np.array((1, 2, 3)) # from a tuple

In [None]:
np.array([1, 2, 3]) # from a list

In [None]:
type(np.array(range(3)))

Nested lists result in mutlidimensional arrays:

In [1]:
import random
nestedList = [[random.uniform(0, 9) for x in range(3)] for y in range(4)]
nestedList

[[8.536348291941575, 7.891253521865471, 7.102633827261235],
 [3.535019178626484, 5.375644307938026, 3.7624950708902203],
 [5.202315910294648, 5.316586496061198, 4.295567253931664],
 [2.9787051451413915, 3.1780647001719107, 2.609184108942338]]

In [4]:
type(nestedList)

list

In [7]:
myArray = np.array(nestedList)
myArray

array([[8.53634829, 7.89125352, 7.10263383],
       [3.53501918, 5.37564431, 3.76249507],
       [5.20231591, 5.3165865 , 4.29556725],
       [2.97870515, 3.1780647 , 2.60918411]])

**Important attributes of arrays**

In [8]:
print(myArray.ndim)  # Number of dimensions
print(myArray.shape) # Shape of the ndarray
print(myArray.dtype) # Data type contained in the array

2
(4, 3)
float64


#### Other functions to create arrays

**arange** - this is equivalent to the range function, except returns a one-dimensional array instead of a range object: 

In [None]:
np.arange(10)

**ones**, **zeros**, **ones_like**, **zeros_like** - to create arrays filled with ones or zeroes with a given shape or with a shape similar to a given object:

In [None]:
np.ones(3)

In [None]:
np.ones((3,4))

In [None]:
np.zeros(4)

In [None]:
np.zeros((4,3))

We can explicitly specify the data type with which an array should be created:

In [None]:
newArray = np.array(nestedList, dtype = np.int)
newArray

In [None]:
newArray = np.array(nestedList, dtype = np.float)
newArray

In [None]:
newArray = np.array(nestedList, dtype = np.bool) # numbers = 0 are False
newArray

In [None]:
newArray = np.array(nestedList, dtype = np.unicode) # Unicode strings
newArray

In [None]:
newArray = np.array(nestedList, dtype = np.object) # arbitrary objects
newArray

## Vectorized math

One of the most powerful and useful features of numpy, especially for data science, is the ability to vectorize mathmatical operations - that is to apply operations to each element of a vector or array simultaneously without needing to use a *for* loop

In [9]:
myArray

array([[8.53634829, 7.89125352, 7.10263383],
       [3.53501918, 5.37564431, 3.76249507],
       [5.20231591, 5.3165865 , 4.29556725],
       [2.97870515, 3.1780647 , 2.60918411]])

In [10]:
%%time
myArray + 3

CPU times: user 199 µs, sys: 94 µs, total: 293 µs
Wall time: 313 µs


array([[11.53634829, 10.89125352, 10.10263383],
       [ 6.53501918,  8.37564431,  6.76249507],
       [ 8.20231591,  8.3165865 ,  7.29556725],
       [ 5.97870515,  6.1780647 ,  5.60918411]])

In [11]:
myArray2 = np.array([[random.uniform(0, 3) for x in range(3)] for y in range(4)]) 
# creating a nested list and then a 2D array in one step
myArray2

array([[2.53433959, 0.34878003, 1.22546269],
       [2.85883185, 0.85258116, 1.80331168],
       [0.05563128, 2.0013303 , 1.38297307],
       [2.74105944, 0.62555095, 0.72315589]])

In [12]:
myArray + myArray2

array([[11.07068788,  8.24003355,  8.32809652],
       [ 6.39385103,  6.22822547,  5.56580675],
       [ 5.25794719,  7.3179168 ,  5.67854032],
       [ 5.71976459,  3.80361565,  3.33234   ]])

In [13]:
newArray = myArray.astype(np.int)
newArray

array([[8, 7, 7],
       [3, 5, 3],
       [5, 5, 4],
       [2, 3, 2]])

In [14]:
newArray * 3

array([[24, 21, 21],
       [ 9, 15,  9],
       [15, 15, 12],
       [ 6,  9,  6]])

In [15]:
newArray * [1,2,3] 
# Elementwise multiplication on rows
# That is, in each row, the first column is multiplied by 1, the second by 2 and the third by 3

array([[ 8, 14, 21],
       [ 3, 10,  9],
       [ 5, 10, 12],
       [ 2,  6,  6]])

In [19]:
x = np.array([[1], [2], [3], [4]])
x.shape

(4, 1)

In [20]:
newArray * [[1], [2], [3], [4]] # Elementwise multiplication on rows

array([[ 8,  7,  7],
       [ 6, 10,  6],
       [15, 15, 12],
       [ 8, 12,  8]])

These types of numpy operations make it very easy and efficient to work with matrices and perform matrix algebra

In [None]:
np.sqrt(newArray)

In [26]:
np.log(newArray + 1) # to avoid a log(0) error

array([[2.19722458, 2.07944154, 2.07944154],
       [1.38629436, 1.79175947, 1.38629436],
       [1.79175947, 1.79175947, 1.60943791],
       [1.09861229, 1.38629436, 1.09861229]])

These vectorized functions return an array of the appropriate shape - the same shape as the input array in the examples above.

There are also operations that are applied to an array which return a scalar value, usually the result of some type of aggregation

In [23]:
np.sum(newArray,axis = 0)

array([18, 20, 16])

In [24]:
np.max(newArray)

8

In [25]:
np.mean(newArray)

4.5

#### What we have looked at here is just a small fraction of the operations and functionality available with numpy arrays - we have only looked at some of the features that are most relevant for us.

## Pandas

In [None]:
import pandas as pd

<b>From Wikipedia:

Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. It is free software released under the three-clause BSD license. The name is derived from the term "panel data", an econometrics term for data sets that include observations over multiple time periods for the same individuals.</b>

<a href = "https://en.wikipedia.org/wiki/Pandas_(software)">Full article here</a>

**Pandas** is built on top of *NumPy* and has the following features:
<ul>
<li>Data structures with labeled axes</li>
<li>Arithmetic Operations and reductions</li>
<li>Integrated time series functionality</li>
<li>Handling of Missing data</li>
<li>Merge, sorting, filtering and other functionlities</li>
</ul>

*Pandas* has two main data structures - *Series* and *DataFrame*

## Series

In [27]:
from pandas import Series, DataFrame # this is actually optional; see below

In [28]:
a = [1,2,3,4]
a

[1, 2, 3, 4]

In [29]:
a[1:]

[2, 3, 4]

In [30]:
obj = Series([3,6,9,12])

# I can use the method Series because I imported it explicitly from pandas in the previous step
# If I do not import Series explicitly, I need to use the syntax pandas.Series or pd.Series

In [31]:
obj

0     3
1     6
2     9
3    12
dtype: int64

In [32]:
type(obj)

pandas.core.series.Series

In [33]:
obj.values # just the values

array([ 3,  6,  9, 12])

In [34]:
obj.index # just the indices

RangeIndex(start=0, stop=4, step=1)

In [35]:
# casualty counts in WW2
ww2_cas = Series([8.7e6,4.3e6,3.0e6,2.1e6,4e5],index=['USSR','Germany','China','Japan','USA'])
ww2_cas

USSR       8700000.0
Germany    4300000.0
China      3000000.0
Japan      2100000.0
USA         400000.0
dtype: float64

In [36]:
ww2_cas.values

array([8700000., 4300000., 3000000., 2100000.,  400000.])

In [37]:
ww2_cas.index

Index(['USSR', 'Germany', 'China', 'Japan', 'USA'], dtype='object')

In [38]:
ww2_cas['USA']

400000.0

In [39]:
ww2_cas[ww2_cas > 4e6]

USSR       8700000.0
Germany    4300000.0
dtype: float64

In [40]:
sum(ww2_cas > 4e6)

2

In [41]:
ww2_cas[ww2_cas > 4e6] # check which countries had casualties > 4 million

USSR       8700000.0
Germany    4300000.0
dtype: float64

In [42]:
'USSR' in ww2_cas

True

In [None]:
'France' in ww2_cas

In [None]:
ww2_dict = ww2_cas.to_dict() # convert series into a dictionary
ww2_dict

In [None]:
ww2_series = Series(ww2_dict) # convert dictionary into a series
ww2_series

In [None]:
countries = ['China','Germany','Japan','USA','USSR','Argentina'] # a list

In [None]:
ww2_dict

In [None]:
obj2 = Series(ww2_dict,index = countries) # passing indices from a list
obj2

In [None]:
import pandas as pd
pd.isnull(obj2) # check for null values

In [None]:
pd.notnull(obj2) # check for non-null values

# Data frames

In [None]:
#import webbrowser

In [None]:
#website = 'https://en.wikipedia.org/wiki/NFL_win%E2%80%93loss_records'

In [None]:
#webbrowser.open(website) # opens the webpage in browser

<a href = "https://en.wikipedia.org/wiki/NFL_win%E2%80%93loss_records">Click here </a>

Copy the first few lines (including header) from the table on the webpage

In [43]:
import pandas as pd

In [44]:
#nfl_frame = pd.read_clipboard()
nfl_frame = pd.read_csv('data/nfl_frame.csv')

In [None]:
type(nfl_frame)

In [45]:
nfl_frame

Unnamed: 0.1,Unnamed: 0,Rank,Team,GP,Won,Lost,Tied,Pct.,First NFL Season,Division
0,0,1,Dallas Cowboys,882,502,374,6,0.573,1960,NFC East
1,1,2,Green Bay Packers,1336,737,562,37,0.565,1921,NFC North
2,2,3,Chicago Bears,1370,749,579,42,0.562,1920,NFC North
3,3,4,Miami Dolphins,800,445,351,4,0.559,1966,AFC East
4,4,5,New England Patriots[b],884,489,386,9,0.558,1960,AFC East
5,5,6,Minnesota Vikings,870,470,390,10,0.546,1961,NFC North
6,6,7,Baltimore Ravens,352,190,161,1,0.541,1996,AFC North
7,7,8,New York Giants,1305,687,585,33,0.539,1925,NFC East
8,8,9,Denver Broncos,884,470,404,10,0.537,1960,AFC West
9,9,10,San Francisco 49ers,1002,528,460,14,0.534,1950,NFC West


In [46]:
nfl_frame.columns

Index(['Unnamed: 0', 'Rank', 'Team', 'GP', 'Won', 'Lost', 'Tied', 'Pct.',
       'First NFL Season', 'Division'],
      dtype='object')

In [47]:
nfl_frame.shape

(12, 10)

In [48]:
nfl_frame.Lost

0     374
1     562
2     579
3     351
4     386
5     390
6     161
7     585
8     404
9     460
10    538
11    411
Name: Lost, dtype: int64

In [49]:
nfl_frame.First NFL Season # In R:nfl_frame$Rank
# this won't work for column names with spaces

SyntaxError: invalid syntax (<ipython-input-49-a29c00d3d266>, line 1)

In [None]:
nfl_frame.Total Games

In [50]:
nfl_frame['First NFL Season']

0     1960
1     1921
2     1920
3     1966
4     1960
5     1961
6     1996
7     1925
8     1960
9     1950
10    1933
11    1960
Name: First NFL Season, dtype: int64

In [None]:
nfl_frame[nfl_frame.Won > 500]

In [None]:
nfl_frame[(nfl_frame.Won > 500) & (nfl_frame.Lost < 500)]

In [None]:
nfl_frame[(nfl_frame.Won > 500) | (nfl_frame.Lost < 500)]

In [51]:
my_new_df = DataFrame(nfl_frame,columns=['Team','First NFL Season','Total Games'])

In [52]:
my_new_df

Unnamed: 0,Team,First NFL Season,Total Games
0,Dallas Cowboys,1960,
1,Green Bay Packers,1921,
2,Chicago Bears,1920,
3,Miami Dolphins,1966,
4,New England Patriots[b],1960,
5,Minnesota Vikings,1961,
6,Baltimore Ravens,1996,
7,New York Giants,1925,
8,Denver Broncos,1960,
9,San Francisco 49ers,1950,


In [53]:
DataFrame(nfl_frame,columns=['Team','First NFL Season','Total Games','Stadium'])

Unnamed: 0,Team,First NFL Season,Total Games,Stadium
0,Dallas Cowboys,1960,,
1,Green Bay Packers,1921,,
2,Chicago Bears,1920,,
3,Miami Dolphins,1966,,
4,New England Patriots[b],1960,,
5,Minnesota Vikings,1961,,
6,Baltimore Ravens,1996,,
7,New York Giants,1925,,
8,Denver Broncos,1960,,
9,San Francisco 49ers,1950,,


If column doesn't exist, creates a column with that name and fills with null values

In [54]:
nfl_frame.head() # head; default first 5 rows

Unnamed: 0.1,Unnamed: 0,Rank,Team,GP,Won,Lost,Tied,Pct.,First NFL Season,Division
0,0,1,Dallas Cowboys,882,502,374,6,0.573,1960,NFC East
1,1,2,Green Bay Packers,1336,737,562,37,0.565,1921,NFC North
2,2,3,Chicago Bears,1370,749,579,42,0.562,1920,NFC North
3,3,4,Miami Dolphins,800,445,351,4,0.559,1966,AFC East
4,4,5,New England Patriots[b],884,489,386,9,0.558,1960,AFC East


In [55]:
nfl_frame.head(3)

Unnamed: 0.1,Unnamed: 0,Rank,Team,GP,Won,Lost,Tied,Pct.,First NFL Season,Division
0,0,1,Dallas Cowboys,882,502,374,6,0.573,1960,NFC East
1,1,2,Green Bay Packers,1336,737,562,37,0.565,1921,NFC North
2,2,3,Chicago Bears,1370,749,579,42,0.562,1920,NFC North


In [56]:
nfl_frame.tail()

Unnamed: 0.1,Unnamed: 0,Rank,Team,GP,Won,Lost,Tied,Pct.,First NFL Season,Division
7,7,8,New York Giants,1305,687,585,33,0.539,1925,NFC East
8,8,9,Denver Broncos,884,470,404,10,0.537,1960,AFC West
9,9,10,San Francisco 49ers,1002,528,460,14,0.534,1950,NFC West
10,10,11,Pittsburgh Steelers,1179,614,538,20,0.532,1933,AFC North
11,11,12,Oakland Raiders,884,462,411,11,0.529,1960,AFC West


In [57]:
nfl_frame.tail(2)

Unnamed: 0.1,Unnamed: 0,Rank,Team,GP,Won,Lost,Tied,Pct.,First NFL Season,Division
10,10,11,Pittsburgh Steelers,1179,614,538,20,0.532,1933,AFC North
11,11,12,Oakland Raiders,884,462,411,11,0.529,1960,AFC West


In [None]:
nfl_frame

In [58]:
nfl_frame['Stadium'] = "Levi's Stadium" # create a new column with the same value for all rows

In [59]:
nfl_frame

Unnamed: 0.1,Unnamed: 0,Rank,Team,GP,Won,Lost,Tied,Pct.,First NFL Season,Division,Stadium
0,0,1,Dallas Cowboys,882,502,374,6,0.573,1960,NFC East,Levi's Stadium
1,1,2,Green Bay Packers,1336,737,562,37,0.565,1921,NFC North,Levi's Stadium
2,2,3,Chicago Bears,1370,749,579,42,0.562,1920,NFC North,Levi's Stadium
3,3,4,Miami Dolphins,800,445,351,4,0.559,1966,AFC East,Levi's Stadium
4,4,5,New England Patriots[b],884,489,386,9,0.558,1960,AFC East,Levi's Stadium
5,5,6,Minnesota Vikings,870,470,390,10,0.546,1961,NFC North,Levi's Stadium
6,6,7,Baltimore Ravens,352,190,161,1,0.541,1996,AFC North,Levi's Stadium
7,7,8,New York Giants,1305,687,585,33,0.539,1925,NFC East,Levi's Stadium
8,8,9,Denver Broncos,884,470,404,10,0.537,1960,AFC West,Levi's Stadium
9,9,10,San Francisco 49ers,1002,528,460,14,0.534,1950,NFC West,Levi's Stadium


In [None]:
nfl_array = np.array(nfl_frame)

In [None]:
nfl_array

In [None]:
nfl_array.shape

In [None]:
import numpy as np
np.arange(10)

In [60]:
nfl_frame['StadiumNumber'] = np.arange(nfl_frame.shape[0]) # create a new column with continuous numbering

In [61]:
nfl_frame

Unnamed: 0.1,Unnamed: 0,Rank,Team,GP,Won,Lost,Tied,Pct.,First NFL Season,Division,Stadium,StadiumNumber
0,0,1,Dallas Cowboys,882,502,374,6,0.573,1960,NFC East,Levi's Stadium,0
1,1,2,Green Bay Packers,1336,737,562,37,0.565,1921,NFC North,Levi's Stadium,1
2,2,3,Chicago Bears,1370,749,579,42,0.562,1920,NFC North,Levi's Stadium,2
3,3,4,Miami Dolphins,800,445,351,4,0.559,1966,AFC East,Levi's Stadium,3
4,4,5,New England Patriots[b],884,489,386,9,0.558,1960,AFC East,Levi's Stadium,4
5,5,6,Minnesota Vikings,870,470,390,10,0.546,1961,NFC North,Levi's Stadium,5
6,6,7,Baltimore Ravens,352,190,161,1,0.541,1996,AFC North,Levi's Stadium,6
7,7,8,New York Giants,1305,687,585,33,0.539,1925,NFC East,Levi's Stadium,7
8,8,9,Denver Broncos,884,470,404,10,0.537,1960,AFC West,Levi's Stadium,8
9,9,10,San Francisco 49ers,1002,528,460,14,0.534,1950,NFC West,Levi's Stadium,9


In [None]:
nfl_frame['Total2'] = nfl_frame['Won'] + nfl_frame.Lost # create a new column by applying some calculations to existing columns

In [None]:
nfl_frame

In [None]:
del nfl_frame['StadiumNumber']

In [None]:
nfl_frame

In [None]:
del nfl_frame['Stadium']

In [None]:
nfl_frame

In [63]:
old_dict = {'SF':8.37e5,'LA':3.88e6,'NYC':8.4e6}
old_dict

{'LA': 3880000.0, 'NYC': 8400000.0, 'SF': 837000.0}

In [64]:
old = DataFrame(old_dict,index=old_dict.keys())

In [65]:
old

Unnamed: 0,LA,NYC,SF
SF,3880000.0,8400000.0,837000.0
LA,3880000.0,8400000.0,837000.0
NYC,3880000.0,8400000.0,837000.0


In [66]:
data_dict = {'City':['SF','LA','NYC'],'Population':[8.37e5,3.88e6,8.4e6]} # create a dictionary

In [67]:
data_dict

{'City': ['SF', 'LA', 'NYC'], 'Population': [837000.0, 3880000.0, 8400000.0]}

In [69]:
city_frame = DataFrame(data_dict) #create a data frame from a dictionary

In [None]:
city_frame

## End of part 3