# Part 3


## NumPy

In [2]:
import numpy as np # import numpy and give it the short form ('nickname') np

<b>From Wikipedia:
    
NumPy (pronounced /ˈnʌmpaɪ/ (NUM-py) or sometimes /ˈnʌmpi/ (NUM-pee)) is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays. The ancestor of NumPy, Numeric, was originally created by Jim Hugunin with contributions from several other developers. In 2005, Travis Oliphant created NumPy by incorporating features of the competing Numarray into Numeric, with extensive modifications. NumPy is open-source software and has many contributors.</b>

<a href = "https://en.wikipedia.org/wiki/NumPy">Full article here</a>

Some important **Numpy** features:
<ul>
<li>ndarray: fast and space-efficient multidimensional array with vectorized arithmetic and sophisticated broadcasting</li>
<li>Standard vectorized math</li>
<li>Reading / writing arrays to disk</li>
<li>Memory-mapped file access</li>
<li>Linear algebra, rng, fourier transform</li>
<li>Integration of C, C++, FORTRAN</li>
</ul>

**Creating NumPy arrays**

The function *array()* is commonly used to create numpy ndarrays on the fly from other Python sequence-like objects such as tuples and lists.

In [3]:
np.array(range(3))

array([0, 1, 2])

In [4]:
np.array((1, 2, 3)) # from a tuple

array([1, 2, 3])

In [5]:
np.array([1, 2, 3]) # from a list

array([1, 2, 3])

Nested lists result in mutlidimensional arrays:

In [None]:
import random
nestedList = [[random.uniform(0, 9) for x in range(3)] for y in range(4)]
nestedList

In [None]:
type(nestedList)

In [None]:
myArray = np.array(nestedList)
myArray

**Important attributes of arrays**

In [None]:
print(myArray.ndim)  # Number of dimensions
print(myArray.shape) # Shape of the ndarray
print(myArray.dtype) # Data type contained in the array

#### Other functions to create arrays

**arange** - this is equivalent to the range function, except returns a one-dimensional array instead of a range object: 

In [None]:
np.arange(10)

**ones**, **zeros**, **ones_like**, **zeros_like** - to create arrays filled with ones or zeroes with a given shape or with a shape similar to a given object:

In [None]:
np.ones(3)

In [None]:
np.ones((3,4))

In [None]:
np.zeros(4)

In [None]:
np.zeros((4,3))

We can explicitly specify the data type with which an array should be created:

In [None]:
newArray = np.array(nestedList, dtype = np.int)
newArray

In [None]:
newArray = np.array(nestedList, dtype = np.float)
newArray

In [None]:
newArray = np.array(nestedList, dtype = np.bool) # numbers >0 are True
newArray

In [None]:
newArray = np.array(nestedList, dtype = np.unicode) # Unicode strings
newArray

In [None]:
newArray = np.array(nestedList, dtype = np.object) # arbitrary objects
newArray

## Vectorized math

One of the most powerful and useful features of numpy, especially for data science, is the ability to vectorize mathmatical operations - that is to apply operations to each element of a vector or array simultaneously without needing to use a *for* loop

In [None]:
myArray

In [None]:
myArray + 3

In [None]:
myArray2 = np.array([[random.uniform(0, 3) for x in range(3)] for y in range(4)]) 
# creating a nested list and then a 2D array in one step
myArray2

In [None]:
myArray + myArray2

In [None]:
newArray = myArray.astype(np.int)
newArray

In [None]:
newArray * 3

In [None]:
newArray * [1,2,3] 
# Elementwise multiplication on rows
# That is, in each row, the first column is multiplied by 1, the second by 2 and the third by 3

In [None]:
newArray * [[1], [2], [3], [4]] # Elementwise multiplication on columns

These types of numpy operations make it very easy and efficient to work with matrices and perform matrix algebra

In [None]:
np.sqrt(newArray)

In [None]:
np.log(newArray + 1) # to avoid a log(0) error

These vectorized functions return an array of the appropriate shape - the same shape as the input array in the examples above.

There are also operations that are applied to an array which return a scalar value, usually the result of some type of aggregation

In [None]:
np.sum(newArray)

In [None]:
np.max(newArray)

In [None]:
np.mean(newArray)

#### What we have looked at here is just a small fraction of the operations and functionality available with numpy arrays - we have only looked at some of the features that are most relevant for us.

## Pandas

In [None]:
import pandas as pd

<b>From Wikipedia:

Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. It is free software released under the three-clause BSD license. The name is derived from the term "panel data", an econometrics term for data sets that include observations over multiple time periods for the same individuals.</b>

<a href = "https://en.wikipedia.org/wiki/Pandas_(software)">Full article here</a>

**Pandas** is built on top of *NumPy* and has the following features:
<ul>
<li>Data structures with labeled axes</li>
<li>Arithmetic Operations and reductions</li>
<li>Integrated time series functionality</li>
<li>Handling of Missing data</li>
<li>Merge, sorting, filtering and other functionlities</li>
</ul>

*Pandas* has two main data structures - *Series* and *DataFrame*

## Series

In [None]:
from pandas import Series, DataFrame # this is actually optional; see below

In [None]:
a = [1,2,3,4]
a

In [None]:
a[1:]

In [None]:
obj = Series([3,6,9,12])

# I can use the method Series because I imported it explicitly from pandas in the previous step
# If I do not import Series explicitly, I need to use the syntax pandas.Series or pd.Series

In [None]:
obj

In [None]:
type(obj)

In [None]:
obj.values # just the values

In [None]:
obj.index # just the indices

In [None]:
# casualty counts in WW2
ww2_cas = Series([8.7e6,4.3e6,3.0e6,2.1e6,4e5],index=['USSR','Germany','China','Japan','USA'])
ww2_cas

In [None]:
ww2_cas.values

In [None]:
ww2_cas.index

In [None]:
ww2_cas['USA']

In [None]:
ww2_cas > 4e6

In [None]:
sum(ww2_cas > 4e6)

In [None]:
ww2_cas[ww2_cas > 4e6] # check which countries had casualties > 4 million

In [None]:
'USSR' in ww2_cas

In [None]:
'France' in ww2_cas

In [None]:
ww2_dict = ww2_cas.to_dict() # convert series into a dictionary
ww2_dict

In [None]:
ww2_series = Series(ww2_dict) # convert dictionary into a series
ww2_series

In [None]:
countries = ['China','Germany','Japan','USA','USSR','Argentina'] # a list

In [None]:
ww2_dict

In [None]:
obj2 = Series(ww2_dict,index = countries) # passing indices from a list
obj2

In [None]:
pd.isnull(obj2) # check for null values

In [None]:
pd.notnull(obj2) # check for non-null values

# Data frames

In [None]:
import webbrowser

In [None]:
#website = 'https://en.wikipedia.org/wiki/NFL_win%E2%80%93loss_records'

In [None]:
#webbrowser.open(website) # opens the webpage in browser

<a href = "https://en.wikipedia.org/wiki/NFL_win%E2%80%93loss_records">Click here </a>

Copy the first few lines (including header) from the table on the webpage

In [None]:
import pandas as pd

In [None]:
#nfl_frame = pd.read_clipboard()
nfl_frame = pd.read_csv('data/nfl_frame.csv')

In [None]:
type(nfl_frame)

In [None]:
nfl_frame

In [None]:
nfl_frame.columns

In [None]:
nfl_frame.shape

In [None]:
nfl_frame.Lost

In [None]:
nfl_frame.First NFL Season # In R:nfl_frame$Rank
# this won't work for column names with spaces

In [None]:
nfl_frame.Total Games

In [None]:
nfl_frame['First NFL Season']

In [None]:
nfl_frame[nfl_frame.Won > 500]

In [None]:
nfl_frame[(nfl_frame.Won > 500) & (nfl_frame.Lost < 500)]

In [None]:
nfl_frame[(nfl_frame.Won > 500) | (nfl_frame.Lost < 500)]

In [None]:
my_new_df = DataFrame(nfl_frame,columns=['Team','First NFL Season','Total Games'])

In [None]:
my_new_df

In [None]:
DataFrame(nfl_frame,columns=['Team','First NFL Season','Total Games','Stadium'])

If column doesn't exist, creates a column with that name and fills with null values

In [None]:
nfl_frame.head() # head; default first 5 rows

In [None]:
nfl_frame.head(3)

In [None]:
nfl_frame.tail()

In [None]:
nfl_frame.tail(2)

In [None]:
nfl_frame

In [None]:
nfl_frame['Stadium'] = "Levi's Stadium" # create a new column with the same value for all rows

In [None]:
nfl_frame

In [None]:
import numpy as np
np.arange(10)

In [None]:
nfl_frame['StadiumNumber'] = np.arange(nfl_frame.shape[0]) # create a new column with continuous numbering

In [None]:
nfl_frame

In [None]:
nfl_frame['Total2'] = nfl_frame['Won'] + nfl_frame.Lost # create a new column by applying some calculations to existing columns

In [None]:
nfl_frame

In [None]:
del nfl_frame['StadiumNumber']

In [None]:
nfl_frame

In [None]:
del nfl_frame['Stadium']

In [None]:
nfl_frame

In [None]:
old_dict = {'SF':8.37e5,'LA':3.88e6,'NYC':8.4e6}
old_dict

In [None]:
data_dict = {'City':['SF','LA','NYC'],'Population':[8.37e5,3.88e6,8.4e6]} # create a dictionary

In [None]:
data_dict

In [None]:
city_frame = DataFrame(data_dict) #create a data frame from a dictionary

In [None]:
city_frame

## End of part 3