# NumPy and Pandas ![NumPy logo](Resources\Images\numpy_logo.png) ![Pandas logo](Resources\Images\pandas_logo.png) 
These are specialised libraries for handling large arrays of data.
* [NumPy](http://www.numpy.org/) is *"the fundamental package for scientific computing with Python"* - it implements fast arrays and matrix calculations. [Documentation](https://docs.scipy.org/doc/).
* [Pandas](https://pandas.pydata.org/) is *"an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language"* - it adds features for data science (named columns and rows). [Documentation](http://pandas.pydata.org/pandas-docs/stable/).



## NumPy ![NumPy logo](Resources\Images\numpy_logo.png)

From the documentation:
> NumPy contains among other things:
> 
> 1. a powerful N-dimensional array object
2. sophisticated (broadcasting) functions
3. tools for integrating C/C++ and Fortran code
4. useful linear algebra, Fourier transform, and random number capabilities
> 
> Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.

The NumPy arrays function in a similar way to If you are already familiar with Matlab, then you will find a comparison of rough Matlab equivalents in the [online guide for Matlab users](https://docs.scipy.org/doc/numpy-1.15.0/user/numpy-for-matlab-users.html).

We will focus on the use of NumPy to handle large arrays of data.


### References

* [Loading CSV into 2D matrices](https://stackoverflow.com/questions/4315506/load-csv-into-2d-matrix-with-numpy-for-plotting)
* [Structured arrays in SciPy](https://docs.scipy.org/doc/numpy/user/basics.rec.html)
* [Vectorisation in place of for-loops](https://towardsdatascience.com/why-you-should-forget-for-loop-for-data-science-code-and-embrace-vectorization-696632622d5f)
* [NumPy array programming](https://realpython.com/numpy-array-programming/)
* [Computation on Arrays using ufuncs](https://jakevdp.github.io/PythonDataScienceHandbook/02.03-computation-on-arrays-ufuncs.html)


### Numpy Arrays
NumPy Arrays (ndarray) have some key features:
1. They are built for fast array and matrix operations, especially for large arrays. 
2. They are homogeneous and strongly typed - you need to decide what the data type is, and that is the only type that it can contain - i.e. all integers or all floats or all strings
3. It is best to define the dimensions of the array at the outset - appending to arrays is relatively slow
4. Many matrix operations are defined and they are fast
5. NumPy arrays are used by other libraries, such as SciPy, pandas, tensorflow and scikit-learn

In [None]:
# First we will create a 1D array filled with zeros
import numpy as np

x = np.zeros(4, dtype=int)
x

In [None]:
# We can create an array of any dimension
y = np.ones((2,3,4), dtype = float)
y

In [None]:
# We can identify the dimensions of the array by referencing the `shape` property
y.shape

In [None]:
import numpy as np

np_arr_1 = np.arange(0, 6).reshape(3, 2)
np_arr_2 = np.arange(0, 8).reshape(2, 4)

print('First array')
print(np_arr_1)
print('Second array')
print(np_arr_2)

The two types of array multiplication can be carried out using the `*` operator (element-wise multiplication) and the `@` operator (matrix multiplication).  

In [None]:
# Element-wise multiplication
my_result_arr = np_arr_1 * np_arr_1
print('First array x First array (element-wise)')
print(my_result_arr)
np_arr_1.shape

In [None]:
# Matrix Multiplication
my_result_arr = np_arr_1 @ np_arr_2
print('Matrix shapes:', np_arr_1.shape, np_arr_2.shape, my_result_arr.shape)
print('First array x Second array')
print(my_result_arr)

Matrix multiplication can also carried out using the `np.matmul` function:

In [None]:
np.matmul(np_arr_1, np_arr_2)

### Comparison with Pure Lists

In [None]:
# We can convert NumPy arrays into simple lists
list_1 = np_arr_1.tolist()
list_2 = np_arr_2.tolist()
list_2

#### Matrix Multiplication using Simple Python For-loops
Note that the `@` operator cannot be used on pure python list that have the arrangement of arrays / matrices, so we have to create a new function.  Indeed, neither can the standard operators (`+`, `-`, `*`, `/` etc) be used for element-wise operations on pure python lists acting as arrays or as matrices - i.e. `list_1 * list_2` will not work either. In this case list comprehension provides a quick solution - `[x * y for x, y in zip(list_1, list_2)]`for 1D array, and `[[x * y for x, y in zip(a, b)] for a, b in zip(list_1, list_1)]` for a 2D array.

In [None]:
# Matrix Multiplication 
def mmult(list_1, list_2):
    my_result_arr = [[0 for col in range(len(list_2[0]))] for row in range(len(list_1))]
    # iterate through rows of list_1
    for i in range(len(list_1)):
       # iterate through columns of list_2
       for j in range(len(list_2[0])):
           # iterate through rows of list_2
           for k in range(len(list_2)):
               my_result_arr[i][j] += list_1[i][k] * list_2[k][j]
    return my_result_arr


print('For-loop Matrix Multiplication')
mmult(list_1, list_2)

#### Matrix Multiplication using List Comprehension

In [None]:
print('List Comprehension')
[[sum(a*b for a,b in zip(X_row,Y_col)) for Y_col in zip(*list_2)] for X_row in list_1]

#### Timing Comparison - Small Arrays

In [None]:
print('For-loop Multiplication  : ', end="")
%timeit mmult(list_1, list_2)
print('List Comp Multiplication : ', end="")
%timeit [[sum(a*b for a,b in zip(X_row,Y_col)) for Y_col in zip(*list_2)] for X_row in list_1]
print('NumPy Multiplication     : ', end="")
%timeit np.matmul(np_arr_1, np_arr_2)

Note that NumPy can be significantly faster, but is not always a lot faster for operations on short lists or small arrays. List comprehension is almost always faster than for-loops, but by less than an order of magnitude.

#### Timing Comparison - Large Arrays

In [None]:
# Building large arrays
np_arr_3 = np.arange(0, 60000).reshape(300, 200)
np_arr_4 = np.arange(0, 80000).reshape(200, 400)
list_3 = np_arr_3.tolist()
list_4 = np_arr_4.tolist()

In [None]:
print('For-loop Multiplication  : ', end="")
%timeit mmult(list_3, list_4)
print('List Comp Multiplication : ', end="")
%timeit [[sum(a*b for a,b in zip(X_row,Y_col)) for Y_col in zip(*list_4)] for X_row in list_3]
print('NumPy Multiplication     : ', end="")
%timeit np.matmul(np_arr_3, np_arr_4)

Note that NumPy is significantly faster than pure Python in this case.
* ~1.6x faster with list comprehensions
* ~500x faster with NumPy

## NumPy Structured Arrays
NumPy arrays are fast, but not very flexible and they do not carry metadata that describes the data they contain. In addition to the simple ndarrays, NumPy also provides structured arrays that allow you to mix different types of data (although each column has to be of the same type) and they can also carry descriptions of the data.

In [None]:
# Structured Arrays
steel_mat = np.array([('S275', 275, 200000), ('S355', 355, 200000), ('S460', 460, 200000)], 
             dtype=[('name', 'U10'), ('fy', 'f4'), ('E_mod', 'f4')])
steel_mat

It is now possible to address the data by name:

In [None]:
steel_mat['name']

## Pandas ![Pandas logo](Resources\Images\pandas_logo.png)

Pandas is a popular tool in data science. It provides a clean interface to structured data and also many tools for cleaning and processing this data. It is fast because it is based on NumPy. 

The key data types are `Series`, `Indexes` and `DataFrames` ([a summary is provided here](https://pandas.pydata.org/pandas-docs/stable/dsintro.html)). These are based on underlying data types such as integers and floats ([there is a summary here](http://pbpython.com/pandas_dtypes.html)). Pandas dataframes are based on the dataframes created for the R-language ([R is a popular and powerful statistical programming language](https://cran.r-project.org/) - [data frame documentation here](https://www.rdocumentation.org/packages/base/versions/3.5.1/topics/data.frame)).

The following website provides a useful introduction to Pandas.
* [10 Minutes to Pandas](https://pandas.pydata.org/pandas-docs/stable/10min.html)

There is also a Jupyter-based tutorial called [Pandas Cookbook](http://pandas.pydata.org/pandas-docs/stable/tutorials.html)

### Pandas Series and DataFrames 
Series and DataFrames can be created directly and DataFrames can also be created from Series.

In [None]:
import pandas as pd
s = pd.Series([1,3,5,np.nan,8]) # note that while all data has to be of the same data type, some can be 'Not a Number' (NaN)
s

We can provide metadata for each data element (index) and also for the whole series (name):

In [None]:
s = pd.Series([1,3,5,np.nan,8], name = 'Animals', index = ('Mouse','Rat','Camel','Bat','Zebra'))
s

We can create a pair of series and then combine them into a DataFrame

In [None]:
s1 = pd.Series(['Lily', 'MingChun', 'Mary', 'Chris','Lee'], name = 'Name') 
s2 = pd.Series([3088, 3142, 5514, 2221, 3001], name = 'Number', dtype = 'int') 
s1

In [None]:
s2

These can be combined using the concatenation command (`concat`)

In [None]:
df = pd.concat([s1, s2], axis=1)
print(df.index)
print(df.columns)
df

In [None]:
# It is then possible to access elements of the dataframe using the column names
df['Name']

In [None]:
# If the name is simple then it can also be used to access the column directly
df.Name

### Importing Python lists
Pandas can directly import Python collections, lists - tuples and dictionaries. It carries out intelligent interpretation of the data types.

In [None]:
py_list = [[5,8],[10,3],[6,1]]
py_dict = [{ 'species' : 'pangolin', 'class': 'mammal' },
           { 'species' : 'gecko', 'class': 'lizard' },
           { 'species' : 'mantis', 'class': 'insect' },
           { 'species' : 'echidna', 'class': 'mammal' },
           { 'species' : 'bulbul', 'class': 'bird' }]

In [None]:
pd.DataFrame(py_list, columns = ['A', 'B'], index = [11,12,13])

In [None]:
pd.DataFrame.from_dict(py_dict)

### Importing Data into Pandas
In addition to direct importing of Python lists, Pandas has a number of tools for importing (and exporting) data ([full listing here](https://pandas.pydata.org/pandas-docs/stable/io.html)):
1. pandas.read_csv
2. pandas.read_excel
3. pandas.read_json
4. pandas.read_sql
5. pandas.read_clipboard
6. ... and others...

In [None]:
import pandas as pd
pd_data_df = pd.read_csv('data7.csv')
pd_data_df.head()

In [None]:
# We can print out information on the meta-data
print('Indexes:', pd_data_df.index)
print('Columns:', pd_data_df.columns)

In [None]:
# We can access the individual columns using the column names 
# (NB head() limits the number of lines)
pd_data_df['sin_x'].head()

In [None]:
# If the header is simple text, then we can also access it using a simple reference
pd_data_df.sin_x.head()

In [None]:
# We can identify which of the elements in a column meet a certain criterion 
pd_data_df['sin_x'] > 0

In [None]:
# We can then use this pattern (mask) to filter the data in the full table
pd_data_df[pd_data_df['sin_x'] > 0]

### Applying functions to pandas DataFrames
One of the strengths of pandas dataframes is that you can carry out NumPy vectorised operations. For example, you can apply the NumPy `log10` operation to all the data in the dataframe. In the following example, we are extracting the first six rows from the dataframe above (using the `[]` notation) and are applying the NumPy function. 

Note that this will generate some errors - log10 of zero is infinity, and log10 of negative numbers is not valid (and results in a 'Not a Number' (`NaN`).

In [None]:
np.log10(pd_data_df[0:6])

### Cleaning data
***Note: this section is not covered by the video***

One of the standard problems with imported data is that it can include missing or irregular data (mis-typed or incorrectly configured). Cleaning up the data can be very time-consuming. NumPy and Pandas include tools to assist with this. On importing, bad data will either be replaced by an `NaN` data object (**n**ot **a** **n**umber), or will be present as inconsistent data.

This is a large topic. Some guidance on data cleaning is available at these web-sites:
* [Pandas Cookbook: Chapter 7: Cleaning data](https://nbviewer.jupyter.org/github/jvns/pandas-cookbook/blob/v0.2/cookbook/Chapter%207%20-%20Cleaning%20up%20messy%20data.ipynb)
* https://www.dataoptimal.com/data-cleaning-with-python-2018/
* https://realpython.com/python-data-cleaning-numpy-pandas/

In [None]:
# Create a dataframe from a list of lists containing building data.
# Note that the data is 'dirty' because some data is missing and because Storeys and GFA 
# are recorded in inconsistent formats

import numpy as np  # imported so that this cell can be run on its own...
import pandas as pd
dirty_data = [['Building_Name','GFA','Storeys','Zipcode'], ['Richland Bldg', '20000', 5, 98105],
              ['Prosper Court', '15,000', 4, '--'], ['Bumper Tower', '55,000sf', 12, '10045'],
              ['MegaMall', '214,000', '3F 1B' , ]]
# import into DataFrame and use first line as column titles (using list slicing)
dirty_df = pd.DataFrame(dirty_data[1:], columns = dirty_data[0]) 
dirty_df

In [None]:
# Example - cleaning up the GFA column
# We can clean up the GFA column by doing a string replace using a regex pattern that matches every  
# character that is not either a number or a dot - [0-9.] - the '^' reverses the selection
# We can also convert the data from strings to floats using the 'astype' function
dirty_df.GFA.str.replace('[^0-9.]', '', regex=True).astype(float)

In [None]:
# Finally we can create a cleaner dataframe where we replace the data in the dataframe
cleaner_df = dirty_df.copy()
cleaner_df.GFA = dirty_df.GFA.str.replace('[^0-9.]', '', regex=True).astype(float)
cleaner_df

### Applying User-Defined Functions to a DataFrame
It is possible to apply user-defined functions to a DataFrame. In the following example we will take the first and last column and average the values row-by-row. This is a vectorised calculation and runs very quickly.

Functions can be defined in the normal way and can be referenced by name, or then can be included in the operation as an un-named function (a lambda function). The lambda function below is exactly the same as the function `my_func`.

In [None]:
def my_func(a):
    """Average second and last element of a 1-D array"""
    return (a[1] + a[-1]) / 2
    #print(a[1], a[-1])

In [None]:
res_df = pd_data_df.apply(my_func, axis = 1)
res_df.head()

Functions can also be defined using the `lambda` format. This is good for simple calculations, especially as it is more compact.
```python
lambda a: (a[1] + a[-1]) / 2
```

In [None]:
res_df = pd_data_df.apply(lambda a: (a[1] + a[-1]) / 2, axis = 1)
res_df.head()

In [None]:
# We can add this to the original dataframe by assigning it to a new column name
pd_data_df['Ave'] = pd_data_df.apply(my_func, axis = 1)
pd_data_df.head()

### Organising data using *DataFrame.groupby*
***Note: this section is not covered by the video***

One powerful function in Pandas is the ability to group data by one of the columns. This effectively treats the data as multidimensional. In the following example we will group the data and then summarise the data using the `sum` function.

Additional information is available:
* http://pandas.pydata.org/pandas-docs/stable/groupby.html
* [Pandas Cookbook: Ch 4: Groupby / Aggregate](https://nbviewer.jupyter.org/github/jvns/pandas-cookbook/blob/v0.2/cookbook/Chapter%204%20-%20Find%20out%20on%20which%20weekday%20people%20bike%20the%20most%20with%20groupby%20and%20aggregate.ipynb)

In [None]:
# Create the dataframe from a list of lists containing building (clean) data:
building_data = [['Building_Name', 'Type', 'GFA', 'Storeys', 'Zipcode'], ['Richland Bldg', 'Residential', 20000., 5, 98105],
              ['Prosper Court', 'Residential', 15000., 4, 98105], ['Bumper Tower', 'Commercial', 55000., 12, 10045],
              ['MegaMall', 'Commercial', 214000., 3, 10045], ['Mini-mall', 'Commercial', 75000., 1, 98105]]
df = pd.DataFrame(building_data[1:], columns = building_data[0])
df

In [None]:
# Sort into groups according to 'Zipcode' and report the group contents
df.groupby(['Zipcode']).groups

In [None]:
# After grouping by Zipcode, calculate the sum of the GFA for each Zipcode.  
df.groupby(['Zipcode'])['GFA'].sum()

## Conclusions
We have seen how there are libraries that assist in working with large datasets. NumPy can bring significant savings in calculations, especially if calculations are vectorised instead of processed as for-loops.

Note that additional time savings may be made in some cases by using the [Numba library](http://numba.pydata.org/) that can not only speed up vectorised formulas, but can also send calculations to multiple processors (parallel processing), but can also make use of GPUs and clusters (such as through the [Dask library](http://docs.dask.org/en/latest/)). 

## Exercise
1. Read the data from the accompanying file called `Traffic_Data.txt` (containing tab-separated data) into a pandas dataframe (note that the first line contains the vehicle categories). 
2. Clean up the missing data
3. Create new columns for commercial vehicles (sum of trucks, buses and taxis) and private vehicles (sum of cars and motorcycles)
4. Print a bar chart of commercial and private vehicles against the hour.

### Reference Material
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html
* https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.DataFrame.plot.bar.html