# Making Sense of Python for Data Science with NumPy and Pandas

```


```

#### Calin Groza<br/>
Feb. 21, 2019

```


```

--------


## Introduction


#### Background
 * experience with imperative languages: C, C++, Java
 * using RDBMS (Oracle, MySQL) for handling medium to large (but not very large) amount of data
 * interested in the machine learning

#### First Reaction Looking at the Data Science/Machine Learning Examples
  * very few control structures: if, for, while
  * extensive method overloading, i.e the same method accepts multiple parameter types
  * concise code using operators like _indexing_ \[ \] and _slice_ .. : .. : ..
  * in-place changes to one object that contains all the data
  * easy to read, hard to write
  
#### What Is Covered in this Notebook
  * some of the unusual Python constructs commonly used in the NumPy and Pandas
  * origin and rationale of using these constructs
  * differences between Java/SQL and Python/NumPy/Pandas in implementing typical data manipulation tasks
  
-----------------------------------
 

## Access to the Presentation


 * **Option 1 - online**
      * go to the github page - https://github.com/flowidcom/pytorials/blob/master/pt10-data-science-idioms/MakingSenseOfPythonForDataScience.ipynb - and follow the presentation online
      * the web page is a saved snapshot of the notebook and cannot run the code fragments
 * **Option 2 - download a zip file with the presentation and data**
      * download the zip file from: https://github.com/flowidcom/pytorials/releases/tag/release-1.4
      * unzip the file on the local machine in the Jupyter notebook area
      * launch Jupyter, navigate to the extracted folder and open the notebook pt10-data-science-idioms/MakingSenseOfPythonForDataScience.ipynb
      * further instructions of how to install locally a Jupyter zip file are available at  https://github.com/ylabrj/ylab_Jupyter_Intro/blob/master/Ylab%20how%20to%20start%20with%20Jupyter%20Notebook.ipynb
 * **Option 3 - clone the git repository**
      * GitHub repository is at https://github.com/flowidcom/pytorials
      * launch Jupyter on the local computer and point to the local git repository
      
---------

## Python for Everyone

Python is a general purpose programming language similar to (and slightly different) C++, Java:
* imperative - statements, if, for, while
* object-oriented:
    * like C++, Java
        * classes
        * modules (like packages in Java)
    * similar to C++ and unlike Java:
        * supports overloading operators as built-in functions. E.g.  \_\_getitem\_\_', \_\_setitem\_\_
    * unlike C++ and unlike Java:
        * some built-in types that have syntactic support: slice (1:5:2), ellipsis (...)
        * untyped
* functional: lambda expressions
* modular, with a rich set of open-source libraries
* emphasis on readability

-------

## What Is Machine Learning?

![Machine Learning Humor](img/MachineLearningHumor.png)

## Python Saves the Day

![Drivers for Python for Data Science](img/PythonDataScienceSources.png)


-----

## From MatLab to NumPy


*MatLab* is a high-level language (primarily) for linear algebra. 
  Matrix and vector are first-class entities in MatLab.

*NumPy* is a Python library for linear algebra. Matrix and vectors are not first-class objects in Python, 
   but through the extended use of the index (\[ \]) operator and the slice type, 
   the NumPy library brings MatLab concepts and syntax in the Python domain.


#### Example - Linear Regression

The following is an example from the Machine Learning course on Coursera by Andrew Ng.

![Cost Function for Linear Regression](img/LinearRegressionCostFunction.png)

The MatLab/Octave program to calculate the cost function is:

![Linear Regression MatLab](img/LinearRegressionOctave.png)

Notes on this example:
 * this is a screenshot of another Jupyter notebook available 
    at [LinearRegressionOctaveNotebook](LinearRegression1-Octave.ipynb).
    That notebook requires the Octave Kernel to be installed in Jupyter
 * notice the matrix operations - slice, concatenation, initialization with zeros, multiplication with scalar, multiplication of matrices
 
------

## Linear Regression in NumPy


In [1]:
import numpy as np

In [2]:
data = np.loadtxt('real-estate-prices.csv', delimiter=',')
X = data[:, 0:-1]  # this is syntactic sugar for X = data.__getitem__((slice(None, None), slice(0, -1)))
y = data[:, -1]
m = len(y)

In [3]:
print('Dimensions:', X.shape, y.shape)  # Notice the shapes - 2d vs 1d

Dimensions: (47, 1) (47,)


In [4]:
X = np.hstack([np.ones((m, 1)), X])  # concatenate ones to the left, theta_0

In [5]:
def computeCost(X, y, theta):
    m = len(y)
    J = 1 / (2 * m) * sum((X @ theta - y[:, np.newaxis]) ** 2)
    # The row before syntactic sugar for
    #         1 / (2 * m) * sum(X.__matmul__(theta).__sub__(y.__getitem__((slice(None, None), np.newaxis))))
    return np.asscalar(J)


In [6]:
theta = np.zeros([2, 1]) # array of zeros, 2 rows, 1 column

In [7]:
J = computeCost(X, y, theta)

In [8]:
print('Cost: ', J)

Cost:  65591548106.45744


What allowed the program to be concise?
* Object Oriented Python allows to have the same method/operator with different semantic depending on the class
* Syntactic sugar for:  \_\_getitem__, slice, \_\_sub\_\_ and \_\_matmul\_\_
* Use of implicit tuple between \[ \]. getitem has only one parameter, but because of implicit tuples, it looks as if there are two parameters passed to the indexing call
* Method/operator overloading with different parameter types
* Python built-in functions such as sum and len are very generic, they accept any Iterable
* Static methods in NumPy modules: np.zeros, np.ones
* Polymorphic built-in functions like _len_ and _sum_ operate on any Iterable. In this case the Iterable is a NumPy array


Python constructs that are common to data analysis programs:
* initial data read and the slice of the data in input data and outcome. See how to select the last column using negative range
* no Python loops, use instead use NumPy methods
   * first, because it makes more concise and reduces the chance of errors at the edge
   * second, because of performance - vectorization


Notes when converting code from MatLab to Python:
* operations between 1d-array and 2d-arrays are not semantically the same in MatLab and Python.
See X * theta -y vs. X @ theta - y[:, np.newaxis]
* some keywords from MatLab do not exist in Python: end



------

## From R and SQL to Pandas

![Tehncial Difficulties](img/TechnicalDifficulties.png)

Pandas is a Python library for data manipulation and analysis. Its core data structure - DataFrame - is inspired from and similar to the R data.frame object.

Wes McKinney - creator of pandas:

>\[...\] many features found in pandas are typically either part of the R core implementation or provided by add-on packages
    
>The pandas name itself is derived from _panel_ _data_, an econometric term for multidimensional structured datasets an a play on the phrase Python Data Analysis.

#### SQL to Pandas - Concept Mapping

* a DataFrame looks like a table in the relational database, i.e. a large set of rows with the same record structure
* a Series is like a row, but can also be a column ... which makes it quite confusing
* the SELECT/WHERE clause is implemented by overwriting the indexing operator (\[ \]) or using the index members _loc_ and _iloc_.

#### Query Example: find all the rows that have the first column between 10 and 20

In [9]:
import pandas as pd
import numpy as np

In [10]:
df = pd.read_csv('real-estate-prices.csv', header=None, names = ['Size', 'Price'])

In [11]:
df[(df['Size'] > 1000) & (df['Size'] < 2000)].head()

Unnamed: 0,Size,Price
1,1600,329900
3,1416,232000
5,1985,299900
6,1534,314900
7,1427,198999


#### Update Example: add a new column representing the square footage

In [12]:
df['PricePerSquareFoot'] = df['Price'] / df['Size']

In [13]:
df.head()

Unnamed: 0,Size,Price,PricePerSquareFoot
0,2104,399900,190.06654
1,1600,329900,206.1875
2,2400,369000,153.75
3,1416,232000,163.841808
4,3000,539900,179.966667


Notes:
* the indexing (\[ \]) operator  is overloaded in several ways:
    * given a string, it returns a column
    * given a boolean vector (as a pandas.Series) it returns all the rows where the vector value is true
    * when is on the left-hand-side of and assignment the method invoked is \_\_setitem\_\_
* the arithmetic and boolean operators are overloaded to apply to the entire set of rows
* very easy to create new columns
* logical indexing is the preferred way to do queries - it is vectorized

Python idioms:
* query/filter data set - this is very common
* create a new column with a derived value


#### Complex Query Example: return the top 5 most expensive houses


In [14]:
df = pd.read_csv('real-estate-prices.csv', header=None, names = ['Size', 'Price'])

In [15]:
top5 = df[['Price']] \
    [df['Size'] > 3000] \
    .sort_values(by='Price', ascending=[False]) \
    .head(5) 

In [16]:
top5

Unnamed: 0,Price
13,699900
19,599000
33,579900
24,573900
38,549000


Notes:
* Notice the sequence of operation - this is called chained invocation - this helps make the code more concise
* ... but there are questions, is the variable top5 a copy of the data or a view into the variable df?

In [17]:
top5['Type'] = 'Luxury' # create a new column and set the value

In [18]:
df.loc[13]

Size       4478
Price    699900
Name: 13, dtype: int64

Notice that the value was not set in df. top5 is a copy, not a view in df.
The correct way is to use the _loc_\[ \] method which is guaranteed to operate on the object itself.

In [19]:
df.loc[top5.index, 'Type'] = 'Luxury'

In [20]:
df.loc[13]

Size       4478
Price    699900
Type     Luxury
Name: 13, dtype: object

Conclusion: to update the dataframe always use the df.loc or df.iloc member rather than the indexing operator alone.

#### Other SQL-like Methods in Pandas

Joining data sets is supported using the pandas method called merge. LEFT and RIGHT join are supported in a similar way to SQL.

The concept of index is explicit in Pandas while it operates behind the scenes in SQL. There are ways to remove the index, re-index. The index can be multi-level.

The pandas method _groupby_ implements a function similar to GROUP BY in SQL.


------

## Functional Programming and Pandas

Python has capabilities for functional programming:
  * Functions are objects
  * Can be assigned to variables
  * Can be passed as parameters to methods
  * functions can be anonymous - lambda functions

Pandas:
 * allows functions to be used in many where conditions are allowed
 * most useful in indexing.

The following example selects all the rows where column A is a prime number:

In [62]:
N = 100000
ab = pd.DataFrame({'A': np.arange(1, N), 'B': np.arange(1, N) ** 2})
ab.head()

Unnamed: 0,A,B
0,1,1
1,2,4
2,3,9
3,4,16
4,5,25


In [63]:
def is_prime(n):
    if n == 2:
        return True
    if n % 2 == 0 or n <= 1:
        return False

    sqr = int(n ** 0.5) + 1

    for divisor in range(3, sqr, 2):
        if n % divisor == 0:
            return False
    return True

In [64]:
ab[lambda df : df.apply(lambda row: is_prime(row['A']), axis = 1)].head()

Unnamed: 0,A,B
1,2,4
2,3,9
4,5,25
6,7,49
10,11,121


But in some cases using a function may be much slower then the alternative matrix expression.

In [65]:
%timeit ab['C'] = ab['A'] + ab['B']

825 µs ± 4.75 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [66]:
%timeit ab['C'] = ab.apply(lambda row: row['A'] + row['B'], axis = 1)

2.23 s ± 16.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


The reason for the difference is that the indexing and addition operators are implemented in C and are very fast. The apply/lambda function is interpreted in Python and needs to be evaluated for every row.

## Conclusions

The Data Science/Machine Learning programs are valid, if not cryptic, Python code.

![Why it works](img/WhyItWorks.png)

Compared with other programs:
 * there are fewer control statements: _if_s and _for_s
 * the programs are sequences of calls to NumPy and Pandas method and implement matrix or data frame manipulations
 * NumPy and Pandas libraries overload arithmetic and logical operators to apply to entire multi-dimensional array or data frame

Takeaways the C++, Java developers:
 * when trying to execute an operation on the whole dataset look for a dedicated Numpy or Pandas method, don't try to iterate over the dataset or to use _apply_ method
 * look at ways to express the application logic in terms of linear algebra
 * use Jupyter to explore data and try different options before writing long programs
 * build a collection of code fragments for specific tasks. It is easier to go back at an example than read the documentation
 

## Next Steps

Books:
  * Pandas for Data Analysis - Wes McKinney, the original author of Pandas
  * Learning IPython for Interactive Computing and Data Visualization - Cyrille Rossant 
  
Web sites:
  * DataFrame join/merge: https://chrisalbon.com/python/data_wrangling/pandas_join_merge_dataframe/
  * Conversion from MatLab to NumPy: http://mathesaurus.sourceforge.net/matlab-numpy.html

Check other notebooks in the Pytorials repository:
  * [Linear Regression using NumPy](LinearRegression1-NumPy.ipynb)
  * [Data Frame Queries using Pandas](DataFrameQueries-Pandas.ipynb)
  * [Linear Regression using Octave](LinearRegression1-Octave.ipynb)
  * ... and more
