# CPU Bound Programs

## Bounds vs. Limitations

In the last course, we covered the idea of memory limitations, and figured out some strategies to overcome them with Pandas. As a quick refresher, **a memory limitation is when a dataset won't fit into the memory available on your computer**. When this happens, you need to rely on workarounds, like 
* processing the data in batches that do fit into memory
* relying on tools, like SQLite, that keep the data on disk instead of in memory while doing processing

The important thing to note here is that **available memory is a hard limitation on what's possible to process**. If you have a 6 gigabyte dataset, and 4 gigabytes of memory, there's no way you can load your data into Pandas and process it without using a workaround.<br>

In this course, we'll cover **the idea of program bounds**. A program bound is similar to a limitation in that it affects how you're able to process your data. **However, a program bound isn't a hard limitation -- if your program is bound, your computer will still be able to eventually process the data**. A program bound mostly limits `how quickly the program can be executed`. There are two primary ways a program can be bound that you'll need to be aware of:

* **CPU-bound** -- A CPU-bound program will be dependent on your CPU to execute quickly. **The faster your processor is, the faster your program will be**.
* **I/O-bound** -- An I/O-bound program will be dependent on external resources, like files on disk and network services to execute quickly. **The faster these external resources can be accessed, the faster your program will run**.

As you work with larger datasets, understanding these bounds and how to make your program more efficient to deal with them is critical. **Relatively simple optimizations can mean the difference between processing a gigabyte of data in 30 minutes or in 30 seconds**.<br>

Here's a diagram that shows how various components work together:

![bounds-vs-limitations](https://s3.amazonaws.com/dq-content/168/CPU+and+I_O+bounds.png)

As you can see above, **every time you load data from disk into memory, then process it, it travels through the I/O bridge twice, which takes time**. 
* The more efficient you make your code, 
* the less back and forth trips will need to be made, 
* and the faster your code will run. 

You can make your code more efficient by minimizing how many times you access data, or by ensuring that the processor has to run fewer instructions.<br>

In this mission, we'll learn more about CPU-bound programs, and how we can understand and improve their performance.

## The Dataset

In this mission, we'll be working with a dataset of search terms and matching products from [CrowdFlower](https://www.crowdflower.com/data-for-everyone/). This dataset is an expanded version of the data used for a [Kaggle competition](https://www.kaggle.com/c/crowdflower-search-relevance). The full dataset contains `267373` rows, each of which represents a search for a product, with the search query used and the search result given. Each search result was scored for its relevance to the original query. There are quite a few columns in the dataset, but we're mainly interested in the `query` column. Here's a look at all of the columns:

In [5]:
import pandas as pd
ecommerce = pd.read_csv('../data/ecommerce5000.csv', encoding='latin-1')
ecommerce.head()

Unnamed: 0.1,Unnamed: 0,_unit_id,relevance,relevance:variance,product_image,product_link,product_price,product_title,query,rank,source,url
0,0,711158459,3.67,0.471,http://thumbs2.ebaystatic.com/d/l225/m/mzvzEUI...,http://www.ebay.com/itm/Sony-PlayStation-4-PS4...,$329.98,Sony PlayStation 4 (PS4) (Latest Model)- 500 G...,playstation 4,1,eBay,http://www.ebay.com/sch/i.html?_from=R40&_trks...
1,1,711158460,4.0,0.0,http://thumbs3.ebaystatic.com/d/l225/m/mJNDmSy...,http://www.ebay.com/itm/Sony-PlayStation-4-Lat...,$324.84,Sony PlayStation 4 (Latest Model)- 500 GB Jet ...,playstation 4,2,eBay,http://www.ebay.com/sch/i.html?_from=R40&_trks...
2,2,711158461,4.0,0.0,http://thumbs4.ebaystatic.com/d/l225/m/m10NZXA...,http://www.ebay.com/itm/Sony-PlayStation-4-PS4...,$324.83,Sony PlayStation 4 PS4 500 GB Jet Black Console,playstation 4,3,eBay,http://www.ebay.com/sch/i.html?_from=R40&_trks...
3,3,711158462,3.67,0.471,http://thumbs2.ebaystatic.com/d/l225/m/mZZXTmA...,http://www.ebay.com/itm/Sony-PlayStation-4-500...,$350.00,Sony - PlayStation 4 500GB The Last of Us Rema...,playstation 4,4,eBay,http://www.ebay.com/sch/i.html?_from=R40&_trks...
4,4,711158463,3.33,0.471,http://thumbs3.ebaystatic.com/d/l225/m/mzvzEUI...,http://www.ebay.com/itm/Sony-PlayStation-4-PS4...,$308.00\nTrending at\n$319.99,Sony PlayStation 4 (PS4) (Latest Model)- 500 G...,playstation 4,5,eBay,http://www.ebay.com/sch/i.html?_from=R40&_trks...


## Finding duplicate values

In order to illustrate the idea of CPU bounds, let's start with a task that we've done numerous times before -- finding which values in a column are duplicates.<br>

With Pandas, you can use methods like [pandas.DataFrame.duplicated](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.duplicated.html) to find duplicate values in columns. You can also find duplicate values using the [GROUP BY](https://www.sqlite.org/lang_select.html) statement in a SQL query. However, there are cases when you'll have to write your own function to find duplicate values:

* You have complex custom logic around what constitutes a duplicate.
* Your dataset doesn't fit into memory, and would take too long to batch process.
* Your data is streaming, and you want to find duplicates in realtime.

duplicate is to:
* Create a list to store duplicate items.
* Loop through each item in the `query` column.
  * Loop through each item in the `query` column.
    * If we're on the same item in the inner loop and outer loop, keep going.
    * If a match is found, mark it as a duplicate row.
  * Add the duplicate item to the duplicates list.

The above algorithm will iterate through each value in the query column, and compare it to every other value in the query column. This will help us find duplicates. The algorithm will result in a nested for loop, like the below:

```python
# Initialize a list to store our duplicate 
duplicates = []
​
# Loop through each item in the query column.
for i, item in enumerate(query):
    duplicate = False
​
    # Loop through each item in the query column.
    for z, item2 in enumerate(query):
        # If the outer and inner loops are on the same value, keep going.
        # Without this, we'll falsely detect rows as duplicates.
        if i == z:
            continue
        # Mark as duplicate if we find a match.
        if item == item2:
            duplicate = True
    # Add to the duplicates list.
    if duplicate:
        duplicates.append(item)
```

If we want to optimize our code later, we need to be able to figure out how long our code is taking to run. One easy way to do this is to count up the number of "operations" our code is performing. Let's say that an "operation" is any time we:
* Assign a value to a variable
* Modify a variable value
* Check if two variables are equal

Let's look through our code from above line by line, and see where the "operations" occur:

* `duplicates = []` -- this assigns to a variable, so it is an operation.
* `duplicate = False` -- this assigns to a variable, so it is an operation. Note that this occurs inside the for loop, so this operation could be called many times.
* `if i == z` -- this checks if two variables are equal, so it's an operation. This occurs inside 2 for loops, so this operation will be called many times.
* `if item == item2` -- this check if two variables are equal as well, so it's an operation.
* `duplicate = True` -- this assigns to a variable, so it's an operation. This is inside two for loops, but also an if statement, so it may not be called that many times.
* `if duplicate` -- this checks if duplicate == True, so it's an operation. It's only inside one for loop.
* `duplicates.append(item)` -- this modifies a list, so it's an operation. This is only inside one for loop.

We can count up how many times each of these operations occurs by incrementing a counter just before the operation occurs. Here's an example where we count up how many times the `i == z` operation is called:

```python
iz_operations = 0
# Initialize a list to store our duplicates
duplicates = []
​
# Loop through each item in the query column.
for i, item in enumerate(query):
    duplicate = False
​
    # Loop through each item in the query column.
    for z, item2 in enumerate(query):
        # If the outer and inner loops are on the same value, keep going.
        # Without this, we'll falsely detect rows as duplicates.
        iz_operations += 1
        if i == z:
            continue
        # Mark as duplicate if we find a match.
        if item == item2:
            duplicate = True
    # Add to the duplicates list.
    if duplicate:
        duplicates.append(item)
```

Let's count up and print how many times each operation is called, so we can see what parts of our code take the longest to run. We've read the `query` column into the `query` variable, which is a list, and only kept the first `5000` values.

* Initialize the following operation counter variables:
  * `iz_operations` -- Count how many times the `i == z` check is performed.
  * `item_operations` -- Count how many times the `item == item2` check is performed.
  * `duplicates_init` -- Count how many times the `duplicates = []` operation is performed.
  * `duplicates_false` -- Count how many times the `duplicates = False` operation is performed.
  * `duplicates_true` -- Count how many times the `duplicates = True` operation is performed.
  * `if_duplicate` -- Count how many times the if `duplicate` operation is performed.
  * `duplicates_append` -- Count how many times the `duplicates.append(duplicate)` operation is performed.
* Perform the duplicate checking from above.
  * Add in code to count up how many times each operation occurs.
* View the operation counter variables in the variable inspector.
* Do you see any interesting trends or patterns in the operation counts?

In [8]:
query = list(ecommerce['query'][:5000])

In [6]:
iz_operations = 0
item_operations = 0
duplicates_init = 0
duplicates_false = 0
duplicates_true = 0
if_duplicate = 0
duplicates_append = 0

In [10]:
# Initialize a list to store our duplicates
duplicates_init += 1
duplicates = []

# Loop through each item in the query column.
for i, item in enumerate(query):
    duplicates_false += 1
    duplicate = False

    # Loop through each item in the query column.
    for z, item2 in enumerate(query):
        # If the outer and inner loops are on the same value, keep going.
        # Without this, we'll falsely detect rows as duplicates.
        iz_operations += 1
        if i == z:
            continue
        
        # Mark as duplicate if we find a match.
        item_operations += 1
        if item == item2:
            duplicates_true +=1
            duplicate = True
    
    # Add to the duplicates list.
    if_duplicate += 1
    if duplicate:
        duplicates_append += 1
        duplicates.append(item)

## Big O notation

As we mentioned earlier, our dataset has `5000` rows. This means that some operations happened once per row (in the "outer" for loop). Some operations, like checking if `i == z`, and `item == item2`, were run about `25000000` times, though. This is because the operations occurred inside the inner for loop. This means that they were run over every element in `query` `5000` times. Here's a diagram showing how the nested for loops run operations:

![nested-operations](https://s3.amazonaws.com/dq-content/168/nested_operations.png)

Each item in the "outer" for loop spawns an "inner" for loop, which executes as many times as `query` has elements. This means that the loop will run `len(query) * len(query)` times, or 5000 * 5000, which equals `25000000`. Let's say your code has a single operation that takes 1 second to run. If it runs `25000000` times, it would take `289` days to run, almost a whole year!<br>

Compared to the "cost" in terms of time of nested for loops, the outer for loop and other operations take almost no time. The operations inside two for loops ran `49995000` times. All the other operations ran `179774` times, a negligible amount comparatively. Compared to the operations inside nested for loops, the other operations don't really matter from a performance perspective. If we wanted to improve our algorithm, removing the operation `duplicate = False` would save us too little time to be worth it. We'd have to optimize the operations that run `25000000` times each.<br>

It also doesn't really matter that there are two operations that run `25000000` times instead of one. Each operation individually will take `289` days to run `25000000` times at `1` second per execution, so just removing one of the operations from our program won't really help us. We need to remove both to get our performance to a reasonable level.<br>

Needing to measure how long algorithms such as this one take is very common in computer science, and it isn't practical to always count up the number of operation directly. This has led to the usage of [Big O notation](https://en.wikipedia.org/wiki/Big_O_notation) to measure time complexity.<br>

Big O notation is based on the same intuition that we just had -- that an algorithm's performance is limited by the operation that runs the most times. Big O notation expresses time complexity in terms of the length of the input variable, represented as `n`.<br>

For example, here's a single for loop that runs across each element in a list once:

```python
counter = 0
for item in query:
    counter += 1
```

The above algorithm would be represented in big O notation as having `O(n)` time complexity since we run the `counter += 1` operation `len(query)` times. Recall from above that `n` equals `len(query)` in our notation.

* Initialize a variable `total`, and set it to `0`.
* Initialize the following operation counter variables:
  * `sum_increments` -- count how many times `total` is incremented.
* Write an algorithm that adds the length of each item in `query` to `total`.
* Look at the value of `sum_increments`. Is it what you expected?