Note: This notebook was created along with the DataCamp course of the same name

# Writing Efficient Code with pandas

The ability to efficiently work with big datasets and extract valuable information is an indispensable tool for every aspiring data scientist. When working with a small amount of data, we often don’t realize how slow code execution can be. This course will build on your knowledge of Python and the pandas library and introduce you to efficient built-in pandas functions to perform tasks faster. Pandas’ built-in functions allow you to tackle the simplest tasks, like targeting specific entries and features from the data, to the most complex tasks, like applying functions on groups of entries, much faster than Python's usual methods. By the end of this course, you will be able to apply a function to data based on a feature value, iterate through big datasets rapidly, and manipulate data belonging to different groups efficiently. You will apply these methods on a variety of real-world datasets, such as poker hands or restaurant tips.

**Instructor:** Leonidas Souliotis, PhD @ University of Warwick

In [2]:
import time

# $\star$ Chapter 1: Selecting columns and rows efficiently 
This chapter will give you an overview of why efficient code matters and selecting specific and random rows and columns efficiently.

### The need for efficient coding I
* How do we measure time?
    * For the context of this course, we will use a function which captures the current time of the computer in seconds since the 1st of January 1970, as a floating point number
    * This function is the `time()` function from the time Python package.
    * **`time.time()`** returns the current time in seconds since 12:00am, January 1, 1970
    * Each time we are interested in measuring some code's execution time, we will assign the current time before execution, using the `time()` function, execute the operation we are interested in, and measure the time again right after the code's execution.
    * In the end, we print the result in second, in a compact but meaningful message

In [3]:
# record time before execution
start_time = time.time()
# execute operation
result = 5 + 2
# record time after execution
end_time = time.time()
print("Result calculated in {} sec".format(end_time - start_time))

Result calculated in 9.703636169433594e-05 sec


### For-loop vs. list comprehension
* List comprehension:

In [4]:
list_comp_start_time = time.time()
result = [i*i for i in range(0, 1000000)]
list_comp_end_time = time.time()
print("Time using the list_comprehension: {} sec".format(list_comp_end_time-list_comp_start_time))

Time using the list_comprehension: 0.07631111145019531 sec


* For loop:

In [5]:
for_loop_start_time = time.time()
result = []
for i in range(0, 1000000):
    result.append(i*i)
for_loop_end_time = time.time()
print("Time using the for loop: {} sec".format(for_loop_end_time - for_loop_start_time))

Time using the for loop: 0.13492989540100098 sec


* In the majority of cases, a list comprehension is a faster way to perform a simple operation than a for loop

### Where time matters I
* Calculate $1 + 2 + ... + 1000000$
* The most intuitive way to do it is by brute force, adding each number to the sum one by one:

In [6]:
def sum_brute_force(N):
    res = 0
    for i in range(1, N+1):
        res+=1
    return res

* A more clever way to proceed is to use a well-known formula from Gauss:
* Using $1 + 2 + ... + N = \frac{N *(N + 1)}{2}$

In [7]:
def sum_formula(N):
    return N*(N+1)/2

* After running both methods, we achieve a massive improvement with a magnitude of over 160,000%, which clearly demonstrates why we need efficient and optimized code, even for simple tasks

<img src='data/efficient26.png' width="600" height="300" align="center"/>