<div style="float:left;font-size:20px;">
    <h1>Pandas</h1>
</div><div style="float:right;"><img src="../assets/banner.jpg"></div>

In [1]:
import pandas as pd
import numpy as np

## Tips

### Error: value is trying to be set on a copy of a slice from a DataFrame
For efficiency in Pandas, selections and slices return a reference to the original data known as a view. If this error occurs it is because you are trying make an assignment of a view of data. To fix this issue, simply prepend any operation that throws this error with a `.copy()` to ensure that view is copied into a new memory location when assigned.

### View a non-truncated view of a DataFrame

Easy way:  
```print(df.to_string())```

Complicated way:  
```
with pd.option_context('display.max_rows', None, 'display.max_columns', None):  
    print(df)
```


# Optimised Pandas

References:
- https://pandas.pydata.org/pandas-docs/stable/user_guide/enhancingperf.html
- Python for Data Analysis
- https://medium.com/bigdatarepublic/advanced-pandas-optimize-speed-and-memory-a654b53be6c2
- https://jakevdp.github.io/PythonDataScienceHandbook/03.12-performance-eval-and-query.html


## Berlin Airbnb dataset

In [2]:
path = 'V:/Kaggle/Berlin Airbnb Data/'

listings = pd.read_csv(path + 'listings.csv')
reviews  = pd.read_csv(path + 'reviews.csv')

### Listings

In [3]:
listings.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2015,Berlin-Mitte Value! Quiet courtyard/very central,2217,Ian,Mitte,Brunnenstr. Süd,52.534537,13.402557,Entire home/apt,60,4,118,2018-10-28,3.76,4,141
1,2695,Prenzlauer Berg close to Mauerpark,2986,Michael,Pankow,Prenzlauer Berg Nordwest,52.548513,13.404553,Private room,17,2,6,2018-10-01,1.42,1,0
2,3176,Fabulous Flat in great Location,3718,Britta,Pankow,Prenzlauer Berg Südwest,52.534996,13.417579,Entire home/apt,90,62,143,2017-03-20,1.25,1,220
3,3309,BerlinSpot Schöneberg near KaDeWe,4108,Jana,Tempelhof - Schöneberg,Schöneberg-Nord,52.498855,13.349065,Private room,26,5,25,2018-08-16,0.39,1,297
4,7071,BrightRoom with sunny greenview!,17391,Bright,Pankow,Helmholtzplatz,52.543157,13.415091,Private room,42,2,197,2018-11-04,1.75,1,26


In [4]:
listings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22552 entries, 0 to 22551
Data columns (total 16 columns):
id                                22552 non-null int64
name                              22493 non-null object
host_id                           22552 non-null int64
host_name                         22526 non-null object
neighbourhood_group               22552 non-null object
neighbourhood                     22552 non-null object
latitude                          22552 non-null float64
longitude                         22552 non-null float64
room_type                         22552 non-null object
price                             22552 non-null int64
minimum_nights                    22552 non-null int64
number_of_reviews                 22552 non-null int64
last_review                       18644 non-null object
reviews_per_month                 18638 non-null float64
calculated_host_listings_count    22552 non-null int64
availability_365                  22552 non-null int64

### Reviews

In [5]:
reviews.head()

Unnamed: 0,listing_id,date
0,2015,2016-04-11
1,2015,2016-04-15
2,2015,2016-04-26
3,2015,2016-05-10
4,2015,2016-05-14


In [6]:
reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 401963 entries, 0 to 401962
Data columns (total 2 columns):
listing_id    401963 non-null int64
date          401963 non-null object
dtypes: int64(1), object(1)
memory usage: 6.1+ MB


# Merges

### Indices

Merges are more efficient on dataframes with indices

In [7]:
%%timeit
listings.merge(reviews, left_on='id', right_on='listing_id')

210 ms ± 3.47 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [8]:
reviews_ = reviews.set_index('listing_id')
listings_ = listings.set_index('id')

In [9]:
%%timeit
listings_.merge(reviews_, left_index=True, right_index=True)

168 ms ± 961 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


### Filtering

Always perform all the required filtering prior to merges or other operations

In [10]:
listings_filtered = listings[:int(len(listings)/10)]

In [11]:
%%timeit
listings_filtered.merge(reviews_, left_index=True, right_index=True)

3.7 ms ± 17.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


# Single value access

`at` is more efficient than `loc` for accessing single values.

In [12]:
listings.loc[3]

id                                                             3309
name                              BerlinSpot Schöneberg near KaDeWe
host_id                                                        4108
host_name                                                      Jana
neighbourhood_group                          Tempelhof - Schöneberg
neighbourhood                                       Schöneberg-Nord
latitude                                                    52.4989
longitude                                                   13.3491
room_type                                              Private room
price                                                            26
minimum_nights                                                    5
number_of_reviews                                                25
last_review                                              2018-08-16
reviews_per_month                                              0.39
calculated_host_listings_count                  

In [13]:
listings.loc[3, ['price', 'norm_price']]

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike
  return getattr(section, self.name)[new_key]


price          26
norm_price    NaN
Name: 3, dtype: object

In [14]:
%%timeit
listings[listings['id'] == 3309]['price']

720 µs ± 4.06 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [15]:
%%timeit
listings.loc[3]['price']

211 µs ± 1.91 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [16]:
%%timeit
listings.loc[3, 'price']

8.48 µs ± 37.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [17]:
%%timeit
listings.at[3, 'price']

5.52 µs ± 30.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


# Vectorize

`vectorize` is the most efficient followed by `map`/`apply`. Avoid `iterrows()`. 
`apply` works on a row / column basis of a DataFrame, `applymap` works element-wise on a DataFrame, and `map` works element-wise on a Series.

Demonstrate with an example in normalising the price data.

In [18]:
min_price = listings['price'].min()
max_price = listings['price'].max()

### Iterrows

In [19]:
%%timeit
norm_prices = np.zeros(len(listings,))
for i, row in listings.iterrows():
    norm_prices[i] = (row['price'] - min_price) / (max_price - min_price)
listings['norm_price'] = norm_prices

2.41 s ± 17.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### iloc

In [20]:
%%timeit
norm_prices = np.zeros(len(listings,))
for i in range(len(norm_prices)):
    norm_prices[i] = (listings.loc[i, 'price'] - min_price) / (max_price - min_price)
listings['norm_price'] = norm_prices

210 ms ± 2.75 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### map

In [21]:
%%timeit 
listings['norm_price'] = listings['price'].map(lambda x: (x - min_price) / (max_price - min_price))

17.1 ms ± 155 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


### apply

In [22]:
%%timeit 
listings['norm_price'] = listings['price'].apply(lambda x: (x - min_price) / (max_price - min_price))

16.9 ms ± 153 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


### Vectorize

In [23]:
%%timeit
listings['norm_price'] = (listings['price'] - min_price) / (max_price - min_price)

451 µs ± 7.74 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In Summary:

Operation | Relative speed
----------|---------------
iterrows  | 1
iloc      | 10
map/apply | 134
vectorize | 5200

## Memory utilisation

Types are automatically inferred from `csv` files which can lead to an inefficient underlying storage format. Other data formats, such as `parquet` and `pickle`, can define the data type and use the most efficient data type. Commonly repeated data can be further compressed by using the `category` datatype.

For example below the integer datatypes are assigned the largest container where `number_of_reviews` is between 0 and 498.

In [24]:
listings.describe()

Unnamed: 0,id,host_id,latitude,longitude,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365,norm_price
count,22552.0,22552.0,22552.0,22552.0,22552.0,22552.0,22552.0,18638.0,22552.0,22552.0,22552.0
mean,15715600.0,54033550.0,52.509824,13.406107,67.143668,7.157059,17.840679,1.135525,1.918233,79.852829,0.00746
std,8552069.0,58162900.0,0.030825,0.057964,220.26621,40.665073,36.769624,1.507082,3.667257,119.368162,0.024474
min,2015.0,2217.0,52.345803,13.103557,0.0,1.0,0.0,0.01,1.0,0.0,0.0
25%,8065954.0,9240002.0,52.489065,13.375411,30.0,2.0,1.0,0.18,1.0,0.0,0.003333
50%,16866380.0,31267110.0,52.509079,13.416779,45.0,2.0,5.0,0.54,1.0,4.0,0.005
75%,22583930.0,80675180.0,52.532669,13.439259,70.0,4.0,16.0,1.5,1.0,129.0,0.007778
max,29867350.0,224508100.0,52.65167,13.757642,9000.0,5000.0,498.0,36.67,45.0,365.0,1.0


In [25]:
listings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22552 entries, 0 to 22551
Data columns (total 17 columns):
id                                22552 non-null int64
name                              22493 non-null object
host_id                           22552 non-null int64
host_name                         22526 non-null object
neighbourhood_group               22552 non-null object
neighbourhood                     22552 non-null object
latitude                          22552 non-null float64
longitude                         22552 non-null float64
room_type                         22552 non-null object
price                             22552 non-null int64
minimum_nights                    22552 non-null int64
number_of_reviews                 22552 non-null int64
last_review                       18644 non-null object
reviews_per_month                 18638 non-null float64
calculated_host_listings_count    22552 non-null int64
availability_365                  22552 non-null int64

In [26]:
# Tools for optimising types  automatically:
from typing import List

def optimize_floats(df: pd.DataFrame) -> pd.DataFrame:
    floats = df.select_dtypes(include=['float64']).columns.tolist()
    df[floats] = df[floats].apply(pd.to_numeric, downcast='float')
    return df

def optimize_ints(df: pd.DataFrame) -> pd.DataFrame:
    ints = df.select_dtypes(include=['int64']).columns.tolist()
    df[ints] = df[ints].apply(pd.to_numeric, downcast='integer')
    return df


def optimize_objects(df: pd.DataFrame, datetime_features: List[str]) -> pd.DataFrame:
    for col in df.select_dtypes(include=['object']):
        if col not in datetime_features:
            num_unique_values = len(df[col].unique())
            num_total_values = len(df[col])
            if float(num_unique_values) / num_total_values < 0.5:
                df[col] = df[col].astype('category')
        else:
            df[col] = pd.to_datetime(df[col])
    return df

def optimize(df: pd.DataFrame, datetime_features: List[str] = []):
    return optimize_floats(optimize_ints(optimize_objects(df, datetime_features)))

In [27]:
optimized_listings = optimize(listings, ['last_review'])

In [28]:
optimized_listings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22552 entries, 0 to 22551
Data columns (total 17 columns):
id                                22552 non-null int32
name                              22493 non-null object
host_id                           22552 non-null int32
host_name                         22526 non-null category
neighbourhood_group               22552 non-null category
neighbourhood                     22552 non-null category
latitude                          22552 non-null float32
longitude                         22552 non-null float32
room_type                         22552 non-null category
price                             22552 non-null int16
minimum_nights                    22552 non-null int16
number_of_reviews                 22552 non-null int16
last_review                       18644 non-null datetime64[ns]
reviews_per_month                 18638 non-null float32
calculated_host_listings_count    22552 non-null int8
availability_365                  22552

The memory footprint has gone from 2.8MB to 1.3MB, _46%_ of the original utilisation.

## Data processing tests

Determine the most efficient method to apply a semi-complex function across all rows of a dataframe, conclusions:
- `apply` is terrible.
- `np.vectorize` applied to a function is not very efficient.
- `vectorize` is very efficient. The Pandas `vectorize` is less efficient that ndarray `vectorize` for small amounts of rows.
- `numba` is the most efficient. Manually operating the loops is less efficient than the vectorized method. NOTE: the @njit decorator must be added to any functions called by @njit decoratored functions, otherwise Numba may generate
much slower code.

In [29]:
def average_price_per_night(price, minimum_nights):
    return price/minimum_nights

### Apply

In [30]:
%%timeit
optimized_listings.apply(lambda row: average_price_per_night(row['price'], row['minimum_nights']), axis=1)

2.67 s ± 22.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [31]:
%%timeit
optimized_listings.apply(lambda row: row['price']/row['minimum_nights'], axis=1)

2.67 s ± 7.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### Vectorised

In [32]:
%%timeit
optimized_listings['price']/optimized_listings['minimum_nights']

175 µs ± 6.38 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [33]:
%%timeit
average_price_per_night(optimized_listings['price'].to_numpy(), optimized_listings['minimum_nights'].to_numpy())

64.2 µs ± 557 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [34]:
%%timeit
optimized_listings['price'].to_numpy()/optimized_listings['minimum_nights'].to_numpy()

63.8 µs ± 364 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


### Numpy

In [35]:
%%timeit
optimized_listings['price'].to_numpy()/optimized_listings['minimum_nights'].to_numpy()

63.8 µs ± 226 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [36]:
%%timeit
np.vectorize(average_price_per_night)(optimized_listings['price'].to_numpy(), optimized_listings['minimum_nights'].to_numpy())

3.64 ms ± 21.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


### Numba

In [37]:
import numba

@numba.njit()
def numba_average_price_per_night(price, minimum_nights):
    return price/minimum_nights

@numba.njit()
def numba_loop_average_price_per_night(price, minimum_nights):
    n = len(price)
    r = np.empty(n, dtype=np.float32)
    
    for i in range(n):
        r[i] = price[i]/minimum_nights[i]
    return r

In [38]:
%%timeit
numba_average_price_per_night(optimized_listings['price'].to_numpy(), optimized_listings['minimum_nights'].to_numpy())

48.2 µs ± 31 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [39]:
x = optimized_listings['price'].to_numpy()
y = optimized_listings['minimum_nights'].to_numpy()

In [40]:
%%timeit 
numba_average_price_per_night(x, y)

19.8 µs ± 347 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [41]:
%%timeit
numba_loop_average_price_per_night(optimized_listings['price'].to_numpy(), optimized_listings['minimum_nights'].to_numpy())

49.3 µs ± 1.15 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


### Cython

In [42]:
 %load_ext Cython

In [43]:
%%cython
def cython_average_price_per_night(price, minimum_nights):
    return price/minimum_nights

In [44]:
%%timeit
cython_average_price_per_night(optimized_listings['price'].to_numpy(), optimized_listings['minimum_nights'].to_numpy())

66.1 µs ± 1.05 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [45]:
%prun -l 4 cython_average_price_per_night(optimized_listings['price'].to_numpy(), optimized_listings['minimum_nights'].to_numpy())

 

         68 function calls in 0.000 seconds

   Ordered by: internal time
   List reduced from 29 to 4 due to restriction <4>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 {_cython_magic_426610cf046191d324deb79ce9bfdf99.cython_average_price_per_night}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.exec}
        2    0.000    0.000    0.000    0.000 frame.py:2964(__getitem__)
        1    0.000    0.000    0.000    0.000 <string>:1(<module>)

## Fast lookups - Index

- Use an index where possible
- Use `loc`, or even better, `at` if there is a single value to query.
- Use the variable to access within `loc`/`at`, we see in the example below it increases performance by ~500x.

In [46]:
%%timeit
listings[listings['id'] == 29856708]['price']

1.48 ms ± 22 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [47]:
listings_ = listings.set_index('id')

Below shows why you should incorporate your selection in your `loc`/`at` call.

In [48]:
%%timeit
listings_.loc[29856708]['price']

2.56 ms ± 19.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [49]:
%%timeit
listings_.loc[29856708, 'price']

8.26 µs ± 34.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [50]:
%%timeit
listings_.at[29856708, 'price']

5.19 µs ± 24.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


## Fast lookups - non-Index

This appears to be the fastest method:

In [72]:
%%timeit
listings[listings['host_name'] == 'Ian']

1.4 ms ± 10.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [80]:
%%timeit
listings[listings['host_name'].isin(['Ian'])]

1.85 ms ± 22.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [88]:
%%timeit
listings.loc[listings.host_name == 'Ian']

1.42 ms ± 22.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


## Adding columns

In [110]:
listings['room_type'].unique()

[Entire home/apt, Private room, Shared room]
Categories (3, object): [Entire home/apt, Private room, Shared room]

In [123]:
categories = {'Entire home/apt': 0, 'Private room': 1, 'Shared room': 2}
categories_df = pd.DataFrame({'room_type': list(categories.keys()), 'value': list(categories.values())})
categories_df

Unnamed: 0,room_type,value
0,Entire home/apt,0
1,Private room,1
2,Shared room,2


In [127]:
%%timeit
listings.merge(categories_df, on='room_type', copy=False)

14.1 ms ± 163 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [128]:
categories_df

Unnamed: 0,room_type,value
0,Entire home/apt,0
1,Private room,1
2,Shared room,2


In [118]:
%%timeit
listings['room_type_enum'] = listings.apply(lambda x: categories[x['room_type']], axis=1)

2.48 s ± 4.35 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [115]:
%%timeit
listings['room_type_enum'] = listings[['room_type']].applymap(categories.get)

8.89 ms ± 71.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## Enhancing performance

Examples from: https://pandas.pydata.org/pandas-docs/stable/user_guide/enhancingperf.html

In [51]:
 df = pd.DataFrame({'a': np.random.randn(1000),
                    'b': np.random.randn(1000),
                    'N': np.random.randint(100, 1000, (1000)),
                    'x': 'x'})

### Complex operations

In [52]:
def f(x):
    return x * (x - 1)
 
def integrate_f(a, b, N):
    s = 0
    dx = (b - a) / N
    for i in range(N):
        s += f(a + i * dx)
    return s * dx

#### Apply

In [53]:
%timeit df.apply(lambda x: integrate_f(x['a'], x['b'], x['N']), axis=1)

181 ms ± 1.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [54]:
%prun -l 4 df.apply(lambda x: integrate_f(x['a'], x['b'], x['N']), axis=1)

 

         659000 function calls (653974 primitive calls) in 0.275 seconds

   Ordered by: internal time
   List reduced from 217 to 4 due to restriction <4>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1000    0.142    0.000    0.209    0.000 <ipython-input-52-e95b0e814a6f>:4(integrate_f)
   539093    0.067    0.000    0.067    0.000 <ipython-input-52-e95b0e814a6f>:1(f)
     3000    0.008    0.000    0.044    0.000 base.py:4702(get_value)
     3000    0.005    0.000    0.050    0.000 series.py:1068(__getitem__)

#### Cython

In [55]:
%%cython
def f_plain(x):
    return x * (x - 1)
def integrate_f_plain(a, b, N):
    s = 0
    dx = (b - a) / N
    for i in range(N):
        s += f_plain(a + i * dx)
    return s * dx

In [56]:
%timeit df.apply(lambda x: integrate_f_plain(x['a'], x['b'], x['N']), axis=1)

104 ms ± 583 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


### Cython with type defined

In [57]:
%%cython
cdef double f_typed(double x) except? -2:
    return x * (x - 1)
cpdef double integrate_f_typed(double a, double b, int N):
    cdef int i
    cdef double s, dx
    s = 0
    dx = (b - a) / N
    for i in range(N):
        s += f_typed(a + i * dx)
    return s * dx


In [58]:
%timeit df.apply(lambda x: integrate_f_typed(x['a'], x['b'], x['N']), axis=1)

33.6 ms ± 517 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


Currently we are using a `Series` datatype and we can see that calls to `Series` methods are taking a considerable amount of the time.

In [59]:
%prun -l 4 df.apply(lambda x: integrate_f_typed(x['a'], x['b'], x['N']), axis=1)

 

         119907 function calls (114881 primitive calls) in 0.061 seconds

   Ordered by: internal time
   List reduced from 216 to 4 due to restriction <4>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     3000    0.008    0.000    0.041    0.000 base.py:4702(get_value)
     3000    0.004    0.000    0.046    0.000 series.py:1068(__getitem__)
     6001    0.003    0.000    0.010    0.000 {pandas._libs.lib.values_from_object}
        1    0.003    0.003    0.060    0.060 {pandas._libs.reduction.reduce}

### Cython with nd.array

In [60]:
%%cython
cimport numpy as np
import numpy as np
cdef double f_typed(double x) except? -2:
    return x * (x - 1)
cpdef double integrate_f_typed(double a, double b, int N):
    cdef int i
    cdef double s, dx
    s = 0
    dx = (b - a) / N
    for i in range(N):
        s += f_typed(a + i * dx)
    return s * dx
cpdef np.ndarray[double] apply_integrate_f(np.ndarray col_a, np.ndarray col_b,
                                           np.ndarray col_N):
    assert (col_a.dtype == np.float
            and col_b.dtype == np.float and col_N.dtype == np.int)
    cdef Py_ssize_t i, n = len(col_N)
    assert (len(col_a) == len(col_b) == n)
    cdef np.ndarray[double] res = np.empty(n)
    for i in range(len(col_a)):
        res[i] = integrate_f_typed(col_a[i], col_b[i], col_N[i])
    return res


In [61]:
%timeit apply_integrate_f(df['a'].to_numpy(), df['b'].to_numpy(), df['N'].to_numpy())

1.04 ms ± 10.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [62]:
%prun -l 4 apply_integrate_f(df['a'].to_numpy(), df['b'].to_numpy(), df['N'].to_numpy())

 

         100 function calls in 0.001 seconds

   Ordered by: internal time
   List reduced from 29 to 4 due to restriction <4>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.001    0.001    0.001    0.001 {built-in method _cython_magic_8ee3fb7b2964ecfc84139940c886c1f0.apply_integrate_f}
        1    0.000    0.000    0.001    0.001 {built-in method builtins.exec}
        3    0.000    0.000    0.000    0.000 frame.py:2964(__getitem__)
        1    0.000    0.000    0.001    0.001 <string>:1(<module>)

### Numba

Generally easier to implement and normally provides the best performance.

In [63]:
import numba

@numba.njit()
def f_plain(x):
    return x * (x - 1)

@numba.njit()
def integrate_f_numba(a, b, N):
    s = 0
    dx = (b - a) / N
    for i in range(N):
        s += f_plain(a + i * dx)
    return s * dx

@numba.njit()
def apply_integrate_f_numba(col_a, col_b, col_N):
    n = len(col_N)
    result = np.empty(n, dtype=np.float64)
    assert len(col_a) == len(col_b) == n
    for i in range(n):
        result[i] = integrate_f_numba(col_a[i], col_b[i], col_N[i])
    return result

def compute_numba(df):
    result = apply_integrate_f_numba(df['a'].to_numpy(),
                                     df['b'].to_numpy(),
                                     df['N'].to_numpy())
    return pd.Series(result, index=df.index, name='result')

In [64]:
%timeit compute_numba(df)

637 µs ± 82.9 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


### Numba with Vectorize
Numba supports `vectorize` which takes out the effort and writing loops and makes the code clearer to read. By default this executes code in `nopython` mode, so provides optimal performance. If using a smaller dataformat it might be beneficial to provide the signature for the call e.g. if the floats are 32-bit we can use:

```python
@numba.vectorize(['float32(float32, float32, int32)', 'float64(float64, float64, int32)'])
```

Here we have provided a fallback to a 64-bit implementation.

In [65]:
@numba.vectorize(['float64(float64, float64, int32)'])
def apply_integrate_f_numba_vectorize(col_a, col_b, col_N):
    return integrate_f_numba(col_a, col_b, col_N)

def compute_numba_vectorize(df):
    result = apply_integrate_f_numba_vectorize(df['a'].to_numpy(),
                                     df['b'].to_numpy(),
                                     df['N'].to_numpy())
    return pd.Series(result, index=df.index, name='result')

In [66]:
%timeit compute_numba_vectorize(df)

592 µs ± 5.42 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [67]:
apply_integrate_f_numba.inspect_types()

apply_integrate_f_numba (array(float64, 1d, C), array(float64, 1d, C), array(int32, 1d, C))
--------------------------------------------------------------------------------
# File: <ipython-input-63-f6cd71261512>
# --- LINE 15 --- 

@numba.njit()

# --- LINE 16 --- 

def apply_integrate_f_numba(col_a, col_b, col_N):

    # --- LINE 17 --- 
    # label 0
    #   col_a = arg(0, name=col_a)  :: array(float64, 1d, C)
    #   col_b = arg(1, name=col_b)  :: array(float64, 1d, C)
    #   col_N = arg(2, name=col_N)  :: array(int32, 1d, C)
    #   $0.1 = global(len: <built-in function len>)  :: Function(<built-in function len>)
    #   $0.3 = call $0.1(col_N, func=$0.1, args=[Var(col_N, <ipython-input-63-f6cd71261512>:17)], kws=(), vararg=None)  :: (array(int32, 1d, C),) -> int64
    #   del $0.1
    #   n = $0.3  :: int64
    #   del $0.3

    n = len(col_N)

    # --- LINE 18 --- 
    #   $0.4 = global(np: <module 'numpy' from 'C:\\Users\\Mark\\.conda\\envs\\CatAna\\lib\\site-packages\\numpy\

In [68]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 4 columns):
a    1000 non-null float64
b    1000 non-null float64
N    1000 non-null int32
x    1000 non-null object
dtypes: float64(2), int32(1), object(1)
memory usage: 27.5+ KB


In [69]:
200 row df, index = deal (str), 5 columns with ratings

Need to select deal and rating to get dictionary

SyntaxError: invalid syntax (<ipython-input-69-a5341f45359f>, line 1)