Skip to content

Commit

Permalink
Merge https://github.com/pydata/pandas into fastsplit
Browse files Browse the repository at this point in the history
* https://github.com/pydata/pandas: (26 commits)
  disable some deps on 3.2 build
  Fix meantim typo
  DOC: use current ipython in doc build
  PERF: write basic datetimes faster pandas-dev#10271
  TST: fix for bottleneck >= 1.0 nansum behavior, xref pandas-dev#9422
  add numba example to enhancingperf.rst
  BUG: SparseSeries constructor ignores input data name
  BUG: Raise TypeError only if key DataFrame is not empty pandas-dev#10126
  ENH: groupby.apply for Categorical should preserve categories (closes pandas-dev#10138)
  DOC: add in whatsnew/0.17.0.txt
  DOC: move whatsnew from 0.17.0 -> 0.16.2
  BUG:  Holiday(..) with both offset and observance raises NotImplementedError pandas-dev#10217
  BUG: Index.union cannot handle array-likes
  BUG: SparseSeries.abs() resets name
  BUG: Series arithmetic methods incorrectly hold name
  ENH: Don't infer WOM-5MON if we don't support it (pandas-dev#9425)
  BUG: Series.align resets name when fill_value is specified
  BUG: GroupBy.get_group raises ValueError when group key contains NaT
  Close mysql connection in TestXMySQL to prevent tests freezing
  BUG: plot doesnt default to matplotlib axes.grid setting (pandas-dev#9792)
  ...
  • Loading branch information
cgevans committed Jun 4, 2015
2 parents c155059 + bc7d48f commit db330d4
Show file tree
Hide file tree
Showing 38 changed files with 822 additions and 206 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -123,7 +123,7 @@ conda install pandas
- xlrd >= 0.9.0
- [XlsxWriter](https://pypi.python.org/pypi/XlsxWriter)
- Alternative Excel writer.
- [Google bq Command Line Tool](https://developers.google.com/bigquery/bq-command-line-tool/)
- [Google bq Command Line Tool](https://cloud.google.com/bigquery/bq-command-line-tool)
- Needed for `pandas.io.gbq`
- [boto](https://pypi.python.org/pypi/boto): necessary for Amazon S3 access.
- One of the following combinations of libraries is needed to use the
Expand Down
5 changes: 3 additions & 2 deletions ci/build_docs.sh
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,10 @@ fi


if [ x"$DOC_BUILD" != x"" ]; then
# we're running network tests, let's build the docs in the meantim

# we're running network tests, let's build the docs in the meantime
echo "Will build docs"
pip install sphinx==1.1.3 ipython==1.1.0
conda install sphinx==1.1.3 ipython

mv "$TRAVIS_BUILD_DIR"/doc /tmp
cd /tmp/doc
Expand Down
11 changes: 0 additions & 11 deletions ci/requirements-3.2.txt
Original file line number Diff line number Diff line change
@@ -1,15 +1,4 @@
python-dateutil==2.1
pytz==2013b
xlsxwriter==0.4.6
xlrd==0.9.2
numpy==1.7.1
cython==0.19.1
numexpr==2.1
tables==3.0.0
matplotlib==1.2.1
patsy==0.1.0
lxml==3.2.1
html5lib
scipy==0.12.0
beautifulsoup4==4.2.1
statsmodels==0.5.0
2 changes: 2 additions & 0 deletions doc/source/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -358,6 +358,8 @@ Computations / Descriptive Stats
Series.median
Series.min
Series.mode
Series.nlargest
Series.nsmallest
Series.pct_change
Series.prod
Series.quantile
Expand Down
86 changes: 70 additions & 16 deletions doc/source/enhancingperf.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
import os
import csv
from pandas import DataFrame
from pandas import DataFrame, Series
import pandas as pd
pd.options.display.max_rows=15
Expand Down Expand Up @@ -68,9 +68,10 @@ Here's the function in pure python:
We achieve our result by using ``apply`` (row-wise):

.. ipython:: python
.. code-block:: python
%timeit df.apply(lambda x: integrate_f(x['a'], x['b'], x['N']), axis=1)
In [7]: %timeit df.apply(lambda x: integrate_f(x['a'], x['b'], x['N']), axis=1)
10 loops, best of 3: 174 ms per loop
But clearly this isn't fast enough for us. Let's take a look and see where the
time is spent during this operation (limited to the most time consuming
Expand All @@ -97,7 +98,7 @@ First we're going to need to import the cython magic function to ipython:

.. ipython:: python
%load_ext cythonmagic
%load_ext Cython
Now, let's simply copy our functions over to cython as is (the suffix
Expand All @@ -122,9 +123,10 @@ is here to distinguish between function versions):
to be using bleeding edge ipython for paste to play well with cell magics.


.. ipython:: python
.. code-block:: python
%timeit df.apply(lambda x: integrate_f_plain(x['a'], x['b'], x['N']), axis=1)
In [4]: %timeit df.apply(lambda x: integrate_f_plain(x['a'], x['b'], x['N']), axis=1)
10 loops, best of 3: 85.5 ms per loop
Already this has shaved a third off, not too bad for a simple copy and paste.

Expand All @@ -150,9 +152,10 @@ We get another huge improvement simply by providing type information:
...: return s * dx
...:

.. ipython:: python
.. code-block:: python
%timeit df.apply(lambda x: integrate_f_typed(x['a'], x['b'], x['N']), axis=1)
In [4]: %timeit df.apply(lambda x: integrate_f_typed(x['a'], x['b'], x['N']), axis=1)
10 loops, best of 3: 20.3 ms per loop
Now, we're talking! It's now over ten times faster than the original python
implementation, and we haven't *really* modified the code. Let's have another
Expand Down Expand Up @@ -229,9 +232,10 @@ the rows, applying our ``integrate_f_typed``, and putting this in the zeros arra
Loops like this would be *extremely* slow in python, but in Cython looping
over numpy arrays is *fast*.

.. ipython:: python
.. code-block:: python
%timeit apply_integrate_f(df['a'].values, df['b'].values, df['N'].values)
In [4]: %timeit apply_integrate_f(df['a'].values, df['b'].values, df['N'].values)
1000 loops, best of 3: 1.25 ms per loop
We've gotten another big improvement. Let's check again where the time is spent:

Expand Down Expand Up @@ -278,20 +282,70 @@ advanced cython techniques:
...: return res
...:

.. ipython:: python
.. code-block:: python
%timeit apply_integrate_f_wrap(df['a'].values, df['b'].values, df['N'].values)
In [4]: %timeit apply_integrate_f_wrap(df['a'].values, df['b'].values, df['N'].values)
1000 loops, best of 3: 987 us per loop
Even faster, with the caveat that a bug in our cython code (an off-by-one error,
for example) might cause a segfault because memory access isn't checked.


Further topics
~~~~~~~~~~~~~~
.. _enhancingperf.numba:

Using numba
-----------

A recent alternative to statically compiling cython code, is to use a *dynamic jit-compiler*, ``numba``.

Numba gives you the power to speed up your applications with high performance functions written directly in Python. With a few annotations, array-oriented and math-heavy Python code can be just-in-time compiled to native machine instructions, similar in performance to C, C++ and Fortran, without having to switch languages or Python interpreters.

Numba works by generating optimized machine code using the LLVM compiler infrastructure at import time, runtime, or statically (using the included pycc tool). Numba supports compilation of Python to run on either CPU or GPU hardware, and is designed to integrate with the Python scientific software stack.

.. note::

You will need to install ``numba``. This is easy with ``conda``, by using: ``conda install numba``, see :ref:`installing using miniconda<install.miniconda>`.

We simply take the plain python code from above and annotate with the ``@jit`` decorator.

.. code-block:: python
import numba
@numba.jit
def f_plain(x):
return x * (x - 1)
@numba.jit
def integrate_f_numba(a, b, N):
s = 0
dx = (b - a) / N
for i in range(N):
s += f_plain(a + i * dx)
return s * dx
@numba.jit
def apply_integrate_f_numba(col_a, col_b, col_N):
n = len(col_N)
result = np.empty(n, dtype='float64')
assert len(col_a) == len(col_b) == n
for i in range(n):
result[i] = integrate_f_numba(col_a[i], col_b[i], col_N[i])
return result
def compute_numba(df):
result = apply_integrate_f_numba(df['a'].values, df['b'].values, df['N'].values)
return Series(result, index=df.index, name='result')
Similar to above, we directly pass ``numpy`` arrays directly to the numba function. Further
we are wrapping the results to provide a nice interface by passing/returning pandas objects.

.. code-block:: python
- Loading C modules into cython.
In [4]: %timeit compute_numba(df)
1000 loops, best of 3: 798 us per loop
Read more in the `cython docs <http://docs.cython.org/>`__.
Read more in the `numba docs <http://numba.pydata.org/>`__.

.. _enhancingperf.eval:

Expand Down
8 changes: 4 additions & 4 deletions doc/source/groupby.rst
Original file line number Diff line number Diff line change
Expand Up @@ -784,11 +784,11 @@ will be (silently) dropped. Thus, this does not pose any problems:
df.groupby('A').std()
NA group handling
~~~~~~~~~~~~~~~~~
NA and NaT group handling
~~~~~~~~~~~~~~~~~~~~~~~~~

If there are any NaN values in the grouping key, these will be automatically
excluded. So there will never be an "NA group". This was not the case in older
If there are any NaN or NaT values in the grouping key, these will be automatically
excluded. So there will never be an "NA group" or "NaT group". This was not the case in older
versions of pandas, but users were generally discarding the NA group anyway
(and supporting it was an implementation headache).

Expand Down
2 changes: 1 addition & 1 deletion doc/source/whatsnew.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ What's New

These are new features and improvements of note in each release.

.. include:: whatsnew/v0.17.0.txt
.. include:: whatsnew/v0.16.2.txt

.. include:: whatsnew/v0.16.1.txt

Expand Down
90 changes: 90 additions & 0 deletions doc/source/whatsnew/v0.16.2.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
.. _whatsnew_0162:

v0.16.2 (June 12, 2015)
-----------------------

This is a minor bug-fix release from 0.16.1 and includes a a large number of
bug fixes along several new features, enhancements, and performance improvements.
We recommend that all users upgrade to this version.

Highlights include:

- Documentation on how to use ``numba`` with *pandas*, see :ref:`here <enhancingperf.numba>`

Check the :ref:`API Changes <whatsnew_0162.api>` before updating.

.. contents:: What's new in v0.16.2
:local:
:backlinks: none

.. _whatsnew_0162.enhancements:

New features
~~~~~~~~~~~~

.. _whatsnew_0162.enhancements.other:

Other enhancements
^^^^^^^^^^^^^^^^^^

.. _whatsnew_0162.api:

Backwards incompatible API changes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. _whatsnew_0162.api_breaking:

.. _whatsnew_0162.api_breaking.other:

Other API Changes
^^^^^^^^^^^^^^^^^

- ``Holiday`` now raises ``NotImplementedError`` if both ``offset`` and ``observance`` are used in constructor. (:issue:`102171`)

.. _whatsnew_0162.performance:

Performance Improvements
~~~~~~~~~~~~~~~~~~~~~~~~

- Improved ``Series.resample`` performance with dtype=datetime64[ns] (:issue:`7754`)
- Modest improvement in datetime writing speed in to_csv (:issue:`10271`)

.. _whatsnew_0162.bug_fixes:

Bug Fixes
~~~~~~~~~

- Bug where read_hdf store.select modifies the passed columns list when
multi-indexed (:issue:`7212`)
- Bug in ``Categorical`` repr with ``display.width`` of ``None`` in Python 3 (:issue:`10087`)

- Bug in groupby.apply aggregation for Categorical not preserving categories (:issue:`10138`)
- Bug in ``mean()`` where integer dtypes can overflow (:issue:`10172`)
- Bug where Panel.from_dict does not set dtype when specified (:issue:`10058`)
- Bug in ``Index.union`` raises ``AttributeError`` when passing array-likes. (:issue:`10149`)
- Bug in ``Timestamp``'s' ``microsecond``, ``quarter``, ``dayofyear``, ``week`` and ``daysinmonth`` properties return ``np.int`` type, not built-in ``int``. (:issue:`10050`)
- Bug in ``NaT`` raises ``AttributeError`` when accessing to ``daysinmonth``, ``dayofweek`` properties. (:issue:`10096`)


- Bug in getting timezone data with ``dateutil`` on various platforms ( :issue:`9059`, :issue:`8639`, :issue:`9663`, :issue:`10121`)
- Bug in display datetimes with mixed frequencies uniformly; display 'ms' datetimes to the proper precision. (:issue:`10170`)

- Bung in ``Series`` arithmetic methods may incorrectly hold names (:issue:`10068`)

- Bug in ``DatetimeIndex`` and ``TimedeltaIndex`` names are lost after timedelta arithmetics ( :issue:`9926`)


- Bug in `Series.plot(label="LABEL")` not correctly setting the label (:issue:`10119`)

- Bug in `plot` not defaulting to matplotlib `axes.grid` setting (:issue:`9792`)

- Bug in ``Series.align`` resets ``name`` when ``fill_value`` is specified (:issue:`10067`)
- Bug in ``SparseSeries.abs`` resets ``name`` (:issue:`10241`)


- Bug in GroupBy.get_group raises ValueError when group key contains NaT (:issue:`6992`)
- Bug in ``SparseSeries`` constructor ignores input data name (:issue:`10258`)

- Bug where infer_freq infers timerule (WOM-5XXX) unsupported by to_offset (:issue:`9425`)

- Bug to handle masking empty ``DataFrame``(:issue:`10126`)
23 changes: 2 additions & 21 deletions doc/source/whatsnew/v0.17.0.txt
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
.. _whatsnew_0170:

v0.17.0 (July ??, 2015)
v0.17.0 (July 31, 2015)
-----------------------

This is a major release from 0.16.1 and includes a small number of API changes, several new features,
This is a major release from 0.16.2 and includes a small number of API changes, several new features,
enhancements, and performance improvements along with a large number of bug fixes. We recommend that all
users upgrade to this version.

Expand Down Expand Up @@ -53,26 +53,7 @@ Removal of prior version deprecations/changes
Performance Improvements
~~~~~~~~~~~~~~~~~~~~~~~~

- Improved ``Series.resample`` performance with dtype=datetime64[ns] (:issue:`7754`)

.. _whatsnew_0170.bug_fixes:

Bug Fixes
~~~~~~~~~

- Bug in ``Categorical`` repr with ``display.width`` of ``None`` in Python 3 (:issue:`10087`)


- Bug where Panel.from_dict does not set dtype when specified (:issue:`10058`)
- Bug in ``Timestamp``'s' ``microsecond``, ``quarter``, ``dayofyear``, ``week`` and ``daysinmonth`` properties return ``np.int`` type, not built-in ``int``. (:issue:`10050`)
- Bug in ``NaT`` raises ``AttributeError`` when accessing to ``daysinmonth``, ``dayofweek`` properties. (:issue:`10096`)

- Bug in getting timezone data with ``dateutil`` on various platforms ( :issue:`9059`, :issue:`8639`, :issue:`9663`, :issue:`10121`)
- Bug in display datetimes with mixed frequencies uniformly; display 'ms' datetimes to the proper precision. (:issue:`10170`)


- Bug in ``DatetimeIndex`` and ``TimedeltaIndex`` names are lost after timedelta arithmetics ( :issue:`9926`)

- Bug in `Series.plot(label="LABEL")` not correctly setting the label (:issue:`10119`)


Loading

0 comments on commit db330d4

Please sign in to comment.