Merge https://github.com/pydata/pandas into fastsplit

* https://github.com/pydata/pandas: (26 commits) disable some deps on 3.2 build Fix meantim typo DOC: use current ipython in doc build PERF: write basic datetimes faster pandas-dev#10271 TST: fix for bottleneck >= 1.0 nansum behavior, xref pandas-dev#9422 add numba example to enhancingperf.rst BUG: SparseSeries constructor ignores input data name BUG: Raise TypeError only if key DataFrame is not empty pandas-dev#10126 ENH: groupby.apply for Categorical should preserve categories (closes pandas-dev#10138) DOC: add in whatsnew/0.17.0.txt DOC: move whatsnew from 0.17.0 -> 0.16.2 BUG: Holiday(..) with both offset and observance raises NotImplementedError pandas-dev#10217 BUG: Index.union cannot handle array-likes BUG: SparseSeries.abs() resets name BUG: Series arithmetic methods incorrectly hold name ENH: Don't infer WOM-5MON if we don't support it (pandas-dev#9425) BUG: Series.align resets name when fill_value is specified BUG: GroupBy.get_group raises ValueError when group key contains NaT Close mysql connection in TestXMySQL to prevent tests freezing BUG: plot doesnt default to matplotlib axes.grid setting (pandas-dev#9792) ...
cgevans · Jun 4, 2015 · db330d4 · db330d4
2 parents c155059 + bc7d48f
commit db330d4
Show file tree

Hide file tree

Showing 38 changed files with 822 additions and 206 deletions.
diff --git a/README.md b/README.md
@@ -123,7 +123,7 @@ conda install pandas
      - xlrd >= 0.9.0
   - [XlsxWriter](https://pypi.python.org/pypi/XlsxWriter)
      - Alternative Excel writer.
-- [Google bq Command Line Tool](https://developers.google.com/bigquery/bq-command-line-tool/)
+- [Google bq Command Line Tool](https://cloud.google.com/bigquery/bq-command-line-tool)
   - Needed for `pandas.io.gbq`
 - [boto](https://pypi.python.org/pypi/boto): necessary for Amazon S3 access.
 - One of the following combinations of libraries is needed to use the

diff --git a/ci/build_docs.sh b/ci/build_docs.sh
@@ -13,9 +13,10 @@ fi
 
 
 if [ x"$DOC_BUILD" != x"" ]; then
-    # we're running network tests, let's build the docs in the meantim
+
+    # we're running network tests, let's build the docs in the meantime
     echo "Will build docs"
-    pip install sphinx==1.1.3 ipython==1.1.0
+    conda install sphinx==1.1.3 ipython
 
     mv "$TRAVIS_BUILD_DIR"/doc /tmp
     cd /tmp/doc

diff --git a/ci/requirements-3.2.txt b/ci/requirements-3.2.txt
@@ -1,15 +1,4 @@
 python-dateutil==2.1
 pytz==2013b
-xlsxwriter==0.4.6
-xlrd==0.9.2
 numpy==1.7.1
 cython==0.19.1
-numexpr==2.1
-tables==3.0.0
-matplotlib==1.2.1
-patsy==0.1.0
-lxml==3.2.1
-html5lib
-scipy==0.12.0
-beautifulsoup4==4.2.1
-statsmodels==0.5.0
diff --git a/doc/source/api.rst b/doc/source/api.rst
@@ -358,6 +358,8 @@ Computations / Descriptive Stats
    Series.median
    Series.min
    Series.mode
+   Series.nlargest
+   Series.nsmallest
    Series.pct_change
    Series.prod
    Series.quantile

diff --git a/doc/source/enhancingperf.rst b/doc/source/enhancingperf.rst
@@ -7,7 +7,7 @@
 
    import os
    import csv
-   from pandas import DataFrame
+   from pandas import DataFrame, Series
    import pandas as pd
    pd.options.display.max_rows=15
 
@@ -68,9 +68,10 @@ Here's the function in pure python:
 
 We achieve our result by using ``apply`` (row-wise):
 
-.. ipython:: python
+.. code-block:: python
 
-   %timeit df.apply(lambda x: integrate_f(x['a'], x['b'], x['N']), axis=1)
+   In [7]: %timeit df.apply(lambda x: integrate_f(x['a'], x['b'], x['N']), axis=1)
+   10 loops, best of 3: 174 ms per loop
 
 But clearly this isn't fast enough for us. Let's take a look and see where the
 time is spent during this operation (limited to the most time consuming
@@ -97,7 +98,7 @@ First we're going to need to import the cython magic function to ipython:
 
 .. ipython:: python
 
-   %load_ext cythonmagic
+   %load_ext Cython
 
 
 Now, let's simply copy our functions over to cython as is (the suffix
@@ -122,9 +123,10 @@ is here to distinguish between function versions):
   to be using bleeding edge ipython for paste to play well with cell magics.
 
 
-.. ipython:: python
+.. code-block:: python
 
-   %timeit df.apply(lambda x: integrate_f_plain(x['a'], x['b'], x['N']), axis=1)
+   In [4]: %timeit df.apply(lambda x: integrate_f_plain(x['a'], x['b'], x['N']), axis=1)
+   10 loops, best of 3: 85.5 ms per loop
 
 Already this has shaved a third off, not too bad for a simple copy and paste.
 
@@ -150,9 +152,10 @@ We get another huge improvement simply by providing type information:
       ...:     return s * dx
       ...:
 
-.. ipython:: python
+.. code-block:: python
 
-   %timeit df.apply(lambda x: integrate_f_typed(x['a'], x['b'], x['N']), axis=1)
+   In [4]: %timeit df.apply(lambda x: integrate_f_typed(x['a'], x['b'], x['N']), axis=1)
+   10 loops, best of 3: 20.3 ms per loop
 
 Now, we're talking! It's now over ten times faster than the original python
 implementation, and we haven't *really* modified the code. Let's have another
@@ -229,9 +232,10 @@ the rows, applying our ``integrate_f_typed``, and putting this in the zeros arra
     Loops like this would be *extremely* slow in python, but in Cython looping
     over numpy arrays is *fast*.
 
-.. ipython:: python
+.. code-block:: python
 
-   %timeit apply_integrate_f(df['a'].values, df['b'].values, df['N'].values)
+   In [4]: %timeit apply_integrate_f(df['a'].values, df['b'].values, df['N'].values)
+   1000 loops, best of 3: 1.25 ms per loop
 
 We've gotten another big improvement. Let's check again where the time is spent:
 
@@ -278,20 +282,70 @@ advanced cython techniques:
       ...:     return res
       ...:
 
-.. ipython:: python
+.. code-block:: python
 
-   %timeit apply_integrate_f_wrap(df['a'].values, df['b'].values, df['N'].values)
+   In [4]: %timeit apply_integrate_f_wrap(df['a'].values, df['b'].values, df['N'].values)
+   1000 loops, best of 3: 987 us per loop
 
 Even faster, with the caveat that a bug in our cython code (an off-by-one error,
 for example) might cause a segfault because memory access isn't checked.
 
 
-Further topics
-~~~~~~~~~~~~~~
+.. _enhancingperf.numba:
+
+Using numba
+-----------
+
+A recent alternative to statically compiling cython code, is to use a *dynamic jit-compiler*, ``numba``.
+
+Numba gives you the power to speed up your applications with high performance functions written directly in Python. With a few annotations, array-oriented and math-heavy Python code can be just-in-time compiled to native machine instructions, similar in performance to C, C++ and Fortran, without having to switch languages or Python interpreters.
+
+Numba works by generating optimized machine code using the LLVM compiler infrastructure at import time, runtime, or statically (using the included pycc tool). Numba supports compilation of Python to run on either CPU or GPU hardware, and is designed to integrate with the Python scientific software stack.
+
+.. note::
+
+    You will need to install ``numba``. This is easy with ``conda``, by using: ``conda install numba``, see :ref:`installing using miniconda<install.miniconda>`.
+
+We simply take the plain python code from above and annotate with the ``@jit`` decorator.
+
+.. code-block:: python
+
+    import numba
+
+    @numba.jit
+    def f_plain(x):
+       return x * (x - 1)
+
+    @numba.jit
+    def integrate_f_numba(a, b, N):
+       s = 0
+       dx = (b - a) / N
+       for i in range(N):
+           s += f_plain(a + i * dx)
+       return s * dx
+
+    @numba.jit
+    def apply_integrate_f_numba(col_a, col_b, col_N):
+       n = len(col_N)
+       result = np.empty(n, dtype='float64')
+       assert len(col_a) == len(col_b) == n
+       for i in range(n):
+          result[i] = integrate_f_numba(col_a[i], col_b[i], col_N[i])
+       return result
+
+    def compute_numba(df):
+       result = apply_integrate_f_numba(df['a'].values, df['b'].values, df['N'].values)
+       return Series(result, index=df.index, name='result')
+
+Similar to above, we directly pass ``numpy`` arrays directly to the numba function. Further
+we are wrapping the results to provide a nice interface by passing/returning pandas objects.
+
+.. code-block:: python
 
-- Loading C modules into cython.
+    In [4]: %timeit compute_numba(df)
+    1000 loops, best of 3: 798 us per loop
 
-Read more in the `cython docs <http://docs.cython.org/>`__.
+Read more in the `numba docs <http://numba.pydata.org/>`__.
 
 .. _enhancingperf.eval:
 

diff --git a/doc/source/groupby.rst b/doc/source/groupby.rst
@@ -784,11 +784,11 @@ will be (silently) dropped. Thus, this does not pose any problems:
 
    df.groupby('A').std()
 
-NA group handling
-~~~~~~~~~~~~~~~~~
+NA and NaT group handling
+~~~~~~~~~~~~~~~~~~~~~~~~~
 
-If there are any NaN values in the grouping key, these will be automatically
-excluded. So there will never be an "NA group". This was not the case in older
+If there are any NaN or NaT values in the grouping key, these will be automatically
+excluded. So there will never be an "NA group" or "NaT group". This was not the case in older
 versions of pandas, but users were generally discarding the NA group anyway
 (and supporting it was an implementation headache).
 

diff --git a/doc/source/whatsnew.rst b/doc/source/whatsnew.rst
@@ -18,7 +18,7 @@ What's New
 
 These are new features and improvements of note in each release.
 
-.. include:: whatsnew/v0.17.0.txt
+.. include:: whatsnew/v0.16.2.txt
 
 .. include:: whatsnew/v0.16.1.txt
 

diff --git a/doc/source/whatsnew/v0.16.2.txt b/doc/source/whatsnew/v0.16.2.txt
@@ -0,0 +1,90 @@
+.. _whatsnew_0162:
+
+v0.16.2 (June 12, 2015)
+-----------------------
+
+This is a minor bug-fix release from 0.16.1 and includes a a large number of
+bug fixes along several new features, enhancements, and performance improvements.
+We recommend that all users upgrade to this version.
+
+Highlights include:
+
+- Documentation on how to use ``numba`` with *pandas*, see :ref:`here <enhancingperf.numba>`
+
+Check the :ref:`API Changes <whatsnew_0162.api>` before updating.
+
+.. contents:: What's new in v0.16.2
+    :local:
+    :backlinks: none
+
+.. _whatsnew_0162.enhancements:
+
+New features
+~~~~~~~~~~~~
+
+.. _whatsnew_0162.enhancements.other:
+
+Other enhancements
+^^^^^^^^^^^^^^^^^^
+
+.. _whatsnew_0162.api:
+
+Backwards incompatible API changes
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. _whatsnew_0162.api_breaking:
+
+.. _whatsnew_0162.api_breaking.other:
+
+Other API Changes
+^^^^^^^^^^^^^^^^^
+
+- ``Holiday`` now raises ``NotImplementedError`` if both ``offset`` and ``observance`` are used in constructor. (:issue:`102171`)
+
+.. _whatsnew_0162.performance:
+
+Performance Improvements
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+- Improved ``Series.resample`` performance with dtype=datetime64[ns] (:issue:`7754`)
+- Modest improvement in datetime writing speed in to_csv (:issue:`10271`)
+
+.. _whatsnew_0162.bug_fixes:
+
+Bug Fixes
+~~~~~~~~~
+
+- Bug where read_hdf store.select modifies the passed columns list when
+  multi-indexed (:issue:`7212`)
+- Bug in ``Categorical`` repr with ``display.width`` of ``None`` in Python 3 (:issue:`10087`)
+
+- Bug in groupby.apply aggregation for Categorical not preserving categories (:issue:`10138`)
+- Bug in ``mean()`` where integer dtypes can overflow (:issue:`10172`)
+- Bug where Panel.from_dict does not set dtype when specified (:issue:`10058`)
+- Bug in ``Index.union`` raises ``AttributeError`` when passing array-likes. (:issue:`10149`)
+- Bug in ``Timestamp``'s' ``microsecond``, ``quarter``, ``dayofyear``, ``week`` and ``daysinmonth`` properties return ``np.int`` type, not built-in ``int``. (:issue:`10050`)
+- Bug in ``NaT`` raises ``AttributeError`` when accessing to ``daysinmonth``, ``dayofweek`` properties. (:issue:`10096`)
+
+
+- Bug in getting timezone data with ``dateutil`` on various platforms ( :issue:`9059`, :issue:`8639`, :issue:`9663`, :issue:`10121`)
+- Bug in display datetimes with mixed frequencies uniformly; display 'ms' datetimes to the proper precision. (:issue:`10170`)
+
+- Bung in ``Series`` arithmetic methods may incorrectly hold names (:issue:`10068`)
+
+- Bug in ``DatetimeIndex`` and ``TimedeltaIndex`` names are lost after timedelta arithmetics ( :issue:`9926`)
+
+
+- Bug in `Series.plot(label="LABEL")` not correctly setting the label (:issue:`10119`)
+
+- Bug in `plot` not defaulting to matplotlib `axes.grid` setting (:issue:`9792`)
+
+- Bug in ``Series.align`` resets ``name`` when ``fill_value`` is specified (:issue:`10067`)
+- Bug in ``SparseSeries.abs`` resets ``name`` (:issue:`10241`)
+
+
+- Bug in GroupBy.get_group raises ValueError when group key contains NaT (:issue:`6992`)
+- Bug in ``SparseSeries`` constructor ignores input data name (:issue:`10258`)
+
+- Bug where infer_freq infers timerule (WOM-5XXX) unsupported by to_offset (:issue:`9425`)
+
+- Bug to handle masking empty ``DataFrame``(:issue:`10126`)
diff --git a/doc/source/whatsnew/v0.17.0.txt b/doc/source/whatsnew/v0.17.0.txt
@@ -1,9 +1,9 @@
 .. _whatsnew_0170:
 
-v0.17.0 (July ??, 2015)
+v0.17.0 (July 31, 2015)
 -----------------------
 
-This is a major release from 0.16.1 and includes a small number of API changes, several new features,
+This is a major release from 0.16.2 and includes a small number of API changes, several new features,
 enhancements, and performance improvements along with a large number of bug fixes. We recommend that all
 users upgrade to this version.
 
@@ -53,26 +53,7 @@ Removal of prior version deprecations/changes
 Performance Improvements
 ~~~~~~~~~~~~~~~~~~~~~~~~
 
-- Improved ``Series.resample`` performance with dtype=datetime64[ns] (:issue:`7754`)
-
 .. _whatsnew_0170.bug_fixes:
 
 Bug Fixes
 ~~~~~~~~~
-
-- Bug in ``Categorical`` repr with ``display.width`` of ``None`` in Python 3 (:issue:`10087`)
-
-
-- Bug where Panel.from_dict does not set dtype when specified (:issue:`10058`)
-- Bug in ``Timestamp``'s' ``microsecond``, ``quarter``, ``dayofyear``, ``week`` and ``daysinmonth`` properties return ``np.int`` type, not built-in ``int``. (:issue:`10050`)
-- Bug in ``NaT`` raises ``AttributeError`` when accessing to ``daysinmonth``, ``dayofweek`` properties. (:issue:`10096`)
-
-- Bug in getting timezone data with ``dateutil`` on various platforms ( :issue:`9059`, :issue:`8639`, :issue:`9663`, :issue:`10121`)
-- Bug in display datetimes with mixed frequencies uniformly; display 'ms' datetimes to the proper precision. (:issue:`10170`)
-
-
-- Bug in ``DatetimeIndex`` and ``TimedeltaIndex`` names are lost after timedelta arithmetics ( :issue:`9926`)
-
-- Bug in `Series.plot(label="LABEL")` not correctly setting the label (:issue:`10119`)
-
-