Add STL to Time Series Documentation (#3835)

* Update PolynomialDecomposer section of docs * Add weather part of stl for docs * Add STL with synthetic data to docs * Remove could not find time index none warning
alteryx · Nov 16, 2022 · 9def394 · 9def394
1 parent 1742fed
commit 9def394
Show file tree

Hide file tree

Showing 3 changed files with 129 additions and 21 deletions.
diff --git a/docs/source/release_notes.rst b/docs/source/release_notes.rst
@@ -15,6 +15,7 @@ Release Notes
         * Removed Featuretools version upper bound and prevent Woodwork 0.20.0 from being installed :pr:`3813`
         * Updated min Featuretools version to 0.16.0, min nlp-primitives version to 2.9.0 and min Dask version to 2022.2.0 :pr:`3823`
     * Documentation Changes
+        * Added information about STL Decomposition to the time series docs :pr:`3835`
     * Testing Changes
 
 .. warning::

diff --git a/docs/source/user_guide/timeseries.ipynb b/docs/source/user_guide/timeseries.ipynb
@@ -255,7 +255,7 @@
    "metadata": {},
    "source": [
     "## Trending and Seasonality Decomposition\n",
-    "Decomposing a target signal into a trend and/or a cyclical signal is a common pre-processing step for time series modeling.  Having an understanding of the presence or absence of these component signals can provide additional insight and decomposing the signal into these constituent components can enable non-time-series aware estimators to perform better while attempting to model this data.\n",
+    "Decomposing a target signal into a trend and/or a cyclical signal is a common pre-processing step for time series modeling.  Having an understanding of the presence or absence of these component signals can provide additional insight and decomposing the signal into these constituent components can enable non-time-series aware estimators to perform better while attempting to model this data. We have two unique decomposers, the `PolynomialDecompser` and the `STLDecomposer`.\n",
     "\n",
     "Let's first take a look at a year's worth of the weather dataset."
    ]
@@ -279,7 +279,14 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "With the knowledge that this is a weather dataset and the data itself is daily weather data, we can assume that the seasonal data will have a period of approximately 365 data points.  Let's build and fit a polynomial decomposer to detrend and deseasonalize this data."
+    "With the knowledge that this is a weather dataset and the data itself is daily weather data, we can assume that the seasonal data will have a period of approximately 365 data points.  Let's build and fit decomposers to detrend and deseasonalize this data."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Polynomial Decomposer"
    ]
   },
   {
@@ -305,7 +312,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The result is the residual signal, with the trend and seasonality removed.  If we want to look at what the component identified as the trend and seasonality, we can call the `get_trend_df()` function."
+    "The result is the residual signal, with the trend and seasonality removed.  If we want to look at what the component identified as the trend and seasonality, we can call the `plot_decomposition()` function."
    ]
   },
   {
@@ -314,18 +321,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "res = pdc.get_trend_dataframe(X_train_time, y_train_time)\n",
-    "fig, axs = plt.subplots(4)\n",
-    "fig.set_size_inches(18.5, 14.5)\n",
-    "axs[0].plot(y, \"r\")\n",
-    "axs[0].set_title(\"signal\")\n",
-    "axs[1].plot(res[0][\"trend\"], \"b\")\n",
-    "axs[1].set_title(\"trend\")\n",
-    "axs[2].plot(res[0][\"seasonality\"], \"g\")\n",
-    "axs[2].set_title(\"seasonality\")\n",
-    "axs[3].plot(res[0][\"residual\"], \"y\")\n",
-    "axs[3].set_title(\"residual\")\n",
-    "plt.show()"
+    "res = pdc.plot_decomposition(X_train_time, y_train_time)"
    ]
   },
   {
@@ -349,9 +345,8 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The children of the `Decomposer` class, if not explicitly set in the constructor, set their `seasonal_period` parameter\n",
-    "based on a `statsmodels` function, `freq_to_period`, that considers the frequency of the datetime data.  This will give a\n",
-    "reasonable guess as to what the frequency could be. For example, if the `PolynomialDecomposer` object is fit with\n",
+    "The `PolynomialDecomposer` class, if not explicitly set in the constructor, will set its `seasonal_period` parameter\n",
+    "based on a `statsmodels` function `freq_to_period` that considers the frequency of the datetime data.  This will give a reasonable guess as to what the frequency could be. For example, if the `PolynomialDecomposer` object is fit with\n",
     "`seasonal_period` not explicitly set, it will take on a default value of 7, which is good for daily data signals that\n",
     "have a known seasonal component period that is weekly.\n",
     "\n",
@@ -372,6 +367,117 @@
     "assert 363 < pdc.seasonal_period < 368"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### STLDecomposer\n",
+    "\n",
+    "The `STLDecomposer` runs on [statsmodels' implementation](https://www.statsmodels.org/dev/generated/statsmodels.tsa.seasonal.STL.html) of [STL decomposition](https://otexts.com/fpp3/stl.html). Let's take a look at how STL decomposes the weather dataset."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from evalml.pipelines.components import STLDecomposer\n",
+    "\n",
+    "stl = STLDecomposer()\n",
+    "X_t, y_t = stl.fit_transform(X_train_time, y_train_time)\n",
+    "\n",
+    "res = stl.plot_decomposition(X_train_time, y_train_time)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This doesn't look nearly as good as the `PolynomialDecomposer` did. This is because STL decomposition performs best when the data has a small seasonal period, generally less than 14 time units. The weather dataset's seasonal period of ~365 days does not work as well since STL extracted a shorter term seasonality for decomposition. \n",
+    "\n",
+    "We can generate some synthetic data that better highlights where STL performs well. For this example, we'll generate monthly data with an annual seasonal period."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import random\n",
+    "import numpy as np\n",
+    "from datetime import datetime\n",
+    "from sklearn.preprocessing import minmax_scale\n",
+    "\n",
+    "\n",
+    "def generate_synthetic_data(\n",
+    "    period=12,\n",
+    "    num_periods=25,\n",
+    "    scale=10,\n",
+    "    seasonal_scale=2,\n",
+    "    trend_degree=1,\n",
+    "    freq_str=\"M\",\n",
+    "):\n",
+    "    freq = 2 * np.pi / period\n",
+    "    x = np.arange(0, period * num_periods, 1)\n",
+    "    dts = pd.date_range(datetime.today(), periods=len(x), freq=freq_str)\n",
+    "    X = pd.DataFrame({\"x\": x})\n",
+    "    X = X.set_index(dts)\n",
+    "\n",
+    "    if trend_degree == 1:\n",
+    "        y_trend = pd.Series(scale * minmax_scale(x + 2))\n",
+    "    elif trend_degree == 2:\n",
+    "        y_trend = pd.Series(scale * minmax_scale(x**2))\n",
+    "    elif trend_degree == 3:\n",
+    "        y_trend = pd.Series(scale * minmax_scale((x - 5) ** 3 + x**2))\n",
+    "    y_seasonal = pd.Series(seasonal_scale * np.sin(freq * x))\n",
+    "    y_random = pd.Series(np.random.normal(0, 1, len(X)))\n",
+    "    y = y_trend + y_seasonal + y_random\n",
+    "    return X, y\n",
+    "\n",
+    "\n",
+    "X_stl, y_stl = generate_synthetic_data()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's see how the `STLDecomposer` does at decomposing this data."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "stl = STLDecomposer()\n",
+    "X_t_stl, y_t_stl = stl.fit_transform(X_stl, y_stl)\n",
+    "\n",
+    "res = stl.plot_decomposition(X_stl, y_stl)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "On top of decomposing this type of data well, the statsmodels implementation of STL automatically determines the seasonal period of the data, which is saved during fit time for this component."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "stl = STLDecomposer()\n",
+    "assert stl.seasonal_period == 7\n",
+    "stl.fit(X_stl, y_stl)\n",
+    "print(stl.seasonal_period)"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},

diff --git a/evalml/utils/gen_utils.py b/evalml/utils/gen_utils.py
@@ -666,9 +666,10 @@ def get_time_index(X: pd.DataFrame, y: pd.Series, time_index_name: str):
 
     # If user's provided time_index doesn't exist, log it and find some datetimes to use
     elif (time_index_name is None) or time_index_name not in X.columns:
-        logger.warning(
-            f"Could not find requested time_index {time_index_name}",
-        )
+        if time_index_name is not None:
+            logger.warning(
+                f"Could not find requested time_index {time_index_name}",
+            )
         # Use the feature data's index, preferentially
         num_datetime_features = X.ww.select("Datetime").shape[1]
         if isinstance(X.index, pd.DatetimeIndex):