-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade all dependencies to the latest versions #761
Conversation
- Upgraded to the most recent version of everything, with the following caveats: - miniconda is 4.8.3 instead of 4.8.4 since there doesn't appear to be a 4.8.4 labelled binary yet and I don't want to use `latest` - remove versioned python since it is installed automatically - specifying python also causes major dependency incompatibilities, since all packages have not been upgraded to the latest version - Remove the manual installs since we are upgrading to the most recent version Testing done: - Setup works locally - Running tests on CI; might run locally with persistent failures
to environment.yml. But only add the direct dependencies since the indirect dependencies take care of themselves
MongoDB has deprecated `count()` on cursors and has no plans of ever restoring it https://jira.mongodb.org/browse/PYTHON-1724 We used `Collection.count()` and `Cursor.count()` extensively. This commit replaces all of those with the new API calls. Concretely, the find and replace regex was: - Collection.`find\(.*\).count()` ➡️ `count_documents\1` - Collection.`find().count()` ➡️ `estimated_document_count` - Collection.`count()` ➡️ `estimated_document_count()` In a couple of files, we returned a cursor from a function, checked the count and then used values from the cursor. I had to replace these with a separate call to get the count. These were in: - `emission/storage/timeseries/builtin_timeseries.py` - `emission/storage/decorations/section_queries.py` - `emission/storage/decorations/stop_queries.py` I also had to change `$where` ➡️ `$expr` in `emission/analysis/modelling/tour_model/prior_unused/exploratory_scripts/explore_smoothing_trajectories.py`
`as_matrix` has been removed from pandas From [0.22 docs](https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.DataFrame.as_matrix.html) > This method is provided for backwards compatibility. Generally, it is recommended to use ‘.values’. However, in the [most recent docs](https://pandas.pydata.org/pandas-docs/version/1.1.0/reference/api/pandas.DataFrame.values.html) > We recommend using DataFrame.to_numpy() instead. Fortunately, that doesn't appear to be [deprecated (yet)](https://pandas.pydata.org/pandas-docs/version/1.1.0/reference/api/pandas.DataFrame.to_numpy.html#pandas.DataFrame.to_numpy) Replace all `as_matrix` by `to_numpy`
Due to changes in the way that pandas interacts with numpy, `numpy.nonzero` does not work properly with pandas series (see example below). ``` $ ./e-mission-py.bash Python 3.7.8 | packaged by conda-forge | (default, Jul 31 2020, 02:37:09) [Clang 10.0.1 ] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import pandas as pd >>> import numpy as np >>> s = pd.Series([True, False, True, False, False, True]) >>> np.nonzero(s) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<__array_function__ internals>", line 6, in nonzero File "/Users/kshankar/miniconda-4.8.3/envs/emission/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 1908, in nonzero return _wrapfunc(a, 'nonzero') File "/Users/kshankar/miniconda-4.8.3/envs/emission/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 55, in _wrapfunc return _wrapit(obj, method, *args, **kwds) File "/Users/kshankar/miniconda-4.8.3/envs/emission/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 48, in _wrapit result = wrap(result) File "/Users/kshankar/miniconda-4.8.3/envs/emission/lib/python3.7/site-packages/pandas/core/generic.py", line 1787, in __array_wrap__ return self._constructor(result, **d).__finalize__( File "/Users/kshankar/miniconda-4.8.3/envs/emission/lib/python3.7/site-packages/pandas/core/series.py", line 314, in __init__ f"Length of passed values is {len(data)}, " ValueError: Length of passed values is 1, index implies 6. >>> np.nonzero(s.to_numpy()) (array([0, 2, 5]),) ``` We fix this by changing `np.nonzero(s)` ➡️ `np.nonzero(s.to_numpy())` iff `s` is a pandas series This change was only in one file - `emission//analysis/intake/segmentation/section_segmentation_methods/smoothed_high_confidence_motion.py`: fixed The others were already numpy arrays and did not need conversion - `emission//analysis/classification/inference/mode/seed/pipeline.py`: the `featureMatrix` is an numpy array - `emission//analysis/classification/inference/Classifier.ipynb`: Ditto - `emission//analysis/modelling/tour_model/prior_unused/exploratory_scripts/plot_error_types.py`: the array is already a numpy array - `emission//analysis/intake/cleaning/cleaning_methods/jump_smoothing.py`: `inlier_mask_` seems to be a normal array. Tested in `emission//tests/analysisTests/intakeTests/TestLocationSmoothing.py` - `emission//analysis/intake/cleaning/cleaning_methods/jump_smoothing.py`: `inlier_mask_` seems to be a normal array. Tested in `emission//tests/analysisTests/intakeTests/TestLocationSmoothing.py` - `emission//incomplete_tests/TestGpsSmoothing.py`: ditto
While testing this locally, ran into a regression caused by an assertion error in place squishing The mismatch is in one of the initial speed computations
The change is small, so depending on the results of the investigation, we could just fix the "ground truth" and move on. But let's try to understand how it happened first.
|
So the original speeds were almost identical, but after inserting, there was a mismatch.
The change happens while inserting
Looking at the code, we insert and then fix a bunch of values. The insert check depends on the
And the reset values are identical, except for some rounding.
|
Let's see where we got these values from and how we rounded them. We got those
But back in section segmentation, the related place seems to be accurate
|
Ok so looking further, the
Which is because the raw trip point is different
|
Not sure why the change would trigger an assert, though. While this may be different from the ground truth, it should be internally consistent. |
So the mismatch while inserting the entry is caused by the distance mismatch
But why is it different while recalculating? The new reconstructed location is the same
So the newly added point is the same, and the second point is the same, so the recalculated speeds are the same. But the distance was different so the non-validated insertion was different. What is different between the two distance calculations? |
Ah, so I think I have the root cause! It looks like the coordinates in the
When we compute the delta distance, we use the coordinates, which are rounded in this version
But when we recompute the speeds and distances, we use the latitude and longitude fields.
A simple fix would be to use a consistent value (either coordinates or lat/lon fields) throughout. In this case, since we are upgrading, probably fixing the rounding is the safer fix. |
Aha! from https://pypi.org/project/geojson/
So our options are:
Given that this change is already long and complex, I vote for (4). |
From https://github.com/jazzband/geojson/blob/master/CHANGELOG.rst, this was introduced in version 2.5.0. |
That fixed the assertion error but resulted in a ground truth regression. Let's try reverting back to 2.3.0, and if that doesn't work, we have to debug further.
|
To fix e-mission#761 (comment) Rationale in e-mission#761 (comment)
Reverting back to 2.3.0 didn't fix it. However, looking at the full list of regressions, we have an issue with some of the modules (e.g. location smoothing). If we are not smoothing out the correct values, that could be a reason for the difference in the distances. Let's fix that first. |
In 0bb8e59, we determined that the `inlier_mask_` was a numpy array. This is true for `SmoothBoundary` and `SmoothPosdap`, but NOT for `SmoothZigzag`. `SmoothZigzag` does use a pandas Series ``` self.inlier_mask_ = pd.Series([True] * with_speeds_df.shape[0]) ``` So we need to use `to_numypy()` while accessing it
The filtering error is
And indeed we deleted 8 points instead of the original 12
|
The main difference seems to start here:
Checking the segment list, we have (in both cases):
Segment 9 is clearly the shortest non-cluster segment. Why is the code picking it? |
In pandas, the behavior of `argmin` has changed. It used to return the **label** of the minimum. It now returns the **position** of the minimum. This was foreshadowed in https://pandas-docs.github.io/pandas-docs-travis/reference/api/pandas.Series.argmin.html > Deprecated since version 0.21.0. > The current behaviour of ‘Series.argmin’ is deprecated, use ‘idxmin’ instead. The behavior of ‘argmin’ will be corrected to return the positional minimum in the future. For now, use ‘series.values.argmin’ or ‘np.argmin(np.array(values))’ to get the position of the minimum row. And verified through experimentation. ``` $ ./e-mission-py.bash Python 3.6.1 | packaged by conda-forge | (default, May 11 2017, 18:00:28) [GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import pandas as pd >>> s = pd.Series([8.6, 14, 15, 7, 0.7, 0.1, 1.6], index=[0,2,4,6,7,9,10]) >>> s.argmin() 9 >>> s.idxmin() 9 >>> ``` ``` $ ./e-mission-py.bash Python 3.7.8 | packaged by conda-forge | (default, Jul 31 2020, 02:37:09) [Clang 10.0.1 ] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import pandas as pd >>> s = pd.Series([8.6, 14, 15, 7, 0.7, 0.1, 1.6], index=[0,2,4,6,7,9,10]) >>> s.argmin() 5 >>> s.idxmin() 9 >>> ``` Let's fix this by replacing `argmin` ➡️ `idxmin` everywhere
It turns out that we were also passing cursors around in the pipeline. Let's fix it the same way as f122aa2 by returning both the cursor and the count. This now works properly. The `TestPipelineSeed` tests now pass completely
Removes a bunch of User Warnings of the form ``` e-mission-server/emission/analysis/intake/cleaning/location_smoothing.py:55: UserWarning: DataFrame columns are not unique, some columns will be omitted. ```
To avoid format warnings ``` /home/runner/miniconda-4.8.3/envs/emissiontest/lib/python3.7/site-packages/sklearn/base.py:334: UserWarning: Trying to unpickle estimator RandomForestClassifier from version 0.18.1 when using version 0.23.2. This might lead to breaking code or invalid results. Use at your own risk. ```
This is a followon/fix to 3032f79 Without this fix, there were failures similar to ``` File "/home/runner/work/e-mission-server/e-mission-server/emission/analysis/intake/cleaning/clean_and_resample.py", line 119, in save_cleaned_segments_for_timeline filtered_trip = get_filtered_trip(ts, trip) ... KeyError: "['heading'] not found in axis" ```
Due to warning ``` e-mission-server/e-mission-server/emission/tests/common.py:41: DeprecationWarning: collection_names is deprecated. Use list_collection_names instead. collections = db.collection_names() ```
Due to warning ``` e-mission-server/emission/tests/coreTests/TestBase.py:105: DeprecationWarning: Please use assertRaisesRegex instead. with self.assertRaisesRegexp(AttributeError, ".*not defined.*"): ```
This is a follow-on to 3e9c566
Now that all the software dependencies are done
Before this change, we were returning the values in the database order ``` def getKeyListForType(self, message_type): return self.db.find({"user_id": self.user_id, "metadata.type": message_type}).distinct("metadata.key") ``` The database order changed between 3.4.0 and 4.4.0, leading to a regression when we tested with c71712f ``` ====================================================================== FAIL: testDeleteObsoleteEntries (__main__.TestBuiltinUserCacheHandlerOutput) ---------------------------------------------------------------------- Traceback (most recent call last): File "emission/tests/netTests/TestBuiltinUserCacheHandlerOutput.py", line 121, in testDeleteObsoleteEntries self.assertEqual(uc.getDocumentKeyList(), ["2015-12-30", "2015-12-29"]) AssertionError: Lists differ: ['2015-12-29', '2015-12-30'] != ['2015-12-30', '2015-12-29'] First differing element 0: '2015-12-29' '2015-12-30' - ['2015-12-29', '2015-12-30'] + ['2015-12-30', '2015-12-29'] ``` We now force an order and update the checks to verify that order
If we use the e-mission server base instead, that has the conda version built-in, and we can't test with a changed conda version. Also dockerhub does not currently have an image for the version of conda that we are using (4.8.3). The most recent version on dockerhub is 4.8.2 published 5 months ago. So let's just start with a blank ubuntu image and install everything from scratch. That will allow us to test with the changed conda version before we merge
With this fix, `docker-compose` seems to work locally
Make corresponding changes to all the scripts in `bin` as well. Concretely: - change all instances of `Cursor.count()`, similar to e-mission@f122aa2 None of the other changes identified apply to these fairly simple scripts ``` $ grep -r as_matrix bin/ $ grep -r nonzero bin/ $ grep -r argmin bin/ $ ```
This makes it consistent with 57c83732591742f4967e41978b7047b14c47dbb0 and 98882c79c86d613024cb4dc85070185e89ed5634 and with e-mission/e-mission-server#761 in general Other example updates TBD
To fix e-mission#761 (comment) Rationale in e-mission#761 (comment)
Upgrade all dependencies to the latest versions
4.8.4 labelled binary yet and I don't want to use
latest
all packages have not been upgraded to the latest version
Testing done: