-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v-measure #19
base: master
Are you sure you want to change the base?
v-measure #19
Conversation
Your data also comes with labels. The labels are represented by a 3 tuple of values. Why can't you use a similar technique (e.g.
How are you doing this?
Yes. That is the way that all ML algorithms typically work. I think this is why we have been talking at cross-purposes for a few days.
I don't understand this statement. The score in the notebook is
|
Thanks for clarification on
Yes, that's what I am doing |
@shankari I have a question about user_input
and
?? If I remember correctly, |
correct. There is no difference. Both of those are invalid |
There should be very few entries like that |
I haven't counted all users.
For user[2]
For user[3]
For user[4], there are some NaN in both
|
At this point, I am going to get v-measure for original user_input, after changing language, and converting purposes and mode, on all bins, bin above cutoff, and clusters above cutoff @shankari suggested
|
I don't recall seeing any I double-checked the entries using I suspect this may be because of your mapping code. But you can try the code from the public dashboard to see if you still have
|
I didn't convert anything from this user, the output is from non_empty_trips.
|
So, should I map |
@corinne-hcr I am not sure which user you are talking about; the order that you and I have may not be the same. I don't think we should share UUIDs in this public forum, and by design, the public dashboard only performs aggregate analysis. But I can confirm that I don't see any Again, it would be helpful to commit your code here. My code is checked in and you can verify against it any time. |
"for i in range(len(bins)):\n", | ||
" for trip in bins[i]:\n", | ||
" labels_pred.append(i)\n", | ||
"labels_pred" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although the actual labels can be anything (1,1,1,1
== 5,5,5,5
), I am pretty sure that the metric expects the labels_true
and labels_pred
to be in the same order - e.g. if labels_true
is generated from trips [t1, t2, t3, t4]
then labels_pred
is generated from the same trips in the same order.
Are you sure that your implementation here matches that?
While generating labels_true
, you are iterating the trips in bin_trips
order
While generating labels_pred
, you are iterating on a bin by bin order
So it would seem that if the bins are (t2, t3)
and (t1, t4)
then the labels_true
would be in order [t1, t2, t3, t4]
and the labels_pred
would be in order [t2, t3, t1, t4]
Maybe I don't understand fully recall the structure of bin_trips
. But to be on the safe side, can you add verification to the notebook to show that the trips are actually being processed in the same order?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think they are in same order.
You can see the bin_df
That's the trips order after binning.
It is according to the non_empty_trips
indices in the bins
The following is the list from where I collect labels_true
, it is indeed according to the indices in bin_trips
I do consider the trip order. I think if bin_trips_user_input_ls
has the same trip order as trips in bin_df
, then the way I collect labels should be the same. Although I just show part of the trips, but you can see from the picture, the trips are in same order.
Is that clear?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't see this.
bin_user_input
that you have listed above uses bins
not bin_trips
, the same as labels_pred
.
I think that your main argument is that the labels are the same through spot checking. That is good, but it could also happen by chance and for the first few entries. I certainly did not look at every single entry, and I am not sure I would be very accurate at doing so. This is the kind of checking which it is much better for computers to do programatically.
I would feel a lot more confident by comparing the trip indices in the original list directly in code. Add an index to the original list (if it doesn't have one) and use it as the dataframe index for both labels_pred
and labels_true
and verify that the indices are the same.
If you don't want to add the index, use the timestamp, which is also guaranteed to be unique.
But use a unique value in the input, pass it through the pipeline and confirm that it is the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The red list of bins
above contains the original indices from trips after Naomi filtered out trips that are points. bin_trips
is a new data list that collects trips according to the indices in bins.
Which key word is for timestamp that you want me to add?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trip["data"]["start_ts"]
should work as a timestamp. Again, I don't want you to print it, I want you to check it programmatically
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks good, but is not using the full power of pandas/numpy. Nevertheless, please add to PR.
Sure |
Did you investigate and determine the reason for the discrepancy? |
The following is my investigation. In my file, the trips are from 2020/11 to 2021/1 Here is
So, you can see, the total labeled trips from your code are the same from my code. I also got NaN output from your code. I think that's because the user didn't input value. |
can you put the related UUID into the teams chat? As I said earlier, I don't see
does not include |
|
I think you should pick 2020/12 or 2021/1 |
I'm not sure what you mean by that. I still need to know the user so that the indices match up, right?
|
I have already put the uuid in Teams. In 2020/11, there is no NaN for this user, but in 2020/12 and 2021/1, you can see NaN |
nvm, you are right! Instead of manually looking at the
|
And double checking those entries from the dataframe does indicate that they are missing.
|
I think we should skip these entries for now. There are not many of them, and the user has only a few labeled trips to begin with. So I wasn't counting him in my mental training set anyway. You can just skip all users with < 50% labeling rate for this training and tuning stage. |
Btw, this is the way how Naomi collected
Here is how the she collected indices in bins
Here |
Do you mean among all users labeled trips, I should skip users that have <50% full labeled trips in the three columns ? |
Correct. You will end up with ~ 6 users I think. |
We can add 4 other users from another project later; their confirmed labels will be a 2-tuple of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please make sure to remove outputs before committing? I am finding this hard to review.
non_empty_trips = [t for t in trips if t["data"]["user_input"] != {}] | ||
valid_trips = [t for t in non_empty_trips if 'mode_confirm' in t["data"]["user_input"] and | ||
'purpose_confirm' in t["data"]["user_input"] and 'replaced_mode' in t["data"]["user_input"]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is combining valid trips and binning. In the next PR, I would pull out the valid trip filtering into a separate function so it can be reused across the various modules - v-score calculation, query calculation, future clustering, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, is there a reason you are not using dropna
here instead of hardcoding the entry checks?
if len(filter_trips) < 10: | ||
homo_score.append(NaN) | ||
comp_score.append(NaN) | ||
v_score.append(NaN) | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, as we discussed, please combine this with the other validity checks.
if len(bin_trips) < 10: | ||
homo_score.append(NaN) | ||
comp_score.append(NaN) | ||
v_score.append(NaN) | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto
if len(filter_trips) < 10: | ||
homo_score.append(NaN) | ||
comp_score.append(NaN) | ||
v_score.append(NaN) | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto
if len(bin_trips) < 10: | ||
homo_score.append(NaN) | ||
comp_score.append(NaN) | ||
v_score.append(NaN) | ||
continue | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto
labels_true.append(no_dup_list.index(trip)) | ||
labels_pred = feat.labels | ||
|
||
# compare the points in cluster_trips and those in feat.points, return nothing if two data frames are the same |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you fix the comment here? Right now if the frames are the same, we just go on to the score calculation
…e,user input request proportion median)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the difference between the various notebooks here?
Can you add a description at the top of each notebook indicating what it does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would split the code into multiple functions so that they can be composed into a pipeline.
the functions would be:
- get_valid_trips
- map_labels (sp2en or cvt_pur_mo)
- valid_user_check
- fit_clusters (with two implementations, bin and cluster)
- get_pred_labels (two implementations?)
- get_user_labels (two implementations?)
- verify_clusters (with two implementations)
- compute scores
Then you have two pipelines, one with the bin implementations and the other with the cluster implementations, calling one after the other without any intermediate steps or repetition.
On a second look, the current pipelines are not too bad since several of the steps will have separate implementations anyway. I will accept a simpler refactoring in which you pull out the common functionality If you want to implement the cleaner, but more complex refactoring, I won't complain! |
I can pull out And what do you want me to do with |
Not sure what this means. The code that uses
|
I was hoping you would finish the refactoring and add the descriptions for the other notebooks first so I could merge this. |
For the scatter, the one I have is based on the bins above cutoff. The trips are grouped only based on coordinates and radius. Back then we compared the difference between bins and clusters(min_clusters = 0) and got that the bins are better than clusters. But when we decided to use a second round clustering, I changed to use k-means for the first round clustering and set min_clusters = len(bins). So, now should I just leave like that? Later when I have result from hierarchical clustering, I write another code for the improved scatter? |
not sure about your question - are you asking whether you need to go back and change the results for the trip end only clusters? In general, for any results, you don't have to present only one set of results. You can present a set of results and pick the best one to explore further. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@corinne-hcr please get this PR to a reviewable state before continuing. It is very hard to review large chunks of code with no context. I'm sure that the code is clear to you, but I'd like (at least this time) for the code to be relatively clear to others as well.
That means digestible code chunks with lots of comments explaining what you are doing.
Please get this done before you commit your refactored code.
I add some comments. Please let me know if it is still not digestible. |
def valid_user_check(filter_trips,trips,homo_score,comp_score,v_score): | ||
if not valid_user(filter_trips, trips): | ||
homo_score.append(NaN) | ||
comp_score.append(NaN) | ||
v_score.append(NaN) | ||
skip = True | ||
else: | ||
skip = False | ||
return homo_score,comp_score,v_score,skip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not going to insist on this since you are planning to refactor anyway, but you don't need to have homo_score
, comp_score
and v_score
as both input and output. You can create a class that encapsulates all the scores and just pass it in directly.
Also, I really don't think you need to have append code in here - you can just have this function return valid or invalid. If invalid, append the values and continue; if valid, proceed with computation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please file an issue to clean this up later.
return bin_date | ||
|
||
|
||
# compare the trip orders in bin_trips with those in filter_trips above cutoff |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improve comment. What is "trip_orders"? why does it matter?
Just repeating the function name in the comment doesn't add much
for bin in bin_date: | ||
if day: | ||
if match_day(trip,bin,filter_trips): | ||
bin.append(trip_index) | ||
added = True | ||
break | ||
if month: | ||
if match_month(trip,bin,filter_trips): | ||
bin.append(trip_index) | ||
added = True | ||
break | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like there should be a better way to implement this in pandas using groupby
. Please file an issue for this so you can explore how to clean it up later.
return homo_score,comp_score,v_score | ||
|
||
|
||
# This function is to compare a trip with a group of trips to see if they happened in a same day |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the meaning of the parameters?
you have a bin
and filter_trips
. Why do you need to pass in both?
why not just pass in two trips?
" bin_day = evaluation.bin_date(req_trips_ls,filter_trips,day=True)\n", | ||
" req_day_ls = []\n", | ||
" for bin in bin_day:\n", | ||
" req_day_ls.append(len(bin))\n", | ||
" \n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment here that describes what bin_day
is expected to look like to make the rest of this easier to review
trip = filter_trips[trip_index] | ||
|
||
for bin in bin_date: | ||
if day: | ||
if match_day(trip,bin,filter_trips): | ||
bin.append(trip_index) | ||
added = True | ||
break | ||
if month: | ||
if match_month(trip,bin,filter_trips): | ||
bin.append(trip_index) | ||
added = True | ||
break | ||
|
||
if not added: | ||
bin_date.append([trip_index]) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is just so messy and non-pythonic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add this to the list of areas to improve the implementation.
" if evaluation.match_day(req_trip,valid_trips_bin,filter_trips):\n", | ||
" proportion = round(len(req_trips_bin)/len(valid_trips_bin), 2)\n", | ||
" propor_single_user.append(proportion)\n", | ||
" match = True\n", | ||
" break\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the problem with the current match_day
implementation. Reading this appears that you compare one trip to a list to find a match, and then you use it determine the proportion for that day for that user. But how is comparing only the first trip OK? What is the valid_trips_bin?
Needs more comments here. It took me 10 mins to toggle between the two implementations to figure this out, and that is too much time to spend on what is supposed to be fairly simple code.
My question for v-measure was what to put in
labels_true
andlabels_pred
. (But I have some clues now)metrics.homogeneity_score(labels_true, labels_pred)
metrics.completeness_score(labels_true, labels_pred)
metrics.v_measure_score(labels_true, labels_pred)
plot_kmeans_digits
andplot_document_clustering
are two examples from sklearn. I add some test code to print some process.In
plot_document_clustering
, the author uses k-means for clustering, but the data come with labels, so he uses the length of labels to determine the clusters.Then he calls KMeans like this
He runs v-measure like this
labels
is a 1D ndarray, and so iskm.labels_
, the size of the array is the size of all data ( so I guess I need to pass in all data labels, instead of data labels from one bin a time then calculate the mean score)(However, in documentation examples, the author passes in lists for both
labels_true
andlabels_pred
)Since the author didn't set
random_state
in KMeans, theHomogeneity
,Completeness
, andV-measure
are different every time ( the labels from clustering are different every time)In the documentation, there is a sentence -
This metric is independent of the absolute values of the labels: a permutation of the class or cluster label values won’t change the score value in any way.
I understand what this means now.
For example,
The contents above are what my see from the example codes.
Here is my questions:
For
homogeneity_score
In documentation page, there are some examples of
homogeneity_score
, but I don't understand.I try a similar test on notebook.
The
homogeneity_score
should be bounded [0,1]. But if I run on my data, this circumstance won't happen since there are more than 1 bin/cluster. I just don't understand why would this happen.Now, I am working on labeling the ground truth based on the user_input (on data above cutoff)--
labels_true
. I will give the same label for same user_input trips. Also, I will label the trips in bins according to the bin index--labels_pred
. Finally, put them in the function.