Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v-measure #19

Open
wants to merge 17 commits into
base: master
Choose a base branch
from
Open

v-measure #19

wants to merge 17 commits into from

Conversation

corinne-hcr
Copy link
Contributor

My question for v-measure was what to put in labels_true and labels_pred. (But I have some clues now)
metrics.homogeneity_score(labels_true, labels_pred)
metrics.completeness_score(labels_true, labels_pred)
metrics.v_measure_score(labels_true, labels_pred)

plot_kmeans_digits and plot_document_clustering are two examples from sklearn. I add some test code to print some process.

In plot_document_clustering, the author uses k-means for clustering, but the data come with labels, so he uses the length of labels to determine the clusters.

labels = dataset.target
true_k = np.unique(labels).shape[0]

Then he calls KMeans like this

km = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1, verbose=opts.verbose)

He runs v-measure like this

print("V-measure: %0.3f" % metrics.v_measure_score(labels, km.labels_))

labelsis a 1D ndarray, and so is km.labels_ , the size of the array is the size of all data ( so I guess I need to pass in all data labels, instead of data labels from one bin a time then calculate the mean score)

The type of labels is <class 'numpy.ndarray'>
the shape of labels is n_samples 3387
The type of km.labels_ is <class 'numpy.ndarray'>
the shape of km.labels_ is n_samples 3387
true_k is 4
np.unique(labels) is [0 1 2 3]
np.unique(labels).shape is 4

(However, in documentation examples, the author passes in lists for both labels_true and labels_pred)

Since the author didn't set random_state in KMeans, the Homogeneity,Completeness, andV-measure are different every time ( the labels from clustering are different every time)

In the documentation, there is a sentence - This metric is independent of the absolute values of the labels: a permutation of the class or cluster label values won’t change the score value in any way.
I understand what this means now.
For example,

labels_true = [0,0,1,1]
labels_pred = [1,1,2,3]
>>>metrics.homogeneity_score(labels_true, labels_pred)
0.9999999999999999
>>>metrics.completeness_score(labels_true, labels_pred)
0.6666666666666666
>>>metrics.v_measure_score(labels_true, labels_pred)
0.7999999999999999
labels_true = [11,11,1,1]
labels_pred = [2,2,3,4]
>>>metrics.homogeneity_score(labels_true, labels_pred)
0.9999999999999999
>>>metrics.completeness_score(labels_true, labels_pred)
0.6666666666666666
>>>metrics.v_measure_score(labels_true, labels_pred)
0.7999999999999999

The contents above are what my see from the example codes.

Here is my questions:
For homogeneity_score
In documentation page, there are some examples of homogeneity_score, but I don't understand.

Clusters that include samples from different classes do not make for an homogeneous labeling:
>>> print("%.6f" % homogeneity_score([0, 0, 1, 1], [0, 1, 0, 1]))
0.0...
>>> print("%.6f" % homogeneity_score([0, 0, 1, 1], [0, 0, 0, 0]))
0.0...

I try a similar test on notebook.

labels_true = [0,1,0,1]
labels_pred = [0,0,0,0]

>>>metrics.homogeneity_score(labels_true, labels_pred)
1.6017132519074588e-16

The homogeneity_score should be bounded [0,1]. But if I run on my data, this circumstance won't happen since there are more than 1 bin/cluster. I just don't understand why would this happen.

Now, I am working on labeling the ground truth based on the user_input (on data above cutoff)--labels_true. I will give the same label for same user_input trips. Also, I will label the trips in bins according to the bin index--labels_pred. Finally, put them in the function.

@shankari
Copy link
Contributor

shankari commented Feb 21, 2021

but the data come with labels, so he uses the length of labels to determine the clusters.

Your data also comes with labels. The labels are represented by a 3 tuple of values. Why can't you use a similar technique (e.g. drop_duplicates, which you are already familar with) to identify the set of unique labels?

Now, I am working on labeling the ground truth based on the user_input (on data above cutoff)

How are you doing this?

so I guess I need to pass in all data labels, instead of data labels from one bin a time then calculate the mean score

Yes. That is the way that all ML algorithms typically work. I think this is why we have been talking at cross-purposes for a few days.

The homogeneity_score should be bounded [0,1]. But if I run on my data, this circumstance won't happen since there are more than 1 bin/cluster. I just don't understand why would this happen.

I don't understand this statement. The score in the notebook is 1.6017132519074588e-16 which is between [0,1] and is 0 if printed in the same way as the example.

>>> print("%.6f" % 1.6017132519074588e-16)
0.000000

@corinne-hcr
Copy link
Contributor Author

Thanks for clarification on 1.6017132519074588e-16

(e.g. drop_duplicates)

Yes, that's what I am doing

@corinne-hcr
Copy link
Contributor Author

@shankari I have a question about user_input
What's the difference between

51 pilot_ebike home pilot_ebike

and

98 pilot_ebike home same_mode

??

If I remember correctly, same_mode in replaced_mode is the same as the one in mode_confirm

@shankari
Copy link
Contributor

correct. There is no difference. Both of those are invalid replaced_mode for pilot eBike since we wanted to know what mode people would have used if they didn't have the eBike, but not everybody fixed their labels. The mode and purpose are valid, though.

@shankari
Copy link
Contributor

There should be very few entries like that

@corinne-hcr
Copy link
Contributor Author

I haven't counted all users.
For user[0],

['pilot_ebike', 'community_walk', 'same_mode']
['pilot_ebike', 'home', 'same_mode']
['pilot_ebike', 'home', 'same_mode']
['walk', 'home', 'same_mode']

For user[2]

['pilot_ebike', 'work', 'same_mode']
['pilot_ebike', 'work', 'same_mode']
['pilot_ebike', 'work', 'same_mode']
['pilot_ebike', 'shopping', 'same_mode']
['pilot_ebike', 'work', 'same_mode']
['drove_alone', 'work', 'same_mode']
['drove_alone', 'work', 'same_mode']
['drove_alone', 'pick_drop', 'same_mode']
['drove_alone', 'pick_drop', 'same_mode']

For user[3]

['walk', 'pick_drop', 'same_mode']
['walk', 'home', 'same_mode']

For user[4], there are some NaN in both purpose_confirm and replaced_mode, the score is lower.

  mode_confirm purpose_confirm replaced_mode
0  shared_ride             NaN           NaN
1  drove_alone        shopping           NaN
2  drove_alone            work     no_travel
3  shared_ride            work           NaN
  mode_confirm purpose_confirm replaced_mode
0  pilot_ebike   entertainment   drove_alone
1  drove_alone             NaN           NaN
2  pilot_ebike            home   drove_alone
  mode_confirm purpose_confirm replaced_mode
0  pilot_ebike   entertainment   drove_alone
1  drove_alone             NaN           NaN
2  pilot_ebike            home   drove_alone

@corinne-hcr
Copy link
Contributor Author

At this point, I am going to get v-measure for original user_input, after changing language, and converting purposes and mode, on all bins, bin above cutoff, and clusters above cutoff

@shankari suggested

we should then move on to the number of user requests per month as the other part of the tradeoff
and to the second task to see if we can bump up the accuracies. As part of that, I really want to try out other clustering techniques as well such as DBSCAN

as you are working on the binning, I would suggest making a wrapper around the bin algorithm that is consistent with the sklearn interfaces

that will make it easy for you to experiment with multiple algorithms using sklearn (https://scikit-learn.org/stable/modules/model_evaluation.html)
e.g. https://scikit-learn.org/stable/modules/grid_search.html#grid-search

sklearn has a specific interface that it expects (e.g. https://scikit-learn.org/stable/developers/develop.html). If you implement that interface, you can use the built-in GridSearch (https://scikit-learn.org/stable/modules/grid_search.html#exhaustive-grid-search) without having to call the functions directly

one caveat is that I am not sure whether GridSearchCV works well for clustering (as opposed to classification) algorithms (https://stackoverflow.com/questions/25633383/how-can-gridsearchcv-be-used-for-clustering-meanshift-or-dbscan)

@shankari
Copy link
Contributor

For user[4], there are some NaN in both purpose_confirm and replaced_mode, the score is lower.

I don't recall seeing any NaN in the purpose entries.

I double-checked the entries using
https://github.com/e-mission/em-public-dashboard/blob/main/viz_scripts/mode_purpose_share.ipynb

I suspect this may be because of your mapping code. But you can try the code from the public dashboard to see if you still have N/A entries and if not, debug the differences between that code and yours.

>>> expanded_ct["purpose_confirm"].unique()

array(['meal', 'personal_med', 'transit_transfer', 'shopping', 'home',
       'exercise', 'work', 'entertainment', 'pick_drop', 'school',
       ...], dtype=object)

@corinne-hcr
Copy link
Contributor Author

corinne-hcr commented Feb 21, 2021

I didn't convert anything from this user, the output is from non_empty_trips.
For this user, total 453 trips, but only 33 trips have user_input, after filtering out trips that are points, 28 left. And that is from all data not above cutoff. Do you really have a different output from mine?
For example, 'user_input': {'mode_confirm': 'shared_ride'}}}), I guess the user didn't input values


mode_confirm | purpose_confirm | replaced_mode

pilot_ebike | entertainment | drove_alone
pilot_ebike | entertainment | drove_alone
shared_ride | NaN | NaN
drove_alone | NaN | NaN
drove_alone | NaN | NaN
pilot_ebike | NaN | NaN
drove_alone | NaN | NaN
shared_ride | NaN | NaN
pilot_ebike | NaN | NaN
drove_alone | NaN | NaN
drove_alone | NaN | NaN
pilot_ebike | exercise | NaN
drove_alone | NaN | NaN
drove_alone | work | no_travel
drove_alone | NaN | no_travel
drove_alone | shopping | NaN
drove_alone | exercise | no_travel
drove_alone | shopping | NaN
drove_alone | work | no_travel
drove_alone | exercise | no_travel
drove_alone | NaN | NaN
drove_alone | NaN | NaN
pilot_ebike | home | drove_alone
pilot_ebike | home | drove_alone
shared_ride | work | NaN
drove_alone | entertainment | NaN
drove_alone | work | NaN
drove_alone | work | no_travel

@corinne-hcr
Copy link
Contributor Author

So, should I map same_mode to the mode in mode_confirm?

@shankari
Copy link
Contributor

shankari commented Feb 21, 2021

@corinne-hcr I am not sure which user you are talking about; the order that you and I have may not be the same. I don't think we should share UUIDs in this public forum, and by design, the public dashboard only performs aggregate analysis.

But I can confirm that I don't see any N/A entries for purpose_confirm across all participants. Are you sure you are filtering the users to only look at participants?

Again, it would be helpful to commit your code here. My code is checked in and you can verify against it any time.

Comment on lines +213 to +216
"for i in range(len(bins)):\n",
" for trip in bins[i]:\n",
" labels_pred.append(i)\n",
"labels_pred"
Copy link
Contributor

@shankari shankari Feb 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although the actual labels can be anything (1,1,1,1 == 5,5,5,5), I am pretty sure that the metric expects the labels_true and labels_pred to be in the same order - e.g. if labels_true is generated from trips [t1, t2, t3, t4] then labels_pred is generated from the same trips in the same order.

Are you sure that your implementation here matches that?

While generating labels_true, you are iterating the trips in bin_trips order
While generating labels_pred, you are iterating on a bin by bin order

So it would seem that if the bins are (t2, t3) and (t1, t4)
then the labels_true would be in order [t1, t2, t3, t4]
and the labels_pred would be in order [t2, t3, t1, t4]

Maybe I don't understand fully recall the structure of bin_trips. But to be on the safe side, can you add verification to the notebook to show that the trips are actually being processed in the same order?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they are in same order.

You can see the bin_df
Screen Shot 2021-02-21 at 11 17 43 AM
That's the trips order after binning.
It is according to the non_empty_tripsindices in the bins
image

The following is the list from where I collect labels_true, it is indeed according to the indices in bin_trips
image

I do consider the trip order. I think if bin_trips_user_input_ls has the same trip order as trips in bin_df, then the way I collect labels should be the same. Although I just show part of the trips, but you can see from the picture, the trips are in same order.

Is that clear?

Copy link
Contributor

@shankari shankari Feb 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't see this.

bin_user_input that you have listed above uses bins not bin_trips, the same as labels_pred.
I think that your main argument is that the labels are the same through spot checking. That is good, but it could also happen by chance and for the first few entries. I certainly did not look at every single entry, and I am not sure I would be very accurate at doing so. This is the kind of checking which it is much better for computers to do programatically.

I would feel a lot more confident by comparing the trip indices in the original list directly in code. Add an index to the original list (if it doesn't have one) and use it as the dataframe index for both labels_pred and labels_true and verify that the indices are the same.

If you don't want to add the index, use the timestamp, which is also guaranteed to be unique.

But use a unique value in the input, pass it through the pipeline and confirm that it is the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The red list of bins above contains the original indices from trips after Naomi filtered out trips that are points. bin_trips is a new data list that collects trips according to the indices in bins.
Which key word is for timestamp that you want me to add?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trip["data"]["start_ts"] should work as a timestamp. Again, I don't want you to print it, I want you to check it programmatically

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Is that ok?

Copy link
Contributor

@shankari shankari Feb 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks good, but is not using the full power of pandas/numpy. Nevertheless, please add to PR.

@shankari
Copy link
Contributor

So, should I map same_mode to the mode in mode_confirm?

Sure

@shankari
Copy link
Contributor

Do you really have a different output from mine?

But I can confirm that I don't see any N/A entries for purpose_confirm across all participants.

Did you investigate and determine the reason for the discrepancy?

@corinne-hcr
Copy link
Contributor Author

The following is my investigation.
I copy your scaffoldingcode, then choose the specific uuid which I want to look at

In my file, the trips are from 2020/11 to 2021/1
So, I screenshot those three months labeled trips from yourmode_purpose_share to compare

2020/11
image
2020/12
image
2021/1
image

Here is non_empty_trips after binning from my file

  mode_confirm purpose_confirm replaced_mode
0  shared_ride             NaN           NaN
1  drove_alone        shopping           NaN
2  drove_alone            work     no_travel
3  shared_ride            work           NaN
  mode_confirm purpose_confirm replaced_mode
0  pilot_ebike   entertainment   drove_alone
1  drove_alone             NaN           NaN
2  pilot_ebike            home   drove_alone
  mode_confirm purpose_confirm replaced_mode
0  pilot_ebike   entertainment   drove_alone
1  drove_alone             NaN           NaN
2  pilot_ebike            home   drove_alone

So, you can see, the total labeled trips from your code are the same from my code. I also got NaN output from your code. I think that's because the user didn't input value.

@shankari
Copy link
Contributor

can you put the related UUID into the teams chat? As I said earlier, I don't see NaN as a value in purpose_confirm when I run it

expanded_ct["purpose_confirm"].unique()

does not include NaN

@shankari
Copy link
Contributor

>>> expanded_ct["purpose_confirm"].isnull()
106     False
108     False
109     False
111     False
113     False
        ...  
4105    False
4106    False
4107    False
4108    False
4109    False
Name: purpose_confirm, Length: 2425, dtype: bool

@corinne-hcr
Copy link
Contributor Author

I think you should pick 2020/12 or 2021/1

@shankari
Copy link
Contributor

shankari commented Feb 22, 2021

I'm not sure what you mean by that. I still need to know the user so that the indices match up, right?
Otherwise, if I look at entry 599 from your screenshots above (which is missing some values), I get

>>> expanded_ct.iloc[599]
....
mode_confirm                                                     pilot_ebike
purpose_confirm                                                         work
replaced_mode                                                           bike

@corinne-hcr
Copy link
Contributor Author

I have already put the uuid in Teams. In 2020/11, there is no NaN for this user, but in 2020/12 and 2021/1, you can see NaN

@shankari
Copy link
Contributor

nvm, you are right!

Instead of manually looking at the isnull() output, I checked it using code, and there are some missing entries, though they don't match your indices above.

>>> pc = expanded_ct["purpose_confirm"].isnull()
>>> pc[pc == True]
1919    True
2095    True
2129    True
2131    True
2132    True
2136    True
2800    True
2805    True
2807    True
2809    True
2811    True
2813    True
2914    True
2916    True
2918    True
2934    True
2940    True
2945    True
3039    True
3041    True
3899    True
3909    True

@shankari
Copy link
Contributor

And double checking those entries from the dataframe does indicate that they are missing.

pc = expanded_ct["purpose_confirm"].isnull()
expanded_ct.loc[pc[pc == True].index][["mode_confirm", "purpose_confirm", "replaced_mode"]]
mode_confirm purpose_confirm replaced_mode
pilot_ebike NaN NaN
pilot_ebike NaN NaN
pilot_ebike NaN NaN
pilot_ebike NaN NaN
pilot_ebike NaN NaN
pilot_ebike NaN taxi
shared_ride NaN NaN
drove_alone NaN NaN
drove_alone NaN NaN
pilot_ebike NaN NaN
drove_alone NaN NaN
shared_ride NaN NaN
pilot_ebike NaN NaN
drove_alone NaN NaN
drove_alone NaN NaN
drove_alone NaN NaN
drove_alone NaN NaN
drove_alone NaN no_travel
drove_alone NaN NaN
drove_alone NaN NaN
not_a_trip NaN no_travel
not_a_trip NaN no_travel

@shankari
Copy link
Contributor

I think we should skip these entries for now. There are not many of them, and the user has only a few labeled trips to begin with. So I wasn't counting him in my mental training set anyway.

You can just skip all users with < 50% labeling rate for this training and tuning stage.

@corinne-hcr
Copy link
Contributor Author

corinne-hcr commented Feb 22, 2021

Btw, this is the way how Naomi collected bin_trips above cutoff bins

    #delete lower portion of bins
    def delete_bins(self):
        self.calc_cutoff_bins()
        for i in range(len(self.bins) - num):
            self.bins.pop()
        newdata = []
        for bin in self.bins:
            for b in bin:
                d = self.data[b]
                newdata.append(self.data[b])
        self.newdata = newdata if len(newdata) > 1 else self.data

bin_trips is self.newdata, and it is exactly according to the indices from bins.
And the indices in bins

Here is how the she collected indices in bins

        logging.debug('After removing trips that are points, there are %s data points' % len(self.data))
        self.size = len(self.data)

    #create bins
    def bin_data(self):
        for a in range(self.size):
            added = False
            for bin in self.bins:
                try:
                    if self.match(a,bin):
                        bin.append(a)
                        added = True
                        break
                except:
                    added = False
            if not added:
                self.bins.append([a])
        self.bins.sort(key=lambda bin: len(bin), reverse=True)

Here a is the original index of the trip after she filtered out trips that are points. In delete_bins(self) , she used b instead of a to represent the index.
So, in algorithm, the trips order in bins is exactly the same as trips order in bin_trips

@corinne-hcr
Copy link
Contributor Author

You can just skip all users with < 50% labeling rate for this training and tuning stage.

Do you mean among all users labeled trips, I should skip users that have <50% full labeled trips in the three columns ?

@shankari
Copy link
Contributor

Do you mean among all users labeled trips, I should skip users that have <50% full labeled trips in the three columns ?

Correct. You will end up with ~ 6 users I think.

@shankari
Copy link
Contributor

shankari commented Feb 22, 2021

We can add 4 other users from another project later; their confirmed labels will be a 2-tuple of mode_confirm and purpose_confirm only. So ~ 10 users total but for multiple months each

Copy link
Contributor

@shankari shankari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please make sure to remove outputs before committing? I am finding this hard to review.

Comment on lines 28 to 30
non_empty_trips = [t for t in trips if t["data"]["user_input"] != {}]
valid_trips = [t for t in non_empty_trips if 'mode_confirm' in t["data"]["user_input"] and
'purpose_confirm' in t["data"]["user_input"] and 'replaced_mode' in t["data"]["user_input"]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is combining valid trips and binning. In the next PR, I would pull out the valid trip filtering into a separate function so it can be reused across the various modules - v-score calculation, query calculation, future clustering, etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, is there a reason you are not using dropna here instead of hardcoding the entry checks?

Comment on lines 51 to 55
if len(filter_trips) < 10:
homo_score.append(NaN)
comp_score.append(NaN)
v_score.append(NaN)
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, as we discussed, please combine this with the other validity checks.

Comment on lines 70 to 74
if len(bin_trips) < 10:
homo_score.append(NaN)
comp_score.append(NaN)
v_score.append(NaN)
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

Comment on lines 140 to 144
if len(filter_trips) < 10:
homo_score.append(NaN)
comp_score.append(NaN)
v_score.append(NaN)
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

Comment on lines 150 to 155
if len(bin_trips) < 10:
homo_score.append(NaN)
comp_score.append(NaN)
v_score.append(NaN)
continue

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

labels_true.append(no_dup_list.index(trip))
labels_pred = feat.labels

# compare the points in cluster_trips and those in feat.points, return nothing if two data frames are the same
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you fix the comment here? Right now if the frames are the same, we just go on to the score calculation

Copy link
Contributor

@shankari shankari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between the various notebooks here?
Can you add a description at the top of each notebook indicating what it does.

Copy link
Contributor

@shankari shankari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would split the code into multiple functions so that they can be composed into a pipeline.

the functions would be:

  • get_valid_trips
  • map_labels (sp2en or cvt_pur_mo)
  • valid_user_check
  • fit_clusters (with two implementations, bin and cluster)
  • get_pred_labels (two implementations?)
  • get_user_labels (two implementations?)
  • verify_clusters (with two implementations)
  • compute scores

Then you have two pipelines, one with the bin implementations and the other with the cluster implementations, calling one after the other without any intermediate steps or repetition.

tour_model_eval/confirmed_trips_eval_bins_clusters.py Outdated Show resolved Hide resolved
@shankari
Copy link
Contributor

shankari commented Mar 24, 2021

On a second look, the current pipelines are not too bad since several of the steps will have separate implementations anyway. I will accept a simpler refactoring in which you pull out the common functionality map_labels, valid_user_check and compute scores into separate functions which you call from the two separate functions.

If you want to implement the cleaner, but more complex refactoring, I won't complain!

@corinne-hcr
Copy link
Contributor Author

corinne-hcr commented Mar 24, 2021

I can pull out map_labels, but valid_user_check is already a single function, it actually based on filter_trips and trips, which are from filter_data(another separate function). Because I need different return values for the following steps from these two functions, I don't make them into one function.

And what do you want me to do with compute score? What should it include?

@shankari
Copy link
Contributor

valid_user_check is already a single function,

valid_user_check is not a single function, valid_user is.

Because I need different return values for the following steps from these two functions, I don't make them into one function.

Not sure what this means. The code that uses valid_user appears to be identical in both cases. Can you point out where it is different?


        if not valid_user(filter_trips,trips):
            homo_score.append(NaN)
            comp_score.append(NaN)
            v_score.append(NaN)
            continue

@shankari
Copy link
Contributor

shankari commented Mar 24, 2021

I was hoping you would finish the refactoring and add the descriptions for the other notebooks first so I could merge this.
You could then submit a new PR for the second round clustering so I can review your code properly instead of being relegated to view it over screenshare. I cannot review code when somebody else is navigating it remotely.

@corinne-hcr
Copy link
Contributor Author

For the scatter, the one I have is based on the bins above cutoff. The trips are grouped only based on coordinates and radius. Back then we compared the difference between bins and clusters(min_clusters = 0) and got that the bins are better than clusters. But when we decided to use a second round clustering, I changed to use k-means for the first round clustering and set min_clusters = len(bins). So, now should I just leave like that? Later when I have result from hierarchical clustering, I write another code for the improved scatter?

@shankari
Copy link
Contributor

@corinne-hcr

  1. Not sure that you need to change to kmeans for the first round. You could use bins for the first round and agglomerative clustering for the second round.
  2. IIRC, the results for the bins above cutoff and clusters above cutoff were not that different. So I don't think it matters a lot which you use for the first round.

not sure about your question - are you asking whether you need to go back and change the results for the trip end only clusters? In general, for any results, you don't have to present only one set of results. You can present a set of results and pick the best one to explore further.

Copy link
Contributor

@shankari shankari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@corinne-hcr please get this PR to a reviewable state before continuing. It is very hard to review large chunks of code with no context. I'm sure that the code is clear to you, but I'd like (at least this time) for the code to be relatively clear to others as well.

That means digestible code chunks with lots of comments explaining what you are doing.
Please get this done before you commit your refactored code.

@corinne-hcr
Copy link
Contributor Author

I add some comments. Please let me know if it is still not digestible.

Comment on lines +94 to +102
def valid_user_check(filter_trips,trips,homo_score,comp_score,v_score):
if not valid_user(filter_trips, trips):
homo_score.append(NaN)
comp_score.append(NaN)
v_score.append(NaN)
skip = True
else:
skip = False
return homo_score,comp_score,v_score,skip
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not going to insist on this since you are planning to refactor anyway, but you don't need to have homo_score, comp_score and v_score as both input and output. You can create a class that encapsulates all the scores and just pass it in directly.

Also, I really don't think you need to have append code in here - you can just have this function return valid or invalid. If invalid, append the values and continue; if valid, proceed with computation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please file an issue to clean this up later.

return bin_date


# compare the trip orders in bin_trips with those in filter_trips above cutoff
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improve comment. What is "trip_orders"? why does it matter?
Just repeating the function name in the comment doesn't add much

Comment on lines +144 to +155
for bin in bin_date:
if day:
if match_day(trip,bin,filter_trips):
bin.append(trip_index)
added = True
break
if month:
if match_month(trip,bin,filter_trips):
bin.append(trip_index)
added = True
break

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like there should be a better way to implement this in pandas using groupby. Please file an issue for this so you can explore how to clean it up later.

return homo_score,comp_score,v_score


# This function is to compare a trip with a group of trips to see if they happened in a same day
Copy link
Contributor

@shankari shankari Apr 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the meaning of the parameters?
you have a bin and filter_trips. Why do you need to pass in both?
why not just pass in two trips?

Comment on lines +149 to +153
" bin_day = evaluation.bin_date(req_trips_ls,filter_trips,day=True)\n",
" req_day_ls = []\n",
" for bin in bin_day:\n",
" req_day_ls.append(len(bin))\n",
" \n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment here that describes what bin_day is expected to look like to make the rest of this easier to review

Comment on lines +142 to +158
trip = filter_trips[trip_index]

for bin in bin_date:
if day:
if match_day(trip,bin,filter_trips):
bin.append(trip_index)
added = True
break
if month:
if match_month(trip,bin,filter_trips):
bin.append(trip_index)
added = True
break

if not added:
bin_date.append([trip_index])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is just so messy and non-pythonic

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add this to the list of areas to improve the implementation.

Comment on lines +177 to +181
" if evaluation.match_day(req_trip,valid_trips_bin,filter_trips):\n",
" proportion = round(len(req_trips_bin)/len(valid_trips_bin), 2)\n",
" propor_single_user.append(proportion)\n",
" match = True\n",
" break\n",
Copy link
Contributor

@shankari shankari Apr 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the problem with the current match_day implementation. Reading this appears that you compare one trip to a list to find a match, and then you use it determine the proportion for that day for that user. But how is comparing only the first trip OK? What is the valid_trips_bin?

Needs more comments here. It took me 10 mins to toggle between the two implementations to figure this out, and that is too much time to spend on what is supposed to be fairly simple code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants