Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add neural network model #36

Closed
andrewtavis opened this issue Apr 6, 2021 · 21 comments · Fixed by #47
Closed

Add neural network model #36

andrewtavis opened this issue Apr 6, 2021 · 21 comments · Fixed by #47
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed

Comments

@andrewtavis
Copy link
Owner

andrewtavis commented Apr 6, 2021

This issue is for adding an embeddings neural network implementation to wikirec. This package was originally based on the linked blog post, but the original model implementation to now has not been included. That original work and the provided codes could serve as the basis to adding such a model to wikirec, which ideally would also be included in the documentation and tested. That model was based on analyzing the links between pages, which could serve as a basis for the wikirec version with modifications to wikirec.data_utils, or the model could focus on the article texts. Partial implementations are more than welcome though :)

Please first indicate your interest in working on this, as it is a feature implementation :)

Thanks for your interest in contributing!

@andrewtavis andrewtavis added documentation Improvements or additions to documentation enhancement New feature or request good first issue Good for newcomers labels Apr 6, 2021
@andrewtavis andrewtavis added the help wanted Extra attention is needed label Apr 7, 2021
@andrewtavis andrewtavis changed the title Add embeddings neural network model Add neural network model Apr 7, 2021
@andrewtavis andrewtavis removed the documentation Improvements or additions to documentation label Apr 12, 2021
@victle
Copy link
Contributor

victle commented Apr 16, 2021

I'm personally trying to learn more about neural networks, so I'd love to work and contribute to this in pieces. I took a quick skim through the linked resources, and there's use of t-SNE as well to visualize the books. I know that you're looking to put together t-SNE for wikirec as well #35, so in the future that could definitely be something I can work on as well. Let me know what you think is a good first step!

@andrewtavis
Copy link
Owner Author

For the neural network model the question would be if we implement the method from the blogpost directly where it's looking into the links to other Wikipedia pages, which would then require the cleaning process to have an option to prepare it in this way. Websites URLs are being removed as of now, which maybe they shouldn't? We could of course devise another method though :)

We already have pretrained NNs covered with BERT, so a NN approach that tries to create embeddings might be wasted effort as that's a lot of computing to try to beat something that's explicitly trained on Wikipedia in the first place.

Another model that's popped up in the last few years is XLNet, which I guess would be the other natural implementation to look into.

Let me know what you think on this :)

@victle
Copy link
Contributor

victle commented Apr 17, 2021

I think dealing with the links themselves might be a good first approach instead of training a whole new NN. I'll look into that blogpost and see if I can find how links are explicitly addressed, unless you had any thoughts on that?

I'll have to look into XLNet as it does look interesting! Though I am definitely lacking in terms of raw computing power.

@andrewtavis
Copy link
Owner Author

andrewtavis commented Apr 17, 2021

The original blogpost author finds the wikilinks in the data preparation steps, which are shown in this notebook. He's using wiki.filter_wikilinks() from mwparserfromhell, which we also have as a dependency. Basically in his cleaning steps he's getting titles, the infobox information, the internal wikilinks, the external links, and a few other diagnostic features, whereas we're just getting titles and the texts themselves.

Implementing this his way would honestly be a huge change to the way the cleaning works, so maybe the best way to go about this is give an option in wikirec.data_utils.clean where the websites would not be removed (might be best anyway), and we could use string methods or regular expressions to extract the links from the texts themselves. For this it'd basically be finding instances of https://en.wikipedia.org/wiki/Artcile_Name by matching the first part, and then we'd need to just extract Article Name and make sure there's a way to not have repeat entries.

Once we have those it's basically following the original blogpost :)

@andrewtavis
Copy link
Owner Author

andrewtavis commented Apr 17, 2021

For XLNet it looks like we'd be able to use sentence-transformers like the BERT implementation, so the big question then becomes what a suitable model would be. It'd be loading in a huggingface model, as sentence-transformers allows those models to be loaded along with its BERT models. I'm not sure if it's similar to BERT where there are multilingual models and ones that are better for certain use cases or others.

References for this are the XLNet documentation from huggingface and this issue so far.

@victle
Copy link
Contributor

victle commented Apr 23, 2021

@andrewtavis Hey! Just wanted to update, I've been pretty busy, but am still wanting to work on this issue. I wanted some clarification on what we can do to incorporate these different cleaning methods. As it is now, we're just grabbing the text of the article, which includes the text displayed for the internal wikilinks (I don't think they're getting cleaned out, are they?) How would grabbing the wikilinks themselves substantially improve the "performance" of the recommendation models, since we already have the text from the links' names as part of the inputs to the models already?

@andrewtavis
Copy link
Owner Author

@victle, hey :) No worries on a bit of silence :)

Thing is that the URLs are being cleaned as of now. As seen at this line, the websites are being removed, but not the texts that they're the links for. I'm thinking now that this is a random step that actually doesn't need to be included in the cleaning process. We could simply remove this, and then you could then extract the internal Wikipedia links from the parsed text corpuses.

Grabbing the links themselves would basically just make a new modeling approach. The assumption would shift from "I believe recommendations can be made based on which articles have similar texts" to "... which articles are linked to similar things." The second assumption is the one from the blog post, and he also got strong results, so we could implement that approach here as well :)

It kind of adds another layer of depth to a combined model as well. Right now we can combine BERT and TFIDF and get something that accounts for semantic similarity (BERT) and explicit word inclusions (TFIDF), both of which as of now are working well and even better when combined. This could give us a third strong performing model that could add a degree of direct relatedness to other articles. To combine it with others the embeddings would need to be changed a bit though, as his approach embeds all target articles and all that they're linked to. Could be a situation where we could simply remove rows and columns of the embeddings matrix based on indexing though.

I checked our results against the ones he has in the blogpost. The direct way that this could help is books that are historical in context. So far we've been picking fantasy novels for examples, which ultimately seem to be performing well as they would have unique words and places that lead to books by the same author. An example is his results for War and Peace:

Books closest to War and Peace:

Book: Anna Karenina               Similarity: 0.92
Book: The Master and Margarita    Similarity: 0.92
Book: Demons (Dostoevsky novel)   Similarity: 0.91
Book: The Idiot                   Similarity: 0.9
Book: Crime and Punishment        Similarity: 0.9

Our results for a combined TFIDF-BERT approach are:

[['Prince Serebrenni', 0.6551468483971056],
 ['Resurrection', 0.6545365970449271],
 ['History of the Russian State from Gostomysl to Timashev',
  0.6189035165863549],
 ['A Confession', 0.6068009160763292],
 ['August 1914', 0.5945890587900238],
 ['The Tale of Savva Grudtsyn', 0.5914113595267484],
 ['The Don Flows Home to the Sea', 0.5905929292423463],
 ['Anna Karenina', 0.5887544831477559],
 ['Special Assignments', 0.5798599047827274],
 ['Russia, Bolshevism, and the Versailles Peace', 0.5791726426273041]]

They're all classic Russian books, but his results are "better" in my opinion. We get similar results for The Fellowship of the Ring, as expected :)

Sorry for the wall of text 😱 Let me know what your thoughts are on the above!

@victle
Copy link
Contributor

victle commented Apr 23, 2021

I want to try and summarize my understanding below:

After reading that first paragraph, this is my understanding. I'm using the beginning text of Prince Serebrenni as an example 😄 Before cleaning the raw text, you'll get something like "Prince Serebrenni (Russian: Князь Серебряный) is a historical novel by [[https://en.wikipedia.org/wiki/Aleksey_Konstantinovich_Tolstoy]](Aleksey Konstantinovich Tolstoy)...". And through that line you referenced to in wikirec.data_utils.clean, it's simply removing the link entirely, such that only "Aleksey Konstantinovich Tolstoy" is left?

So then, if we implement this third approach/model that looks at the internal links, we could combine that with TFIDF and BERT to make an overall stronger model (hopefully!) Although, I'm not knowledgeable to know how you would combine embeddings, so that might need more explanation 😅

Either way, it does sound interesting and challenging! I think I'll begin with removing that cleaning step, and replicating the approach from the blog post to extract the links. Does that sound like a reasonable starting step?

@andrewtavis
Copy link
Owner Author

Hey there :) Yes, your understanding is correct. From [[https://en.wikipedia.org/wiki/Aleksey_Konstantinovich_Tolstoy]](Aleksey Konstantinovich Tolstoy) only Aleksey Konstantinovich Tolstoy will remain, and then that string will be tokenized, n-grams will be created such that we would likely get Aleksey_Konstantinovich_Tolstoy (while maintaining Aleksey Konstantinovich Tolstoy as well), then the tokens will be lower cased, and then common names are removed (these steps then lead to lemmatization or stemming). Common names are removed to make sure that we're not getting recommendations for things just because the character has the same name, but they're removed after n-grams are created so that we still have, for example, harry_potter.

I think that there's a better way to do this that can still maintain removing the URLs (I think that they're ultimately a lot of filler, and further will be nonsense as the punctuation's removed, so we'll have a lot of httpsenwikipediaorg). The way he gets the URLs is in the parsing phase, which is again this notebook. If you look at the ran cell 40 he seems to be just getting the article names that are then used as inputs for the model, so to get the links you honestly could do a regular expression search over the texts and get everything between https://en.wikipedia.org/ and the next space, then you'd need to convert the underscores to spaces and make sure that they don't go through later cleaning steps. Those could then be saved as an optional output to the cleaning process for if someone wanted to get the internal links, say by adding a get_wikilinks argument?

This is the simplest way I can think to go about this, as there's all kinds of cleaning steps like removing punctuation and such that follow that would also need to be accounted for. If you just put a if get_wikilinks: conditional right before the URLs are removed, we could then get them, return a third value, and maintain everything else as is :)

Lemme know what you think!

@victle
Copy link
Contributor

victle commented Apr 23, 2021

I was messing around with the wikirec.data_utils.clean() function, and I wanted to see exactly what websites were getting removed from the code lines you linked to. I put an image of the websites that get removed in the first 500 texts or so. It seems like in the cleaning process, there really aren't any (perhaps none) internal wikilinks to other articles. I really only see links to images and such. Was this intended? Are the internal links to other articles getting removed in an earlier step? FYI, the texts that I'm parsing only come from using the enwiki_books.ndjson that's provided in the repository, so I didn't go through the data_utils.parse_to_ndjson function or anything.

image

Other than that, I'm a fan of the get_wikilinks = True argument. For now, I'll go ahead and work with the clean() function such that a separate list containing the internal wikilinks are returned 👍

@andrewtavis
Copy link
Owner Author

andrewtavis commented Apr 23, 2021

Very very interesting, and sorry for putting you on the wrong track. Honestly I last really referenced the parsing codes years ago when I originally wrote the LDA version of this (was a project for my master's), and didn't think that Wikipedia doesn't actually use URLs for internal links.

Referencing the source of Prince Serebrenni, "Aleksey Konstantinovich Tolstoy" is [[Aleksey Konstantinovich Tolstoy]], and the "1862" link is [[1862 in literature|1862]], i.e. double brackets indicate internal links, and there's a bar separator for if reference isn't the name of the article (weird that it's different from what you're seeing). I had forgotten how advanced mwparserfromhell that we're using for the parsing is, in that wikicode.strip_code().strip() is just getting the texts without the internal links, which you then explicitly need to request via wikicode.filter_wikilinks(). The URLs that you're seeing look to be for references, which appear at the end of Wikipedia articles, but then in the source these are actually found in the texts themselves and are moved to the bottom when displayed.

Again, sorry for the false lead. We'll need to get the links in the parsing phase, which in the long run makes this easier :) The main difference on this is going to be that a third element will be added to the json files where we'll be able to do the following:

with open("./enwiki_books.ndjson", "r") as f:
    books = [json.loads(l) for l in f]

titles = [b[0] for b in books]
texts = [b[1] for b in books]
wikilinks = [b[2] for b in books]  # <- access it if you need it

So basically we just get an optional element that isn't even used if we're applying the current techniques. More to the point, we don't need to screw with the cleaning process as of now. Is something that should be looked at again in the future as I think that BERT could potentially benefit from even raw texts with very little processing, but let's check this later :) I can reference this with some friends as well.

I will do an update tomorrow with the changes to wikirec.data_utils that will include edits to _process_article (adding wikicode.filter_wikilinks() as a third returned value) and then propagate this change to iterate_and_parse_file and parse_to_ndjson. I'll then DL the April dump, do a parse of it, and then update the zip of enwiki_books.ndjson. From there we'll be good to get started on the NN model 😊

@andrewtavis
Copy link
Owner Author

I'll also do a parse and get us a copy of enwiki_books_stories_plays.ndjson so long as it's within the upload size limitations :)

@victle
Copy link
Contributor

victle commented Apr 23, 2021

Cool! I'm glad we cleared that up, and that it's an easy fix. Let me know if there's something I can look into as well. I can keep reviewing the blogpost, as I imagine a lot of the methods and insights for the NN model will derive from that.

@andrewtavis
Copy link
Owner Author

andrewtavis commented Apr 24, 2021

Hey there :) Thanks for your offer to help on the parsing! Was literally just the line for wikicode.filter_wikilinks() being added and some other slight changes. That all went through with #43, and I've just updated the zip of enwiki_books.ndjson - it now has wikilinks at index 2. I decided to hold off on updating the Wikipedia dump for now (parsing itself is 4-5 hours), so we're still using the February one. We can update that when the examples get updated. For now I also decided against enwiki_books_stories_plays.ndjson, as the books alone now are almost 78MB, so even zipped it'll likely be more than the max file size of 100MB.

To keep track of this, the recent steps and those left are (with my estimates of time/difficulty):

  • Add wikilink parsing (2)
  • Update/upload enwiki_books.ndjson with wikilinks (3)
  • Implement updated wikilinks NN method in model.recommend (8)
  • Add model description to the readme (currently WIP) and docstrings (checking documentation) (2)
  • Add wikilinks NN method to examples (includes combination approaches) (5)
  • Add wikilinks NN results to the readme (1)
  • Add testing for wikilinks NN method (4)
  • Close this issue 🎉

Let me know what all from the above you'd have interest in, and I'll take the rest. Also of course let me know if you need support on anything. Would be happy to help 😊

@victle
Copy link
Contributor

victle commented Apr 25, 2021

I'd love to talk about more about breaking down the 3rd task. And again, correct me if I'm not understanding 😅 In following the blogpost, we'd have to train a neural network (treat as a supervised task) to generate an embedding between books and the internal wikilinks. Then, we can generate a similarity matrix based on this embedding for each book. However, how do we combine the recommendations based off this NN with those of TFIDF and BERT?

@andrewtavis
Copy link
Owner Author

andrewtavis commented Apr 25, 2021

Let's definitely break it down a bit more :) Just wanted to put out everything so there's a general roadmap, and I'd of course do the data uploads and testing (not sure if you have experience/interest in unit tests).

Answering your question (as well as I can right now 😄), you're right in that we'll be combining similarities into a sim_matrix as with the other models and we'd make sure that it's ordered by titles in the same way. The final thing for the blog post is that the titles are being modeled with everything that they're also linked to. We'd just need to make sure that anything that's not in selected_titles isn't included in the sim_matrix, and then as before we have matrices of equal dimension that can be weighted. We'd need to experiment with the weights from there, but then that's the fun part where we can see if there's any value in combining this kind of similarity with the others we already have to ultimately get something more representative than any of the models by themselves :)

Let me know what your thoughts are on the above! As far as breaking the task down, if you want to add something that's similar to the blogpost into model.gen_embeddings, then I could potentially work on getting model.gen_sim_matrix set up from the vector embeddings? Let me know what you'd be interested in doing and have the time for 😊

@victle
Copy link
Contributor

victle commented Apr 27, 2021

I'm familiar with unit tests, but not well-versed I would say! Either way, what you've outlined makes sense. In terms of what I can do, I can start by building the architecture for the NN that will eventually generate the embeddings between titles and links. I'm interested in training the model myself, but we'll see if I have the computing power to do so in a reasonable manner 😅 . To keep model.gen_embeddings clean, I might write a private function or something that will set everything up and train. We'll see!

@andrewtavis
Copy link
Owner Author

andrewtavis commented Apr 27, 2021

A private function on the side would be totally fine for this! All sounds great, and looking forward to it :)

In terms of computing power, have you used Google Colab ever? That might be a solution for this, as I don't remember the training for this being mentioned as too long in the blog post. Plus it's from 2016 when GPUs weren't as readily available as today (ML growth is nuts 😮). Big thing for that is that you do need to activate the GPUs in the notebook, as they're not on by default. As stated in the examples for this, you'd do Edit > Notebook settings > Hardware accelerator and selecting GPU.

I used Colab for some university school projects, and it's built with Keras in mind. You'd have 24 hours or so of GPU time before the kernel restarts, which hopefully would be enough. If it's not, just lower the parameters down and send something that works along, and I'm happy to make my computer wheeze a bit for the full run 😄

@andrewtavis
Copy link
Owner Author

@victle, what do you think about combining model.gen_embeddings and mdol.gen_sim_matrix? Idea would be to make gen_embeddings a private function, that way the recommendation process would just be gen_sim_matrix followed by recommend.

@victle
Copy link
Contributor

victle commented May 1, 2021

@andrewtavis I actually like the modularity of two separate functions for computing the embeddings and then the similarity matrix. Someone might just be interested in the embeddings, or would want more customization with how the similarity matrices are computed. Though this might be a rare case! But, I do see the benefit of making the recommendation process simpler. Plus, generating the similarity matrix is pretty simple after computing the embeddings, so it's like... why not? 😆

@andrewtavis
Copy link
Owner Author

andrewtavis commented May 2, 2021

@victle, if you like the modularity we can keep it as is :) I was kind of on the fence for it and wanted to check, but it makes sense that someone might just want the embeddings. Plus, if we keep it as is it's less work 😄 Thanks for your input!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants