Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Learn Regular Expressions by Building a Spam Filter #122

Closed
QuincyLarson opened this issue Mar 8, 2019 · 25 comments
Closed

Learn Regular Expressions by Building a Spam Filter #122

QuincyLarson opened this issue Mar 8, 2019 · 25 comments
Assignees

Comments

@QuincyLarson
Copy link
Contributor

No description provided.

@lionel-rowe
Copy link
Contributor

I'd be interested in contributing to this challenge.

@QuincyLarson
Copy link
Contributor Author

@lionel-rowe Awesome! I think this will be a fun one. See if you can get an extremely simple demo that checks a line of input for spam words and the use of characters meant to mask spam words like v1agr@ or something like that 😄 - I'm excited to see what you come up with.

@lionel-rowe
Copy link
Contributor

I'm actually wondering how best to approach this one. In order to make a truly extensible filter that would catch v1agr@, vi@gra, v|agRa, V I A G R A, etc. it'd really make more sense to have programmatically-generated regexes, rather than regex literals:

https://gist.github.com/lionel-rowe/0724546e4b5c1a71be29502aac4c825e

However, this approach would probably introduce too many disparate concepts at once, plus it has some gotchas (double-escaping, for starters). Any thoughts on how to simplify it while still guiding students toward writing extensible code?

Note that the regexes generated by this method are almost completely unreadable:

[
  /\bv\s*[i|]\s*[a@4]\s*[g69]\s*r\s*[a@4]\b/i,
  /\bf\s*r\s*[e3]\s*[e3]\s* \s*m\s*[o0]\s*n\s*[e3]\s*y\b/i,
  /\bw\s*[o0]\s*r\s*k\s* \s*f\s*r\s*[o0]\s*m\s* \s*h\s*[o0]\s*m\s*[e3]\b/i,
  /\b[s5]\s*[t7]\s*[o0]\s*[c{\[(]\s*k\s* \s*[a@4]\s*l\s*[e3]\s*r\s*[t7]\b/i,
  /\bd\s*[e3]\s*[a@4]\s*r\s* \s*f\s*r\s*[i|]\s*[e3]\s*n\s*d\b/i
]

Alternatively, we could just take the hard-coding approach and limit the test cases somewhat to avoid having completely unreadable regexes.

@brandenbyers
Copy link

brandenbyers commented Mar 22, 2019

Yes, programmatically might make the most sense in the real world but as you have pointed out, it really is not readable (nor does it teach much about regex). What about making the lesson slowly build on itself? Go from a normal spelling of a spam word. Next lesson is to add an alternative spelling. Next is more alternative spellings. Oh, wait...the spammers have gotten smarter and are now doing “x” things instead...how can we improve it from here?

I’m imagining a text based game of sorts. Outsmarting the spammers. And with each step learning a bit more regex. And how to alter regex when the requirements change.

It could include a bit of history too for those that aren’t familiar with how spam use to be regexed when spam was a relatively new concept. Let the user experience what it might have been like as a mail server admin back in the simpler days...

Ultimately, the final step could be acknowledging the limits of human readable spam filtering. But that’s ok. The user isn’t trying to become a regex master with these lessons. Mostly what they need to learn is how to read and write basic regex. They can search the internet, or use one of the visual regex tools, for anything more complex.

@lionel-rowe
Copy link
Contributor

lionel-rowe commented Mar 22, 2019

@brandenbyers yeah, that sounds like a good way to approach it. Here's an updated version:

https://gist.github.com/lionel-rowe/72a19bf346858ede6f406ad20e7c157a

As I've split out the logic of deleting intra-word spaces from the logic of de-mangling, this version has a clearer path for how to iterate:

  • Step 1 - plain words (can be solved with includes, doesn't even need regexes)
  • Step 2 - plain words with mixed case (can be solved with toLowerCase, but here we introduce regexes with i flag as an alternative)
  • Step 3 - mangled words
  • Step 4 - mangled words + intra-word spaces
  • Step 5 - the spammers have moved on to using images, now you're into the realm of OCR and NLP... but that's for another lesson 😉

The biggest challenge conceptually is probably the wordCondenser regex used in step 4: /(?:^|\s)\S(?:(\s+)\S)(?:\1\S)*(?:$|\s)/g, but the funkiest bits of that ((?:^|\s) and (?:$|\s)) can actually be introduced in step 2 or 3.

@QuincyLarson
Copy link
Contributor Author

@lionel-rowe I agree with Branden that it's much more important that we teach these regex techniques even if real life approaches to spam filtering would be different.

Remember that the entire curriculum will be a series of individual tests, and getting one test to pass at a time. So we will only be testing one aspect of their regular expressions at a time. And any concepts we want to impart, we will need to do so in just a few words as part of a test description. There won't be any paragraphs of explainer text.

@scissorsneedfoodtoo
Copy link
Contributor

@lionel-rowe, just wanted to check in and see how everything's going with the project. Your gist looks like a great start!

My only suggestion would be to keep things simple, teaching just one regex concept with some repetition/review of earlier concepts between. Your project will be replacing the lessons here, so the wordCondenser regex you have (while really cool!) is probably too advanced for students at this point in the curriculum.

Anyway, hope that helps! Please let us know if there's anything we can help out with.

@lionel-rowe
Copy link
Contributor

@scissorsneedfoodtoo Your project will be replacing the lessons here, so the wordCondenser regex you have (while really cool!) is probably too advanced for students at this point in the curriculum.

The idea will be to build each regex up incrementally (based on the second gist I posted, not the first, which I agree is overcomplicated). The only thing I'm concerned about is not covering enough concepts (e.g. \d, \w, etc. won't be covered). I think this is probably OK as long as the broader concept is covered. For example, if u flag isn't covered, the idea of flags in general is covered; if \W isn't covered, \S is, and so on.

I'll work on making this into a proper lesson where the incremental approach is clearer.

@scissorsneedfoodtoo
Copy link
Contributor

scissorsneedfoodtoo commented Apr 4, 2019

@lionel-rowe, okay, that sounds great. I don't think you need to worry about the coverage of your project. Like you said, teaching the broad concept of flags and using several in depth is better than covering all the flags. Looking forward to seeing your lessons!

@scissorsneedfoodtoo
Copy link
Contributor

@lionel-rowe, just checking in to see how things are going. Did you start breaking this project down into steps?

@lionel-rowe
Copy link
Contributor

I'll open a work-in-progress PR sometime this weekend.

@scissorsneedfoodtoo
Copy link
Contributor

Hi @lionel-rowe, just wanted to check on the status of this project, too. Were you able to start on another draft by any chance?

@lionel-rowe
Copy link
Contributor

Draft 2 should be coming shortly, though I'm somewhat swamped with work at the moment. Aiming for within the week.

@scissorsneedfoodtoo
Copy link
Contributor

@lionel-rowe, thank you for the update! Looking forward to seeing it soon.

@scissorsneedfoodtoo
Copy link
Contributor

@lionel-rowe, were you able to make any progress on your next draft? Looking forward to seeing it soon.

@scissorsneedfoodtoo
Copy link
Contributor

Hi @lionel-rowe, were you able to start on a new draft?

@Bam92
Copy link

Bam92 commented Nov 9, 2019

Yo!

How far is this project? Can I consider it as unclaimed?

@scissorsneedfoodtoo
Copy link
Contributor

Hi @Bam92, thank you for your patience and sorry about the delay.

Yes, this project is unclaimed. Feel free to work on a prototype and post updates here as you go along.

@CatalanCabbage
Copy link

Hi @Bam92 and @scissorsneedfoodtoo , I'd be glad to contribute.
I'm back-end and not familiar with JS or Python, which rules out many of the listed projects; however, I believe I can contribute to this section, since it's regex based.
Do let me know if I can help!

@Bam92
Copy link

Bam92 commented Dec 9, 2019

Hi @CatalanCabbage
As for now, I am busy, you can go for if you can.

@scissorsneedfoodtoo
Copy link
Contributor

scissorsneedfoodtoo commented Dec 9, 2019

Hi @CatalanCabbage, thank you for picking this project up. There's already been some work done that's been merged into the repo, but please feel free to start from scratch if that's easier.

Though this project will focus on regex, we'll still be using JavaScript to teach the fundamentals. Please go ahead and start working on CodePen, CodeSandbox, or some other similar platform, and post a link to your prototype here whenever it's ready.

@CatalanCabbage
Copy link

CatalanCabbage commented Dec 12, 2019

@scissorsneedfoodtoo, first off, you're doing great work. :)
There are some questions I have, and I'd be grateful if you could point me in the right direction. Also seeing how this seems to change hands so often, this could serve as a guide for anybody to pick it up, if I fail to(touch wood!) :)

UI: You had asked me to make a mock-up on CodePen; when we speak of regex, the learner generally progresses in terms of regex complexity; I'm at a loss how to improve on the UI step by step(or maybe look into it later), since it would just involve basically a regex, and running tests. Is one common UI enough for now? How do you visualize this course?

Concept-first vs Product-first : The last person who took this task up did a pretty good job, actually. The regex is built say, step by step to morph into a spam filter; however, should we shift focus onto various concepts, and look at the filter as a means to an end, even if it's actually a roundabout way? Should we do a concept-first approach and then try to somehow integrate it into the filter even if it's not how we'd actually do a filter? I think this will benefit the users more.

So assuming this is your line of reasoning, I'd like to first list out various general Regex concepts and JS-based regex functions, so we can integrate them into the lessons, and if not all, we could mention them in hints or somewhere along the way, so they're aware. It was stated that concepts need to be explained in a few words; do we provide links for further reading(and if so, are there approved sites)? Also, there are some concepts such as Catastrophic Backtracking which are extremely important, that I'm not sure how to introduce in the lesson(but will look into later).
This will also be helpful as a future reference if I compile this list.

My final question is documentation: Where do I place problem statements, hints, problem completion message and so on? Do I introduce them as comments now and we migrate them later into documentation? There are comments in the repo on the JS files themselves, just wanted to be sure.

@scissorsneedfoodtoo
Copy link
Contributor

scissorsneedfoodtoo commented Dec 17, 2019

Hi @CatalanCabbage, thanks again for your patience. You have some great questions here, and I'll do my best to answer them one by one.

UI: I don't think there needs to be any sort of UI for this project. Looking at the current Regular Expressions section, learners just have the editor to focus on while they build up the code and learn concepts along the way. I see this project working similarly, where they start out with a blank editor and build up the code line by line. If we want them to see any output, we could prompt them to log something to the console.

Concept-first vs. Product-first: Great question, and this is something we've been trying to reconcile with a lot of these new projects. I agree that the concept-first approach is better in the long run, even if it's not how you would normally build a production ready spam filter. The RSA Cryptography project does something similar, where we explain several times that what we're teaching is not secure at all, and is for educational purposes only.

Ideally the spam filter will cover most of the concepts in the current Regular Expressions section. But we shouldn't go out of our way to introduce concepts or methods if they're not necessary for the spam filter. In some of the other projects we've introduced basic concepts like if/else statements, then later go back and refactor them into ternary operators. I could see doing that in this project as well.

Your idea to list out the various concepts first before finishing the prototype sounds good. As for things like catastrophic backtracking, I fear it might be too early in the curriculum to introduce a concept like that. If we do, I would recommend keeping it as simple as possible since this will only be the third project where learners work with JavaScript if they start from the very beginning. The two projects that come before this are the Basic JS RPG game and the Intermediate JS Calorie Counter, both of which are pretty simple.

Also that's a very good question about the documentation. These new projects will be quite different than the current challenges, and won't include things like completion messages or hints, at least for the time being. Right now we're just focused on building the prototypes, then breaking them down into short individual steps. The commented out sections are the instructions for each step, so you can introduce them as comments for now.

@CatalanCabbage
Copy link

Hi!
I've built up a basic lesson plan here, please check it out; any feedback/discussion is appreciated. :)

Some comments:

  • Proceeded with the basic idea we had: start off small, add complexity to the filter
  • Lessons count: As of that commit, the number of lessons has already crossed 20, and we have probably 2 sections left.
    Reduce number of lessons? Might reduce practice/retention (too many concepts at once) and also complexity between consecutive exercises; I don't want to make it too long either, trying to find the balance. Better a bit more where students grasp it, than too less, leaving them exasperated/feeling inadequate imo.
    I could work it out after deciding on all the lessons, but if there are any pointers/concerns, feel free to share, so that iterations can be reduced and there's less redundant effort. :)
  • Any suggestions on appropriate spam words/witty examples/low-key plots is appreciated!
    Right now it's just a framework with a one generic example, but we could revise that after the basic plan to make it more engaging, and maybe even map out a kind of story to keep it flowing, like lionel-rowe had already envisioned.
    I'd like to refrain from words such as 'viagra', which might be deemed to be distasteful in some cultures.
  • Are the concepts right? I understand that I've written this with my experience in learning regexes, i.e., the order I would like to take concepts in; it might not be the same for everyone.
    If you have any concerns regarding the concepts introduced, their complexity or order of introduction please let me know so we can discuss and make this better for all who take it. :)
    cc @scissorsneedfoodtoo

@CatalanCabbage
Copy link

CatalanCabbage commented May 22, 2020

A suggestion regarding the UI:
I believe the stance is:

I don't think there needs to be any sort of UI for this project
If we want them to see any output, we could prompt them to log something to the console.

However, imo, regex isn't like other lessons.
In other lessons, we have outputs/errors that we debug; however, here, we have 'matches' that we need to convey to the learner.
There are no errors/outputs per se; only expectations of matches.
Now, we can convey this by text or the UI; I find that the UI is very intuitive and helps grasp data quickly.

Example, current view, text (and I'm assuming, the current stance): From FCC Regex course

image

Contrast it to this overview: From regex101

image

For this extremely simple example, long did it take to grasp the overall objective in both cases?
The second seems more intuitive and better suited for regexes over repetitions of your regex should match....

My idea:
A graphical output window, where we can see what needs to be matched, and what the current regex matches.
We're building a spam filter, so I modified my ZohoMail layout a bit. We're filtering mails by subject, count of the mails we've flagged as Spam on the left.

image

Here, yellow is what needs to be matched, red is what your regex matches (incorrectly) and green is what your regex matches (correctly).
Please note that this is a very crude idea, just used Paint, this is just a basic vision for this; please overlook colors/formats/layout etc.

So the overall screen could have the graphical output pane (again, please overlook colors/order/sizes/font size):
image

TLDR, my case for Graphical output:

  • More intuitive
  • There aren't native errors/direct textual output comparison required (as in, doing console.log(outputArray) to check output), so graphically is better suited. Regex is fundamentally different from those lessons.
  • Encourages experimentation: We all know Regex comes from trying things out (like everything else). Question might say, Can you match 'cccc'?. The student can look at the text and says, Hey, 'cccc' if fine. How can I match 'p1p2p3'?
    • Current Text output: Your regex does not match 'cccc'
    • Graphical output: Indicates that your regex failed. AND, indicates what it matched. Student learns more.
  • Easier to debug: Can see what's being matched exactly, can work on even complex regexes with more ease.
  • Questions are grasped easily. Even when the question explains it well, like must match all words that... vs just understanding it at a glance. It's not a substitute, but a great supplement.
  • More engaging over repetitive your regex does not match... outputs since it's dynamic.

As always, feedback is appreciated :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: Not Started
Development

No branches or pull requests

7 participants