New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an advanced text generator based on markov chains. #254

Merged
merged 8 commits into from Mar 3, 2014

Conversation

Projects
None yet
5 participants
@TimWolla
Copy link
Contributor

TimWolla commented Feb 18, 2014

The advanced text generator's text()-generator is compatible to
Lorem's one and can be used as a drop in replacement.

see #109

Examples (The number is the maximum amount of characters, the line breaks are here for better formatting and not generated):

  10: Poor.

  20: The Duchess tucked her.

  50: Mary Ann, what are you getting on?" said the Hatter, and he.

  70: This time there could be no mistake about it—it was neither more nor l
      ess than a.

  90: She drew herself up and said very gravely, "I think I must be off, and s
      he dropped it hastily, just in time to.

 120: Very soon the Rabbit came up to her great delight, it fitted! Alice open
      ed the door and tried to make out which were the cook took the least not
      ice.

 150: I should like to be talking in its sleep, "that 'I breathe when I breath
      e!'" "It is the same height as herself. She stretched herself up on tipt
      oe and peeped over the list. Imagine.

 200: She stretched herself up on tiptoe and peeped over the list. Imagine her
       surprise when he finds out who I was when I get it home?" when it saw h
      er. "Cheshire-Puss," began Alice, rather timidly, "would you please tell
       me your history, you.

 500: However, when they liked and left off quarreling with the others. IX—W
      HO STOLE THE TARTS? The King and Queen of Hearts, she made some tarts, A
      ll on a summer day; The Knave of Hearts, carrying the King's crown on a 
      bough of a book," thought Alice, "without pictures or conversations in i
      t, "and what is the same thing," said the Cat, and vanished. Alice had f
      ound her head pressing against the ceiling, and had no reason to be seen
      —everything seemed to quiver all over with William the Conqueror." So 
      she was now the right paw 'round, "lives a March Hare. Alice gave a wear
      y sigh. "I think you ought to.
Add an advanced text generator based on markov chains.
The advanced text generator's text()-generator is compatible to
Lorem's one and can be used as a drop in replacement.
namespace Faker\Provider;
class Text extends \Faker\Provider\Base

This comment has been minimized.

@fzaninotto

fzaninotto Feb 24, 2014

Owner

Since this provider contains English text, it should probably be under the en_US locale. So I think you should commit one class with no locale and a basic text (see for instance the base Person generator), and another one, more complete with the en_US locale.

The non-locale Text provider can even be abstract, to force a proper text.

This comment has been minimized.

@TimWolla

TimWolla Feb 24, 2014

Contributor

Okay, I'll change it.

This comment has been minimized.

@fzaninotto

fzaninotto Mar 3, 2014

Owner

After thinking about it, putting lorem ipsum text in this class brings confusion with the text class. That's why you should leave the $baseText empty and make the class abstract, to be instanciated in locales.

class Text extends \Faker\Provider\Base
{
protected static $baseText = <<<EOT

This comment has been minimized.

@fzaninotto

fzaninotto Feb 24, 2014

Owner

Please link to the original text source in the variable comments. This must be a location where the Licence of the text is apparent. Please also add the License to the comments.

This comment has been minimized.

@TimWolla

TimWolla Feb 24, 2014

Contributor

Well, the text is Alice in Wonderland (https://en.wikipedia.org/wiki/Alice%27s_Adventures_in_Wonderland), it was written in 1865, therefore all copyright should be expired.

This comment has been minimized.

@fzaninotto

fzaninotto Feb 24, 2014

Owner

All the problem lies in the "should"... 😉

I think the text can be joined with the "Project Gutenberg Licence" (if you extract the text from here: http://www.gutenberg.org/cache/epub/11/pg11.txt). Again, please mention the source and the licence in the variable comments to ease future updates.

class Text extends \Faker\Provider\Base
{
protected static $baseText = <<<EOT
Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do. Once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, "and what is the use of a book," thought Alice, "without pictures or conversations?"

This comment has been minimized.

@fzaninotto

fzaninotto Feb 24, 2014

Owner

Does wrapping the text break the markov chain algorithm? It would improve code readability on GitHub.

This comment has been minimized.

@TimWolla

TimWolla Feb 24, 2014

Contributor

It does not, line breaks are normalized away anyway.

*
* @example 'Poor Alice! It was the first question, you know." Alice opened the door as you are; secondly, because they're making.'
* @param integer $maxNbChars Maximum number of characters the text should contain
* @param integer $indexSize Index size of the markov chains, higher index causes a better, but less random text.

This comment has been minimized.

@fzaninotto

fzaninotto Feb 24, 2014

Owner

$indexSize and $indexUnit do not make much sense from a user's point of view. Could you abstract these into some kind of $entropy parameter?

This comment has been minimized.

@TimWolla

TimWolla Feb 24, 2014

Contributor

I'll see that can be done.

This comment has been minimized.

@TimWolla

TimWolla Feb 24, 2014

Contributor

What about this improvement of documentation:

     * @param  integer $indexSize Determines how many words / chars are considered for the generation of the next token (higher number = better quality text, lower number = more random)
     * @param  string $indexUnit Determines whether 'words' or 'chars' represent the basis of the generator.

I don't think this can be abstracted away without giving up some freedom.

This comment has been minimized.

@fzaninotto

fzaninotto Feb 24, 2014

Owner

"Token" doesn't mean anything for devs who don't know the Markov chain algo. Also, the $indexUnit description doesn't explain how it alters the generation.

This comment has been minimized.

@TimWolla

TimWolla Feb 24, 2014

Contributor

I have no idea what to write then. If you want you can give me some example, otherwise I'll fix the other issues and then update the pull request. It's up to you whether you merge it or not.

// take a random starting point
$next = static::randomKey($table);
while($resultLength < $maxNbChars && isset($table[$next])) {

This comment has been minimized.

@fzaninotto

fzaninotto Feb 24, 2014

Owner

Coding Standards: please add a whitespace after while. You can use php-cs-fixer to do it all for you.

This comment has been minimized.

@TimWolla

TimWolla Feb 24, 2014

Contributor

Whoopsie, seems I missed that one.

@fzaninotto

This comment has been minimized.

Copy link
Owner

fzaninotto commented Feb 24, 2014

I absolutely love that contribution. I think it adds great value to Faker. Thanks a lot!

@Anahkiasen

This comment has been minimized.

Copy link

Anahkiasen commented Feb 24, 2014

I've been wanting something like that for a while now so I really appreciate this too.

@TimWolla

This comment has been minimized.

Copy link
Contributor

TimWolla commented Feb 24, 2014

@fzaninotto The updates are now added to this pull request.

* @example 'Lorem ipsum dolor sit amet'
* @param integer $maxNbChars Maximum number of characters the text should contain
* @param integer $indexSize Determines how many words / chars are considered for the generation of the next token (higher number = correcter, lower number = more random)
* @param string $indexUnit Determines whether 'words' or 'chars' represent the basis of the generator.

This comment has been minimized.

@fzaninotto

fzaninotto Feb 25, 2014

Owner

To simplify, I suggest to remove the ability to generate character-based markov text. It only produces legible text with 4+ characters, and by that time it's almost equivalent to a word-based markov chain.

*
* @example 'Lorem ipsum dolor sit amet'
* @param integer $maxNbChars Maximum number of characters the text should contain
* @param integer $indexSize Determines how many words / chars are considered for the generation of the next token (higher number = correcter, lower number = more random)

This comment has been minimized.

@fzaninotto

fzaninotto Feb 25, 2014

Owner

If we abandon the ability to generate text based on chars, I think the definition of this pattern can be:

Determines how many words are considered for the generation of the next word. The minimum is 1, and it produces the higher level of randomness, although the generated text usually doesn't make sense. Higher index size (up to 10) produce more correct text, at the price of less randomness.

This comment has been minimized.

@TimWolla

TimWolla Feb 25, 2014

Contributor

Sounds good, I'll update once I'm back to my development box. Thanks!

/**
* Generate a text string.
* Depending on the $maxNbChars, returns a random valid looking text.
*

This comment has been minimized.

@fzaninotto

fzaninotto Feb 25, 2014

Owner

I suggest you add a comment about the Markow chain algorithm.

* @param string $indexUnit Determines whether 'words' or 'chars' represent the basis of the generator.
* @return string
*/
public static function text($maxNbChars = 200, $indexSize = 2, $indexUnit = 'words')

This comment has been minimized.

@fzaninotto

fzaninotto Feb 25, 2014

Owner

After thinking it further, I think the provider shouldn't override the Lorem one - they serve different purposes. We need another formatter name than text, so I suggest realText.

This comment has been minimized.

@TimWolla

TimWolla Feb 25, 2014

Contributor

Should I rename the class as well?

This comment has been minimized.

@fzaninotto

fzaninotto Feb 25, 2014

Owner

No, it's not necessary to rename the class.

@TimWolla

This comment has been minimized.

Copy link
Contributor

TimWolla commented Feb 27, 2014

@fzaninotto Everything should be fine now.

throw new \InvalidArgumentException('indexSize must be at most 10');
}
if (!isset(static::$tables[$indexSize])) {

This comment has been minimized.

@fzaninotto

fzaninotto Mar 3, 2014

Owner

I think I see a potential bug when switching locales.

$faker = Faker\Factory::create('fr_FR');
$faker->realText(100); // generates static $table cache for French locale
$faker = Faker\Factory::create('en_EN');
$faker->realText(100); // uses static $table cache for French locale, generates French text

That probably means that $tables should not be static, and therefore realText shouldn't be static either.

This comment has been minimized.

@fzaninotto

fzaninotto Mar 3, 2014

Owner

also, $tables is a pretty generic name. I suggest $consecutiveWords.

$text = static::getNormalizedText();
// split into look up parts
$parts = preg_split('/ /u', $text, -1, PREG_SPLIT_NO_EMPTY);

This comment has been minimized.

@fzaninotto

fzaninotto Mar 3, 2014

Owner

just out of curiosity, why not use split() instead, which is much faster?

This comment has been minimized.

@TimWolla

TimWolla Mar 3, 2014

Contributor

split() is deprecated. In case you mean explode(): Forgot to change it after removing chars as index.

This comment has been minimized.

@fzaninotto

fzaninotto Mar 3, 2014

Owner

yes, sorry, I meant explode().

$append = static::randomElement($table[$next]);
// calculate next index
$next = preg_split('/ /u', $next, -1, PREG_SPLIT_NO_EMPTY);

This comment has been minimized.

@fzaninotto

fzaninotto Mar 3, 2014

Owner

Why not use split() here either?

@TimWolla

This comment has been minimized.

Copy link
Contributor

TimWolla commented Mar 3, 2014

@fzaninotto Done.

@fzaninotto

This comment has been minimized.

Copy link
Owner

fzaninotto commented Mar 3, 2014

Awesome. I'll merge it right away. Thanks a lot for your patch!

@fzaninotto

This comment has been minimized.

Copy link
Owner

fzaninotto commented Mar 3, 2014

oops, sorry can't merge: tests fail.

@TimWolla

This comment has been minimized.

Copy link
Contributor

TimWolla commented Mar 3, 2014

oops, sorry can't merge: tests fail.

Funny, they did work on my development box though they shouldn't. Anyway: Fix gone out just now.

Even more funny: Hiphop worked.

fzaninotto added a commit that referenced this pull request Mar 3, 2014

Merge pull request #254 from TimWolla/advancedTextProvider
Add an advanced text generator based on markov chains.

@fzaninotto fzaninotto merged commit 680b36d into fzaninotto:master Mar 3, 2014

1 check passed

default The Travis CI build passed
Details

@TimWolla TimWolla deleted the TimWolla:advancedTextProvider branch Mar 3, 2014

@staabm

This comment has been minimized.

Copy link

staabm commented Jun 5, 2014

❤️

@tablatronix

This comment has been minimized.

Copy link

tablatronix commented Jan 15, 2015

wonderful, Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment