Skip to content
This repository has been archived by the owner. It is now read-only.

Add an advanced text generator based on markov chains. #254

Merged
merged 8 commits into from Mar 3, 2014
Merged

Add an advanced text generator based on markov chains. #254

merged 8 commits into from Mar 3, 2014

Conversation

@TimWolla
Copy link
Contributor

@TimWolla TimWolla commented Feb 18, 2014

The advanced text generator's text()-generator is compatible to
Lorem's one and can be used as a drop in replacement.

see #109

Examples (The number is the maximum amount of characters, the line breaks are here for better formatting and not generated):

  10: Poor.

  20: The Duchess tucked her.

  50: Mary Ann, what are you getting on?" said the Hatter, and he.

  70: This time there could be no mistake about it—it was neither more nor l
      ess than a.

  90: She drew herself up and said very gravely, "I think I must be off, and s
      he dropped it hastily, just in time to.

 120: Very soon the Rabbit came up to her great delight, it fitted! Alice open
      ed the door and tried to make out which were the cook took the least not
      ice.

 150: I should like to be talking in its sleep, "that 'I breathe when I breath
      e!'" "It is the same height as herself. She stretched herself up on tipt
      oe and peeped over the list. Imagine.

 200: She stretched herself up on tiptoe and peeped over the list. Imagine her
       surprise when he finds out who I was when I get it home?" when it saw h
      er. "Cheshire-Puss," began Alice, rather timidly, "would you please tell
       me your history, you.

 500: However, when they liked and left off quarreling with the others. IX—W
      HO STOLE THE TARTS? The King and Queen of Hearts, she made some tarts, A
      ll on a summer day; The Knave of Hearts, carrying the King's crown on a 
      bough of a book," thought Alice, "without pictures or conversations in i
      t, "and what is the same thing," said the Cat, and vanished. Alice had f
      ound her head pressing against the ceiling, and had no reason to be seen
      —everything seemed to quiver all over with William the Conqueror." So 
      she was now the right paw 'round, "lives a March Hare. Alice gave a wear
      y sigh. "I think you ought to.
The advanced text generator's text()-generator is compatible to
Lorem's one and can be used as a drop in replacement.

namespace Faker\Provider;

class Text extends \Faker\Provider\Base

This comment has been minimized.

@fzaninotto

fzaninotto Feb 24, 2014
Owner

Since this provider contains English text, it should probably be under the en_US locale. So I think you should commit one class with no locale and a basic text (see for instance the base Person generator), and another one, more complete with the en_US locale.

The non-locale Text provider can even be abstract, to force a proper text.

This comment has been minimized.

@TimWolla

TimWolla Feb 24, 2014
Author Contributor

Okay, I'll change it.

This comment has been minimized.

@fzaninotto

fzaninotto Mar 3, 2014
Owner

After thinking about it, putting lorem ipsum text in this class brings confusion with the text class. That's why you should leave the $baseText empty and make the class abstract, to be instanciated in locales.


class Text extends \Faker\Provider\Base
{
protected static $baseText = <<<EOT

This comment has been minimized.

@fzaninotto

fzaninotto Feb 24, 2014
Owner

Please link to the original text source in the variable comments. This must be a location where the Licence of the text is apparent. Please also add the License to the comments.

This comment has been minimized.

@TimWolla

TimWolla Feb 24, 2014
Author Contributor

Well, the text is Alice in Wonderland (https://en.wikipedia.org/wiki/Alice%27s_Adventures_in_Wonderland), it was written in 1865, therefore all copyright should be expired.

This comment has been minimized.

@fzaninotto

fzaninotto Feb 24, 2014
Owner

All the problem lies in the "should"... 😉

I think the text can be joined with the "Project Gutenberg Licence" (if you extract the text from here: http://www.gutenberg.org/cache/epub/11/pg11.txt). Again, please mention the source and the licence in the variable comments to ease future updates.

class Text extends \Faker\Provider\Base
{
protected static $baseText = <<<EOT
Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do. Once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, "and what is the use of a book," thought Alice, "without pictures or conversations?"

This comment has been minimized.

@fzaninotto

fzaninotto Feb 24, 2014
Owner

Does wrapping the text break the markov chain algorithm? It would improve code readability on GitHub.

This comment has been minimized.

@TimWolla

TimWolla Feb 24, 2014
Author Contributor

It does not, line breaks are normalized away anyway.

*
* @example 'Poor Alice! It was the first question, you know." Alice opened the door as you are; secondly, because they're making.'
* @param integer $maxNbChars Maximum number of characters the text should contain
* @param integer $indexSize Index size of the markov chains, higher index causes a better, but less random text.

This comment has been minimized.

@fzaninotto

fzaninotto Feb 24, 2014
Owner

$indexSize and $indexUnit do not make much sense from a user's point of view. Could you abstract these into some kind of $entropy parameter?

This comment has been minimized.

@TimWolla

TimWolla Feb 24, 2014
Author Contributor

I'll see that can be done.

This comment has been minimized.

@TimWolla

TimWolla Feb 24, 2014
Author Contributor

What about this improvement of documentation:

     * @param  integer $indexSize Determines how many words / chars are considered for the generation of the next token (higher number = better quality text, lower number = more random)
     * @param  string $indexUnit Determines whether 'words' or 'chars' represent the basis of the generator.

I don't think this can be abstracted away without giving up some freedom.

This comment has been minimized.

@fzaninotto

fzaninotto Feb 24, 2014
Owner

"Token" doesn't mean anything for devs who don't know the Markov chain algo. Also, the $indexUnit description doesn't explain how it alters the generation.

This comment has been minimized.

@TimWolla

TimWolla Feb 24, 2014
Author Contributor

I have no idea what to write then. If you want you can give me some example, otherwise I'll fix the other issues and then update the pull request. It's up to you whether you merge it or not.


// take a random starting point
$next = static::randomKey($table);
while($resultLength < $maxNbChars && isset($table[$next])) {

This comment has been minimized.

@fzaninotto

fzaninotto Feb 24, 2014
Owner

Coding Standards: please add a whitespace after while. You can use php-cs-fixer to do it all for you.

This comment has been minimized.

@TimWolla

TimWolla Feb 24, 2014
Author Contributor

Whoopsie, seems I missed that one.

@fzaninotto
Copy link
Owner

@fzaninotto fzaninotto commented Feb 24, 2014

I absolutely love that contribution. I think it adds great value to Faker. Thanks a lot!

@Anahkiasen
Copy link

@Anahkiasen Anahkiasen commented Feb 24, 2014

I've been wanting something like that for a while now so I really appreciate this too.

@TimWolla
Copy link
Contributor Author

@TimWolla TimWolla commented Feb 24, 2014

@fzaninotto The updates are now added to this pull request.

* @example 'Lorem ipsum dolor sit amet'
* @param integer $maxNbChars Maximum number of characters the text should contain
* @param integer $indexSize Determines how many words / chars are considered for the generation of the next token (higher number = correcter, lower number = more random)
* @param string $indexUnit Determines whether 'words' or 'chars' represent the basis of the generator.

This comment has been minimized.

@fzaninotto

fzaninotto Feb 25, 2014
Owner

To simplify, I suggest to remove the ability to generate character-based markov text. It only produces legible text with 4+ characters, and by that time it's almost equivalent to a word-based markov chain.

*
* @example 'Lorem ipsum dolor sit amet'
* @param integer $maxNbChars Maximum number of characters the text should contain
* @param integer $indexSize Determines how many words / chars are considered for the generation of the next token (higher number = correcter, lower number = more random)

This comment has been minimized.

@fzaninotto

fzaninotto Feb 25, 2014
Owner

If we abandon the ability to generate text based on chars, I think the definition of this pattern can be:

Determines how many words are considered for the generation of the next word. The minimum is 1, and it produces the higher level of randomness, although the generated text usually doesn't make sense. Higher index size (up to 10) produce more correct text, at the price of less randomness.

This comment has been minimized.

@TimWolla

TimWolla Feb 25, 2014
Author Contributor

Sounds good, I'll update once I'm back to my development box. Thanks!

/**
* Generate a text string.
* Depending on the $maxNbChars, returns a random valid looking text.
*

This comment has been minimized.

@fzaninotto

fzaninotto Feb 25, 2014
Owner

I suggest you add a comment about the Markow chain algorithm.

* @param string $indexUnit Determines whether 'words' or 'chars' represent the basis of the generator.
* @return string
*/
public static function text($maxNbChars = 200, $indexSize = 2, $indexUnit = 'words')

This comment has been minimized.

@fzaninotto

fzaninotto Feb 25, 2014
Owner

After thinking it further, I think the provider shouldn't override the Lorem one - they serve different purposes. We need another formatter name than text, so I suggest realText.

This comment has been minimized.

@TimWolla

TimWolla Feb 25, 2014
Author Contributor

Should I rename the class as well?

This comment has been minimized.

@fzaninotto

fzaninotto Feb 25, 2014
Owner

No, it's not necessary to rename the class.

@TimWolla
Copy link
Contributor Author

@TimWolla TimWolla commented Feb 27, 2014

@fzaninotto Everything should be fine now.

throw new \InvalidArgumentException('indexSize must be at most 10');
}

if (!isset(static::$tables[$indexSize])) {

This comment has been minimized.

@fzaninotto

fzaninotto Mar 3, 2014
Owner

I think I see a potential bug when switching locales.

$faker = Faker\Factory::create('fr_FR');
$faker->realText(100); // generates static $table cache for French locale
$faker = Faker\Factory::create('en_EN');
$faker->realText(100); // uses static $table cache for French locale, generates French text

That probably means that $tables should not be static, and therefore realText shouldn't be static either.

This comment has been minimized.

@fzaninotto

fzaninotto Mar 3, 2014
Owner

also, $tables is a pretty generic name. I suggest $consecutiveWords.

$text = static::getNormalizedText();

// split into look up parts
$parts = preg_split('/ /u', $text, -1, PREG_SPLIT_NO_EMPTY);

This comment has been minimized.

@fzaninotto

fzaninotto Mar 3, 2014
Owner

just out of curiosity, why not use split() instead, which is much faster?

This comment has been minimized.

@TimWolla

TimWolla Mar 3, 2014
Author Contributor

split() is deprecated. In case you mean explode(): Forgot to change it after removing chars as index.

This comment has been minimized.

@fzaninotto

fzaninotto Mar 3, 2014
Owner

yes, sorry, I meant explode().

$append = static::randomElement($table[$next]);

// calculate next index
$next = preg_split('/ /u', $next, -1, PREG_SPLIT_NO_EMPTY);

This comment has been minimized.

@fzaninotto

fzaninotto Mar 3, 2014
Owner

Why not use split() here either?

@TimWolla
Copy link
Contributor Author

@TimWolla TimWolla commented Mar 3, 2014

@fzaninotto Done.

@fzaninotto
Copy link
Owner

@fzaninotto fzaninotto commented Mar 3, 2014

Awesome. I'll merge it right away. Thanks a lot for your patch!

@fzaninotto
Copy link
Owner

@fzaninotto fzaninotto commented Mar 3, 2014

oops, sorry can't merge: tests fail.

@TimWolla
Copy link
Contributor Author

@TimWolla TimWolla commented Mar 3, 2014

oops, sorry can't merge: tests fail.

Funny, they did work on my development box though they shouldn't. Anyway: Fix gone out just now.

Even more funny: Hiphop worked.

fzaninotto added a commit that referenced this pull request Mar 3, 2014
Add an advanced text generator based on markov chains.
@fzaninotto fzaninotto merged commit 680b36d into fzaninotto:master Mar 3, 2014
1 check passed
1 check passed
@fzaninotto
default The Travis CI build passed
Details
@TimWolla TimWolla deleted the TimWolla:advancedTextProvider branch Mar 3, 2014
@staabm
Copy link

@staabm staabm commented Jun 5, 2014

❤️

@tablatronix
Copy link

@tablatronix tablatronix commented Jan 15, 2015

wonderful, Thanks!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

5 participants