# Info
This notebook works pretty differently from txt_from_web.ipynb. The other notebook is meant to get a bunch of text files for you to analyze. In theory, you would go through it, get what you want, and go back to it as-needed for more text. 

By contrast, this notebook is more interactive, and you're intended to fill certain blocks with your own code to make use of the main object: marky the Markov Model Manager! The functions you can make use of are described in the code blocks, but I'll also have a block with examples of the functions being used, and how you might make use of a Markov Model Manager object.

The main idea is that marky will take the folder you set in section 0 as `root`, create any missing folders, and allow you to handle file management and markov model building all in one place! You can:

* import text files and turn them into markov models
* import markov models from json files
* create new models and then export them as json files
* combine models and export the result as a json file
* create formatted sentence outputs for any model so you can test models and fiddle with settings

(Note: the reason jsons are used here is because you can convert a markovify model to json, save the json, and convert the json back later. Making a model from json is way faster than remaking the model from a string. Get more info on markovify here: https://github.com/jsvine/markovify)

# 0. Universal Stuff - Run First

In [4]:
import re
import os
import markovify
import spacy
nlp = spacy.load("en_core_web_sm")

# defines class that extends markovify model by incorporating part-of-speech tagging
# note that using POSifiedText(string) over markovify.Text(string) will take a lot longer
class POSifiedText(markovify.Text):
    def word_split(self, sentence):
        return ["::".join((word.orth_,word.pos_)) for word in nlp(sentence)]
    
    def word_join(self, words):
        sentence = " ".join(word.split("::")[0] for word in words)
        return sentence

# note the / at the end!! no matter how long the folder path, it needs to end in / or shit breaks
root = "data/"

# marky the Markov Model Manager

### Run next block first

In [47]:
# this defines the class that we will make an instance of to do our file and model management;
# i recommend reading a function's description if you want to know what it is/how to use it
class markovManager(dict):
    def __init__(self, root_folder):
        '''
        Initializes an instance of markovManager given the root folder where your data is stored, defined in section 0.
        Note that this runs directory_setup(), which means it will check for folders called "txts/" and "jsons/".
        If these folders do not exist, it will create them. 
        These are where your .txt and markovify .jsons should be stored, and where jsons will be written out.
        '''
        self.root_folder = root_folder
        # if you want to change where text files are read from, this is where you'll do it
        self.txts_folder = self.root_folder+"txts/"
        # if you want to change where json files are read from and saved to, this is where you'll do it
        self.jsons_folder = self.root_folder+"jsons/"
        self.directory_setup()
        
        # creates dictionary that maps filenames in txts folder (w/o extension) to contents from the associated files
        self.txts = self.folder_to_dict(self.txts_folder)
        # creates dictionary that maps filenames in jsons folder (w/o extension) to contents from the associated files
        self.jsons = self.folder_to_dict(self.jsons_folder)
        # creates dictionary that maps filenames in the jsons folder (w/o extension) to markovify model objects
        self.models = {key:self.json_interpreter(self.jsons[key]) for key in self.jsons.keys()}

# NOTE: the next set of functions are ones you generally will *not* be calling
    def directory_setup(self):
        '''
        Checks the root_folder for subfolders called "txts" and "jsons"; if they do not exist, they will be created.
        Note this runs upon initialization, but should not generally be called.
        '''
        for folder in [self.root_folder, self.txts_folder, self.jsons_folder]:
            if os.path.isdir(folder) == False:
                path_out = (os.getcwd()+"/"+folder).replace("\\", "/")
                print("creating {}".format(path_out))
                os.mkdir(folder)
    
    def file_reader(self, path):
        '''
        Receives a string representing a filepath, and returns a string of that file's contents.
        Note that you will generally not be running this function; it's used for other functions you will use.
        '''
        with open(path, 'r', errors="ignore") as f:
            content = f.read()
        return content
    
    def json_interpreter(self, json):
        '''
        Receives a string of json representing a stored markovify model, and returns a markovify model object.
        Note that it checks if the model represented by the json is POSified or just a regular markovify object.
        Note that you will generally not be running this function; it's used for other functions you will use.
        '''
        if "::" in json:
            model = POSifiedText.from_json(json)
        else:
            model = markovify.Text.from_json(json)
        return model
    
    def folder_to_dict(self, folder):
        '''
        Receives a string representing a folder path, and returns a dictionary mapping file names in the folder
        to a string of that file's contents.
        Note that you will generally not be running this function; it's used for other functions you will use.
        '''
        files = dict()
        filenames = os.listdir(folder)
        for filename in filenames:
            filename_noext, filename_ext = filename.rsplit(".")
            content = self.file_reader(folder+filename)
            files[filename_noext] = content
        return files
    
# NOTE: the next set of functions are ones you *will* be calling
    def update(self):
        '''
        Checks the txts and jsons folders for new files to import.
        If the filename (w/o extension) is already in the self.txts or self.jsons dictionary,
            then that file will be skipped.
        Any new jsons will be automatically converted to a markov model and added to self.models.
        '''
        for (folder, dictionary) in [(self.txts_folder, self.txts), (self.jsons_folder, self.jsons)]:
            filenames = os.listdir(folder)
            for filename in filenames:
                filename_noext, filename_ext = filename.rsplit(".")
                if filename_noext not in dictionary.keys() and filename_ext == "txt":
                     dictionary[filename_noext] = self.file_reader(folder+filename)
                elif filename_noext not in dictionary.keys() and filename_ext == "json":
                    dictionary[filename_noext] = self.file_reader(folder+filename)
                    new_model = self.json_interpreter(dictionary[filename_noext])
                    self.models[filename_noext] = new_model
    
    def add_model(self, name, string, pos=False, state_size=2):
        '''
        Receives a string for the name of the new model and a string to make a model from.
        Creates a model and adds a pairing of the new name to the model to self.models.
        Note that you can optionally create a POSified model or change the model's state size
            by passing the optional parameters `pos=True` or `state_size=n` where n is a positive integer.
        Note that this does NOT automatically export the model to self.jsons! You need to do that manually
            using export_model().
        '''
        if pos == False:
            model = markovify.Text(string, state_size=state_size)
        elif pos == True:
            model = POSifiedText(string, state_size=state_size)
        self.models[name] = model
        
    def combine_models(self, name, model_names, weights=False):
        '''
        Receives a string for the name of the new model and a list of model names to combine.
        Creates a new combined model and adds a pairing of the new name to the model to self.models.
        Note that you can optionally weight how much each model contributes to the combined model
            by passing the optional parameter `weights=[weight1, weight2, ..., weightN]` where 
            each weight is a positive number and N is the number of models being combined.
            Not passing a weights parameter automatically sets each weight to 1.
        Note that this does NOT automatically export the model to self.jsons! You need to do that manually
            using export_model().        
        '''
        if weights == False:
            weights = [1 for i in model_names]
        models = list()
        for model_name in model_names:
            models.append(self.models[model_name])
        model_combo = markovify.combine(models, weights)
        self.models[name] = model_combo
        
    def export_model(self, model_name):
        '''
        Receives a string representing a model name in self.model.keys(), converts the model to json,
            adds it to self.jsons, and then exports that json out to a file with the same name as the model.
        Note that this will overwrite any files with the same name in the jsons folder.
        The reason to make/use jsons is it's much faster than creating the model anew from a string.
        '''
        model = self.models[model_name]
        model_json = model.to_json()
        self.jsons[model_name] = model_json
        with open(self.jsons_folder+model_name+".json",'w') as f:
            f.write(model_json)
    
    def output(self, model_name):
        '''
        Receives a string representing a model name, generates a new sentence using that model, and then returns
            a formatted string made from that sentence that fixes some awkward formatting issues.
        Note that the replacements were largely determined by the files I was using, so you may need to adjust them.
        '''
        model = self.models[model_name]
        sentence = model.make_sentence()
        if sentence != None:
            sentence = sentence.replace(" ,",",").replace(" .",".").replace(r'(?<=[^:]) \.\.\.',"...").replace(":...",": ...")\
                .replace(" ?","?").replace(" !","!").replace(" ;",";").replace(" :",":").replace("  "," ").replace("   ","  ")\
                .replace("\n","").replace(" )", ")").replace("( ","(").replace(";;",";").replace("::",":")
            try: 
                # fixes apostrophe errors for contractions, e.g. "I 'm" or "do n't"
                patterns = ["[A-Za-z]* '[a-z]", "[A-Za-z]* [a-z]'[a-z]"]
                for pattern in patterns:
                    matches = re.findall(pattern, sentence)
                    for match in matches:
                        match_f = match.replace(" ","")
                        sentence = sentence.replace(match, match_f)
                # fixes apostrophe errors for questions and exclamations, e.g. "Hi!It's" or "Yes?This"
                odd_pattern = "[?!](?=[A-Z])"
                matches = re.findall(odd_pattern, sentence)
                for match in matches:
                    sentence = sentence.replace(match, match+" ")
                if sentence[-1] == " ":
                    sentence = sentence[0:-1]
            except:
                pass
            return sentence
        else:
            return "Sentence could not be generated for '{}', sorry. :( Try again!!".format(model_name)

### a. In these next blocks, you'll create an instance of the Markov Model Manager, and then do stuff with it.

In [48]:
'''
Here you're creating an instance of markovManager given the root folder you defined in section 0.
You CAN change the name, but just so you know the canonical name is marky.
If you want to import text files or jsons, set up "txts/" and "jsons/" folders in the folder you defined as root,
    and then add files to them before running this block. You can add stuff later, though.
'''
marky = markovManager(root)

### b. The next set of blocks are meant to be changed! They are just showing some examples of what you can do.

The main "things" marky has that you'll be using are three dictionaries. If you're unsure how to make use of dictionaries, I'd read through this resource: https://www.w3schools.com/python/python_dictionaries.asp

The dictionaries are:
1. marky.txts: maps names of strings to their contents
2. marky.models: maps names of models to a model
3. marky.jsons: maps names of jsons to json version of markify models

In [49]:
print("txts in marky: ", marky.txts.keys())
print("models in marky: ", marky.models.keys())
print("jsons in marky: ", marky.jsons.keys())

print("\ntxt contents\n-------")
for key in marky.txts.keys():
    print(key+": ", marky.txts[key][0:100])

print("\nmodels\n-------")
for key in marky.models.keys():
    print(key+": ", marky.models[key])

txts in marky:  dict_keys(['emma', 'gigi_transcripts', 'gigi_tweets', 'mystery', 'pathologic_dia', 'pathologic_script', 'scifi'])
models in marky:  dict_keys(['gigi_transcripts', 'gigi_tweets', 'mystery', 'scifi', 'test'])
jsons in marky:  dict_keys(['gigi_transcripts', 'gigi_tweets', 'mystery', 'scifi', 'test'])

txt contents
-------
emma:  VOLUME I CHAPTER I Emma Woodhouse, handsome, clever, and rich, with a comfortable home and happy dis
gigi_transcripts:  hey guys it's Gigi Happy Thanksgiving Cheers join me making my apple cranberry rosemary stuffing tha
gigi_tweets:  NEW VIDEOðŸ’– 2 GIRLS 1 WISH  Mr  is on the podcast this week and let me tell you, she brings the LA
mystery:  There were thirty-eight patients on the bus the morning I left for Hanover, most of them disturbed a
pathologic_dia:  And so we finally meet, oynon. Just a pair of pawns on this dreadful chessboard where each square is
pathologic_script:  Haruspex: And so we finally meet, oynon. Just a pair of pawns on this d

In [52]:
for key in marky.models.keys():
    print(key+": ", marky.output(key))

gigi_transcripts:  Sentence could not be generated for 'gigi_transcripts', sorry. :( Try again!!
gigi_tweets:  Sentence could not be generated for 'gigi_tweets', sorry. :( Try again!!
mystery:  He keeps riding me because I like to sing something of Glendora's dismal look and see if you will.
scifi:  He turned back to his brothers, merge without let.
test:  Sentence could not be generated for 'test', sorry. :( Try again!!


The main functions you'll be using are:
1. `marky.update()`
2. `marky.add_model(new_name, string, pos=False, state_size=2)`
3. `marky.combine_models(new_name, model_names, weights=False)`
4. `marky.export_model(model_name)`
5. `marky.output(model_name)`

If you want to learn about what a function does or how to use it, simply type help(marky.NAME), where NAME is the function you want to learn more about without parentheses. See example below.

In [60]:
help(marky.update)

Help on method update in module __main__:

update() method of __main__.markovManager instance
    Checks the txts and jsons folders for new files to import.
    If the filename (w/o extension) is already in the self.txts or self.jsons dictionary,
        then that file will be skipped.
    Any new jsons will be automatically converted to a markov model and added to self.models.

