A Character level Langauge model build using tensorflow
-
Clone the repository
git clone https://github.com/coder3101/char2char
OR
Download as a zip from this link and then unzip
-
Install the required packages using the given command below
pip install -r requirements.txt
The above command should be executed from the direcory you have downloaded or cloned the repository. You need a working internet connectivity for the above commad to execuute properly
To train a model with your own dataset. You need to run the script named python. First get your data in a text file. Say data.txt contains some large text. Copy that file and paste it to the file/folder you have cloned or unzipped the repo.
Run the following for training the model on your own data (say data.txt)
python train.py --file_name "./data.txt" --name "name_of_model"
To tune the hyper parameters like batch_size, learning_rate, etc, you can pass additional args to train script as
python train.py --file_name "./data.txt" \
--name "name_of_model" \
--encoding "utf-8" \
--epochs 50 \
--batch_size 100 \
--units 256 \
--num_layers 3 \
--cell_type "gru" \
--input_dropout_keep_prob 0.8 \
--output_dropout_keep_prob 0.8 \
--learning_rate 0.01 \
--optimizer "rms"
To See what each argument does. Simply type
python train.py --help
The train script will run for the specified number of iterations and a progress bar will be shown in the terminal window. Upon completion of the script a JSON file and trained parameters files folders will be produced. These files will be used by Sample script to get prediction from trained model.
File(s) produced | Use |
---|---|
model_name.json | Contains configs for the sample file to be used. |
saved-v1 direcory | Contained learned parameters of training. |
To get predictions from trained model. You need to run the sample script that will write the predictions to a file.
python sample.py --output_json "name_of_model.json" \
--seq_len 200 \
--source_file "./data.txt"
The above script will generate a file name_of_model-output.txt this file contains output produced by the model. The number of characters in the file is specified by --seq_len argument.
Here --output_json is the name of the json generated by train script and --source_file is the file you have trained.
For more info you can run
python sample.py --help
We ran the model with company_names.txt
and then sampled the predictions and got the following new names :
- Marsen
- Penin
- Genir
Many more other names were generated you can have a look at new_names.txt
file that was generated by the sample script.
While it may seem the names are not very novel, we accept but it was because we trained for lesser number of epochs. You can always train the model again with higher epochs, and more units will generate really novel texts
SPECIAL THANKS TO Rohan FOR HIS HARDWORK IN COLLECTING THE DATA