Additional notes for github
## Using regex 
Complex regex techniques have been used to clean up text files to extract dialogues from individual characters. This was an earlier project (Oct-Dec 2021). 

## Using regex to clean up data from online txt files of Oscar Wilde plays

The corpora for this NLP project consisted of looking at specifically Oscar Wilde plays. The need was to create individual files containing the lines of each character in each play. Eventually, this would be fed into a fine-tuned LSTM to generate outputs. 


In [4]:
#Import all necessary packages
import numpy as np
import re

In [3]:
#NLPMP1
#At the beginning, the file is manually cleaned of any introductory or concluding text, 
#or any general text associated with online ebooks that is not a part of the play. 
#Additionally, scene beginnings and set designs are also deleted manually. 

lwfo = open(r'C:\Users\____\Documents\GitHub\nlp-21-22\data\lwfan1.txt', 'r') 
# opens the original file with, with the text set to be read (not edited or changed)
pfr = lwfo.read() 
#pfr - play first read, pfr is a variable that stores the individual lines of the play being read

reg1 = r"\[.+?\n?.+?\]" # regex to remove anything in square brackets 
#This works for square brackets that are separated by a newline
#I tried using other regex expressions such as "+", "*", or "{}"
#to accommodate for square brackets longer than two lines, 
#however this was met with limited success and I clean up the rest of the square bracket manually

pfr = re.sub(reg1, "", str(pfr), flags=re.M) 
#substitutes in the square brackets and contents with nothing, thereby removing it

reg2 = r'(?<=[A-Z]{5})(?:\.\s\s){1}' #regex to precede the characters spoken lines with an equals sign
#in Lady Windermere's Fan, characters lines have the characters names, 
#all of which have at least 5 capital letters, before a fullstop and two empty spaces ".  "
#after which a character's lines begins.
#Thus, the regex changes the ".  " to an equals sign.
#An equals sign was used as after manual inspection of the text, equals signs were not found in the text
#or were generally rare.
#Due to this, an equals sign makes for a useful delineater through which to separate character lines
#for future purposes of using regex to create txt files for individual character's lines. 

pfr = re.sub(reg2, '=', pfr, flags=re.M) #substitues in ".  " with an equals sign

lwfe = open(r'C:\Users\____\Documents\GitHub\nlp-21-22\data\lwfan2.txt', 'w') 
#creates a new file to write things into
per = lwfe.write(str(pfr)) 
#writes into new file the cleaned up string from the original txt file

lwfe.close() 
lwfo.close()
#closes both txt files, which means manual editing can take place if necessary



In [172]:
#NLPMP2
#The following code was used to remove newlines

lwfo = open(r'C:\Users\____\Documents\GitHub\nlp-21-22\data\lwfan_n9.txt', 'r') 
pfr = lwfo.read() 
#as earlier, the text is opened in read mode, and the lines are individually read and recorded

#reg2 = r'\n(?=[a-z]{3})' # used to remove any new lines preceding any three alphabet characters
#reg2 = r'\n(?=\n{1})' # used to remove any newlines preceding a newline
#reg2 = r'\n(?=[^A-Z]{3})' #used to remove any newlines not preceding any three capital letters
reg2 = r'\n(?![A-Z]{3})' #used to remove any newlines not preceding any three capital letters
#while the above two regex expressions are similar, the expression that is
#'\n(?![A-Z]{3})' can tend to include newlines while the above does not
#these 4 expressions were used and while some share similarity,
#it was to ensure the data cleaning was thorough
#the 4 were used in succession and if any issues presented themselves after manual inspection
#relevant regex expressions were re-applied
pfr = re.sub(reg2, ' ', pfr, flags=re.M) 
#the above line removes the new line and replaces it with a whitespace. 
#A whitespace is relevant as without it, any characters separated solely by a new line
#would then be combined. Therefore, a whitespace is used to replace a newline. 

lwfe = open(r'C:\Users\____\Documents\GitHub\nlp-21-22\data\lwfan_n10.txt', 'w') 
per = lwfe.write(str(pfr)) 
lwfe.close()
lwfo.close()
#as earlier, a new file is used to write the cleaned txt file into and the files are then closed

In [16]:
#NLPMP3

lwfo = open(r'C:\Users\____\Documents\GitHub\nlp-21-22\data\lwfan_n10.txt', 'r') 
pfr = lwfo.read() 
#As above, the file is opened and read

reg2 = r'(?<=\nMRS\. ERLYNNE=).+(?=\n)'
#The above regex finds any line that is in between a character's name and the proceeding nextline. 
#As we cleaned for most newlines, we were left with a txt file where each character's line begins 
#with the character's name in capital letters, proceeded by an equal sign, and after their line, 
#A newline begins. 

#An alternate way of writing the above is below
#name = 'CHARACTER NAME'
#reg2 = r'(?<=\n' + name + '=).+(?=\n)'
#the above means that the character's name can be placed inside a variable to ease 
#typing in the character's name in the regex expression
pfr = re.findall(reg2, str(pfr)) 

lwfe = open(r'C:\Users\____\Documents\GitHub\nlp-21-22\data\lwfan_Erlynne.txt', 'a') 
#the above opens a new document in append mode, where each string produced is added to the text file

#a for loop that goes through all the matches with the regex expression
#and appends them to the new text file 
#and separates each match with a newline
for i in pfr:
    lwfe.write(i+'\n')
    
#closing the files at the end
lwfe.close()
lwfo.close()

#attemps to automate this with a for loop were made
#the idea was to list the character names and then go through a for loop for that list
#however, when it came to creating an output file such as "lwfe"
#there were issues with separating the filepath into manageable variable sections
#and parts were getting considered as string while other parts as variable text 
#As a result, this separation of characters into a file each for their names was done indivdually
#combining text files to make arachetypes or to combine it based on age, gender, 
#or considered social status was also combined manually 

## Discussion and citations

Much of this work was possible due to the course materials provided in the Natural Language Processing module in my MSc course, the regex python documentation, and general sources such as stack overflow. They are cited below.  

Citations:

W3schools.com. 2021. Python For Loops. [online] Available at: <https://www.w3schools.com/python/python_for_loops.asp> [Accessed 1 December 2021].

W3schools.com. 2021. Python File Open. [online] Available at: <https://www.w3schools.com/python/python_file_handling.asp> [Accessed 1 December 2021].

Docs.python.org. 2021. re — Regular expression operations — Python 3.10.0 documentation. [online] Available at: <https://docs.python.org/3/library/re.html> [Accessed 1 December 2021].

Stack Overflow. 2021. How to open a file using the open with statement. [online] Available at: <https://stackoverflow.com/questions/9282967/how-to-open-a-file-using-the-open-with-statement> [Accessed 1 December 2021].

Texts accessed:

Wilde, O., 2021. A Woman of No Importance. [online] Available at: <https://www.gutenberg.org/ebooks/854> [Accessed 1 December 2021].

Wilde, O., 2021. An Ideal Husband : Wilde, Oscar, 1854-1900 : Free Download, Borrow, and Streaming : Internet Archive. [online] Internet Archive. Available at: <https://archive.org/details/anidealhusband00885gut> [Accessed 1 December 2021].

Wilde, O., 2021. Lady Windermere's Fan : Wilde, Oscar, 1854-1900 : Free Download, Borrow, and Streaming : Internet Archive. [online] Internet Archive. Available at: <https://archive.org/details/ladywindermeresf00790gut> [Accessed 1 December 2021].

Wilde, O., 2021. The Importance of Being Earnest. [online] https://www.gutenberg.org/ebooks/844. Available at: <https://www.gutenberg.org/ebooks/844> [Accessed 1 December 2021].

Colab notebooks were used to generate text. Week 5 notebooks for fine-tuning a pre-trained LSTM were used which used the following cited github repositories:

Woolf, M., 2021. GitHub - minimaxir/gpt-2-simple: Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts. [online] GitHub. Available at: <https://github.com/minimaxir/gpt-2-simple> [Accessed 1 December 2021].

Woolf, M., 2021. GitHub - minimaxir/textgenrnn: Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.. [online] GitHub. Available at: <https://github.com/minimaxir/textgenrnn> [Accessed 1 December 2021].
