# Team Viviane Solomon and Brandon Bonifacio
# How We Split Up The Work: ...

# HW7: Train a Sequence Classifier That Can Predict if a Sentence is in English or Spanish.

The goal of this assignment is to train a sequence classifier that can predict if a sentence is in 
English or Spanish. You should use the official PyTorch documentation to build your system 
from scratch. You may use other online sources as well but must cite your sources and indicate 
clearly what portions of your code have been copied and modified from elsewhere. You may 
work individually or with a partner on this assignment.


Each team should submit one assignment as a single jupyter notebook on Sakai. At the top of 
your notebook, please indicate both team members’ names and who did what. To speed up 
training, you may want to run your jupyter notebook in Google Colab with a GPU. Note: The 
datasets provided below are very large, and you don’t need to train on everything!

In fact, as you develop your code, I would recommend using a tiny subset of data to iterate quickly, and 
wait until your code is debugged to start training on larger subsets of data. It is much better to 
have a functioning model that is trained on 1% of the data than a non-functional model that 
failed to train on 100% of the data.

An additional 10 points will be graded for the organization and clarity of your notebook. Your 
notebook should read like a tutorial and be understandable to others

## Part 1: Basic System with fixed-length inputs (65 points)

In the first part of the assignment you will do the following:


● Prepare the data (20 points). Get two large text files: one English file (WikiText-103, 
181MB) and one Spanish file (e.g. Spanish text corpus, 155MB). Convert to lowercase 
and remove all punctuation except “.” so the data only contains alphabet characters, 
whitespace, and periods. Determine a set of unique characters and map all characters 
to integers. Split the data into train & validation sets, and split each into chunks of fixed 
length.


● Train 1-layer model (20 Points). Define an LSTM model containing 1 LSTM layer 
followed by an output linear layer. Your model should classify a fixed-length sequence 
of characters as English or Spanish. Show your training & validation loss curves, along 
with your validation classification accuracy.


● Experimentation (20 points). Experiment with different aspects of the model: the 
number of LSTM layers, the number of fully connected layers, the size of the hidden 
layer, etc. Train the corresponding models, compare their performance, and provide 
plots to demonstrate the effect of at least two different hyperparameters of interest.


● Intuition (5 points). Show the output of your model for several specific sentences. Pick 
inputs that demonstrate the behavior of the system, and try to figure out what things 
the model is focusing on. Explain your intuition about what the model is doing

## Welcome to our Tutorial for Preparing the Data! 

### In the cell below, we go through the process of converting the text to lowercase and removing all punctuation except "." so the data only contains alphabet characters, whitespace, and periods for the Spanish Sentences. We also save this locally so we don't have to do this every time we load the file.

To provide an example for what we want to do with this data, we provide the first few sentences from the Spanish sentences.txt file. 

*la enciclopedia libre Jorge Hess De Wikipedia#

*la enciclopedia libre Saltar a Jorge Hess de julio es un y cofundador de la Liga Argentina de Esperanto Hess escribió un manual para el aprendizaje de esperanto que fue editado por primera vez en y se titula Sabe Usted Esperanto#

*Es uno de los más conocidos libros en español que tratan sobre el tema junto con Curso Práctico de Esperanto Ferenc Szilágyi#

*el cual Hess adaptó para los en#


## As you can see, each sentence begins with an aserisk (*), and it ends with a hashtag and a new-line character (#\n) After this function, these sentences in the txt file should look like: 

la enciclopedia libre jorge hess de wikipedia
la enciclopedia libre saltar a jorge hess de julio es un y cofundador de la liga argentina de esperanto hess escribió un manual para el aprendizaje de esperanto que fue editado por primera vez en y se titula sabe usted esperanto
es uno de los más conocidos libros en español que tratan sobre el tema junto con curso práctico de esperanto ferenc szilágyi
el cual hess adaptó para los en


(note that there is a newline between each sentence, but that Jupyter Notebook combines lines that only differ by one \n character)

In [1]:
#Import Statements
from tqdm import tqdm
import string

In [2]:
## There is an important aspect of Spanish sentences we must consider. Python is mostly an English-based language, 
## so it is possible that Python might miss the diacriticed characters, namely é, á, í, ó, ú, ñ, and ü. However, 
## thankfully the Python devs have already thought of this, so we don't have to worry about it. However, we will continually
## check this throughout the process to make sure this is working as intended. 
## An example of Python functions working with Spanish characters is provided in the cell below. 

#Example of Python working with Spanish Characters - Python can work with Spanish!
espanol = "ÁÉÍÓÚÑÜ áéíóúñü."
print(espanol.lower())  
print(espanol.upper())

áéíóúñü áéíóúñü.
ÁÉÍÓÚÑÜ ÁÉÍÓÚÑÜ.


In [3]:
#Here, we make the process helper function to process a single sentence according to the problem
##Below, we convert the text in this file to lowercase, remove all punctuation except "."
def process(sentence):
    """
    Processes the given sentence string according to what the problem wants us to do:
    - Convert all characters to lowercase.
    - Remove all punctuation except ".". Keep whitespace characters ("\n" and " ", idk any others)
    """
    sentence = sentence.lower() #convert all characters to lowercase, O(n) time 
    
    #We are going to make a list of allowed characters! and if it's not in it, we get rid of it
    #the instructions say to keep only alphabetic characters, periods, and whitespaces, and that is what we will 
    #consist the allowed characters with
    
    #we are going to use a dictionary for O(1) lookup time
    allowed_characters = {
        " ": 0,
        "a": 1,
        "b": 2,
        "c": 3, 
        "d": 4,
        "e": 5, 
        "f": 6, 
        "g": 7, 
        "h": 8, 
        "i": 9, 
        "j": 10,
        "k": 11, 
        "l": 12,
        "m": 13, 
        "n": 14, 
        "o": 15,
        "p": 16,
        "q": 17, 
        "r": 18, 
        "s": 19, 
        "t": 20, 
        "u": 21, 
        "v": 22, 
        "w": 23, 
        "x": 24, 
        "y": 25, 
        "z": 26, 
        ".": 27, 
        "á": 28,
        "é": 29, 
        "í": 30,
        "ó": 31, 
        "ú": 32,
        "ñ": 33, 
        "ü": 34
        }
    
    processed_sentence = ''.join(char for char in sentence if char in allowed_characters)
    return processed_sentence

test_string = "Hi #%@ my 123415#$% name ..,3453,.,.343 !is B–RAND---ÓN B31onifacío. .@#$@$--=!~234324` . "
print(process(test_string))

hi  my  name .... is brandón bonifacío. . . 


In [4]:
## sentences.txt is our file of Spanish sentences. 
## With respect to this Jupyter Notebook's directory, this raw file is stored in (for Brandon's computer): 
## /Users\Brandon\Desktop\Classes\E208\Homework\E208HW7\Data\Raw\Spanish/sentences.txt

## After processing this data, we store it in 
## /E208/E208HW7/Data/Raw/Spanish/sentences.txt

## AS AN IMPORTANT NOTE, THIS CELL SHOULD ONLY BE RUN ONCE. 
## UNCOMMENT THE CODE BELOW TO RUN IT:
# Windows: Ctrl + / 
# Mac: Cmd + /


# Path to the raw data
input_path = "/Users\Brandon\Desktop\Classes\E208\Homework\E208HW7\Data\Raw\Spanish/sentences.txt"
# Path to the processed data
output_path = "/Users\Brandon\Desktop\Classes\E208\Homework\E208HW7\Data\Processed\Spanish/sentences.txt"

#Open the input file and take out the sentences
#https://stackoverflow.com/questions/2081836/how-to-read-specific-lines-from-a-file-by-line-number
with open(input_path, "r") as file:
    raw_sentences = file.readlines()

#Process each sentence
processed_sentences = []
for sentence in tqdm(raw_sentences, desc="Processing sentences"):
    if sentence.startswith('*') and sentence.endswith("#\n"): #Every sentence had a * in front of it and the end character at the end
        processed_sentences.append(process(sentence[1:-2]))
    else:
        print(f"Something went wrong! Here's the current sentence: {sentence}")
        raise

#Now write the processed sentences to the output file
with open(output_path, "w") as file:
    for sentence in tqdm(processed_sentences, desc="Writing to file"):
        if not sentence.endswith("\n"):
            file.write(sentence + "\n") #Add new line whitespace at the end of each sentence
        else:
            file.write(sentence)

print("Data processing complete!")

Processing sentences: 100%|██████████████████████████████████████████████| 6075660/6075660 [00:50<00:00, 120148.15it/s]
Writing to file: 100%|███████████████████████████████████████████████████| 6075660/6075660 [00:09<00:00, 619557.00it/s]

Data processing complete!





## Now that we have processed the Spanish sentences, we move on to processing the English sentences. However, the English sentences are in .tokens files, which we can open in VSCode. To provide an example of the text in the .tokens files, I provide the first few sentences from the file below: 


 = Robert Boulter = 
 
 Robert Boulter is an English film , television and theatre actor . He had a guest @-@ starring role on the television series The Bill in 2000 . This was followed by a starring role in the play Herons written by Simon Stephens , which was performed in 2001 at the Royal Court Theatre . He had a guest role in the television series Judge John Deed in 2002 . In 2004 Boulter landed a role as " Craig " in the episode " Teddy 's Story " of the television series The Long Firm ; he starred alongside actors Mark Strong and Derek Jacobi . He was cast in the 2005 theatre productions of the Philip Ridley play Mercury Fur , which was performed at the Drum Theatre in Plymouth and the <unk> Chocolate Factory in London . He was directed by John Tiffany and starred alongside Ben Whishaw , Shane Zaza , Harry Kent , Fraser Ayres , Sophie Stanton and Dominic Hall . 
 In 2006 , Boulter starred alongside Whishaw in the play Citizenship written by Mark Ravenhill . He appeared on a 2006 episode of the television series , Doctors , followed by a role in the 2007 theatre production of How to Curse directed by Josie Rourke . How to Curse was performed at Bush Theatre in the London Borough of Hammersmith and Fulham . Boulter starred in two films in 2008 , Daylight Robbery by filmmaker Paris <unk> , and Donkey Punch directed by Olly Blackburn . In May 2008 , Boulter made a guest appearance on a two @-@ part episode arc of the television series Waking the Dead , followed by an appearance on the television series Survivors in November 2008 . He had a recurring role in ten episodes of the television series Casualty in 2010 , as " Kieron Fletcher " . Boulter starred in the 2011 film Mercenaries directed by Paris <unk> . 
 
 = = Career = = 
 
 
 = = = 2000 – 2005 = = = 
 
 In 2000 Boulter had a guest @-@ starring role on the television series The Bill ; he portrayed " Scott Parry " in the episode , " In Safe Hands " . Boulter starred as " Scott " in the play Herons written by Simon Stephens , which was performed in 2001 at the Royal Court Theatre . A review of Boulter 's performance in The Independent on Sunday described him as " horribly menacing " in the role , and he received critical reviews in The Herald , and Evening Standard . He appeared in the television series Judge John Deed in 2002 as " <unk> Armitage " in the episode " Political <unk> " , and had a role as a different character " Toby Steele " on The Bill . 
 He had a recurring role in 2003 on two episodes of The Bill , as character " Connor Price " . In 2004 Boulter landed a role as " Craig " in the episode " Teddy 's Story " of the television series The Long Firm ; he starred alongside actors Mark Strong and Derek Jacobi . Boulter starred as " Darren " , in the 2005 theatre productions of the Philip Ridley play Mercury Fur . It was performed at the Drum Theatre in Plymouth , and the <unk> Chocolate Factory in London . He was directed by John Tiffany and starred alongside Ben Whishaw , Shane Zaza , Harry Kent , Fraser Ayres , Sophie Stanton and Dominic Hall . Boulter received a favorable review in The Daily Telegraph : " The acting is shatteringly intense , with wired performances from Ben Whishaw ( now unrecognisable from his performance as Trevor Nunn 's Hamlet ) , Robert Boulter , Shane Zaza and Fraser Ayres . " The Guardian noted , " Ben Whishaw and Robert Boulter offer tenderness amid the savagery . " 
 
 = = = 2006 – present = = =

## As you can see, the formatting is a bit more complex than before. After processing, in order to format it in the same way as the spanish sentences, we want the post-processed sentences to look like this: 

 robert boulter is an english film  television and theatre actor .
he had a guest  starring role on the television series the bill in 2000 .
this was followed by a starring role in the play herons written by simon stephens  which was performed in 2001 at the royal court theatre .
he had a guest role in the television series judge john deed in 2002 .
in 2004 boulter landed a role as  craig  in the episode  teddy s story  of the television series the long firm  he starred alongside actors mark strong and derek jacobi .
he was cast in the 2005 theatre productions of the philip ridley play mercury fur  which was performed at the drum theatre in plymouth and the unk chocolate factory in london .
he was directed by john tiffany and starred alongside ben whishaw  shane zaza  harry kent  fraser ayres  sophie stanton and dominic hall .
 in 2006  boulter starred alongside whishaw in the play citizenship written by mark ravenhill .
he appeared on a 2006 episode of the television series  doctors  followed by a role in the 2007 theatre production of how to curse directed by josie rourke .
how to curse was performed at bush theatre in the london borough of hammersmith and fulham .
boulter starred in two films in 2008  daylight robbery by filmmaker paris unk  and donkey punch directed by olly blackburn .
in may 2008  boulter made a guest appearance on a two  part episode arc of the television series waking the dead  followed by an appearance on the television series survivors in november 2008 .
he had a recurring role in ten episodes of the television series casualty in 2010  as  kieron fletcher  .
boulter starred in the 2011 film mercenaries directed by paris unk .
 in 2000 boulter had a guest  starring role on the television series the bill  he portrayed  scott parry  in the episode   in safe hands  .
boulter starred as  scott  in the play herons written by simon stephens  which was performed in 2001 at the royal court theatre .
a review of boulter s performance in the independent on sunday described him as  horribly menacing  in the role  and he received critical reviews in the herald  and evening standard .
he appeared in the television series judge john deed in 2002 as  unk armitage  in the episode  political unk   and had a role as a different character  toby steele  on the bill .
 he had a recurring role in 2003 on two episodes of the bill  as character  connor price  .
in 2004 boulter landed a role as  craig  in the episode  teddy s story  of the television series the long firm  he starred alongside actors mark strong and derek jacobi .
boulter starred as  darren   in the 2005 theatre productions of the philip ridley play mercury fur .
it was performed at the drum theatre in plymouth  and the unk chocolate factory in london .
he was directed by john tiffany and starred alongside ben whishaw  shane zaza  harry kent  fraser ayres  sophie stanton and dominic hall .
boulter received a favorable review in the daily telegraph   the acting is shatteringly intense  with wired performances from ben whishaw  now unrecognisable from his performance as trevor nunn s hamlet   robert boulter  shane zaza and fraser ayres .
 the guardian noted   ben whishaw and robert boulter offer tenderness amid the savagery .

## To format this, we're going to follow the same approach as with the Spanish sentences, except we're going to take extra steps to get rid of inconsistent spacing or sentences that begin with a "=" because these aren't sentences. 


In [12]:
##Note: This should only be run once, so uncomment this code when we need to make a new sentences.txt file

# Path to the raw data
input_path = "/Users\Brandon\Desktop\Classes\E208\Homework\E208HW7\Data\Raw\English/"
# And the files we gotta process:
files = ["wiki.test.tokens", "wiki.train.tokens", "wiki.valid.tokens"]
# Path to the processed data
output_path = "/Users\Brandon\Desktop\Classes\E208\Homework\E208HW7\Data\Processed\English/sentences.txt"

#Open the input file and take out the sentences
#https://stackoverflow.com/questions/2081836/how-to-read-specific-lines-from-a-file-by-line-number
raw_sentences = []
#Add the lines through each file to a single list
for file in files:
    #As a note, we have to use utf-8 here because it works better than regular open:
    #https://stackoverflow.com/questions/36303919/what-encoding-does-open-use-by-default
    with open(input_path +file, "r", encoding='utf-8') as reading_file:
        sentences = reading_file.readlines() #initially split it by lines, this will allow us to skip "=" lines
        for sentence in tqdm(sentences, desc="Going through sentences for " + file): #go through each sentence
            if len(sentence) > 3: #only keep sentences that aren't newlines and are actually sentences
                if sentence[0] != "=" and sentence[0:2] != " =": #We don't want to keep the "=" lines
                    real_sentences = sentence.split(". ") #Once we have the sentences now, split by periods
                    for real_sentence in real_sentences: #Go through each sentence we have now
                        real_sentence = real_sentence.replace("<unk>", "") #remove this annoying string that's EVERYWHERE in the data
                        real_sentence = real_sentence.replace("  ", " ") #replace double spaces
                        real_sentence = real_sentence.replace(" .", ".") #get rid of spaces before periods
                        raw_sentences.append(real_sentence + ".") #add the period back


#Process each sentence
processed_sentences = []
for sentence in tqdm(raw_sentences, desc="Processing sentences"):
    if len(sentence) > 1: #We only want nonzero sentences
        if sentence[0:2] != " =": #We don't want the sentences that start with an " =" as shown above
            processed_sentence = process(sentence)
            if len(processed_sentence) > 5: #5 is the shortest sentence possible in English with periods and spaces
                #We needed to do the extra check to make sure its nonzero after processing because processing removes
                #characters
                processed_sentences.append(processed_sentence)

#Now write the processed sentences to the output file
with open(output_path, "w", encoding = 'utf-8') as file:
    for sentence in tqdm(processed_sentences, desc="Writing to file"):
        file.write(sentence + "\n") #Add new line whitespace at the end of each sentence, in same structure as espanol

print("Data processing complete!")

Going through sentences for wiki.test.tokens: 100%|████████████████████████████| 4358/4358 [00:00<00:00, 299509.69it/s]
Going through sentences for wiki.train.tokens: 100%|█████████████████████| 1801350/1801350 [00:06<00:00, 295421.23it/s]
Going through sentences for wiki.valid.tokens: 100%|███████████████████████████| 3760/3760 [00:00<00:00, 257386.46it/s]
Processing sentences: 100%|███████████████████████████████████████████████| 4738808/4738808 [00:49<00:00, 95677.41it/s]
Writing to file: 100%|██████████████████████████████████████████████████| 3950213/3950213 [00:03<00:00, 1037825.89it/s]

Data processing complete!





## Now that we have processed the data as required by the problem, we now determine a set of unique characters and map all characters to integers. Because the problem stated that each sentence should only consist of alphabet characters, whitespace, and periods, we only include these in the unique character map. We also don't include the newline whitespace character because, as per the way we designed the sentences to be separated, the newline character is only used to separate the sentences in the txt files. 

In [14]:
unique_character_map = {
    " ": 0,
    "a": 1,
    "b": 2,
    "c": 3, 
    "d": 4,
    "e": 5, 
    "f": 6, 
    "g": 7, 
    "h": 8, 
    "i": 9, 
    "j": 10,
    "k": 11, 
    "l": 12,
    "m": 13, 
    "n": 14, 
    "o": 15,
    "p": 16,
    "q": 17, 
    "r": 18, 
    "s": 19, 
    "t": 20, 
    "u": 21, 
    "v": 22, 
    "w": 23, 
    "x": 24, 
    "y": 25, 
    "z": 26, 
    ".": 27, 
    "á": 28,
    "é": 29, 
    "í": 30,
    "ó": 31, 
    "ú": 32,
    "ñ": 33, 
    "ü": 34
}

english_file = "/Users\Brandon\Desktop\Classes\E208\Homework\E208HW7\Data\Processed\English/sentences.txt"

spanish_file = "/Users\Brandon\Desktop\Classes\E208\Homework\E208HW7\Data\Processed\Spanish/sentences.txt"






## We now split the data into train & validation sets, and split each into chunks of fixed length.

## Part 2: Realistic system with variable-length inputs (25 points

In the second part of the assignment you will do the following:


● Prepare the data (10 points). Your data should be the same as in part 1, except that 
each sample should contain one complete sentence rather than a fixed-length sequence of characters. This means that your training & validation samples should have variable 
length. You will need to zero-pad your inputs.


● Train model (15 points). Use the best model architecture that you found from part 1, 
and train a model on your data. Be careful to handle the zero-padding correctly, since 
you can no longer use the same index for all batch samples. Show the training & 
validation loss curves and validation accuracy. Compare your results to the 
corresponding model in part 1.




Running List of Resources Used: 


https://stackoverflow.com/questions/20935151/how-to-encode-and-decode-from-spanish-in-python

https://datagy.io/python-remove-punctuation-from-string/"

https://stackoverflow.com/questions/2081836/how-to-read-specific-lines-from-a-file-by-line-number

https://docs.python.org/3/library/string.html

https://blog.salesforceairesearch.com/the-wikitext-long-term-dependency-language-modeling-dataset/#download

https://stackoverflow.com/questions/36303919/what-encoding-does-open-use-by-default