# Team Viviane Solomon and Brandon Bonifacio
# How We Split Up The Work: ...

# HW7: Train a Sequence Classifier That Can Predict if a Sentence is in English or Spanish.

The goal of this assignment is to train a sequence classifier that can predict if a sentence is in 
English or Spanish. You should use the official PyTorch documentation to build your system 
from scratch. You may use other online sources as well but must cite your sources and indicate 
clearly what portions of your code have been copied and modified from elsewhere. You may 
work individually or with a partner on this assignment.


Each team should submit one assignment as a single jupyter notebook on Sakai. At the top of 
your notebook, please indicate both team members’ names and who did what. To speed up 
training, you may want to run your jupyter notebook in Google Colab with a GPU. Note: The 
datasets provided below are very large, and you don’t need to train on everything!

In fact, as you develop your code, I would recommend using a tiny subset of data to iterate quickly, and 
wait until your code is debugged to start training on larger subsets of data. It is much better to 
have a functioning model that is trained on 1% of the data than a non-functional model that 
failed to train on 100% of the data.

An additional 10 points will be graded for the organization and clarity of your notebook. Your 
notebook should read like a tutorial and be understandable to others

## Part 1: Basic System with fixed-length inputs (65 points)

In the first part of the assignment you will do the following:


● Prepare the data (20 points). Get two large text files: one English file (WikiText-103, 
181MB) and one Spanish file (e.g. Spanish text corpus, 155MB). Convert to lowercase 
and remove all punctuation except “.” so the data only contains alphabet characters, 
whitespace, and periods. Determine a set of unique characters and map all characters 
to integers. Split the data into train & validation sets, and split each into chunks of fixed 
length.


● Train 1-layer model (20 Points). Define an LSTM model containing 1 LSTM layer 
followed by an output linear layer. Your model should classify a fixed-length sequence 
of characters as English or Spanish. Show your training & validation loss curves, along 
with your validation classification accuracy.


● Experimentation (20 points). Experiment with different aspects of the model: the 
number of LSTM layers, the number of fully connected layers, the size of the hidden 
layer, etc. Train the corresponding models, compare their performance, and provide 
plots to demonstrate the effect of at least two different hyperparameters of interest.


● Intuition (5 points). Show the output of your model for several specific sentences. Pick 
inputs that demonstrate the behavior of the system, and try to figure out what things 
the model is focusing on. Explain your intuition about what the model is doing

## Welcome to our Tutorial for Preparing the Data! 

### In the cell below, we go through the process of converting the text to lowercase and removing all punctuation except "." so the data only contains alphabet characters, whitespace, and periods for the Spanish Sentences. We also save this locally so we don't have to do this every time we load the file.

To provide an example for what we want to do with this data, we provide the first few sentences from the Spanish sentences.txt file. 

*la enciclopedia libre Jorge Hess De Wikipedia#

*la enciclopedia libre Saltar a Jorge Hess de julio es un y cofundador de la Liga Argentina de Esperanto Hess escribió un manual para el aprendizaje de esperanto que fue editado por primera vez en y se titula Sabe Usted Esperanto#

*Es uno de los más conocidos libros en español que tratan sobre el tema junto con Curso Práctico de Esperanto Ferenc Szilágyi#

*el cual Hess adaptó para los en#


## As you can see, each sentence begins with an aserisk (*), and it ends with a hashtag and a new-line character (#\n) After this function, these sentences in the txt file should look like: 

la enciclopedia libre jorge hess de wikipedia

la enciclopedia libre saltar a jorge hess de julio es un y cofundador de la liga argentina de esperanto hess escribió un manual para el aprendizaje de esperanto que fue editado por primera vez en y se titula sabe usted esperanto

es uno de los más conocidos libros en español que tratan sobre el tema junto con curso práctico de esperanto ferenc szilágyi

el cual hess adaptó para los en


In [11]:
#Import Statements
from tqdm import tqdm
import string

In [12]:
## There is an important aspect of Spanish sentences we must consider. Python is mostly an English-based language, 
## so it is possible that Python might miss the diacriticed characters, namely é, á, í, ó, ú, ñ, and ü. However, 
## thankfully the Python devs have already thought of this, so we don't have to worry about it. However, we will continually
## check this throughout the process to make sure this is working as intended. 
## An example of Python functions working with Spanish characters is provided in the cell below. 

#Example of Python working with Spanish Characters - Python can work with Spanish!
espanol = "ÁÉÍÓÚÑÜ áéíóúñü."
print(espanol.lower())  
print(espanol.upper())

áéíóúñü áéíóúñü.
ÁÉÍÓÚÑÜ ÁÉÍÓÚÑÜ.


In [20]:
#Here, we make the process helper function to process a single sentence according to the problem
##Below, we convert the text in this file to lowercase, remove all punctuation except "."
def process(sentence):
    """
    Processes the given sentence string according to what the problem wants us to do:
    - Convert all characters to lowercase.
    - Remove all punctuation except ".". Keep whitespace characters ("\n" and " ", idk any others)
    """
    sentence = sentence.lower() #convert all characters to lowercase, O(n) time 
    
    #https://datagy.io/python-remove-punctuation-from-string/
    #https://docs.python.org/3/library/string.html
    
    #Here, we make a translator that will replace all punctuation EXCEPT "." to the empty string (get rid of them)
    #string.punctuation is list of punctuations, removing "."
    punctuations_without_period = string.punctuation.replace(".", "")
    translator = str.maketrans('', '', punctuations_without_period)
    return sentence.translate(translator)

test_string = "Hi #%@ my #$% name ..,,.,. !is BRANDÓN Bonifacío. .@#$@$--=!~` . "
print(process(test_string))

hi  my  name .... is brandón bonifacío. . . 


In [21]:
## sentences.txt is our file of Spanish sentences. 
## With respect to this Jupyter Notebook's directory, this raw file is stored in (for Brandon's computer): 
## /Users\Brandon\Desktop\Classes\E208\Homework\E208HW7\Data\Raw\Spanish/sentences.txt

## After processing this data, we store it in 
## /E208/E208HW7/Data/Raw/Spanish/processed_spanish.txt

## AS AN IMPORTANT NOTE, THIS CELL SHOULD ONLY BE RUN ONCE. 
## UNCOMMENT THE CODE BELOW TO RUN IT:
# Windows: Ctrl + / 
# Mac: Cmd + /


# # Path to the raw data
# input_path = "/Users\Brandon\Desktop\Classes\E208\Homework\E208HW7\Data\Raw\Spanish/sentences.txt"
# # Path to the processed data
# output_path = "/Users\Brandon\Desktop\Classes\E208\Homework\E208HW7\Data\Processed\Spanish/sentences.txt"

# #Open the input file and take out the sentences
# #https://stackoverflow.com/questions/2081836/how-to-read-specific-lines-from-a-file-by-line-number
# with open(input_path, "r") as file:
#     raw_sentences = file.readlines()

# #Process each sentence
# processed_sentences = []
# for sentence in tqdm(raw_sentences, desc="Processing sentences"):
#     if sentence.startswith('*') and sentence.endswith("#\n"): #Every sentence had a * in front of it and the end character at the end
#         processed_sentences.append(process(sentence[1:-2]))
#     else:
#         print(f"Something went wrong! Here's the current sentence: {sentence}")
#         raise

# #Now write the processed sentences to the output file
# with open(output_path, "w") as file:
#     for sentence in tqdm(processed_sentences, desc="Writing to file"):
#         file.write(sentence + "\n") #Add new line whitespace at the end of each sentence

# print("Data processing complete!")

Processing sentences: 100%|██████████████████████████████████████████████| 6075660/6075660 [00:46<00:00, 129433.35it/s]
Writing to file: 100%|███████████████████████████████████████████████████| 6075660/6075660 [00:09<00:00, 649456.43it/s]

Data processing complete!





## Part 2: Realistic system with variable-length inputs (25 points

In the second part of the assignment you will do the following:


● Prepare the data (10 points). Your data should be the same as in part 1, except that 
each sample should contain one complete sentence rather than a fixed-length sequence of characters. This means that your training & validation samples should have variable 
length. You will need to zero-pad your inputs.


● Train model (15 points). Use the best model architecture that you found from part 1, 
and train a model on your data. Be careful to handle the zero-padding correctly, since 
you can no longer use the same index for all batch samples. Show the training & 
validation loss curves and validation accuracy. Compare your results to the 
corresponding model in part 1.




Running List of Resources Used: 


https://stackoverflow.com/questions/20935151/how-to-encode-and-decode-from-spanish-in-python

https://datagy.io/python-remove-punctuation-from-string/"

https://stackoverflow.com/questions/2081836/how-to-read-specific-lines-from-a-file-by-line-number

https://docs.python.org/3/library/string.html