# Lab Notebook for Final Project of Computational Models of Cognition
## Authors: Gabe Brookman, James Yang

In [3]:
import pandas as pd
import data_cleaning

### Notes Oct. 10th:
We came together to write our project proposal, which we both editted independently for the next day or two. The whole week leading up to this meeting, we had been sending papers back and forth and reading them, as well as brainstorming ideas for what topic to do our final project on, so this meeting was a natural culmination of our work thus far. We didn't begin implementation or do anything beyond background research, coming up with our idea, and writing our proposal.

### Notes Oct 21st:
We first updated our Github repo (https://github.com/formidify/CMoC_project): the read_me and todo files. We also created this Jupyter notebook to keep track of our progress. Then, we created a Python file to clean the Enron email dataset by including only the subjects and body (for now) of each email, as well as basic data cleaning such as removing symbols and common stop words (such as 'a', 'and' and 'an', etc.) from the email subject lines. We also got rid of reply and forward emails, as well as emails whose subjects include significant words that do not appear in the body (so that we can run extractive summarisation on that email).

Later, on the network side of things, we settled on a framework (Keras) and a paradigm (functional) for our model. We began implementing a first draft for both of these.

### Notes Oct 23rd:
We continued to deal with cleaning the Enron directory of text files and compiling them into a readable CSV file. We cleared up a few commands, such as using `" ".join(string.split())` instead of using the `strip()` command, etc. The results are outstanding, as we manage to get rid of all empty lines, including escape characters "\n" and "\t". In addition, we edited the code to make sure that cases where forward and subject info from another email are somehow visible. We managed to get rid of those lines and made sure that only relevant information are printed out. The number of how many usable emails we will then have for our training and testing is displayed above, after running everything. At most 3309 are valid can be used effectively.

In [4]:
data_list = data_cleaning.read_text() # if the Enron txt directory is in the same directory
csv = data_cleaning.create_csv(data_list)

Number of reply/forward or non-extractable emails: 11088
Number of emails after basic cleaning, such as getting rid of ones with empty subject/body: 14397
Number of actual emails used in the `final`-ish dataset: 3309


We said "at most" because the data cleaning is not over - there can be cases where the information needs further processing. For example, some emails may include the name and company and title of the sender. We do not want that information, and in the case that those information are all the email body has, we may have to exclude that email from our dataset. Same goes for numbers - the subject an email full of uninterpretable numbers can not be properly extracted by a program. Those are the conditions we will continue to explore and hopefully be able to deal with (numbers are not so much of a problem with tokenization problems such as `nltk`, but person information removal will be quite difficult).

# Notes Oct 28th:
Today, we moved on to construct our model. We want the model to iterate over all of the words in the message body and return the probability that each of them are in the title. The network will take in one letter at a time. When it encounters whitespace, it will return the probability of the word just before that whitespace being in the title.

We switched to PyTorch because it's much lower level than Keras, so it's easier to program PyTorch to only update on a per-word basis while still updating its hidden parameters on a per-letter basis.

Do we want all of our hidden layers to be LSTM's, or just the first one? It's unclear, something that we'll have to look into. https://arxiv.org/pdf/1503.04069.pdf may have answers: "Adding  full  gate  recurrence  (FGR)  did  not  significantly change  performance  on  TIMIT  or  IAM  Online,  but  led  to worse  results  on  the  JSB  Chorales  dataset.  Given  that  this variant greatly increases the number of parameters, we generally advise against using it." Well, that answers that question.

We decided on 1 LSTM (27 to 100 nodes), followed by two linear layers (100 to 50 and 50 to 1 nodes respectively). We'll use ReLU for all of our activations besides the last one, which we'll use a sigmoid for (since we want it to be a probability).

We now must decide how big a state (cell state, hidden state) we want to pass through between the different calls to our LSTM. This dude has some idea: https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw. We'll just trust him for now, and stay nimble and experiment for this parameter.

Upon further inspection, it seems that our hidden state and output state are the same, and so must have the same size. For some reason, our cell state must also have the same size as our hidden state. We don't quite understand why that is. Anyways, it should be fine (?) I guess.

We completed our forward and backward passes.

Next

# Notes Oct 30th:
Today, we edited the data cleaning script to get rid of all numbers, the symbol '$' and apostrophe, after we discussed that these symbols are not necessary, and can be potential distractions to our network.