# Climate Chatbot

### W207, Section 1, Final Project

### Eric Hulburd

## Motivation

The objective of this project is to build a chatbot that can put respond to basic input prompts as well as answer questions about climate science. While configuring and implementing a deep learning model can be quite cumbersome, it is the only realistic way to accomplish this ends. 

Natural language processing is a field of computing and artificial intelligence that bridges the gap between natural human language and computer input. In the past 5-10 years, natural language has made significant strides, largely due to the growing availability of data and computing power required to provide it. Deep learning has played and an especially important role in this transformation. Whereas past approaches to natural language processing made explicit attempts to codify grammar and syntax, deep learning allows computers to identify linguistic patterns on its own, much as a would pick on different idioms by listening to parents or older siblings speak.

This view is, of course, not without its detractors. Famed linguist Noam Chomsky has been notably skeptical of statistical based methods for language learning. At MITs 150th birthday Brains, Minds,and Machines Symposium [1]

> if you uh uh took a ton of video tapes of what's happening outside my office window, let's say, you know, leaves flying and various things, and you did an extensive analysis of them, uh you would get some kind of prediction of what's likely to happen next, certainly way better than anybody in the physics department could do.
Well that's a notion of success which is I think novel, I don't know of anything like it in the history of science. Uh and in- in those terms you get some kind of successes, and if you look at the literature in the field, a lot of these papers are listed as successes. And when you look at them carefully, they're successes in this particular sense, and not the sense that science has ever been interested in.
But it does give you ways of approximating unanalyzed data, you know analysis of ((a)) corpus and so on and so forth.I don't know of any other cases, frankly.
Uh so there are successes where things are integrated with some of the properties of language, but I know of-((the sec-)) know of none in which they're not.

A myriad of leading technology companies have gained traction on both speech recognition and natural language processing, through products such as Apple’s Syri, Amazon's Echo, and Google's and Microsoft's machine learning APIs. Regardless of Chomsky’s point as to whether these successes in natural language processing constitute genuine scientific insight, there does exist some success in the extent to which people are communicating with computers through human language.

The purpose of this project is to both to explore the capability of deep learning through Google’s open sourced Tensorflow library to engage in human conversation, while at the same time side stepping some of the limitations of statistical models that Chomsky pointed out. Specifically, this project will create a chatbot that will train on a movie conversation database as well as a dataset of climate change FAQs. The point is not to teach the chatbot of concepts in climate science, but rather to have it feign an understanding by simply identifying common questions about climate change and responding with pre-constructed answers to those questions.

## Neural Networks and Natural Language Processing

As mentioned in the Definition and Overview section above, the fundamental challenge of machine translation and conversation is recognizing input significance, both at the sentence and word level, and generating an appropriate output. The model architecture and Tensorflow methods used in this project to address this challenge are discused below.

### Recurrent Neural Network Architecture

Recurrent Neural Networks (RNN) represent the current paradigm of training models for development of chatbots. A critical difference between feed-forward and RNNs is that RNN units forms an acyclic computation graph. Consider the equation:

![RNN Update Rule](images/rnn_update_rule.png)

For instance:

![RNN Update Rule](images/rnn_update_example.png)


Basically this states that the current state of the function depends on a prior state, which can be represented as the following acyclic graph:

![RNN Update Rule](images/rnn_update_diagram.png)


Note that t does not actually have to represent time, but can rather represent any sequential position - word position in a sentence in our case. 

There are two features of this architecture that make RNNs a powerful tool in natural language process. First, weights can be shared across different position t for a given parameter x. For instance, if you take a word “bat”, it will return a different based on its position without necessarily having to create completely separate parameter for the word at any position in the sentence. You will therefore get a different output state if the word is proceded by a grammatical modifier such as “baseball” or “flying”, even though “bat” only need be represented by a single parameter within the model.

Second, the same transition function f can be used at every time step t and accept an input sequence of arbitrary length because it describes transition from one state to the next, rather than a variable length history of states.

### LSTM and GRU Cells

While the above RNN structure is sufficient for a creating a neural network with a basic recurrence mechanism, it is not sufficient for creating a model that can adequately capture the larger context of a word within a sentence, let alone a paragraph.

The reason for this shortcoming is a problem within other neural network architectures, namely the problem of vanishing or exploding gradients. Consider the following abstraction from the Recurrent Neural Network chapter from Goodfellow et al 2016 [2]:

![RNN Weight Exponential Decay](images/rnn_weight_exponential_decay.png)

where h is a simplified recurrence relationship without inputs or a non-linear activation function, W is a weight matrix with the eigen decomposition function above, and Q is the orthogonal matrix to W. It is clear from this simplified abstraction that gradients at time step τ away from the current state will be subject to exponentially decaying or growing gradient updates (Goodfellow et al 2016).

A myriad of solutions have been proposed to alleviate this problem, but recently researchers have success through the use of neural networks with gated recurrence units. Essentially, these gated neural networks create additional weight parameters for each unit, which are used to determine what information the unit accepts and passes on to the subsequent unit.

The more well known gated recurrence unit is the long short term memory unit (LSTM). It has the following architecture:

![LSTM Diagram](images/lstm_diagram.png)


A gated recurrent unit (GRU) is very similar, however, it lacks an output gate, making the computation more efficient. Research has shown the GRU cell to achieve performance comparable to that of LSTM cells making the tradeoff worthwhile for efficiency benefits (Chung et al 2014) [3].

### Encoder-Decoder Sequence to Sequence Models

The methods discussed above are able to directly compute a vector to variable length sequence and vice versa. However, in the realm of translation and conversation, there are no grammatical rules that limit the length of either an input or output phrase or sentence. An additional abstraction is necessary in order for the model to process input of variable length and output an optimal response of undefined length. The is where the encoder and decoder structure comes into play. 

The encoder-decoder structure basically provides forges a connection between input of arbitrary length to output of arbitrary length. It does so by creating an intermediary context variable of fixed length. This variable essentially contains a summary of the input. Once this context has been encoded from the input, the decoder can calculate an optimal response. This is useful in translation as well as in chatbots. This architecture is essentially able to allow models to consider the full context of an input before generating any output. Concretely, the encoder may calculate a similar context variable for the phrases “How are you?”, “How’s it going?”, or “Cómo estás?”. Their significances are similar and may all evoke similar responses. The decoder may then generate the appropriate response or translation - “Doing great, thanks!”, “I’ve been better”, or “How are you?”.

The following is an adaptation of the encoder-decoder architecture from Goodfellow et al 2016, 396 [2].

![Encode Decoder Diagram](images/encoder_decoder_diagram.png)



## Implementation

As mentioned above, in the past 5-10 years, much progress has been made within the realm natural language processing performance, mainly due to increased computing power as well as the progress of deep learning algorithms.
This has specifically been the case for both machine translation and chatbots. These realms within MLP represent very similar problems. There are really two main tasks for these types of models. The first is to extract meaning from a verbal prompt. The second is to generate an output from that prompt - either a translation, or a response in the case of a chatbot.
The major challenge in developing good models for these tasks is to develop a model that (1) develops an adequate sense of context and (2) effectively relates inputs to outputs. Sequence to sequence models provide an appropriate architecture to meets these challenges. These models are charged with the following tasks:
* accept a tokenized input (ie word to vector),
* develop a sense of meaning of the full input,
* develop a sense of significance of each token within the broader context of the full input,
* generate a response that corresponds to the significance of the input, as well as the formulation of its individual tokens.

In order to accomplish this, the model needs a large set of input and outputs. In our case, we use a database of conversations from movie scripts. We then a set of questions and answers pertaining to the set of climate change. We will remove complexity by meta-tokenizing long, scientifically dense answers from this climate set so the model only has to recognize a class of response to climate change questions, rather than generate a response itself.

### Metrics

Determining a metric for machine translation and chatbot learning is much more difficult that simple classification tasks. While such learning is still considered supervised learning, since models learn based on human transcribed “answers” from the real world, there are myriad different “answers” for any given input. For instance, below are valid translations and responses for the question “How are you?”:
* I’m good, how are you?
* I’ve been better.
* Doing great.
* Cómo estás?
* Qué tal?

While those are all appropriate responses and translations, they are clearly better than following responses and translations:
* The sky is blue.
* The ocean is deep.
* Latinoamérica incluye México, Centroamérica, y Sudamérica.

Ultimately, we calculate the loss using a softmax function, where categories represent a token within our generated vocabulary vocabulary. Since we use a vocabulary of 25,000 tokens, in practice we use a sampled softmax function, which evaluates the function only over a subset of the vocabulary for each loss calculation.

![Softmax Equation](images/softmax_equation.png)

This softmax function will output the probability that a given input vector will output `y = j`. This probability distribution can then be used to predict the model’s cross entropy.

Additionally of note, these losses are computed individually for different length categories of sentences. This is discussed in further detail below, but basically this means a given input-output pair is bucketed based on the maximum length of the two. This allows the model to use a separate set of weights for sentences of 5 words and 50 words, thus enabling the model to optimize more efficiently by avoiding calculations on padding symbols for short inputs and outputs.

Of course, in natural language processing, for the reason mentioned above, there is no known probability distribution. The best we have is the input and output data from our test set. The following equation calculates the cross entropy of a given set of inputs and outputs (N), and the probability determined by the softmax function above (q(x)).

![Cross Entropy Equation](images/cross_entropy_equation.png)

Note that the higher the probability a given input outputs the corresponding output from the test set, the lower the cross entropy. This metric is taken a step further in natural language modeling by calculating the perplexity.

![Perplexity Equation](images/perplexity_equation.png)

Perplexity can be thought of as the number of random variables of a probability distribution. For instance, rolling a six sided die has a perplexity of 6, while choosing  a number 1-10 at random has a perplexity of 10. A higher perplexity reflects a higher degree of uncertainty.

With regards to our specific model, we can interpret the perplexity as the uncertainty with which our model predicts a given output from our dataset. The negative cross entropy value above in this interpretation represents the number of bits required to encode a given word within our vocabulary. Given a particular input and a negative cross entropy of 4 (ie there is a 25% chance of the model choosing the output in the dataset), the model would have a perplexity of 16 per word. 

Note, the upper bound for the 5.96 million character Brown Corpus of English text was estimated in 1992 as 1.75 bits per character or 7.95 bits per word, yielding a perplexity of about 247 (Brown et al) [4].


### Data Preprocessing

As mentioned earlier, the Cornell Movie conversation dataset provides the base data for training the model in general conversation structure. The data pre-processing for this dataset is as follows:
Create the vocabulary for both encoder and decoder inputs (ie the prompts and responses). This involves iterating over all of the prompts and responses and counting occurrences of every word. The list of prompt and response words is then filtered to the most n words (we used a parameter of 25,000 for n). These vocabularies are then stored for later processing of inputs and outputs.
Additionally, the following functional words are included in the vocabulary:
* A start_ed which represents  the first word of a response string.
* An end_id which represents the end of a response string.
* A pad id which is appended to the end of all prompt and response strings up until the maximum length of their corresponding bucket.
* An `unk` which is used for any prompt or response word not saved in the vocabulary above.

Encode every prompt and response by id assigned to each prompt and response vocabulary word.
Add the functional vocabulary words mentioned above to create sequences of appropriate length.
Reverse the order of the prompt sequences, which creates more short term dependencies and simplifies the optimization problem and in turn better performance (Sutskever et al 2014, 2).

Including the climate data provided additional pre-processing challenges. The goal of this project was to create a chatbot with a practical and specific purpose, namely answer questions about climate change. Given the complexity of natural language processing and climate change, it made little sense to combine these two tasks without some tricks. As the point Chomsky made suggests, deep learning results are purely statistical models. They do not create nor understand fundamental concepts, such as the greenhouse effect or the significance of the word “climate”, “temperature”, or flood.

To create a chatbot that did indeed have a sense of the significance of these words would require not just conversational data, but also climate and economic data. These data would be processed by a completely different model than the architecture discussed here. I, therefore, decided to meta-tokenize the response data used in the model. Data pre-processing procedures for the climate FAQs were as follows:
1. I searched for the internet for FAQs on climate change. I saved 209 FAQs in total from the sources included in Appendix I.
2. For each of these FAQs, I created 5 additional paraphrases of the questions and gleaned a further three from Amazon’s Mechanical Turk.
  * I originally planned to use Microsoft’s paraphrase API. They have deprecated this API. Ultimately, while this cost me an additional day and some Mechanical Turk Fees, the input data was probably much better from a human source than another AI source.
3. For each response, I created a unique meta-token, which was saved in a JSON lookup file, and appended each paraphrase along with the response meta-token to the Cornell movie dialog dataset.
4. I included all vocabulary words from the climate augmented dataset in the final vocabularies by removing the least used vocabulary words from the Cornell movie dialog dataset.

The drawback to the meta-token approach is that the decoder training data is not representative of conversational grammar. However, given that the climate data set was smaller, this seemed like a much more reasonable approach than to expect our model to learn long, scientifically dense output responses.

### Tensorlayer

To simplify our implementation, we employed layers from the [Tensorlayer library](http://tensorlayer.readthedocs.io/en/stable/) to implement our sequence to sequence model. Our model was composed of two types of layers:

#### Embedding Layer

An embedding layer is essentially a matrix that can convert one-hot word vectors into dense embedding vectors that can be optimized during training. This provides an additional layer to our network that will allow our model to develop a sense of meaning between different words (ie similar words will have similar values in their embedding vector) as well as make it more efficient by avoiding sparse matrix multiplication because words are converted to their embedding vector by look up rather than multiplication. 

Note that these embedding layers were applied on both input and output sequences.

#### Sequence to Sequence Layer

This layer maps the input and output embedded layers to each other. Essentially, this layer is responsible for discerning meaning from a sequence of embedded inputs (ie calculating a context vector). This context can then be used to recurrently calculate a sequence of embedded outputs.

Note, that our sequence to sequence layer used a long short term memory cell as discussed above.

### Optimization and Tuning Hyperparameters

There are a number of different parameters we could tune on this model. We chose to focus on the following.

* Dropout
* Embedding dimension
* Seq2Seq number layers
* Vocab size
* Batch size
* Optimizer
* Learning rate

Note that the latter 3 parameters we expect to have more impact on the training efficiency rather than the final outcome.



## Results

## Discussion

### Addressing Shortcomings

## Research References

1. http://languagelog.ldc.upenn.edu/myl/PinkerChomskyMIT.html
2. Goodfellow et al, Deep Learning. http://www.deeplearningbook.org, MIT Press, 2016.
3. arXiv:1412.3555v1 [cs.NE]. Chung et al, “Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling.” https://arxiv.org/pdf/1412.3555v1.pdf, 11 Dec 2014.
4. Brown et al, ‘An Estimate of an Upper Bound for the Entropy of English’. http://www.cs.cmu.edu/~roni/11761/PreviousYearsHandouts/gauntlet.pdf, CMU, March 1992.

## Code References

This code base combines concepts primarily from:

* Tensorlayer's [seq2seq-chatbot](https://github.com/tensorlayer/seq2seq-chatbot)
* Google Cloud's [Using Distributed TensorFlow with Cloud ML Engine and Cloud Datalab ](https://cloud.google.com/ml-engine/docs/tensorflow/distributed-tensorflow-mnist-cloud-datalab).

## Climate FAQ References

Please see this Google spreadsheet for the original list of questions and answers, as well as the paraphrases: https://docs.google.com/spreadsheets/d/1yexja22mo94y0h4Dwk4VQz9fZPrOkm779SXguQA3SL0/edit?usp=sharing

The original FAQ questions and answers were derived from the following sources:

* “Climate Science FAQs.” USDA Climate Hubs. Web 15 July 2017. https://www.climatehubs.oce.usda.gov/content/climate-science-faqs
* “Frequently Asked Questions.” Global Climate Change Vital Signs of the Planet. NASA.Web. 15 July 2017. https://climate.nasa.gov/faq/.
* “FAQs About Rapid Climate Change.” Utah Education Network. Web. 15 July 2017. http://www.uen.org/climate/faq.shtml.
* “Climate Science FAQ.” ClimatePrediction.net. Web. 15 July 2017.  http://www.climateprediction.net/climate-science/faqs/.
* “Five Outstanding Questions in Earth Science.” Earth Magazine. Web. 15 July 2017. https://www.earthmagazine.org/article/five-outstanding-questions-earth-science.
* “FAQ on Climate Models.” RealClimate. Web. 15 July 2017. http://www.realclimate.org/index.php/archives/2008/11/faq-on-climate-models/.
* IPCC, 2007: Climate Change 2007: The Physical Science Basis. Contribution of Working Group I to the Fourth Assessment Report of the Intergovernmental Panel on Climate Change [Solomon, S., D. Qin, M. Manning, Z. Chen, M. Marquis, K.B. Averyt, M.Tignor and H.L. Miller (eds.)]. Cambridge University Press, Cambridge, United Kingdom and New York, NY, USA. Web. 15 July 2017. https://www.ipcc.ch/pdf/assessment-report/ar4/wg1/ar4-wg1-faqs.pdf.
* “Global Warming/ Climate Change Frequently Asked Questions.” Environmental and ENerty Study Institute. Web. 15 July 2017. http://www.eesi.org/climate-change-FAQ.
* “Frequently Asked Questions About Climate Change.” Environmental Protection Agency. Web. 15 July 2017. https://19january2017snapshot.epa.gov/climatechange/frequently-asked-questions-about-climate-change_.html
* “Frequently Asked Questions.” University Corporation for Atmospheric Research. Web. 15 July 2017. https://www2.ucar.edu/contact-us/faq.
* “Global Warming FAQ.” Union of Concerned Scientists. Web. 15 July 2017. http://www.ucsusa.org/global_warming/science_and_impacts/science/global-warming-faq.html
* Gillis, Justin. “Short Answers to Hard Questions About Climate Change.” 6 July 2017. Web. 15 July 2017. https://www.nytimes.com/interactive/2015/11/28/science/what-is-climate-change.html?_r=4
* “How do human CO2 emissions compare to natural CO2 emissions?” Skeptical Science. Web. 15 July 2017. https://skepticalscience.com/human-co2-smaller-than-natural-emissions.htm.
* “Could Warmer Oceans Make Atmospheric Carbon Dioxide Rise Faster Than Expected?” Science Daily. Web. 15 July 2017. https://www.sciencedaily.com/releases/2007/10/071023163953.htm.
* “Global Warming Frequently Asked Questions.” Climate.gov. Web. 15 July 2017. https://www.climate.gov/news-features/understanding-climate/global-warming-frequently-asked-questions#hide4.
* “Frequently Asked Questions.” National Center for Environmental Information. Web. 15 July 2017. https://www.ncdc.noaa.gov/monitoring-references/faq/.
* “FAQs.” National Climate Assessment. Web. 15 July 2017. http://nca2014.globalchange.gov/report/appendices/faqs.

