<a href="https://colab.research.google.com/github/daviddhc20120601/colab/blob/master/data_science_recruitment_challenge.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science Challenge

## Overview

The focus of this exercise is  on a field within machine learning called [Natural Language Processing](https://en.wikipedia.org/wiki/Natural-language_processing). We can think of this field as the intersection between language, and machine learning. Tasks in this field include automatic translation (Google translate), intelligent personal assistants (Siri), predictive text, and speech recognition for example.

NLP uses many of the same techniques as traditional data science, but also features a number of specialised skills and approaches. There is no expectation that you have any experience with NLP, however, to complete the challenge it will be useful to have the following skills:

- understanding of the python programming language, or similar third generation language.
- understanding of basic machine learning concepts, i.e. supervised learning


### Instructions

1. Answer each of the provided questions, including your source code as cells in this notebook.
2. Send us your solution in a zip file at your earliest convenience.

### Task description

You will be performing a task known as [sentiment analysis](https://en.wikipedia.org/wiki/Sentiment_analysis). Here, the goal is to predict sentiment -- the emotional intent behind a statement -- from text. For example, the sentence: "*This movie was terrible!"* has a negative sentiment, whereas "*loved this cinematic masterpiece*" has a positive sentiment.

To simplify the task, we consider sentiment binary: labels of `1` indicate a sentence has a positive sentiment, and labels of `0` indicate that the sentence has a negative sentiment.

### Dataset

The dataset is split across three files, representing three different sources -- Amazon, Yelp and IMDB. Your task is to build a sentiment analysis model using both the Yelp and IMDB data as your training-set, and test the performance of your model on the Amazon data.

Each file can be found in the `../input` directory, and contains 1000 rows of data. Each row contains a sentence, a `tab` character and then a label -- `0` or `1`. 

**Notes**
- This environment comes with a wide range of ML libraries installed. If you wish to include more, go to the 'Settings' tab and input the `pip install` command as required.
- Suggested libraries: `sklearn` (for machine learning), `pandas` (for loading/processing data).
- As mentioned, you are not expected to have previous experience with this exact task. You are free to refer to external tutorials/resources to assist you. However, you will be asked to justfify the choices you have made -- so make you understand the approach you have taken.

In [6]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

Saving yelp_labelled.txt to yelp_labelled.txt
User uploaded file "yelp_labelled.txt" with length 61320 bytes


In [4]:
!head "amazon_cells_labelled.txt"

So there is no way for me to plug it in here in the US unless I go by a converter.	0
Good case, Excellent value.	1
Great for the jawbone.	1
Tied to charger for conversations lasting more than 45 minutes.MAJOR PROBLEMS!!	0
The mic is great.	1
I have to jiggle the plug to get it to line up right to get decent volume.	0
If you have several dozen or several hundred contacts, then imagine the fun of sending each of them one by one.	0
If you are Razr owner...you must have this!	1
Needless to say, I wasted my money.	0
What a waste of money and time!.	0


# Tasks
### 1. Read and concatenate data into test and train sets.

In [0]:
import tensorflow as tf
import tensorflow_hub as hub
import pandas as pd
import sklearn
import pprint

In [7]:
test = pd.read_table("amazon_cells_labelled.txt",header=None)
test.columns = ["sentence","label"]
train = pd.read_table("yelp_labelled.txt",header=None).append(pd.read_table("imdb_labelled.txt",header=None) ).reset_index(drop=True)
train.columns = ["sentence","label"]
print(len(test),len(train)) 

1000 1748


In [0]:
train_input_fn = tf.estimator.inputs.pandas_input_fn( train, train["label"], num_epochs=None, shuffle=True)
predict_test_input_fn = tf.estimator.inputs.pandas_input_fn( test, test["label"], num_epochs=None, shuffle=True)

### 2. Prepare the data for input into your model.

In [0]:
embedded_text_feature_column = hub.text_embedding_column(
    key="sentence", 
    module_spec="https://tfhub.dev/google/nnlm-en-dim128/1")



In [10]:
estimator = tf.estimator.DNNClassifier(
    hidden_units=[500, 100],
    feature_columns=[embedded_text_feature_column ],
    n_classes=2,
    optimizer=tf.train.AdagradOptimizer(learning_rate=0.003))


INFO:tensorflow:Using default config.


INFO:tensorflow:Using default config.






INFO:tensorflow:Using config: {'_model_dir': '/tmp/tmpvy1x5ydt', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7ff47e8e2cf8>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


INFO:tensorflow:Using config: {'_model_dir': '/tmp/tmpvy1x5ydt', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7ff47e8e2cf8>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


In [11]:
estimator.train(input_fn=train_input_fn, steps=5000);

Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.


Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.


Instructions for updating:
To construct input pipelines, use the `tf.data` module.


Instructions for updating:
To construct input pipelines, use the `tf.data` module.


Instructions for updating:
To construct input pipelines, use the `tf.data` module.


Instructions for updating:
To construct input pipelines, use the `tf.data` module.


INFO:tensorflow:Calling model_fn.


INFO:tensorflow:Calling model_fn.


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


Instructions for updating:
If using Keras pass *_constraint arguments to layers.


Instructions for updating:
If using Keras pass *_constraint arguments to layers.


Instructions for updating:
Use `tf.cast` instead.


Instructions for updating:
Use `tf.cast` instead.


Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


INFO:tensorflow:Done calling model_fn.


INFO:tensorflow:Done calling model_fn.


INFO:tensorflow:Create CheckpointSaverHook.


INFO:tensorflow:Create CheckpointSaverHook.


INFO:tensorflow:Graph was finalized.


INFO:tensorflow:Graph was finalized.


INFO:tensorflow:Running local_init_op.


INFO:tensorflow:Running local_init_op.


INFO:tensorflow:Done running local_init_op.


INFO:tensorflow:Done running local_init_op.


Instructions for updating:
To construct input pipelines, use the `tf.data` module.


Instructions for updating:
To construct input pipelines, use the `tf.data` module.


INFO:tensorflow:Saving checkpoints for 0 into /tmp/tmpvy1x5ydt/model.ckpt.


INFO:tensorflow:Saving checkpoints for 0 into /tmp/tmpvy1x5ydt/model.ckpt.


INFO:tensorflow:loss = 89.08295, step = 1


INFO:tensorflow:loss = 89.08295, step = 1


INFO:tensorflow:global_step/sec: 106.904


INFO:tensorflow:global_step/sec: 106.904


INFO:tensorflow:loss = 63.595245, step = 101 (0.945 sec)


INFO:tensorflow:loss = 63.595245, step = 101 (0.945 sec)


INFO:tensorflow:global_step/sec: 130.881


INFO:tensorflow:global_step/sec: 130.881


INFO:tensorflow:loss = 63.957138, step = 201 (0.757 sec)


INFO:tensorflow:loss = 63.957138, step = 201 (0.757 sec)


INFO:tensorflow:global_step/sec: 135.596


INFO:tensorflow:global_step/sec: 135.596


INFO:tensorflow:loss = 48.17328, step = 301 (0.740 sec)


INFO:tensorflow:loss = 48.17328, step = 301 (0.740 sec)


INFO:tensorflow:global_step/sec: 134.29


INFO:tensorflow:global_step/sec: 134.29


INFO:tensorflow:loss = 47.797714, step = 401 (0.742 sec)


INFO:tensorflow:loss = 47.797714, step = 401 (0.742 sec)


INFO:tensorflow:global_step/sec: 133.56


INFO:tensorflow:global_step/sec: 133.56


INFO:tensorflow:loss = 37.019775, step = 501 (0.754 sec)


INFO:tensorflow:loss = 37.019775, step = 501 (0.754 sec)


INFO:tensorflow:global_step/sec: 132.799


INFO:tensorflow:global_step/sec: 132.799


INFO:tensorflow:loss = 36.64075, step = 601 (0.748 sec)


INFO:tensorflow:loss = 36.64075, step = 601 (0.748 sec)


INFO:tensorflow:global_step/sec: 130.555


INFO:tensorflow:global_step/sec: 130.555


INFO:tensorflow:loss = 31.93021, step = 701 (0.766 sec)


INFO:tensorflow:loss = 31.93021, step = 701 (0.766 sec)


INFO:tensorflow:global_step/sec: 126.597


INFO:tensorflow:global_step/sec: 126.597


INFO:tensorflow:loss = 27.657173, step = 801 (0.790 sec)


INFO:tensorflow:loss = 27.657173, step = 801 (0.790 sec)


INFO:tensorflow:global_step/sec: 131.75


INFO:tensorflow:global_step/sec: 131.75


INFO:tensorflow:loss = 24.858578, step = 901 (0.761 sec)


INFO:tensorflow:loss = 24.858578, step = 901 (0.761 sec)


INFO:tensorflow:global_step/sec: 130.464


INFO:tensorflow:global_step/sec: 130.464


INFO:tensorflow:loss = 23.646076, step = 1001 (0.768 sec)


INFO:tensorflow:loss = 23.646076, step = 1001 (0.768 sec)


INFO:tensorflow:global_step/sec: 132.882


INFO:tensorflow:global_step/sec: 132.882


INFO:tensorflow:loss = 17.853737, step = 1101 (0.752 sec)


INFO:tensorflow:loss = 17.853737, step = 1101 (0.752 sec)


INFO:tensorflow:global_step/sec: 133.654


INFO:tensorflow:global_step/sec: 133.654


INFO:tensorflow:loss = 15.69716, step = 1201 (0.749 sec)


INFO:tensorflow:loss = 15.69716, step = 1201 (0.749 sec)


INFO:tensorflow:global_step/sec: 134.429


INFO:tensorflow:global_step/sec: 134.429


INFO:tensorflow:loss = 11.651931, step = 1301 (0.744 sec)


INFO:tensorflow:loss = 11.651931, step = 1301 (0.744 sec)


INFO:tensorflow:global_step/sec: 131.582


INFO:tensorflow:global_step/sec: 131.582


INFO:tensorflow:loss = 11.398531, step = 1401 (0.757 sec)


INFO:tensorflow:loss = 11.398531, step = 1401 (0.757 sec)


INFO:tensorflow:global_step/sec: 131.228


INFO:tensorflow:global_step/sec: 131.228


INFO:tensorflow:loss = 11.704571, step = 1501 (0.768 sec)


INFO:tensorflow:loss = 11.704571, step = 1501 (0.768 sec)


INFO:tensorflow:global_step/sec: 132.763


INFO:tensorflow:global_step/sec: 132.763


INFO:tensorflow:loss = 8.359665, step = 1601 (0.750 sec)


INFO:tensorflow:loss = 8.359665, step = 1601 (0.750 sec)


INFO:tensorflow:global_step/sec: 132.011


INFO:tensorflow:global_step/sec: 132.011


INFO:tensorflow:loss = 7.214984, step = 1701 (0.761 sec)


INFO:tensorflow:loss = 7.214984, step = 1701 (0.761 sec)


INFO:tensorflow:global_step/sec: 131.312


INFO:tensorflow:global_step/sec: 131.312


INFO:tensorflow:loss = 6.998951, step = 1801 (0.756 sec)


INFO:tensorflow:loss = 6.998951, step = 1801 (0.756 sec)


INFO:tensorflow:global_step/sec: 130.281


INFO:tensorflow:global_step/sec: 130.281


INFO:tensorflow:loss = 6.302157, step = 1901 (0.765 sec)


INFO:tensorflow:loss = 6.302157, step = 1901 (0.765 sec)


INFO:tensorflow:global_step/sec: 131.412


INFO:tensorflow:global_step/sec: 131.412


INFO:tensorflow:loss = 6.218002, step = 2001 (0.762 sec)


INFO:tensorflow:loss = 6.218002, step = 2001 (0.762 sec)


INFO:tensorflow:global_step/sec: 128.599


INFO:tensorflow:global_step/sec: 128.599


INFO:tensorflow:loss = 5.620042, step = 2101 (0.781 sec)


INFO:tensorflow:loss = 5.620042, step = 2101 (0.781 sec)


INFO:tensorflow:global_step/sec: 125.04


INFO:tensorflow:global_step/sec: 125.04


INFO:tensorflow:loss = 3.7959285, step = 2201 (0.802 sec)


INFO:tensorflow:loss = 3.7959285, step = 2201 (0.802 sec)


INFO:tensorflow:global_step/sec: 126.579


INFO:tensorflow:global_step/sec: 126.579


INFO:tensorflow:loss = 3.5860922, step = 2301 (0.786 sec)


INFO:tensorflow:loss = 3.5860922, step = 2301 (0.786 sec)


INFO:tensorflow:global_step/sec: 127.132


INFO:tensorflow:global_step/sec: 127.132


INFO:tensorflow:loss = 2.9925306, step = 2401 (0.788 sec)


INFO:tensorflow:loss = 2.9925306, step = 2401 (0.788 sec)


INFO:tensorflow:global_step/sec: 130.464


INFO:tensorflow:global_step/sec: 130.464


INFO:tensorflow:loss = 2.7290182, step = 2501 (0.766 sec)


INFO:tensorflow:loss = 2.7290182, step = 2501 (0.766 sec)


INFO:tensorflow:global_step/sec: 133.413


INFO:tensorflow:global_step/sec: 133.413


INFO:tensorflow:loss = 3.0873284, step = 2601 (0.748 sec)


INFO:tensorflow:loss = 3.0873284, step = 2601 (0.748 sec)


INFO:tensorflow:global_step/sec: 129.256


INFO:tensorflow:global_step/sec: 129.256


INFO:tensorflow:loss = 3.0123105, step = 2701 (0.776 sec)


INFO:tensorflow:loss = 3.0123105, step = 2701 (0.776 sec)


INFO:tensorflow:global_step/sec: 133.659


INFO:tensorflow:global_step/sec: 133.659


INFO:tensorflow:loss = 2.6113572, step = 2801 (0.748 sec)


INFO:tensorflow:loss = 2.6113572, step = 2801 (0.748 sec)


INFO:tensorflow:global_step/sec: 129.336


INFO:tensorflow:global_step/sec: 129.336


INFO:tensorflow:loss = 2.163141, step = 2901 (0.776 sec)


INFO:tensorflow:loss = 2.163141, step = 2901 (0.776 sec)


INFO:tensorflow:global_step/sec: 129.839


INFO:tensorflow:global_step/sec: 129.839


INFO:tensorflow:loss = 1.7032461, step = 3001 (0.764 sec)


INFO:tensorflow:loss = 1.7032461, step = 3001 (0.764 sec)


INFO:tensorflow:global_step/sec: 129.446


INFO:tensorflow:global_step/sec: 129.446


INFO:tensorflow:loss = 1.9032897, step = 3101 (0.775 sec)


INFO:tensorflow:loss = 1.9032897, step = 3101 (0.775 sec)


INFO:tensorflow:global_step/sec: 131.806


INFO:tensorflow:global_step/sec: 131.806


INFO:tensorflow:loss = 1.7411642, step = 3201 (0.756 sec)


INFO:tensorflow:loss = 1.7411642, step = 3201 (0.756 sec)


INFO:tensorflow:global_step/sec: 134.69


INFO:tensorflow:global_step/sec: 134.69


INFO:tensorflow:loss = 1.8741642, step = 3301 (0.743 sec)


INFO:tensorflow:loss = 1.8741642, step = 3301 (0.743 sec)


INFO:tensorflow:global_step/sec: 129.717


INFO:tensorflow:global_step/sec: 129.717


INFO:tensorflow:loss = 1.479543, step = 3401 (0.774 sec)


INFO:tensorflow:loss = 1.479543, step = 3401 (0.774 sec)


INFO:tensorflow:global_step/sec: 129.318


INFO:tensorflow:global_step/sec: 129.318


INFO:tensorflow:loss = 1.6191914, step = 3501 (0.773 sec)


INFO:tensorflow:loss = 1.6191914, step = 3501 (0.773 sec)


INFO:tensorflow:global_step/sec: 132.909


INFO:tensorflow:global_step/sec: 132.909


INFO:tensorflow:loss = 1.630455, step = 3601 (0.753 sec)


INFO:tensorflow:loss = 1.630455, step = 3601 (0.753 sec)


INFO:tensorflow:global_step/sec: 133.858


INFO:tensorflow:global_step/sec: 133.858


INFO:tensorflow:loss = 1.3769958, step = 3701 (0.746 sec)


INFO:tensorflow:loss = 1.3769958, step = 3701 (0.746 sec)


INFO:tensorflow:global_step/sec: 138.708


INFO:tensorflow:global_step/sec: 138.708


INFO:tensorflow:loss = 1.3313228, step = 3801 (0.721 sec)


INFO:tensorflow:loss = 1.3313228, step = 3801 (0.721 sec)


INFO:tensorflow:global_step/sec: 137.481


INFO:tensorflow:global_step/sec: 137.481


INFO:tensorflow:loss = 1.0254672, step = 3901 (0.726 sec)


INFO:tensorflow:loss = 1.0254672, step = 3901 (0.726 sec)


INFO:tensorflow:global_step/sec: 134.132


INFO:tensorflow:global_step/sec: 134.132


INFO:tensorflow:loss = 1.0217093, step = 4001 (0.747 sec)


INFO:tensorflow:loss = 1.0217093, step = 4001 (0.747 sec)


INFO:tensorflow:global_step/sec: 130.139


INFO:tensorflow:global_step/sec: 130.139


INFO:tensorflow:loss = 1.2786353, step = 4101 (0.771 sec)


INFO:tensorflow:loss = 1.2786353, step = 4101 (0.771 sec)


INFO:tensorflow:global_step/sec: 132.243


INFO:tensorflow:global_step/sec: 132.243


INFO:tensorflow:loss = 1.0562612, step = 4201 (0.751 sec)


INFO:tensorflow:loss = 1.0562612, step = 4201 (0.751 sec)


INFO:tensorflow:global_step/sec: 133.47


INFO:tensorflow:global_step/sec: 133.47


INFO:tensorflow:loss = 0.9215454, step = 4301 (0.754 sec)


INFO:tensorflow:loss = 0.9215454, step = 4301 (0.754 sec)


INFO:tensorflow:global_step/sec: 133.847


INFO:tensorflow:global_step/sec: 133.847


INFO:tensorflow:loss = 0.9118127, step = 4401 (0.743 sec)


INFO:tensorflow:loss = 0.9118127, step = 4401 (0.743 sec)


INFO:tensorflow:global_step/sec: 130.428


INFO:tensorflow:global_step/sec: 130.428


INFO:tensorflow:loss = 1.1543596, step = 4501 (0.765 sec)


INFO:tensorflow:loss = 1.1543596, step = 4501 (0.765 sec)


INFO:tensorflow:global_step/sec: 134.779


INFO:tensorflow:global_step/sec: 134.779


INFO:tensorflow:loss = 0.9645288, step = 4601 (0.746 sec)


INFO:tensorflow:loss = 0.9645288, step = 4601 (0.746 sec)


INFO:tensorflow:global_step/sec: 132.617


INFO:tensorflow:global_step/sec: 132.617


INFO:tensorflow:loss = 0.8535518, step = 4701 (0.756 sec)


INFO:tensorflow:loss = 0.8535518, step = 4701 (0.756 sec)


INFO:tensorflow:global_step/sec: 133.168


INFO:tensorflow:global_step/sec: 133.168


INFO:tensorflow:loss = 0.8311044, step = 4801 (0.751 sec)


INFO:tensorflow:loss = 0.8311044, step = 4801 (0.751 sec)


INFO:tensorflow:global_step/sec: 119.581


INFO:tensorflow:global_step/sec: 119.581


INFO:tensorflow:loss = 0.5366992, step = 4901 (0.832 sec)


INFO:tensorflow:loss = 0.5366992, step = 4901 (0.832 sec)


INFO:tensorflow:Saving checkpoints for 5000 into /tmp/tmpvy1x5ydt/model.ckpt.


INFO:tensorflow:Saving checkpoints for 5000 into /tmp/tmpvy1x5ydt/model.ckpt.


INFO:tensorflow:Loss for final step: 0.7412214.


INFO:tensorflow:Loss for final step: 0.7412214.


#### 2a: Find the ten most frequent words in the training set.

In [0]:
#Q2a soln.
test_eval_result = estimator.evaluate(input_fn=predict_test_input_fn)
print("Test set accuracy: {accuracy}".format(**test_eval_result))

INFO:tensorflow:Calling model_fn.


INFO:tensorflow:Calling model_fn.


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


Instructions for updating:
Deprecated in favor of operator or tf.math.divide.


Instructions for updating:
Deprecated in favor of operator or tf.math.divide.










INFO:tensorflow:Done calling model_fn.


INFO:tensorflow:Done calling model_fn.


INFO:tensorflow:Starting evaluation at 2019-11-07T14:52:01Z


INFO:tensorflow:Starting evaluation at 2019-11-07T14:52:01Z


INFO:tensorflow:Graph was finalized.


INFO:tensorflow:Graph was finalized.


INFO:tensorflow:Restoring parameters from /tmp/tmpvy1x5ydt/model.ckpt-5000


INFO:tensorflow:Restoring parameters from /tmp/tmpvy1x5ydt/model.ckpt-5000


INFO:tensorflow:Running local_init_op.


INFO:tensorflow:Running local_init_op.


INFO:tensorflow:Done running local_init_op.


INFO:tensorflow:Done running local_init_op.



### 3. Train your model and justify your choices.

In [0]:
#Q3 soln.

### 4. Evaluate your model using metric(s) you see fit and justify your choices.

In [0]:
#Q4 soln.