# Game 1 - Spectrum Labeling

What makes a situation safe? What makes it different? Can you develop a program to navigate future pandemics?
Recent events have demonstrated an ongoing struggle to determine what is "safe" during a pandemic. In this game, contestants will be tasked with utilizing readily available information, such as 100+ pages of the CDC Guidelines, and developing an algorithm to map a spectrum of scenarios, from safe to dangerous, to reduce the spread of COVID-19.

## Brainstorming
- pdf scraper
- multi-class classification?  Logistic regression? Recommender system?
- bag of words, N-Grams, Tf-Idf

## Plan of Attack
We are treating this like a sentiment analysis problem.  Instead of the labels being positive or negative, the labels will be low, medium, or high risk.  Using the research Rhodora did to find the risky/safe activities, I created a small labeled dataset to train the model on.  

## Part 1: Read the Data
Reading in the self-generated labeled training dataset and fake test dataset.

In [2]:
import pandas as pd
import numpy as np

In [3]:
train = pd.read_csv("selfGeneratedLabelledDataset.csv")
train.sample(10)

Unnamed: 0,key_words,risk_score
28,hugging,2
131,more than six feet apart,0
128,see your doctor,0
10,school,2
203,Taking a taxi or a ride-sharing service,1
104,getting restaurant food as takeout,0
204,Visiting a hospital emergency department,1
16,going to a sports stadium,2
208,"hair, nail salon, or barbershop visit",1
156,Interacting with more people,2


This training set was derived from research by Rhodora.  I captured the activities and assigned a risk score:
- 0 = low risk
- 1 = medium risk
- 3 = high risk 

These websites provided a list of activites and scored them according to risk levels as published by CDC.

[1] https://www.cdc.gov/coronavirus/2019-ncov/community/large-events/considerations-for-events-gatherings.html​

[2] https://www.texmed.org/TexasMedicineDetail.aspx?id=54216​

[3] https://finance.yahoo.com/news/coronavirus-health-experts-ranked-activities-risk-132702304.html​

[4] https://www.ksdk.com/article/news/health/coronavirus/covid-19-risk-chart/63-723ae01d-4dc6-4a17-a8f0-1e68013515af​

[5] https://library.stockton.edu/publichealth/COVID-19​

[6] https://www.businessinsider.com/charts-show-coronavirus-risk-for-activities-2020-10

![HACKtheMACHINE](https://www.texmed.org/uploadedimages/Current/2016_About_TMA/Newsroom/News_Releases/COVID-19/TMA%20COVID%20%20309193%20Risk%20Assessment%20Chart.png)

![HACKtheMACHINE](https://infobeautiful4.s3.amazonaws.com/2020/03/Coronavirus-COVID19-riskiest-activities-03.png)

Next for the test set.  These daily activies were derived from these websites: 
- https://games4esl.com/daily-routine-examples/
- https://englishlive.ef.com/blog/english-in-the-real-world/useful-english-phrases-describe-daily-routine/


We will train a model on the training set and predict on the test set.  I personally assigned the risk scores based on my general understanding from the research above (training set).  Obviously, this is not a great test, but it's a resonable proof of concept.

In [4]:
test = pd.read_csv("fakeTestData.csv", sep=',', error_bad_lines=False, warn_bad_lines=True)
test.sample(10)

Unnamed: 0,daily_activities,risk_score
8,have lunch with others,2
0,wake up,0
30,feed the dog,0
26,relax,0
9,wash the dishes,0
35,watch TV with friends,2
2,brush your teeth,0
12,cook dinner,0
21,clean the house,0
5,get dressed,0


## Part 2: Data processing

Vectorize the data, using nltk's tokenize


In [41]:
X_train = train.key_words
y_train = np.where(train.risk_score>0, 1, 0)


X_test = test.daily_activities
y_test = np.where(test.risk_score, 1, 0)

In [42]:
import pandas as pd
import numpy as np

In [43]:
from keras.models import Sequential,Model
from keras.layers import Dense,Dropout,Activation
from keras.layers import Flatten,Input
from keras.layers import Embedding
from keras.layers import concatenate
from keras.utils import to_categorical
from keras.layers.recurrent import LSTM
from keras.layers.wrappers import Bidirectional

from IPython.display import SVG
#from keras.utils import model_to_dot
from keras.utils.vis_utils import model_to_dot

In [44]:
train.risk_score.value_counts()

2    133
1     69
0     58
Name: risk_score, dtype: int64

In [45]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

In [46]:
MAX_NB_WORDS = 40000
MAX_SEQUENCE_LENGTH = 100

In [47]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer(nb_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(X_train)
sequences = tokenizer.texts_to_sequences(X_train)

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)

Found 366 unique tokens.


In [48]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

y = le.fit_transform(y_train)

In [50]:
from keras.utils import to_categorical
import numpy as np
y = to_categorical(np.asarray(y))
y

array([[[1., 0.],
        [0., 1.]],

       [[1., 0.],
        [0., 1.]],

       [[1., 0.],
        [0., 1.]],

       ...,

       [[1., 0.],
        [0., 1.]],

       [[0., 1.],
        [1., 0.]],

       [[0., 1.],
        [1., 0.]]], dtype=float32)

In [51]:
# load the whole embedding into memory
embeddings_index = dict()
f = open('./glove.6B.100d.txt', encoding='utf-8')
for line in f:
    try:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs
    except Exception as e:
        print(e)
f.close()
print('Loaded %s word vectors.' % len(embeddings_index))


Loaded 400000 word vectors.


In [52]:
vocab = tokenizer.sequences_to_texts(X_train)
# Adding 1 because of reserved 0 index
vocab_size = len(tokenizer.word_index) + 1

print (vocab_size)

# create a weight matrix for words in training docs
embedding_matrix = np.zeros((vocab_size, 100))
for word, i in tokenizer.word_index.items():
	embedding_vector = embeddings_index.get(word)
	if embedding_vector is not None:
		embedding_matrix[i] = embedding_vector

367


In [53]:
# Add sequential model
sentiment_model = Sequential()
# Add embedding layer 
#No of output dimenstions is 100 as we embedded with Glove 100d
Embed_Layer = Embedding(vocab_size, 100, weights=[embedding_matrix], input_length=(MAX_SEQUENCE_LENGTH,), trainable=True)
#define Inputs
review_input = Input(shape=(MAX_SEQUENCE_LENGTH,),dtype= 'int32',name = 'review_input')
review_embedding = Embed_Layer(review_input)
Flatten_Layer = Flatten()
review_flatten = Flatten_Layer(review_embedding)
output_size = 2

dense1 = Dense(100,activation='relu')(review_flatten)

dense2 = Dense(32,activation='relu')(dense1)
predict = Dense(2,activation='softmax')(dense2)

sentiment_model = Model(inputs=[review_input],outputs=[predict])
sentiment_model.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['acc'])

print(sentiment_model.summary())


Model: "functional_11"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
review_input (InputLayer)    [(None, 100)]             0         
_________________________________________________________________
embedding_5 (Embedding)      (None, 100, 100)          36700     
_________________________________________________________________
flatten_5 (Flatten)          (None, 10000)             0         
_________________________________________________________________
dense_15 (Dense)             (None, 100)               1000100   
_________________________________________________________________
dense_16 (Dense)             (None, 32)                3232      
_________________________________________________________________
dense_17 (Dense)             (None, 2)                 66        
Total params: 1,040,098
Trainable params: 1,040,098
Non-trainable params: 0
___________________________________________

In [54]:

sentiment_model.fit(data,y,epochs= 5,batch_size=32,verbose=True)

Epoch 1/5


ValueError: in user code:

    D:\Anaconda3\envs\htm\lib\site-packages\tensorflow\python\keras\engine\training.py:806 train_function  *
        return step_function(self, iterator)
    D:\Anaconda3\envs\htm\lib\site-packages\tensorflow\python\keras\engine\training.py:796 step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    D:\Anaconda3\envs\htm\lib\site-packages\tensorflow\python\distribute\distribute_lib.py:1211 run
        return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
    D:\Anaconda3\envs\htm\lib\site-packages\tensorflow\python\distribute\distribute_lib.py:2585 call_for_each_replica
        return self._call_for_each_replica(fn, args, kwargs)
    D:\Anaconda3\envs\htm\lib\site-packages\tensorflow\python\distribute\distribute_lib.py:2945 _call_for_each_replica
        return fn(*args, **kwargs)
    D:\Anaconda3\envs\htm\lib\site-packages\tensorflow\python\keras\engine\training.py:789 run_step  **
        outputs = model.train_step(data)
    D:\Anaconda3\envs\htm\lib\site-packages\tensorflow\python\keras\engine\training.py:748 train_step
        loss = self.compiled_loss(
    D:\Anaconda3\envs\htm\lib\site-packages\tensorflow\python\keras\engine\compile_utils.py:204 __call__
        loss_value = loss_obj(y_t, y_p, sample_weight=sw)
    D:\Anaconda3\envs\htm\lib\site-packages\tensorflow\python\keras\losses.py:149 __call__
        losses = ag_call(y_true, y_pred)
    D:\Anaconda3\envs\htm\lib\site-packages\tensorflow\python\keras\losses.py:253 call  **
        return ag_fn(y_true, y_pred, **self._fn_kwargs)
    D:\Anaconda3\envs\htm\lib\site-packages\tensorflow\python\util\dispatch.py:201 wrapper
        return target(*args, **kwargs)
    D:\Anaconda3\envs\htm\lib\site-packages\tensorflow\python\keras\losses.py:1535 categorical_crossentropy
        return K.categorical_crossentropy(y_true, y_pred, from_logits=from_logits)
    D:\Anaconda3\envs\htm\lib\site-packages\tensorflow\python\util\dispatch.py:201 wrapper
        return target(*args, **kwargs)
    D:\Anaconda3\envs\htm\lib\site-packages\tensorflow\python\keras\backend.py:4687 categorical_crossentropy
        target.shape.assert_is_compatible_with(output.shape)
    D:\Anaconda3\envs\htm\lib\site-packages\tensorflow\python\framework\tensor_shape.py:1134 assert_is_compatible_with
        raise ValueError("Shapes %s and %s are incompatible" % (self, other))

    ValueError: Shapes (None, 2, 2) and (None, 2) are incompatible
