# Game 1 - Spectrum Labeling

What makes a situation safe? What makes it different? Can you develop a program to navigate future pandemics?
Recent events have demonstrated an ongoing struggle to determine what is "safe" during a pandemic. In this game, contestants will be tasked with utilizing readily available information, such as 100+ pages of the CDC Guidelines, and developing an algorithm to map a spectrum of scenarios, from safe to dangerous, to reduce the spread of COVID-19.

## Brainstorming
- pdf scraper
- multi-class classification?  Logistic regression? Recommender system?
- bag of words, N-Grams, Tf-Idf

## Plan of Attack
We are treating this like a sentiment analysis problem.  Instead of the labels being positive or negative, the labels will be low, medium, or high risk.  Using the research Rhodora did to find the risky/safe activities, I created a small labeled dataset to train the model on.  

## Part 1: Read the Data
Reading in the self-generated labeled training dataset and fake test dataset.

In [93]:
import pandas as pd
import numpy as np

In [108]:
train = pd.read_csv("singleWordLabels.csv")
train.sample(10)

Unnamed: 0,word,risk
172,convalescence,1
11,school,1
14,theater,1
69,touch,1
34,drunk,1
111,playdate,1
157,run,0
198,nature,0
40,enclosed,1
28,handshake,1


This training set was derived from research by Rhodora.  I captured the activities and assigned a risk score:
- 0 = low risk
- 1 = high risk


These websites provided a list of activites and scored them according to risk levels as published by CDC.

[1] https://www.cdc.gov/coronavirus/2019-ncov/community/large-events/considerations-for-events-gatherings.html​

[2] https://www.texmed.org/TexasMedicineDetail.aspx?id=54216​

[3] https://finance.yahoo.com/news/coronavirus-health-experts-ranked-activities-risk-132702304.html​

[4] https://www.ksdk.com/article/news/health/coronavirus/covid-19-risk-chart/63-723ae01d-4dc6-4a17-a8f0-1e68013515af​

[5] https://library.stockton.edu/publichealth/COVID-19​

[6] https://www.businessinsider.com/charts-show-coronavirus-risk-for-activities-2020-10

![HACKtheMACHINE](https://www.texmed.org/uploadedimages/Current/2016_About_TMA/Newsroom/News_Releases/COVID-19/TMA%20COVID%20%20309193%20Risk%20Assessment%20Chart.png)

![HACKtheMACHINE](https://infobeautiful4.s3.amazonaws.com/2020/03/Coronavirus-COVID19-riskiest-activities-03.png)

Next for the test set.  These daily activies were derived from these websites: 
- https://games4esl.com/daily-routine-examples/
- https://englishlive.ef.com/blog/english-in-the-real-world/useful-english-phrases-describe-daily-routine/


We will train a model on the training set and predict on the test set.  I personally assigned the risk scores based on my general understanding from the research above (training set).  Obviously, this is not a great test, but it's a resonable proof of concept.

In [109]:
test = pd.read_csv("fakeTestData.csv", sep=',', error_bad_lines=False, warn_bad_lines=True)
test.sample(10)

Unnamed: 0,daily_activities,risk_score
19,go for a walk,0
16,exercise indoors,2
32,take a taxi,2
22,read the newspaper,0
30,feed the dog,0
6,go to school,2
13,go to bed,0
20,take out the trash,0
12,cook dinner,0
18,go shopping,1


## Part 2: Data processing

Vectorize the data, using glove's vectorization


In [110]:
X_train = train.word
y_train = train.risk


X_test = test.daily_activities
y_test = test.risk_score

In [82]:
# Python program to generate word vectors using Word2Vec 
  
# importing all necessary modules 
from nltk.tokenize import sent_tokenize, word_tokenize 
import warnings 
  
warnings.filterwarnings(action = 'ignore') 
  
import gensim 
from gensim.models import Word2Vec
import gensim.downloader
glove_vectors = gensim.downloader.load('glove-wiki-gigaword-50')

In [64]:
glove_vectors.most_similar('coronavirus')

[('hantavirus', 0.7137583494186401),
 ('norovirus', 0.6828611493110657),
 ('irukandji', 0.6680245995521545),
 ('carcinogen', 0.6634572744369507),
 ('malady', 0.6494699120521545),
 ('prion', 0.6485769748687744),
 ('microbe', 0.6472789645195007),
 ('h5n1', 0.6433577537536621),
 ('prions', 0.6385928392410278),
 ('superbug', 0.6375430822372437)]

In [26]:
cs=glove_vectors.wv['crowds']

In [27]:
c=glove_vectors.wv['crowd']

In [28]:
np.sum(np.square(c-cs))

4.787545

In [29]:
d=glove_vectors.wv['dog']

In [15]:
np.sum(np.square(c-d))

34.36852

In [18]:
glove_vectors.most_similar('gathering')

[('gather', 0.84010249376297),
 ('gathered', 0.8312792778015137),
 ('meetings', 0.7875051498413086),
 ('participants', 0.7808489799499512),
 ('conference', 0.7629228830337524),
 ('meeting', 0.7627481818199158),
 ('organized', 0.7598081231117249),
 ('forum', 0.7544101476669312),
 ('addressing', 0.748741626739502),
 ('attending', 0.7391371130943298)]

New plan of attack.  Use single word risky labels.  Find average distance to each word in the risky category and average distance to non-risky category.  Whichever has higher score overall

In [97]:
train.risk.value_counts()

1    135
0     74
Name: risk, dtype: int64

In [104]:
FEATURE_SHAPE = c.shape[0]

In [111]:
X_train = np.zeros((len(train), FEATURE_SHAPE))
y_train = np.array(train.risk)

In [112]:
for i, word in enumerate(train.word):
    X_train[i] = glove_vectors.wv[word]

In [113]:
X_train

array([[ 0.51141   ,  0.67690003,  0.26820999, ...,  0.52095997,
         0.20545   ,  0.41402   ],
       [ 0.78944999,  1.12670004,  0.094963  , ...,  0.37224001,
         0.12707999,  0.079093  ],
       [-0.94531   ,  0.39686   , -0.80605   , ..., -1.02310002,
         0.95393997, -0.0635    ],
       ...,
       [-0.2157    , -0.63352001,  0.87094998, ...,  1.34940004,
         1.5934    ,  0.44295001],
       [-0.0059413 ,  0.40832001, -0.18948001, ...,  1.27550006,
         0.63156003,  0.42811999],
       [-0.033329  , -0.08402   ,  0.29251999, ...,  1.0697    ,
         1.02649999,  0.59463   ]])

In [114]:
y_train

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0,
       1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1,
       1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0], dtype=int64)

In [115]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(max_depth=2, random_state=0)
clf.fit(X_train, y_train)

RandomForestClassifier(max_depth=2, random_state=0)

In [None]:
#try a random word

word = 'eating'
vect = glove_vectors.wv[word]
clf.predict(vect)

In [None]:
# now do something with test data
import nltk
nltk.download()


showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


In [None]:
from nltk.corpus import stopwords
print(stopwords.words('english'))