# Game 1 - Spectrum Labeling

What makes a situation safe? What makes it different? Can you develop a program to navigate future pandemics?
Recent events have demonstrated an ongoing struggle to determine what is "safe" during a pandemic. In this game, contestants will be tasked with utilizing readily available information, such as 100+ pages of the CDC Guidelines, and developing an algorithm to map a spectrum of scenarios, from safe to dangerous, to reduce the spread of COVID-19.

## Brainstorming
- pdf scraper
- multi-class classification?  Logistic regression? Recommender system?
- bag of words, N-Grams, Tf-Idf

## Plan of Attack
We are treating this like a sentiment analysis problem.  Instead of the labels being positive or negative, the labels will be low, medium, or high risk.  Using the research Rhodora did to find the risky/safe activities, I created a small labeled dataset to train the model on.  

## Part 1: Read the Data
Reading in the self-generated labeled training dataset and fake test dataset.

In [59]:
import pandas as pd
import numpy as np

In [48]:
train = pd.read_csv("selfGeneratedLabelledDataset.csv")
train.sample(10)

Unnamed: 0,key_words,risk_score
161,outdoor spaces,0
172,stay home,0
105,pumping gasoline,0
134,don't wear a mask,2
71,getting nails done,1
248,going for a jog,0
101,staying at a hotel,1
245,watching tv with friends,2
176,cough,2
221,exposed to covid-19,2


This training set was derived from research by Rhodora.  I captured the activities and assigned a risk score:
- 0 = low risk
- 1 = medium risk
- 3 = high risk 

These websites provided a list of activites and scored them according to risk levels as published by CDC.

[1] https://www.cdc.gov/coronavirus/2019-ncov/community/large-events/considerations-for-events-gatherings.html​

[2] https://www.texmed.org/TexasMedicineDetail.aspx?id=54216​

[3] https://finance.yahoo.com/news/coronavirus-health-experts-ranked-activities-risk-132702304.html​

[4] https://www.ksdk.com/article/news/health/coronavirus/covid-19-risk-chart/63-723ae01d-4dc6-4a17-a8f0-1e68013515af​

[5] https://library.stockton.edu/publichealth/COVID-19​

[6] https://www.businessinsider.com/charts-show-coronavirus-risk-for-activities-2020-10

![HACKtheMACHINE](https://www.texmed.org/uploadedimages/Current/2016_About_TMA/Newsroom/News_Releases/COVID-19/TMA%20COVID%20%20309193%20Risk%20Assessment%20Chart.png)

![HACKtheMACHINE](https://infobeautiful4.s3.amazonaws.com/2020/03/Coronavirus-COVID19-riskiest-activities-03.png)

Next for the test set.  These daily activies were derived from these websites: 
- https://games4esl.com/daily-routine-examples/
- https://englishlive.ef.com/blog/english-in-the-real-world/useful-english-phrases-describe-daily-routine/


We will train a model on the training set and predict on the test set.  I personally assigned the risk scores based on my general understanding from the research above (training set).  Obviously, this is not a great test, but it's a resonable proof of concept.

In [49]:
test = pd.read_csv("fakeTestData.csv", sep=',', error_bad_lines=False, warn_bad_lines=True)
test.sample(10)

Unnamed: 0,daily_activities,risk_score
32,take a taxi,2
9,wash the dishes,0
14,practice the guitar,0
1,have breakfast,0
5,get dressed,0
31,walk the dog,0
26,relax,0
30,feed the dog,0
16,exercise indoors,2
3,take a shower,0


## Part 2: Data processing

Vectorize the data, using sklearn's countvectorizer

This is a simple way to encode the words as a vector (one-hot encoding).  Follow-on steps will try more sophisticated approaches.

In [50]:
from sklearn.feature_extraction.text import CountVectorizer

In [76]:
X_train = train.key_words
y_train = train.risk_score

X_test = test.daily_activities
y_test = test.risk_score

In [101]:
vectorizer = CountVectorizer(
    analyzer = 'word',
    ngram_range=(2,2),
    lowercase = True,
    strip_accents='unicode',
    stop_words='english'
)

In [102]:
features = vectorizer.fit_transform(
    X_train
)
vectorizer.get_feature_names()

['19 positive',
 '19 vaccine',
 '500 plus',
 'achy muscles',
 'air facilities',
 'air filtration',
 'air travel',
 'alcohol bar',
 'amusement park',
 'amusement parks',
 'attend wedding',
 'attending backyard',
 'attending concert',
 'attending large',
 'attending religious',
 'backyard barbeque',
 'bank online',
 'bank person',
 'bar club',
 'bar friends',
 'barber shop',
 'barbershop visit',
 'barriers like',
 'basketball friends',
 'beauty salon',
 'bicycle ride',
 'bike ride',
 'boat ride',
 'building air',
 'busy area',
 'car ride',
 'choir practice',
 'class indoors',
 'class inside',
 'cleaning disinfection',
 'close contact',
 'close proximity',
 'closely interact',
 'commonly used',
 'concert music',
 'concert play',
 'concert venue',
 'confined area',
 'contact people',
 'contact sports',
 'convalescence homes',
 'coronavirus positive',
 'cover mouth',
 'covid 19',
 'crowded beach',
 'crowded classroom',
 'crowded grocery',
 'crowded household',
 'crowded retail',
 'crowded s

In [103]:
features_arr = features.toarray() # for easy usage
features_arr.shape

(258, 306)

In [105]:
features_arr

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [106]:
# now apply same vectorizer to test data
X_test_vect = vectorizer.transform(X_test).toarray()
X_test_vect

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

## Part 3 Train a Model

Since the labels range from 0 to 2 and are ordered (not categorical), I will treat this as a regression problem and then threshold to 0, 1, or 2.

Going to try a random forest regressor as a first model.

In [107]:
from sklearn.ensemble import RandomForestRegressor
regr = RandomForestRegressor(n_estimators=1000, max_features=0.7, bootstrap=True, max_depth=2, random_state=0)

In [108]:
regr.fit(features_arr, y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=2,
                      max_features=0.7, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=1000,
                      n_jobs=None, oob_score=False, random_state=0, verbose=0,
                      warm_start=False)

In [109]:
regr.score(X_test_vect, y_test)

-1.2413828545202987

In [110]:
test['pred'] = regr.predict(X_test_vect)
test

Unnamed: 0,daily_activities,risk_score,rfcPred,vectSum,pred
0,wake up,0,0,0,1.332856
1,have breakfast,0,0,0,1.332856
2,brush your teeth,0,2,1,1.332856
3,take a shower,0,0,0,1.332856
4,take a bath,0,0,0,1.332856
5,get dressed,0,0,0,1.332856
6,go to school,2,2,1,1.332856
7,study English,0,0,0,1.332856
8,have lunch with others,2,0,0,1.332856
9,wash the dishes,0,2,1,1.332856


In [111]:
train['pred'] = regr.predict(features_arr)
train

Unnamed: 0,key_words,risk_score,pred
0,exposure during travel,2,1.332856
1,bar,2,1.332856
2,large music concerts,2,1.332856
3,buffet,2,1.332856
4,gym,2,1.332856
...,...,...,...
253,zoom meetings with family,0,1.282310
254,convalescence homes,2,1.332856
255,retirement facilities,2,1.332856
256,virtual meeting,0,1.306309


## OK, what to do now...
Clearly the results are total crap...
Some options:
    1. play with vectorization encoding (ngram_range=(2, 2))
    2. play with analyzer{‘word’, ‘char’, ‘char_wb’} 
    2. play with model, try different kind of model and play with hyper parameters 
        - classification vs regression
    3. different nlp encoder (i.e. word to vec)

In [112]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=1000, max_features=0.7, bootstrap=True, max_depth=2, random_state=0)

In [115]:
rfc.fit(features_arr, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=2, max_features=0.7, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=1000,
                       n_jobs=None, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)

In [116]:
rfc.score(X_test_vect, y_test)

0.19444444444444445

In [117]:
test['rfcPred'] = rfc.predict(X_test_vect)
test

Unnamed: 0,daily_activities,risk_score,rfcPred,vectSum,pred
0,wake up,0,2,0,1.332856
1,have breakfast,0,2,0,1.332856
2,brush your teeth,0,2,1,1.332856
3,take a shower,0,2,0,1.332856
4,take a bath,0,2,0,1.332856
5,get dressed,0,2,0,1.332856
6,go to school,2,2,1,1.332856
7,study English,0,2,0,1.332856
8,have lunch with others,2,2,0,1.332856
9,wash the dishes,0,2,1,1.332856


from looking at this list, it looks like it defaults to 2, because this is the largest class in the training set.  But we really want it to default to 0 when no "hits" are found with our keyword list. Because of this, I'm going to hack the algorithm, if the sum of the feature vectors is 0, default to class 0.

In [118]:
test['vectSum'] = np.sum(X_test_vect, axis=1) 
test

Unnamed: 0,daily_activities,risk_score,rfcPred,vectSum,pred
0,wake up,0,2,0,1.332856
1,have breakfast,0,2,0,1.332856
2,brush your teeth,0,2,0,1.332856
3,take a shower,0,2,0,1.332856
4,take a bath,0,2,0,1.332856
5,get dressed,0,2,0,1.332856
6,go to school,2,2,0,1.332856
7,study English,0,2,0,1.332856
8,have lunch with others,2,2,0,1.332856
9,wash the dishes,0,2,0,1.332856


In [119]:
test.rfcPred = np.where(test.vectSum==0, 0, test.rfcPred)

In [72]:
test

Unnamed: 0,daily_activities,risk_score,rfcPred,vectSum
0,wake up,0,0,0
1,have breakfast,0,0,0
2,brush your teeth,0,0,2
3,take a shower,0,0,0
4,take a bath,0,0,0
5,get dressed,0,0,0
6,go to school,2,2,3
7,study English,0,0,0
8,have lunch with others,2,2,2
9,wash the dishes,0,2,2


In [120]:
accuracy = np.sum(np.where(test.rfcPred==test.risk_score, 1, 0))/len(test)
accuracy

0.7777777777777778

In [99]:
test[test.rfcPred==test.risk_score]

Unnamed: 0,daily_activities,risk_score,rfcPred,vectSum,pred
0,wake up,0,0,0,1.347026
1,have breakfast,0,0,0,1.347026
3,take a shower,0,0,0,1.347026
4,take a bath,0,0,0,1.347026
5,get dressed,0,0,0,1.347026
6,go to school,2,2,1,1.347026
7,study English,0,0,0,1.347026
10,read a book,0,0,0,1.347026
11,do your homework,0,0,0,1.347026
13,go to bed,0,0,0,1.347026


In [121]:
test[test.rfcPred!=test.risk_score]

Unnamed: 0,daily_activities,risk_score,rfcPred,vectSum,pred
6,go to school,2,0,0,1.332856
8,have lunch with others,2,0,0,1.332856
15,play with friends,2,0,0,1.332856
16,exercise indoors,2,0,0,1.332856
18,go shopping,1,0,0,1.332856
32,take a taxi,2,0,0,1.332856
33,go out to eat,2,0,0,1.332856
34,go to the mall,1,0,0,1.332856
