# Game 1 - Spectrum Labeling

What makes a situation safe? What makes it different? Can you develop a program to navigate future pandemics?
Recent events have demonstrated an ongoing struggle to determine what is "safe" during a pandemic. In this game, contestants will be tasked with utilizing readily available information, such as 100+ pages of the CDC Guidelines, and developing an algorithm to map a spectrum of scenarios, from safe to dangerous, to reduce the spread of COVID-19.

## Brainstorming
- pdf scraper
- multi-class classification?  Logistic regression? Recommender system?
- bag of words, N-Grams, Tf-Idf

## Plan of Attack
We are treating this like a sentiment analysis problem.  Instead of the labels being positive or negative, the labels will be low, medium, or high risk.  Using the research Rhodora did to find the risky/safe activities, I created a small labeled dataset to train the model on.  

## Part 1: Read the Data
Reading in the self-generated labeled training dataset and fake test dataset.

In [1]:
import pandas as pd
import numpy as np

In [4]:
train = pd.read_csv("selfGeneratedLabelledDataset.csv")
train.sample(10)

Unnamed: 0,key_words,risk_score
170,telework,0
234,walking the dog,0
79,taxi ride,1
256,virtual meeting,0
97,visiting an elderly friend or relative,1
55,meeting new people,2
39,see a concert or play,2
9,public pools,2
109,grocery shopping,0
212,Watching sports,2


This training set was derived from research by Rhodora.  I captured the activities and assigned a risk score:
- 0 = low risk
- 1 = medium risk
- 3 = high risk 

These websites provided a list of activites and scored them according to risk levels as published by CDC.

[1] https://www.cdc.gov/coronavirus/2019-ncov/community/large-events/considerations-for-events-gatherings.html​

[2] https://www.texmed.org/TexasMedicineDetail.aspx?id=54216​

[3] https://finance.yahoo.com/news/coronavirus-health-experts-ranked-activities-risk-132702304.html​

[4] https://www.ksdk.com/article/news/health/coronavirus/covid-19-risk-chart/63-723ae01d-4dc6-4a17-a8f0-1e68013515af​

[5] https://library.stockton.edu/publichealth/COVID-19​

[6] https://www.businessinsider.com/charts-show-coronavirus-risk-for-activities-2020-10

![HACKtheMACHINE](https://www.texmed.org/uploadedimages/Current/2016_About_TMA/Newsroom/News_Releases/COVID-19/TMA%20COVID%20%20309193%20Risk%20Assessment%20Chart.png)

![HACKtheMACHINE](https://infobeautiful4.s3.amazonaws.com/2020/03/Coronavirus-COVID19-riskiest-activities-03.png)

Next for the test set.  These daily activies were derived from these websites: 
- https://games4esl.com/daily-routine-examples/
- https://englishlive.ef.com/blog/english-in-the-real-world/useful-english-phrases-describe-daily-routine/


We will train a model on the training set and predict on the test set.  I personally assigned the risk scores based on my general understanding from the research above (training set).  Obviously, this is not a great test, but it's a resonable proof of concept.

In [5]:
test = pd.read_csv("fakeTestData.csv", sep=',', error_bad_lines=False, warn_bad_lines=True)
test.sample(10)

Unnamed: 0,daily_activities,risk_score
16,exercise indoors,2
0,wake up,0
10,read a book,0
20,take out the trash,0
25,watch TV alone,0
29,iron the clothes,0
23,surf the internet,0
2,brush your teeth,0
30,feed the dog,0
8,have lunch with others,2


## Part 2: Data processing

Vectorize the data, using sklearn's countvectorizer

This is a simple way to encode the words as a vector (one-hot encoding).  Follow-on steps will try more sophisticated approaches.

In [21]:
from sklearn.feature_extraction.text import CountVectorizer
from nltk import word_tokenize          
from nltk.stem import WordNetLemmatizer 
import nltk


class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, articles):
        return [self.wnl.lemmatize(t) for t in word_tokenize(articles)]

In [103]:
def nlpCleanup(df, columnName):
    df[columnName] = df[columnName].str.replace('\d+', '',regex=True) # for digits
    df[columnName] = df[columnName].str.replace(r'(\b\w{1,2}\b)', '',regex=True) # for words
    df[columnName] = df[columnName].str.replace('[^\w\s]', '',regex=True) # for punctuation 
    return df

In [22]:
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Jaimie\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Jaimie\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [106]:
train = nlpCleanup(train, columnName='key_words')
train

Unnamed: 0,key_words,risk_score,pred
0,exposure during travel,2,1.336815
1,bar,2,1.339192
2,large music concerts,2,1.355943
3,buffet,2,1.336815
4,gym,2,1.338362
...,...,...,...
255,retirement facilities,2,1.330153
256,virtual meeting,0,1.218284
257,person meeting,2,1.294385
258,talking the phone,0,1.319490


In [107]:
test = nlpCleanup(test, columnName='daily_activities')
test

Unnamed: 0,daily_activities,risk_score,pred,rfcPred,vectSum
0,wake,0,1.336815,0,0
1,have breakfast,0,1.336815,0,0
2,brush your teeth,0,1.336815,2,1
3,take shower,0,1.336815,0,0
4,take bath,0,1.336815,0,0
5,get dressed,0,1.336815,0,0
6,school,2,1.336815,2,1
7,study English,0,1.336815,0,0
8,have lunch with others,2,1.336815,0,0
9,wash the dishes,0,1.255282,2,1


In [108]:
X_train = train.key_words
y_train = train.risk_score


X_test = test.daily_activities
y_test = test.risk_score

In [109]:
vectorizer = CountVectorizer(
    analyzer = 'word',
    ngram_range=(2,2),
    lowercase = True,
    strip_accents='unicode',
    stop_words='english'
)

In [110]:
tf_vectorizer = CountVectorizer(tokenizer=LemmaTokenizer(),
                                strip_accents = 'unicode', 
                                stop_words = 'english', 
                                lowercase = True, 
                                )

In [111]:
features = vectorizer.fit_transform(
    X_train
)
vectorizer.get_feature_names()

['achy muscles',
 'air filtration',
 'air travel',
 'alcohol bar',
 'amusement park',
 'amusement parks',
 'attend wedding',
 'attending backyard',
 'attending concert',
 'attending large',
 'attending religious',
 'backyard barbeque',
 'bank online',
 'bank person',
 'bar club',
 'bar friends',
 'barber shop',
 'barbershop visit',
 'barriers like',
 'basketball friends',
 'beauty salon',
 'bicycle ride',
 'bike ride',
 'boat ride',
 'building air',
 'busy area',
 'car ride',
 'choir practice',
 'class indoors',
 'class inside',
 'cleaning disinfection',
 'close contact',
 'close proximity',
 'closely interact',
 'commonly used',
 'concert music',
 'concert play',
 'concert venue',
 'confined area',
 'contact people',
 'contact sports',
 'convalescence homes',
 'coronavirus positive',
 'cover mouth',
 'covid positive',
 'covid vaccine',
 'crowded beach',
 'crowded classroom',
 'crowded grocery',
 'crowded household',
 'crowded retail',
 'crowded shopping',
 'crowded space',
 'crowded s

In [112]:
features = tf_vectorizer.fit_transform(
    X_train
)
tf_vectorizer.get_feature_names()



['achy',
 'activity',
 'air',
 'airplane',
 'airport',
 'alcohol',
 'amusement',
 'apart',
 'appointment',
 'area',
 'attend',
 'attending',
 'backyard',
 'bank',
 'bar',
 'barbeque',
 'barber',
 'barbershop',
 'barrier',
 'basketball',
 'bathroom',
 'bbq',
 'beach',
 'beauty',
 'bicycle',
 'bike',
 'biking',
 'boat',
 'bowling',
 'buffet',
 'building',
 'bus',
 'busy',
 'camping',
 'car',
 'cart',
 'casino',
 'celebration',
 'chill',
 'choir',
 'church',
 'cinema',
 'class',
 'classroom',
 'cleaning',
 'close',
 'closely',
 'club',
 'clubbing',
 'commonly',
 'concert',
 'confined',
 'contact',
 'convalescence',
 'cook',
 'coronavirus',
 'cough',
 'coughing',
 'cover',
 'covid',
 'crowd',
 'crowded',
 'curbside',
 'cut',
 'daycare',
 'dentist',
 'department',
 'dining',
 'dinner',
 'disinfect',
 'disinfecting',
 'disinfection',
 'distance',
 'distancing',
 'doctor',
 'dog',
 'don',
 'drink',
 'drinking',
 'drunk',
 'eating',
 'elderly',
 'emergency',
 'enclosed',
 'event',
 'exercise',

In [113]:
features_arr = features.toarray() # for easy usage
features_arr.shape

(260, 300)

In [114]:
features_arr

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [115]:
# now apply same vectorizer to test data
X_test_vect = tf_vectorizer.transform(X_test).toarray()
X_test_vect

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

## Part 3 Train a Model

Since the labels range from 0 to 2 and are ordered (not categorical), I will treat this as a regression problem and then threshold to 0, 1, or 2.

Going to try a random forest regressor as a first model.

In [116]:
from sklearn.ensemble import RandomForestRegressor
regr = RandomForestRegressor(n_estimators=1000, max_features=0.7, bootstrap=True, max_depth=2, random_state=0)

In [117]:
regr.fit(features_arr, y_train)

RandomForestRegressor(max_depth=2, max_features=0.7, n_estimators=1000,
                      random_state=0)

In [118]:
regr.score(X_test_vect, y_test)

-1.236333840612851

In [119]:
test['pred'] = regr.predict(X_test_vect)
test

Unnamed: 0,daily_activities,risk_score,pred,rfcPred,vectSum
0,wake,0,1.336685,0,0
1,have breakfast,0,1.336685,0,0
2,brush your teeth,0,1.336685,2,1
3,take shower,0,1.336685,0,0
4,take bath,0,1.336685,0,0
5,get dressed,0,1.336685,0,0
6,school,2,1.336685,2,1
7,study English,0,1.336685,0,0
8,have lunch with others,2,1.336685,0,0
9,wash the dishes,0,1.261665,2,1


In [120]:
train['pred'] = regr.predict(features_arr)
train

Unnamed: 0,key_words,risk_score,pred
0,exposure during travel,2,1.336685
1,bar,2,1.338400
2,large music concerts,2,1.360909
3,buffet,2,1.336685
4,gym,2,1.337434
...,...,...,...
255,retirement facilities,2,1.334018
256,virtual meeting,0,1.209264
257,person meeting,2,1.290780
258,talking the phone,0,1.317945


## OK, what to do now...
Clearly the results are total crap...
Some options:

    1. play with vectorization encoding (ngram_range=(2, 2))
    2. play with analyzer{‘word’, ‘char’, ‘char_wb’} 
    3. play with model, try different kind of model and play with hyper parameters 
        - classification vs regression
    4. different nlp encoder (i.e. word to vec)

In [121]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=1000, max_features=0.7, bootstrap=True, max_depth=2, random_state=0)

In [122]:
rfc.fit(features_arr, y_train)

RandomForestClassifier(max_depth=2, max_features=0.7, n_estimators=1000,
                       random_state=0)

In [123]:
rfc.score(X_test_vect, y_test)

0.19444444444444445

In [124]:
test['rfcPred'] = rfc.predict(X_test_vect)
test

Unnamed: 0,daily_activities,risk_score,pred,rfcPred,vectSum
0,wake,0,1.336685,2,0
1,have breakfast,0,1.336685,2,0
2,brush your teeth,0,1.336685,2,1
3,take shower,0,1.336685,2,0
4,take bath,0,1.336685,2,0
5,get dressed,0,1.336685,2,0
6,school,2,1.336685,2,1
7,study English,0,1.336685,2,0
8,have lunch with others,2,1.336685,2,0
9,wash the dishes,0,1.261665,2,1


from looking at this list, it looks like it defaults to 2, because this is the largest class in the training set.  But we really want it to default to 0 when no "hits" are found with our keyword list. Because of this, I'm going to hack the algorithm, if the sum of the feature vectors is 0, default to class 0.

In [125]:
test['vectSum'] = np.sum(X_test_vect, axis=1) 
test

Unnamed: 0,daily_activities,risk_score,pred,rfcPred,vectSum
0,wake,0,1.336685,2,0
1,have breakfast,0,1.336685,2,0
2,brush your teeth,0,1.336685,2,1
3,take shower,0,1.336685,2,0
4,take bath,0,1.336685,2,0
5,get dressed,0,1.336685,2,0
6,school,2,1.336685,2,1
7,study English,0,1.336685,2,0
8,have lunch with others,2,1.336685,2,0
9,wash the dishes,0,1.261665,2,1


In [126]:
test.rfcPred = np.where(test.vectSum==0, 0, test.rfcPred)
test

Unnamed: 0,daily_activities,risk_score,pred,rfcPred,vectSum
0,wake,0,1.336685,0,0
1,have breakfast,0,1.336685,0,0
2,brush your teeth,0,1.336685,2,1
3,take shower,0,1.336685,0,0
4,take bath,0,1.336685,0,0
5,get dressed,0,1.336685,0,0
6,school,2,1.336685,2,1
7,study English,0,1.336685,0,0
8,have lunch with others,2,1.336685,0,0
9,wash the dishes,0,1.261665,2,1


In [127]:
accuracy = np.sum(np.where(test.rfcPred==test.risk_score, 1, 0))/len(test)
accuracy

0.5833333333333334

In [128]:
#correct results
test[test.rfcPred==test.risk_score]

Unnamed: 0,daily_activities,risk_score,pred,rfcPred,vectSum
0,wake,0,1.336685,0,0
1,have breakfast,0,1.336685,0,0
3,take shower,0,1.336685,0,0
4,take bath,0,1.336685,0,0
5,get dressed,0,1.336685,0,0
6,school,2,1.336685,2,1
7,study English,0,1.336685,0,0
10,read book,0,1.336685,0,0
11,your homework,0,1.336685,0,0
13,bed,0,1.336685,0,0


In [129]:
#incorrect results
test[test.rfcPred!=test.risk_score]

Unnamed: 0,daily_activities,risk_score,pred,rfcPred,vectSum
2,brush your teeth,0,1.336685,2,1
8,have lunch with others,2,1.336685,0,0
9,wash the dishes,0,1.261665,2,1
12,cook dinner,0,1.317937,2,2
14,practice the guitar,0,1.333913,2,1
17,brush your hair,0,1.336685,2,1
18,shopping,1,1.336685,2,1
19,for walk,0,1.324371,2,1
21,clean the house,0,1.336685,2,1
24,water the plants,0,1.334088,2,1
