## Determine Relevant Texts: NLP and Classification Modeling <br>
_Authors: Amy Taylor and Veronica Giannota_

**Summary:** Since collecting all tweets that contain the phrase "road closed" will return a large set of useless tweets, a pre-filtering step was needed to separate the relevant tweets from useless tweets. Relevant tweets were defined as those with: (1) cross street information for a distinct location that could be located on a map and (2) a road that was FULLY blocked and inaccessible rather than partially inaccessible (i.e. some lanes closed). 

**Our strategy:** Build a classifier that can filter relevant tweets from useless tweets using natural language processing methods and a logistic regression classifier model. 

**Method**: 
<br> A corpus of 143 tweets (collected in Notebook #1) was downloaded and pre-labeled as either:
- **0 = unrelated OR useless**  (EX. "MVHS will remain closed tomorrow due to concerns about road conditions.")

- **1 = related, but road not FULLY blocked** (EX: "Road construction. right lanes closed in #Pima on I-10 EB at Ruthrauff Rd")  
- **2 = relevant, road is fully blocked AND street info provided**

Once labeled, the corpus was: 
1. split into training and testing data
2. Word vectorized using sklearn CountVectorizer method
3. Fit to a logistic regression model and scored based on the accuracy of the predicted class being 0, 1, or 2

This notebook outlines these steps, analyzes the mislabeled tweets, and discusses the viability of the model altogether.

### Sections
  [1. Classification Modeling](#classification)
<br>  [2. Missclassified Tweet Analysis](#analysis)

### Section 1: Classification Modeling

Load imports

In [123]:
import pandas as pd
from sklearn.model_selection import train_test_split
   
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import stop_words
from sklearn.linear_model import LogisticRegression

%matplotlib inline

Read in the dataframe containing the texts from 143 tweets as well as the pre-labeled `rating` classifier.

In [124]:
df = pd.read_csv("../data/twitter_corpus.csv", sep='\t', encoding='latin-1')
df.head()

Unnamed: 0,tweed_id,rating,tweet
0,0,1,"Road construction, left lane closed in #Albuqu..."
1,1,1,Road construction. right lanes closed in #Pima...
2,2,1,"Road construction, shoulder closed in #ElPaso ..."
3,3,0,Ughhh at the dentist for a cleaning and the si...
4,4,1,Road constructions. two right lanes closed in ...


**Step 1.1_ Determine the baseline model accuracy score**

In [125]:
df['rating'].value_counts()

2    58
1    54
0    31
Name: rating, dtype: int64

In [131]:
print(58 / (58 + 54 + 31))
print(54 / (58 + 54 + 31))
print(31 / (58 + 54 + 31))

0.40559440559440557
0.3776223776223776
0.21678321678321677


Our baseline score is ~40%, so any model we fit should perform better than 40%

 <a id='classification'></a>
**Step 1.2_ Train/test/split the data and CountVectorize**
- For CountVectorize we will use all the default parameters with the exception of `analyzer = "word"` to separate every word and `ngram_range=(1, 2)` to supply bi-grams.

In [127]:
X = df['tweet']
y = df['rating']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, test_size = 0.33)

# instantiate countvectorizer
vect = CountVectorizer(analyzer = "word", ngram_range=(1, 2))

# fit on the training data, transform training and test data
train_data = vect.fit_transform(X_train)
test_data = vect.transform(X_test)
train_data = train_data.toarray()

In [128]:
# How many features did countvectorize create?
print(train_data.shape)
print(test_data.shape)

(95, 1831)
(48, 1831)


>Using these CountVectorize parameters, 1831 features were created from 143 tweets

**Step 1.3_ Fit a Logistic Regression model to our vectorized data**

In [129]:
lr = LogisticRegression()

lr.fit(train_data, y_train)
print(lr.score(train_data, y_train))
print(lr.score(test_data, y_test))

1.0
0.8958333333333334




So far this model performs with ~90% accuracy, meaning it can correctly predict all of the classes 90% of the time. However, this model is overfit to the training data. This could potentially be overcome if we provided a much larger dataset of tweets.

Since we have three classes of tweets to distinguish, let's determine which classes the model is correctly and incorrectly predicting. We only really care if the model is correctly predicting class 2, or incorrectly predicting class 2.

<a id='analysis'></a>

### Section 2. Misclassified Tweet Analysis: Examine which predictions are wrong


**Step 2.1_Generate predictions on the test set**

In [14]:
pred = lr.predict(test_data)
pred

array([2, 1, 1, 0, 0, 2, 1, 2, 2, 1, 0, 2, 1, 1, 1, 0, 2, 2, 1, 2, 2, 1,
       0, 2, 2, 2, 1, 1, 0, 1, 0, 0, 2, 2, 0, 2, 2, 0, 2, 1, 1, 2, 1, 0,
       2, 0, 2, 0])

In [15]:
lr.coef_.shape

(3, 1831)

**Step 2.2_ Create a dataframe of the predictions from the test set**
- Add columns for the `actual` classification, the `predicted` classication
- Add a column for `new_score`, which will distinguish if the tweet was:
    - correctly labeled
    - incorrecly labeled (as class 0/1, class 1/2, or class 0/2)
- The `new_score` = `actual` + (4 * `predicted` )
- The summary of the combinations for the possible `new_score`s are given by the following table:
   
| -| Class 0 = usless tweet| Class 1 = related tweet| Class 2 = relevant tweet|
|---|---|---|---|
| **labeled as 0**| 0 = correct prediction|1 |2 |
| **labeled as 1**| 4| 5 = correct prediction| 6|
| **labeled as 2**| 8| 9|10 = correct prediction |
    

In [67]:
wrong_pred = pd.DataFrame(X_test, columns =['tweet'])

In [81]:
wrong_pred.loc[:, 'actual'] = y_test
wrong_pred.loc[:, 'predicted'] = pred
wrong_pred.loc[:, 'new_score'] = wrong_pred['actual'] + (4* wrong_pred['predicted'])
wrong_pred.head()

Unnamed: 0,tweet,actual,predicted,new_score
117,Road closed. broken glass on roadway. in #Cora...,2,2,10
19,Road construction. two left lanes closed in #F...,1,1,5
82,"Road construction, left lane closed in #Brevar...",1,1,5
97,ThereÕs no Wells Fargo Bank on Beatties Ford R...,0,0,0
56,@FAIRImmigration @NBCNews @JuliaEAinsley You m...,0,0,0


How many different classes were predicted? (i.e. what is the number of `new_score`s generated)?

In [82]:
wrong_pred['new_score'].unique()

array([10,  5,  0,  2,  6,  1])

In [83]:
wrong_pred['new_score'].value_counts()

10    20
5     14
0      9
2      3
6      1
1      1
Name: new_score, dtype: int64

- Out of the nine possible `new_score`s, only five were generated. 
- Scores of 10, 5, and 0 are correct predictions
- Incorrect predictions have the score of 2, 6, and 1
    - Score = 2: Three **relevant** tweets incorrectly predicted as **useless**
    - Score = 6: One **relevant** tweet incorrectly predicted as **related**
    - Score = 1: One **related** tweet incorrectly predicted as **useless**

> These results are beneficial to us, because no related or useless tweets are being incorrectly labeled as relevant. That is, our model is minimizing false postives at the expense of a few relevant tweets being incorrectly filtered out. 

> Ideally, our model will read in a set of unseen tweets, classify them as relevant or not, and only feed the relevant tweets into the next step (or notebook) that uses regex to extract the location from the Tweets.

**Step 2.3_Isolate Misclassified Tweets**
- Let's put these misclassified tweets into a dataframe so we can see where our model went wrong

In [108]:
wrong = wrong_pred[(wrong_pred['new_score'] == 1) | 
                            (wrong_pred['new_score'] == 2) | (wrong_pred['new_score'] == 6)]

wrong = wrong.sort_values(by='new_score', ascending=False)
wrong

Unnamed: 0,tweet,actual,predicted,new_score
125,ROAD CLOSUREPark Road (Southbound lanes only) ...,2,1,6
51,Cougar fans traveling to Lakeland for the Girl...,2,0,2
10,The drainage project on Center Street in #Vine...,2,0,2
42,Manchester road off Wayne avenue is closed off...,2,0,2
40,Southbound 101 freeway still closed at lost hi...,1,0,1


In [110]:
for i in wrong['tweet']:
    print(i)
    print("--------")

ROAD CLOSUREPark Road (Southbound lanes only) between Selwyn Avenue and Tyvola Road are CLOSED and will re-opÉ https://t.co/TNZvF9lB8A
--------
Cougar fans traveling to Lakeland for the Girls NECC championship game. State Road 9 is closed in Wolcottville due to a structure fire, plan accordingly and find an alternative route. https://t.co/Be4mIPlhAM
--------
The drainage project on Center Street in #VineyardHaven continues this week - the road is closed to parking andÉ https://t.co/X04Uiougpu
--------
Manchester road off Wayne avenue is closed off by police. Giving traffic delivery updated as I see them
--------
Southbound 101 freeway still closed at lost hills. Agoura Road is jam southbound. Heading to the Civic arts Plaza fÉ https://t.co/tWzhAVh47x
--------


|Tweet| Actual Category | Missclassification Category| Score |
|---| ---| ---| ---|
|Cougar fans traveling to Lakeland for the Girls NECC championship game. State Road 9 is closed in Wolcottville due to a structure fire, plan accordingly and find an alternative route.| 2| 1  = related / not blocked | 6 |
|ROAD CLOSUREPark Road (Southbound lanes only) between Selwyn Avenue and Tyvola Road are CLOSED and will re-opÉ | 2 | 0 = unrelated | 2 |
|The drainage project on Center Street in #VineyardHaven continues this week - the road is closed to parking andÉ | 2 | 0 = unrelated | 2 |
|Manchester road off Wayne avenue is closed off by police. Giving traffic delivery updated as I see them| 2 | 0 = unrelated | 2 |
|Southbound 101 freeway still closed at lost hills. Agoura Road is jam southbound. Heading to the Civic arts Plaza fÉ | 1 | 0 = unrelated | 1|