## Background

This project was started because I was curious how different populations view the top three Democratic presidential candidates. Joe Biden, Bernie Sanders, and Elizabeth Warren consistently poll as the top three, but not always in the same order. 
Prior to the creation of this notebook, tweets were gathered using [GetOldTweets3](https://pypi.org/project/GetOldTweets3/). This was accomplished using the shell scripts found in the /shellscripts folder. 

## Set-up

Below are the library imports, functions, and data imports that make this project possible.

### Imports

In [1]:
# PyData
import pandas as pd
import numpy as np

In [2]:
# Vader sentiment analyzer from NLTK
import nltk
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()

In [3]:
# Statistics library
import statistics as stat

In [4]:
# Scikit Learn imports for ML
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix

from sklearn.naive_bayes import GaussianNB
from sklearn.svm import LinearSVC
from sklearn import tree
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier

### Functions

I defined three functions to streamline the process of gathering and analyzing sentiment polarity scores.

In [5]:
# Function to get mean, median, min, max, and standard deviation of input
def get_stats(info):
    me = stat.mean(info)
    med = stat.median(info)
    mini = min(info)
    maxi = max(info)
    sdev = stat.stdev(info)
    
    return me,med,mini,maxi,sdev

In [6]:
# Function to get sentiment polarities
def get_sentiment(tweets):
    # Lists for each category of sentiment polarity
    neg = [] # negative
    pos = [] # positive
    neu = [] # neutral
    
    # Run the sentiment intensity analysis for input
    for t in tweets:
        # Get polarity scores for each tweet
        tmp_sia = analyzer.polarity_scores(t)
        # Variable assignment for the negative, positive, 
        # and neutral scores
        tmp_neg = tmp_sia['neg']
        tmp_pos = tmp_sia['pos']
        tmp_neu = tmp_sia['neu']
        
        # Append each tweet's neg, pos,and neu scores to
        # their respective lists
        neg.append(tmp_neg)
        pos.append(tmp_pos)
        neu.append(tmp_neu)
    
    # Return the lists
    return neg,pos,neu

In [7]:
''' Function to take in tweets from each date, get
the sentiment scores, and get summary statistics 
from those sentiment scores.'''

def namedate(namedate):
    # Create a list of lists for sentiments
    tmp = [get_sentiment(namedate)]
    # Get summary stats 
    tmp_neg = get_stats(tmp[0][0])
    tmp_pos = get_stats(tmp[0][1])
    tmp_neu = get_stats(tmp[0][2])
    # Make a list of lists for all summary stats
    tmp_sents = [tmp_neg,tmp_pos,tmp_neu]
    # Convert to numpy array
    sents = np.array(tmp_sents)
    # Return the converted numpy array
    return sents

### Read in data

There are 13 different dates being analyzed and each candidate has one .csv file per date. After using Pandas to read in the .csv files, I consolidated all input into one array per candidate. 

In [8]:
biden0808 = pd.read_csv('csv/biden0808.csv')['text']
biden0815 = pd.read_csv('csv/biden0815.csv')['text']
biden0827 = pd.read_csv('csv/biden0827.csv')['text']
biden0907 = pd.read_csv('csv/biden0907.csv')['text']
biden0911 = pd.read_csv('csv/biden0911.csv')['text']
biden0912 = pd.read_csv('csv/biden0912.csv')['text']
biden0917 = pd.read_csv('csv/biden0917.csv')['text']
biden0921 = pd.read_csv('csv/biden0921.csv')['text']
biden0924 = pd.read_csv('csv/biden0924.csv')['text']
biden0929 = pd.read_csv('csv/biden0929.csv')['text']
biden1003 = pd.read_csv('csv/biden1003.csv')['text']
biden1007 = pd.read_csv('csv/biden1007.csv')['text']
biden1016 = pd.read_csv('csv/biden1016.csv')['text']

In [9]:
biden1016[0]

'Obama Endorses Justin Trudeau. He Still Hasn’t Endorsed Joe Biden - https://go.shr.lc/35E6gEF'

In [10]:
bidens = [biden0808,biden0815,biden0827,biden0907,biden0911,biden0912,biden0917,biden0921,biden0924,biden0929,biden1003,biden1007,biden1016]

In [11]:
sanders0808 = pd.read_csv('csv/sanders0808.csv')['text']
sanders0815 = pd.read_csv('csv/sanders0815.csv')['text']
sanders0827 = pd.read_csv('csv/sanders0827.csv')['text']
sanders0907 = pd.read_csv('csv/sanders0907.csv')['text']
sanders0911 = pd.read_csv('csv/sanders0911.csv')['text']
sanders0912 = pd.read_csv('csv/sanders0912.csv')['text']
sanders0917 = pd.read_csv('csv/sanders0917.csv')['text']
sanders0921 = pd.read_csv('csv/sanders0921.csv')['text']
sanders0924 = pd.read_csv('csv/sanders0924.csv')['text']
sanders0929 = pd.read_csv('csv/sanders0929.csv')['text']
sanders1003 = pd.read_csv('csv/sanders1003.csv')['text']
sanders1007 = pd.read_csv('csv/sanders1007.csv')['text']
sanders1016 = pd.read_csv('csv/sanders1016.csv')['text']

In [12]:
sanderss = [sanders0808,sanders0815,sanders0827,sanders0907,sanders0911,sanders0912,sanders0917,sanders0921,sanders0924,sanders0929,sanders1003,sanders1007,sanders1016]

In [13]:
sanders1016[0]

'BREAKING NEWS FROM WASHINGTON DC QUESTIONS NEED TO BE ANSWERED REGARDING MIKE PENCE AN CAIR MUSLIM BROTHERHOOD WHY WOULD MIKE PENCE BRING SOMEONE TO WASHINGTON DC TIED TO JOE BIDEN.. BERNIE SANDERS.. PLUS THEY HATE PRESIDENT TRUMP. THE QUESTION IS WHY \u2066 @realDonaldTrump\u2069pic.twitter.com/CwO6f13xp4'

In [14]:
warren0808 = pd.read_csv('csv/warren0808.csv')['text']
warren0815 = pd.read_csv('csv/warren0815.csv')['text']
warren0827 = pd.read_csv('csv/warren0827.csv')['text']
warren0907 = pd.read_csv('csv/warren0907.csv')['text']
warren0911 = pd.read_csv('csv/warren0911.csv')['text']
warren0912 = pd.read_csv('csv/warren0912.csv')['text']
warren0917 = pd.read_csv('csv/warren0917.csv')['text']
warren0921 = pd.read_csv('csv/warren0921.csv')['text']
warren0924 = pd.read_csv('csv/warren0924.csv')['text']
warren0929 = pd.read_csv('csv/warren0929.csv')['text']
warren1003 = pd.read_csv('csv/warren1003.csv')['text']
warren1007 = pd.read_csv('csv/warren1007.csv')['text']
warren1016 = pd.read_csv('csv/warren1016.csv')['text']

In [15]:
warrens = [warren0808,warren0815,warren0827,warren0907,warren0911,warren0912,warren0917,warren0921,warren0924,warren0929,warren1003,warren1007,warren1016]

In [16]:
warren1016[0]

"Warren becomes debate target as moderates vie for breakout: At Tuesday's Democratic presidential debate in Ohio, attacks on Sen. Elizabeth Warren started early and came from all sides, particularly from more… http://dlvr.it/RGLk7R #25thAmendmentNow #ImpeachTrump #TheResistancepic.twitter.com/1nDpcafANn"

## Data Manipulation and Analysis

The namedate function runs both the get_stats() and get_sentiment() functions, for summary statistics and for sentiment polarity scores, respectively. Then it returns a Numpy array of the summary statistics for all sentiment scores, separated by date. For each candidate, I used a list comprehension to apply the namedate function to each collection of 100 tweets. 

After this, I reshaped the arrays to have 9 rows, each with 15 items. The original shape was 9,3,5 because the negative, positive, and neutral arrays were still individually separated. To illustrate this point, I am displaying both the sizes of biden_stats (original) and the reshaped biden_np below, followed by the first item in each array. 

#### Biden

In [17]:
biden_stats = [namedate(b) for b in bidens]
biden_np = np.array(biden_stats).reshape(13,15)

In [18]:
np.array(biden_stats).shape, biden_np.shape

((13, 3, 5), (13, 15))

In [19]:
biden_stats[0]

array([[0.12768   , 0.125     , 0.        , 0.636     , 0.12656079],
       [0.07823   , 0.05      , 0.        , 0.333     , 0.09125797],
       [0.79403   , 0.817     , 0.312     , 1.        , 0.14131578]])

In [20]:
biden_np[0]

array([0.12768   , 0.125     , 0.        , 0.636     , 0.12656079,
       0.07823   , 0.05      , 0.        , 0.333     , 0.09125797,
       0.79403   , 0.817     , 0.312     , 1.        , 0.14131578])

In [21]:
warren_stats = [namedate(w) for w in warrens]
warren_np = np.array(warren_stats).reshape(13,15)

In [22]:
sanders_stats = [namedate(s) for s in sanderss]
sanders_np = np.array(sanders_stats).reshape(13,15)

### Detour: Gather data for Tableau visualization

In [23]:
# Define empty lists to hold positive and negative means
b_neg_means = []
b_pos_means = []
w_neg_means = []
w_pos_means = []
s_neg_means = []
s_pos_means = []

for i in range(0,13):
    # Index the 2D list to get the 1st and 6th items
    # The mean negative scores are always 1st and the mean positive
    # scores are always 6th
    b_neg_means.append(biden_np[i][0])
    b_pos_means.append(biden_np[i][5])
    w_neg_means.append(warren_np[i][0])
    w_pos_means.append(warren_np[i][5])
    s_neg_means.append(sanders_np[i][0])
    s_pos_means.append(sanders_np[i][5])

# Create DataFrames so that they can be written to csv files using Pandas
b_means_df = pd.DataFrame([b_neg_means,b_pos_means],
                        columns=['Aug8','Aug15',
                                 'Aug27','Sept7',
                                 'Sept11','Sept12',
                                'Sept17','Sept21','Sept24',
                                'Sept29','Oct3','Oct7','Oct16'],
                       index=['Negative','Positive']).T
w_means_df = pd.DataFrame([w_neg_means,w_pos_means],
                        columns=['Aug8','Aug15',
                                 'Aug27','Sept7',
                                 'Sept11','Sept12',
                                'Sept17','Sept21','Sept24',
                                'Sept29','Oct3','Oct7','Oct16'],
                       index=['Negative','Positive']).T
s_means_df = pd.DataFrame([s_neg_means,s_pos_means],
                        columns=['Aug8','Aug15',
                                 'Aug27','Sept7',
                                 'Sept11','Sept12',
                                'Sept17','Sept21','Sept24',
                                 'Sept29','Oct3','Oct7','Oct16'],
                       index=['Negative','Positive']).T

In [24]:
w_means_df

Unnamed: 0,Negative,Positive
Aug8,0.10073,0.10188
Aug15,0.06788,0.13062
Aug27,0.08026,0.109
Sept7,0.08866,0.13287
Sept11,0.06112,0.11366
Sept12,0.0713,0.09764
Sept17,0.07877,0.10619
Sept21,0.09815,0.09444
Sept24,0.04136,0.09838
Sept29,0.08754,0.09469


In [25]:
# Writing Pandas DataFrames to csv files for Tableau
b_means_df.to_csv('b_means.csv')
w_means_df.to_csv('w_means.csv')
s_means_df.to_csv('s_means.csv')

### Incorporate Poll Order

The poll orderings are as follows:

* August 8 via SurveyUSA: 
    Biden, Sanders, Warren
    
* August 15 order for likely voters via Fox News: 
    Biden, Warren, Sanders
    
* August 27 LV via Emerson College: 
    Biden, Sanders, Warren
    
* September 7 LV via Suffolk University: 
    Biden, Sanders, Warren 
    
* September 11 via RKM Research and Communications Inc.: 
    Sanders, Biden, Warren
    
* September 12 LV via YouGov: 
    Biden, Warren, Sanders 
    
* September 17 LV via NBC News/Wall Street Journal: 
    Biden, Warren, Sanders
    
* September 21 LV via Selzer and Co: 
    Warren, Biden, Sanders
    
* September 24 LV via Monmouth University: 
    Warren, Biden, Sanders
    
* September 29 LV via CNN/SSRS:
    Biden, Warren, Sanders

* October 3rd LV via Public Policy Institute of California:
    Warren, Biden, Sanders
    
* October 7th LV via Morning Consult:
    Biden, Warren, Sanders
    
* October 16th LV via YouGov:
    Warren, Biden, Sanders

In [26]:
# One array per candidate in chronological order
# 0 = 1st place; 1 = 2nd place; 2 = 3rd place

b_target = np.array([0,0,0,0,1,0,0,1,1,0,1,0,1])
s_target = np.array([1,2,1,1,0,2,2,2,2,2,2,2,2])
w_target = np.array([2,1,2,2,2,1,1,0,0,1,0,1,0])

I created a train/test split for each candidate.

In [27]:
BX_train, BX_test, by_train, by_test = train_test_split(biden_np, b_target, test_size=0.33, random_state=42)

In [28]:
WX_train, WX_test, wy_train, wy_test = train_test_split(warren_np, w_target, test_size=0.33, random_state=42)

In [29]:
SX_train, SX_test, sy_train, sy_test = train_test_split(sanders_np, s_target, test_size=0.33, random_state=42)

### Classifiers

In [30]:
clf_nb = GaussianNB()
clf_linsvc = LinearSVC()
clf_dt = tree.DecisionTreeClassifier()
clf_knn =  KNeighborsClassifier(n_neighbors=3)
#clf_nn = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=1)
clf_nn = MLPClassifier(solver='lbfgs', alpha=2)

For the neural network, the LBFGS solver is chosen because, "For small datasets, however, ‘lbfgs’ can converge faster and perform better" than other solvers, such as the defualt 'adam'. This quote is taken from scikit-learn documentation. This documentation can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html).

## Model Selection

One problem I ran into here is that, for Sanders, 3-fold cross validation generates a warning because the least-populated class has fewer than 3 items. I originally adjusted for this warning, but the accuracy was better when I proceeded with 3-fold despite the warning message.

#### Biden

In [31]:
b_crossval = [stat.mean(cross_val_score(clf_nb, biden_np, b_target, cv=3)),
              stat.mean(cross_val_score(clf_linsvc, biden_np, b_target, cv=3)),
              stat.mean(cross_val_score(clf_dt, biden_np, b_target,cv=3)),
              stat.mean(cross_val_score(clf_knn, biden_np, b_target,cv=3)),
              stat.mean(cross_val_score(clf_nn, biden_np, b_target,cv=3))]

In [32]:
pd.DataFrame(b_crossval,index=['Naive Bayes','Linear SVC','Decision Tree','K-Nearest Neighbors','Neural Network'],columns=['Avg 3-fold cross-val score'])

Unnamed: 0,Avg 3-fold cross-val score
Naive Bayes,0.511111
Linear SVC,0.622222
Decision Tree,0.2
K-Nearest Neighbors,0.488889
Neural Network,0.622222


#### Sanders

In [33]:
s_crossval = [stat.mean(cross_val_score(clf_nb, sanders_np, s_target,cv=3)),
              stat.mean(cross_val_score(clf_linsvc, sanders_np, s_target,cv=3)),
              stat.mean(cross_val_score(clf_dt, sanders_np, s_target,cv=3)),
              stat.mean(cross_val_score(clf_knn, sanders_np, s_target,cv=3)),
              stat.mean(cross_val_score(clf_nn, sanders_np, s_target,cv=3))]



In [34]:
pd.DataFrame(s_crossval,index=['Naive Bayes','Linear SVC','Decision Tree','K-Nearest Neighbors','Neural Network'],columns=['Avg 3-fold cross-val score'])

Unnamed: 0,Avg 3-fold cross-val score
Naive Bayes,0.416667
Linear SVC,0.7
Decision Tree,0.45
K-Nearest Neighbors,0.533333
Neural Network,0.7


#### Warren

In [35]:
w_crossval = [stat.mean(cross_val_score(clf_nb, warren_np, w_target, cv=3)),
              stat.mean(cross_val_score(clf_linsvc, warren_np, w_target, cv=3)),
              stat.mean(cross_val_score(clf_dt, warren_np, w_target,cv=3)),
              stat.mean(cross_val_score(clf_knn, warren_np, w_target,cv=3)),
              stat.mean(cross_val_score(clf_nn, warren_np, w_target,cv=3))]

In [36]:
pd.DataFrame(w_crossval,index=['Naive Bayes','Linear SVC','Decision Tree','K-Nearest Neighbors','Neural Network'],columns=['Avg 3-fold cross-val score'])

Unnamed: 0,Avg 3-fold cross-val score
Naive Bayes,0.388889
Linear SVC,0.305556
Decision Tree,0.361111
K-Nearest Neighbors,0.277778
Neural Network,0.388889


Overall, the best performing classifiers overall are Linear SVC and the Neural Network. Naive Bayes was moderately successful for all candidates, but the KNN model performed similarly well. The least accurate model overall was the Decision Tree model. However, this was not true for Warren and the DT model's efficacy is not far behind the others for Biden and Sanders.

* Biden: At the time of my video presentation, I was getting the result that the Linear SVC outperforms all other models, followed by the Neural Network, then Naive Bayes. K-Nearest Neighbors performed slightly worse and the Decision Tree classifier performed worst of all. However, as I added final comments to this file, I realized that I had not fine-tuned all of the parameters for the neural network. I do not recall the source of the original parameters, which is an ethical misstep. As such, I updated the parameters. This increased the efficacy of the Neural Network and now it matches the average accuracy of the Linear SVC classifier. Naive Bayes is still the third most accurate, on average.


* Sanders: Again, at the time of the video presentation I had different results. At that time, the Linear SVC far outperfomeds all other models, followed by the Neural Network. After the update, the Neural Network is equally successful. Previously, the KNN classifier was fourth, but it has swapped places with Naive Bayes. I do not believe that adjusting the Neural Network parameters should have affected the KNN classifier, so I am unsure what caused this shift. I cleared the kernel before I re-ran all cells, so this is not a vestigial result. Still in last place is the Decision Tree classifier.


* Warren: This iteration, the Naive Bayes classifier and the Neural Networks were the only ones to reach the average 38.89% accuracy previously shown by all three of the following: the Neural Network classifier, Naive Bayes classifier, and Linear SVC. The Linear SVC is now fourth. Prior to the paramater tuning of the Neural Network, the Naive Bayes classifier, Linear SVC, and Neural Network classifier were followed by the KNN classifier, then the Decision Tree classifier. Now the Decision Tree is the second best model.

## Model Evaluation

#### Biden

In [37]:
clf_linsvc.fit(BX_train,by_train)
predict_lin_b = clf_linsvc.predict(BX_test)
confusion_matrix(by_test,predict_lin_b)

array([[2, 2],
       [1, 0]])

In [38]:
clf_nb.fit(BX_train,by_train)
predict_nb_b = clf_nb.predict(BX_test)
confusion_matrix(by_test,predict_nb_b)

array([[0, 4],
       [0, 1]])

In [39]:
clf_nn.fit(BX_train,by_train)
predict_nn_b = clf_nn.predict(BX_test)
confusion_matrix(by_test,predict_nn_b)

array([[4, 0],
       [1, 0]])

The best performing model for Biden, Linear SVC, had only 2 correct predictions, both true positives. It had 2 false negatives and one false positive.

The tied-for-first model, the Neural Network, had better results. It had 4 correct predictions (80% success) and one false positive.

The third-place model, Naive Bayes, only had 1 correct result, with 1 true negative and 4 false negatives.

#### Sanders

In [40]:
clf_linsvc.fit(SX_train,sy_train)
predict_lin_s = clf_linsvc.predict(SX_test)
confusion_matrix(sy_test,predict_lin_s)

array([[0, 1],
       [0, 4]])

In [41]:
clf_knn.fit(SX_train,sy_train)
predict_knn_s = clf_knn.predict(SX_test)
confusion_matrix(sy_test,predict_knn_s)

array([[0, 0, 0],
       [0, 0, 1],
       [2, 0, 2]])

In [42]:
clf_nn.fit(SX_train,sy_train)
predict_nn_s = clf_nn.predict(SX_test)
confusion_matrix(sy_test,predict_nn_s)

array([[0, 1],
       [0, 4]])

The best performing models for Sanders were also the Linear SVC and Neural Network. In Sanders' case, they both got 4 correct predictions, all true negatives. The remaining prediction for both was a false negative. Thus, 80% of the predictions were accurate, as opposed to the 70% found by the cross-validated model.

Sanders' data had KNN as its third place model. It had a 53.33% average accuracy. Here, it performed somewhat worse, with a 40% success rate, with 2 correct predictions and 3 incorrect. Because the number of neighbors was set to 3, this matrix is 3x3 instead of 2x2. 

#### Warren

In [43]:
clf_nb.fit(WX_train,wy_train)
predict_nb_w = clf_nb.predict(WX_test)
confusion_matrix(wy_test,predict_nb_w)

array([[1, 0, 0],
       [3, 0, 0],
       [1, 0, 0]])

Naive Bayes was the most accurate model on average, with an average success rate of almost 40%. Here it was only 20% accurate (1 out of 5).

In [44]:
clf_dt.fit(WX_train,wy_train)
predict_dt_w = clf_dt.predict(WX_test)
confusion_matrix(wy_test, predict_dt_w)

array([[1, 0, 0],
       [3, 0, 0],
       [0, 0, 1]])

Decision Tree was the second best at around a 36% average accuracy. Here, it performed better than the Naive Bayes classifier, as it was 40% accurate. These results are more similar than the results of the NB classifier. 

In [46]:
clf_nn.fit(WX_train,wy_train)
predict_nn_w = clf_nn.predict(WX_test)
confusion_matrix(wy_test,predict_nn_w)

array([[0, 0, 1],
       [0, 0, 3],
       [0, 0, 1]])

The Neural Network was tied for first, with an average accuracy of 38.89%. Here, 20% of the predictions (1/5) were correct. This was a surprising discrepancy. 

Warren has particularly interesting predictions because she has polled in 1st, 2nd, and 3rd place. I believe this made it harder for the models to predict her poll order. Through cross-validation, the most accurate models were shown to be Naive Bayes and the Neural Network, followed by Decision Tree in third.

### Model evaluation findings
For Warren and Biden, none of the learning algorithms perform particularly well, indicating that there is not a strong correlation between Twitter sentiment and opinion polls. There is a stronger correlation for Biden than for Warren, as his best model had an average 62.2% accuracy when cross-validated. The only candidate with decent predictions was Sanders, with one prediction showing an 80% success. This anomaly will be explored by investigating the predictions for the testing set. Overall, I conclude that the two populations don't have strongly correlating opinions, i.e. are distinct in their opinions.

## Predictions

Above, I explained that the populations do not agree strongly and thus any predictions based on a relationship between the two populations are not particularly reliable. Still, as stated, I want to determine if there is a reason for Sanders' predictions of the testing set being more successful. I will explore the predictions for the other candidates as well and interpret the models.

#### Biden

In [47]:
# Linear SVC
predict_lin_b

array([0, 1, 1, 0, 0])

The Linear SVC has mixed predictons. The average accuracy of this model was only about 60%, but the iteration with predictions found an 80% success rate. Of course, this is only an average, so the 80% result is not necessarily a typical result. These results are thus semi-reliable.

In [48]:
# Naive Bayes
predict_nb_b

array([1, 1, 1, 1, 1])

Here, all of the predictions are the same. Of course, this model only had an approximately 50% success rate. I will partially attribute this result to Naive Bayes' high bias and low variance.

In [49]:
# Neural network
predict_nn_b

array([0, 0, 0, 0, 0])

The Neural Network, in the example shown, had an 80% success. Here all of the predictions are the same, as they are in the Naive Bayes approach. However, I am more confident in these predictions to the example success rate.

None of Biden's models gave the same predictions, so it is not surprising that they had mixed results for their predictions.

#### Sanders

In [50]:
# Linear SVC
predict_lin_s

array([2, 2, 2, 2, 2])

For Sanders, the Linear SVC had an average accuracy of 70%, as did the Neural Network. In the predictions, both had an 80% success. This is less of an anomaly than the 80% for Biden, mentioned above. I would consider these predictions to be reasonably reliable. This means that both moderately successful models place Sanders third.

In [51]:
# Naive Bayes
predict_knn_s

array([2, 0, 2, 2, 0])

KNN had an average accuracy of slightly over 50%, while the predictions only had a 40% success rate. It is not surprising that this prediction is inconsistent with the others. 

In [52]:
# Neural network
predict_nn_s

array([2, 2, 2, 2, 2])

As stated above, these predictions were equally as successful as the Linear SVC and equally as reliable.

#### Warren

In [54]:
# Decision Tree
predict_dt_w

array([0, 0, 2, 0, 0])

In [55]:
# Naive Bayes
predict_nb_w

array([0, 0, 0, 0, 0])

In [56]:
# Neural network
predict_nn_w

array([2, 2, 2, 2, 2])

### Prediction consensus

Most models place Warren or Biden in 1st, and Sanders in 3rd.

I find it interesting that the Neural Network placed Warren in either 1st or 3rd place, as did the DT. For Warren, 9/15 predictions across the three models put her in 1st place, or 60% of predictions. 6/15 placed her in 3rd, or 40%.

Biden had 8/15 predictions across the three models that he would be in 1st place. This is 53.33% of predictions. Warren is placed in 1st more often than Biden.

Sanders was predicted in 1st only twice, or 13.33% of the time. He came in 3rd the remaining 86.67% percent of the time. Since Sanders' input/target is the most consistent, it's not surprising that these were the most successful predictions. This was true on average and shown even more strongly with these specific predictions. 

## Conclusion

I do not believe that the predictions herein can be considered fully reliable, but there are some slight correlations shown. These correlations can be compared to those uncovered by the Tableau viz, found [here](https://public.tableau.com/profile/emma.highland#!/vizhome/MSDS692/Pollcomparison). I found that less of a correlation is shown in the Tableau viz than shown here. The visualized time series only shows a pattern (an inverse correlation) with Warren, where the Twitter sentiments become less positive over time, while her poll order rises. Sanders has stagnated in his poll ordering, as shown here as well with the consistency of his 3rd place prediction. His sentiment scores have become slightly more positive over time, but I would have expected the positive sentiment to correlate with rising poll order. This did not occur. Biden goes through patches of more positive and more negative sentiments, but this does not correlate positively or negatively with the changes in his poll order. He has only vacillated between 1st and 2nd, rather than the range and rise shown by Warren.

I conclude that the populations (poll takers and randomly-chosen Twitter users) have distinct opinions that do not follow the same trend. 