# Machine Learning Engineer Nanodegree
## Capstone Project
Brian Long  
November 5th, 2019

## I. Definition
<font color="red">_(approx. 1-2 pages)_</font>

### Project Overview
<font color="red">
In this section, look to provide a high-level overview of the project in layman’s terms. Questions to ask yourself when writing this section:
    
- _Has an overview of the project been provided, such as the problem domain, project origin, and related datasets or input data?_
- _Has enough background information been given so that an uninformed reader would understand the problem domain and following problem statement?_
</font>

Twitter is a micro-blogging platform where users can post short messages referred to as
“tweets”. Tweets used to be limited to 140 characters but this limit was increased to 280
characters in 2017. This enforced conciseness made twitter a popular platform for
people to quickly voice their opinion about a topic without needing to publish a
multi-page article. Millions of people write multiple tweets every day to express their
thoughts or opinions, both formal and informal. Twitter has become a place for friends to
share pictures of their lunch, as well as a place for professionals to discuss emerging
practices and technologies.

The thing that interests me the most about Twitter as a social media platform is the
frequency and openness with which people post. In many ways, a person’s twitter feed
is like a journal of their day to day life. There is a wealth of information that seems
almost purposefully structured for data science. For example, the use of hashtags to
label tweets allows them to be categorized by topic without breaking up the platform into
multiple smaller communities that are defined by that topic. Using the public Twitter API,
developers can access data about tweets that can then be aggregated and used to
drive business decisions across a variety of fields.

Twitter stands out as a tool for sentiment analysis because the text differs heavily from
the usual data that comes from sources such as movie reviews, product reviews, or
news articles. These other sources are typically large bodies of text with proper
grammar, whereas tweets are have a strict character limit which keeps them short while
also having a very informal tone that doesn’t follow grammatical convention. Twitter has
been studied as a source of sentiment analysis in the following papers:

Go, A., Bhayani, R. and Huang, L., 2009. Twitter sentiment classification using distant
supervision. CS224N Project Report, Stanford, 1(12).
https://cs.stanford.edu/people/alecmgo/papers/TwitterDistantSupervision09.pdf

Agarwal, A., Xie, B., Vovsha, I., Rambow, O., Passonneau, R., Sentiment Analysis of
Twitter Data. Columbia.
http://www.cs.columbia.edu/~julia/papers/Agarwaletal11.pdf



### Problem Statement
<font color="red">
In this section, you will want to clearly define the problem that you are trying to solve, including the strategy (outline of tasks) you will use to achieve the desired solution. You should also thoroughly discuss what the intended solution will be for this problem. Questions to ask yourself when writing this section:
    
- _Is the problem statement clearly defined? Will the reader understand what you are expecting to solve?_
- _Have you thoroughly discussed how you will attempt to solve the problem?_
- _Is an anticipated solution clearly defined? Will the reader understand what results you are looking for?_
</font>


The information I would most like to extract from these tweets is the general sentiment
in terms of whether it has a positive or negative tone. Sentiment analysis is a popular
problem in the field of Natural Language Processing. The intricacies of language make
it difficult to algorithmically extract meaning from a given body of text. While true
sentiment contains much more complexity than can be expressed by the broad
definitions of having a positive or negative point of view, this is still valuable information
as it can be used to quantify the general public’s feelings towards a company, product,
or idea.

I will be attempting to create a model that can determine whether a tweet expresses a
positive or negative sentiment. This model could eventually be used to build a tool that
uses the Twitter API to have the sentiment evaluated for new tweets. This could further
be used to show trends in how the general public views a given topic and leveraged to
make business decisions.



### Metrics
<font color="red">
In this section, you will need to clearly define the metrics or calculations you will use to measure performance of a model or result in your project. These calculations and metrics should be justified based on the characteristics of the problem and problem domain. Questions to ask yourself when writing this section:
    
- _Are the metrics you’ve chosen to measure the performance of your models clearly discussed and defined?_
- _Have you provided reasonable justification for the metrics chosen based on the problem and solution?_
</font>


The main metric I will be using to measure the benchmark and solution models will be
accuracy. I choose to focus on accuracy over precision or recall because mislabelling a
tweet as positive or negative has the same impact. In this instance we are not
specifically trying to identify all positive tweets or avoid incorrectly labelling a negative
tweet as positive, but to correctly identify as many tweets as possible, regardless of
whether they are positive or negative. Since the target class distribution is perfectly
balanced between positive and negative tweets, using accuracy should not be affected
by simply favoring one choice over the other. I will also be looking at F1-score as a
secondary metric so that precision and recall are not entirely discounted, although my
main focus will be on improving accuracy.




## II. Analysis
<font color="red">_(approx. 2-4 pages)_</font>

### Data Exploration
<font color="red">
In this section, you will be expected to analyze the data you are using for the problem. This data can either be in the form of a dataset (or datasets), input data (or input files), or even an environment. The type of data should be thoroughly described and, if possible, have basic statistics and information presented (such as discussion of input features or defining characteristics about the input or environment). Any abnormalities or interesting qualities about the data that may need to be addressed have been identified (such as features that need to be transformed or the possibility of outliers). Questions to ask yourself when writing this section:
    
- _If a dataset is present for this problem, have you thoroughly discussed certain features about the dataset? Has a data sample been provided to the reader?_
- _If a dataset is present for this problem, are statistics about the dataset calculated and reported? Have any relevant results from this calculation been discussed?_
- _If a dataset is **not** present for this problem, has discussion been made about the input space or input data for your problem?_
- _Are there any abnormalities or characteristics about the input space or dataset that need to be addressed? (categorical variables, missing values, outliers, etc.)_
</font>

I will be using a subset of the Sentiment140 dataset to build my Twitter sentiment analysis model.
This dataset is a collection of 1.6 million tweets that were gathered from the Twitter API
and are labeled as positive or negative. I have cut down the dataset to 700k tweets and added the column headers to the csv file included in this project.
Since this dataset was intended for use in sentiment analysis, it was set up such that there is a perfect class balance with exactly
50% of the training data being positive and 50% being negative. The tweets were
labeled based on their inclusion of emoticons, with variations of smiling emoticons being
labeled positive, and variations of frowning emoticons being labeled negative. The
emoticons were also removed from the dataset after the sentiment labels were added.
This dataset was put together using distant supervised learning as a way to showcase
its viability in labeling data this way. The data was tested against Naive Bayes,
Maximum Entropy, and Support Vector Machine models, all of which scored over 80%
accuracy. The complete Sentiment140 dataset is available at the following link:

http://help.sentiment140.com/for-students

Go, A., Bhayani, R. and Huang, L., 2009. Twitter sentiment classification using distant
supervision. CS224N Project Report, Stanford, 1(12).
https://cs.stanford.edu/people/alecmgo/papers/TwitterDistantSupervision09.pdf

Below, we will begin by importing the dataset and viewing a sample of the entries.

The dataset contains 6 columns:
    - sentiment: can be either 0 or 4. 0 represents a negative sentiment and 4 represents a positive sentiment.
    - tweet_id: the id of the tweet from the Twitter API.
    - date: the date that the tweet was posted to Twitter.
    - query: the search term used by to find the tweet.
    - user: the username of the person who posted the tweet.
    - tweet: the content of the tweet that was posted to Twitter.

In [4]:
import pandas as pd
import numpy as np

pd.options.display.max_colwidth = 1000
df = pd.read_csv("twitter_data_700k.csv")

In [5]:
df.head()

Unnamed: 0,sentiment,tweet_id,date,query,user,tweet
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all. i'm mad. why am i here? because I can't see you all over there."


### Exploratory Visualization
<font color="red">
In this section, you will need to provide some form of visualization that summarizes or extracts a relevant characteristic or feature about the data. The visualization should adequately support the data being used. Discuss why this visualization was chosen and how it is relevant. Questions to ask yourself when writing this section:
    
- _Have you visualized a relevant characteristic or feature about the dataset or input data?_
- _Is the visualization thoroughly analyzed and discussed?_
- _If a plot is provided, are the axes, title, and datum clearly defined?_
</font>

### Algorithms and Techniques
<font color="red">
In this section, you will need to discuss the algorithms and techniques you intend to use for solving the problem. You should justify the use of each one based on the characteristics of the problem and the problem domain. Questions to ask yourself when writing this section:
    
- _Are the algorithms you will use, including any default variables/parameters in the project clearly defined?_
- _Are the techniques to be used thoroughly discussed and justified?_
- _Is it made clear how the input data or datasets will be handled by the algorithms and techniques chosen?_
</font>


### Benchmark
<font color="red">
In this section, you will need to provide a clearly defined benchmark result or threshold for comparing across performances obtained by your solution. The reasoning behind the benchmark (in the case where it is not an established result) should be discussed. Questions to ask yourself when writing this section:
    
- _Has some result or value been provided that acts as a benchmark for measuring performance?_
- _Is it clear how this result or value was obtained (whether by data or by hypothesis)?_
</font>

## III. Methodology
<font color="red">_(approx. 3-5 pages)_</font>

### Data Preprocessing
<font color="red">
In this section, all of your preprocessing steps will need to be clearly documented, if any were necessary. From the previous section, any of the abnormalities or characteristics that you identified about the dataset will be addressed and corrected here. Questions to ask yourself when writing this section:
    
- _If the algorithms chosen require preprocessing steps like feature selection or feature transformations, have they been properly documented?_
- _Based on the **Data Exploration** section, if there were abnormalities or characteristics that needed to be addressed, have they been properly corrected?_
- _If no preprocessing is needed, has it been made clear why?_
</font>



### Implementation
<font color="red">
In this section, the process for which metrics, algorithms, and techniques that you implemented for the given data will need to be clearly documented. It should be abundantly clear how the implementation was carried out, and discussion should be made regarding any complications that occurred during this process. Questions to ask yourself when writing this section:
    
- _Is it made clear how the algorithms and techniques were implemented with the given datasets or input data?_
- _Were there any complications with the original metrics or techniques that required changing prior to acquiring a solution?_
- _Was there any part of the coding process (e.g., writing complicated functions) that should be documented?_
</font>

### Refinement
<font color="red">
In this section, you will need to discuss the process of improvement you made upon the algorithms and techniques you used in your implementation. For example, adjusting parameters for certain models to acquire improved solutions would fall under the refinement category. Your initial and final solutions should be reported, as well as any significant intermediate results as necessary. Questions to ask yourself when writing this section:
    
- _Has an initial solution been found and clearly reported?_
- _Is the process of improvement clearly documented, such as what techniques were used?_
- _Are intermediate and final solutions clearly reported as the process is improved?_
</font>




## IV. Results
<font color="red">_(approx. 2-3 pages)_</font>

### Model Evaluation and Validation
<font color="red">
In this section, the final model and any supporting qualities should be evaluated in detail. It should be clear how the final model was derived and why this model was chosen. In addition, some type of analysis should be used to validate the robustness of this model and its solution, such as manipulating the input data or environment to see how the model’s solution is affected (this is called sensitivity analysis). Questions to ask yourself when writing this section:
    
- _Is the final model reasonable and aligning with solution expectations? Are the final parameters of the model appropriate?_
- _Has the final model been tested with various inputs to evaluate whether the model generalizes well to unseen data?_
- _Is the model robust enough for the problem? Do small perturbations (changes) in training data or the input space greatly affect the results?_
- _Can results found from the model be trusted?_
</font>



### Justification
<font color="red">
In this section, your model’s final solution and its results should be compared to the benchmark you established earlier in the project using some type of statistical analysis. You should also justify whether these results and the solution are significant enough to have solved the problem posed in the project. Questions to ask yourself when writing this section:
    
- _Are the final results found stronger than the benchmark result reported earlier?_
- _Have you thoroughly analyzed and discussed the final solution?_
- _Is the final solution significant enough to have solved the problem?_
</font>



## V. Conclusion
<font color="red">_(approx. 1-2 pages)_</font>

### Free-Form Visualization
<font color="red">
In this section, you will need to provide some form of visualization that emphasizes an important quality about the project. It is much more free-form, but should reasonably support a significant result or characteristic about the problem that you want to discuss. Questions to ask yourself when writing this section:
    
- _Have you visualized a relevant or important quality about the problem, dataset, input data, or results?_
- _Is the visualization thoroughly analyzed and discussed?_
- _If a plot is provided, are the axes, title, and datum clearly defined?_
</font>

### Reflection
<font color="red">
In this section, you will summarize the entire end-to-end problem solution and discuss one or two particular aspects of the project you found interesting or difficult. You are expected to reflect on the project as a whole to show that you have a firm understanding of the entire process employed in your work. Questions to ask yourself when writing this section:
    
- _Have you thoroughly summarized the entire process you used for this project?_
- _Were there any interesting aspects of the project?_
- _Were there any difficult aspects of the project?_
- _Does the final model and solution fit your expectations for the problem, and should it be used in a general setting to solve these types of problems?_
</font>



### Improvement
<font color="red">
In this section, you will need to provide discussion as to how one aspect of the implementation you designed could be improved. As an example, consider ways your implementation can be made more general, and what would need to be modified. You do not need to make this improvement, but the potential solutions resulting from these changes are considered and compared/contrasted to your current solution. Questions to ask yourself when writing this section:
    
- _Are there further improvements that could be made on the algorithms or techniques you used in this project?_
- _Were there algorithms or techniques you researched that you did not know how to implement, but would consider using if you knew how?_
- _If you used your final solution as the new benchmark, do you think an even better solution exists?_
</font>




<font color="red">

-----------
**Before submitting, ask yourself. . .**

- Does the project report you’ve written follow a well-organized structure similar to that of the project template?
- Is each section (particularly **Analysis** and **Methodology**) written in a clear, concise and specific fashion? Are there any ambiguous terms or phrases that need clarification?
- Would the intended audience of your project be able to understand your analysis, methods, and results?
- Have you properly proof-read your project report to assure there are minimal grammatical and spelling mistakes?
- Are all the resources used for this project correctly cited and referenced?
- Is the code that implements your solution easily readable and properly commented?
- Does the code execute without error and produce results similar to those reported?
</font>



















Import the libraries that will be usedin this project.

In [1]:
import pandas as pd
import numpy as np

pd.options.display.max_colwidth = 1000

Read in the twitter data into a panda dataframe. I originally ran into an encoding issue and had to save the csv in UTF-8. The csv file does not include a column header row, so add those in manually.

In [2]:
# df = pd.read_csv("twitter_data.csv", header=None, names=["sentiment", "tweet_id", "date", "query", "user", "tweet"])
df = pd.read_csv("twitter_data_700k.csv")

In [3]:
df.head()

Unnamed: 0,sentiment,tweet_id,date,query,user,tweet
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all. i'm mad. why am i here? because I can't see you all over there."


In [4]:
# num = 350000
# h = df.head(num)
# t = df.tail(num)

# f = pd.concat([h,t], axis=0)
# f.to_csv('twitter_data_700k.csv', index=False)



Remove columns that dont seem useful for sentiment analysis

In [5]:
df.drop(["tweet_id", "date", "query", "user"], axis=1, inplace=True)
df.head()

Unnamed: 0,sentiment,tweet
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
1,0,is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!
2,0,@Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds
3,0,my whole body feels itchy and like its on fire
4,0,"@nationwideclass no, it's not behaving at all. i'm mad. why am i here? because I can't see you all over there."


DO NOT KEEP THIS IN THE PROJECT!!!
DROPPING A BUNCH OF ROWS TO SPEED UP PREPROCESSING WHILE TESTING!!!

In [6]:
# df = df.sample(100000)
# df.shape

Perform data preprocessing

In [13]:
import re

def processTweet(tweet):
    tweet = re.sub("[@|#]\w+\S","", tweet) # remove @usernames and #hashtags
    tweet = re.sub("http[s]?://[\S]+", '', tweet) # remove urls 
    tweet = re.sub(r"(.)\1\1+",r"\1\1", tweet) # remove letters that repeat more than 2 times
    tweet = re.sub('[!&()+,-./:;<=>?[\\]_{|}~]', ' ',tweet) # replace certain punctuation with a space 
    tweet = re.sub('[!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~]', '',tweet) # remove remaining punctuation
    tweet = re.sub(r'\b\w{1,2}\b', '', tweet) # remove words that are less than 3 characters
    tweet = tweet.lower() # lowercase
    return tweet

In [14]:
df['tweet'] = df['tweet'].map(lambda tweet: processTweet(tweet))

df.head()

Unnamed: 0,sentiment,tweet
0,0,aww thats bummer you shoulda got david carr third day
1,0,upset that cant update his facebook texting and might cry result school today also blah
2,0,dived many times for the ball managed save the rest out bounds
3,0,whole body feels itchy and like its fire
4,0,its not behaving all mad why here because cant see you all over there


In [15]:
from sklearn.model_selection import train_test_split

y = df['sentiment'].map({0: 0, 4: 1})
X = df.drop(['sentiment'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25, random_state=42)

In [16]:
from sklearn.feature_extraction.text import CountVectorizer

count_vector = CountVectorizer()

train_vector = count_vector.fit_transform(X_train['tweet'])
test_vector = count_vector.transform(X_test['tweet'])

In [17]:
from sklearn.naive_bayes import MultinomialNB
naive_bayes = MultinomialNB()
naive_bayes.fit(train_vector, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [18]:
predictions = naive_bayes.predict(test_vector)

In [19]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(confusion_matrix(y_test,predictions))
print(classification_report(y_test,predictions))
print(accuracy_score(y_test, predictions))

[[69739 17673]
 [21531 66057]]
              precision    recall  f1-score   support

           0       0.76      0.80      0.78     87412
           1       0.79      0.75      0.77     87588

   micro avg       0.78      0.78      0.78    175000
   macro avg       0.78      0.78      0.78    175000
weighted avg       0.78      0.78      0.78    175000

0.7759771428571428


BELOW HERE STARTING RNN

In [14]:
from keras.preprocessing import sequence, text
from keras.utils import to_categorical

from sklearn.feature_extraction.text import TfidfVectorizer

vocabulary_size = 20000

tokenizer = text.Tokenizer(num_words=vocabulary_size)
tokenizer.fit_on_texts(X_train['tweet'])

train_token = tokenizer.texts_to_sequences(X_train['tweet'])
test_token = tokenizer.texts_to_sequences(X_test['tweet'])

max_words = 50
train_padded = sequence.pad_sequences(train_token, maxlen=max_words)
test_padded = sequence.pad_sequences(test_token, maxlen=max_words)

Using TensorFlow backend.


In [20]:
from keras import Sequential
from keras.layers import Embedding, LSTM, Dense, Dropout, Activation, TimeDistributed

embedding_size=32
model=Sequential()
model.add(Embedding(vocabulary_size, embedding_size, input_length=max_words))
model.add(LSTM(32, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))
print(model.summary())


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 50, 32)            640000    
_________________________________________________________________
lstm_2 (LSTM)                (None, 32)                8320      
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 33        
Total params: 648,353
Trainable params: 648,353
Non-trainable params: 0
_________________________________________________________________
None


In [21]:
loss_func = 'binary_crossentropy'

model.compile(loss=loss_func, 
             optimizer='adam', 
             metrics=['accuracy'])

In [22]:
from keras.callbacks import ModelCheckpoint  

saved_model_path = 'model.weights.best.hdf5'

checkpointer = ModelCheckpoint(filepath=saved_model_path, 
                               monitor='val_loss',
                               save_best_only=True)

batch_size = 32
num_epochs = 5
model.fit(train_padded, 
          y_train, 
          validation_split=0.2, 
          batch_size=batch_size, 
          epochs=num_epochs,
          shuffle=True,
          callbacks=[checkpointer])


Train on 420000 samples, validate on 105000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f7fd4c53780>

In [23]:
model.load_weights(saved_model_path)

In [24]:
scores = model.evaluate(test_padded, y_test, verbose=0)
print('Test accuracy:', scores[1])

Test accuracy: 0.8091828571401324


In [25]:
rnn_pred = model.predict_classes(test_padded)

In [26]:
print(confusion_matrix(y_test, rnn_pred))
print(classification_report(y_test, rnn_pred))
print(accuracy_score(y_test, rnn_pred))

[[69751 17661]
 [15732 71856]]
              precision    recall  f1-score   support

           0       0.82      0.80      0.81     87412
           1       0.80      0.82      0.81     87588

   micro avg       0.81      0.81      0.81    175000
   macro avg       0.81      0.81      0.81    175000
weighted avg       0.81      0.81      0.81    175000

0.8091828571428571
