# Part 1 - Working with Text Data

### Use Python string methods remove irregular whitespace from the following string:

In [1]:
whitespace_string = "\n\n  This is a    string   that has  \n a lot of  extra \n   whitespace.   "

print(whitespace_string)



  This is a    string   that has  
 a lot of  extra 
   whitespace.   


In [2]:
' '.join(whitespace_string.strip().split())

'This is a string that has a lot of extra whitespace.'

### Use Regular Expressions to take the dates in the following .txt file and put them into a dataframe with columns for:

[RegEx dates.txt](https://github.com/ryanleeallred/datasets/blob/master/dates.txt)

- Day
- Month
- Year


In [3]:
import requests
import pandas as pd
import re
url = 'https://raw.githubusercontent.com/ryanleeallred/datasets/master/dates.txt'
r = requests.get(url)
r.text

'March 8, 2015\r\nMarch 15, 2015\r\nMarch 22, 2015\r\nMarch 29, 2015\r\nApril 5, 2015\r\nApril 12, 2015\r\nApril 19, 2015\r\nApril 26, 2015\r\nMay 3, 2015\r\nMay 10, 2015\r\nMay 17, 2015\r\nMay 24, 2015\r\nMay 31, 2015\r\nJune 7, 2015\r\nJune 14, 2015\r\nJune 21, 2015\r\nJune 28, 2015\r\nJuly 5, 2015\r\nJuly 12, 2015\r\nJuly 19, 2015'

In [4]:
# Separate each line into a string
string_dates = r.text.split('\n')

# Strip carriage returns from the string
string_dates = [x.strip() for x in string_dates]

# Split each string into a list of words
list_dates = [re.findall(r'\w+', row) for row in string_dates]

# Extract the elements of that list into a dictionary
components = {'Day': [x[1] for x in list_dates],
              'Month': [x[0] for x in list_dates],
              'Year': [x[2] for x in list_dates]}

# Turn the dictionary into a DF
df_dates = pd.DataFrame(components)
df_dates.head()

Unnamed: 0,Day,Month,Year
0,8,March,2015
1,15,March,2015
2,22,March,2015
3,29,March,2015
4,5,April,2015


# Part 2 - Bag of Words 

### Use the twitter sentiment analysis dataset found at this link for the remainder of the Sprint Challenge:

[Twitter Sentiment Analysis Dataset](https://raw.githubusercontent.com/ryanleeallred/datasets/master/twitter_sentiment_binary.csv)

 ### Clean and tokenize the documents ensuring the following properties of the text:

1) Text should be lowercase.

2) Stopwords should be removed.

3) Punctuation should be removed.

4) Tweets should be tokenized at the word level. 

(The above don't necessarily need to be completed in that specific order.)

### Output some cleaned tweets so that we can see that you made all of the above changes.


In [5]:
url = 'https://raw.githubusercontent.com/ryanleeallred/datasets/master/twitter_sentiment_binary.csv'
df = pd.read_csv(url)
df.head()

Unnamed: 0,Sentiment,SentimentText
0,0,is so sad for my APL frie...
1,0,I missed the New Moon trail...
2,1,omg its already 7:30 :O
3,0,.. Omgaga. Im sooo im gunna CRy. I'...
4,0,i think mi bf is cheating on me!!! ...


In [6]:
raw_feels = df.SentimentText.tolist()
len(raw_feels)

99989

In [7]:
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

punct_table = str.maketrans(string.punctuation,' '*32)
stop_words = set(stopwords.words('english'))

def cleanup(tweet_list):
    cleaned_tweets = []

    for tweet in tweet_list:

        # Strip punctuation everywhere,
        # replacing it with spaces so that words separated only
        # by punctuation don't get smooshed together
        tweet = tweet.translate(punct_table)
        
        # Tokenize by word
        tweet = word_tokenize(tweet)

        # Make all words lowercase
        tweet = [w.lower() for w in tweet]

        # Remove words that aren't alphabetic
        tweet = [w for w in tweet if w.isalpha()]

        # Remove stopwords
        tweet = [w for w in tweet if not w in stop_words]

        # Append to list
        cleaned_tweets.append(tweet)
    
    return cleaned_tweets

In [8]:
feels = cleanup(raw_feels)
feels[:10]

[['sad', 'apl', 'friend'],
 ['missed', 'new', 'moon', 'trailer'],
 ['omg', 'already'],
 ['omgaga',
  'im',
  'sooo',
  'im',
  'gunna',
  'cry',
  'dentist',
  'since',
  'suposed',
  'get',
  'crown',
  'put'],
 ['think', 'mi', 'bf', 'cheating'],
 ['worry', 'much'],
 ['juuuuuuuuuuuuuuuuussssst', 'chillin'],
 ['sunny', 'work', 'tomorrow', 'tv', 'tonight'],
 ['handed', 'uniform', 'today', 'miss', 'already'],
 ['hmmmm', 'wonder', 'number']]

### How should TF-IDF scores be interpreted? How are they calculated?

TF-IDF scores combine two different measures of a word's importance. They are used in cases where we have a collection of documents, and each document is a collection of words.  Each word in a document is assigned a score.  The score is higher if the word is a high proportion of the words in that document. The score is lower if that word also shows up in a high proportion of all documents.  

TF-IDF is designed to match our intuition for how special a word is.  A word will be really special if one document talks about it a lot, and few other documents mention it.

The specific equation used to calculate TF-IDF contains one term for each of the measures mentioned, TF (Term Frequency) and IDF (Inverse Document Frequency)

`TF(w, d)` = (Number of times word `w` appears in a document) / (Total number of words in the document `d`).

`IDF(w)` = log_2(Total number of documents / Number of documents with word `w` in it).

`TF-IDF(w, d) = TF(w, d) * IDF(w)`

Note that some versions of TF_IDF will use a different base (such as `e`) for the logarithm in IDF.

# Part 3 - Document Classification

1) Use Train_Test_Split to create train and test datasets.

2) Vectorize the tokenized documents using your choice of vectorization method. 

 - Stretch goal: Use both of the methods that we talked about in class.

3) Create a vocabulary using the X_train dataset and transform both your X_train and X_test data using that vocabulary.

4) Use your choice of binary classification algorithm to train and evaluate your model's accuracy. Report both train and test accuracies.

 - Stretch goal: Use an error metric other than accuracy and implement/evaluate multiple classifiers.



In [24]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score

In [22]:
# Put the data in the right format
X = [' '.join(x) for x in feels]
y = df.Sentiment.tolist()

In [28]:
def vect_and_classify(vect, clf, X, y):
    """
    Split the data, vectorize it, classify it, and evaluate the
    classification
    """
    # Split into train and test data
    X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size=0.2, random_state=42)
    
    # Fit the vectorizer to the training data
    vect.fit(X_train)

    # Transform traind and test into word vectors
    train_vect = vect.transform(X_train)
    test_vect = vect.transform(X_test)
    
    # Fit the classifier, using training data only
    clf.fit(train_vect, y_train)
    
    # Test the classifier's predictions
    train_preds = clf.predict(train_vect)
    test_preds = clf.predict(test_vect)
    
    print(f'Train Accuracy: {accuracy_score(y_train, train_preds):.4f}')
    print(f'Test Accuracy: {accuracy_score(y_test, test_preds):.4f}')
    print()
    print(f'Train ROC AUC score: {roc_auc_score(y_train, train_preds):.4f}')
    print(f'Test ROC AUC score: {roc_auc_score(y_test, test_preds):.4f}')  

In [31]:
vect = CountVectorizer(max_features=None, 
                             ngram_range=(1,2), 
                             stop_words='english')

clf = LogisticRegression(random_state=42)

print('Count Vectorizer & Logistic Regression')
print()
vect_and_classify(vect, clf, X, y)

Count Vectorizer & Logistic Regression

Train Accuracy: 0.9811
Test Accuracy: 0.7508

Train ROC AUC score: 0.9797
Test ROC AUC score: 0.7400


In [32]:
vect = TfidfVectorizer(max_features=None, 
                             ngram_range=(1,2), 
                             stop_words='english')

clf = LogisticRegression(random_state=42)

print('TF-IDF Vectorizer & Logistic Regression')
print()
vect_and_classify(vect, clf, X, y)

TF-IDF Vectorizer & Logistic Regression

Train Accuracy: 0.8820
Test Accuracy: 0.7470

Train ROC AUC score: 0.8736
Test ROC AUC score: 0.7361


In [33]:
vect = CountVectorizer(max_features=None, 
                             ngram_range=(1,2), 
                             stop_words='english')

clf = LinearSVC(random_state=42)

print('Count Vectorizer & LinearSVC Classifier')
print()
vect_and_classify(vect, clf, X, y)

Count Vectorizer & LinearSVC Classifier

Train Accuracy: 0.9960
Test Accuracy: 0.7380

Train ROC AUC score: 0.9955
Test ROC AUC score: 0.7304




In [34]:
vect = TfidfVectorizer(max_features=None, 
                             ngram_range=(1,2), 
                             stop_words='english')

clf = LinearSVC(random_state=42)

print('TF-IDF Vectorizer & LinearSVC Classifier')
print()
vect_and_classify(vect, clf, X, y)

TF-IDF Vectorizer & LinearSVC Classifier

Train Accuracy: 0.9946
Test Accuracy: 0.7465

Train ROC AUC score: 0.9942
Test ROC AUC score: 0.7396


In [36]:
# I've been using an ngram-range of (1,2) everywhere.  Just to check, 
# I wonder whether a range of (1,1) might work better.
vect = CountVectorizer(max_features=None, 
                             ngram_range=(1,1), 
                             stop_words='english')

clf = LogisticRegression(random_state=42)

print('Count Vectorizer & Logistic Regression, with an ngram range of (1,1)')
print()
vect_and_classify(vect, clf, X, y)

Count Vectorizer & Logistic Regression, with an ngram range of (1,1)

Train Accuracy: 0.8830
Test Accuracy: 0.7453

Train ROC AUC score: 0.8768
Test ROC AUC score: 0.7351


# Part 4 -  Word2Vec

1) Fit a Word2Vec model on your cleaned/tokenized twitter dataset. 

2) Display the 10 words that are most similar to the word "twitter"

In [37]:
from gensim.models.word2vec import Word2Vec

In [39]:
# Peek at the input that will go into Word2Vec
print(len(feels))
feels[:5]

99989


[['sad', 'apl', 'friend'],
 ['missed', 'new', 'moon', 'trailer'],
 ['omg', 'already'],
 ['omgaga',
  'im',
  'sooo',
  'im',
  'gunna',
  'cry',
  'dentist',
  'since',
  'suposed',
  'get',
  'crown',
  'put'],
 ['think', 'mi', 'bf', 'cheating']]

In [83]:
%%time
# Fit to our cleaned/tokenized data
# I tried a few different iterations of the parameters before settling 
# on these.
w2v = Word2Vec(feels, min_count=10, window=3, size=500, negative=5)

CPU times: user 17 s, sys: 124 ms, total: 17.1 s
Wall time: 6.46 s


In [84]:
# How many words are in our vocabulary?
len(list(w2v.wv.vocab))

6837

In [85]:
# Example words from the vocabulary
list(w2v.wv.vocab)

['sad',
 'friend',
 'missed',
 'new',
 'moon',
 'trailer',
 'omg',
 'already',
 'im',
 'sooo',
 'gunna',
 'cry',
 'dentist',
 'since',
 'get',
 'crown',
 'put',
 'think',
 'mi',
 'bf',
 'cheating',
 'worry',
 'much',
 'chillin',
 'sunny',
 'work',
 'tomorrow',
 'tv',
 'tonight',
 'handed',
 'uniform',
 'today',
 'miss',
 'hmmmm',
 'wonder',
 'number',
 'must',
 'positive',
 'thanks',
 'haters',
 'face',
 'day',
 'weekend',
 'sucked',
 'far',
 'jb',
 'isnt',
 'showing',
 'australia',
 'ok',
 'thats',
 'win',
 'lt',
 'way',
 'feel',
 'right',
 'man',
 'completely',
 'useless',
 'rt',
 'funny',
 'twitter',
 'http',
 'myloc',
 'feeling',
 'fine',
 'gon',
 'na',
 'go',
 'listen',
 'celebrate',
 'huge',
 'roll',
 'thunder',
 'scary',
 'cut',
 'beard',
 'growing',
 'well',
 'year',
 'start',
 'happy',
 'meantime',
 'iran',
 'one',
 'see',
 'cause',
 'else',
 'following',
 'pretty',
 'awesome',
 'level',
 'writing',
 'massive',
 'blog',
 'tweet',
 'myspace',
 'comp',
 'shut',
 'lost',
 'positi

In [86]:
# Display the 10 words that are most similar to the word "twitter"
w2v.wv.most_similar('twitter', topn=10)

[('dm', 0.8473275303840637),
 ('email', 0.8394845128059387),
 ('link', 0.8251531720161438),
 ('sent', 0.8088736534118652),
 ('facebook', 0.7989238500595093),
 ('message', 0.7944740056991577),
 ('info', 0.7845165729522705),
 ('myspace', 0.7794843912124634),
 ('list', 0.7783143520355225),
 ('profile', 0.7758128046989441)]

In [87]:
# The main context for the word "miss" is people texting
# their loved ones.
w2v.wv.most_similar(positive=['miss'], topn=10)

[('luv', 0.7986624240875244),
 ('sooo', 0.7817818522453308),
 ('xx', 0.7746195793151855),
 ('guys', 0.7677934169769287),
 ('r', 0.7640038728713989),
 ('doin', 0.7634164094924927),
 ('boo', 0.7572261691093445),
 ('goodnight', 0.7514300346374512),
 ('missed', 0.7511347532272339),
 ('soooo', 0.7458885312080383)]

In [88]:
# But if we remove the context of the word "love", "miss" becomes 
# instead about times and events that people missed.
w2v.wv.most_similar(positive=['miss'], negative=['love'], topn=10)

[('school', 0.5754621028900146),
 ('bed', 0.5715541839599609),
 ('sleep', 0.5299521088600159),
 ('summer', 0.49757999181747437),
 ('im', 0.48902300000190735),
 ('home', 0.48606789112091064),
 ('tomorrow', 0.47004780173301697),
 ('weekend', 0.46783655881881714),
 ('tonight', 0.4603281021118164),
 ('work', 0.4577789306640625)]

In [104]:
# Given this random set of words from the vocabulary, we can query
# the word "joe" and see that it's closest to the word "matt" from the list.
random_list = ['praise',
                 'sry',
                 'matt',
                 'couldnt',
                 'near',
                 'cell',
                 'screen',
                 'dead',
                 'texts',
                 'calling',
                 'sickness',]

w2v.wv.most_similar_to_given('joe', random_list)

'matt'