# Part 1 - Working with Text Data

### Use Python string methods remove irregular whitespace from the following string:

In [1]:
whitespace_string = "\n\n  This is a    string   that has  \n a lot of  extra \n   whitespace.   "

print(whitespace_string)



  This is a    string   that has  
 a lot of  extra 
   whitespace.   


In [2]:
 ##### Your Code Here #####
" ".join(whitespace_string.split())

'This is a string that has a lot of extra whitespace.'

### Use Regular Expressions to take the dates in the following .txt file and put them into a dataframe with columns for:

[RegEx dates.txt](https://github.com/ryanleeallred/datasets/blob/master/dates.txt)

- Day
- Month
- Year


In [55]:
string = """
March 8, 2015
March 15, 2015
March 22, 2015
March 29, 2015
April 5, 2015
April 12, 2015
April 19, 2015
April 26, 2015
May 3, 2015
May 10, 2015
May 17, 2015
May 24, 2015
May 31, 2015
June 7, 2015
June 14, 2015
June 21, 2015
June 28, 2015
July 5, 2015
July 12, 2015
July 19, 2015
"""
regex = r"([a-zA-Z]+) (\d+), (\d{4})"

search_result = re.findall(regex, string)

for match in search_result:
    print(match)

('March', '8', '2015')
('March', '15', '2015')
('March', '22', '2015')
('March', '29', '2015')
('April', '5', '2015')
('April', '12', '2015')
('April', '19', '2015')
('April', '26', '2015')
('May', '3', '2015')
('May', '10', '2015')
('May', '17', '2015')
('May', '24', '2015')
('May', '31', '2015')
('June', '7', '2015')
('June', '14', '2015')
('June', '21', '2015')
('June', '28', '2015')
('July', '5', '2015')
('July', '12', '2015')
('July', '19', '2015')


In [56]:
df = pd.DataFrame(search_result, columns=['Month', 'Day', 'Year'])
df

Unnamed: 0,Month,Day,Year
0,March,8,2015
1,March,15,2015
2,March,22,2015
3,March,29,2015
4,April,5,2015
5,April,12,2015
6,April,19,2015
7,April,26,2015
8,May,3,2015
9,May,10,2015


# Part 2 - Bag of Words 

### Use the twitter sentiment analysis dataset found at this link for the remainder of the Sprint Challenge:

[Twitter Sentiment Analysis Dataset](https://raw.githubusercontent.com/ryanleeallred/datasets/master/twitter_sentiment_binary.csv)

 ### Clean and tokenize the documents ensuring the following properties of the text:

1) Text should be lowercase.

2) Stopwords should be removed.

3) Punctuation should be removed.

4) Tweets should be tokenized at the word level. 

(The above don't necessarily need to be completed in that specific order.)

### Output some cleaned tweets so that we can see that you made all of the above changes.


In [4]:
import re
import string

import nltk

nltk.download('stopwords')
from nltk.tokenize import sent_tokenize # Sentence Tokenizer
from nltk.tokenize import word_tokenize # Word Tokenizer
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.probability import FreqDist

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/Alexander/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [5]:
import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/ryanleeallred/datasets/master/twitter_sentiment_binary.csv')
df.head()

Unnamed: 0,Sentiment,SentimentText
0,0,is so sad for my APL frie...
1,0,I missed the New Moon trail...
2,1,omg its already 7:30 :O
3,0,.. Omgaga. Im sooo im gunna CRy. I'...
4,0,i think mi bf is cheating on me!!! ...


In [6]:
table = str.maketrans('','', string.punctuation)
stop_words = set(stopwords.words('english'))

def nltk_tokenize(input):
  
  # Tokenize by word
  tokens = word_tokenize(input)
  #print("Tokens:", tokens)
  # Make all words lowercase
  lowercase_tokens = [w.lower() for w in tokens]
  #print("Lowercase:", lowercase_tokens)
  # Strip punctuation from within words
  no_punctuation = [x.translate(table) for x in lowercase_tokens]
  #print("No Punctuation:", no_punctuation)
  # Remove words that aren't alphabetic
  alphabetic = [word for word in no_punctuation if word.isalpha()]
  #print("Alphabetic:", alphabetic)
  # Remove stopwords
  words = [w for w in alphabetic if not w in stop_words]
  #print("Cleaned Words:", words)
  #print("--------------------------------")
  # Append to list
  return words

In [7]:
df['SentimentCleaned'] = df['SentimentText'].apply(nltk_tokenize)
df.head()

Unnamed: 0,Sentiment,SentimentText,SentimentCleaned
0,0,is so sad for my APL frie...,"[sad, apl, friend]"
1,0,I missed the New Moon trail...,"[missed, new, moon, trailer]"
2,1,omg its already 7:30 :O,"[omg, already]"
3,0,.. Omgaga. Im sooo im gunna CRy. I'...,"[omgaga, im, sooo, im, gunna, cry, dentist, si..."
4,0,i think mi bf is cheating on me!!! ...,"[think, mi, bf, cheating, tt]"


In [8]:
df['SentimentCleaned'][:5]

0                                   [sad, apl, friend]
1                         [missed, new, moon, trailer]
2                                       [omg, already]
3    [omgaga, im, sooo, im, gunna, cry, dentist, si...
4                        [think, mi, bf, cheating, tt]
Name: SentimentCleaned, dtype: object

### How should TF-IDF scores be interpreted? How are they calculated?

#### Your Answer Here #####

TF-IDF equals **term frequency (how often a particular term t appears in a particular document d)** multiplied by **inverse document frequency (the logarithm of the ratio of the total number of documents d and the number of documents d that contain the term t)**.

The purpose and interpretation is that that TF-IDF downgrades or downweights frequently occuring words in the feature vectors and therefore trying to get at what's important within any particular document.

# Part 3 - Document Classification

1) Use Train_Test_Split to create train and test datasets.

2) Vectorize the tokenized documents using your choice of vectorization method. 

 - Stretch goal: Use both of the methods that we talked about in class.

3) Create a vocabulary using the X_train dataset and transform both your X_train and X_test data using that vocabulary.

4) Use your choice of binary classification algorithm to train and evaluate your model's accuracy. Report both train and test accuracies.

 - Stretch goal: Use an error metric other than accuracy and implement/evaluate multiple classifiers.



In [9]:
##### Your Code Here #####
df.head()

Unnamed: 0,Sentiment,SentimentText,SentimentCleaned
0,0,is so sad for my APL frie...,"[sad, apl, friend]"
1,0,I missed the New Moon trail...,"[missed, new, moon, trailer]"
2,1,omg its already 7:30 :O,"[omg, already]"
3,0,.. Omgaga. Im sooo im gunna CRy. I'...,"[omgaga, im, sooo, im, gunna, cry, dentist, si..."
4,0,i think mi bf is cheating on me!!! ...,"[think, mi, bf, cheating, tt]"


In [10]:
# Creating a new column of df['SentimentCleaned'] that can be fed into scikit-learn vectorizers

sentiment_vector = []
for i in df['SentimentCleaned']:
    new_sentiment = " ".join(i)
    sentiment_vector.append(new_sentiment)
    
df['sentiment_vector'] = sentiment_vector
df.head()

Unnamed: 0,Sentiment,SentimentText,SentimentCleaned,sentiment_vector
0,0,is so sad for my APL frie...,"[sad, apl, friend]",sad apl friend
1,0,I missed the New Moon trail...,"[missed, new, moon, trailer]",missed new moon trailer
2,1,omg its already 7:30 :O,"[omg, already]",omg already
3,0,.. Omgaga. Im sooo im gunna CRy. I'...,"[omgaga, im, sooo, im, gunna, cry, dentist, si...",omgaga im sooo im gunna cry dentist since supo...
4,0,i think mi bf is cheating on me!!! ...,"[think, mi, bf, cheating, tt]",think mi bf cheating tt


In [12]:
df.shape

(99989, 4)

In [57]:
# Tina, I ended up truncating the dataset by about half because I was getting runtime errors after
# Running Logistic Regression on the whole dataset.  So that's what I'm doing here:

In [19]:
df = df.loc[:49999, :]
df.shape

(50000, 4)

In [20]:
from sklearn.model_selection import train_test_split

X = df.sentiment_vector
y = df.Sentiment

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)



In [21]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(40000,)
(10000,)
(40000,)
(10000,)


In [22]:
# Using CountVectorizer to Create Bag-of-Words

vectorizer = CountVectorizer(max_features=None, ngram_range=(1, 1), stop_words='english')

vectorizer.fit(X_train)


CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [60]:
print(vectorizer.vocabulary_)



In [23]:
train_word_counts = vectorizer.fit_transform(X_train)
test_word_counts = vectorizer.transform(X_test)

# Showing vectorized as an array or DataFrame
X_train_vectorized = pd.DataFrame(train_word_counts.toarray(), columns=vectorizer.get_feature_names())
X_test_vectorized = pd.DataFrame(test_word_counts.toarray(), columns=vectorizer.get_feature_names())
X_test_vectorized.head()

Unnamed: 0,aa,aaa,aaaaaa,aaaaaaaaa,aaaaaaaaaaaaaaaaaa,aaaaaaaaahh,aaaaaaah,aaaaaaalcohol,aaaaaahhhhhhhh,aaaaaand,...,zzzzz,zzzzzzzzzzzz,ãªnfase,ðµ,ðµñ,ðº,ðºð,ðºðµ,øª,øªù
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [25]:
# Logistic Regression

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

lr = LogisticRegression(random_state=42).fit(X_train_vectorized, y_train)

train_preds = lr.predict(X_train_vectorized)
test_preds = lr.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_preds)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_preds)}')



Train Accuracy: 0.89995
Test Accuracy: 0.7443


In [61]:
"""

# Trying to use RandomForestClassifier, but it ended up taking too long to even train the truncated
# dataset.

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

rf = RandomForestClassifier().fit(X_train_vectorized, y_train)

train_preds = rf.predict(X_train_vectorized)
test_preds = rf.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_preds)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_preds)}')

"""

"\n\n# Trying to use RandomForestClassifier, but it ended up taking too long to even train the truncated\n# dataset.\n\nfrom sklearn.ensemble import RandomForestClassifier\nfrom sklearn.metrics import accuracy_score\n\nrf = RandomForestClassifier().fit(X_train_vectorized, y_train)\n\ntrain_preds = rf.predict(X_train_vectorized)\ntest_preds = rf.predict(X_test_vectorized)\n\nprint(f'Train Accuracy: {accuracy_score(y_train, train_preds)}')\nprint(f'Test Accuracy: {accuracy_score(y_test, test_preds)}')\n\n"

# Part 4 -  Word2Vec

1) Fit a Word2Vec model on your cleaned/tokenized twitter dataset. 

2) Display the 10 words that are most similar to the word "twitter"

In [26]:
##### Your Code Here #####
from nltk.tokenize import word_tokenize
from gensim.models.word2vec import Word2Vec

sentences = [word_tokenize(text) for text in df.SentimentText]

model = Word2Vec(sentences, min_count=1, size=200)

print(model)

Word2Vec(vocab=77675, size=200, alpha=0.025)


In [28]:
model.wv.most_similar('twitter')

[('everyone', 0.9191207885742188),
 ('everything', 0.907377302646637),
 ('300', 0.9018738865852356),
 ('emails', 0.9001036882400513),
 ('laugh', 0.8997212648391724),
 ('yourself', 0.899293065071106),
 ('facebook', 0.8976020812988281),
 ('talking', 0.8908904790878296),
 ('us', 0.8891047239303589),
 ('free', 0.8869091272354126)]