# Part 1 - Working with Text Data

### Use Python string methods remove irregular whitespace from the following string:

In [1]:
whitespace_string = "\n\n  This is a    string   that has  \n a lot of  extra \n   whitespace.   "

print(whitespace_string)



  This is a    string   that has  
 a lot of  extra 
   whitespace.   


In [2]:
print(" ".join(whitespace_string.split()))

This is a string that has a lot of extra whitespace.


### Use Regular Expressions to take the dates in the following .txt file and put them into a dataframe with columns for:

[RegEx dates.txt](https://github.com/ryanleeallred/datasets/blob/master/dates.txt)

- Day
- Month
- Year


In [109]:
import re
import pandas as pd

regex = r"(?:,|\s)\s*"

df = pd.read_csv("https://raw.githubusercontent.com/ryanleeallred/datasets/master/dates.txt", sep = regex, header = None)
df.columns = ['Month', 'Day', 'Year']
df.head(20)



  


Unnamed: 0,Month,Day,Year
0,March,8,2015
1,March,15,2015
2,March,22,2015
3,March,29,2015
4,April,5,2015
5,April,12,2015
6,April,19,2015
7,April,26,2015
8,May,3,2015
9,May,10,2015


# Part 2 - Bag of Words 

### Use the twitter sentiment analysis dataset found at this link for the remainder of the Sprint Challenge:

[Twitter Sentiment Analysis Dataset](https://raw.githubusercontent.com/ryanleeallred/datasets/master/twitter_sentiment_binary.csv)

 ### Clean and tokenize the documents ensuring the following properties of the text:

1) Text should be lowercase.

2) Stopwords should be removed.

3) Punctuation should be removed.

4) Tweets should be tokenized at the word level. 

(The above don't necessarily need to be completed in that specific order.)

### Output some cleaned tweets so that we can see that you made all of the above changes.


In [114]:
# import dataset
df = pd.read_csv("https://raw.githubusercontent.com/ryanleeallred/datasets/master/twitter_sentiment_binary.csv")
print(df.shape)
df.head()

(99989, 2)


Unnamed: 0,Sentiment,SentimentText
0,0,is so sad for my APL frie...
1,0,I missed the New Moon trail...
2,1,omg its already 7:30 :O
3,0,.. Omgaga. Im sooo im gunna CRy. I'...
4,0,i think mi bf is cheating on me!!! ...


In [115]:
df_tweets = df['SentimentText']

In [116]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

In [117]:
import string

table = str.maketrans('','', string.punctuation)
stop_words = set(stopwords.words('english'))

cleaned_tweets = []

for tweet in df_tweets:
  # Tokenize by word
  tokens = word_tokenize(tweet)
  #print("Tokens:", tokens)
  # Make all words lowercase
  lowercase_tokens = [w.lower() for w in tokens]
  #print("\nLowercase:", lowercase_tokens)
  # Strip punctuation from within words
  no_punctuation = [x.translate(table) for x in lowercase_tokens]
  #print("\nNo Punctuation:", no_punctuation)
  # Remove words that aren't alphabetic
  #alphabetic = [word for word in no_punctuation if word.isalpha()]
  #print("\nAlphabetic:", alphabetic)
  # Remove stopwords
  words = [w for w in no_punctuation if not w in stop_words]
  #print("\nCleaned Words:", words)
  #print("--------------------------------")
  # Append to list
  cleaned_tweets.append(words)

In [118]:
df['SentimentText_tokenized'] = cleaned_tweets

In [119]:
df.head()

Unnamed: 0,Sentiment,SentimentText,SentimentText_tokenized
0,0,is so sad for my APL frie...,"[sad, apl, friend, , , , , ]"
1,0,I missed the New Moon trail...,"[missed, new, moon, trailer, ]"
2,1,omg its already 7:30 :O,"[omg, already, 730, ]"
3,0,.. Omgaga. Im sooo im gunna CRy. I'...,"[, omgaga, , im, sooo, im, gunna, cry, , denti..."
4,0,i think mi bf is cheating on me!!! ...,"[think, mi, bf, cheating, , , , tt]"


### How should TF-IDF scores be interpreted? How are they calculated?

TF-IDF weighs the term frequency against the inverse of its document frequency. This will give the value of specific content and determine if it is undervalued or overvalued. Common words will have low scores (<1) and words with more impact or rarity will have a higher score ( closer to 1).

TF-IDF is calculated as:

TF = # times a word appears / total words
IDF = total # documents / # docs containing the word

**TF-IDF = TF * IDF**

# Part 3 - Document Classification

1) Use Train_Test_Split to create train and test datasets.

2) Vectorize the tokenized documents using your choice of vectorization method. 

 - Stretch goal: Use both of the methods that we talked about in class.

3) Create a vocabulary using the X_train dataset and transform both your X_train and X_test data using that vocabulary.

4) Use your choice of binary classification algorithm to train and evaluate your model's accuracy. Report both train and test accuracies.

 - Stretch goal: Use an error metric other than accuracy and implement/evaluate multiple classifiers.



In [120]:
from sklearn.model_selection import train_test_split

X = df['SentimentText']
y = df['Sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [121]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(79991,)
(19998,)
(79991,)
(19998,)


In [126]:
# Use Count Vectorizer for cleaning and preprocessing
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features=400, ngram_range=(1,1), stop_words='english')

vectorizer.fit(X_train)

print(vectorizer.vocabulary_)


{'sleep': 290, 'watching': 366, 'quot': 259, 'school': 280, 'haha': 131, 'just': 171, 'forgot': 103, 'la': 177, 'love': 200, 'music': 227, 'little': 188, 'gotta': 124, 'dont': 78, 'know': 176, 'coffee': 52, 'lt': 203, 'gt': 126, 'twitpic': 348, 'working': 382, 'wanna': 360, 'hear': 143, 'bet': 29, 'cute': 62, 'like': 184, 'friend': 107, 'told': 334, 'got': 123, 'video': 357, 'll': 190, '10': 0, '30': 2, 'week': 369, 'ago': 6, 'http': 160, 'com': 54, 'time': 329, 'long': 193, 'party': 241, 'kinda': 174, 'thing': 320, 'good': 122, 'night': 232, 'sun': 308, 'way': 367, 'did': 70, 'today': 333, 'yes': 398, 'sigh': 288, 'cool': 58, 'things': 321, 'hot': 156, 'day': 66, 'right': 270, 'lol': 192, 'new': 229, 'nice': 231, 'meet': 216, 'happy': 137, 'sad': 274, 'come': 55, 'makes': 210, 'mad': 208, 'yeah': 394, 'remember': 267, 'yep': 397, 'does': 74, 'doesn': 75, 'think': 322, 'old': 237, 'far': 91, 'im': 165, 'car': 46, 'having': 141, 'iphone': 167, 'really': 265, 'want': 361, 'don': 77, 'bad

In [127]:
train_word_counts = vectorizer.transform(X_train)

X_train_vectorized = pd.DataFrame(train_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_train_vectorized.shape)
X_train_vectorized.head()

(79991, 400)


Unnamed: 0,10,100,30,able,actually,add,ago,agree,ah,alexalltimelow,...,xx,ya,yay,yea,yeah,year,years,yep,yes,yesterday
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [128]:
# Vectorize X_test
test_word_counts = vectorizer.transform(X_test)

X_test_vectorized = pd.DataFrame(test_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_test_vectorized.shape)
X_test_vectorized.head()

(19998, 400)


Unnamed: 0,10,100,30,able,actually,add,ago,agree,ah,alexalltimelow,...,xx,ya,yay,yea,yeah,year,years,yep,yes,yesterday
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Run classification models to see accuracy

In [129]:
# Logistic Regression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

LR = LogisticRegression(random_state=42).fit(X_train_vectorized, y_train)

train_predictions = LR.predict(X_train_vectorized)
test_predictions = LR.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')



Train Accuracy: 0.7116300583815679
Test Accuracy: 0.7059205920592059


In [131]:
# MultinomialNB
from sklearn.naive_bayes import MultinomialNB

MNB = MultinomialNB().fit(X_train_vectorized, y_train)

train_predictions = MNB.predict(X_train_vectorized)
test_predictions = MNB.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

Train Accuracy: 0.7077671238014277
Test Accuracy: 0.7054205420542055


In [130]:
# Check against Random Forest
from sklearn.ensemble import RandomForestClassifier

RFC = RandomForestClassifier(n_estimators=200).fit(X_train_vectorized, y_train)

train_predictions = RFC.predict(X_train_vectorized)
test_predictions = RFC.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

Train Accuracy: 0.8965008563463389
Test Accuracy: 0.6791179117911791


# Part 4 -  Word2Vec

1) Fit a Word2Vec model on your cleaned/tokenized twitter dataset. 

2) Display the 10 words that are most similar to the word "twitter"

In [132]:
import gensim

In [137]:
from gensim.models.word2vec import Word2Vec

w2v = Word2Vec(df.SentimentText_tokenized, min_count=20, window=3, size=400, negative=20)

In [138]:
words = list(w2v.wv.vocab)
print(f'Vocabulary Size: {len(words)}')

Vocabulary Size: 3731


In [139]:
w2v.wv.most_similar('twitter', topn=10)

[('email', 0.723713755607605),
 ('updates', 0.7034378051757812),
 ('facebook', 0.701595664024353),
 ('address', 0.6986019611358643),
 ('account', 0.6972944736480713),
 ('myspace', 0.6967622637748718),
 ('dm', 0.6963673233985901),
 ('info', 0.6834712028503418),
 ('list', 0.6820206046104431),
 ('others', 0.6798641681671143)]