# Part 1 - Working with Text Data

### Use Python string methods remove irregular whitespace from the following string:

In [1]:
whitespace_string = "\n\n  This is a    string   that has  \n a lot of  extra \n   whitespace.   "

print(whitespace_string)



  This is a    string   that has  
 a lot of  extra 
   whitespace.   


In [2]:
print(" ".join(whitespace_string.split()))

This is a string that has a lot of extra whitespace.


### Use Regular Expressions to take the dates in the following .txt file and put them into a dataframe with columns for:

[RegEx dates.txt](https://github.com/ryanleeallred/datasets/blob/master/dates.txt)

- Day
- Month
- Year


In [3]:
import pandas as pd

dates = "March 8, 2015 March 15, 2015 March 22, 2015 March 29, 2015 April 5, 2015 April 12, 2015 April 19, 2015 \
         April 26, 2015 May 3, 2015 May 10, 2015 May 17, 2015 May 24, 2015 May 31, 2015 June 7, 2015 June 14, 2015 \
         June 21, 2015 June 28, 2015 July 5, 2015 July 12, 2015 July 19, 2015"

dates

'March 8, 2015 March 15, 2015 March 22, 2015 March 29, 2015 April 5, 2015 April 12, 2015 April 19, 2015          April 26, 2015 May 3, 2015 May 10, 2015 May 17, 2015 May 24, 2015 May 31, 2015 June 7, 2015 June 14, 2015          June 21, 2015 June 28, 2015 July 5, 2015 July 12, 2015 July 19, 2015'

In [4]:
import re 

regex = r"(\w+)\s+(\d\d?)\s*,\s*(\d{4})"

search_result = re.findall(regex, dates)
print(search_result)

[('March', '8', '2015'), ('March', '15', '2015'), ('March', '22', '2015'), ('March', '29', '2015'), ('April', '5', '2015'), ('April', '12', '2015'), ('April', '19', '2015'), ('April', '26', '2015'), ('May', '3', '2015'), ('May', '10', '2015'), ('May', '17', '2015'), ('May', '24', '2015'), ('May', '31', '2015'), ('June', '7', '2015'), ('June', '14', '2015'), ('June', '21', '2015'), ('June', '28', '2015'), ('July', '5', '2015'), ('July', '12', '2015'), ('July', '19', '2015')]


In [5]:
df = pd.DataFrame(search_result, columns=['Month', 'Day', 'Year'])
df.head()

Unnamed: 0,Month,Day,Year
0,March,8,2015
1,March,15,2015
2,March,22,2015
3,March,29,2015
4,April,5,2015


# Part 2 - Bag of Words 

### Use the twitter sentiment analysis dataset found at this link for the remainder of the Sprint Challenge:

[Twitter Sentiment Analysis Dataset](https://raw.githubusercontent.com/ryanleeallred/datasets/master/twitter_sentiment_binary.csv)

 ### Clean and tokenize the documents ensuring the following properties of the text:

1) Text should be lowercase.

2) Stopwords should be removed.

3) Punctuation should be removed.

4) Tweets should be tokenized at the word level. 

(The above don't necessarily need to be completed in that specific order.)

### Output some cleaned tweets so that we can see that you made all of the above changes.


In [6]:
tweets = pd.read_csv("https://raw.githubusercontent.com/ryanleeallred/datasets/master/twitter_sentiment_binary.csv")
tweets.head()

Unnamed: 0,Sentiment,SentimentText
0,0,is so sad for my APL frie...
1,0,I missed the New Moon trail...
2,1,omg its already 7:30 :O
3,0,.. Omgaga. Im sooo im gunna CRy. I'...
4,0,i think mi bf is cheating on me!!! ...


In [7]:
from nltk.corpus import stopwords, wordnet
import string

stop_words = set(stopwords.words('english'))
table = str.maketrans('', '', string.punctuation)

df_copy = tweets.copy()
df_copy['SentimentText'] = df_copy['SentimentText'].str.lower()  #Text is lowercase
df_copy['SentimentText'] = df_copy['SentimentText'].str.translate(table) #Remove punctuation
df_copy['SentimentText'] = df_copy['SentimentText'].str.split() #Tokenize words
df_copy['SentimentText'] = df_copy['SentimentText'].apply(lambda x: [item for item in x if item not in stop_words]) #Remove stop words
print(df_copy['SentimentText'][0])
print(df_copy['SentimentText'][1])
print(df_copy['SentimentText'][2])

# df_copy.head()

['sad', 'apl', 'friend']
['missed', 'new', 'moon', 'trailer']
['omg', 'already', '730']


### How should TF-IDF scores be interpreted? How are they calculated?

#### TF-IDF stands for Term Frequency — Inverse Data Frequency. It tells us the occurance of a word within a list of words within a document of lists of words. The TF-IDF of common words is 0 and uncommon words will be higher. #####

# Part 3 - Document Classification

1) Use Train_Test_Split to create train and test datasets.

2) Vectorize the tokenized documents using your choice of vectorization method. 

 - Stretch goal: Use both of the methods that we talked about in class.

3) Create a vocabulary using the X_train dataset and transform both your X_train and X_test data using that vocabulary.

4) Use your choice of binary classification algorithm to train and evaluate your model's accuracy. Report both train and test accuracies.

 - Stretch goal: Use an error metric other than accuracy and implement/evaluate multiple classifiers.



In [8]:
tweets.Sentiment.value_counts()

1    56457
0    43532
Name: Sentiment, dtype: int64

In [9]:
from sklearn.model_selection import train_test_split

X = tweets['SentimentText']
y = tweets['Sentiment']

X_train, X_test, y_train, y_test = train_test_split(X,
                                                   y,
                                                   test_size=0.2,
                                                   random_state=42)

In [10]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(79991,)
(19998,)
(79991,)
(19998,)


In [11]:
## Using Count Vectorizer

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features=10000, ngram_range=(1,1), stop_words='english')

vectorizer.fit(X_train)

print(vectorizer.vocabulary_)



In [12]:
train_word_counts = vectorizer.transform(X_train)

X_train_vectorized = pd.DataFrame(train_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_train_vectorized.shape)
X_train_vectorized.head()

(79991, 10000)


Unnamed: 0,00,000,007peter,00am,02,04,05,06,07,09,...,ð½ðµ,ð½ð¾,ð½ñ,ð¾,ð¾ð,ð¾ñ,øª,øªø,øªù,ø¹ù
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [13]:
test_word_counts = vectorizer.transform(X_test)

X_test_vectorized = pd.DataFrame(test_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_test_vectorized.shape)
X_test_vectorized.head()

(19998, 10000)


Unnamed: 0,00,000,007peter,00am,02,04,05,06,07,09,...,ð½ðµ,ð½ð¾,ð½ñ,ð¾,ð¾ð,ð¾ñ,øª,øªø,øªù,ø¹ù
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [15]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

LR = LogisticRegression(random_state=42).fit(X_train_vectorized, y_train)

train_predictions = LR.predict(X_train_vectorized)
test_predictions = LR.predict(X_test_vectorized)

In [16]:
from sklearn.metrics import accuracy_score

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

Train Accuracy: 0.802002725306597
Test Accuracy: 0.7499749974997499


In [20]:
LR.fit(X_train_vectorized, y_train)
y_pred_proba = LR.predict_proba(X_test_vectorized)[:,1]
print('Validation ROC AUC:', roc_auc_score(y_test, y_pred_proba))

Validation ROC AUC: 0.8194183143669986


In [21]:
from sklearn.naive_bayes import MultinomialNB

MNB = MultinomialNB().fit(X_train_vectorized, y_train)

train_predictions = MNB.predict(X_train_vectorized)
test_predictions = MNB.predict(X_test_vectorized)
y_pred_proba = MNB.predict_proba(X_test_vectorized)[:,1]

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')
print('Validation ROC AUC:', roc_auc_score(y_test, y_pred_proba))

Train Accuracy: 0.7839631958595342
Test Accuracy: 0.7498249824982498
Validation ROC AUC: 0.81840942897785


In [22]:
## Using TF-IDF Vectorizer

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=10000, ngram_range=(1,1), stop_words='english')

vectorizer.fit(X_train)

print(vectorizer.vocabulary_)



In [23]:
train_word_counts = vectorizer.transform(X_train)

X_train_vectorized = pd.DataFrame(train_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_train_vectorized.shape)
X_train_vectorized.head()

(79991, 10000)


Unnamed: 0,00,000,007peter,00am,02,04,05,06,07,09,...,ð½ðµ,ð½ð¾,ð½ñ,ð¾,ð¾ð,ð¾ñ,øª,øªø,øªù,ø¹ù
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [24]:
test_word_counts = vectorizer.transform(X_test)

X_test_vectorized = pd.DataFrame(test_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_test_vectorized.shape)
X_test_vectorized.head()

(19998, 10000)


Unnamed: 0,00,000,007peter,00am,02,04,05,06,07,09,...,ð½ðµ,ð½ð¾,ð½ñ,ð¾,ð¾ð,ð¾ñ,øª,øªø,øªù,ø¹ù
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [25]:
LR = LogisticRegression(random_state=42).fit(X_train_vectorized, y_train)

train_predictions = LR.predict(X_train_vectorized)
test_predictions = LR.predict(X_test_vectorized)
y_pred_proba = LR.predict_proba(X_test_vectorized)[:,1]


print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')
print('Validation ROC AUC:', roc_auc_score(y_test, y_pred_proba))

Train Accuracy: 0.7930642197247191
Test Accuracy: 0.7513751375137514
Validation ROC AUC: 0.8290713879292828


In [26]:
from sklearn.naive_bayes import MultinomialNB

MNB = MultinomialNB().fit(X_train_vectorized, y_train)

train_predictions = MNB.predict(X_train_vectorized)
test_predictions = MNB.predict(X_test_vectorized)
y_pred_proba = MNB.predict_proba(X_test_vectorized)[:,1]

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')
print('Validation ROC AUC:', roc_auc_score(y_test, y_pred_proba))

Train Accuracy: 0.7843632408645973
Test Accuracy: 0.744924492449245
Validation ROC AUC: 0.8212273724852673


# Part 4 -  Word2Vec

1) Fit a Word2Vec model on your cleaned/tokenized twitter dataset. 

2) Display the 10 words that are most similar to the word "twitter"

In [None]:
!pip install -U gensim
import gensim

In [None]:
import nltk
nltk.download('all')

In [30]:
from gensim.models.word2vec import Word2Vec
model = Word2Vec(df_copy['SentimentText'], min_count=5, size=200)

In [31]:
print(model)
print(list(model.wv.vocab))
print(len(model.wv.vocab))

Word2Vec(vocab=12978, size=200, alpha=0.025)
12978


In [32]:
model.wv.most_similar('twitter')

[('facebook', 0.8220177292823792),
 ('list', 0.8020410537719727),
 ('link', 0.7957506775856018),
 ('page', 0.7953722476959229),
 ('following', 0.7829118967056274),
 ('sent', 0.7820925116539001),
 ('dm', 0.778730571269989),
 ('grats', 0.7720205783843994),
 ('email', 0.7697209119796753),
 ('site', 0.7667928338050842)]