<a href="https://colab.research.google.com/github/zarrinan/DS-Unit-4-Sprint-2-NLP/blob/master/sc/DS42SC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Part 1 - Working with Text Data

### Use Python string methods remove irregular whitespace from the following string:

In [47]:
whitespace_string = "\n\n  This is a    string   that has  \n a lot of  extra \n   whitespace.   "

print(whitespace_string)



  This is a    string   that has  
 a lot of  extra 
   whitespace.   


In [48]:
 ##### Your Code Here #####
print(' '.join(whitespace_string.split()))

This is a string that has a lot of extra whitespace.


### Use Regular Expressions to take the dates in the following .txt file and put them into a dataframe with columns for:

[RegEx dates.txt](https://github.com/ryanleeallred/datasets/blob/master/dates.txt)

- Day
- Month
- Year


In [41]:
##### Your Code Here #####
import requests
import re

data = requests.get('https://raw.githubusercontent.com/ryanleeallred/datasets/master/dates.txt')
dates = data.text.replace('\r', '').split('\n')

months = [re.findall(r'[A-Z][a-z]+', date)[0] for date in dates]
days = [re.findall(r'[\d]{1,2},', date)[0][:-1] for date in dates]
years = [re.findall(r'[\d]{4}', date)[0] for date in dates]
df_dates = pd.DataFrame({'Month' : months,
                        'Day' : days,
                        'Year' : years})
df_dates.head()

Unnamed: 0,Day,Month,Year
0,8,March,2015
1,15,March,2015
2,22,March,2015
3,29,March,2015
4,5,April,2015


# Part 2 - Bag of Words 

### Use the twitter sentiment analysis dataset found at this link for the remainder of the Sprint Challenge:

[Twitter Sentiment Analysis Dataset](https://raw.githubusercontent.com/ryanleeallred/datasets/master/twitter_sentiment_binary.csv)

 ### Clean and tokenize the documents ensuring the following properties of the text:

1) Text should be lowercase.

2) Stopwords should be removed.

3) Punctuation should be removed.

4) Tweets should be tokenized at the word level. 

(The above don't necessarily need to be completed in that specific order.)

### Output some cleaned tweets so that we can see that you made all of the above changes.


In [2]:
##### Your Code Here #####
import pandas as pd
twitters = pd.read_csv('https://raw.githubusercontent.com/ryanleeallred/datasets/master/twitter_sentiment_binary.csv')
twitters.head()

Unnamed: 0,Sentiment,SentimentText
0,0,is so sad for my APL frie...
1,0,I missed the New Moon trail...
2,1,omg its already 7:30 :O
3,0,.. Omgaga. Im sooo im gunna CRy. I'...
4,0,i think mi bf is cheating on me!!! ...


In [0]:
twitters.shape

In [0]:
pip install -U gensim

In [4]:
import gensim
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [0]:
text = twitters['SentimentText']

In [6]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

import string

# turn a doc into clean tokens
def clean_doc(doc):
  
  doc = str(doc).lower()
  # split into tokens by white space
  tokens = doc.split()
  
	# remove punctuation from each token
  table = str.maketrans('', '', string.punctuation)
  tokens = [w.translate(table) for w in tokens]
  
  
  # remove remaining tokens that are not alphabetic
  tokens = [word for word in tokens if word.isalpha()]
  #filter out stop words
  stop_words = set(stopwords.words('english'))
  tokens = [w for w in tokens if not w in stop_words]
  # filter out short tokens
  tokens = [word for word in tokens if len(word) > 1]
  
  return tokens

sentences = [clean_doc(tweet) for tweet in text]
print(sentences[:10])

[['sad', 'apl', 'friend'], ['missed', 'new', 'moon', 'trailer'], ['omg', 'already'], ['omgaga', 'im', 'sooo', 'im', 'gunna', 'cry', 'ive', 'dentist', 'since', 'suposed', 'get', 'crown', 'put'], ['think', 'mi', 'bf', 'cheating', 'tt'], ['worry', 'much'], ['juuuuuuuuuuuuuuuuussssst', 'chillin'], ['sunny', 'work', 'tomorrow', 'tv', 'tonight'], ['handed', 'uniform', 'today', 'miss', 'already'], ['hmmmm', 'wonder', 'number']]


### How should TF-IDF scores be interpreted? How are they calculated?

**TF-IDF**: 
Term Frequency (percentage of words in document for each word) - Inverse Document Frequency (a penalty for the word existing in a high number of documents).

The purpose of TF-IDF is to find what is _unique_ to each document. Because of this we will penalize the term frequencies of words that are common across all documents which will allow for each document's most different topics to rise to the top. It calculates how many times the word appears accross all documents and find a proportion 1/number of times. The within each document it multiplies that proportion for each occurence of that word

# Part 3 - Document Classification

1) Use Train_Test_Split to create train and test datasets.

2) Vectorize the tokenized documents using your choice of vectorization method. 

 - Stretch goal: Use both of the methods that we talked about in class.

3) Create a vocabulary using the X_train dataset and transform both your X_train and X_test data using that vocabulary.

4) Use your choice of binary classification algorithm to train and evaluate your model's accuracy. Report both train and test accuracies.

 - Stretch goal: Use an error metric other than accuracy and implement/evaluate multiple classifiers.



In [0]:
##### Your Code Here #####
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score

In [0]:
twitters['text'] = sentences
twitters.head()

In [0]:
twitters['text'] = [' '.join(tweet) for tweet in twitters['text']]

In [11]:
#split the data for the df
def split_data(df, target):
  from sklearn.model_selection import train_test_split
  X, y = df.text, df[target]
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
  return X_train, X_test, y_train, y_test



#separate data into training and test
X_train, X_test, y_train, y_test = split_data(twitters, 'Sentiment')
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(79991,)
(19998,)
(79991,)
(19998,)


In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=200, ngram_range=(1,1), stop_words='english')

vectorizer.fit(X_train)

train_word_counts = vectorizer.transform(X_train)

X_train_vectorized = pd.DataFrame(train_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_train_vectorized.shape)
X_train_vectorized.head()


(79991, 200)


Unnamed: 0,actually,amazing,amp,away,awesome,aww,awww,baby,bad,bed,...,wow,wrong,xx,ya,yay,yeah,year,yes,yesterday,youre
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [21]:
print(vectorizer.vocabulary_)

{'sleep': 136, 'watching': 179, 'school': 131, 'haha': 56, 'love': 95, 'music': 106, 'little': 86, 'gotta': 51, 'dont': 26, 'know': 79, 'working': 189, 'wanna': 174, 'hear': 62, 'cute': 20, 'like': 85, 'friend': 39, 'got': 50, 'video': 172, 'ill': 74, 'week': 181, 'time': 158, 'long': 89, 'friends': 40, 'thing': 153, 'good': 49, 'night': 110, 'youre': 199, 'way': 180, 'thats': 151, 'today': 160, 'yes': 197, 'cool': 19, 'things': 154, 'hot': 70, 'right': 126, 'lol': 88, 'new': 108, 'nice': 109, 'happy': 58, 'day': 22, 'im': 75, 'sad': 127, 'come': 17, 'makes': 98, 'yeah': 195, 'doesnt': 25, 'think': 155, 'old': 114, 'far': 30, 'whats': 185, 'theres': 152, 'really': 125, 'want': 175, 'bad': 8, 'miss': 102, 'wont': 187, 'help': 64, 'following': 36, 'amp': 2, 'twitter': 169, 'looks': 92, 'gonna': 48, 'let': 83, 'oh': 111, 'didnt': 24, 'ok': 112, 'thanks': 150, 'said': 128, 'work': 188, 'year': 196, 'read': 123, 'talk': 147, 'tomorrow': 161, 'awww': 6, 'poor': 120, 'hope': 69, 'feel': 31, '

In [22]:
test_word_counts = vectorizer.transform(X_test)

X_test_vectorized = pd.DataFrame(test_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_test_vectorized.shape)
X_test_vectorized.head()

(19998, 200)


Unnamed: 0,actually,amazing,amp,away,awesome,aww,awww,baby,bad,bed,...,wow,wrong,xx,ya,yay,yeah,year,yes,yesterday,youre
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.66453,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [23]:
#Multinomial Naive Bayes Model on Count Vectorized data
from sklearn.naive_bayes import MultinomialNB
MNB = MultinomialNB().fit(X_train_vectorized, y_train)
train_predictions = MNB.predict(X_train_vectorized)
test_predictions = MNB.predict(X_test_vectorized)
print(f'Train Roc-Auc: {roc_auc_score(y_train, train_predictions)}')
print(f'Test Roc-Auc: {roc_auc_score(y_test, test_predictions)}')


Train Roc-Auc: 0.6641576477345039
Test Roc-Auc: 0.6602522759601707


In [0]:
#XGBoostClassifier Model on Count Vectorized data

from xgboost.sklearn import XGBClassifier
XGB = XGBClassifier(n_estimators=200, num_class=len(twitters['Sentiment'].unique()), objective='multi:softmax').fit(X_train_vectorized, y_train)
train_predictions = XGB.predict(X_train_vectorized)
test_predictions = XGB.predict(X_test_vectorized)
print(f'Train Roc-Auc: {roc_auc_score(y_train, train_predictions)}')
print(f'Test Roc-Auc: {roc_auc_score(y_test, test_predictions)}')

# Part 4 -  Word2Vec

1) Fit a Word2Vec model on your cleaned/tokenized twitter dataset. 

2) Display the 10 words that are most similar to the word "twitter"

In [35]:
# get new data including stop words
def clean_tweet_doc(doc):
  
  doc = str(doc).lower()
  # split into tokens by white space
  tokens = doc.split()
  
	# remove punctuation from each token
  table = str.maketrans('', '', string.punctuation)
  tokens = [w.translate(table) for w in tokens]
  
  
  # remove remaining tokens that are not alphabetic
  tokens = [word for word in tokens if word.isalpha()]
#   #filter out stop words
#   stop_words = set(stopwords.words('english'))
#   tokens = [w for w in tokens if not w in stop_words]
  # filter out short tokens
  tokens = [word for word in tokens if len(word) > 1]
  
  return tokens

w2v_sentences = [clean_tweet_doc(tweet) for tweet in twitters['SentimentText']]
print(w2v_sentences[:10])

[['is', 'so', 'sad', 'for', 'my', 'apl', 'friend'], ['missed', 'the', 'new', 'moon', 'trailer'], ['omg', 'its', 'already'], ['omgaga', 'im', 'sooo', 'im', 'gunna', 'cry', 'ive', 'been', 'at', 'this', 'dentist', 'since', 'was', 'suposed', 'just', 'get', 'crown', 'put', 'on'], ['think', 'mi', 'bf', 'is', 'cheating', 'on', 'me', 'tt'], ['or', 'just', 'worry', 'too', 'much'], ['juuuuuuuuuuuuuuuuussssst', 'chillin'], ['sunny', 'again', 'work', 'tomorrow', 'tv', 'tonight'], ['handed', 'in', 'my', 'uniform', 'today', 'miss', 'you', 'already'], ['hmmmm', 'wonder', 'how', 'she', 'my', 'number']]


In [0]:
##### Your Code Here #####

from gensim.models import Word2Vec
w2v = Word2Vec(w2v_sentences)

In [45]:
words = list(w2v.wv.vocab)
print(f'Vocabulary Size: {len(words)}')

Vocabulary Size: 12311


In [46]:
w2v.most_similar('twitter')

  """Entry point for launching an IPython kernel.


[('facebook', 0.7481160163879395),
 ('myspace', 0.7107597589492798),
 ('list', 0.6638249158859253),
 ('youtube', 0.6508448719978333),
 ('fb', 0.6391708254814148),
 ('message', 0.624267578125),
 ('comment', 0.6164907217025757),
 ('tweetdeck', 0.6153228878974915),
 ('updates', 0.6095948815345764),
 ('everyone', 0.6074749827384949)]