# Natural Language Processing Sprint Challenge

**Part 1 - Working with Text Data**
Use Python string methods remove irregular whitespace from the following string:

In [4]:
whitespace_string = "\n\n  This is a    string   that has  \n a lot of  extra \n   whitespace.   "

print(whitespace_string)



  This is a    string   that has  
 a lot of  extra 
   whitespace.   


In [5]:
" ".join(whitespace_string.split())

'This is a string that has a lot of extra whitespace.'

### Use Regular Expressions to take the dates in the following .txt file and put them into a dataframe with columns for:

[RegEx dates.txt](https://raw.githubusercontent.com/ryanleeallred/datasets/master/dates.txt)

- Day
- Month
- Year

In [6]:
import re
import requests
import pandas as pd

r = requests.get("https://raw.githubusercontent.com/"
                 "ryanleeallred/datasets/master/dates.txt")
dates = re.findall(r'[a-zA-Z]+ \d+, \d{4}', r.text)
pd.DataFrame({'dates':dates})

Unnamed: 0,dates
0,"March 8, 2015"
1,"March 15, 2015"
2,"March 22, 2015"
3,"March 29, 2015"
4,"April 5, 2015"
5,"April 12, 2015"
6,"April 19, 2015"
7,"April 26, 2015"
8,"May 3, 2015"
9,"May 10, 2015"


# Part 2 - Bag of Words 

### Use the twitter sentiment analysis dataset found at this link for the remainder of the Sprint Challenge:

[Twitter Sentiment Analysis Dataset](https://raw.githubusercontent.com/ryanleeallred/datasets/master/twitter_sentiment_binary.csv)

 ### Clean and tokenize the documents ensuring the following properties of the text:

1) Text should be lowercase.

2) Stopwords should be removed.

3) Punctuation should be removed.

4) Tweets should be tokenized at the word level. 

(The above don't necessarily need to be completed in that specific order.)

### Output some cleaned tweets so that we can see that you made all of the above changes.


In [7]:
import string

import nltk
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer

nltk.download('stopwords')

def tokenize_words(doc):
    import re
    
    from nltk.corpus import stopwords
    import string
    
    table = str.maketrans('','', string.punctuation)
    stop_words = set(stopwords.words('english'))

    # Tokenize by word
    tokens = word_tokenize(doc)
    # Strip punctuation from within words
    tokens = [x.lower().translate(table) 
              for x in tokens]
    # Remove words with numbers
    tokens = [word for word in tokens if word.isalpha()]
    # Remove stopwords
    tokens = [w for w in tokens if not w in stop_words]
    # lemmatize
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(w) for w in tokens]
    
    return tokens

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\City_Year\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [8]:
df = pd.read_csv("https://raw.githubusercontent.com/ryanleeallred/"
                 "datasets/master/twitter_sentiment_binary.csv")

df['SentimentText_token'] = df['SentimentText'].apply(tokenize_words)

df['SentimentText_token'].head()

0                                   [sad, apl, friend]
1                         [missed, new, moon, trailer]
2                                       [omg, already]
3    [omgaga, im, sooo, im, gunna, cry, dentist, si...
4                        [think, mi, bf, cheating, tt]
Name: SentimentText_token, dtype: object

### How should TF-IDF scores be interpreted? How are they calculated?

tf-idf scores represent how unique a given term is to a given document. The greater the score, the more characteristic that term is of the document in which it's present.

tf-idf is calculated by the count of a term in a document, divided by the total words in that document, multipied by a log of the total number of documents divided by the number of documents containing the term.

# Part 3 - Document Classification

1) Use Train_Test_Split to create train and test datasets.

2) Vectorize the tokenized documents using your choice of vectorization method. 

 - Stretch goal: Use both of the methods that we talked about in class.

3) Create a vocabulary using the X_train dataset and transform both your X_train and X_test data using that vocabulary.

4) Use your choice of binary classification algorithm to train and evaluate your model's accuracy. Report both train and test accuracies.

 - Stretch goal: Use an error metric other than accuracy and implement/evaluate multiple classifiers.
 - Stretch goal: Track your results in a DataFrmae and produce a visualization of the results



In [9]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import MultinomialNB # Multinomial Naive Bayes


def vectorize(vectorizer, X_train, X_test):
    vectorizer.fit(X_train)
    X_train_vect = vectorizer.transform(X_train)
    X_train_trans = pd.DataFrame(X_train_vect.toarray(), 
                                 columns=vectorizer.get_feature_names())
    X_test_trans = vectorizer.transform(X_test)

    X_test_trans = pd.DataFrame(X_test_trans.toarray(), 
                                columns=vectorizer.get_feature_names())

    return X_train_trans, X_test_trans, vectorizer


def assess_model(model, X_train, X_test, 
                 y_train, y_test):
    """Expects transformed inputs"""
    model.fit(X_train, y_train)
    
    train_predictions = model.predict(X_train)
    test_predictions = model.predict(X_test)
    
    result = {}
    result['model'] = str(model).split('(')[0]
    result['acc_train'] = accuracy_score(y_train, train_predictions)
    result['acc_test'] = accuracy_score(y_test, test_predictions)
    print(result)
    
    return result

In [10]:
X = df['SentimentText_token'].apply(lambda x: " ".join(x))
y = df['Sentiment']

X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size=0.2, random_state=812)

In [11]:
tfidf = TfidfVectorizer(max_features = 20)
X_train, X_test, _ = vectorize(tfidf, X_train, X_test)

In [12]:
classifiers = [LogisticRegression(solver='lbfgs'),
               MultinomialNB(),
               RandomForestClassifier(n_estimators=100)]

results = []
for model in classifiers:
    result = assess_model(
        model,
        X_train, X_test, y_train, y_test)
    
    results.append(result)
    
pd.DataFrame.from_records(results)

{'model': 'LogisticRegression', 'acc_train': 0.5960795589503819, 'acc_test': 0.6033103310331033}
{'model': 'MultinomialNB', 'acc_train': 0.5958295308222175, 'acc_test': 0.6047604760476047}
{'model': 'RandomForestClassifier', 'acc_train': 0.6105186833518771, 'acc_test': 0.6002600260026003}


Unnamed: 0,acc_test,acc_train,model
0,0.60331,0.59608,LogisticRegression
1,0.60476,0.59583,MultinomialNB
2,0.60026,0.610519,RandomForestClassifier


# Part 4 -  Word2Vec

1) Fit a Word2Vec model on your cleaned/tokenized twitter dataset. 

2) Display the 10 words that are most similar to the word "twitter"

In [13]:
from gensim.models import Word2Vec

In [None]:
model = Word2Vec(
    df['SentimentText_token'],
    min_count=1, size=5)

model.wv.most_similar('twitter')