<a href="https://colab.research.google.com/github/ed-chin-git/DS-Unit-4-Sprint-2-NLP/blob/master/DS42SC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import pandas as pd
import re
import numpy as np
import string

import nltk
from nltk.tokenize import sent_tokenize # Sentence Tokenizer
from nltk.tokenize import word_tokenize # Word Tokenizer
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier

import gensim
from gensim.models.word2vec import Word2Vec

# Part 1 - Working with Text Data

### Use Python string methods remove irregular whitespace from the following string:

In [0]:
whitespace_string = "\n\n  This is a    string   that has  \n a lot of  extra \n   whitespace.   "

print(whitespace_string)



  This is a    string   that has  
 a lot of  extra 
   whitespace.   


In [0]:
print(" ".join(whitespace_string.split()))
    

This is a string that has a lot of extra whitespace.


### Use Regular Expressions to take the dates in the following .txt file and put them into a dataframe with columns for:

[RegEx dates.txt](https://github.com/ryanleeallred/datasets/blob/master/dates.txt)

- Day
- Month
- Year


In [0]:
with open('dates.txt', 'r', encoding='utf-8') as f:
  contents = f.read()
  
contents

FileNotFoundError: ignored

In [0]:
regex = r"([a-zA-Z]+) (\d+), (\d\d\d\d)" 

search_result = re.findall(regex, contents)

for match in search_result:
  print(match)

In [0]:
dfDate = pd.DataFrame(search_result, columns=['Month', 'Day','Year'])
dfDate.head(100)

# Part 2 - Bag of Words 

### Use the twitter sentiment analysis dataset found at this link for the remainder of the Sprint Challenge:

[Twitter Sentiment Analysis Dataset](https://raw.githubusercontent.com/ryanleeallred/datasets/master/twitter_sentiment_binary.csv)

 ### Clean and tokenize the documents ensuring the following properties of the text:

1) Text should be lowercase.

2) Stopwords should be removed.

3) Punctuation should be removed.

4) Tweets should be tokenized at the word level. 

(The above don't necessarily need to be completed in that specific order.)

### Output some cleaned tweets so that we can see that you made all of the above changes.


#### Load data

In [2]:
pd.set_option('display.max_colwidth', 200)
url='https://raw.githubusercontent.com/ryanleeallred/datasets/master/twitter_sentiment_binary.csv'
df=pd.read_csv(url)

# Create punctuation table
table = str.maketrans('','', string.punctuation)

# Create Stop word list
stop_words = set(stopwords.words('english'))

df2=df.copy()
df.head()

Unnamed: 0,Sentiment,SentimentText
0,0,is so sad for my APL friend.............
1,0,I missed the New Moon trailer...
2,1,omg its already 7:30 :O
3,0,.. Omgaga. Im sooo im gunna CRy. I've been at this dentist since 11.. I was suposed 2 just get a crown put on (30mins)...
4,0,i think mi bf is cheating on me!!! T_T


In [0]:
df=df2.copy()
def clean_doc(doc):
    # split into tokens by white space
    tokens = word_tokenize(doc)
    #  lowercase
    lowercase_tokens = [w.lower() for w in tokens]
    # remove punctuation from each token
    no_punctuation = [w.translate(table) for w in lowercase_tokens]
    # Remove words that aren't alphabetic
    alphabetic = [word for word in no_punctuation if word.isalpha()]
    # filter out stop words
    words = [w for w in alphabetic if not w in stop_words]
	  # filter out short tokens
    tokens = [w for w in words if len(w)>1]
    return tokens


df['SentimentText'] = df['SentimentText'].apply(clean_doc)

# transform word list into word string
df['SentimentText'] = df.SentimentText.apply(' '.join) # remove commas

### How should TF-IDF scores be interpreted? How are they calculated?

#### Term Frequency / Inverse Document Frequency  #####

Term Frequency: Percentage of words in document for each word
  TF = num of times word occurs (frequency) divided by the total num of word in   the document

Inverse Document Frequency: A penalty for the word existing in a high number of documents.
    IDF = log2(total-#-documents/ #-documents-including-word )

TF-IDF weighs a keyword in any content and assigns the importance to that keyword based on the number of times it appears in the document. It also checks how relevant the keyword is. Each word has its respective TF and IDF score. The product of the TF and IDF scores of a term is called the TF-IDF weight of that term.

Put simply, the higher the TF-IDF score (weight), the rarer the term and vice versa.


# Part 3 - Document Classification

1) Use Train_Test_Split to create train and test datasets.

2) Vectorize the tokenized documents using your choice of vectorization method. 

 - Stretch goal: Use both of the methods that we talked about in class.

3) Create a vocabulary using the X_train dataset and transform both your X_train and X_test data using that vocabulary.

4) Use your choice of binary classification algorithm to train and evaluate your model's accuracy. Report both train and test accuracies.

 - Stretch goal: Use an error metric other than accuracy and implement/evaluate multiple classifiers.



In [0]:
X = df.SentimentText.tolist()
y = df.Sentiment

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [5]:
#vectorizer = CountVectorizer(max_features=9999, ngram_range=(1,1), stop_words='english')
vectorizer = TfidfVectorizer(max_features=9999, ngram_range=(1,1), stop_words='english')
vectorizer.fit(X_train)
# print(vectorizer.vocabulary_)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=9999, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words='english', strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [6]:
train_word_counts = vectorizer.transform(X_train)

X_train_vectorized = pd.DataFrame(train_word_counts.toarray(), columns=vectorizer.get_feature_names())

#vectorize  X_test, uses ame vocabulary as the training dataset so  just call .transform() on X_test
test_word_counts = vectorizer.transform(X_test)
X_test_vectorized = pd.DataFrame(test_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_test_vectorized.shape)
print(X_test_vectorized.head())
print(X_train_vectorized.shape)
print(X_train_vectorized.head())

(24998, 9999)
    aa  aaa  aaah  aaahh  aaaww  aafreen  aah  aahh  aahhh  aakomas ...   \
0  0.0  0.0   0.0    0.0    0.0      0.0  0.0   0.0    0.0      0.0 ...    
1  0.0  0.0   0.0    0.0    0.0      0.0  0.0   0.0    0.0      0.0 ...    
2  0.0  0.0   0.0    0.0    0.0      0.0  0.0   0.0    0.0      0.0 ...    
3  0.0  0.0   0.0    0.0    0.0      0.0  0.0   0.0    0.0      0.0 ...    
4  0.0  0.0   0.0    0.0    0.0      0.0  0.0   0.0    0.0      0.0 ...    

   zombie  zombies  zomg  zone  zones  zoo  zoom  zune   ðµ  ðµñ  
0     0.0      0.0   0.0   0.0    0.0  0.0   0.0   0.0  0.0  0.0  
1     0.0      0.0   0.0   0.0    0.0  0.0   0.0   0.0  0.0  0.0  
2     0.0      0.0   0.0   0.0    0.0  0.0   0.0   0.0  0.0  0.0  
3     0.0      0.0   0.0   0.0    0.0  0.0   0.0   0.0  0.0  0.0  
4     0.0      0.0   0.0   0.0    0.0  0.0   0.0   0.0  0.0  0.0  

[5 rows x 9999 columns]
(74991, 9999)
    aa  aaa  aaah  aaahh  aaaww  aafreen  aah  aahh  aahhh  aakomas ...   \
0  0.0  0.0 

In [7]:
RFC = RandomForestClassifier().fit(X_train_vectorized, y_train)

train_predictions = RFC.predict(X_train_vectorized)
test_predictions = RFC.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')



Train Accuracy: 0.9668760251230147
Test Accuracy: 0.7163773101848148


In [0]:
clf = XGBClassifier(n_jobs = -1).fit(X_train_vectorized, y_train)

# Part 4 -  Word2Vec

1) Fit a Word2Vec model on your cleaned/tokenized twitter dataset. 

2) Display the 10 words that are most similar to the word "twitter"

In [0]:
from gensim.models import Word2Vec
w2v = Word2Vec(df2.SentimentText, min_count=20, window=3, size=300, negative=20)

In [0]:
w2v.wv.most_similar('twitter', topn=10)