<a href="https://colab.research.google.com/github/cocoisland/DS-Unit-4-Sprint-2-NLP/blob/master/DS42SC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Part 1 - Working with Text Data

### Use Python string methods remove irregular whitespace from the following string:

In [25]:
whitespace_string = "\n\n  This is a    string   that has  \n a lot of  extra \n   whitespace.   "

print(whitespace_string)



  This is a    string   that has  
 a lot of  extra 
   whitespace.   


In [28]:
import re
 
re.sub("\W+", ' ', whitespace_string).strip()


'This is a string that has a lot of extra whitespace'

In [0]:
import re
 
re.sub("\s+", ' ', whitespace_string).strip()


'This is a string that has a lot of extra whitespace.'

### Use Regular Expressions to take the dates in the following .txt file and put them into a dataframe with columns for:

[RegEx dates.txt](https://github.com/ryanleeallred/datasets/blob/master/dates.txt)

- Day
- Month
- Year


In [0]:
import pandas as pd

url='https://raw.githubusercontent.com/ryanleeallred/datasets/master/dates.txt'

#df = pd.read_csv(url, header=None, sep = "[^\w\s]") # Month_day Year
df = pd.read_csv(url, header=None, sep = "\s+")
df.columns = ["Month", "Day", "Year"]
df['Day'] = df['Day'].str.replace('[^\w\s]','') # remove non word/whitespace, comma
df


Unnamed: 0,Month,Day,Year
0,March,8,2015
1,March,15,2015
2,March,22,2015
3,March,29,2015
4,April,5,2015
5,April,12,2015
6,April,19,2015
7,April,26,2015
8,May,3,2015
9,May,10,2015


# Part 2 - Bag of Words 

### Use the twitter sentiment analysis dataset found at this link for the remainder of the Sprint Challenge:

[Twitter Sentiment Analysis Dataset](https://raw.githubusercontent.com/ryanleeallred/datasets/master/twitter_sentiment_binary.csv)

 ### Clean and tokenize the documents ensuring the following properties of the text:

1) Text should be lowercase.

2) Stopwords should be removed.

3) Punctuation should be removed.

4) Tweets should be tokenized at the word level. 

(The above don't necessarily need to be completed in that specific order.)

### Output some cleaned tweets so that we can see that you made all of the above changes.


In [1]:
import pandas as pd

url = 'https://raw.githubusercontent.com/ryanleeallred/datasets/master/twitter_sentiment_binary.csv'
df = pd.read_csv(url)
print(df.shape)
df.head()

(99989, 2)


Unnamed: 0,Sentiment,SentimentText
0,0,is so sad for my APL frie...
1,0,I missed the New Moon trail...
2,1,omg its already 7:30 :O
3,0,.. Omgaga. Im sooo im gunna CRy. I'...
4,0,i think mi bf is cheating on me!!! ...


In [0]:
df.Sentiment.value_counts()

1    56457
0    43532
Name: Sentiment, dtype: int64

In [0]:
!pip install -U nltk

Requirement already up-to-date: nltk in /usr/local/lib/python3.6/dist-packages (3.4)


In [7]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [18]:
import string
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize # Sentence Tokenizer
from nltk.tokenize import word_tokenize # Word Tokenizer
from nltk.stem.wordnet import WordNetLemmatizer


s_text = df.SentimentText


table = str.maketrans('', '', string.punctuation)
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
cleaned_listings = []

for listing in s_text:

    no_punctuation = listing.translate(table)

    tokens = word_tokenize(no_punctuation)

    lowercase_tokens = [w.lower() for w in tokens]
    
    #lowercase_tokens = [w.lower() for w in no_punctuation]
    alphabetic = [w for w in lowercase_tokens if w.isalpha()]

    words = [w for w in alphabetic if not w in stop_words]

    #lemmas = [lemmatizer.lemmatize(w) for w in words]

    cleaned_listings.append(words)

len(cleaned_listings)

99989

In [0]:
cleaned_listings[0:5]

[['sad', 'apl', 'friend'],
 ['missed', 'new', 'moon', 'trailer'],
 ['omg', 'already'],
 ['omgaga',
  'im',
  'sooo',
  'im',
  'gunna',
  'cry',
  'ive',
  'dentist',
  'since',
  'suposed',
  'get',
  'crown',
  'put'],
 ['think', 'mi', 'bf', 'cheating', 'tt']]

### How should TF-IDF scores be interpreted? How are they calculated?

TF-IDF focus more on term frequency rather than on counting keywords.

TF-IDF is an information retrieval technique that weighs a term’s frequency (TF) and its inverse document frequency (IDF). Each word or term has its respective TF and IDF score. The product of the TF and IDF scores of a term is called the TF-IDF weight of that term.

Hence a higher term frequency will incur higher penalty than lower term frequency.


# Part 3 - Document Classification

1) Use Train_Test_Split to create train and test datasets.

2) Vectorize the tokenized documents using your choice of vectorization method. 

 - Stretch goal: Use both of the methods that we talked about in class.

3) Create a vocabulary using the X_train dataset and transform both your X_train and X_test data using that vocabulary.

4) Use your choice of binary classification algorithm to train and evaluate your model's accuracy. Report both train and test accuracies.

 - Stretch goal: Use an error metric other than accuracy and implement/evaluate multiple classifiers.



In [0]:
from sklearn.model_selection import train_test_split

#X = df.SentimentText[:1000]
#y = df.Sentiment[:1000]

X = df.SentimentText
y = df.Sentiment


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [0]:
import string
from nltk.corpus import stopwords
def text_clean(message):
    nopunc = [i for i in message if i not in string.punctuation]
    nn = "".join(nopunc)
    nn = nn.lower().split()
    nostop = [words for words in nn if words not in stopwords.words('english')]
    return(nostop)

In [8]:
from sklearn.feature_extraction.text import CountVectorizer
cv_transformer = CountVectorizer(analyzer = text_clean)
cv_transformer.fit(X_train)
print(cv_transformer.vocabulary_)



In [9]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features=None, ngram_range=(1,1), stop_words='english')

vectorizer.fit(X_train)

print(vectorizer.vocabulary_)



In [10]:
# The following vectorizer was fitted with max_feature=30, due to memory limitation crash.

train_word_counts = vectorizer.transform(X_train)

X_train_vectorized = pd.DataFrame(train_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_train_vectorized.shape)
X_train_vectorized.head()

(79991, 30)


Unnamed: 0,amp,com,day,don,going,good,got,great,haha,hope,...,quot,really,sorry,thanks,think,time,today,want,work,yeah
0,0,0,0,0,0,0,0,0,0,0,...,2,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [11]:
test_word_counts = vectorizer.transform(X_test)

X_test_vectorized = pd.DataFrame(test_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_test_vectorized.shape)
X_test_vectorized.head()

(19998, 30)


Unnamed: 0,amp,com,day,don,going,good,got,great,haha,hope,...,quot,really,sorry,thanks,think,time,today,want,work,yeah
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [12]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

RFC = RandomForestClassifier().fit(X_train_vectorized, y_train)

train_predictions = RFC.predict(X_train_vectorized)
test_predictions = RFC.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')



Train Accuracy: 0.6397469715342976
Test Accuracy: 0.6046104610461046


In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer


tfidf = TfidfVectorizer(max_features = 30)

feature_matrix = tfidf.fit_transform(X_train)
# Get feature names to use as dataframe column headers
feature_names = tfidf.get_feature_names()

# View Feature Matrix as DataFrame
df = pd.DataFrame(feature_matrix.toarray(), columns=feature_names)
df.head()

Unnamed: 0,and,are,at,be,but,can,for,good,have,in,...,on,quot,so,that,the,to,was,with,you,your
0,0.0,0.0,0.0,0.0,0.0,0.36371,0.0,0.0,0.0,0.0,...,0.0,0.785454,0.0,0.289374,0.226658,0.0,0.0,0.0,0.228078,0.0
1,0.0,0.0,0.365391,0.0,0.0,0.0,0.0,0.0,0.0,0.294329,...,0.0,0.0,0.0,0.0,0.0,0.0,0.683179,0.0,0.443487,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.461299,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.322241,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.464306,0.423926,0.0,0.0,0.0,0.0,0.39324,0.0,...,0.0,0.0,0.0,0.0,0.0,0.28039,0.434061,0.0,0.0,0.0


# Part 4 -  Word2Vec

1) Fit a Word2Vec model on your cleaned/tokenized twitter dataset. 

2) Display the 10 words that are most similar to the word "twitter"

In [0]:
!pip install -U gensim
import gensim

In [0]:
import nltk
nltk.download('all')

In [19]:
from gensim.models.word2vec import Word2Vec

model = Word2Vec(cleaned_listings, min_count=1, size=5)
print(model)
print(list(model.wv.vocab))
print(len(model.wv.vocab))

Word2Vec(vocab=104445, size=5, alpha=0.025)
104445


In [20]:
model.wv.most_similar('twitter')

[('astuteslytherin', 0.9986425042152405),
 ('websites', 0.9986316561698914),
 ('lollipopsquot', 0.9979644417762756),
 ('andrewrimmer', 0.9979069828987122),
 ('boil', 0.997797429561615),
 ('investigated', 0.9974745512008667),
 ('manners', 0.997321605682373),
 ('quotonly', 0.9972283840179443),
 ('giants', 0.9969967007637024),
 ('addy', 0.9967241287231445)]

In [24]:
model.wv.most_similar('bf')

[('diego', 0.9999207854270935),
 ('doctor', 0.9994810819625854),
 ('bein', 0.9994375705718994),
 ('grr', 0.9994285702705383),
 ('doubt', 0.9993072748184204),
 ('spiders', 0.9992938041687012),
 ('shut', 0.9992657899856567),
 ('read', 0.9991635680198669),
 ('always', 0.9990581274032593),
 ('kill', 0.9990249276161194)]