# Part 1 - Working with Text Data

### Use Python string methods remove irregular whitespace from the following string:

In [1]:
whitespace_string = "\n\n  This is a    string   that has  \n a lot of  extra \n   whitespace.   "

print(whitespace_string)



  This is a    string   that has  
 a lot of  extra 
   whitespace.   


In [2]:
print(" ".join(whitespace_string.strip().split()))

This is a string that has a lot of extra whitespace.


### Use Regular Expressions to take the dates in the following .txt file and put them into a dataframe with columns for:

[RegEx dates.txt](https://github.com/ryanleeallred/datasets/blob/master/dates.txt)

- Day
- Month
- Year


In [16]:
import requests
import re
import pandas as pd

r = requests.get("https://raw.githubusercontent.com/ryanleeallred/datasets/master/dates.txt")
text = str(r.content)[2:-1].split("\\r\\n")
data = []
for l in text:
    d = {}
    groups = re.match(r'(\w+)\s(\d+),\s(\d+)', l).groups()
    d["Day"] = int(groups[1])
    d["Month"] = groups[0]
    d["Year"] = int(groups[2])
    data.append(d)
df = pd.DataFrame(data)
df

Unnamed: 0,Day,Month,Year
0,8,March,2015
1,15,March,2015
2,22,March,2015
3,29,March,2015
4,5,April,2015
5,12,April,2015
6,19,April,2015
7,26,April,2015
8,3,May,2015
9,10,May,2015


# Part 2 - Bag of Words 

### Use the twitter sentiment analysis dataset found at this link for the remainder of the Sprint Challenge:

[Twitter Sentiment Analysis Dataset](https://raw.githubusercontent.com/ryanleeallred/datasets/master/twitter_sentiment_binary.csv)

 ### Clean and tokenize the documents ensuring the following properties of the text:

1) Text should be lowercase.

2) Stopwords should be removed.

3) Punctuation should be removed.

4) Tweets should be tokenized at the word level. 

(The above don't necessarily need to be completed in that specific order.)

### Output some cleaned tweets so that we can see that you made all of the above changes.


In [18]:
df_base = pd.read_csv("https://raw.githubusercontent.com/ryanleeallred/datasets/master/twitter_sentiment_binary.csv")
df_base.head()

Unnamed: 0,Sentiment,SentimentText
0,0,is so sad for my APL frie...
1,0,I missed the New Moon trail...
2,1,omg its already 7:30 :O
3,0,.. Omgaga. Im sooo im gunna CRy. I'...
4,0,i think mi bf is cheating on me!!! ...


In [22]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

def split_lemma(v):
    stop_words = stopwords.words('english')
    return [w.lower() for w in word_tokenize(v) if w.isalpha() and w not in stop_words]

df = df_base.copy()
df["SentimentText"] = df["SentimentText"].apply(split_lemma)
df.head()

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\brit2\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\brit2\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\brit2\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,Sentiment,SentimentText
0,0,"[sad, apl, friend]"
1,0,"[i, missed, new, moon, trailer]"
2,1,"[omg, already, o]"
3,0,"[omgaga, im, sooo, im, gunna, cry, i, dentist,..."
4,0,"[think, mi, bf, cheating]"


### How should TF-IDF scores be interpreted? How are they calculated?

#### TF-IDF Scores

TF-IDF, or Term Frequency - Inverse Document Frequency, is a measure of how important a word or term is to a document in a collection of documents. It is calculated through the multiplucation of the relative frequency of the term in the document and the inverse frequency of the term in the collection of documents compared to the total number of terms. However, for both the relative frequency of the term in the document and the inverse frequency of the term in the collection of documents, there are numerous ways to calculate these values. A common and simple method of doing this is:

$${tf}(t,d) = \frac{n_{t}}{n_{d}}$$
$${idf}(t,D) = \log{\frac{N_{D}}{N_{d}}}$$
$${tfidf} = {tf}(t,d) \cdot {idf}(t,D)$$

where:
- ${tf(t,d)}$ is the relative frequency of the term in the document
- ${idf(t,D)}$ is the inverse frequency of the term in the collection of documents
- ${tfidf}$ is the term frequency inverse document frequency (TF-IDF)
- ${n_{t}}$ is the frequency of the term in the document
- ${n_{d}}$ is the frequency of all terms in the document
- ${N_{d}}$ is the frequency of documents that the term appears in in the collection of documents
- ${N_{D}}$ is the frequency of all terms in the collection of documents

This allows for getting the importance of a term not only to the document, but also to the collection of documents.

# Part 3 - Document Classification

1) Use Train_Test_Split to create train and test datasets.

2) Vectorize the tokenized documents using your choice of vectorization method. 

 - Stretch goal: Use both of the methods that we talked about in class.

3) Create a vocabulary using the X_train dataset and transform both your X_train and X_test data using that vocabulary.

4) Use your choice of binary classification algorithm to train and evaluate your model's accuracy. Report both train and test accuracies.

 - Stretch goal: Use an error metric other than accuracy and implement/evaluate multiple classifiers.



In [24]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df[["SentimentText"]], df["Sentiment"], train_size=0.8, shuffle=True)

In [43]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features=10000, ngram_range=(1,1), stop_words='english')
vectorizer.fit(X_train_vec)
train_vec_2 = pd.DataFrame(vectorizer.transform(X_train_vec).toarray(), columns=vectorizer.get_feature_names(), index=X_train.index)
test_vec_2 = pd.DataFrame(vectorizer.transform(X_test_vec).toarray(), columns=vectorizer.get_feature_names(), index=X_test.index)

In [44]:
from sklearn.linear_model import LogisticRegression

LR = LogisticRegression(random_state=42).fit(train_vec, y_train)
print(LR.score(train_vec_2, y_train))
print(LR.score(test_vec_2, y_test))



0.5012563913440262
0.5074007400740074


In [45]:
from sklearn.naive_bayes import MultinomialNB

MNB = MultinomialNB().fit(train_vec, y_train)
print(MNB.score(train_vec_2, y_train))
print(MNB.score(test_vec_2, y_test))

0.5049193034216349
0.5096009600960096


In [36]:
from sklearn.feature_extraction.text import TfidfVectorizer

X_train_vec = [" ".join(c) for c in X_train["SentimentText"].values]
X_test_vec = [" ".join(c) for c in X_test["SentimentText"].values]

tfidf = TfidfVectorizer(ngram_range=(1,1), max_features=10000)
tfidf.fit(X_train_vec)
train_vec = pd.DataFrame(tfidf.transform(X_train_vec).toarray(), columns=tfidf.get_feature_names(), index=X_train.index)
test_vec = pd.DataFrame(tfidf.transform(X_test_vec).toarray(), columns=tfidf.get_feature_names(), index=X_test.index)

In [37]:
from sklearn.linear_model import LogisticRegression

LR = LogisticRegression(random_state=42).fit(train_vec, y_train)
print(LR.score(train_vec, y_train))
print(LR.score(test_vec, y_test))



0.7915015439236914
0.7545754575457546


In [42]:
from sklearn.naive_bayes import MultinomialNB

MNB = MultinomialNB().fit(train_vec, y_train)
print(MNB.score(train_vec, y_train))
print(MNB.score(test_vec, y_test))

0.7841882211748822
0.7464246424642464


In [46]:
from sklearn.metrics import roc_auc_score

print(roc_auc_score(y_train, LR.predict_proba(train_vec)[:,1]))
print(roc_auc_score(y_test, LR.predict_proba(test_vec)[:,1]))



0.8710169237800546
0.8311230606585811


# Part 4 -  Word2Vec

1) Fit a Word2Vec model on your cleaned/tokenized twitter dataset. 

2) Display the 10 words that are most similar to the word "twitter"

In [47]:
from gensim.models.word2vec import Word2Vec
for i in range(1,11):
    model = Word2Vec(df["SentimentText"].values, min_count=1, size=i)
    print("Size:", i)
    for v in model.wv.most_similar('twitter'):
        print("\t", v)
    print()

Size: 1
	 ('alexmace', 1.0)
	 ('homecoming', 1.0)
	 ('lovemaking', 1.0)
	 ('liveee', 1.0)
	 ('pottstown', 1.0)
	 ('canfield', 1.0)
	 ('alexmatheson', 1.0)
	 ('thoguht', 1.0)
	 ('feaked', 1.0)
	 ('alexmerced', 1.0)

Size: 2
	 ('didsys', 1.0000001192092896)
	 ('chuckrey', 1.0)
	 ('clemartdesign', 1.0)
	 ('berniegrace', 1.0)
	 ('tulog', 1.0)
	 ('skype', 1.0)
	 ('walked', 1.0)
	 ('mall', 1.0)
	 ('nah', 1.0)
	 ('constantknot', 1.0)

Size: 3
	 ('biiig', 0.9999903440475464)
	 ('arrested', 0.9999868869781494)
	 ('cheekylamb', 0.9999712109565735)
	 ('aliciawag', 0.9999397993087769)
	 ('implement', 0.9999380111694336)
	 ('chinta', 0.9998846650123596)
	 ('zeit', 0.9998795390129089)
	 ('brettislame', 0.9998740553855896)
	 ('sincerely', 0.9998714923858643)
	 ('azveganchik', 0.9998651742935181)

Size: 4
	 ('disapointed', 0.999488115310669)
	 ('masseuse', 0.9993946552276611)
	 ('tuenti', 0.9989905953407288)
	 ('kreactive', 0.9987278580665588)
	 ('interlude', 0.9986570477485657)
	 ('chefrosebud', 0.99