# An Introduction to Natural Language Processing (NLP)

### What is NLP?

Natural language processing (NLP) is the set of techniques developed to automatically process/analyze/understand/ generate natural human languages.

### Why is this useful?

Most of the knowledge that has accrued over the course of human history is stored as unstructured text and we need some way to make sense out of all of it.

**Simply put, NLP enables the automatic, quantitative analysis of this unstructured text.**

### Common NLP subproblems

- **Speech recognition and generation**: [Apple siri](https://www.apple.com/ios/siri/)
    - Speech to text
    - Text to speech
- **Question answering**: [IBM Watson](https://www.ibm.com/watson/)
    - Match query with knowledge base
    - Reasoning about intent of question
- **Machine translation**: [Google Translate](https://translate.google.com/)
    - One language to another to another
- **Information retrieval**: [Google](https://www.google.com/)
    - Finding relevant results
    - Finding similar results
- **Information extraction**: [Gmail](https://www.google.com/gmail/)
    - Structured information from unstructured documents
- **Assistive technologies**: Google autocompletion
    - Predictive text input
    - Text simplification
- **Natural Language Generation**: [Narrative Science](https://narrativescience.com/)
    - Generating text from data
- **Automatic summarization**: [Google News](https://news.google.com/news/?ned=us&gl=US&hl=en)
    - Extractive summarization
    - Abstractive summarization
- [**Sentiment analysis**](https://en.wikipedia.org/wiki/Sentiment_analysis):
    - Attitude of speaker

### What are some of the lower level components?

- **Tokenization**: breaking text into tokens (words, sentences, n-grams)
- **Stopword removal**: 
    - repetitive & redundant gap-filling utterances (e.g., "like") 
    - bridge words (e.g., a/an/the)
- **Stemming and lemmatization**: root word
- **TF-IDF**: word importance
- **Part-of-speech tagging**: noun/verb/adjective
- **Named entity recognition**: person/organization/location
- **Spelling correction**: "New Yrok City"
- **Word sense disambiguation**: "buy a mouse"
- **Segmentation**: "New York City subway"
- **Language detection**: "translate this page"
- **Machine learning**

### Why is NLP hard?

- **Ambiguity**:
    - Teacher Strikes Idle Kids
    - Red Tape Holds Up New Bridges
    - Hospitals are Sued by 7 Foot Doctors
    - Juvenile Court to Try Shooting Defendant
    - Local High School Dropouts Cut in Half
- **Non-standard English**: tweets/text messages
- **Idioms**: "throw in the towel"
- **Newly coined words**: "retweet"
- **Tricky entity names**: "Where is A Bug's Life playing?"
- **World knowledge**: "Mary and Sue are sisters", "Mary and Sue are mothers"

### How does NLP work?

- Build probabilistic model using data about a language
- Requires an understanding of the language
- Requires an understanding of the world (or a particular domain)

## Reading in a subset of SemEval

In [1]:
# dependencies #
import pandas as pd
import numpy as np
import scipy as sp

# for unbalanced data 
from sklearn.model_selection import train_test_split, StratifiedShuffleSplit

# vectorizer (this one ignores the syntax -i.e., order and focuses on frequency)
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# 
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import seaborn as sns
%matplotlib inline

In [2]:
# read in the reviews into a DataFrame
semeval_data = pd.read_csv("../data/semeval_sampled_cleaned_data.csv",sep="\t",names=["id","sentiment","tweet"])

In [3]:
print(semeval_data.shape)
semeval_data.head(10)

(7105, 3)


Unnamed: 0,id,sentiment,tweet
0,633616008910082048,negative,Donald Trump and Scott Walker would Negros bac...
1,639974060466663424,negative,@YidVids2 probably not cause he played with th...
2,664049298624114688,negative,"Woaw just because briana is ""having"" louis' ba..."
3,665627471899975680,negative,I wrote this about the 'SAS response' after th...
4,522951269271339008,negative,@MasterDebator_ @NFLosophy 2nd best in luck dr...
5,109016744177319937,negative,"Boehner tells Obama that sorry, the House is a..."
6,641475286228271108,negative,I think Google may be worried that if they all...
7,636734909646815233,negative,This goes right up there with Rolling Stone pu...
8,636155339067424768,negative,If Carly Fiorina ran the US the way she ran HP...
9,111733236627025920,negative,@HartsPub ru showing the All Blacks v Tonga ga...


In [None]:
semeval_data.tail(10)

In [4]:
# frequency of positive vs. negative tweets
semeval_data.sentiment.value_counts()/semeval_data.shape[0]

positive    0.511612
negative    0.488388
Name: sentiment, dtype: float64

In [5]:
# sample & read a random tweet
semeval_data.tweet[20]

'I wish I was in Bolton tonight :('

**Terminology**
- **corpus:** collection of documents 
    - document: individual row in the data -i.e., a tweet
- **corpora:** plural form of corpus

In [6]:
# converting tweets to utf (best encoding method up to date)
semeval_data.tweet = semeval_data.tweet.map(lambda x: x.encode("utf-8"))

In [None]:
# converting sentiment as 1's and 0's
semeval_data["target"] = (semeval_data.sentiment=="positive").astype(int)

In [None]:
# split the new DataFrame into training and testing sets, keeping relative frequencies of targets unchanged
splitter = StratifiedShuffleSplit(n_splits=1,
                                  test_size=0.3)
train_indices,test_indices=list(splitter.split(semeval_data.tweet,semeval_data.target))[0]
X_train,y_train = semeval_data.tweet.iloc[train_indices],semeval_data.target.iloc[train_indices]
X_test,y_test = semeval_data.tweet.iloc[test_indices],semeval_data.target.iloc[test_indices]

In [None]:
# make sure training dataset matches the frequency of +ve and -ve sentiment frequency
y_train.value_counts()/y_train.shape[0]

In [None]:
y_train.shape

In [None]:
y_test.shape

In [None]:
y_test.value_counts()/y_test.shape[0]

## Tokenization

- **What:** Separate text into units such as sentences or words
- **Why:** Gives structure to previously unstructured text
- **Notes:** Relatively easy with English language text, not easy with some languages
- **Tweets:** Traditionally in NLP, you throw out any non alphabetical characters/symbols and numbers. With tweets, certain additional symbols convey meaning (hashtags, emoticons, etc.), so we will need to parse tweets a bit differently than we would a normal piece of text.

Here we import a special tweet-specific tokenizer that can handle what is described above:

In [None]:
import re
 
emoticons_str = r"""
    (?:
        [:=;] # Eyes
        [oO\-]? # Nose (optional)
        [D\)\]\(\]/\\OpP] # Mouth
    )"""
 
regex_str = [
    emoticons_str,
    r'<[^>]+>', # HTML tags
    r'(?:@[\w_]+)', # @-mentions
    r"(?:\#+[\w_]+[\w\'_\-]*[\w_]+)", # hash-tags
    r'http[s]?://(?:[a-z]|[0-9]|[$-_@.&amp;+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+', # URLs
 
    r'(?:(?:\d+,?)+(?:\.?\d+)?)', # numbers
    r"(?:[a-z][a-z'\-_]+[a-z])", # words with - and '
    r'(?:[\w_]+)', # other words
    r'(?:\S)' # anything else
]
    
tokens_re = re.compile(r'('+'|'.join(regex_str)+')', re.VERBOSE | re.IGNORECASE)

In [None]:
# tokenizer 
from nltk.tokenize import TweetTokenizer

In [None]:
tokenizer_for_tweets = TweetTokenizer(strip_handles = True, # remove @'s
                           preserve_case = False, # otherwise upper and lower-cases are treated as different words
                           reduce_len = True)

- **token_pattern:** string
- Regular expression denoting what constitutes a "token". The default regexp selects tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator).
- **tokenizer:** callable/function
- Function that converts a string into a list of tokens using some arbitrary logic.

In [None]:
# use CountVectorizer to create document-term matrices from X_train and X_test
## dictionary contains unique tokens extracted from documents
vect = CountVectorizer(tokenizer = tokenizer_for_tweets.tokenize)

# training data properties must be applied to tested on testing data (e.g., tf-idf)
train_dtm = vect.fit_transform(X_train) # counting the tokens
test_dtm = vect.transform(X_test) # transforming is also doing the same thing

In [None]:
# rows are documents, columns are terms (aka "tokens" or "features")
print("# of tokens in training data:", train_dtm.shape[1])
print("# of tokens in testing data:", test_dtm.shape[1])

In [None]:
# last 50 features
print(vect.get_feature_names()[-50:])

In [None]:
# compressed sparse format for a large dataset
train_dtm

In [None]:
# show vectorizer options
vect

Look at the [CountVectorizer documentation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to get a better understanding of how it works.

- **lowercase:** `boolean`, `True` by default
- Convert all characters to lowercase before tokenizing.

In [None]:
# don't convert to lowercase
vect = CountVectorizer(lowercase = False,
                       tokenizer = tokenizer_for_tweets.tokenize)

train_dtm = vect.fit_transform(X_train)
train_dtm.shape

What is the impact of not lowercasing the text?

In [None]:
# allow tokens of one character
vect = CountVectorizer(token_pattern = r'(?u)\b\w+\b')

train_dtm = vect.fit_transform(X_train)

# number of tokens 
print("# of tokens for this tokenizer:", train_dtm.shape[1])

- **ngram_range:** tuple (min_n, max_n)
- The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.

In [None]:
# include 1-grams and 2-grams
vect = CountVectorizer(ngram_range = (1, 2), # min = 1, max = 2
                       tokenizer = tokenizer_for_tweets.tokenize)
train_dtm = vect.fit_transform(X_train)

print("# of tokens for this tokenizer:", train_dtm.shape[1])

In [None]:
# last 50 features
print(vect.get_feature_names()[-50:])

**Now, build a LR model predicting sentiment:**

In [None]:
# use default options for CountVectorizer
vect = CountVectorizer(tokenizer = tokenizer.tokenize)

# create document-term matrices
train_dtm = vect.fit_transform(X_train)
test_dtm = vect.transform(X_test)

# use Logistic Regression to predict the star rating
lr = LogisticRegression()
lr.fit(train_dtm, y_train)
y_pred_class = lr.predict(test_dtm)

# calculate accuracy
print(metrics.accuracy_score(y_test, y_pred_class).round(2))

In [None]:
# calculate null accuracy
y_test.mean()

In [None]:
lr.coef_.shape

Awesome, so just using raw counts of different words gives us very good performance!

Let's examine the most positive and most negative words (their coefficients will tell us whether they are most indicative of a positive or negative review).

In [None]:
# zip objects extentiate the two vectors 
a = [1, 2, 3, 4]
b = ["a", "b", "c", "d"]

list(zip(a, b))

In [None]:
feature_coeffs = pd.DataFrame(list(zip(vect.get_feature_names(), 
                                       lr.coef_[0])),
                              columns=["word","coeff"]) # naming the two columns
feature_coeffs = feature_coeffs.sort_values(by="coeff",ascending=False).reset_index(drop=True)

Words most indicative of positive review:

In [None]:
# most positive words
feature_coeffs.head(10)

Words most indicative of negative review:

In [None]:
# most negative words
feature_coeffs.tail(10)

In [None]:
feature_coeffs["abs_coeff"] = feature_coeffs.coeff.abs()

Most predictive words, regardless of polarity:

In [None]:
feature_coeffs.sort_values(by = "abs_coeff",
                           inplace = True,
                           ascending = False)
feature_coeffs.head(100)

Useless words (dont tell you anything about the polarity of the review):

In [None]:
feature_coeffs.tail(20)

## Pipelines To Make CV/Transformations Easier

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score, StratifiedKFold

In [None]:
first_pipeline = Pipeline([("countVect", # string that explains what it does
                            CountVectorizer(tokenizer = tokenizer_for_tweets.tokenize)), # step 1
                           ("lr",LogisticRegression())]) # step 2

In [None]:
# passing the raw data through the defined pipeline
first_pipeline.fit(X_train, y_train)

In [None]:
# accuracy based on actual testing data and prediction 
## ".predict" indicates it's an estimator
print(metrics.accuracy_score(y_test, first_pipeline.predict(X_test)).round(2))

In [None]:
strat_cv = StratifiedKFold(n_splits=10) # balance splits by frequency of each target class
# cv = 10 #10-fold cross-validation

In [None]:
# average accuracy over 10 strafied cross-validations 
np.mean(cross_val_score(first_pipeline,
                semeval_data.tweet,
                semeval_data.target,
                scoring = "accuracy",
                cv = strat_cv,
                n_jobs = -1, # parallel processing
                verbose = 1)).round(2)

In [None]:
# fitting data in the entire dataset
first_pipeline.fit(semeval_data.tweet, semeval_data.target)

In [None]:
# to access the innards
first_pipeline.steps[0][1].get_feature_names() 

In [None]:
feature_names = first_pipeline.steps[0][1].get_feature_names()
feature_coeffs = first_pipeline.steps[1][1].coef_

feature_coeffs = pd.DataFrame(list(zip(feature_names,feature_coeffs.reshape((-1)))), 
                              columns = ["word","coeff"])
feature_coeffs = feature_coeffs.sort_values(by = "coeff",
                                            ascending = False).reset_index(drop = True)
 
feature_coeffs

In [None]:
first_pipeline.steps

In [None]:
# lets create a function that accepts a vectorizer and returns a table with the coefficients and accuracy of cv-ed model
def tokenize_test(vect,clf):
    pipe = Pipeline([("vect",vect),("lr",clf)])
    pipe.fit(semeval_data.tweet,semeval_data.target)
    num_features = len(pipe.steps[0][1].get_feature_names())
    print('Num Features: ', num_features)

    zipped_coeffs = list(zip(pipe.steps[0][1].get_feature_names(),
                             pipe.steps[1][1].coef_[0]))
    feature_coeffs = pd.DataFrame(zipped_coeffs,columns=["word","coeff"]).sort_values(by="coeff",ascending=False)
    feature_coeffs.reset_index(drop=True,inplace=True)

    strat_cv = StratifiedKFold(n_splits=10)
    acc = np.mean(cross_val_score(pipe,
                                  semeval_data.tweet,
                                  semeval_data.target,
                                  scoring="accuracy",
                                  cv=strat_cv,
                                  n_jobs=-1,
                                  verbose=1))
    print("Accuracy:", acc)
    return (feature_coeffs, acc)

In [None]:
# include 1-grams and 2-grams
vect = CountVectorizer(tokenizer=tokenizer_for_tweets.tokenize,
                       ngram_range=(1,2))
feature_coeffs,acc = tokenize_test(vect,LogisticRegression())

In [None]:
feature_coeffs.head(10) # most positive words

In [None]:
feature_coeffs.sort_values("coeff").head(10) # most negative words

In [None]:
feature_coeffs["abs_coeffs"] = feature_coeffs.coeff.abs() 

In [None]:
feature_coeffs.sort_values("abs_coeffs",ascending=False).head(10) # absolute values

## Stopword Removal

- **What:** Remove common words that will likely appear in any text
- **Why:** They don't tell you much about your text

In [None]:
# show vectorizer options
vect

- **stop_words:** `string` {`'english'`}, `list`, or `None` (default)
- If `'english'`, a built-in stop word list for English is used.
- If a `list`, that list is assumed to contain stop words, all of which will be removed from the resulting tokens.
- If `None`, no stop words will be used. 
- `max_df` can be set to a value in the range [0.7, 1.0) to automatically detect and filter stop words based on within-corpus document frequency of terms.

In [None]:
# remove English stop words
vect = CountVectorizer(stop_words = 'english',
                       tokenizer = tokenizer_for_tweets.tokenize)

feature_coeffs,acc = tokenize_test(vect,
                                   LogisticRegression())

In [None]:
feature_coeffs.head(10)

In [None]:
feature_coeffs.tail(10)

In [None]:
feature_coeffs["abs_coeffs"] = feature_coeffs.coeff.abs()
feature_coeffs.sort_values("abs_coeffs").head(10)

In [None]:
# set of stop words
print(vect.get_stop_words())

## Other CountVectorizer Options

- **max_features:** int or None, default=None
- If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.

- **min_df:** float in range [0.0, 1.0] or int, default=1
- When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts.

In [None]:
# remove English stop words and only keep words appearing in 0.5% of documents
#and never appear in more than 70% of documents
vect = CountVectorizer(stop_words ='english', 
                       min_df = 0.005, 
                       max_df = 0.7,
                       tokenizer = tokenizer_for_tweets.tokenize)

features,acc = tokenize_test(vect,
                             LogisticRegression())

In [None]:
features.head()

In [None]:
features.sort_values("coeff").head()

In [None]:
# include 1-grams and 2-grams, and only include terms that appear at least 20 times
vect = CountVectorizer(ngram_range=(1, 2), 
                       min_df=20,
                       tokenizer = tokenizer_for_tweets.tokenize)

features,acc = tokenize_test(vect,LogisticRegression())

In [None]:
features.head(10)

In [None]:
features.sort_values("coeff").head(10)

In [None]:
features["abs_coeffs"] = features.coeff.abs()
features.sort_values("abs_coeffs").head(10)

## Term Frequency - Inverse Document Frequency (TF-IDF) Transformation

- **What:** Computes "relative frequency" that a word appears in a document compared to its frequency across all documents
- **Why:** More useful than "term frequency" for identifying "important" words in each document 
    - high frequency in that document, low frequency in other documents
- **Notes:** Used for search engine scoring, text summarization, document clustering

In [None]:
# example documents
train_simple = ['call you tonight',
                'Call me a cab',
                'please call me... PLEASE!']

In [None]:
# CountVectorizer
vect = CountVectorizer()

pd.DataFrame(vect.fit_transform(train_simple).toarray(), 
             columns = vect.get_feature_names())

In [None]:
# TfidfVectorizer
vect = TfidfVectorizer()
pd.DataFrame(vect.fit_transform(train_simple).toarray(), 
             columns = vect.get_feature_names())

In [None]:
vect = TfidfVectorizer(tokenizer = tokenizer_for_tweets.tokenize)
tfidf_coeffs,acc = tokenize_test(vect,LogisticRegression())

In [None]:
for max_features in (1000,10000):
    vect = TfidfVectorizer(max_features = max_features,
                           tokenizer = tokenizer_for_tweets.tokenize)
    tokenize_test(vect,LogisticRegression())

In [None]:
for ngram_range in (2,3):
    vect = TfidfVectorizer(max_features = 10000,
                           ngram_range = (1,ngram_range),
                           tokenizer = tokenizer_for_tweets.tokenize)
    tokenize_test(vect,LogisticRegression())    

In [None]:
tfidf_coeffs.head()

In [None]:
tfidf_coeffs.sort_values("coeff").head()

In [None]:
tfidf_coeffs["abs_coeffs"] = tfidf_coeffs.coeff.abs()
tfidf_coeffs.sort_values("abs_coeffs").head(10)

## [Introduction to TextBlob](http://textblob.readthedocs.io/en/dev/)


In [None]:
# print the first tweet
print(semeval_data.tweet.values[20])

In [None]:
import textblob

In [None]:
tweet = textblob.TextBlob(semeval_data.tweet.values[20].decode("utf-8"))

If the command below fails, run `conda install -c conda-forge textblob ` from a terminal window.

In [None]:
import nltk
nltk.download("punkt")
# list the words
tweet.words

In [None]:
# list the sentences
tweet.sentences

In [None]:
# some string methods are available
tweet.lower()

## Stemming and Lemmatization

**Stemming:**

- **What:** Reduce a word to its base/stem/root form
- **Why:** Often makes sense to treat related words the same way
- **Notes:**
    - Uses a "simple" and fast rule-based approach
    - Stemmed words are usually not shown to users (used for analysis/indexing)
    - Some search engines treat words with the same stem as synonyms

If the import below fails, run `conda install -c conda-forge nltk` in a terminal window.

In [None]:
# initialize stemmer
from nltk.stem import SnowballStemmer

stemmer = SnowballStemmer('english')

# stem each word
print([stemmer.stem(word) for word in tweet.words])

**Lemmatization**

- **What:** Derive the canonical form ('lemma') of a word
- **Why:** Can be better than stemming
- **Notes:** Uses a dictionary-based approach (slower than stemming)

In [None]:
nltk.download("wordnet")
# assume every word is a noun
print([word.lemmatize() for word in tweet.words])

In [None]:
# assume every word is a verb
print([word.lemmatize(pos='v') for word in tweet.words])

In [None]:
# define a function that accepts text and returns a list of lemmas
def split_into_lemmas(text):
    text = text.lower()
    words = textblob.TextBlob(text.decode("utf-8")).words
    return [word.lemmatize() for word in words]

In [None]:
# use split_into_lemmas as the feature extraction function
vect = CountVectorizer(analyzer=split_into_lemmas)
features,acc = tokenize_test(vect,LogisticRegression())

In [None]:
features.head()

In [None]:
features.sort_values("coeff").head()

## Sentiment Analysis in TextBlob

In [None]:
print(tweet)

In [None]:
# polarity ranges from -1 (most negative) to 1 (most positive)
tweet.sentiment

In [None]:
semeval_data['tweet_length'] = semeval_data.tweet.apply(len)

In [None]:
# define a function that accepts text and returns the polarity
def detect_sentiment(text):
    return textblob.TextBlob(text.decode("utf-8")).sentiment.polarity

In [None]:
# create a new DataFrame column for sentiment
semeval_data['textblob_sentiment'] = semeval_data.tweet.apply(detect_sentiment)

In [None]:
# boxplot of sentiment grouped by stars
semeval_data.boxplot(column='textblob_sentiment', by='sentiment')

In [None]:
# tweets with most positive sentiment
semeval_data[semeval_data.textblob_sentiment == 1].tweet.head()

In [None]:
# reviews with most negative sentiment
semeval_data[semeval_data.textblob_sentiment == -1].tweet.head()

In [None]:
# widen the column display
pd.set_option('max_colwidth', 500)

In [None]:
# negative textblob-computed sentiment in a positively labeled tweet
semeval_data[(semeval_data.sentiment == "positive") & (semeval_data.textblob_sentiment < -0.8)].head()

In [None]:
# positive textblob-computed sentiment in a negatively labeled tweet
semeval_data[(semeval_data.sentiment == "negative") & (semeval_data.textblob_sentiment > 0.7)].head()

## Adding Extra Features to a Document-Term Matrix

In [None]:
feature_cols = ['tweet', 'textblob_sentiment','tweet_length']
X = semeval_data[feature_cols]
y = semeval_data.target
#X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [None]:
# use CountVectorizer with text column only
pipeline = Pipeline([("countVect",
                      CountVectorizer(stop_words="english", 
                                      tokenizer = tokenizer_for_tweets.tokenize)),
                     ("lr",LogisticRegression())])

strat_cv = StratifiedKFold(n_splits=10)
tweet_only_acc = np.mean(cross_val_score(pipeline,
                        X.tweet,
                        y,
                        scoring="accuracy",
                        cv=strat_cv,
                        n_jobs=-1,
                        verbose=1))
print("text only accuracy: {:0.3f}".format(tweet_only_acc))

In [None]:
print("Original matrix: ",CountVectorizer(stop_words="english").fit_transform(X.tweet).shape)

In [None]:
from sklearn.preprocessing import StandardScaler
#scale the length feature 
X.tweet_length = StandardScaler().fit_transform(X.tweet_length.values.reshape((-1,1)))


In [None]:
# cast other feature columns to float and convert to a sparse matrix
addl_features = sp.sparse.csr_matrix(X.iloc[:, 1:].astype(float))
addl_features.shape

In [None]:
# combine sparse matrices
X_with_addl = sp.sparse.hstack((CountVectorizer(stop_words="english",
                                                token_pattern=tokens_re).fit_transform(X.tweet),
                                addl_features))
print("Matrix with extra features: ",X_with_addl.shape)

In [None]:
# use logistic regression with all features
text_and_other_features_acc = np.mean(cross_val_score(LogisticRegression(),
                                                      X_with_addl,
                                                      y,
                                                      scoring="accuracy",
                                                      cv=strat_cv,
                                                      n_jobs=-1,
                                                      verbose=1))
print("text and extra features acc: {:0.3f}".format(text_and_other_features_acc))

Looks like adding the textblob polarity and tweet length helped a bit.

## Other Fun TextBlob Features

In [None]:
# spelling correction
TextBlob('15 minuets laate').correct()

In [None]:
from textblob import Word
# spellcheck
Word('bloaud').spellcheck()

In [None]:
# definitions - must pass in part of speech you want definitions for
Word('goodbye').define()

In [None]:
# language identification
TextBlob('здраствуйте').detect_language()