<img src="https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png" style="float: left; margin: 15px;">

## Naive Bayes Language Detection Lab

_Author: David Yerrington (SF) _

In this lab, we’ll use Naive Bayes (and other classifiers) to auto-detect the language of a given tweet. We’ll then assess the performance of our classifier.

In [1]:
import pandas as pd, seaborn as sns, numpy as np, matplotlib.pyplot as plt

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer 
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import learning_curve
from sklearn.model_selection import train_test_split, cross_val_score, ShuffleSplit
from sklearn.metrics import classification_report
from sklearn.naive_bayes import MultinomialNB, BernoulliNB, GaussianNB
from sklearn.linear_model import LogisticRegression

from sklearn.pipeline import Pipeline

%matplotlib inline

sns.set_style("darkgrid")

In [3]:
tweets_df = pd.read_csv("datasets/tweets_language.csv", index_col=0)

In [4]:
tweets_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9431 entries, 0 to 9408
Data columns (total 2 columns):
LANG    9409 non-null object
TEXT    9409 non-null object
dtypes: object(2)
memory usage: 221.0+ KB


In [5]:
# Note: Some of the rows above are null, so we can't use them for training.
tweets_df = tweets_df.dropna()

In [7]:
tweets_df.head()

Unnamed: 0,LANG,TEXT
0,en,The #Yolo bailout: Greece's ex-finance chief h...
1,en,Another mental Saturday night. It will be near...
2,en,Sometimes you take bedtime selfies w yer hat s...
3,en,Currently just changed my entire outfit includ...
4,en,I just like listening to @SpotifyAU's top 100 ...


### 1) Data exploration.

#### 1.A) Explore a list of tweet words that occur more than 50 times.
Plot a histogram that might be helpful.

In [8]:
# Let's use the CountVectorizer to count words for us.
cvt      =  CountVectorizer(strip_accents='unicode', ngram_range=(1,1))
X_all    =  cvt.fit_transform(tweets_df['TEXT'])

# Complete the code.

#### 1.B) Investigate the `counts` histogram.

#### 1.C) Try it again with stop word removal.

In [None]:
# Let's use the CountVectorizer to count words for us.
cvt      =  CountVectorizer(strip_accents='unicode')
X_all    =  cvt.fit_transform(insults_df['Comment'])

# Complete the code.

#### 1.D) Explore n-grams between two and four.
Display the top 75 n-grams with frequencies. Look at each class to see their similarities and differences.

In [7]:
# Look up the appropriate parameters.
# CountVectorizer?


#### 1.E) (Optional) Try expanding the list of stop words.
There are definitely some non-words, such as web URLs, that could be removed to help us improve the score. Identify word/tokens that don't add much value to either class. **You should also look at n-grams per language to fine-tune your preprocessing. This has the greatest potential to improve your results without tuning any model parameters.**

Using `nltk.corpus`, we can get a baseline list of stop words. Try to expand it and pass it to our vectorizer.

In [8]:
from nltk.corpus import stopwords
stop = stopwords.words('english')


### 2) Set up a train/test split of your data using any method you wish.
Try 70/30 to start.

### 3) Set up a pipeline to vectorize and use the MultinomialNB classifier.
Use `lowercase`, `strip_accents`, `Pipeline`, and (optionally) your updated `stop_words`. Fit your comment data using the "insult" feature as your response.

Fit your training data set to your pipeline, then score it.

In [None]:
# Here's the code — you can adapt it from here on out.
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('cls', MultinomialNB())
]) 

pipeline.fit(tweets_train["TEXT"], tweets_train["LANG"])

# Don't forget to score.

#### 3.A) Swap out MultinomialNB with BernoulliNB in the pipeline.
How do they compare? Do you have a guess as to why BernoulliNB is so poor?

#### 3.B) Try logistic regression and random forests in the pipeline.
How do they compare? Recall that logistic regression is discriminative, whereas Naive Bayes is generative. Logistic regression uses optimization to fit a formula that discriminates between the classes, while Naive Bayes essentially just computes aggregate statistics. So, logistic regression should have a longer training time than Naive Bayes — but does it here? (See `%time`.)

**Note**: Logistic regression and random forests both allow you to see feature importance/coefficients. In this case, these coefficients will inform you how strongly each word indicates a language. Optionally, see if you can sort these coefficients by their values to get the strongest and weakest indicator words for languages.

#### 3.C) Also, try tweaking the parameters of `CountVectorizer` and `TfidfTranformer`.

Remove TF-IDF. Is this good or bad?

### 4) Check your score.
For which languages does your model work best? Run a classification report for all languages. [Plot the area under curve/ROC](../../week-04/2.3-evaluating_model_fit/code/AUC-ROC-codealong.ipynb) for particular languages (versus all others) and compare them — do they indicate that some languages perform better? Does our model perform worse while guessing on some languages versus others? Additionally, [review the classification reporting metrics](../../week-04/4.3-advanced-model_evaluation/code/starter-code/week4-4.1-classification-report.ipynb).

In [17]:
# Update the code to display the classification report.
print classification_report?

### Revisiting: ROC/AUC.

In [None]:
def multi_roc(y, probs):
    
    mean_tpr = 0.0
    mean_fpr = np.linspace(0, 1, 100)
    all_tpr = []

    for i, (train, test) in enumerate(cv):
        probas_ = classifier.fit(X[train], y[train]).predict_proba(X[test])
        # Compute the ROC curve and area under the curve.
        fpr, tpr, thresholds = roc_curve(y[test], probas_[:, 1])
        mean_tpr += interp(mean_fpr, fpr, tpr)
        mean_tpr[0] = 0.0
        roc_auc = auc(fpr, tpr)
        plt.plot(fpr, tpr, lw=1, label='ROC fold %d (area = %0.2f)' % (i, roc_auc))

    plt.plot([0, 1], [0, 1], '--', color=(0.6, 0.6, 0.6), label='Luck')

    mean_tpr /= len(cv)
    mean_tpr[-1] = 1.0
    mean_auc = auc(mean_fpr, mean_tpr)
    plt.plot(mean_fpr, mean_tpr, 'k--',
             label='Mean ROC (area = %0.2f)' % mean_auc, lw=2)

    plt.xlim([-0.05, 1.05])
    plt.ylim([-0.05, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic example')
    plt.legend(loc="lower right")
    plt.show()

In [None]:
from sklearn.metrics import roc_curve

def plot_roc(y, probs, threshmarkers=None):
    fpr, tpr, thresh = roc_curve(y, probs)

    plt.figure(figsize=(8,8))
    plt.plot(fpr, tpr, lw=2)
   
    plt.xlabel("False Positive Rate\n(1 - Specificity)")
    plt.ylabel("True Positive Rate\n(Sensitivity)")
    plt.xlim([-0.025, 1.025])
    plt.ylim([-0.025, 1.025])
    plt.xticks(np.linspace(0, 1, 21), rotation=45)
    plt.yticks(np.linspace(0, 1, 21))
    plt.show()

In [18]:
# Using your pipeline, predict the probabilities of each language.
# Then, call plot_roc.

## Your code to predict the probabilities of each class:

# Example of testing a particular language:
plot_roc(tweets_test['LANG'].apply(lambda x: x == "en"), predicted_proba[:, list(pipeline.classes_).index("en")])

### 5) Check out your baseline.

What is the chance that you'll randomly guess correctly without any modeling? Assume your input phrase's language has the same chance of appearing as the languages in your training set.

### 6) What is your model not getting right?

Check out the incorrectly classified tweets. Are there any noticeable patterns? Can you explain why many of these are incorrectly classified given what you know about how Naive Bayes works? Pay particular attention to the recall metric.  What could be done in the preprocessing steps to improve accuracy?  

- Try to improve your **preprocessing first**.
- Then, try to tweak your **parameters to your model(s)**.

## Additional Practice

There are two additional data sets in the directory that you can use for more practice:

- **/datasets/tweets_sentiment.csv**: Sentiment analysis.

- **/datasets/insults_train.csv**: [Kaggle data set](https://www.kaggle.com/c/detecting-insults-in-social-commentary). _Warning:_ This content is fairly provocative and contains offensive and insensitive words. However, this type of problem is common in the continuum of comment threads throughout the web.

    - Check out [this blog post](http://webmining.olariu.org/my-first-kaggle-competition-and-how-i-ranked/) by a guy who used support vector machines, a "neural network," and a ton of cleaning to place third in a Kaggle competition using this same data set. Additionally, see [this post](http://peekaboo-vision.blogspot.de/2012/09/recap-of-my-first-kaggle-competition.html) — he took sixth place and found that the best model was a simple logistic regression.

#### Where to next?

If you're interested in this type of problem, a great area to read up on is sentiment analysis. This [Kaggle data set](https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/data) offers an excellent opportunity for more practice.  The following white papers are also great for further exploration in this topic:

- [Fast and accurate sentiment classification using an enhanced Naive Bayes model](http://arxiv.org/pdf/1305.6143.pdf)— *a great overview!*
- [Sarcasm detection](http://www.aclweb.org/anthology/P15-2124).
- [Making computers laugh: Investigations in automatic humor recognition](http://www.aclweb.org/anthology/H05-1067).
- [Modeling sarcasm in Twitter, a novel approach](http://www.aclweb.org/anthology/W14-2609).
- [Narcissism and lie detection](https://deepblue.lib.umich.edu/bitstream/handle/2027.42/107345/zarins.finalthesis.pdf?sequence=1) — *this study's metrics are interesting.*