# NLP EDA

Basically, exploration and modeling boil down to a single question:

How do we quantify our data/text

In this lesson, we'll explore answers to this question that will aid in visualization.

- word frequency (by label)
- ngrams
- word cloud
- sentiment analysis
- other common features

## Setup

Data is spam/ham text messages.

In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import nltk
import unicodedata
import re

In [2]:
plt.style.available

['Solarize_Light2',
 '_classic_test_patch',
 '_mpl-gallery',
 '_mpl-gallery-nogrid',
 'bmh',
 'classic',
 'dark_background',
 'fast',
 'fivethirtyeight',
 'ggplot',
 'grayscale',
 'seaborn',
 'seaborn-bright',
 'seaborn-colorblind',
 'seaborn-dark',
 'seaborn-dark-palette',
 'seaborn-darkgrid',
 'seaborn-deep',
 'seaborn-muted',
 'seaborn-notebook',
 'seaborn-paper',
 'seaborn-pastel',
 'seaborn-poster',
 'seaborn-talk',
 'seaborn-ticks',
 'seaborn-white',
 'seaborn-whitegrid',
 'tableau-colorblind10']

In [3]:
# setting basic style parameters for matplotlib
plt.rc('figure', figsize=(13, 7))
plt.style.use('seaborn-colorblind')

In [2]:
ADDITIONAL_STOPWORDS = ['r', 'u', '2', 'ltgt']
def clean(text):
    'A simple function to cleanup text data'
    wnl = nltk.stem.WordNetLemmatizer()
    stopwords = nltk.corpus.stopwords.words('english') + ADDITIONAL_STOPWORDS
    text = (unicodedata.normalize('NFKD', text)
             .encode('ascii', 'ignore')
             .decode('utf-8', 'ignore')
             .lower())
    words = re.sub(r'[^\w\s]', '', text).split()
    return [wnl.lemmatize(word) for word in words if word not in stopwords]

In [5]:
# basic cleaning function:
# ADDITIONAL_STOPWORDS = ['r', 'u', '2', 'ltgt']

# def clean(text):
#     '''Simplified text cleaning function'''
#     stopwords = nltk.corpus.stopwords.words('english') + ADDITIONAL_STOPWORDS
#     text = text.lower()
#     text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
#     words = re.sub(r"[^a-z0-9\s]", '', text)
#     return [word for word in words if word not in stopwords]

In [3]:
# acquire data from spam_db

from env import get_db_url

# def get_db_url(database, host=host, user=user, password=password):
#     return f'mysql+pymysql://{user}:{password}@{host}/{database}'


url = get_db_url("spam_db")
sql = "SELECT * FROM spam"

df = pd.read_sql(sql, url)
df.head()

Unnamed: 0,id,label,text
0,0,ham,"Go until jurong point, crazy.. Available only ..."
1,1,ham,Ok lar... Joking wif u oni...
2,2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,3,ham,U dun say so early hor... U c already then say...
4,4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [4]:
# drop out id column (arbitrary index)
df = df.drop(columns='id')

In [8]:
from sklearn.model_selection import train_test_split

In [9]:
train_validate, test = train_test_split(df,
                                        random_state=1349,
                                        train_size=0.8,
                                       stratify=df.label)
train, validate = train_test_split(train_validate,
                                   random_state=1349,
                                   train_size=0.7,
                                  stratify=train_validate.label)

In [10]:
train.shape, validate.shape, test.shape

((3119, 2), (1338, 2), (1115, 2))

### If we look at this in the context of a classification problem,
we may ask:
 - What leads to a spam text?
 - What leads to a ham text?
 

In [11]:
# lets clean that data

In [12]:
# let's get some sights on word frequency by taking 
# our words back apart
# we will split each set of words by the spaces,
# turn that into a list, cast that list as a Series,
# and then take the value counts of that Series
# We will do this for each type of word present
ham_df = train[train.label=='ham']

In [16]:
import nltk
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /Users/aaron/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     /Users/aaron/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /Users/aaron/nltk_data...
[nltk_data]    |   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /Users/aaron/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /Users/aaron/nltk_data...
[nltk_data]    |   Unzipping grammars/basque_grammars.zip.
[nltk_data]    | Downloading package bcp47 to
[nltk_data]    |     /Users/aaron/nltk_data...
[nltk_data]    | Downloading pac

False

In [15]:
ham_words = clean(' '.join(train[train.label=='ham']['text']))
spam_words = clean(' '.join(train[train.label=='spam']['text']))
all_words = clean(' '.join(train['text']))

LookupError: 
**********************************************************************
  Resource [93momw-1.4[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('omw-1.4')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mcorpora/omw-1.4[0m

  Searched in:
    - '/Users/aaron/nltk_data'
    - '/opt/homebrew/anaconda3/nltk_data'
    - '/opt/homebrew/anaconda3/share/nltk_data'
    - '/opt/homebrew/anaconda3/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************


In [None]:
len(ham_words), len(spam_words), len(all_words)

In [None]:
len(ham_words) + len(spam_words) == len(all_words)

In [None]:
ham_freq = pd.Series(ham_words).value_counts()
spam_freq = pd.Series(spam_words).value_counts()
all_freq = pd.Series(all_words).value_counts()

In [None]:
ham_freq.head()

In [None]:
spam_freq.head()

## Exploration

Represent text as word frequencies.

In [None]:
# concat all frequencies together into a dataframe
word_counts = pd.concat([ham_freq, spam_freq,all_freq], axis=1
         ).fillna(0
                 ).astype(int)
word_counts.columns = ['ham','spam','all']
word_counts.head()

- What are the most frequently occuring words?
- Are there any words that uniquely identify a spam or ham message? I.e. words present in one type of message but not the other?

In [None]:
word_counts.sort_values('all', ascending=False).head()

In [None]:
word_counts.sort_values(['ham','spam','all'], ascending=False).head()

### Visualization

- ham vs spam count for 20 most common words
- ham vs spam proportion for 20 most common words

In [None]:
word_counts.sort_values('all', ascending=False
                       )[['ham','spam']].head(20).plot.barh()

In [None]:
word_counts.sort_values('all', ascending=False
                       ).head(20).apply(
    lambda row: row/row['all'], axis=1
)[['ham','spam']].plot.barh(
    stacked=True, legend=False, ec='black', 
    width=1).set(title='Ham word proportion v Spam word proportion for 20 most popular words');
                                                         

## n-grams

**bigram**: combinations of 2 words

Represent text as combinations of 2 words

In [None]:
# let's test this out on a sentence:
some_string = 'here is a thing to put into bigrams ok'

In [None]:
list(nltk.bigrams(some_string.split()))

**Be Careful!** Make sure you are making bigrams out of *words*.

- what are the most common bigrams? spam bigrams? ham bigrams?
- visualize 20 most common bigrams, most common ham bigrams
- ngrams

Find the most common bigram and then find a representative text

In [None]:
pd.Series(nltk.bigrams(ham_words)).value_counts().head(20).plot.barh()

In [None]:
# making other types of ngrams:
list(nltk.ngrams(some_string.split(), 6))

In [None]:
pd.Series(nltk.ngrams(spam_words, 3)
         ).value_counts().head(20).plot.barh()

In [None]:
ham_grams = [thing[0] + '_' + thing[1] for thing in list(nltk.bigrams(ham_words))]

## Word Cloud

`python -m pip install --upgrade wordcloud`

documentation: https://amueller.github.io/word_cloud/

In [None]:
from wordcloud import WordCloud

In [None]:
# wordcloud expects a single string
img = WordCloud(background_color='White',
         ).generate(' '.join(ham_words))

In [None]:
img

In [None]:
plt.imshow(img)
plt.axis('off')
plt.title('Most common ham words')
plt.show()

In [None]:
img = WordCloud(background_color='White',
         ).generate(' '.join(spam_words))
plt.imshow(img)
plt.axis('off')
plt.title('Most common spam words')
plt.show()

In [None]:
# wordcloud expects a single string
img = WordCloud(background_color='White',
         ).generate(' '.join(ham_grams))
plt.imshow(img)
plt.axis('off')
plt.title('Most common ham words')
plt.show()

## Other Common Features

Any NLP dataset will have domain specific features, for example: number of retweets, number of @mentions, number of upvotes, or mean time to respond to a support chat. In addition to these domain specific features, some common measures for a document are:

- character count
- word count
- sentence count
- stopword count
- unique word count
- punctuation count
- average word length
- average words per sentence
- word to stopword ratio

Create one or more of the above features and visualize it.

In [None]:
# we apply our clean function, apply len chained on it
# if we did not want to clean this before
# word count, we would want to do a split on it


## Sentiment

A number indicating whether the document is positive or negative.

- knowledge-based + statistical approach
- relies on human-labelled data
    - combination of qualitative and quantitative methods
    - then empirically validate
- different models for diff domains (e.g. social media vs news)
- for social media
    - Afinn ([github](https://github.com/fnielsen/afinn) + [whitepaper](http://www2.imm.dtu.dk/pubdb/edoc/imm6006.pdf))
    - Vader ([github](https://github.com/cjhutto/vaderSentiment) + [whitepaper](http://comp.social.gatech.edu/papers/icwsm14.vader.hutto.pdf)) `nltk.sentiment.vader.SentimentIntensityAnalyzer`. Pre-trained sentiment analyzer (**V**alence **A**ware **D**ictionary and s**E**ntiment **R**easoner).)


From your terminal:
`python -c 'import nltk;nltk.download("vader_lexicon")'`

In [None]:
import nltk.sentiment

In [None]:
sia = nltk.sentiment.SentimentIntensityAnalyzer()

Things that can influence Sentiment Score:
1. Punctuations. Can increase the intensity
2. Capitalization. Can increase the intensity
3. Degree modifiers
4. Conjunctions

It can handle Emojis and slangs

In [None]:
sia.polarity_scores('Things are going really well!')

In [None]:
sia.polarity_scores('Things are going really awful!')

In [None]:
sia.polarity_scores('she is vegan :(')['compound']

Apply this to the text message data

In [None]:
# grab the sentiment from each of the texts as they stand:
# apply a lambda function on each cell in the text column:
# polarity_score's value associtated with the "compound"
# key for each score
train['compound_sentiment'] = train['text'
                                   ].apply(
    lambda x: sia.polarity_scores(x)['compound'])

In [None]:
# is the mean and median values of sentiment score different for ham vs spam?
train.groupby('label')['compound_sentiment'].mean()

### Takeaways:
 - Spam messages seem to have roughly the same message length, where ham varies a lot.
 - Spam messages have a very positive sentiment
 - If we wanted to utilize these features for modeling, we would want to proceed forward with means testing to establish their viability

## More Resources

- [kaggle wikipedia movie plots](https://www.kaggle.com/jrobischon/wikipedia-movie-plots)
    - Suggestion: narrow to top n genres that aren't unknown
- [wikitable extractor](https://wikitable2csv.ggor.de/) (Try with, e.g. [helicopter prison escapes](https://en.wikipedia.org/wiki/List_of_helicopter_prison_escapes))
- [Textblob library](https://textblob.readthedocs.io/en/dev/)