# Module 20 -  Text Mining in Python -- Analyzing Sentiment


**_Author: Jessica Cervi_**

**Expected time = 3 hours**

**Total points = 120 points**

## Assignment Overview


Sentiment analysis  falls into the broad category of text classification tasks where you are supplied with a phrase, or a list of phrases and your classifier is supposed to tell if the sentiment behind that is positive or negative. Consider for example the example below:

Consider the following phrases:
- "Titanic is a great movie."
- "Titanic is not a great movie."

The phrases correspond to short movie reviews, and each one of them conveys different sentiments. For example, the first phrase denotes positive sentiment about the film Titanic while the second one treats the movie as not so great (negative sentiment). In this assignment, you'll work with a data set to tokenize text, examine word frequencies, train and apply a sentiment classifer. 

This assignment is designed to build your familiarity and comfort coding in Python while also helping you review key topics from each module. As you progress through the assignment, answers will get increasingly complex. It is important that you adopt a data scientist's mindset when completing this assignment. **Remember to run your code from each cell before submitting your assignment.** Running your code beforehand will notify you of errors and give you a chance to fix your errors before submitting. You should view your Vocareum submission as if you are delivering a final project to your manager or client. 

***Vocareum Tips***
- Do not add arguments or options to functions unless you are specifically asked to. This will cause an error in Vocareum.
- Do not use a library unless you are expicitly asked to in the question. 
- You can download the Grading Report after submitting the assignment. This will include feedback and hints on incorrect questions. 


### Learning Objectives

- Use `nltk` to tokenize text
- Use `nltk` to examine word frequencies
- Use `scikitlearn` and `LogisticRegression` to train a  sentiment classifier
- Apply our sentiment classifier to Stephen King reviews
- Use `textblob` to examine *sentiment* and *polarity* of a text
- Remove **stopwords** and punctuation from text
- Compute **complexity** of text using Lexical Diversity


In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import nltk

In [2]:
df = pd.read_csv('./data/king_clean.csv', index_col=0)

In [3]:
df.head()

Unnamed: 0,author,date,review,title
0,Stephen King,2011-11-13,"By ERROL MORRISNOV. 10, 2011 In all of Stephen...",11/22/63
1,Stephen King,2011-10-31,"By JANET MASLINOCT. 30, 2011 Stephen King’s la...",11/22/63
2,Stephen King,2004-01-04,"By ANDREW O'HEHIRJAN. 4, 2004 Wolves of the Ca...",Wolves of the Calla
3,Stephen King,1993-10-24,"By RICHARD E. NICHOLLSOCT. 24, 1993 ...",Nightmares and Dreamscapes
4,Stephen King,2001-11-04,"By MARY ELIZABETH WILLIAMSNOV. 4, 2001 BLACK H...",Black House


[Back to top](#Index:) 
<a id='q1'></a>

### Question 1:

*10 points*

To get started, we want to focus on a straightforward approach to sentiment analysis similar to what we saw in the lectures.  Before we can count **positive** and **negative** words however, we need to **tokenize** our reviews.  

In [4]:
df['review'][0][:500]

'By ERROL MORRISNOV. 10, 2011 In all of Stephen King’s work there is an admixture of the ordinary and the supernatural — call it the weird quotidian. In his new novel, “11/22/63,” it is a rabbit hole into the past that pops up in Lisbon Falls, a woebegone corner of Maine. On one end is 2011. An unpopular diner has finally been bought out by L. L. Bean. The diner — and the time portal inside it — may last a few more weeks in the footprint of a burned textile mill. On the other end is America under'

Use the nltk function `word_tokenize` to create a list of tokens from the `review_0` string defined below. Assign the resultant list of tokens to the `ans_1` variable.

In [6]:
from nltk import word_tokenize
review_0 = df['review'][0]

In [7]:
#Construct tokens (words/sentences) from the text
# text = le_monde_data.raw()
# import nltk
# from nltk import sent_tokenize,word_tokenize 
# sentences = nltk.Text(sent_tokenize(text))
# print(len(sentences))
# words = nltk.Text(word_tokenize(text))
# print(len(words))

In [11]:
### GRADED

### YOUR SOLUTION HERE
ans1 = word_tokenize(review_0)

###
### YOUR CODE HERE
###


In [12]:
ans1

['By',
 'ERROL',
 'MORRISNOV',
 '.',
 '10',
 ',',
 '2011',
 'In',
 'all',
 'of',
 'Stephen',
 'King',
 '’',
 's',
 'work',
 'there',
 'is',
 'an',
 'admixture',
 'of',
 'the',
 'ordinary',
 'and',
 'the',
 'supernatural',
 '—',
 'call',
 'it',
 'the',
 'weird',
 'quotidian',
 '.',
 'In',
 'his',
 'new',
 'novel',
 ',',
 '“',
 '11/22/63',
 ',',
 '”',
 'it',
 'is',
 'a',
 'rabbit',
 'hole',
 'into',
 'the',
 'past',
 'that',
 'pops',
 'up',
 'in',
 'Lisbon',
 'Falls',
 ',',
 'a',
 'woebegone',
 'corner',
 'of',
 'Maine',
 '.',
 'On',
 'one',
 'end',
 'is',
 '2011',
 '.',
 'An',
 'unpopular',
 'diner',
 'has',
 'finally',
 'been',
 'bought',
 'out',
 'by',
 'L.',
 'L.',
 'Bean',
 '.',
 'The',
 'diner',
 '—',
 'and',
 'the',
 'time',
 'portal',
 'inside',
 'it',
 '—',
 'may',
 'last',
 'a',
 'few',
 'more',
 'weeks',
 'in',
 'the',
 'footprint',
 'of',
 'a',
 'burned',
 'textile',
 'mill',
 '.',
 'On',
 'the',
 'other',
 'end',
 'is',
 'America',
 'under',
 'Eisenhower',
 '.',
 'The',
 'mi

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Index:) 
<a id='q2'></a>

### Question 2:

*10 points*

Extending the idea of single tokens, we can generate **bigrams** or sequences of pairs of adjacent tokens from the text.  Note, this is a Python *generator* which allows for lazy evaluation.  These are common concepts to encounter in NLP tasks, feel free to read more on generators [here](https://docs.python.org/3/tutorial/classes.html#generators).  

To create a list, we have to wrap the `bigram` function in a `list` constructor. Further, `bigrams` expects a list of tokens as an input rather than a single string.
  

In [13]:
from nltk import bigrams

In [14]:
text = '''
Alice took up the fan and gloves and she kept fanning herself all the
time she went on talking. "Dear, dear! How queer everything is to-day!
And yesterday things went on just as usual. _Was_ I the same when I got
up this morning? But if I'm not the same, the next question is, 'Who in
the world am I?' Ah, _that's_ the great puzzle!"
'''
tokens = word_tokenize(text)
bi_tokens = list(bigrams(tokens))
print(bi_tokens[:5])

[('Alice', 'took'), ('took', 'up'), ('up', 'the'), ('the', 'fan'), ('fan', 'and')]


Create a list of bigrams of `review_0`. Save your results to `ans2` below

In [17]:
### GRADED

### YOUR SOLUTION HERE

tokens = word_tokenize(review_0)
ans2 = list(bigrams(tokens))

###
### YOUR CODE HERE
###


In [18]:
ans2

[('By', 'ERROL'),
 ('ERROL', 'MORRISNOV'),
 ('MORRISNOV', '.'),
 ('.', '10'),
 ('10', ','),
 (',', '2011'),
 ('2011', 'In'),
 ('In', 'all'),
 ('all', 'of'),
 ('of', 'Stephen'),
 ('Stephen', 'King'),
 ('King', '’'),
 ('’', 's'),
 ('s', 'work'),
 ('work', 'there'),
 ('there', 'is'),
 ('is', 'an'),
 ('an', 'admixture'),
 ('admixture', 'of'),
 ('of', 'the'),
 ('the', 'ordinary'),
 ('ordinary', 'and'),
 ('and', 'the'),
 ('the', 'supernatural'),
 ('supernatural', '—'),
 ('—', 'call'),
 ('call', 'it'),
 ('it', 'the'),
 ('the', 'weird'),
 ('weird', 'quotidian'),
 ('quotidian', '.'),
 ('.', 'In'),
 ('In', 'his'),
 ('his', 'new'),
 ('new', 'novel'),
 ('novel', ','),
 (',', '“'),
 ('“', '11/22/63'),
 ('11/22/63', ','),
 (',', '”'),
 ('”', 'it'),
 ('it', 'is'),
 ('is', 'a'),
 ('a', 'rabbit'),
 ('rabbit', 'hole'),
 ('hole', 'into'),
 ('into', 'the'),
 ('the', 'past'),
 ('past', 'that'),
 ('that', 'pops'),
 ('pops', 'up'),
 ('up', 'in'),
 ('in', 'Lisbon'),
 ('Lisbon', 'Falls'),
 ('Falls', ','),
 (',

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Index:) 
<a id='q3'></a>

### Question 3:

*10 points*



Once we have a list of tokens or bigrams, we can use the `FreqDist` class to create a count of the tokens.  From here, we can view the most common words and their counts using the `.most_common()` method.

In [19]:
from nltk import FreqDist

In [20]:
text = '''
As she said this, she looked down at her hands and was surprised to see
that she had put on one of the Rabbit's little white kid-gloves while
she was talking. "How _can_ I have done that?" she thought. "I must be
growing small again." She got up and went to the table to measure
herself by it and found that she was now about two feet high and was
going on shrinking rapidly. She soon found out that the cause of this
was the fan she was holding and she dropped it hastily, just in time to
save herself from shrinking away altogether.
'''
tokens = word_tokenize(text)
fdist = FreqDist(tokens)
fdist.most_common(5)

[('she', 8), ('was', 6), ('and', 5), ('.', 5), ('to', 4)]

Complete the function `top_n` below that will accept a single string of text and an integer, `n`. Your function should return a list of tuples of the `n` most common tokens from the text of the form (token, count).

In [21]:
### GRADED

### YOUR SOLUTION HERE
def top_n(text, n):
    tokens = word_tokenize(text)
    fdist = FreqDist(tokens)
    return fdist.most_common(n)

###
### YOUR CODE HERE
###


In [22]:
top_n(text,5)

[('she', 8), ('was', 6), ('and', 5), ('.', 5), ('to', 4)]

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Index:) 
<a id='q4'></a>

### Question 4:

*10 points*

`scikitlearn` has similar functionality to that of the NLTK library for text tokenization.  It is especially well suited to working with dataframes so we will bring in a dataset labeled with sentiment from the IMDB Movie Reviews dataset curated by Maas.  This is a small sample of the data, if you are interested in the entire dataset feel free to visit: http://ai.stanford.edu/~amaas/data/sentiment/




In [23]:
imdb = pd.read_csv('./data/sample_revs.csv', index_col=0)

In [24]:
imdb.head()

Unnamed: 0,review,sentiment
0,Badland is one of the worst movies I ever seen...,negative
1,New Year's Day. The day after consuming a few ...,negative
2,There's really not a whole lot to say about th...,negative
3,...for this movie defines a new low in Bollywo...,negative
4,What seemed at first just another introverted ...,positive


As you can see, we have the text of movie reviews that have been given labels of either "positive" or "negative".  We aim to use train a logistic regression classifier using the tokenized version of the reviews as input, and the sentiment feature as the target.  To begin, we can create our input array using the `CountVectorizer` class.

The `CountVectorizer` is a transformer, for which we will use the `.fit_transform()` method to create a [sparse array](https://en.wikipedia.org/wiki/Sparse_matrix).  

In [31]:
from sklearn.feature_extraction.text import CountVectorizer

In [32]:
#instantiate the class
cvect = CountVectorizer()

In [37]:
type(df.review)

pandas.core.series.Series

In [33]:
#create a document term matrix from king reviews
dtm = cvect.fit_transform(df.review)

In [34]:
#examine the first ten tokens
#similar to word_tokenize from earlier
cvect.get_feature_names()[:10]

['00', '000', '02', '03844', '074', '10', '100', '103', '11', '12']

In [35]:
#convert sparse array to dense array
dtm = dtm.todense()
#get column names
features = cvect.get_feature_names()
#build DataFrame
dtm_df = pd.DataFrame(dtm,  columns = features)
dtm_df.head()

Unnamed: 0,00,000,02,03844,074,10,100,103,11,12,...,zeitgeist,zeppelin,zeros,zestful,zevon,zombie,zombies,zombified,zone,zuckerman
0,0,0,0,0,0,1,0,0,7,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,6,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [40]:
dtm_df

Unnamed: 0,00,000,02,03844,074,10,100,103,11,12,...,zeitgeist,zeppelin,zeros,zestful,zevon,zombie,zombies,zombified,zone,zuckerman
0,0,0,0,0,0,1,0,0,7,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,6,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,2,2,0,0,0
60,0,1,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,3
61,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
62,0,0,0,0,0,0,0,0,0,2,...,0,0,0,0,0,0,0,0,0,0


Write a function, `dtm_maker` that accepts a series of string objects and returns a DataFrame of tokens and counts for each row in the series. The term will be a column header, and the count value will be a row.

In [41]:
### GRADED

### YOUR SOLUTION HERE
def dtm_maker(series):
    from sklearn.feature_extraction.text import CountVectorizer
    cvect = CountVectorizer()
    dtm = cvect.fit_transform(series)
    dtm = dtm.todense()
    features = cvect.get_feature_names()
    return  pd.DataFrame(dtm,  columns = features)

###
### YOUR CODE HERE
###


In [39]:
ctm_maker(df.review)

Unnamed: 0,00,000,02,03844,074,10,100,103,11,12,...,zeitgeist,zeppelin,zeros,zestful,zevon,zombie,zombies,zombified,zone,zuckerman
0,0,0,0,0,0,1,0,0,7,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,6,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,2,2,0,0,0
60,0,1,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,3
61,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
62,0,0,0,0,0,0,0,0,0,2,...,0,0,0,0,0,0,0,0,0,0


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Index:) 
<a id='q5'></a>

### Question 5:

*10 points*

Now, we will use our dataframe from above to train a `LogisticRegression` classifier on the labeled sentiment.  Here, we can also use the coefficients of the fit model to understand the "importance" of features.  Accordingly, we can consider features with high absolute value coefficients as "important"; negatively or positively as such.  Your correct solution should generate the table below.




    
| tokens	| coefs| 
| -------- | -------- |
| great	| 0.8791126599377999| 
| us	| 0.6056992793234783| 
| well	| 0.5752492839208284| 
| best	| 0.5504774974120047| 
| my	| 0.5414566720604057| 
| short	| 0.5071974167589889| 
| always	| 0.5033530077459032| 
| liked	| 0.493808270195112| 
| excellent	| 0.4874796582725917| 
| most	| 0.47949228301494345| 


Write a function, `important_words` that accepts a `X` and `y` which are both `pandas.core.series.Series` objects.
    - These objects both contain string dtypes for their values
Your function should use the LogisticRegression class to fit the classifier on our `imdb` reviews (text) as input and sentiment as target.
    - Note that sentiment values will be strings, `positive` and `negative`
    - When instantiating the LogisticRegression class, use the following values:
        - `solver=lbfgs`
        - `max_iter=2000`
        - `random_state=24`
Your function should save the resulting coefficients and their tokens as a DataFrame and return the top 10 coeffients (see above table sample).

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
### GRADED

### YOUR SOLUTION HERE
def important_words(X, y):
    return

###
### YOUR CODE HERE
###


In [None]:
###
### YOUR CODE HERE
###


[Back to top](#Index:) 
<a id='q6'></a>

### Question 6:

*10 points*

Up to this point, we have not done any hyperparameter tuning, nor did we consider different models when building the sentiment classifier.  Instead, we assume that we have arrived at a solid model, and reuse this model on new data to provide sentiment scores.  

This is where we call on `textblob`.   Here, we can use a model that itself was trained on the full IMDB dataset to determine sentiment.  We demonstrate the use of the `TextBlob` class, and its sentiment attribute below.

In [None]:
from textblob import TextBlob

In [None]:
text = '''
The titular threat of The Blob has always struck me as the ultimate movie
monster: an insatiably hungry, amoeba-like mass able to penetrate
virtually any safeguard, capable of--as a doomed doctor chillingly
describes it--"assimilating flesh on contact.
Snide comparisons to gelatin be damned, it's a concept with the most
devastating of potential consequences, not unlike the grey goo scenario
proposed by technological theorists fearful of
artificial intelligence run rampant.
'''

In [None]:
blob = TextBlob(text)
blob.sentiment

According to textblob's documentation:

```
The sentiment property returns a namedtuple of the form Sentiment(polarity, subjectivity). 
The polarity score is a float within the range [-1.0, 1.0]. 
The subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective.
```

In [None]:
king_rev = pd.read_csv('./data/king_clean.csv', index_col = 0)
tommyknockers = king_rev.loc[40, 'review']

Using the `tommyknockers` variable above and the TextBlob class, Write a function `pol_sub` that takes the `tommyknockers` variable as an argument. Your function should return a tuple of float values with the first value being the polarity and the second value being the subjectivity.

In [None]:
### GRADED

### YOUR SOLUTION HERE
def pol_sub(text=tommyknockers):
    return

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Index:) 
<a id='q7'></a>

### Question 7:

*10 points*



To use `textblob` in assigning sentiment scores to our entire review dataset, we will use the `.apply()` method and apply a custom function that determines both the polarity and subjectivity of the reviews, respectively.  Recall that in Pandas we use the `.apply()` method in place of looping over every row of the dataframe in applying a function to a feature.  A simple example follows.

In [None]:
text = '''It was the White Rabbit, trotting slowly back again and looking
anxiously about as it went, as if it had lost something; Alice heard it
muttering to itself, "The Duchess! The Duchess! Oh, my dear paws! Oh, my
fur and whiskers! She'll get me executed, as sure as ferrets are
ferrets! Where _can_ I have dropped them, I wonder?" Alice guessed in a
moment that it was looking for the fan and the pair of white kid-gloves
and she very good-naturedly began hunting about for them, but they were
nowhere to be seen--everything seemed to have changed since her swim in
the pool, and the great hall, with the glass table and the little door,
had vanished completely.'''

df = pd.DataFrame([text], columns = ['Text'])
df.head()

In [None]:
#creating a list of characters using the .split() method
df['Text'].apply(lambda x: x.split())

In [None]:
#counting the characters
df['Text'].apply(lambda x: len(x.split()))

In [None]:
def num_tokens(string):
    from nltk import word_tokenize
    return len(word_tokenize(string))

In [None]:
king_rev['review'].apply(num_tokens)

Write a function `num_tokens` that accepts a string as an argument and returns an integer number of tokens. 

Write another function `series_tokens` that accepts a `pandas.core.series.Series` object as an argument.
        - This Series object is assumed to have string values without nulls
`series_tokens` should use the `.apply` method in conjunction with the `num_tokens` function to return another series object that has the corresponding integer number of tokens for that row.

In [None]:
### GRADED

### YOUR SOLUTION HERE
def num_tokens(string):
    return

def series_tokens(series):
    return

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Index:) 
<a id='q8'></a>

### Question 8:

*10 points*



Write a function `series_pol_sub` that accepts a `pandas.core.series.Series` object as an input.
        - Assume this Series object has non-null string values
Your function should call the `pol_sub` function you wrote earlier with an `.apply` method.
Your function should return a Series object containing a tuple which contains:
        - polarity (float)
        - subjectivity (float)
        
The result (snippet) of calling `series_pol_sub` with `king_rev['review']` should be a Series that looks something like this:




|-|-|
|---|---|
|0|(0.06928269905881844, 0.48262103982253235)|
|1|(0.07120610870610873, 0.5133349366682699)|
|2|(0.05425619070049119, 0.46552828432362114)|

In [None]:
### GRADED

### YOUR SOLUTION HERE
def series_pol_sub(series):
    return

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Index:) 
<a id='q9'></a>

### Question 9:

*10 points*

Similar to the lectures, we can compute a measure of complexity through the lexical diversity of a review.  Here, we define the lexical diversity as the percentage of unique words or characters.  Here, we will rely on the earlier `word_tokenize` operation and assume that we are taking in a list of characters.  We will define lexical diversity as follows:

$$\text{Lexical Diversity} = \frac{\text{number unique tokens}}{\text{number total tokens}}$$

For us, this means:

- `tokens_u` = number of unique tokens in list resulting from `word_tokenize` applied to text.
- `tokens_t` = total number of tokens in list resulting from `word_tokenize` applied to text.

Write a function `complexity_score` that takes a multi-token string as an argument (i.e. 'The quick brown fox jumped over the lazy dog').
Your function should return a complexity score  as a float.
- _Hint_ : The `set` data structure is one way to 'de-dupe' (retrieve unique records) from a `list`.
- _Hint_ : Use the `word_tokenize` function from the `nltk` library to parse your input string.

In [None]:
### GRADED

### YOUR SOLUTION HERE
def complexity_score(string):
    return

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Index:) 
<a id='q10'></a>

### Question 10:

*10 points*

In our earlier complexity calculation and in our `LogisticRegression` classifier, we did not pay much attention to the elements in our word tokens.  Not only did we include things like punctuation, but we also relied on some potentially redundant vocabulary such as **"this, the, that, or, and, but, ...".**  As we saw in the lectures, these are called "stopwords".  NLTK provides us a list of stopwords that we can utilize to filter stopwords from a text.

In [None]:
from nltk.corpus import stopwords
#create a list of stopwords
stops = stopwords.words('english')
#print every twentieth stopword
print(stops[::20])

Write a function `remove_stopwords` that takes a string as an input. Your function should return a `list` of string tokens as an ouput with English stopwords removed.

- _Hint_ : You will need to:
    - Tokenize your input text with `word_tokenize`
    - Remove the token if it is exists within the list of `nltk`'s English stopwords (this is case sensitive!)
    - Return the result

In [None]:
### GRADED

### YOUR SOLUTION HERE
def remove_stopwords(string):
    return

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Index:) 
<a id='q11'></a>

### Question 11:

*10 points*

After we've removed the stopwords, we can focus on eliminating non-alphanumic characters as a way to eliminate punctuation.

In [None]:
character = ['!', '@', '#', '$', '%', '.', 'Tony']
  

Write a function `remove_punctuation_and_stopwords` that accepts a string as an input.
Your function should return a `list` of string tokens with English stopwords removed and punctuation removed (as dictated by `str.isalpha`).
        
_Example:_ `"There's certainly too much pepper in that soup!"` --> `{'certainly', 'soup', 'pepper', 'much'}`

In [None]:
### GRADED

### YOUR SOLUTION HERE
def remove_punctuation_and_stopwords(string):
    return

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Index:) 
<a id='q12'></a>

### Question 12:

Now, using our ealier defined functions, we should be able to take in a feature of text, remove the stopwords, and compute the complexity score for a review.  We apply this to our Stephen King reviews, and might interpret the solution as speaking to the diversity of language the book induced in a critic.

Write a function `clean_complexity` that accepts a `pandas.core.series.Series` object as an argument. Assume the values of this series are non-null strings.
   
Your function should remove all stop words and punctuation from the values in this series and return a `Series` object whose values are the corresponding lexical diversity score (float) for each input row.

In [None]:
### GRADED

### YOUR SOLUTION HERE
def clean_complexity(series):
    return

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


### What's the right Stephen King book?

Now, we might use our function to approximate the best book to read based on the diversity of language excited from the critic.  There are many additional approaches for conducting such an analysis, our aim was to give you insight into how some of these tools work.  You can adjust things like the words included and excluded in training a sentiment model or computing complexity, consider ngrams, only certain parts of speech, etc.  

In [None]:
king_rev = pd.read_csv('./data/king_clean.csv', index_col=0, nrows=100)
king_rev['complexity'] = clean_complexity(king_rev['review'])
king_rev.sort_values('complexity', ascending = False).head()