# Assignment 1 - Working with Text

In this assignment, I worked with messy medical data and used regex to extract relevant infromation from the data. The goal of this assignment was to correctly identify and sort all of the date variants encoded in this dataset. Once I extracted all the date patterns from the text, I then normalized the data and sorted them in ascending chronological order, according to the following given rules: 

* Assume all dates in xx/xx/xx format are mm/dd/yy
* Assume all dates where year is encoded in only two digits are years from the 1900's (e.g. 1/5/89 is January 5th, 1989)
* If the day is missing (e.g. 9/2009), assume it is the first day of the month (e.g. September 1, 2009).
* If the month is missing (e.g. 2010), assume it is the first of January of that year (e.g. January 1, 2010).
* Watch out for potential typos as this is a raw, real-life derived dataset.

After sorting the medical data using the rules above, I was instructed to return the correct date from each medical note and return a pandas Series in chronological order of the original Series' indices.

*Note: Each line of the `dates.txt` file corresponds to a medical note. Each note has a date that needs to be extracted, but each date is encoded in one of many formats.*

In [1]:
import pandas as pd
import re 

doc = []
with open('dates.txt') as file:
    for line in file:
        doc.append(line)

df = pd.Series(doc)
df.head(10)

0         03/25/93 Total time of visit (in minutes):\n
1                       6/18/85 Primary Care Doctor:\n
2    sshe plans to move as of 7/8/71 In-Home Servic...
3                7 on 9/27/75 Audit C Score Current:\n
4    2/6/96 sleep studyPain Treatment Pain Level (N...
5                    .Per 7/06/79 Movement D/O note:\n
6    4, 5/18/78 Patient's thoughts about current su...
7    10/24/89 CPT Code: 90801 - Psychiatric Diagnos...
8                         3/7/86 SOS-10 Total Score:\n
9             (4/10/71)Score-1Audit C Score Current:\n
dtype: object

In [2]:
def date_sorter():
    # Step 1: Regex
    ## A) Numeric date. E.g., mm/dd/yyyy, mm/dd/yy, mm-dd-yyyy, mm-dd-yyy
    num = '(\d{1,2}[/-]\d{1,2}[/-]\d{2,4})'
    ## B) Date. E.g., Month Day Year, Jan 4th 2000, August 2nd, 1999, Oct. 1, 1987
    day_second = '((?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[.]?[a-z]*[+\s]\d{1,2}(?:st|nd|rd|th)?[,]?[+\s]\d{4})'
    ## C) Date, with day optional. E.g., Day Month Year, 14 Feb. 2009, 22 July 2001, Mar 2009
    day_first = '((\d{1,2} )?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[.]?[a-z]*[,]? \d{4})'
    ## D) Obtain year (yyyy  year format) and optionally the month (m or mm  month format) for the 1900s and 2000s 
    year = '((\d{1,2}[/-])?[1|2]\d{3})'

    ## Compile a list of all extracted dates
    all_dates = '({}|{}|{}|{})'.format(num, day_first, day_second, year)
    date = df.str.extract(all_dates, expand=True)

    # Step 2. Correct mispellings 
    date = date[date.columns[0]].str.replace('Janaury', 'January').str.replace('Decemeber', 'December')

    # Step 3: Final result. Convert series to datetime in ascending order
    date = pd.to_datetime(date)
    return pd.Series(date.sort_values().index)

date_sorter().head(10)

0    474
1    153
2     13
3    129
4     98
5    111
6    225
7     31
8    171
9    191
dtype: int64

---
# Assignment 2 - Introduction to NLTK

In part 1 of this assignment, I used nltk to explore the Herman Melville novel, *Moby Dick*. Then in part 2, I created a spelling recommender function that uses nltk to find words similar to the misspelling. 

## Part 1 - Analyzing Moby Dick

In [3]:
import nltk
import pandas as pd
import numpy as np
nltk.download('punkt')

# If you would like to work with the raw text you can use 'moby_raw'
with open('moby.txt', 'r') as f:
    moby_raw = f.read()
    
# If you would like to work with the novel in nltk.Text format you can use 'text1'
moby_tokens = nltk.word_tokenize(moby_raw)
text1 = nltk.Text(moby_tokens)

[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


### Example 1 - How many tokens (words and punctuation symbols) are in text1?

In [4]:
def example_one():
    
    return len(nltk.word_tokenize(moby_raw)) # or alternatively len(text1)

example_one()

254989

### Example 2 - How many unique tokens (unique words and punctuation) does text1 have?

In [5]:
def example_two():
    
    return len(set(nltk.word_tokenize(moby_raw))) # or alternatively len(set(text1))

example_two()

20755

### Example 3 - After lemmatizing the verbs, how many unique tokens does text1 have?

In [6]:
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

def example_three():

    lemmatizer = WordNetLemmatizer()
    lemmatized = [lemmatizer.lemmatize(w,'v') for w in text1]

    return len(set(lemmatized))

example_three()

[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...


16900

### Question 1 - What is the lexical diversity of the given text input? (i.e. ratio of unique tokens to the total number of tokens)

In [7]:
def answer_one():

    # Ratio =  (Unique Tokens) / (Total # of Tokens)
    return  len(set(moby_tokens)) / len(moby_tokens)

answer_one()

0.08139566804842562

### Question 2 - What percentage of tokens is 'whale' or 'Whale'?

In [8]:
def answer_two():
    
    # Step 1) Find number of "whale" and "Whale" tokens
    whale = text1.count("whale") + text1.count("Whale")
    
    # Step 2) Find percentage: (Whale Tokens) / (Total Tokens) * 100%
    return (whale / len(moby_tokens)) * 100

answer_two()

0.4125668166077752

### Question 3 - What are the 20 most frequently occurring (unique) tokens in the text? What is their frequency?

In [9]:
from nltk import FreqDist
dist = FreqDist(text1)

def answer_three():
    return dist.most_common(20)

answer_three()

[(',', 19204),
 ('the', 13715),
 ('.', 7308),
 ('of', 6513),
 ('and', 6010),
 ('a', 4545),
 ('to', 4515),
 (';', 4173),
 ('in', 3908),
 ('that', 2978),
 ('his', 2459),
 ('it', 2196),
 ('I', 2097),
 ('!', 1767),
 ('is', 1722),
 ('--', 1713),
 ('with', 1659),
 ('he', 1658),
 ('was', 1639),
 ('as', 1620)]

### Question 4 - What tokens have a length of greater than 5 and frequency of more than 150?

In [10]:
def answer_four():
    freq_words = [t for t in text1 if len(t) > 5 and dist[t] > 150]
    # Note: Use set() to obtain unique tokens
    return sorted(set(freq_words))

answer_four()

['Captain',
 'Pequod',
 'Queequeg',
 'Starbuck',
 'almost',
 'before',
 'himself',
 'little',
 'seemed',
 'should',
 'though',
 'through',
 'whales',
 'without']

### Question 5 - Find the longest word in text1 and that word's length.

In [11]:
# Unique words within the novel:
words = set(text1)
    
def answer_five():
    
    #Order words by decreasing length
    longest_word = sorted(dist, reverse=True, key=len) 
    
    return (longest_word[0], len(longest_word[0]))

answer_five()

("twelve-o'clock-at-night", 23)

### Question 6 - What unique words have a frequency of more than 2000? What is their frequency?

In [12]:
def answer_six():
   
    # Determine unique words and their frequencies for words occuring more than 2000 times: 
    filtered = dict((freq, word) for word, freq in dist.items() if word.isalpha() and dist[word] > 2000)
    
    # Return list of unique words, ordered from most to least frequent
    return sorted(filtered.items(), reverse=True)

answer_six()

[(13715, 'the'),
 (6513, 'of'),
 (6010, 'and'),
 (4545, 'a'),
 (4515, 'to'),
 (3908, 'in'),
 (2978, 'that'),
 (2459, 'his'),
 (2196, 'it'),
 (2097, 'I')]

### Question 7 - What is the average number of tokens per sentence?

In [13]:
def answer_seven():
    # Tokenize words 
    sent_tokens = nltk.sent_tokenize(moby_raw)
    # Number of tokens per sentence
    num_sentence = [len(nltk.word_tokenize(w)) for w in sent_tokens]
    # Average length of words per sentence
    return np.mean(num_sentence)

answer_seven()

25.881952902963864

### Question 8 - What are the 5 most frequent parts of speech in this text? What is their frequency?

In [14]:
nltk.download('averaged_perceptron_tagger')
from collections import Counter 

def answer_eight():
    # Parts of speech for the entire novel:
    pos_tag = nltk.pos_tag(text1)
    # Count the most frequent parts of speech (pos):
    freq_pos = Counter(part for word, part in pos_tag)
    
    print("The top 5 most frequent parts of speech: ")
    return freq_pos.most_common(5)

answer_eight()

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
The top 5 most frequent parts of speech: 


[('NN', 32730), ('IN', 28657), ('DT', 25867), (',', 19204), ('JJ', 17620)]

## Part 2 - Spelling Recommender

For this part of the assignment, I created three different spelling recommenders. Each take a list of misspelled words and recommends a correctly spelled word for every word in the list.

For every misspelled word, the recommender finds the word in `correct_spellings` that has the shortest distance, and starts with the same letter as the misspelled word, and return that word as a recommendation. This distance is calculated using either the Jaccard Distance or Edit Distance. 

Each of the recommenders provides recommendations for the three default words provided: `['cormulent', 'incendenece', 'validrate']`.

<br>
*Distance metrics*:
**[Jaccard distance](https://en.wikipedia.org/wiki/Jaccard_index)** and **[Edit distance on the two words with transpositions.](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance)**


*Note: Since there is a lot of overlapping code for the three recommendation systems, I created a generalized recommendation function that analyzes the word entries based on the number of specified ngrams. The function also includes an option for using the Jaccard or Edit distance, as specified in Questions 9-11.*

In [15]:
nltk.download('words')
from nltk.corpus import words

correct_spellings = words.words()
correct_spellings[:10]

[nltk_data] Downloading package words to /home/jovyan/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


['A',
 'a',
 'aa',
 'aal',
 'aalii',
 'aam',
 'Aani',
 'aardvark',
 'aardwolf',
 'Aaron']

### Question 9 - For this recommender, provide recommendations for the three default words using the Jaccard Distance on the trigrams of the two words.

In [16]:
# Use this function for Question #9-11:

def recommend(entries, ngram):
    
    distance = []

    for e in entries:
        correct = [cor for cor in correct_spellings if cor[0].startswith(e[0])]
        
        # Optional: Edit Distance (Question #11)
        if ngram == 0:
            edit = [nltk.edit_distance(e, cor) for cor in correct]
            distance.append(correct[np.argmin(edit)])
            
        # Default: Jaacard's Distance (Question #9-10)
        else: 
            jaccard = [nltk.jaccard_distance(set(nltk.ngrams(e, n=ngram)), set(nltk.ngrams(cor, n=ngram)))
                                            for cor in correct]
            distance.append(correct[np.argmin(jaccard)])
    
    return distance
    # Return the closest recomendation for misspelled word

def answer_nine(entries=['cormulent', 'incendenece', 'validrate']):
    
    return recommend(entries, ngram=3)
    
answer_nine()



['corpulent', 'indecence', 'validate']

### Question 10 - For this recommender, provide recommendations for the three default words using the Jaccard Distance on the 4-grams of the two words. 

In [17]:
def answer_ten(entries=['cormulent', 'incendenece', 'validrate']):
    
    return recommend(entries, ngram=4)
    
answer_ten()



['cormus', 'incendiary', 'valid']

### Question 11 - For this recommender, provide recommendations for the three default words using the Edit Distance of the two words.

In [18]:
def answer_eleven(entries=['cormulent', 'incendenece', 'validrate']):
    
    # Note: Use ngram=0 to measure Edit Distance rather than the default Jacard's Distance
    return recommend(entries, ngram=0)
    
answer_eleven()

['corpulent', 'intendence', 'validate']

---
# Assignment 3 - Text Classifications

In this assignment, I explored text message data and created models to predict if a message is spam or not. These models included Logistic Regression, Naive Bayes, and anSupport Vector Machine. I accounted for text features, such as length of document, along with the number and type of characters included within a text message.

In [19]:
import pandas as pd
import numpy as np

spam_data = pd.read_csv('spam.csv')

spam_data['target'] = np.where(spam_data['target']=='spam',1,0)
spam_data.head(10)

Unnamed: 0,text,target
0,"Go until jurong point, crazy.. Available only ...",0
1,Ok lar... Joking wif u oni...,0
2,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,U dun say so early hor... U c already then say...,0
4,"Nah I don't think he goes to usf, he lives aro...",0
5,FreeMsg Hey there darling it's been 3 week's n...,1
6,Even my brother is not like to speak with me. ...,0
7,As per your request 'Melle Melle (Oru Minnamin...,0
8,WINNER!! As a valued network customer you have...,1
9,Had your mobile 11 months or more? U R entitle...,1


In [20]:
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(spam_data['text'], 
                                                    spam_data['target'], 
                                                    random_state=0)

### Question 1 - What percentage of the documents in `spam_data` are spam?

In [21]:
def answer_one():
    ratio = spam_data['target'].mean()
    return ratio * 100

answer_one()

13.406317300789663

**Fit the training data `X_train` using a Count Vectorizer with default parameters.**

### Question 2 - What is the longest token in the vocabulary?

In [22]:
from sklearn.feature_extraction.text import CountVectorizer

def answer_two():
    
    ## A) Count Vectorizer: Convert training text into counts per word, a.k.a., "tokens"
    count_vect = CountVectorizer().fit(X_train)
    
    ## B) Find longest token
    tokens = count_vect.get_feature_names()
    longest = sorted(tokens, reverse=True, key=len)
    
    return longest[0]

answer_two()

'com1win150ppmx3age16subscription'

**Fit and transform the training data `X_train` using a Count Vectorizer with default parameters.** 

**Next, fit a multinomial Naive Bayes classifier model with smoothing `alpha=0.1`.**

### Question 3 - What is the area under the curve (AUC) score using the transformed test data?

In [23]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score

def answer_three():
    
    # Step 1) Count Vectorizer: Fit and Transform text into matrix-form counts per word, a.k.a., "tokens"
    count_vect = CountVectorizer().fit(X_train)
    X_train_vect = count_vect.transform(X_train)
        
    # Step 2) Multinomial Naive Bayes:
    model = MultinomialNB(alpha=0.1)
    model.fit(X_train_vect, y_train)
    
    # Step 3) Predict spam/not-spam. Obtain AUC Score by comparing predictions to test set
    predict = model.predict(count_vect.transform(X_test))
    auc_score = roc_auc_score(y_test, predict)
    
    return auc_score

answer_three()

0.97208121827411165

**Fit and transform the training data `X_train` using a Tfidf Vectorizer with default parameters.**

### Question 4 - What 20 features have the smallest tf-idf and what 20 have the largest tf-idf?

**Put these features in a two series where each series is sorted by tf-idf value and then alphabetically by feature name. The index of the series should be the feature name, and the data should be the tf-idf.**

In [24]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Use for Questions 4:
## Takes in vectorizer, training data, and n-grams
def coeff(vect, train, n, added):
    
    # A) Find features:
    features = np.array(vect.get_feature_names() + added)
    
    # B) Sorted features and indices:
    maxed = train.max(0).toarray()[0]
    sorted_index = maxed.argsort()
    sorted_features = maxed[sorted_index]
    
    # C) Find n-smallest and n-largest coefficients:
    small_coeff = pd.Series(sorted_features[:n], index=features[sorted_index[:n]])
    large_coeff = pd.Series(sorted_features[-n:][::-1], index=features[sorted_index[-n:][::-1]])
    
    return small_coeff, large_coeff


# Find the 20 smallest and largest coefficients with Tfidf Vectorizer
def answer_four():
    
    tf_vect = TfidfVectorizer().fit(X_train)
    X_train_vect = tf_vect.transform(X_train)
    
    smallest, largest = coeff(tf_vect, X_train_vect, n=20, added=[])
    
    return (smallest, largest)

answer_four()

(sympathetic     0.074475
 healer          0.074475
 aaniye          0.074475
 dependable      0.074475
 companion       0.074475
 listener        0.074475
 athletic        0.074475
 exterminator    0.074475
 psychiatrist    0.074475
 pest            0.074475
 determined      0.074475
 chef            0.074475
 courageous      0.074475
 stylist         0.074475
 psychologist    0.074475
 organizer       0.074475
 pudunga         0.074475
 venaam          0.074475
 diwali          0.091250
 mornings        0.091250
 dtype: float64, 146tf150p    1.000000
 havent       1.000000
 home         1.000000
 okie         1.000000
 thanx        1.000000
 er           1.000000
 anything     1.000000
 lei          1.000000
 nite         1.000000
 yup          1.000000
 thank        1.000000
 ok           1.000000
 where        1.000000
 beerage      1.000000
 anytime      1.000000
 too          1.000000
 done         1.000000
 645          1.000000
 tick         0.980166
 blank        0.932702
 dty

**Fit and transform the training data `X_train` using a Tfidf Vectorizer ignoring terms that have a document frequency strictly lower than **3**.**

**Then fit a multinomial Naive Bayes classifier model with smoothing `alpha=0.1`**

### Question 5 - After transforming the test data (described above), compute the area under the curve (AUC) score.

In [25]:
def answer_five():
    
    # Step 1) TFIDF: Fit and Transform text 
    vect = TfidfVectorizer(min_df=3).fit(X_train)
    X_train_vect = vect.transform(X_train)
    
    # Step 2) Multinomial Naive Bayes:
    model = MultinomialNB(alpha=0.1)
    model.fit(X_train_vect, y_train)
    
    # Step 3) Predict spam/not-spam. Obtain AUC Score by comparing predictions to test set
    predict = model.predict(vect.transform(X_test))
    auc_score = roc_auc_score(y_test, predict)
    
    return auc_score

answer_five()

0.94162436548223349

### Question 6 - What is the average length of documents (number of characters) for not spam and spam documents?

*Note: Since many questions involve finding the number of characters, I created a 'num_char' function to avoid repeated code.* 

In [26]:
# Use for Questions 6-7, 9 and 11:
## Number of characters/Length of document (e.g., for non-spam and spam text, training and testing data)
def num_char(a, b):
    char_a = [len(w) for w in a]
    char_b = [len(w) for w in b]
    return char_a, char_b

def answer_six():
    
    # Use to find characters for non-spam and spam:
    target = spam_data['target']
    not_spam = spam_data.loc[target==0, 'text']
    spam = spam_data.loc[target==1, 'text']
    
    # Number of characters:
    not_spam, spam = num_char(not_spam, spam)
    
    return (np.average(not_spam), np.average(spam))

answer_six()

(71.023626943005183, 138.8661311914324)

<br>
<br>
The following function has been provided to help combine new features into the training data:

In [27]:
def add_feature(X, feature_to_add):
    """
    Returns sparse feature matrix with added feature.
    feature_to_add can also be a list of features.
    """
    from scipy.sparse import csr_matrix, hstack
    return hstack([X, csr_matrix(feature_to_add).T], 'csr')

**Fit and transform the training data X_train using a Tfidf Vectorizer, ignoring terms that have a document frequency strictly lower than **5**.**

**Using this document-term matrix and an additional feature, **the length of document (number of characters)**, fit a Support Vector Classification model with regularization `C=10000`.**

### Question 7 - After transforming the test data (described above), compute the area under the curve (AUC) score.

In [28]:
from sklearn.svm import SVC

def answer_seven():
    
    # Step 1) Fit and tranform data
    vect = TfidfVectorizer(min_df=5).fit(X_train)
    X_train_vect = vect.transform(X_train)
    ## Use transformed y_test later for model predictions:
    X_test_vect = vect.transform(X_test)
    
    # Step 2) Add feature to model: Number of characters (a.k.a. length of document)
    ## Use X_train for model fitting and later X_test later for model predictions
    char_train, char_test = num_char(X_train, X_test)

    ## Added feature: number of characters in documment
    X_train_added = add_feature(X_train_vect, char_train)
    X_test_added = add_feature(X_test_vect, char_test)
    
    # Step 3) Fit model: Support Vector Classification
    model = SVC(C=10000).fit(X_train_added, y_train)
    
    # Step 4) Model predictions & Evaluation: AUC score
    predict = model.predict(X_test_added)
    auc_score = roc_auc_score(y_test, predict)
    
    return auc_score

answer_seven()

0.95813668234215565

### Question 8 - What is the average number of digits per document for not spam and spam documents?

*Note: Since many questions involve finding the number of digits, I created a 'num_digits' function to avoid repeated code.*

In [29]:
# Use function for questions 8-9, and 11:
## Number of digits (e.g., for not-spam and spam documents, training and testing data):
def num_digits(a, b):
    num_a = [sum(x.isnumeric() for x in d) for d in a]
    num_b = [sum(x.isnumeric() for x in d) for d in b]
    
    return num_a, num_b


def answer_eight():
    # Use to find digits for non-spam and spam text:
    target = spam_data['target']
    not_spam = spam_data.loc[target==0, 'text']
    spam = spam_data.loc[target==1, 'text']
    
    # Number of digits:
    not_spam, spam = num_digits(not_spam, spam)
    
    return (np.average(not_spam), np.average(spam))

answer_eight()

(0.29927461139896372, 15.76037483266399)

**Fit and transform the training data `X_train` using a Tfidf Vectorizer ignoring terms that have a document frequency strictly lower than **5** and using **word n-grams from n=1 to n=3** (unigrams, bigrams, and trigrams).**

**Using this document-term matrix, fit a Logistic Regression model with regularization `C=100`, using the following additional features:**
* the length of document (number of characters)
* **number of digits per document**

### Question 9 - After transforming the test data (described above), compute the area under the curve (AUC) score.

In [30]:
from sklearn.linear_model import LogisticRegression

def answer_nine():
    
    # Step 1) TFidf Vectorizer
    vect = TfidfVectorizer(min_df=5, ngram_range=(1,3)).fit(X_train)
    X_train_vect = vect.transform(X_train)
    X_test_vect = vect.transform(X_test)
    
    # Step 2) Update model, upon finding the number of...
    ## Characters (length of document):
    char_train, char_test = num_char(X_train, X_test)
    ## Digits per document:
    num_train, num_test = num_digits(X_train, X_test)
    
    ## Update model with added features
    X_train_added = add_feature(X_train_vect, [char_train, num_train])
    X_test_added = add_feature(X_test_vect, [char_test, num_test])
    
    # Step 3) Logistic regression model
    model = LogisticRegression(C=100).fit(X_train_added, y_train)
    
    # Step 4) Evaluation: AUC score
    predict = model.predict(X_test_added)
    auc_score = roc_auc_score(y_test, predict)
    
    return auc_score

answer_nine()

0.96533283533945646

### Question 10 - What is the average number of non-word characters (anything other than a letter, digit or underscore) per document for not spam and spam documents?

*Note: Since many questions involve finding the number of non-word characters, I created a 'num_non_words' function to avoid repeated code.*

# Use for Question 10-11:
## Number of non_word characers (E.g., for non-spam and spam documents)

In [31]:
def num_non_words(a, b):
    char_a = a.str.count('\W')
    char_b = b.str.count('\W')
    
    return char_a, char_b

def answer_ten():
    
    # Use to find non-word characters for non-spam and spam text:
    target = spam_data['target']
    not_spam = spam_data.loc[target==0, 'text']
    spam = spam_data.loc[target==1, 'text']
    
    # Number of non-word characters:
    not_spam, spam = num_non_words(not_spam, spam)
    
    return (np.average(not_spam), np.average(spam))
    
answer_ten()

(17.291813471502589, 29.041499330655956)

For the finale...

**Fit and transform the training data X_train using a Count Vectorizer ignoring terms that have a document frequency strictly lower than **5** and using **character n-grams from n=2 to n=5.****

To tell Count Vectorizer to use character n-grams, **pass in `analyzer='char_wb'`.** This creates character n-grams only from text inside word boundaries. This will also make the model more robust to spelling mistakes.

**Using this document-term matrix, fit a Logistic Regression model with regularization `C=100`, using the following additional features:**
* the length of document (number of characters)
* number of digits per document
* **number of non-word characters (anything other than a letter, digit or underscore.)**


### Question 11 - After transforming the test data (described above), compute the area under the curve (AUC) score.  Also, **find the 10 smallest and 10 largest coefficients from the model** and return them along with the AUC score in a tuple.

*Note: The list of 10 smallest coefficients should be sorted smallest first, the list of 10 largest coefficients should be sorted largest first.*

In [32]:
# Resembles Question 9, but combines several exercises together...

def answer_eleven():
    
    # Step 1) Count Vectorizer: Find matrix-form counts per word, a.k.a., "tokens"
    ## Use character n-grams from test inside word boundaries ('char_wb') 
    count_vect = CountVectorizer(min_df=5, ngram_range=(2,5), analyzer='char_wb').fit(X_train)
    X_train_vect = count_vect.transform(X_train)
    ## Use for model evaluation:
    X_test_vect = count_vect.transform(X_test)
    
    
    # Step 2) Find the number of...
    ## Characters (length of document):
    char_train, char_test = num_char(X_train, X_test)
    ## Digits per document:
    num_train, num_test = num_digits(X_train, X_test)
    ## Non-word characters:
    non_train, non_test = num_non_words(X_train, X_test)
    
    ## Update document-matrix with added training and testing features
    X_train_added = add_feature(X_train_vect, [char_train, num_train, non_train])
    X_test_added = add_feature(X_test_vect, [char_test, num_test, non_test])
    
    
    # Step 3) Logistic regression model
    model = LogisticRegression(C=100).fit(X_train_added, y_train)
    
    # Step 4) Evaluation: AUC score
    predict = model.predict(X_test_added)
    auc_score = roc_auc_score(y_test, predict)
    
    # Step 5) Find 10 smallest and largest coefficients:
    features = np.array(count_vect.get_feature_names() + ['length_of_doc', 'digit_count', 'non_word_char_count'])
    sorted_index = model.coef_[0].argsort()
    smallest = list(features[sorted_index[:10]])
    largest = list(features[sorted_index[:-11:-1]])
    
    return (auc_score, smallest, largest)

answer_eleven()

(0.97885931107074342,
 ['. ', '..', '? ', ' i', ' y', ' go', ':)', ' h', 'go', ' m'],
 ['digit_count', 'ne', 'ia', 'co', 'xt', ' ch', 'mob', ' x', 'ww', 'ar'])