# 4.4.5 Challenge: Build your own NLP model
__Instructions__
For this challenge, you will need to choose a corpus of data from nltk or another source that includes categories you can predict and create an analysis pipeline that includes the following steps:

1. Data cleaning / processing / language parsing
2. Create features using two different NLP methods: For example, BoW vs tf-idf.
3. Use the features to fit supervised learning models for each feature set to predict the category outcomes.
4. Assess your models using cross-validation and determine whether one model performed better.
5. Pick one of the models and try to increase accuracy by at least 5 percentage points.


In [1]:
# Necessary imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import requests

# NLP 
import spacy
import re
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer

### Corpus Processing - PLOS Abstracts
The corpus I decided to use for this challenge is from PLOS searches.  PLOS is a non-profit publisher of scientific articles. I pulled abstract searches from the PLOS API with the following title searches: cancer, HIV, and heart disease.

In [2]:
response1 = requests.get(r'http://api.plos.org/search?q=title:"cancer"&fl=abstract&wt=json&api_key=51bQhx63o6--UjRBhkHb')
response2 = requests.get(r'http://api.plos.org/search?q=title:"HIV"&fl=abstract&wt=json&api_key=51bQhx63o6--UjRBhkHb')
response3 = requests.get(r'http://api.plos.org/search?q=title:"heart disease"&fl=abstract&wt=json&api_key=51bQhx63o6--UjRBhkHb')

Pulling JSON data into usable data

In [3]:
cancer_raw = response1.json()
hiv_raw = response2.json()
heart_raw = response3.json()

Pulling out just the abstracts of each article into one string. 

In [4]:
cancer = ''
for article in cancer_raw['response']['docs']:
    art = re.sub(r'[\[\]]','', str(article['abstract']))
    cancer = cancer + art

cancer[0:200]

"'\\nSocietal perceptions may factor into the high rates of nontreatment in patients with lung cancer. To determine whether bias exists toward lung cancer, a study using the Implicit Association Test me"

In [5]:
hiv = ''
for article in hiv_raw['response']['docs']:
    art = re.sub(r'[\[\]]','', str(article['abstract']))
    hiv = hiv + art

hiv[0:200]

"'Background: Whether spontaneous low levels of HIV-1 RNA in blood plasma correlate with low levels of HIV-1 RNA in seminal plasma has never been investigated in HIV controller (HIC) men so far. Method"

In [6]:
heart = ''
for article in heart_raw['response']['docs']:
    art = re.sub(r'[\[\]]','', str(article['abstract']))
    heart = heart + art

heart[0:200]

"'\\nThere are 16.5 million newborns in China annually. However, the incidence of congenital heart disease (CHD) has not been evaluated. In 2004, we launched an active province-wide hospital-based CHD r"

### Text Cleaning
Removing --, which is not processed by our NLP models; removing excess quotation marks, and removing all digits.

In [7]:
def text_cleaner(text):
    text = re.sub(r'--', ' ', text)
    text = re.sub(r'[\']', '', text)
    text = re.sub(r'[\\]', '', text)
    text = re.sub(r'\d', '', text)
    return text

In [8]:
cancer_clean = text_cleaner(cancer)
hiv_clean = text_cleaner(hiv)
heart_clean = text_cleaner(heart)

In [9]:
heart_clean[0:200]

'nThere are . million newborns in China annually. However, the incidence of congenital heart disease (CHD) has not been evaluated. In , we launched an active province-wide hospital-based CHD registry i'

### Language Parsing with Spacy

In [10]:
nlp = spacy.load('en')
cancer_doc = nlp(cancer_clean)
hiv_doc = nlp(hiv_clean)
heart_doc = nlp(heart_clean)

Splitting each topic into individual sentences for processing.

In [11]:
cancer_sents = [[sent, 'Cancer'] for sent in cancer_doc.sents]
hiv_sents = [[sent, 'HIV'] for sent in hiv_doc.sents]
heart_sents = [[sent, 'Heart'] for sent in heart_doc.sents]

sentences = pd.DataFrame(cancer_sents + hiv_sents + heart_sents)
sentences.head()
print(len(sentences))

370


### Creating BOW Features
Defining functions to identify most common words and the create features from those words in the text.

In [12]:
def bag_of_words(text):
    allwords = [token.lemma_
               for token in text
               if not token.is_punct
               and not token.is_stop]
    return [item[0] for item in Counter(allwords).most_common(2000)]

def bow_features(sentences, common_words):
    df = pd.DataFrame(columns=common_words)
    df['text_sentence'] = sentences[0]
    df['text_source'] = sentences[1]
    df.loc[:,common_words] = 0
    
    for i, sentence in enumerate(df['text_sentence']):
        words = [token.lemma_ 
                for token in sentence
                if (
                    not token.is_punct
                    and not token.is_stop
                    and token.lemma_ in common_words
                )]
        for word in words:
            df.loc[i, word] += 1
        if i%100 == 0:
            print('Processing row {}'.format(i))
    return df

Finding common words from three searches.

In [13]:
cancerwords = bag_of_words(cancer_doc)
hivwords = bag_of_words(hiv_doc)
heartwords = bag_of_words(heart_doc)

common_words = set(cancerwords + hivwords + heartwords)

Creating features from all searches from common words.

In [14]:
word_counts = bow_features(sentences, common_words)
word_counts.head()

Processing row 0
Processing row 100
Processing row 200
Processing row 300


Unnamed: 0,substantially,transmission,therapeutics.nauthor,incorporate,disease,cardioprotectors/,estimate,sperm,arr,multiple,...,comparable,concomitant,underlie,comparison,subtype,like,child,soluble,text_sentence,text_source
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(nSocietal, perceptions, may, factor, into, th...",Cancer
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(To, determine, whether, bias, exists, toward,...",Cancer
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(Participants, were, primarily, recruited, fro...",Cancer
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(Explicit, attitudes, regarding, lung, and, br...",Cancer
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(Participants’, responses, to, descriptive, an...",Cancer


### Creating tf-idf features
Second NLP model - converting sentencs into numeric vectors.

In [15]:
vectorizer = TfidfVectorizer(encoding='ASCII',
                             max_df=0.5, # drop words that occur in more than half the paragraphs
                             min_df=2, # only use words that appear at least twice
                             stop_words='english', 
                             lowercase=True, #convert everything to lower case (since Alice in Wonderland has the HABIT of CAPITALIZING WORDS for EMPHASIS)
                             use_idf=True,#we definitely want to use inverse document frequencies in our weighting
                             norm=u'l2', #Applies a correction factor so that longer paragraphs and shorter paragraphs get treated equally
                             smooth_idf=True #Adds 1 to all document frequencies, as if an extra document existed that used every word once.  Prevents divide-by-zero errors
                            )

#Applying the vectorizer
sent_tfidf=vectorizer.fit_transform(cancer_clean)
print("Number of features: %d" % sent_tfidf.get_shape()[1])

#splitting into training and test sets
X_train_tfidf, X_test_tfidf= train_test_split(sent_tfidf, test_size=0.4, random_state=0)


#Reshapes the vectorizer output into something people can read
X_train_tfidf_csr = X_train_tfidf.tocsr()

#number of paragraphs
n = X_train_tfidf_csr.shape[0]
#A list of dictionaries, one per paragraph
tfidf_bypara = [{} for _ in range(0,n)]
#List of features
terms = vectorizer.get_feature_names()
#for each paragraph, lists the feature words and their tf-idf scores
for i, j in zip(*X_train_tfidf_csr.nonzero()):
    tfidf_bypara[i][terms[j]] = X_train_tfidf_csr[i, j]

#Keep in mind that the log base 2 of 1 is 0, so a tf-idf score of 0 indicates that the word was present once in that sentence.
print('Original sentence:', X_train[5])
print('Tf_idf vector:', tfidf_bypara[5])

ValueError: Iterable over raw text documents expected, string object received.

This code keeps throwing errors.  If I use the clean version of the text, it says that it needs a raw text document.  If I use the sentences, it says that tokens don't have attribute lower.  I've tried to make this work, without success.

### Supervised Learning Models to Predict Outcomes
I will use the BOW features to predict outcomes using Radom Forest Classifier, Logistic Regression, and Gradient Boosting Classifier to see which best predicts the topic of the sentence.

In [16]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

rfc = RandomForestClassifier()
Y = word_counts['text_source']
X = np.array(word_counts.drop(['text_sentence','text_source'], 1))

X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    Y,
                                                    test_size=0.4,
                                                    random_state=0)
train = rfc.fit(X_train, y_train)

print('Training set score:', rfc.score(X_train, y_train))
print('\nTest set score:', rfc.score(X_test, y_test))

Training set score: 0.9684684684684685

Test set score: 0.7432432432432432


In [17]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
train = lr.fit(X_train, y_train)
print(X_train.shape, y_train.shape)
print('Training set score:', lr.score(X_train, y_train))
print('\nTest set score:', lr.score(X_test, y_test))

(222, 1499) (222,)
Training set score: 0.963963963963964

Test set score: 0.8243243243243243


In [18]:
from sklearn.ensemble import GradientBoostingClassifier
clf = GradientBoostingClassifier()
train = clf.fit(X_train, y_train)

print('Training set score:', clf.score(X_train, y_train))
print('\nTest set score:', clf.score(X_test, y_test))

Training set score: 0.9504504504504504

Test set score: 0.7972972972972973


All three models had approximately the same accuracy scores for the training set.  The logistic regression model had the highest score for the test set.  I will use this model to optimize the predictions.

#### Optimizing Logistic Regression Model
Using the Logisitic Regression default model, which had the highest accuracy on the test set, I will attempt to increase the accuracy of the model.

First, I noticed that there were only around 1500 common words (eventhough the function specifies that there can be 2000 common words.  I will try reducing the number of common words used by the function to reduce the overfitting of the model.

In [19]:
def bag_of_words2(text):
    allwords = [token.lemma_
               for token in text
               if not token.is_punct
               and not token.is_stop]
    return [item[0] for item in Counter(allwords).most_common(1000)]


In [20]:
response1b = requests.get(r'http://api.plos.org/search?q=title:"cancer"&fl=abstract&wt=json&api_key=51bQhx63o6--UjRBhkHb&start=10&rows=20')
response2b = requests.get(r'http://api.plos.org/search?q=title:"HIV"&fl=abstract&wt=json&api_key=51bQhx63o6--UjRBhkHb&start=10&rows=20')
response3b = requests.get(r'http://api.plos.org/search?q=title:"heart disease"&fl=abstract&wt=json&api_key=51bQhx63o6--UjRBhkHb&start=10&rows=20')

Pulling JSON data into usable data

In [21]:
cancer_raw2 = response1b.json()
hiv_raw2 = response2b.json()
heart_raw2 = response3b.json()

Pulling out just the abstracts of each article into one string. 

In [22]:
for article in cancer_raw2['response']['docs']:
    art = re.sub(r'[\[\]]','', str(article['abstract']))
    cancer = cancer + art

cancer[0:200]

"'\\nSocietal perceptions may factor into the high rates of nontreatment in patients with lung cancer. To determine whether bias exists toward lung cancer, a study using the Implicit Association Test me"

In [23]:
for article in hiv_raw2['response']['docs']:
    art = re.sub(r'[\[\]]','', str(article['abstract']))
    hiv = hiv + art

hiv[0:200]

"'Background: Whether spontaneous low levels of HIV-1 RNA in blood plasma correlate with low levels of HIV-1 RNA in seminal plasma has never been investigated in HIV controller (HIC) men so far. Method"

In [24]:
for article in heart_raw2['response']['docs']:
    art = re.sub(r'[\[\]]','', str(article['abstract']))
    heart = heart + art

heart[0:200]

"'\\nThere are 16.5 million newborns in China annually. However, the incidence of congenital heart disease (CHD) has not been evaluated. In 2004, we launched an active province-wide hospital-based CHD r"

In [25]:
cancer_clean2 = text_cleaner(cancer)
hiv_clean2 = text_cleaner(hiv)
heart_clean2 = text_cleaner(heart)

In [26]:
cancer_doc2 = nlp(cancer_clean2)
hiv_doc2 = nlp(hiv_clean2)
heart_doc2 = nlp(heart_clean2)

Splitting each topic into individual sentences for processing.

In [27]:
cancer_sents2 = [[sent, 'Cancer'] for sent in cancer_doc2.sents]
hiv_sents2 = [[sent, 'HIV'] for sent in hiv_doc2.sents]
heart_sents2 = [[sent, 'Heart'] for sent in heart_doc2.sents]

sentences2 = pd.DataFrame(cancer_sents2 + hiv_sents2 + heart_sents2)
sentences2.head()
print(len(sentences2))

1151


In [28]:
cancerwords2 = bag_of_words2(cancer_doc2)
hivwords2 = bag_of_words2(hiv_doc2)
heartwords2 = bag_of_words2(heart_doc2)

common_words2 = set(cancerwords2 + heartwords2 + hivwords2)
len(common_words2)

2120

In [None]:
word_counts2 = bow_features(sentences2, common_words2)
word_counts2.head()

Processing row 0
Processing row 100
Processing row 200


In [None]:
word_counts2['sent_length'] = word_counts2.text_sentence.map(lambda x: len(x)) 

In [None]:
Y2 = word_counts2['text_source']
X2 = np.array(word_counts2.drop(['text_sentence','text_source'], 1))

X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, 
                                                    Y2,
                                                    test_size=0.4,
                                                    random_state=0)

In [None]:
lr2 = LogisticRegression()
train = lr2.fit(X_train2, y_train2)
print(X_train2.shape, y_train2.shape)
print('Training set score:', lr2.score(X_train2, y_train2))
print('\nTest set score:', lr2.score(X_test2, y_test2))

In [None]:
type(sentences[0])