# Basic Confounder Demo
---

**How Confounder is Used:** Below is a workflow describing what Confounder needs in order to operate and how it's used in practice. It's strongly recommended that you check out [the README for this project](https://github.com/analyticascent/confounder/blob/master/README.md) first for context.

Now for the workflow...

* You start with a set of sample documents (news, studies, blog posts, etc) related to a given topic. 
* Some contain certain pieces of information while others don't mention them at all (binary classification). 
* Confounder uses this set as training data to "learn" to tell the two apart (supervised machine learning).
* You repeat this process for any given piece of information you want to check for the presence of.
* With the right training data and settings, it can categorize new text passages by what they contain.

&nbsp;

**This can be thought of as a much more scalable alternative to constantly using** `Ctrl + F` **and trying to come up with as many synonyms as possible that might indicate whether a document contains specific information or not.**

Confounder's focus is particularly on detecting information that pertains to the following:

![Methodological Criteria for Accurate Research](https://raw.githubusercontent.com/analyticascent/confounder/master/images/research_methodology.png)

**Side Note:** A *"confounder"* in statistics is a [third variable](https://explorable.com/confounding-variables) that can distort what the true relationship is between the original [independent (X) and responding (Y) variables](https://explorable.com/research-variables). Confounders are also sometimes called *extraneous* variables.

**Bear in mind that there is no such thing as true "fact-checking" software (in part due to [second-order questions](https://www.bloomberg.com/view/articles/2016-12-23/fact-checking-s-infinite-regress-problem)) and thus Confounder is only meant to check if something was even _mentioned_ in a text sample.**

**I can't emphasize this enough:** _Checking for omissions is not the same thing as trying to rank how "true" or "false" a passage of text is._

---

### What This Demo Notebook Illustrates

This demo version of [Confounder](https://github.com/analyticascent/confounder) can be repurposed for whatever subject matter or binary text classification task you choose. Doing so only requires that you read in a different CSV file with sample text and labels for what the text contains. 

Note that each of the criteria you have listed has to be checked independently from others. This will not only allow you to know which specific parts of the criteria are met by a text document, it will also boost the accuracy of the overall results by reducing how much [dimensionality](https://en.wikipedia.org/wiki/Curse_of_dimensionality) the classifier must parse through to get the end results.

**Demo Subject Matter:** The demo notebook you're reading right now is focused on a specific statistical claim, and a specific trade-off that often gets overlooked when that claim is made:

&nbsp;

![college earnings graph](https://raw.githubusercontent.com/analyticascent/confounder/master/images/college_earnings.gif)

* **Claim:** Obtaining a four-year degree will enable you to earn a million dollars more over your lifetime
* **Trade-off:** Tuition cost (including interest on student loans) can often cancel out most of those earnings.

&nbsp;

That trade-off is by no means the only issue with the statistic, but for now we will look at training a text classifier to pick up on whether this was mentioned or not. The process would be repeated for these other issues as well, though they are not:

* The sunk cost of not working as much (or at all) before obtaining the degree.
* Super-earner outliers skewing the average for degree recipients (use the median!).
* Different degrees lead to very different lifetime earnings ([new data on this incoming](https://www.insidehighered.com/views/2019/03/26/president-trumps-embrace-program-level-earnings-data-game-changing-opinion)).
* Jobs for high-earning majors tend to be in more expensive cities ([San Francisco anyone?](https://www.youtube.com/watch?v=ExgxwKnH8y4)).
* Students that complete college may already be prone to succeed (so they [may not be benefit much](https://www.theatlantic.com/magazine/archive/2018/01/whats-college-good-for/546590/)).


**With all this in mind, let's take a look at the Python code that's involved.**

---
## Step 1: Importing Needed Libraries

You will need `pandas` to read in rows and colums (containing the raw article text, and columns for all of the criteria of interest). A **0** in a column indicates a piece of information is absent, while a **1** indicates it was mentioned.

`Numpy` adds functionality that you will depend on throughout notebook use. Very specific tools are also imported from `scikit-learn.` Additionally, a few natural language processing tools are imported which may be used to boost model accuracy (with iterative trial and error).

Below, we will import those:

In [1]:
# These are the "libraries" you need to import to run the scripts that follow. It may take several seconds for some to load.

import pandas as pd # allows you to read in rows/columns of data
import numpy as np # allows you to work with vectors/matrices

from sklearn.model_selection import train_test_split  # train_test_split will all you to check accuracy on existing data
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer  # these structure text into word vectors

# CountVectorizer turns raw text into word frequency counts, aka "bag of words." Very simple way of structuring text
# TfidVectorizer is a little more complex. "Term Frequency-Inverse Document Frequency" will find relative importance of terms

from sklearn.naive_bayes import MultinomialNB  # multinomial naive bayes will classify text samples from the vectorizer results
from sklearn import metrics  # metrics will be used to evaluate the accuracy of the model when you run train_test_split

---
## Step 2: Reading in the CSV File Containing the Training Corpus

As previously mentioned, you will need some labeled text passages in CSV form to train the classifier. **This is how *supervised machine learning* is supposed to work - the columns with "labels" of what criteria the sample text failed or met are the categories we are aiming to classify articles under.** 

The classifier needs to have a sample of various articles to compare future articles against. It is strongly advised that you spend time gathering a varied and large sample of articles that meet or fail as wide a variety of criteria as possible, otherwise Confounder may not be able to predict new samples accurately.

In [2]:
# Read sampletext.csv into a DataFrame. Any CSV with columns containing raw tweet contents and usernames can often work.
# If you're offline, replace the link with the file location for sampletext.csv if you have it stored locally.

url = 'https://github.com/analyticascent/confounder/raw/master/data/sampletext.csv'
data = pd.read_csv(url, index_col=0, encoding = 'utf8')

In [3]:
# IGNORE THIS CELL UNLESS YOU NEED TO SET VARIABLES FOR A DIFFERENT CSV FILE

# data = pd.DataFrame({"variable": [1,0,1,0,1,1,0,1,1,1,0,0,0,1,0]})

# data['variable'] = [1,0,1,0,1,1,0,1,1,1,0,0,0,1]

The cells below will be used to verify if the CSV file has loaded properly. You need to have a CSV file with properly labeled columns and rows containing the raw article text (first column), and then a **0** or a **1** within each criteria column.

### Skip to Step 3 if you have loaded a CSV file and aren't creating a dataframe from scratch

In [4]:
# see the two columns

list(data)

['raw_text', 'variable']

`raw_text` is the column that stores the text from articles and studies about college lifetime earnings, while `variable` is used to store `0` or `1` depending on whether tuition cost was mentioned or not.

In [5]:
# check the first five rows/tweets

data.head()

Unnamed: 0,raw_text,variable
0,COMMENTARY \n\nBeyond the College Earnings Pre...,1
1," \n\ny • By Richard Rothstein • July 21, 200...",0
2,(https://dailyreckoning.com/author/ericfry)\nB...,1
3," \n\nt • By Lawrence Mishel • February 21, 2...",0
4,The college earnings premium is near\nrecord h...,1


In [6]:
# check the number of rows and columns

data.shape

(15, 2)

Notice we're only working with fifteen articles (the number of rows). Accuracy would be much higher with a greater sample size, but for demo purposes this will suffice.

In [7]:
list(data['variable'])

[1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0]

Eight of the articles mention tuition cost, while seven don't. These were labels I added into the CSV file before it was uploaded.

---
## Step 3: Define Variables for *train_test_split* Accuracy Experiments

Given the contents of the raw article text, did it likely meet or fail a criteria feature? **We will do a train/test split with the training data to measure predictive accuracy for each criteria column.** Define **X** as the raw article text (the manipulated variable), and some **y** variables as the individual criteria columns you are trying to classify future text into.

In [8]:
# define X and y - the manipulated variable and responding variable

X = data.raw_text  # this defines X as the csv column that contains sample article text
y = data.variable  # does the text mention tuition cost or not as a trade-off?


# split the new DataFrame into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

You now have the columns of the CSV file defined as Python variables. No you just need to *vectorize* the `raw_text` you have stored in the `X` variable, meaning you'll quantify the text in some manner so machine learning can be performed on it. 

---
## Step 4: Use *CountVectorizer* to Turn *X_train* and *X_test* into Document-Term Matrices

- **What:** Turn the training and testing portions of your framework samples into *document-term matrices*
- **Why:** Gives structure to previously unstructured text; you now have word frequency counts
- **Notes:** Easier with English text, not easy with langauges where beginning/end of words or sentences is ambigous

We are now going to create what are called *document-term matrices* of the sample articles. **Think of these as *rows and columns* which store numbers representing *how often* certain terms appear in different text samples.** See the image below to better understand what that looks like:

&nbsp;

![Document-Term Matrix](https://raw.githubusercontent.com/analyticascent/confounder/master/images/text_vectorization.png)

&nbsp;

Basically, the frequency of certain words (and word sequences) will be different in text samples that mention tuition cost versus those that don't mention it at all. It's this difference that an algorithm "learns" to use to tell the difference between the two.

In [9]:
# use CountVectorizer to create document-term matrices from X_train and X_test

vect = CountVectorizer()  # because vect is way easier to type than CountVectorizer
X_train_dtm = vect.fit_transform(X_train)  # stores a vectorized X_train sample into X_train_dtm
X_test_dtm = vect.transform(X_test)  # stores a vectorized X_train sample into X_train_dtm

# now we have quantitative info about articles that classifier can work with

**Just to clarify what's going on in the adjacent cells:** All the **rows** are of course the *individual tweets* that are stored in the CSV file. But the astronomical crapload of **columns** is literally *each unique term* that appears. Those are going to be the "features" used to "fingerprint" one user from another. 

In [10]:
# rows are documents, columns are terms (aka "tokens" or "features")

X_train_dtm.shape

(11, 6343)

In [11]:
# last 50 features

print(vect.get_feature_names()[-50:])

['zimmermann', 'zt', 'zukunft', 'zur', 'zxx', 'ˆα', 'ˆπ', 'ˆρ', 'ˆτ', 'ˆτb', 'ˆτsim', 'β0', 'β1i', 'βf', 'βijt', 'γ0', 'γ1f', 'γ2f', 'γf', 'δ0', 'δ1f', 'δ2f', 'δf', 'ηi', 'θ0', 'θ1f', 'θ2f', 'θf', 'μi', 'πi', 'τb', 'ﬁeld', 'ﬁelds', 'ﬁg', 'ﬁgure', 'ﬁgures', 'ﬁle', 'ﬁnal', 'ﬁnance', 'ﬁnancial', 'ﬁnancing', 'ﬁnd', 'ﬁnds', 'ﬁrst', 'ﬁt', 'ﬁve', 'ﬁxed', 'ﬂagship', 'ﬂat', 'ﬂexible']


In [12]:
# show vectorizer options
vect

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

[CountVectorizer documentation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

- Parameter **lowercase:** boolean, True by default
    - If True, Convert all characters to lowercase before tokenizing.

In [13]:
# We will not convert to lowercase this time, but if we did it would reduce the number of quantified features

vect = CountVectorizer(lowercase=False)
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm.shape

(11, 7371)

In [14]:
# last 50 features

print(vect.get_feature_names()[-50:])

['yu', 'zXX', 'zero', 'zi', 'zur', 'ˆα', 'ˆπ', 'ˆρ', 'ˆτ', 'ˆτb', 'ˆτsim', 'β0', 'β1I', 'βf', 'βijt', 'γ0', 'γ1f', 'γ2f', 'γf', 'δ0', 'δ1f', 'δ2f', 'δf', 'ηi', 'θ0', 'θ1f', 'θ2f', 'θf', 'μi', 'πI', 'τb', 'ﬁeld', 'ﬁelds', 'ﬁg', 'ﬁgure', 'ﬁgures', 'ﬁle', 'ﬁnal', 'ﬁnance', 'ﬁnancial', 'ﬁnancing', 'ﬁnd', 'ﬁnds', 'ﬁrst', 'ﬁt', 'ﬁve', 'ﬁxed', 'ﬂagship', 'ﬂat', 'ﬂexible']


Below, the cell will allow you to augment how *CountVectorizer* works by including a range of **n-grams.** These are *word sequences,* so a 2-gram for instance will be a *pair* of words. The result from including *that* range is that the resulting *document-term matrices* will contain frequency counts of how often pairs of words appear, as well as single terms if you specify so as a parameter. 

- Parameter **ngram_range:** tuple (min_n, max_n)
    - The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.

In [15]:
# include 1-grams and 2-grams

vect = CountVectorizer(ngram_range=(1, 2))  # sets the vectorizer to look at single as well as pairs of words
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm.shape

(11, 40655)

In [16]:
# last 50 features

print(vect.get_feature_names()[-50:])

['ﬁle', 'ﬁle of', 'ﬁnal', 'ﬁnal exit', 'ﬁnal graduation', 'ﬁnance', 'ﬁnance and', 'ﬁnancial', 'ﬁnancial payoff', 'ﬁnancing', 'ﬁnancing of', 'ﬁnd', 'ﬁnd an', 'ﬁnd are', 'ﬁnd evidence', 'ﬁnd of', 'ﬁnd positive', 'ﬁnd that', 'ﬁnds', 'ﬁnds that', 'ﬁrst', 'ﬁrst 15', 'ﬁrst estimate', 'ﬁrst outcome', 'ﬁrst paper', 'ﬁrst row', 'ﬁrst sample', 'ﬁrst stage', 'ﬁrst years', 'ﬁt', 'ﬁt to', 'ﬁve', 'ﬁve in', 'ﬁve types', 'ﬁve universities', 'ﬁxed', 'ﬁxed at', 'ﬁxed eﬀects', 'ﬁxed number', 'ﬂagship', 'ﬂagship institution', 'ﬂagship state', 'ﬂagship university', 'ﬂat', 'ﬂat along', 'ﬂexible', 'ﬂexible below', 'ﬂexible linear', 'ﬂexible polynomial', 'ﬂexible polynomials']


---
## Step 5: Test Predictive Accuracy for First Criteria Feature

How accurate can we predict whether a text sample met or failed a criteria compared to the way you originally labeled it?

In [17]:
# use default options for CountVectorizer

vect = CountVectorizer()


# create document-term matrices using CountVectorizer, and store them as variables

X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)


# use Naive Bayes to predict the first feature of the list criteria

nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)
y_pred_class = nb.predict(X_test_dtm)


# calculate accuracy

print(metrics.accuracy_score(y_test, y_pred_class))

0.5


In [18]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1000)

In [19]:
# use default options for CountVectorizer

vect = CountVectorizer()


# create document-term matrices using CountVectorizer, and store them as variables

X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)


# use Naive Bayes to predict the first feature of the list criteria

nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)
y_pred_class = nb.predict(X_test_dtm)


# calculate accuracy

print(metrics.accuracy_score(y_test, y_pred_class))

0.75


The cell below will eliminate the need for typing in the same code over and over again, as well as produce an output that includes all the information we need to know about how the number of unique features is affecting the classifier accuracy.

In [20]:
# define a function that accepts a vectorizer and calculates the accuracy

def tokenize_test(vect):
    X_train_dtm = vect.fit_transform(X_train)
    print('Features: ', X_train_dtm.shape[1])  # this output will be unique words or n-grams
    X_test_dtm = vect.transform(X_test)
    nb = MultinomialNB()
    nb.fit(X_train_dtm, y_train)
    y_pred_class = nb.predict(X_test_dtm)
    print('Accuracy: ', metrics.accuracy_score(y_test, y_pred_class) * 100, "%")

In [21]:
vect = CountVectorizer()
tokenize_test(vect)

Features:  6343
Accuracy:  75.0 %


---

## Step 6: Test Stopword Removal and N-Grams to Potentially Boost Accuracy

In [22]:
# include 1-grams and 2-grams

vect = CountVectorizer(ngram_range=(1, 2))
tokenize_test(vect)

Features:  40655
Accuracy:  75.0 %


- **What:** Remove common words that will likely appear in any text
- **Why:** They don't tell you much about your text

In [23]:
# show vectorizer options

vect

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 2), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

- **stop_words:** string {'english'}, list, or None (default)
    - If 'english', a built-in stop word list for English is used.
    - If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens.
    - If None, no stop words will be used. max_df can be set to a value in the range [0.7, 1.0) to automatically detect and filter stop words based on intra corpus document frequency of terms.

In [24]:
# remove English stop words

vect = CountVectorizer(stop_words='english')
tokenize_test(vect)

Features:  6086
Accuracy:  75.0 %


In [25]:
# set of stop words

print(vect.get_stop_words())

frozenset({'nowhere', 'several', 'even', 'thereby', 'a', 'towards', 'twenty', 'afterwards', 'would', 'amoungst', 'together', 'yours', 'seem', 'see', 'by', 'system', 'name', 'latter', 'while', 'except', 'ten', 'toward', 'ltd', 'up', 'than', 'others', 'too', 'beside', 'therefore', 'due', 'and', 'becoming', 'your', 'should', 'over', 'beyond', 'whence', 'along', 'hereupon', 'often', 'somehow', 'has', 'ever', 'about', 'was', 'out', 'we', 'my', 'never', 'be', 'seemed', 'get', 'she', 'something', 'under', 'were', 'almost', 'third', 'hundred', 'describe', 'hers', 'put', 'else', 'only', 'though', 'fifty', 'well', 'everything', 'amount', 'any', 'already', 'on', 'against', 'one', 'when', 'may', 'none', 'of', 'themselves', 'myself', 'anyone', 'empty', 'became', 'its', 'thru', 'sometime', 'thus', 'either', 'at', 'you', 'whole', 'mostly', 'itself', 'however', 'all', 'cannot', 'thin', 'us', 'three', 'least', 'indeed', 'whose', 'thereupon', 'could', 'thereafter', 'our', 'otherwise', 'eleven', 'herself

---

## Step 7: Other CountVectorizer Options to Raise Predictive Accuracy

- **max_features:** int or None, default=None
- If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.

In [26]:
vect = CountVectorizer(max_features=25) # I need a minimum of only 25 unique vocabulary terms
tokenize_test(vect)

Features:  25
Accuracy:  100.0 %


In [27]:
# remove English stop words and only keep 16 features

vect = CountVectorizer(stop_words='english', max_features=16) # Removing stop words brings prior requirement down to 16
tokenize_test(vect)

Features:  16
Accuracy:  100.0 %


In [28]:
# all 100 features

print(vect.get_feature_names())

['changes', 'college', 'degree', 'demand', 'education', 'graduates', 'high', 'labor', 'relative', 'returns', 'school', 'university', 'wage', 'wages', 'workers', 'year']


In [29]:
# include 1-grams and 2-grams, and limit the number of features

vect = CountVectorizer(ngram_range=(1, 2), max_features=150)
tokenize_test(vect)

Features:  150
Accuracy:  100.0 %


In [30]:
print(vect.get_feature_names())

['00', '10', '16', '1963', '1980s', '1987', '2000', '2009', 'admission', 'all', 'also', 'an', 'analysis', 'and', 'and the', 'are', 'as', 'at', 'at the', 'average', 'be', 'been', 'between', 'but', 'by', 'can', 'changes', 'changes in', 'china', 'college', 'college graduates', 'data', 'degree', 'demand', 'each', 'earnings', 'economic', 'economics', 'education', 'elite', 'elite university', 'employment', 'estimates', 'experience', 'figure', 'for', 'for the', 'from', 'from the', 'graduates', 'group', 'growth', 'has', 'have', 'high', 'high school', 'higher', 'if', 'in', 'in relative', 'in the', 'income', 'industry', 'inequality', 'is', 'is the', 'it', 'it is', 'job', 'jobs', 'journal', 'journal of', 'labor', 'less', 'market', 'more', 'most', 'not', 'of', 'of college', 'of the', 'ols', 'on', 'on the', 'one', 'only', 'or', 'other', 'our', 'over', 'over the', 'percent', 'period', 'premium', 'relative', 'relative wages', 'returns', 'returns to', 'rural', 'sample', 'school', 'schooling', 'score',

In [31]:
# include 1-grams and 2-grams

vect = CountVectorizer(ngram_range=(1, 2))
tokenize_test(vect)

Features:  40655
Accuracy:  75.0 %


In [32]:
# include 1-grams and 2-grams, and limit the number of features

vect = CountVectorizer(ngram_range=(1, 2), max_features=100)
tokenize_test(vect)

Features:  100
Accuracy:  100.0 %


- **min_df:** float in range [0.0, 1.0] or int, default=1
    - When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts.

In [33]:
# include 1-grams and 2-grams, and only include terms that appear at least 2 times

vect = CountVectorizer(ngram_range=(1, 2), min_df=2)
tokenize_test(vect)

Features:  29877
Accuracy:  75.0 %


In [34]:
# remove English stop words and only keep 100 features

vect = CountVectorizer(stop_words='english', max_features=150)
tokenize_test(vect)

Features:  150
Accuracy:  100.0 %


**Note:** _The perfect accuracy here is likely the result of overfitting on the training data._

In [35]:
print(vect.get_feature_names()) # notice that only 150 features are listed

['00', '10', '11', '12', '15', '16', '17', '18', '19', '1963', '1975', '1979', '1980s', '1987', '1991', '1995', '20', '2000', '2001', '2004', '2005', '2007', '2009', '2010', '25', '30', 'admission', 'advanced', 'al', 'analysis', 'available', 'average', 'bachelor', 'bureau', 'business', 'change', 'changes', 'china', 'college', 'column', 'data', 'degree', 'degrees', 'demand', 'differences', 'differentials', 'discontinuity', 'distribution', 'earnings', 'economic', 'economics', 'educated', 'education', 'elite', 'employment', 'epi', 'estimate', 'estimates', 'et', 'evidence', 'experience', 'factor', 'female', 'figure', 'given', 'graduates', 'group', 'groups', 'growth', 'high', 'higher', 'important', 'income', 'increase', 'increased', 'index', 'industry', 'inequality', 'job', 'jobs', 'journal', 'just', 'labor', 'level', 'li', 'log', 'major', 'male', 'market', 'measure', 'median', 'new', 'number', 'ols', 'openings', 'percent', 'percentage', 'period', 'premium', 'private', 'production', 'rate',

---
## Step 8: Using GridSearchCV for Hyperparameter Opitimization

Although it wasn't covered comprehensively in the part-time course I took, it turns out there is a rather simple way to do multiple train/test splits on the same data with cross-validation (to avoid overfitting), as well as try out a variety of parameter settings to see which ones output the most predictive results. 

**Enter GridSearchCV.**

In [36]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn import preprocessing

In [37]:
steps=[('vectorize',CountVectorizer()),\
       ('clf',MultinomialNB())]

In [38]:
pipe=Pipeline(steps)

In [39]:
X_train, X_test, y_train, y_test=\
train_test_split(data['raw_text'], data['variable'], random_state=1)

In [40]:
pipe.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('vectorize', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)), ('clf', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

In [41]:
pred=pipe.predict(X_test)

In [42]:
print('Accuracy = {:3f}'.format(accuracy_score(y_test,pred)))

Accuracy = 0.500000


In [43]:
pipe.named_steps.keys()

dict_keys(['vectorize', 'clf'])

In [44]:
param_grid = dict(vectorize__binary=[True,False],\
                  vectorize__stop_words=[None,'english'],\
                  #vectorize__min_df=[1,3,5,7,9,12],\
                  vectorize__lowercase=[True,False],\
                  vectorize__ngram_range=[(1,1),(1,2)],\
                  vectorize__max_features=[95,100,105] # did a lot of iteration to whittle it down to those
                 )

In [45]:
grid_search = GridSearchCV(pipe, param_grid=param_grid,\
                           scoring=make_scorer(accuracy_score),n_jobs=6)

In [46]:
%time res=grid_search.fit(data['raw_text'], data['variable'])



CPU times: user 672 ms, sys: 68 ms, total: 740 ms
Wall time: 10.3 s




In [47]:
res.best_params_

{'vectorize__binary': True,
 'vectorize__lowercase': True,
 'vectorize__max_features': 95,
 'vectorize__ngram_range': (1, 1),
 'vectorize__stop_words': 'english'}

In [48]:
print("Accuracy: " + str(res.best_score_))

Accuracy: 0.8


**Keep in mind that the above result is the best outcome we could get with the specified random state.** Earlier accuracy results in this notebook without `GridSearchCV` that led to 100% accuracy were probably the result of overfitting to the training data. 

---
## Concluding Tips and Tricks

- It will always be easier to determine if an article *omits* something than screen how "true" a statement is
- This is all only as good as the criteria you set out to check articles and studies for
- That criteria is also a large influence on how good your training sample is
- Your workflow should check each of the criteria independently