# Lecture Eight - July 18th 2017

* Machine Learning Recap
* Feature Engineering
* Machine Learning with Text
* Final Project

## Machine Learning Recap

* Anyone remember the two main types of machine learning?

### Machine Learning Summary


* *Supervised learning*: Models that can predict labels based on labeled training data
    * *Classification*: Models that predict labels as two or more discrete categories
    * *Regression*: Models that predict continuous labels

* *Unsupervised learning*: Models that identify structure in unlabeled data
     * *Clustering*: Models that detect and identify distinct groups in the data
     * *Dimensionality reduction*: Models that detect and identify lower-dimensional structure in higher-dimensional data

### Machine learning with Scikit Learn

* Scikit Learn is the most popular library for doing machine learning in Python
* Has a nice interface and integrates with Pandas and Matplotlib

#### Features & Features Matrix

* In machine learning the term *features* is used to indicate the dimensions of the dataset. They are also called *predictors* or *variables* depending upon the discipline. 
* Most machine learning algorithms want your data in the form of a *features matrix*. 

#### Target Array

* Supervised learning encodes existing knowledge as *a priori* labels, these labels form the *target array*.
* Using these *a priori* labels the model *learns* to predict labels based on values in the features matrix.


### The Stages of Modeling Data

1. Choose a class of model by importing the appropriate estimator class from Scikit-Learn.
2. Choose model hyperparameters by instantiating this class with desired values.
3. Arrange data into a features matrix and target vector following the discussion above.
4. Fit the model to your data by calling the `fit()` method of the model instance.
5. Apply the fitted instance of the model to new data

## Feature Engineering

* Most of the data we have been playing with is nicely cleaned and prepared
* However, in the real world data are *rarely* so tidy 
* Machine learning is very difficult if your data aren't in the form of a NxM feature matrix plus a labels in a target array
* There is a kind of data fitness work, closely related but distinct from data cleaning, called *feature engineering* that focuses on taking the data and information you have and getting into shape for machine learning. That is, making a *features matrix*

In [1]:
# load up pandas and the scorecard data
import pandas as pd


In [2]:
scorecard = pd.read_csv("https://ed-public-download.app.cloud.gov/downloads/Most-Recent-Cohorts-All-Data-Elements.csv",
                        low_memory=False)
scorecard.shape

(7703, 1743)

* So we have a giant dataset, which, as we saw last week, has some columns ready for machine learning
* However, there are other corners of the dataset that need a bit of work before we can feed them to the algorithm
* Especially categorical data

In [6]:
# select just the columns we are interested in as features
feature_list = ['REGION',
                'COSTT4_A',
                'UGDS',
                'CONTROL',
                'INSTNM']
scorecard_subset = scorecard[feature_list].dropna()#.astype({"COSTT4_A":int, "UGDS":int})

# Select 10 random rows to so we can see what the dataframe looks like
scorecard_subset.sample(10)

Unnamed: 0,REGION,COSTT4_A,UGDS,CONTROL,INSTNM
551,4,30930,8658,3,Walden University
1411,5,33107,1844,2,University of the Cumberlands
2950,3,62051,2961,2,Oberlin College
157,5,18679,3226,1,Henderson State University
677,1,19704,8133,1,Southern Connecticut State University
1069,3,40643,1452,3,Kendall College
2175,4,16439,1537,1,Peru State College
1448,5,11815,3158,1,Big Sandy Community and Technical College
214,8,13885,261,2,Bethesda University
1022,3,20602,1086,3,Le Cordon Bleu College of Culinary Arts-Chicago


* So can't we just use the original mapping from the raw data?
```
0	U.S. Service Schools
1	New England (CT, ME, MA, NH, RI, VT)
2	Mid East (DE, DC, MD, NJ, NY, PA)
3	Great Lakes (IL, IN, MI, OH, WI)
4	Plains (IA, KS, MN, MO, NE, ND, SD)
5	Southeast (AL, AR, FL, GA, KY, LA, MS, NC, SC, TN, VA, WV)
6	Southwest (AZ, NM, OK, TX)
7	Rocky Mountains (CO, ID, MT, UT, WY)
8	Far West (AK, CA, HI, NV, OR, WA)
9	Outlying Areas (AS, FM, GU, MH, MP, PR, PW, VI)
```
* The problem with this mapping is there is an implied ordering. New England is less than the Mid East which is less than the Great Plans. As a New Englander, I take offense!
* This ordering can cause problems because the algorithms don't realize they are categories
* Let's map the raw data to the human-readable mappings

In [7]:
# mapping numbers to categories for CONTROL and REGION
control_values = {1:"Public",
                  2:"Private nonprofit",
                  3:"Private for-profit"}

regions = """U.S. Service Schools
New England (CT, ME, MA, NH, RI, VT)
Mid East (DE, DC, MD, NJ, NY, PA)
Great Lakes (IL, IN, MI, OH, WI)
Plains (IA, KS, MN, MO, NE, ND, SD)
Southeast (AL, AR, FL, GA, KY, LA, MS, NC, SC, TN, VA, WV)
Southwest (AZ, NM, OK, TX)
Rocky Mountains (CO, ID, MT, UT, WY)
Far West (AK, CA, HI, NV, OR, WA)
Outlying Areas (AS, FM, GU, MH, MP, PR, PW, VI)""".split("\n")

# a bit of python trickery to create a mapping dictionary
region_values = {key: value for key, value in enumerate(regions)}

# replace numbers with labels
scorecard_subset = scorecard_subset.replace({"CONTROL": control_values,
                                             "REGION": region_values})

# Select 10 random rows to so we can see what the dataframe looks like
scorecard_subset.sample(10)

Unnamed: 0,REGION,COSTT4_A,UGDS,CONTROL,INSTNM
2376,"Mid East (DE, DC, MD, NJ, NY, PA)",13300,42,Private nonprofit,Beth Hamedrash Shaarei Yosher Institute
3648,"Southwest (AZ, NM, OK, TX)",47404,1272,Private nonprofit,Austin College
834,"Southeast (AL, AR, FL, GA, KY, LA, MS, NC, SC,...",19306,30279,Public,University of South Florida-Main Campus
1966,"Plains (IA, KS, MN, MO, NE, ND, SD)",12774,8755,Public,Century College
3015,"Great Lakes (IL, IN, MI, OH, WI)",37601,694,Private nonprofit,Ursuline College
603,"Rocky Mountains (CO, ID, MT, UT, WY)",16043,476,Public,Lamar Community College
3459,"Southeast (AL, AR, FL, GA, KY, LA, MS, NC, SC,...",32818,1138,Private nonprofit,Coker College
125,"Southwest (AZ, NM, OK, TX)",9800,996,Private for-profit,Western International University
212,"Far West (AK, CA, HI, NV, OR, WA)",14455,1588,Public,Barstow Community College
1403,"Southeast (AL, AR, FL, GA, KY, LA, MS, NC, SC,...",27812,1003,Private nonprofit,Brescia University


* So while these data are nice and clean, they are not in the correct shape for machine learning.
* We need numbers.
* These data need to be transformed into *vectors* into a process called...*vectorization*

### Vectorizing Categorical Data

* Often, especially with the scorecard data, you have non-numerical data in the form of discrete categories. 
* One common technique for dealing with categorical features is *one-hot encoding*, which creates a column for each category and uses `0` or `1` to indicate a particular sample's categorical value.
* In our `scorecard_subset` we have lots of categorical data. Vectorizing all of that categorical data would be a *huge* pain in pure python.
* Fortunately, scikit-learn has us covered!
* What we need to do is use a combination of helper functions from Pandas and Scikit learn to reconfigure the data
* First, we will use [`to_dict()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_dict.html) to transform our dataframe into a list of dictionaries

In [8]:
# transform the dataframe into a list of python dictionaries
scorecard_subset_dict = scorecard_subset.drop('INSTNM', axis=1).to_dict(orient="records")
# look at the first four entries
scorecard_subset_dict[0:4]

[{'CONTROL': 'Public',
  'COSTT4_A': 21475.0,
  'REGION': 'Southeast (AL, AR, FL, GA, KY, LA, MS, NC, SC, TN, VA, WV)',
  'UGDS': 4206.0},
 {'CONTROL': 'Public',
  'COSTT4_A': 20621.0,
  'REGION': 'Southeast (AL, AR, FL, GA, KY, LA, MS, NC, SC, TN, VA, WV)',
  'UGDS': 11383.0},
 {'CONTROL': 'Private nonprofit',
  'COSTT4_A': 16370.0,
  'REGION': 'Southeast (AL, AR, FL, GA, KY, LA, MS, NC, SC, TN, VA, WV)',
  'UGDS': 291.0},
 {'CONTROL': 'Public',
  'COSTT4_A': 21107.0,
  'REGION': 'Southeast (AL, AR, FL, GA, KY, LA, MS, NC, SC, TN, VA, WV)',
  'UGDS': 5451.0}]

* Now we can use the helper function [`DictVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html) to transform this into a feature matrix

In [9]:
# import vectorizer helper function and transform our data into a feature matrix
from sklearn.feature_extraction import DictVectorizer

# instantiate a vectorizer and transform the data
scorecard_subset_vector = DictVectorizer(sparse=False, dtype=int)
X = scorecard_subset_vector.fit_transform(scorecard_subset_dict)
X

array([[    0,     0,     1, ...,     0,     0,  4206],
       [    0,     0,     1, ...,     0,     0, 11383],
       [    0,     1,     0, ...,     0,     0,   291],
       ..., 
       [    1,     0,     0, ...,     1,     0,   253],
       [    0,     1,     0, ...,     0,     0,    94],
       [    0,     0,     1, ...,     0,     0,  1782]], dtype=int32)

In [11]:
# List the names of the features (columns) in our vectorized data structure
scorecard_subset_vector.get_feature_names()

['CONTROL=Private for-profit',
 'CONTROL=Private nonprofit',
 'CONTROL=Public',
 'COSTT4_A',
 'REGION=Far West (AK, CA, HI, NV, OR, WA)',
 'REGION=Great Lakes (IL, IN, MI, OH, WI)',
 'REGION=Mid East (DE, DC, MD, NJ, NY, PA)',
 'REGION=New England (CT, ME, MA, NH, RI, VT)',
 'REGION=Outlying Areas (AS, FM, GU, MH, MP, PR, PW, VI)',
 'REGION=Plains (IA, KS, MN, MO, NE, ND, SD)',
 'REGION=Rocky Mountains (CO, ID, MT, UT, WY)',
 'REGION=Southeast (AL, AR, FL, GA, KY, LA, MS, NC, SC, TN, VA, WV)',
 'REGION=Southwest (AZ, NM, OK, TX)',
 'REGION=U.S. Service Schools',
 'UGDS']

In [12]:
print("Human Readable Representation")
print("-----------------------------")
print(scorecard_subset.iloc[0]) # Institution name for row 0
print()
print("Machine Readable Representation")
print("-------------------------------")
print(X[0]) # The vectorized representation of row 0

Human Readable Representation
-----------------------------
REGION      Southeast (AL, AR, FL, GA, KY, LA, MS, NC, SC,...
COSTT4_A                                                21475
UGDS                                                     4206
CONTROL                                                Public
INSTNM                               Alabama A & M University
Name: 0, dtype: object

Machine Readable Representation
-------------------------------
[21475     0     0     0     0     0     0     0     1     0     0  4206]


### Why do we have to do this?

* Machine learning algorithms only eat numbers, so you need to transform text into numbers
* The models complain when you feed them text, they hate text

In [13]:
# Load our model
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# instantiate our model
model = LogisticRegression()

In [None]:
# create feature vector and target array without vectorizing
X = scorecard_subset.drop(["INSTNM", "CONTROL"],axis=1)
y = scorecard_subset["CONTROL"]

# try and fit the data
model.fit(X,y)

* Well that didn't work.
* What happens when we vectorize the data first?

In [14]:
# Create our vectorized feature matrix and a target array
X_dict = scorecard_subset.drop(['INSTNM','CONTROL'],axis=1).to_dict(orient="records")
vectorizer = DictVectorizer(sparse=False, dtype=int)
X = vectorizer.fit_transform(X_dict)
y = scorecard_subset["CONTROL"]

# Create training and testing data
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y,
                                                train_size=.5, # split dataset 50/50
                                                random_state=1) # fix random sampling

# Fit the model and test its accuracy
model.fit(Xtrain,ytrain)
y_model = model.predict(Xtest)
print("The accuracy of this model is: {:.1%}".format(accuracy_score(ytest, y_model)))

The accuracy of this model is: 72.6%


* Well, this isn't a very good model, but at least we can run it without python barfing!

## Machine learning with Text

* Machine learning with textual data is tricky because human readable text isn't readily digestible by algorithms
* This means a whole bunch of *feature engineering*
* Human readable, prosaic documents != feature matrix
* So just like we vectorized categorical data, we have to vectorize text data.

### From words & documents to text features and feature vectors

* The most common representation of text for machine learning is the [*bag of words* model](https://en.wikipedia.org/wiki/Bag-of-words_model)
* Documents becomes rows, words become columns, and the cells are frequency counts
* This feature matrix is called the [*document-term matrix*](https://en.wikipedia.org/wiki/Document-term_matrix)

In [15]:
# set us up the documents
documents = ["John likes to watch movies. Mary likes movies too.",
          "John also likes to watch football games.",
          "Watching Football is a thing."]

* We have three documents that we need to transform into text feature vectors.
* Fortunately, Scikit Learn provides a helper function to make this easy

In [16]:
from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer()
X = vec.fit_transform(documents)
pd.DataFrame(X.toarray(), columns=vec.get_feature_names())

Unnamed: 0,also,football,games,is,john,likes,mary,movies,thing,to,too,watch,watching
0,0,0,0,0,1,2,1,2,0,1,1,1,0
1,1,1,1,0,1,1,0,0,0,1,0,1,0
2,0,1,0,1,0,0,0,0,1,0,0,0,1


* Text is transformed into features through a process called *tokenization*
* Basically, each word is broken up into a list of words (like `.split()`) and punctuation is removed (if  that is desired).
* When you split into single words you have *unigrams*
* If you want to look for clusters of two words you can create *bigrams* (or *trigrams* or...)

In [17]:
# unigram and bigram example
bi_vec = CountVectorizer(ngram_range=(1,2)) # tell CountVectorizer to make unigrams and bigrams
bi_X = bi_vec.fit_transform(documents)
pd.DataFrame(bi_X.toarray(), columns=bi_vec.get_feature_names()).T # rotate for readability

Unnamed: 0,0,1,2
also,0,1,0
also likes,0,1,0
football,0,1,1
football games,0,1,0
football is,0,0,1
games,0,1,0
is,0,0,1
is thing,0,0,1
john,1,1,0
john also,0,1,0


* There all other kinds of preprocessing steps, too many to cover in today's lecture (or this course)
* *Stemming* make similar word tokens comparable. Depending on the agressiveness of the technique, you can strip pluralization, suffexes, or cut words to their morphological root. 
* This is how search engines return results for similar words.
* In the case above we could use a stemmer to tranform `watching` into `watch`, which would mean we'd have one less column and document two would have a 1 for the `watch` column.
* Scikit-Learn only has some very basic tools for more advanced text processing check out the [Natual Language Toolkit](http://www.nltk.org), [spaCy](https://spacy.io), [TextBlob](https://textblob.readthedocs.io/en/dev/), and for doing even more advanced text analysis [gensim](https://radimrehurek.com/gensim/).

### Text data example

* The scorecard dataset doesn't include a lot of document-centric, textual data, so we need to find 
* Topic modeling works best on a large number of short documents. The original LDA paper used abstracts of scientific literature.
* If you want to model novel length texts, you have to break them up into smaller documents.
* In this example, we are going to use the [20 newsgroups text dataset](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html).
* This dataset contains email messages to a sent to 20 mailing lists about a variety of topics including computers, recreatio, science, politics, and religion.
* Like the iris dataset, it is a commonly used dataset in machine learning.

In [19]:
# Load the data loader
from sklearn.datasets import fetch_20newsgroups 

# Download the data 
dataset = fetch_20newsgroups(#data_home="/ihome/mburton/mcburton/scikit_learn_data",
                             shuffle=True, random_state=1,
                             remove=('headers', 'footers', 'quotes'))

# How many emails
len(dataset['data'])

Downloading dataset from http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz (14 MB)


11314

In [20]:
# look at the first two emails
dataset['data'][0:2]

["Well i'm not sure about the story nad it did seem biased. What\nI disagree with is your statement that the U.S. Media is out to\nruin Israels reputation. That is rediculous. The U.S. media is\nthe most pro-israeli media in the world. Having lived in Europe\nI realize that incidences such as the one described in the\nletter have occured. The U.S. media as a whole seem to try to\nignore them. The U.S. is subsidizing Israels existance and the\nEuropeans are not (at least not to the same degree). So I think\nthat might be a reason they report more clearly on the\natrocities.\n\tWhat is a shame is that in Austria, daily reports of\nthe inhuman acts commited by Israeli soldiers and the blessing\nreceived from the Government makes some of the Holocaust guilt\ngo away. After all, look how the Jews are treating other races\nwhen they got power. It is unfortunate.\n",
 "\n\n\n\n\n\n\nYeah, do you expect people to read the FAQ, etc. and actually accept hard\natheism?  No, you need a little leap

* Now that we have the data in raw textual form we need to vectorize it

In [21]:
# specify how many samples and word features we want in our matrix
n_samples = 2000
n_features = 1000

# extract a subset of the data
data_samples = dataset.data[:n_samples]

# instantiate a vectorizer 
tf_vectorizer = CountVectorizer(max_df=0.95, # ignore above 95%
                                min_df=2,    # ignore 2 or less
                                max_features=n_features,
                                stop_words='english')

# vectorize it
tf = tf_vectorizer.fit_transform(data_samples) ### <---<<< THIS IS USED LATER IN LDA

# print out a sample of the matrix
pd.DataFrame(tf.toarray(), 
             columns=tf_vectorizer.get_feature_names()) \
            .sample(10).sample(10,axis=1)

Unnamed: 0,happy,network,wasn,board,takes,agree,particular,hand,piece,unfortunately
1175,0,0,0,0,0,0,0,0,0,0
508,0,0,0,0,0,0,0,0,0,0
85,0,0,0,0,0,0,0,0,0,0
1844,0,0,0,0,0,0,0,0,0,0
991,0,0,0,0,0,0,0,0,0,0
740,0,0,0,0,0,0,0,0,0,0
1334,0,0,1,0,0,0,0,0,1,0
789,0,0,0,0,0,1,0,0,0,0
127,0,0,0,0,0,0,0,0,0,0
1044,0,0,0,0,0,0,0,0,0,0


## Topic Modeling 

* Topic modeling is a general term for a class of unsupervised, clustering algorithms
* The most famous of which is [*Latent Dirichlet Allocation*](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) (LDA)
* LDA is a method for finding latent patterns of co-occurance within large amounts of data (most often, but not necessarily, text)
* It is popular in the digital humanities because it is useful for exploring large corpora of unlabled text 

### How does it work?

* LDA is a *generative* model meaning the model describes a process, with a set of variables (topics), for generating data (documents)
* Because we already have the data, we work backwards to figure out the parameters for the generative process that most likely produced the data we have
* These parameters, the topics, can provide insight into latent structures within the data




### What do I mean by "topic?"

* Topics in LDA are not the same as human topics 
* In LDA, topics are *probability distributions over words*

![word-distribution2.png](attachment:word-distribution2.png)

* Imagine topics as bags filled with word tokens. When you reach into the bag you are more or less likely to pull out some words over others (because they have higher probability).

### Documents are composed from a *mixture* of topics

* Documents are composed of words drawn from multiple topic distributions
* Each document has a different mixture of topics

![document-mixture.png](attachment:document-mixture.png)


### The LDA generative process model

* Documents are generated by repeatedly sampling from these distributions
* Remember, documents are just word-frequency counts not prose

```python
# Specify the number of topics a priori
number_of_topics = 10

# Get a distribution of topic distributions 
topics = dirichlet_probability_distribution.sample(number_of_topics)

# Loop for generating documents
for document in documents:
    # Get a topic mixture for a specific document
    topic_mixture = topics.get_mixture()
    # Loop for selecting words
    for word in document:
        # Get a specific topic distribution
        topic = topic_distribution.get_topic()
        # Sample a word from that topic distribution
        word = topic.get_word()
```        
        
* Here is a representation of that process using *plate notation*

![Plate notation of LDA](http://mcburton.net/blog/joy-of-tm/images/image02.png)

### Estimation

* Now, the model I just described above is just that, a model, it is a mathimatically and computationally grounded fiction about how textual documents (as word frequency matrices) are created
* This is just like the simple model for linear regression we talked about last week: $y=mx+b$
* Just like with linear regression, we need to *fit* the model with the data we have
* With linear regression we are trying to figure out the values for $m$ and $b$ (the slope and the y-intercept)
* With LDA we are trying to figure out the topic distributions and the document topic mixtures
* We need to *estimate* the parameters of the model most likely to have generated our documents

![estimating.png](attachment:estimating.png)

* With Scikit Learn this process looks the same for LDA as it does for linear regression
* This code is derived from Scikit Learn's [LDA tutorial](http://scikit-learn.org/stable/auto_examples/applications/topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-topics-extraction-with-nmf-lda-py) 

In [43]:
### COME BACK TO THIS
import re, os
sentences = []

for fn in os.listdir('/home/jason/Documents/AA Chapters/output/bigbook/'):
    with open('/home/jason/Documents/AA Chapters/output/bigbook/'+fn, 'r') as filestuff:
        for l in filestuff:
            for sentence in re.split('[?.;!]',l):
                sentences.append(sentence)
len(sentences)


3884

In [44]:
#dataset = fetch_20newsgroups(#data_home="/ihome/mburton/mcburton/scikit_learn_data",
#                             shuffle=True, random_state=1,
#                             remove=('headers', 'footers', 'quotes'))

dataset = sentences 

# specify how many samples and word features we want in our matrix
n_samples = 2000
n_features = 1000

# extract a subset of the data
#data_samples = dataset.data[:n_samples]
data_samples = dataset[:n_samples]

# instantiate a vectorizer 
tf_vectorizer = CountVectorizer(max_df=0.95, # ignore above 95%
                                min_df=2,    # ignore 2 or less
                                max_features=n_features,
                                stop_words='english')

# vectorize it
tf = tf_vectorizer.fit_transform(data_samples)

# print out a sample of the matrix
pd.DataFrame(tf.toarray(), 
             columns=tf_vectorizer.get_feature_names()) \
            .sample(10).sample(10,axis=1)

Unnamed: 0,roaring,age,quick,husband,adopt,skeletons,require,drink,controversy,matters
1524,0,0,0,0,0,0,0,0,0,0
737,0,0,0,0,0,0,0,0,0,0
1713,0,0,0,0,0,0,0,0,0,0
1697,0,0,0,0,0,0,0,0,0,0
240,0,0,0,0,0,0,0,0,0,0
24,0,0,0,0,0,0,0,0,0,0
725,0,0,0,0,0,0,0,0,0,0
1284,0,0,0,0,0,0,0,0,0,0
288,0,0,0,0,0,0,0,0,0,0
646,0,0,0,0,0,0,0,0,0,0


In [47]:
# Import the LDA algorithm 
from sklearn.decomposition import LatentDirichletAllocation

# Set the number of topics (just like clusters in k-means)
n_topics = 5

# instantiate the model with hyperparameters
lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)

# Now estimate the topic distributions with our newsgroup data
lda.fit(tf)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='online', learning_offset=50.0,
             max_doc_update_iter=100, max_iter=5, mean_change_tol=0.001,
             n_jobs=1, n_topics=5, perp_tol=0.1, random_state=0,
             topic_word_prior=None, total_samples=1000000.0, verbose=0)

* Now that we have trained a model on our existing data we can inspect the *components* of the model
* These components are the topic distributions we used to generate the documents
* People use the top words in the distributions to get a sense of what was being written about without having to read each and every document

In [48]:
# Number of words to display
n_top_words = 20

# Get the actual words instead because they are just numbers in the model
tf_feature_names = tf_vectorizer.get_feature_names()

# Loop over each topic distribution and print out the top words
for topic_idx, topic in enumerate(lda.components_):
    print()
    print("Topic #%d:" % topic_idx)
    print(" ".join([tf_feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))


Topic #0:
husband man drinking know think make alcoholic let better business family good great self feel does ought things life little

Topic #1:
alcoholics people men new way world anonymous did alcoholism power home course ask spiritual place friends women life fellowship problems

Topic #2:
spiritual experience alcoholic love need father best away just don right problem religious book understand tell family means service general

Topic #3:
years man wife long day face life said far book action thinking new sure mind alcohol friend old lose possible

Topic #4:
god time like thought past drink say got reason family dad step use helpful drinking fellows large suggestions talk ing


* Do these topics make sense for the selection of newsgroups included in this corpus?

In [None]:
# List the newgroups included in the dateset
dataset.target_names

### Landscape of Machine Learning Algorithms

* A big part of knowing data science is knowing about the landscape of various algorithms and when to use them.
* This decision tree from the Scikit Learn documentation is helpful (but not comprehensive)

![scikit-learn algorithms cheatsheet](http://scikit-learn.org/stable/_static/ml_map.png)

* We won't spend more time in this course covering the landscape of machine learning
