#### #1
First step, get the newsgroup data.

In [1]:
from sklearn.datasets import fetch_20newsgroups
categories = ['rec.sport.baseball']
dataset = fetch_20newsgroups(subset='all',shuffle=True, random_state=42, categories=categories)
corpus = dataset.data

I ran corpus, and found a large file full of baseball data, as well as other junk and extraneous material. 

#### #2
Adapt code from the LSA lecture to process the data in corpus.

In [2]:
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

In [3]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/adriancavallaris/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [30]:
#I'm not going to go too crazy on the stop words yet
stopset = set(stopwords.words('english'))
stopset.update(['nntp', 'edu', 'com'])

In [31]:
vectorizer = TfidfVectorizer(stop_words=stopset,
                                 use_idf=True, ngram_range=(1, 3))
X = vectorizer.fit_transform(corpus)

In [24]:
#Run the first entry in corpus to see what the entry looks like
corpus[0]

u"From: writingctr@leo.bsuvc.bsu.edu\nSubject: Re: CUB fever.\nOrganization: Ball State University, Muncie, In - Univ. Computing Svc's\nLines: 21\n\n\nIn article <kingoz.735285670@camelot>, kingoz@camelot.bradley.edu (Orin Roth) writes:\n> \n>    CUB fever is hitting me again. I'm beginning to think they have a \n>    chance this year. (what the heck am i thinking?)\n>    Sorry. Just a moment of incompetence.\n>    I'll be ok. Really. \n>    Orin.\n>    Bradley U.\n> \n> --\n> I'm really a jester in disguise!                                   \nI hear ya!  Then again, we must remember that we are indeed Cub fans, and\nthat the Cubs will eventually blow it.  After all, the Cubs are the easiest\nteam in the National League to root for.  No Pressure.  You know they will\nlose eventually.  Oh well, I suppose we must have faith.  After all, they\ndo look pretty good, and they don't even have Sandberg back yet.  \n\nCUBS IN '93!!!!!\n\nCHA\n"

In [25]:
#Get the shape so I can include all concepts
X.shape

(994, 191592)

In [26]:
lsa = TruncatedSVD(n_components=994, n_iter=100)
lsa.fit(X)

TruncatedSVD(algorithm='randomized', n_components=994, n_iter=100,
       random_state=None, tol=0.0)

#### #3
Print discovered concepts

In [27]:
import sys

In [28]:
terms = vectorizer.get_feature_names()
for i, comp in enumerate(lsa.components_): 
    termsInComp = zip (terms,comp)
    sortedTerms =  sorted(termsInComp, key=lambda x: x[1], reverse=True) [:10]
    print "Concept %d:" % i
    for term in sortedTerms:
        print term[0]
    print " "

Concept 0:
year
team
would
writes
game
article
cs
re
baseball
players
 
Concept 1:
jewish
lafayette
lafibm
vb30
lafayette vb30
lafibm lafayette
lafibm lafayette vb30
baseball players
jewish baseball
jewish baseball players
 
Concept 2:
03
02
04
05
01
lost
06
03 03
won
00
 
Concept 3:
morris
ibm
aix
kingston
maynard
ca
jack
laurentian
team
cs
 
Concept 4:
ibm
aix
kingston
aix kingston
aix kingston ibm
kingston ibm
mjones
gant
mike jones
hirschbeck
 
Concept 5:
bonds
williams
batting
aix
clutch
ibm
kingston
batting 4th
4th
punjabi
 
Concept 6:
gant
hirschbeck
uiuc
cs uiuc
steph
stephenson
dale
dale stephenson
cs
defensive
 
Concept 7:
indiana
journalism
journalism indiana
dwarner
dwarner journalism
dwarner journalism indiana
david
mail
reply
1993 rap
 
Concept 8:
clutch
netcom
sabo
mss
mss netcom
mark singer
singer
samuel
performance
mark
 
Concept 9:
hall
hall fame
fame
kingman
dave
dave kingman
winfield
princeton
smith
roger
 
Concept 10:
hall
games
fame
game
kingman
duke
hall fame
dav

I definitely learned a valuable lesson here. I initally thought something was wrong with the compiler, but I realized that it took about 10 minutes to process because I processed all of the concepts. I didn't remove any stop words except nntp, com, and edu, but I think a lot of the concepts offer interesting and potentially valuable information as they sit. 

I don't think you touched on this in the lecture, but I think the main value to this method (at least at this early stage in the semester) is being able to quickly search and process a huge data source quickly and effectively. Instead of trying to search one big ugly data file for keywords, I now have a nice organized list of each entry with the keywords. It's pretty awesome. 