LSA on a group of newsgroup posts from the newsgroup 'rec.sport.baseball.'
<li> Get the newsgroup data
<li> As the data is already in text format we dont need to convert them into XML
<li> Take the words and convert them into matrix
<li> LSA
    <li> Read Data
    <li> Stop Words
    <li> TF-IDF Vectorizing
    <li> Know the terms

In [26]:
from bs4 import BeautifulSoup
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from __future__ import print_function

In [12]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /Users/Arati/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### GET THE NEWSGROUP DATA

In [13]:
from sklearn.datasets import fetch_20newsgroups
categories = ['rec.sport.baseball']
dataset = fetch_20newsgroups(subset='all',shuffle=True, random_state=42, categories=categories)
corpus = dataset.data


### Converting the data into lowercase

In [14]:
postDocs = [x.lower() for x in corpus]


In [9]:
postDocs

["from: writingctr@leo.bsuvc.bsu.edu\nsubject: re: cub fever.\norganization: ball state university, muncie, in - univ. computing svc's\nlines: 21\n\n\nin article <kingoz.735285670@camelot>, kingoz@camelot.bradley.edu (orin roth) writes:\n> \n>    cub fever is hitting me again. i'm beginning to think they have a \n>    chance this year. (what the heck am i thinking?)\n>    sorry. just a moment of incompetence.\n>    i'll be ok. really. \n>    orin.\n>    bradley u.\n> \n> --\n> i'm really a jester in disguise!                                   \ni hear ya!  then again, we must remember that we are indeed cub fans, and\nthat the cubs will eventually blow it.  after all, the cubs are the easiest\nteam in the national league to root for.  no pressure.  you know they will\nlose eventually.  oh well, i suppose we must have faith.  after all, they\ndo look pretty good, and they don't even have sandberg back yet.  \n\ncubs in '93!!!!!\n\ncha\n",
 'from: schmke@cco.caltech.edu (kevin todd schmi

### Setting up stop words


In [15]:
stopset = set(stopwords.words('english'))
stopset.update(['lt','p','/p','br','amp','quot','field','font','normal','span','0px','rgb',
                'style','51','spacing','text','helvetica','size','family', 'space', 'arial',
                'height', 'indent', 'letter','line','none','sans','serif','transform','line',
                'variant','weight','times', 'new','strong', 'video', 'title'
                'white','word','letter', 'roman','0pt','16','color','12','14','21',
                'neue', 'apple', 'class','me','again','you','in',  ])



### TF-IDF Vectorizing

In [16]:
postDocs[0]

"from: writingctr@leo.bsuvc.bsu.edu\nsubject: re: cub fever.\norganization: ball state university, muncie, in - univ. computing svc's\nlines: 21\n\n\nin article <kingoz.735285670@camelot>, kingoz@camelot.bradley.edu (orin roth) writes:\n> \n>    cub fever is hitting me again. i'm beginning to think they have a \n>    chance this year. (what the heck am i thinking?)\n>    sorry. just a moment of incompetence.\n>    i'll be ok. really. \n>    orin.\n>    bradley u.\n> \n> --\n> i'm really a jester in disguise!                                   \ni hear ya!  then again, we must remember that we are indeed cub fans, and\nthat the cubs will eventually blow it.  after all, the cubs are the easiest\nteam in the national league to root for.  no pressure.  you know they will\nlose eventually.  oh well, i suppose we must have faith.  after all, they\ndo look pretty good, and they don't even have sandberg back yet.  \n\ncubs in '93!!!!!\n\ncha\n"

### Scikit-learn's TF-IDF vectorizer is used to take corpus data set and convert each document into a sparse matrix of TFIDF Features...

In [17]:
vectorizer = TfidfVectorizer(stop_words=stopset,
                                 use_idf=True, ngram_range=(1, 3))
X = vectorizer.fit_transform(postDocs)





In [18]:
X[0]

<1x188088 sparse matrix of type '<class 'numpy.float64'>'
	with 229 stored elements in Compressed Sparse Row format>

In [28]:
print (X[0])

  (0, 51441)	0.0739614450823
  (0, 187358)	0.0739614450823
  (0, 28939)	0.0739614450823
  (0, 145086)	0.0739614450823
  (0, 64046)	0.0739614450823
  (0, 77593)	0.0739614450823
  (0, 132368)	0.0739614450823
  (0, 102622)	0.0739614450823
  (0, 66435)	0.0739614450823
  (0, 113709)	0.0739614450823
  (0, 161841)	0.0739614450823
  (0, 179412)	0.0739614450823
  (0, 118825)	0.0739614450823
  (0, 64186)	0.0739614450823
  (0, 103114)	0.0739614450823
  (0, 94983)	0.0739614450823
  (0, 132283)	0.0739614450823
  (0, 142244)	0.0739614450823
  (0, 97593)	0.0739614450823
  (0, 114349)	0.0739614450823
  (0, 164212)	0.0739614450823
  (0, 59267)	0.0739614450823
  (0, 51481)	0.0739614450823
  (0, 35716)	0.0739614450823
  (0, 64172)	0.0739614450823
  :	:
  (0, 183911)	0.015655015889
  (0, 142480)	0.0555802309707
  (0, 121494)	0.111160461941
  (0, 37508)	0.0996785439776
  (0, 40682)	0.105257944488
  (0, 16584)	0.0739614450823
  (0, 94387)	0.111160461941
  (0, 25764)	0.0161922460089
  (0, 100274)	0.010320800

In [29]:
X.shape

(994, 188088)

In [30]:
lsa = TruncatedSVD(n_components=27, n_iter=100)
lsa.fit(X)

TruncatedSVD(algorithm='randomized', n_components=27, n_iter=100,
       random_state=None, tol=0.0)

### First row 

In [31]:
lsa.components_[0]

array([ 0.01601322,  0.00499985,  0.00078314, ...,  0.00105261,
        0.00105261,  0.00105261])

In [37]:
import sys
print (sys.version)

3.5.2 |Anaconda 4.1.1 (x86_64)| (default, Jul  2 2016, 17:52:12) 
[GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.28)]


In [38]:
terms = vectorizer.get_feature_names()
for i, comp in enumerate(lsa.components_): 
    termsInComp = zip (terms,comp)
    sortedTerms =  sorted(termsInComp, key=lambda x: x[1], reverse=True) [:10]
    print ("Concept %d:" % i)
    for term in sortedTerms:
        print (term[0])
    print (" ")

Concept 0:
edu
com
year
writes
team
would
game
article
cs
baseball
 
Concept 1:
com
writes
baseball
last
pitcher
nntp posting host
posting host
baseball players
organization
jewish baseball players
 
Concept 2:
com
better
would
players
game
writes
clutch
ibm
usa
organization university
 
Concept 3:
time
last
posting
com
game
runs
edu
aix
ibm
00 00 chicago
 
Concept 4:
posting
season
well
host
really
runs
team
pitching
back
reply
 
Concept 5:
good
year
posting
league
would
run
uiuc
way
pitching
ball
 
Concept 6:
00
00 00
first
would
john
braves
game
00 00 00
better
runs
 
Concept 7:
00
00 00
00 00 00
00 00 000
sox
get
think
00 00 01
know
probably
 
Concept 8:
hit
university
00
lines
pitching
back
game
last
nntp posting host
posting host
 
Concept 9:
writes
00
baseball
would
season
com
00 00
university
morris
play
 
Concept 10:
one
university
lines
like
players
edu
posting
first
would
david
 
Concept 11:
know
last
year
hitter
well
may
teams
00 00 000
last year
mike
 
Concept 12:
game
pos

In [39]:
lsa.components_

array([[ 0.01601322,  0.00499985,  0.00078314, ...,  0.00105261,
         0.00105261,  0.00105261],
       [-0.00877844, -0.00791926, -0.02896111, ..., -0.00082239,
        -0.00082239, -0.00082239],
       [-0.03918546, -0.02242537, -0.01590993, ...,  0.00158357,
         0.00158357,  0.00158357],
       ..., 
       [ 0.0079341 , -0.02801328,  0.04591085, ..., -0.00053878,
        -0.00053878, -0.00053878],
       [-0.00725124, -0.01022441, -0.02442806, ...,  0.00122791,
         0.00122791,  0.00122791],
       [ 0.00480382,  0.00910228, -0.00976061, ..., -0.00060225,
        -0.00060225, -0.00060225]])