### LSA Experiment against newsgroup text dataset 

In [1]:
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.datasets import fetch_20newsgroups

In [2]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /Users/asyam/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
# Selected the space newsgroup for my experiment
categories = ['sci.space']
dataset = fetch_20newsgroups(subset='all',shuffle=True, random_state=42, categories=categories)
corpus = dataset.data
corpus = [x.lower() for x in corpus]

### TF-IDF Vectorization

In [4]:
corpus[0]

'from: aws@iti.org (allen w. sherzer)\nsubject: re: orbital repairstation\norganization: evil geniuses for a better tomorrow\nlines: 20\n\nin article <c5hcbo.joy@zoo.toronto.edu> henry@zoo.toronto.edu (henry spencer) writes:\n\n>the biggest problem with this is that all orbits are not alike.  it can\n>actually be more expensive to reach a satellite from another orbit than\n>from the ground.  \n\nbut with cheaper fuel from space based sources it will be cheaper to \nreach more orbits than from the ground.\n\nalso remember, that the presence of a repair/supply facility adds value\nto the space around it. if you can put your satellite in an orbit where it\ncan be reached by a ready source of supply you can make it cheaper and gain\nbenefit from economies of scale.\n\n  allen\n-- \n+---------------------------------------------------------------------------+\n| lady astor:   "sir, if you were my husband i would poison your coffee!"   |\n| w. churchill: "madam, if you were my wife, i would 

In [5]:
stopset = set(stopwords.words('english'))
stopset.update(['18084tm','__','___','_____','acad3','access','added','af','afit','alaska','also','april','article','au','available','baalke','base','bbs','borden','ca','cacs','cain','caltech','claudio','cmu','com','command','could','cs','cso','daily','darling','david','dc','digest','digex','distribution','dseg','dublin','eder','edu','egalon','eng','enzo','fidonet','fnal','fnalf','forwarded','fraering','fred','free','gary','gehrels','gif','gothamcity','gov','government','henry','hst','ibm','ie','image','images','institute','international','ireland','isu','james ','jeff','jpl','jsc','ke4zv','kelvin','king','kjenks','larc','larrison ','lick','like','lines','loss','mankato','matthew ','mccall','mil','moon','msb','msu','msus','net','nicho','nsmca','ofa123','oh','oliveira','oort','org','organization','palmer','pat','pgf','phil','prb','raider','rborden','read','ron','ross','sci','sigurdsson','smiley','software','speak','spencer','sq','steinn','subject','sysmgr','tcd','temporary','test','ti','tkelso','topaz','toronto','ts','ucsc','uiuc','uk','umd','updated','usl','uvic','vacation','vax1','venari','via','victor','vnet','washington','would','wpi','wright','zoo',])

In [6]:
vectorizer = TfidfVectorizer(stop_words=stopset,
                                 use_idf=True, ngram_range=(1, 3))
X = vectorizer.fit_transform(corpus)

In [7]:
print(X[0])

  (0, 34185)	0.0498322212419
  (0, 114771)	0.0494776232733
  (0, 24011)	0.0872670364014
  (0, 197853)	0.0481621165785
  (0, 156675)	0.0392436656799
  (0, 183531)	0.0727026934559
  (0, 80901)	0.0536625361385
  (0, 94730)	0.0559038066394
  (0, 37638)	0.0326014427375
  (0, 226766)	0.0526907149563
  (0, 7952)	0.0349027842898
  (0, 43542)	0.0804443835168
  (0, 116335)	0.064961003395
  (0, 246150)	0.0169840725169
  (0, 38258)	0.0624687358625
  (0, 172136)	0.0366402639184
  (0, 157102)	0.102750416523
  (0, 23947)	0.0804443835168
  (0, 20641)	0.0361005986642
  (0, 82156)	0.0447187301827
  (0, 179660)	0.101938043517
  (0, 190811)	0.0741831878493
  (0, 26533)	0.0370915939246
  (0, 156068)	0.0575770691298
  (0, 99528)	0.084181062362
  :	:
  (0, 156478)	0.0804443835168
  (0, 179765)	0.0804443835168
  (0, 180068)	0.0804443835168
  (0, 204486)	0.0804443835168
  (0, 215807)	0.0804443835168
  (0, 131390)	0.0804443835168
  (0, 48600)	0.0804443835168
  (0, 93330)	0.0804443835168
  (0, 37263)	0.080444383

### LSA

In [8]:
X.shape

(987, 249225)

In [9]:
# According to the documentation, for LSA, in case of n_components a value of 100 is recommended.
lsa = TruncatedSVD(n_components=100, n_iter=10)
lsa.fit(X)

TruncatedSVD(algorithm='randomized', n_components=100, n_iter=10,
       random_state=None, tol=0.0)

In [10]:
lsa.components_[0]

array([ 0.0150963 ,  0.00078789,  0.00040133, ...,  0.00217403,
        0.00217403,  0.00217403])

In [11]:
terms = vectorizer.get_feature_names()
for i, comp in enumerate(lsa.components_): 
    termsInComp = zip (terms,comp)
    sortedTerms =  sorted(termsInComp, key=lambda x: x[1], reverse=True) [:10]
    print("Concept %d:" % i )
    for term in sortedTerms:
        print(term[0])
    print (" ")

Concept 0:
space
nasa
writes
shuttle
one
orbit
launch
mission
earth
posting
 
Concept 1:
elements
element
shuttle
air force technology
force technology
two line
orbital
current
kelso
element sets
 
Concept 2:
gene
theporch
billion
gene theporch
theporch gene
reward
year
gene theporch gene
first
residents
 
Concept 3:
gamma
energetic
gamma ray
ray
bursters
gamma ray bursters
ray bursters
big capacitor
capacitor
model
 
Concept 4:
sky
night
rights
night sky
vandalizing
vandalizing sky
george
light
canon
george krumins
 
Concept 5:
mission
servicing
zoology
boost
arrays
servicing mission
shuttle
mission scheduled
sky
11 days
 
Concept 6:
kuiper
space
object
nicoll
uwo
prism
james
belt object
kuiper belt object
kuiper belt
 
Concept 7:
higgins
allen
josh
josh hopkins
jbh55289
jbh55289 uxa
hopkins
iti
manned
uxa
 
Concept 8:
kuiper
object
nicoll
james
uwo
prism
belt object
kuiper belt object
karla
kuiper belt
 
Concept 9:
knox
bby
bond
gregory bond
gnb
knox box
gregory
buckeridge
buckeridge