# Latent Semantic Analysis Lab
**University of Illinois**
<br>CSC 570 - Data Science Essentials
<br>Author: Arthur Putnam

## Lab Directions
Your assignment for this week is to do LSA on a group of newsgroup posts from the newsgroup 'rec.sport.baseball.'  (Feel free to pick another newsgroup if you like, the list is here.  http://scikit-learn.org/stable/datasets/twenty_newsgroups.html)   

1.  To get the newsgroup data, use this code:

from sklearn.datasets import fetch_20newsgroups<br>
categories = ['rec.sport.baseball']<br>
dataset = fetch_20newsgroups(subset='all',shuffle=True, random_state=42, categories=categories)<br>
corpus = dataset.data<br>

2.  Next, you'll be adapting my LSA code for your problem.  This shouldn't be too hard, but please spend some time understanding what my code is doing.  

3.  When you print the discovered concepts you'll probably find they don't make sense.  Consider adjusting the words in the stop word list to remove things like nntp, and people's names...

4.  Once youre satisfied with your work, submit the link to your work

## Libraries 
* NLTK - Natural Language Toolkit [http://www.nltk.org/](http://www.nltk.org/)
* sklearn - Python Machine Learning kit [http://scikit-learn.org/stable/](http://scikit-learn.org/stable/)

## Terms
Latent Semantic Analysis - is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. [Wikipedia](https://en.wikipedia.org/wiki/Latent_semantic_analysis)

corpus - a collection of written texts, especially the entire works of a particular author or a body of writing on a particular subject.[merriam-webster](https://www.merriam-webster.com/dictionary/corpus)

## Data Set
[The 20 newsgroups text dataset](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html)

In [198]:
from sklearn.datasets import fetch_20newsgroups
categories = ['rec.sport.baseball']
dataset = fetch_20newsgroups(subset='all',shuffle=True, 
                             random_state=42, 
                             categories=categories, 
                             remove=('headers', 'footers', 'quotes'))
corpus = dataset.data
print ("Data set loaded!")

Data set loaded!


## Data preparation
There is some data preparation we want to do.
* Lowercase
* Remove names
* Remove emails
* Common works like: and, the, or, of.

In [199]:
import nltk
import re
import string
from nltk.corpus import stopwords
from nltk.corpus import names
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
nltk.download('stopwords')
nltk.download('names')

# a set of words to 'ignore' 
stopset = set()
# add common words from the english language 
stopset.update(stopwords.words('english'))
# add common names to the stopset (lowercased)
stopset.update([name.lower() for name in names.words()])
# words that were found to have little meaning
stopset.update(['from', 'to', 're', 'subject', 'dl', 'i think', 'dont', 'th', 'maybe', 'gilkey', 'yorku', 
                'Alomar', 'alomar', 'baerga', 'Baerga', 'kubey', 'Kubey', 'kirsch', 'Kirsch', 'Traven', 
                'traven', 'koufax', 'would', 'think', 'list', 'thanks', 'mailing','mailing list','please',
                'anyone','email','mail','send','please email','extra', 'dcon','dops','nhs','contribution',
                'compared','hes','pm','am', 'net', 'com','md','hp','hewlettpackard', 'animal', 'beyond', 
                'natural', 'yall', 'chop', 'spanishspeaking', 'although', 'internet', 'comes', 'tanstaafl',
                'something', 'like', 'rf','era','cf','lf','bb', 'idle', 'hs', 'formula', 'thought', 'ap',
                'also', 'read', 'able', 'much', 'humor', 'ss', 'ab', 'rbi', 'bchmbiochemdukeedu', 'ls', 'gif','oh well',
               'neb', 'anyway', 'want', 'find', 'ny', 'widespread', 'fla', 'spanish', 'could', 'wife', 'young', 'name'])

def remove_emails(text):
    """ Uses regex to remove email addresses from text"""
    return re.sub(r'\S*@\S*\s?', '', text)

def remove_numbers(text):
    return ''.join([l for l in text if not l.isdigit()])

def lowercase(text):
    """ Lowercases the text """
    return text.lower()

def remove_punctuation(text):
    return "".join(l for l in text if l not in string.punctuation)

def remove_stopset_words(text):
    querywords = text.split()
    resultwords  = [word for word in querywords if word.lower() not in stopset]
    result = ' '.join(resultwords)
    return result

def clean_text(text):
    text = lowercase(text)
    text = remove_emails(text)
    text = remove_punctuation(text)
    text = remove_stopset_words(text)
    text = remove_numbers(text)
    return text
    
cleaned_corpus = [clean_text(text) for text in corpus]
print('The data has been lowercased and emails, punctuation, numbers, and stopwords have been removed.')

[nltk_data] Downloading package stopwords to C:\Users\Aj-
[nltk_data]     Pu\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package names to C:\Users\Aj-
[nltk_data]     Pu\AppData\Roaming\nltk_data...
[nltk_data]   Package names is already up-to-date!
The data has been lowercased and emails, punctuation, numbers, and stopwords have been removed.


### Data before being cleaned

In [200]:
print(corpus[0])

I hear ya!  Then again, we must remember that we are indeed Cub fans, and
that the Cubs will eventually blow it.  After all, the Cubs are the easiest
team in the National League to root for.  No Pressure.  You know they will
lose eventually.  Oh well, I suppose we must have faith.  After all, they
do look pretty good, and they don't even have Sandberg back yet.  

CUBS IN '93!!!!!


### Data after being cleaned

In [201]:
print(cleaned_corpus[0])

hear ya must remember indeed cub fans cubs eventually blow cubs easiest team national league root pressure know lose eventually oh well suppose must look pretty good even sandberg back yet cubs 


## TF-IDF Vectorizing

In [202]:
vectorizer = TfidfVectorizer(stop_words=stopset, use_idf=True, ngram_range=(1,3))
X = vectorizer.fit_transform(cleaned_corpus)

In [203]:
X.shape
print('Number of documents:', X.shape[0])
print('Number of terms:', X.shape[1])

Number of documents: 994
Number of terms: 100222


In [204]:
X[0]

<1x100222 sparse matrix of type '<class 'numpy.float64'>'
	with 89 stored elements in Compressed Sparse Row format>

In [194]:
print(X[0])

  (0, 36857)	0.0805087888282
  (0, 98197)	0.094599979551
  (0, 54914)	0.133840048217
  (0, 71580)	0.0621540871968
  (0, 41704)	0.0889356534341
  (0, 18381)	0.094599979551
  (0, 27063)	0.0610192531333
  (0, 18421)	0.193076812856
  (0, 25255)	0.184958756475
  (0, 9820)	0.0906088753149
  (0, 23172)	0.10805584877
  (0, 86366)	0.0455113827919
  (0, 55205)	0.0704637489118
  (0, 46478)	0.0504275562406
  (0, 73399)	0.0970480341075
  (0, 67134)	0.0906088753149
  (0, 44858)	0.0452049889868
  (0, 49503)	0.0757082905978
  (0, 58308)	0.0699750666738
  (0, 95310)	0.0443197550349
  (0, 85122)	0.0835921648883
  (0, 49009)	0.0582613171425
  (0, 67160)	0.0575854313267
  (0, 34222)	0.0438484765461
  (0, 24947)	0.0502876353387
  :	:
  (0, 18385)	0.114495007563
  (0, 27091)	0.114495007563
  (0, 18449)	0.114495007563
  (0, 25257)	0.114495007563
  (0, 9824)	0.114495007563
  (0, 18447)	0.114495007563
  (0, 23176)	0.114495007563
  (0, 86641)	0.114495007563
  (0, 55237)	0.114495007563
  (0, 46695)	0.11449500756

In [205]:
lsa = TruncatedSVD(n_components=30, n_iter=100)
lsa.fit(X)

TruncatedSVD(algorithm='randomized', n_components=30, n_iter=100,
       random_state=None, tol=0.0)

In [206]:
terms = vectorizer.get_feature_names()
for i, comp in enumerate(lsa.components_): 
    termsInComp = zip (terms,comp)
    sortedTerms =  sorted(termsInComp, key=lambda x: x[1], reverse=True) [:10]
    print("Concept %d:" % i )
    for term in sortedTerms:
        print(term[0])
    print (" ")

Concept 0:
lost
year
game
team
last
games
one
good
hit
sox
 
Concept 1:
lost
new york
york
chicago
san
new
american
angels
national
sox
 
Concept 2:
hits
average
stolen
hits stolen
defensive
fielder
bases
outs
prevented
average fielder
 
Concept 3:
year
last year
clutch
last
better
good
team
years
average
players
 
Concept 4:
team
games
runs
pitching
jays
sox
last year
toronto
staff
last
 
Concept 5:
stats
baseball
players
station
know
sox
jewish
heard
local
day
 
Concept 6:
sox
year
station
last
last year
hit
heard sox
heard sox wrol
know scmets
know scmets yankmes
 
Concept 7:
runs
clutch
sox
stats
hit
station
batting
run
home
pitcher
 
Concept 8:
jewish
kingman jewish
kingman
tune
fingers
staub
pitcher
article
blomberg
greenberg
 
Concept 9:
stats
clutch
tune
kids
games
people
stick
stadium
giants
game
 
Concept 10:
jewish
games
game
field
stick
clutch
year
stats
trash
baseball
 
Concept 11:
year
university
last year
stats
last
pitcher
baseball
rule
colorado boulder
university color