## LSA(Latent Semantic Analysis, 잠재 의미 분석)
- 문서의 잠재된 의미 분석 및 단어에 잠재된 의미 분석 가능
- 의미는 문서와 단어를 연결하는 매개체, 축소된 차원이 이 역할?
- 절단된 SVD(Truncated SVD)로 구현됨
<br/>

**SVD(Singular Value Decomposition):** 특이값 분해, m x n 크기 행렬을 세 개 행렬의 곱으로 분해하는 것 <br/><br/>
$$ X = UΣV^T $$ <br/>
``U와 V: m x m, n x n 크기를 갖는 직교행렬 
Σ: m x n 크기의 대각행렬 ``

분해된 세 행렬을 다시 곱해 원래 데이터를 복원할 수 있음 <br/>
단, 절단된 SVD에서는 완전한 복원이 불가능, 최대한 유사하게

In [1]:
# 20뉴스그룹 데이터 불러오기
from sklearn.datasets import fetch_20newsgroups

categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']

# train
news_train = fetch_20newsgroups(subset='train',
                                remove=('headers', 'footers', 'quotes'),
                                categories=categories)

# test
news_test = fetch_20newsgroups(subset='test',
                               remove=('headers', 'footers', 'quotes'),
                               categories=categories)

In [3]:
# 전처리
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords

cachedStopWords = stopwords.words('english')   # 불용어

from nltk.tokenize import RegexpTokenizer
from nltk.stem.porter import PorterStemmer

# train/test split
X_train = news_train.data
y_train = news_train.target

X_test = news_test.data
y_test = news_test.target

# 토큰화
reg = RegexpTokenizer("[\w']{3,}")
english_stops = set(stopwords.words('english'))

In [6]:
# pca에서 토크나이저 임포트
import import_ipynb
from pca import tokenizer

# tfidf
tfidf = TfidfVectorizer(tokenizer=tokenizer)
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

importing Jupyter notebook from pca.ipynb
# Train score: 0.962
# Test score: 0.761
Original Tfidf matrix shape:  (2034, 20085)
PCA Converted matrix shape:  (2034, 2000)
Sum of explained variance ratio: 1.000
# Train score: 0.962
# Test score: 0.761
# Train score: 0.790
# Test score: 0.718
# Used features: 321 out of  (2034, 20085)
PCA Converted matrix shape:  (2034, 321)
Sum of explained variance ratio: 0.437
# Train score: 0.875
# Test score: 0.751


In [7]:
# LSA
from sklearn.decomposition import TruncatedSVD

# pca과 마찬가지로 차원의 개수 2000개로 지정
svd = TruncatedSVD(n_components=2000, random_state=7)
X_train_lsa = svd.fit_transform(X_train_tfidf)
X_test_lsa = svd.transform(X_test_tfidf)

print("LSA Converted X shape: ", X_train_lsa.shape)

print("Sum of explained variance ratio: {:.3f}".format(svd.explained_variance_ratio_.sum()))

LSA Converted X shape:  (2034, 2000)
Sum of explained variance ratio: 1.000


In [8]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(X_train_lsa, y_train)

print("# Train score: {:.3f}".format(lr.score(X_train_lsa, y_train)))
print("# Test score: {:.3f}".format(lr.score(X_test_lsa, y_test)))

# Train score: 0.962
# Test score: 0.761


In [9]:
# 100개 차원으로 축소
svd = TruncatedSVD(n_components=100, random_state=1)
X_train_lsa = svd.fit_transform(X_train_tfidf)
X_test_lsa = svd.transform(X_test_tfidf)

print("LSA Converted X shape: ", X_train_lsa.shape)

print("Sum of explained variance ratio: {:.3f}".format(svd.explained_variance_ratio_.sum()))

lr.fit(X_train_lsa, y_train)

print("# Train score: {:.3f}".format(lr.score(X_train_lsa, y_train)))
print("# Test score: {:.3f}".format(lr.score(X_test_lsa, y_test)))

LSA Converted X shape:  (2034, 100)
Sum of explained variance ratio: 0.209
# Train score: 0.810
# Test score: 0.745
