# SVD
- 특잇값 분해(Singular Value Decomposition): m by n 실수 행렬 A를 행렬 3개로 분리하는 방법

- 특잇값 분해의 기하학적 의미
  - 행렬 A를 f(x) = Ax와 같은 좌표 공간에서의 선형 변환에 대한 행렬로 볼 때, 직교 행렬은 회전변환에 해당하고, 대각행렬은 각 기저 벡터 방향으로의 확대 또는 축소 변환을 나타낸다.
  - 특잇값 분해 A = abc에서 a와 c는 직교행렬, b는 직사각 대각행렬이다. 따라서 c에 의해 회전한 후, b로 확대 또는 축소하고, a에 의해 회전하는 것이다. 특잇값 i, j는 선형 변환에서 확대 또는 축소의 비율을 뜻한다.

- 특잇값 분해의 존재
  - m by n 행렬 A의 특잇값 분해 A = abc는 항상 존재
  - 모든 행렬에 대해서 특잇값 분해가 가능하다.

- 복수의 특잇값 분해 존재 가능성
  - m by n 행렬 A의 특잇값 분해 A = abc에서 b는 유일하지만, a, c는 여러개 존재할 수 있다.
  - A의 특이값 분해는 유일하지 않다.

-  특잇값 분해의 응용
  - 낮은 계수 근사와 데이터 압축
    - 행렬을 특이값 분해를 통해 전개해보면 더 작은 특이값에 관련된 항일수록 행렬에 미치는 영향력이 작다. 이에 작은 특잇값들은 생략해서 데이터를 압축할 수 있다.

# LSA
- LSA(잠재 의미 분석)의 작동 방식
  - LSA에서는 기본적으로는 SVD를 이용해서 TF-IDF 행렬을 압축하고, 압축된 차원에 대한 벡터를 주제 벡터로 활용함.

  - 1. TF-IDF가 있다고 가정함. (SVD 적용하기 위해서는 입력할 행렬이 word-documents 형태의 행렬이여야 한다. -> 이 형태의 행렬이 아닐 경우, 전치해야 함. )
  - 2. SVD를 이용하여 행렬 압축
  - 3. 압축된 차원에 대한 벡터를 주제 벡터로 활용함.
  - 결론: 즉, words - context(topic) 행렬 a를 구하는 것이 목적이다.

  - LSA의 결과로 나온 벡터가 있을 때, SVD의 결과는 중요도 순으로 배열되므로 각 주제는 중요도에 따라 배열되며, 계산을 통해 나온 결과로 컴퓨터는 각 주제들이 어떤 주제어에 관한 것인지는 모르지만, 사람이 해석해야 함.

  - SVD를 적용하게 되면 원래는 m by m, m by n, n by n 형태의 행렬들로 분해되지만, LSA에서는 차원을 축소하여(압축하여) 원하는 특이값(중요도 순으로 정렬되어 있음.) 개수(p) 만큼으로 벡터 차원을 변경하고자 한다.
    - 즉, m by p, p by p, p by n 형태의 행렬들로 도출된다.

  - 전통적인 LSA에서는 행렬 a에서 word에 대한 주제 벡터들을 추출하여 활용함.

  - 차원을 많이 축소할 수록 원본 용어 - 문서 행렬 복구가 힘들어진다.

  - BoW보다 TF-IDF가 전반적인 성능이 우수하다.

# PCA
- PCA(Principal Component Analysis)
  - 주성분 분석: 행렬의 차원 축소에 이용되는 대표적인 방법이다.
  - 기존의 벡터들이 나타내는 정보에 대한 손실을 최소화하여 차원을 축소함.
  - Unsupervised Learning에 속함.
  - 전처리(데이터의 차원 축소)하는데 사용됨.

- PCA 장점
  - 계산 효율성 증가
  - 데이터 시각화
  - 노이즈 제거
  - 과적합 해결

- PCA 단점
  - 정보가 손실되며, Under-fitting이 발생할 수 있음.

- PCA 과정
  - 평균(무게중심) 계산 -> 분산이 최대화 되는 기저를 잡음  -> 데이터 중심화 -> Covariance 행렬 계산 -> 고유값, 고유벡터 추출 -> 주성분 선택
  - 주성분 선택: 가장 큰 고유값을 갖는 고유벡터로부터 원하는 수의 주성분을 선택함.

  - 1. 평균벡터를 0으로 만든다 -> 데이터에 대한 공분산 행렬을 구한다. -> 고유값 분해를 한다. -> V(주성분 벡터)를 구함. -> 원래 데이터와 V를 내적함.
  - 2. 평균벡터를 구함 -> 데이터 행렬 - 평균 벡터 -> SVD한다.-> V를 구함. -> 원래 데이터와 V를 내적함.

- PCA vs SVD
  - PCA의 경우, 보다 일관된 결과가 도출되고 주성분에 기반한 해석이 용이하며 중심화 및 정규화 과정 추가가 용이함.

# Truncated SVD 기반의 LSA

In [1]:
# PCA 기반의 LSA
import pandas as pd

In [2]:
# TF-IDF 벡터 도출 및 행렬 shape, 스팸 데이터 개수 확인
sms = pd.read_csv('sms-spam.csv', index_col=0)

In [3]:
sms # 행: document 번호, 열: spam인지 아닌지 구분 / text 내용
# 즉, document가 4838 개, 열 구분자가 2개로 이루어진 data이다.

Unnamed: 0,spam,text
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
4832,1,This is the 2nd time we have tried 2 contact u...
4833,0,Will ü b going to esplanade fr home?
4834,0,"Pity, * was in mood for that. So...any other s..."
4835,0,The guy did some bitching but I acted like i'd...


In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize.casual import casual_tokenize

In [5]:
tfidf = TfidfVectorizer(tokenizer=casual_tokenize)
tfidf_docs = tfidf.fit_transform(raw_documents=sms.text).toarray() # tf-idf array 구하기



In [7]:
tfidf_docs = pd.DataFrame(tfidf_docs)
tfidf_docs -= tfidf_docs.mean() # 중심화

In [8]:
print(tfidf_docs.shape) # (문서 갯수, 토큰 수)
print(sms.spam.sum()) # spam data 갯수

(4837, 9232)
638


In [11]:
tfidf_docs.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,9222,9223,9224,9225,9226,9227,9228,9229,9230,9231
0,-0.025643,-0.00584,-0.000228,-5.3e-05,-0.000156,-0.000943,-0.000463,-0.006695,-0.004035,-0.002745,...,-0.000264,-0.000426,-7.667659e-07,-0.001598,-0.000148,-9.9e-05,-0.00066,-5.5e-05,-5.5e-05,-5.5e-05
1,-0.025643,-0.00584,-0.000228,-5.3e-05,-0.000156,-0.000943,-0.000463,-0.006695,-0.004035,-0.002745,...,-0.000264,-0.000426,-7.667659e-07,-0.001598,-0.000148,-9.9e-05,-0.00066,-5.5e-05,-5.5e-05,-5.5e-05
2,-0.025643,-0.00584,-0.000228,-5.3e-05,-0.000156,-0.000943,-0.000463,0.096125,0.12734,0.124007,...,-0.000264,-0.000426,-7.667659e-07,-0.001598,-0.000148,-9.9e-05,-0.00066,-5.5e-05,-5.5e-05,-5.5e-05
3,-0.025643,-0.00584,-0.000228,-5.3e-05,-0.000156,-0.000943,-0.000463,-0.006695,-0.004035,-0.002745,...,-0.000264,-0.000426,-7.667659e-07,-0.001598,-0.000148,-9.9e-05,-0.00066,-5.5e-05,-5.5e-05,-5.5e-05
4,-0.025643,-0.00584,-0.000228,-5.3e-05,-0.000156,-0.000943,-0.000463,-0.006695,-0.004035,-0.002745,...,-0.000264,-0.000426,-7.667659e-07,-0.001598,-0.000148,-9.9e-05,-0.00066,-5.5e-05,-5.5e-05,-5.5e-05
5,0.150338,-0.00584,-0.000228,-5.3e-05,-0.000156,-0.000943,-0.000463,-0.006695,-0.004035,-0.002745,...,-0.000264,-0.000426,-7.667659e-07,-0.001598,-0.000148,-9.9e-05,-0.00066,-5.5e-05,-5.5e-05,-5.5e-05
6,-0.025643,-0.00584,-0.000228,-5.3e-05,-0.000156,-0.000943,-0.000463,-0.006695,-0.004035,-0.002745,...,-0.000264,-0.000426,-7.667659e-07,-0.001598,-0.000148,-9.9e-05,-0.00066,-5.5e-05,-5.5e-05,-5.5e-05
7,-0.025643,-0.00584,-0.000228,-5.3e-05,-0.000156,-0.000943,-0.000463,-0.006695,0.262816,0.125986,...,-0.000264,-0.000426,-7.667659e-07,-0.001598,-0.000148,-9.9e-05,-0.00066,-5.5e-05,-5.5e-05,-5.5e-05
8,0.235275,-0.00584,-0.000228,-5.3e-05,-0.000156,-0.000943,-0.000463,-0.006695,-0.004035,-0.002745,...,-0.000264,-0.000426,-7.667659e-07,-0.001598,-0.000148,-9.9e-05,-0.00066,-5.5e-05,-5.5e-05,-5.5e-05
9,0.062493,-0.00584,-0.000228,-5.3e-05,-0.000156,-0.000943,-0.000463,-0.006695,-0.004035,-0.002745,...,-0.000264,-0.000426,-7.667659e-07,-0.001598,-0.000148,-9.9e-05,-0.00066,-5.5e-05,-5.5e-05,-5.5e-05


In [12]:
# PCA를 통해 16개의 주성분으로 차원 축소
from sklearn.decomposition import PCA

In [13]:
pca = PCA(n_components=16) # 16차원으로 차원 축소를 하고자 한다.
pca = pca.fit(tfidf_docs)
pca_topic_vectors = pca.transform(tfidf_docs)

In [15]:
pca

In [16]:
columns = [f'topic{i}' for i in range(pca.n_components)]
index = [f'sms{i}{"!"*j}' for (i,j) in zip(range(len(sms)),sms.spam)]

In [17]:
pca_topic_vectors = pd.DataFrame(pca_topic_vectors, columns=columns, index=index)
pca_topic_vectors.round(3).head(7)

Unnamed: 0,topic0,topic1,topic2,topic3,topic4,topic5,topic6,topic7,topic8,topic9,topic10,topic11,topic12,topic13,topic14,topic15
sms0,0.201,-0.003,-0.037,0.011,-0.019,-0.053,-0.039,-0.066,0.012,-0.082,0.006,-0.006,-0.01,-0.036,-0.015,0.038
sms1,0.404,0.094,0.078,0.051,0.1,0.047,-0.023,0.065,0.023,-0.024,-0.005,0.032,-0.039,-0.014,0.051,-0.022
sms2!,-0.03,0.048,-0.09,-0.067,0.091,-0.043,0.0,-0.001,-0.058,0.052,0.125,0.024,-0.035,-0.017,-0.04,0.06
sms3,0.329,0.033,0.035,-0.016,0.052,0.056,0.166,-0.074,0.062,-0.107,0.019,0.02,-0.078,-0.03,0.025,-0.079
sms4,0.002,-0.031,-0.038,0.034,-0.075,-0.092,0.044,0.06,-0.045,0.03,0.026,-0.011,-0.026,0.042,-0.08,-0.012
sms5!,-0.016,-0.059,-0.014,-0.006,0.122,-0.04,-0.005,0.166,-0.022,0.063,0.046,0.06,0.051,0.057,-0.011,-0.023
sms6,-0.066,0.099,0.045,-0.03,-0.032,0.036,0.02,-0.016,-0.016,-0.078,0.081,0.011,0.092,0.056,-0.047,-0.08


In [18]:
pca.components_.shape # 주성분 벡터들의 정보가 저장되어 있다.
# 16 -> 주제(주성분)로 16개의 축을 잡아서 압축한 것임.
# 9232 -> 토큰 갯수
# 9232개의 토큰 차원(각 차원은 하나 하나의 토큰에 대응됨)에서의 16개의 주성분 벡터에 대한 정보를 담고 있다. -> 각 단어에 대한 주제 벡터를 얻을 수 있다.
# 주성분 벡터들(16개의 축)은 각 주제를 나타낸다고 할 수 있고, 해당 행렬은 주제 - 단어 벡터로 활용할 수 있음.(전치하면 단어 - 주제 벡터가 됨.)
# 9232개의 각 차원에 대해 매칭되는 토큰만 열결해주면, 사람이 이해할 수 있는 형태로 출력이 가능하다.

(16, 9232)

In [20]:
# TfidfVectorizer: 각 토큰의 인덱스에 대한 사전을 저장하고 있는데, 주제 - 단어 벡터를 나타내기 위해 각 토큰을 인덱스 순서대로 정렬해야할 필요가 있다.
tfidf.vocabulary_

{'go': 3807,
 'until': 8487,
 'jurong': 4675,
 'point': 6296,
 ',': 13,
 'crazy': 2549,
 '..': 21,
 'available': 1531,
 'only': 5910,
 'in': 4396,
 'bugis': 1973,
 'n': 5594,
 'great': 3894,
 'world': 8977,
 'la': 4811,
 'e': 3056,
 'buffet': 1971,
 '...': 25,
 'cine': 2277,
 'there': 8071,
 'got': 3855,
 'amore': 1296,
 'wat': 8736,
 'ok': 5874,
 'lar': 4848,
 'joking': 4642,
 'wif': 8875,
 'u': 8395,
 'oni': 5906,
 'free': 3604,
 'entry': 3195,
 '2': 471,
 'a': 1054,
 'wkly': 8933,
 'comp': 2386,
 'to': 8192,
 'win': 8890,
 'fa': 3328,
 'cup': 2608,
 'final': 3450,
 'tkts': 8180,
 '21st': 497,
 'may': 5272,
 '2005': 487,
 '.': 15,
 'text': 8020,
 '87121': 948,
 'receive': 6688,
 'question': 6574,
 '(': 9,
 'std': 7651,
 'txt': 8379,
 'rate': 6628,
 ')': 10,
 't': 7889,
 '&': 7,
 "c's": 2020,
 'apply': 1383,
 '08452810075': 115,
 'over': 6003,
 '18': 438,
 "'": 8,
 's': 6959,
 'dun': 3041,
 'say': 7034,
 'so': 7438,
 'early': 3069,
 'hor': 4207,
 'c': 2019,
 'already': 1268,
 'then': 

In [21]:
column_nums, terms = zip(*sorted(zip(tfidf.vocabulary_.values(), tfidf.vocabulary_.keys())))

In [24]:
terms[:10]

('!', '"', '#', '#150', '#5000', '$', '%', '&', "'", '(')

In [25]:
column_nums[:10]

(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)

In [26]:
weights = pd.DataFrame(pca.components_, columns = terms, index = [f'topic{i}' for i in range(16)])
weights.head(5).round(3) # 이 결과값에서의 열을 뽑아보면, 각 단어의 주제 벡터를 의미한다.

Unnamed: 0,!,"""",#,#150,#5000,$,%,&,',(,...,ü'll,–,—,‘,’,“,…,┾,〨ud,鈥
topic0,-0.071,0.008,-0.001,-0.0,-0.001,0.003,-0.0,-0.012,-0.007,-0.005,...,0.003,-0.0,-0.0,-0.004,-0.001,-0.001,-0.002,0.001,0.001,0.001
topic1,-0.064,-0.008,-0.0,0.0,0.001,0.001,0.002,0.016,0.016,-0.001,...,-0.002,-0.001,0.0,-0.004,0.001,0.001,-0.003,-0.001,-0.001,-0.001
topic2,-0.071,-0.027,-0.0,-0.001,-0.002,-0.0,-0.001,-0.059,-0.008,-0.019,...,-0.0,-0.001,0.0,-0.002,-0.0,-0.001,-0.002,0.001,0.001,0.001
topic3,-0.059,-0.032,-0.001,-0.0,-0.001,0.001,-0.003,-0.028,0.001,-0.01,...,-0.001,-0.001,0.0,0.0,-0.0,-0.0,0.001,0.001,0.001,0.001
topic4,0.381,-0.008,0.001,0.001,0.004,0.003,-0.001,0.067,0.002,0.02,...,0.0,-0.0,0.0,0.0,0.001,0.001,0.002,0.001,0.001,0.001


In [27]:
# 할인 상품 광고나 불법 제품 광고 거래에 등장할 법한 단어들과 연관된 주제 탐색
deals = weights['! ;) :) half off free crazy deal only $ 80 %'.split()].round(3)*100
deals

Unnamed: 0,!,;),:),half,off,free,crazy,deal,only,$,80,%
topic0,-7.1,0.1,-0.5,-0.0,-0.4,-2.0,-0.0,-0.1,-2.2,0.3,-0.0,-0.0
topic1,-6.4,-0.0,-7.4,-0.1,-0.4,2.3,0.2,0.1,3.8,0.1,0.0,0.2
topic2,-7.1,-0.2,0.1,-0.0,-0.3,-4.4,-0.1,0.1,-0.7,-0.0,-0.0,-0.1
topic3,-5.9,-0.3,-7.1,0.2,0.3,-0.2,0.0,0.1,-2.3,0.1,-0.1,-0.3
topic4,38.1,-0.1,-12.5,-0.1,-0.2,9.8,0.1,-0.2,3.0,0.3,0.1,-0.1
topic5,-26.5,0.1,-1.6,-0.3,-0.7,-1.3,-0.6,-0.2,-1.8,-0.9,0.0,0.0
topic6,10.8,0.5,-19.9,0.4,0.9,0.5,0.2,0.1,1.4,0.0,0.0,0.1
topic7,16.1,0.1,-18.1,0.8,0.8,-2.9,-0.0,0.0,-1.8,-0.3,0.0,-0.1
topic8,34.4,0.1,5.1,-0.4,-0.5,0.1,-0.4,-0.4,3.3,-0.6,-0.0,-0.2
topic9,7.1,-0.3,16.4,1.4,-0.9,6.2,-0.5,-0.4,3.1,-0.5,-0.0,-0.0


In [29]:
# 이 결과값을 각 주제당 시그마를 취해주면, "할인 상품 광고나 불법 제품 광고 거래와의 연관 정도를 파악할 수 있다."
deals.sum(axis = 1)

# 스팸 관련도가 높은 단어들에 대한 점수를 합한 결과
# topic 4, 8, 11, 13, 14에서 높은 score를 보이고 있다.
# 여기서 음수를 가지면, 거래와 상반된 주제일 확률이 높음.

# LSA의 한계
# LSA에서는 TF-IDF 행렬로부터 잠재 의미를 도출하고 주제를 압축하더라도 해당 주제들에 대한 파악 및 설명이 비교적 어려움.
# 분명히 압축된, 잠재된 의미를 수치화하였지만, 직관적 해석이 어려울 수 있음.

Unnamed: 0,0
topic0,-11.9
topic1,-7.6
topic2,-12.7
topic3,-15.5
topic4,38.2
topic5,-33.8
topic6,-5.0
topic7,-5.4
topic8,40.5
topic9,31.6


In [30]:
# Truncated SVD 기반의 LSA
# Truncated SVD: 연산 과정에서 오류를 최소화하는 방향으로 특잇값과 특이벡터에 대한 근사를 수행하기 때문에, 반복 횟수를 기본(5번)보다 늘려서 안정된 결과를 얻을 수 있도록 한다.
from sklearn.decomposition import TruncatedSVD

In [31]:
svd = TruncatedSVD(n_components=16, n_iter=100)
svd_topic_vectors = svd.fit_transform(tfidf_docs.values)

In [32]:
svd_topic_vectors

array([[ 0.20116915, -0.00277055, -0.03723445, ..., -0.03553471,
        -0.01376018, -0.03691745],
       [ 0.40437963,  0.09387529,  0.07751254, ..., -0.0208238 ,
         0.05062044,  0.0420136 ],
       [-0.03045864,  0.04810443, -0.09019072, ..., -0.02029215,
        -0.0424939 , -0.05233934],
       ...,
       [ 0.07670815, -0.04336999,  0.01939749, ...,  0.0576479 ,
         0.02065193,  0.01571955],
       [-0.02927204, -0.00663187, -0.00079168, ..., -0.04096908,
         0.02848882,  0.05362645],
       [-0.03771067,  0.07779567, -0.01600405, ..., -0.0207629 ,
        -0.0456668 ,  0.01584656]])

In [35]:
svd_topic_vectors_df = pd.DataFrame(svd_topic_vectors, columns=columns, index=index)
svd_topic_vectors_df.round(3).head(7)

Unnamed: 0,topic0,topic1,topic2,topic3,topic4,topic5,topic6,topic7,topic8,topic9,topic10,topic11,topic12,topic13,topic14,topic15
sms0,0.201,-0.003,-0.037,0.011,-0.019,-0.053,-0.039,-0.066,0.012,-0.083,0.007,-0.007,0.002,-0.036,-0.014,-0.037
sms1,0.404,0.094,0.078,0.051,0.1,0.047,-0.023,0.065,0.023,-0.024,-0.004,0.036,0.043,-0.021,0.051,0.042
sms2!,-0.03,0.048,-0.09,-0.067,0.091,-0.043,0.0,-0.001,-0.057,0.051,0.125,0.023,0.026,-0.02,-0.042,-0.052
sms3,0.329,0.033,0.035,-0.016,0.052,0.056,0.166,-0.074,0.063,-0.108,0.022,0.023,0.073,-0.046,0.022,0.07
sms4,0.002,-0.031,-0.038,0.034,-0.075,-0.093,0.044,0.061,-0.045,0.029,0.028,-0.009,0.027,0.034,-0.083,0.021
sms5!,-0.016,-0.059,-0.014,-0.006,0.122,-0.04,-0.005,0.167,-0.023,0.064,0.041,0.055,-0.037,0.075,-0.001,-0.02
sms6,-0.066,0.099,0.045,-0.03,-0.032,0.036,0.02,-0.015,-0.016,-0.078,0.083,0.018,-0.09,0.068,-0.058,0.098


In [36]:
import numpy as np
svd_topic_vectors_df = (svd_topic_vectors_df.T / np.linalg.norm(svd_topic_vectors_df, axis=1)).T

In [37]:
svd_topic_vectors_df.iloc[:10].dot(svd_topic_vectors_df.iloc[:10].T).round(1)
# Truncated SVD 기반의 LSA
# 의미 분석을 위한 구체적인 방식은 PCA, SVD외에도 LDA 등 다양하게 존재한다.
# 중심화 및 정규화는 해야함. -> 해당 과정들을 생략하면, 자주 언급되는 주제들이 실제보다 가중치가 더 커져서, 모델이 미묘하고 드문 주제들을 구분하는 능력이 감소한다.

Unnamed: 0,sms0,sms1,sms2!,sms3,sms4,sms5!,sms6,sms7,sms8!,sms9!
sms0,1.0,0.6,-0.1,0.6,-0.0,-0.3,-0.3,-0.1,-0.3,-0.3
sms1,0.6,1.0,-0.2,0.8,-0.2,0.0,-0.2,-0.2,-0.1,-0.1
sms2!,-0.1,-0.2,1.0,-0.2,0.1,0.4,0.0,0.3,0.5,0.4
sms3,0.6,0.8,-0.2,1.0,-0.2,-0.3,-0.1,-0.3,-0.2,-0.1
sms4,-0.0,-0.2,0.1,-0.2,1.0,0.2,0.0,0.1,-0.4,-0.2
sms5!,-0.3,0.0,0.4,-0.3,0.2,1.0,-0.1,0.1,0.3,0.4
sms6,-0.3,-0.2,0.0,-0.1,0.0,-0.1,1.0,0.1,-0.2,-0.2
sms7,-0.1,-0.2,0.3,-0.3,0.1,0.1,0.1,1.0,0.1,0.4
sms8!,-0.3,-0.1,0.5,-0.2,-0.4,0.3,-0.2,0.1,1.0,0.3
sms9!,-0.3,-0.1,0.4,-0.1,-0.2,0.4,-0.2,0.4,0.3,1.0
