# Part 2: Dimensionality Reduction

**In Part 1, we learnt:**
    
* How to import dataset 
* How to preprocess data
* How to create tokens, vocabulary of known words and TF-IDF matrix
* How to find 10 most significant terms in a class

TFIDF matrix has dimensions of orders of thousands. We know that tfidf matrix is sparse and low-rank.Learning algorithms perform poorly on high-dimensional data. Therefore, we want to reduce the dimensions.

We can transform features into lower-dimensional space via Latent Semantic Indexing and Non-negative Matrix Factorization. We will learn how to apply both of these methods in this part. 

Latent Semantic Indexing(LSI) minimizes the mean squared residual between the original data and the reconstruction from its low-dimensional representation.

You can learn more about LSI [here](https://www.analyticsvidhya.com/blog/2018/10/stepwise-guide-topic-modeling-latent-semantic-analysis/) and [here](https://machinelearningmastery.com/singular-value-decomposition-for-machine-learning/)


**We are going to use Singular Value Decomposition for transforming the matrix into 50 dimensions.**

### Importing Part_1
I first converted Part_1 from ipynb to py. Then I imported it here down below. It works very smoothly. This is so as to avoid writing same code again and again.

In [16]:
from Part_1 import *

In [18]:
vectorizer_tfidf #Cool

<3903x17594 sparse matrix of type '<class 'numpy.float64'>'
	with 190613 stored elements in Compressed Sparse Row format>

## Latent Semantic Indexing using SVD

In [19]:
#Importing SVD
from sklearn.decomposition import TruncatedSVD

We will reduce dimensions of training data of our class: 'Computer Technology'

In [21]:
#Preprocessing training data
training_data=fetch_20newsgroups(subset='train',categories=computer_technology_subclasses,shuffle=True,random_state=42,
                                 remove=('headers','footers','quotes'))
preprocess(training_data.data)

Let's get our TFIDF matrix of training data

In [22]:
counts2=CountVectorizer(min_df=2,stop_words ='english')
X_counts2=counts2.fit_transform(training_data.data)
tfidf_transformer2=TfidfTransformer()
X_tfidf2=tfidf_transformer2.fit_transform(X_counts2)

Let's do for what this part is all about.

In [23]:
#Reducing dimensionality using LSI for min_df=2
svd2=TruncatedSVD(n_components=50,n_iter=10,random_state=42)
svd2.fit(X_tfidf2)
LSI2=svd2.transform(X_tfidf2)
print("The LSI shape when min_df=2 is:")
print(LSI2.shape)

The LSI shape when min_df=2 is:
(2343, 50)


**This means that there are now 2343 documents and 50 most relevant words.**

## Dimensionality reduction using Non-negative Matrix Factorization

**Importing required library**

In [25]:
from sklearn.decomposition import NMF

We have already created TFIDf matrix for training data. We will simply make a model from NMF

In [26]:
model2=NMF(n_components=50,init='random',random_state=0)
W_train2=model2.fit_transform(X_tfidf2)
print("The NMF shape when min_df=2 is:")
print(W_train2.shape)

The NMF shape when min_df=2 is:
(2343, 50)


## Play around and learn

In [28]:
LSI2

array([[ 0.07751021,  0.04337386, -0.08391913, ..., -0.18113037,
         0.01686565,  0.05719849],
       [ 0.1019845 , -0.05958745,  0.00906318, ..., -0.00321237,
         0.02005059,  0.06316133],
       [ 0.10844307,  0.02680134, -0.00158244, ...,  0.02670837,
        -0.02454688,  0.02236188],
       ...,
       [ 0.16934633, -0.010743  , -0.05744238, ...,  0.06469548,
        -0.00476247,  0.02484388],
       [ 0.17604228, -0.04144455, -0.05037223, ...,  0.00850844,
        -0.01332187, -0.09518748],
       [ 0.21284895,  0.02356831,  0.09449639, ..., -0.01797934,
        -0.02382705,  0.07352551]])

In [29]:
pd.DataFrame(LSI2)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,40,41,42,43,44,45,46,47,48,49
0,0.077510,0.043374,-0.083919,-0.015489,-0.092739,-0.016976,0.064764,-0.003232,-0.064106,0.085931,...,-0.003005,-0.073323,0.145341,0.019965,-0.016572,-0.049595,-0.002285,-0.181130,0.016866,0.057198
1,0.101985,-0.059587,0.009063,0.025379,0.041841,0.000788,-0.042434,0.002649,-0.074039,-0.033108,...,0.041000,0.042990,-0.007490,0.018251,0.007365,-0.017857,0.014250,-0.003212,0.020051,0.063161
2,0.108443,0.026801,-0.001582,-0.063800,0.038635,-0.012820,-0.020113,0.001866,-0.003943,0.006689,...,-0.010341,0.003859,-0.025393,-0.012624,-0.009689,-0.053246,-0.028851,0.026708,-0.024547,0.022362
3,0.121519,-0.059068,0.096189,-0.086386,-0.018306,0.059667,0.068546,-0.003620,0.048281,-0.067909,...,0.037597,0.146437,-0.101385,-0.043960,-0.104119,-0.152717,-0.061542,-0.054664,0.111509,-0.108646
4,0.041004,-0.031618,0.013208,-0.009814,0.019057,0.025633,0.007283,-0.000712,0.057504,-0.001388,...,-0.069859,0.040361,0.049136,-0.017830,-0.031820,-0.023086,0.030373,0.001856,0.012786,0.002905
5,0.260472,-0.076860,0.092562,0.108238,0.003004,-0.066161,-0.120888,0.005490,-0.169113,0.009348,...,-0.130181,-0.089155,0.023462,-0.100314,0.016184,0.040074,0.014925,-0.080277,-0.021059,-0.026201
6,0.107004,-0.029320,0.017272,0.006030,-0.022066,-0.029540,-0.087775,0.003749,-0.001804,0.072572,...,0.011818,0.043341,-0.006356,-0.021150,0.015667,-0.023614,-0.035571,0.058767,-0.027967,-0.037564
7,0.157656,0.023854,-0.065427,-0.042922,-0.091283,-0.019557,-0.025346,-0.000328,-0.047810,0.042244,...,-0.013355,-0.001686,-0.054995,0.045439,0.051098,-0.112740,0.011695,-0.022253,-0.002840,0.053742
8,0.131603,-0.071616,0.053008,-0.038905,-0.068898,0.030232,0.041983,-0.001666,0.067095,-0.062918,...,0.013096,0.040480,-0.021831,-0.074891,-0.023042,0.041003,0.004663,0.033680,0.038430,0.017865
9,0.137235,-0.001010,-0.000593,0.018038,-0.088914,-0.023792,-0.011028,-0.002180,-0.022629,-0.030047,...,-0.027547,0.013589,0.072611,-0.091924,0.042243,0.028757,-0.017705,0.038291,0.059115,0.004767
