# Part 2

In this part I will use Latent Semantic Analysis to search the downloaded articles. My code, given a search term, will find the top 5 related articles to the search query.

I have gathered around 10,000 wikipedia articles in 8 different categories. These categories are:
- machine learning
- business software
- association football
- engineering
- quantum mechanics
- evolution
- music
- econimics

Let's load the data from postgreSQL:

In [1]:
import pandas as pd

In [2]:
from lib.database_manager import query_to_dataframe

In [3]:
articles_df = query_to_dataframe('SELECT * FROM articles')

## Vectorizing documents

I have used `TfidfVectorizer` to vectorize the articles content

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [5]:
tfidf_vectorizer = TfidfVectorizer(min_df=.005, max_df=.9, ngram_range=(1, 3), stop_words = 'english')

In [6]:
document_term_matrix = tfidf_vectorizer.fit_transform(articles_df['article_content'])

In [7]:
document_term_matrix_df = pd.DataFrame(document_term_matrix.toarray(), 
                                       columns=tfidf_vectorizer.get_feature_names(),
                                       index=articles_df['article_title'])

In [8]:
document_term_matrix_df.shape

(10496, 9066)

Let's pickle this vectorizer for future use (in part 3)

In [9]:
import pickle
pickle.dump(tfidf_vectorizer, open('vectorizer.py', 'wb'))

## Compute SVD of document term matrix

In [10]:
from sklearn.decomposition import TruncatedSVD

In [11]:
SVD = TruncatedSVD(n_components=350)

In [12]:
svd_matrix = SVD.fit_transform(document_term_matrix_df)

In [13]:
latent_semantic_analysis = pd.DataFrame(svd_matrix,
                                        index=document_term_matrix_df.index)

In [14]:
latent_semantic_analysis.shape

(10496, 350)

Let's pickle the `latent_semantic_analysis` dataframe and the dimensionality reduction model as well

In [15]:
pickle.dump(SVD, open('SVD.py', 'wb'))

In [29]:
latent_semantic_analysis.to_pickle('lsa_df')

## Search the articles

The code below works for any number of search terms. There is a list containing all the search terms and the code returns 5 top related articles to each one of the search terms. To find the related articles I used cosine similarity as well as nearest neighbors approach.

First we need to vectorize and reduce the dimensionality of the search terms using our models.

In [16]:
search_terms = ['principal component analysis',
                'schrodinger equation',
                'penalty kick',
                'soil structure interaction']

In [17]:
search_terms_matrix = tfidf_vectorizer.transform(search_terms)

In [18]:
search_terms_svd_matrix = SVD.transform(search_terms_matrix)

### Cosine similarity

In [19]:
from sklearn.metrics.pairwise import cosine_similarity

In [20]:
co_sim_df = pd.DataFrame(cosine_similarity(latent_semantic_analysis, search_terms_svd_matrix),
                         index=articles_df['article_title'],
                         columns=search_terms)

In [21]:
top_related_articles_list = [co_sim_df[search_term].sort_values(ascending=False)[:5] for search_term in search_terms]

In [22]:
for i, search_term in enumerate(search_terms):
    print('top 5 related articles to "{}" are:'.format(search_term))
    print('')
    for j in range(5):
        print(top_related_articles_list[i].index[j])
    print('')
    print('')

top 5 related articles to "principal component analysis" are:

Principal geodesic analysis
Correspondence analysis
Tucker decomposition
Multiple correspondence analysis
Multilinear principal component analysis


top 5 related articles to "schrodinger equation" are:

Schrödinger–Newton equation
Relativistic wave equations
Heisenberg-Langevin equation
Logarithmic Schrödinger equation
Soliton


top 5 related articles to "penalty kick" are:

Indirect free kick
Penalty kick (association football)
Direct free kick
Penalty area
Bicycle kick


top 5 related articles to "soil structure interaction" are:

Fender pier
Bistable structure
Gravity-based structure
Active structure
Prestressed structure




### Nearest neighbors

In [23]:
from sklearn.neighbors import NearestNeighbors

In [24]:
nn = NearestNeighbors()

In [25]:
nn.fit(svd_matrix)

NearestNeighbors(algorithm='auto', leaf_size=30, metric='minkowski',
         metric_params=None, n_jobs=1, n_neighbors=5, p=2, radius=1.0)

In [26]:
similar_article_indices = nn.kneighbors(search_terms_svd_matrix, n_neighbors=5)[1]

In [27]:
similar_article_indices

array([[2342, 2290, 2324, 1746, 1747],
       [5496, 5722, 5570, 5454, 6283],
       [2920, 3015, 2922, 2988, 2913],
       [5046, 4933, 4346, 9450, 4198]])

In [28]:
for i, search_term in enumerate(search_terms):
    print('top 5 related articles to "{}" are:'.format(search_term))
    print('')
    for j in range(5):
        print(articles_df.iloc[similar_article_indices[i],:][['article_title']].values[j][0])
    print('')
    print('')

top 5 related articles to "principal component analysis" are:

Principal geodesic analysis
Correspondence analysis
Tucker decomposition
Multilinear principal component analysis
Multilinear subspace learning


top 5 related articles to "schrodinger equation" are:

Heisenberg-Langevin equation
Schrödinger group
Non-Hermitian quantum mechanics
Faddeev equations
Diffuson


top 5 related articles to "penalty kick" are:

Penalty area
Sliding tackle
Penalty kick (association football)
Bicycle kick
Indirect free kick


top 5 related articles to "soil structure interaction" are:

Fender pier
Bistable structure
Strongback (girder)
Serial homology
Centring


