# Compute KNN on DBpedia Article Content

## Problem Statement

Given a source of DBpedia article pages, select a person from the set at random. Compute 10 nearest neighbors to the selected person based on the content of their article pages.

## Setup Software

In [1]:
%%capture
# Install textblob
!pip install -U textblob

In [2]:
%%capture
# Download corpora
!python -m textblob.download_corpora

In [3]:
%%capture output
#install Wikipedia API
!pip3 install wikipedia-api

## Setup Libraries

In [4]:
import pandas as pd

from textblob import TextBlob
from sklearn.feature_extraction.text import CountVectorizer as BagOfWords
from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.neighbors import NearestNeighbors

import wikipediaapi
import random

pd.options.display.max_columns = 100

## Get Data

In [5]:
url = 'https://ddc-datascience.s3.amazonaws.com/Projects/Project.5-NLP/Data/NLP.csv'

In [6]:
wiki_data_full = pd.read_csv(url)

In [7]:
wiki_data_full.head()

Unnamed: 0,URI,name,text
0,<http://dbpedia.org/resource/Digby_Morrell>,Digby Morrell,digby morrell born 10 october 1979 is a former...
1,<http://dbpedia.org/resource/Alfred_J._Lewy>,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from un...
2,<http://dbpedia.org/resource/Harpdog_Brown>,Harpdog Brown,harpdog brown is a singer and harmonica player...
3,<http://dbpedia.org/resource/Franz_Rottensteiner>,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lowe...
4,<http://dbpedia.org/resource/G-Enka>,G-Enka,henry krvits born 30 december 1974 in tallinn ...


In [8]:
wiki_data_full.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42786 entries, 0 to 42785
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   URI     42786 non-null  object
 1   name    42786 non-null  object
 2   text    42786 non-null  object
dtypes: object(3)
memory usage: 1002.9+ KB


In [9]:
wiki_data_full.iloc[7639].text

'samuel hollingsworth young born december 26 1922 was a us representative from illinoisborn in casey illinois young graduated from urbana high school urbana illinois in 1940 he received an llb from the university of illinois in 1947 and a jd from university of illinois law school in 1948young served in the united states army paratroops from 1943 to 1946 and attained the rank of captain he was admitted to the illinois bar in 1948 and commenced practice in chicago with the united states securities and exchange commission he also served as a lawyer in private practice from 1947 to 1948 young was an instructor in economics at university of illinois and taught business finance at northwestern university from 1949 to 1950young served as securities commissioner of illinois from 1953 to 1955 and as assistant secretary of state from 1955 to 1957 he was financial vice president secretary and treasurer for a hospital supply company from 1965 to 1966 he also served as delegate to the illinois stat

## Sample Articles

In [12]:
page_list_orig = wiki_data_full['text'].tolist()

In [13]:
page_list_orig[0]

'digby morrell born 10 october 1979 is a former australian rules footballer who played with the kangaroos and carlton in the australian football league aflfrom western australia morrell played his early senior football for west perth his 44game senior career for the falcons spanned 19982000 and he was the clubs leading goalkicker in 2000 at the age of 21 morrell was recruited to the australian football league by the kangaroos football club with its third round selection in the 2001 afl rookie draft as a forward he twice kicked five goals during his time with the kangaroos the first was in a losing cause against sydney in 2002 and the other the following season in a drawn game against brisbaneafter the 2003 season morrell was traded along with david teague to the carlton football club in exchange for corey mckernan he played 32 games for the blues before being delisted at the end of 2005 he continued to play victorian football league vfl football with the northern bullants carltons vfla

## Clean Article Content

In [14]:
for i, page in enumerate(page_list_orig):
  page_list_orig[i] = (
    page
    .replace("\n"," ")
    .replace("\'s",'')
    .replace('\'','')
    .replace("(", "")
    .replace(")", "")
    .replace('"', "")
  )

page_list_orig[0]

'digby morrell born 10 october 1979 is a former australian rules footballer who played with the kangaroos and carlton in the australian football league aflfrom western australia morrell played his early senior football for west perth his 44game senior career for the falcons spanned 19982000 and he was the clubs leading goalkicker in 2000 at the age of 21 morrell was recruited to the australian football league by the kangaroos football club with its third round selection in the 2001 afl rookie draft as a forward he twice kicked five goals during his time with the kangaroos the first was in a losing cause against sydney in 2002 and the other the following season in a drawn game against brisbaneafter the 2003 season morrell was traded along with david teague to the carlton football club in exchange for corey mckernan he played 32 games for the blues before being delisted at the end of 2005 he continued to play victorian football league vfl football with the northern bullants carltons vfla

## Prep Article Content

In [15]:
page_list_prepped = page_list_orig.copy()

for i, page in enumerate(page_list_prepped):
  if (i % 1000) == 0:
    print(i)
  page_blob = TextBlob(page)
  singlurized_page = ''
  for j, sentence in enumerate(page_blob.sentences[:]):
    singularized_sentence = ' '.join([x.singularize() for x in sentence.words])
    if j == 0:
      singlurized_page = singularized_sentence
    else:
      singlurized_page = singlurized_page + ' ' + singularized_sentence
  page_list_prepped[i] = str(singlurized_page)

page_list_prepped[0]

0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
14000
15000
16000
17000
18000
19000
20000
21000
22000
23000
24000
25000
26000
27000
28000
29000
30000
31000
32000
33000
34000
35000
36000
37000
38000
39000
40000
41000
42000


'digby morrell born 10 october 1979 is a former australian rule footballer who played with the kangaroo and carlton in the australian football league aflfrom western australium morrell played hi early senior football for west perth hi 44game senior career for the falcon spanned 19982000 and he wa the club leading goalkicker in 2000 at the age of 21 morrell wa recruited to the australian football league by the kangaroo football club with it third round selection in the 2001 afl rookie draft a a forward he twice kicked five goal during hi time with the kangaroo the first wa in a losing cause against sydney in 2002 and the other the following season in a drawn game against brisbaneafter the 2003 season morrell wa traded along with david teague to the carlton football club in exchange for corey mckernan he played 32 game for the blue before being delisted at the end of 2005 he continued to play victorian football league vfl football with the northern bullant carlton vflaffiliate in 2006 an

## Bag of Words Using CountVectorizer

In [16]:
# Perform the count transformation
BoW =  BagOfWords(stop_words='english')
bow_vec = BoW.fit_transform(page_list_prepped)
#bow_vec.toarray() # This line blows up memory cos it takes sparse matrix and un-sparses it.

## TF-IDF

In [17]:
# Perform the TF-IDF transformation
tf_idf_vec = TfidfTransformer()
tf_idf_pages = tf_idf_vec.fit_transform(bow_vec)
#tf_idf_pages.toarray() # This line blows up memory cos it takes sparse matrix and un-sparses it.

## K Nearest Neighbors

In [18]:
# Fit nearest neighbors
nn = NearestNeighbors().fit(tf_idf_pages)

In [19]:
def compute_nearest_documents(idx):
  distances, indices = nn.kneighbors(tf_idf_pages[idx], n_neighbors = 11)
  print(f"Distances : {distances}")
  print("\n\n")
  for i in indices:
    print(wiki_data_full.iloc[i]['name'])
    print(wiki_data_full.iloc[i]['URI'])
  text_blob = TextBlob(wiki_data_full.iloc[idx].text)
  print(f"{wiki_data_full.iloc[idx].name} overview sentiment {text_blob.sentiment}")

In [20]:
idx = wiki_data_full[wiki_data_full['name'] == 'Scott Pellerin'].index[0]
compute_nearest_documents(idx)

Distances : [[0.         1.15737884 1.1695926  1.17250642 1.18814802 1.19382726
  1.19388558 1.1947094  1.19473735 1.20504384 1.20739938]]



28055                          Scott Pellerin
16332                            Gord Sherven
31738                Steven King (ice hockey)
15618              Stephen Johns (ice hockey)
42546    Mike Stevens (ice hockey, born 1965)
19643                            Tanner Glass
35441            Willie Mitchell (ice hockey)
243                                Brett Hull
16118                             Mike McHugh
4610                              Steven Rice
20353                            Dane Jackson
Name: name, dtype: object
28055         <http://dbpedia.org/resource/Scott_Pellerin>
16332           <http://dbpedia.org/resource/Gord_Sherven>
31738    <http://dbpedia.org/resource/Steven_King_(ice_...
15618    <http://dbpedia.org/resource/Stephen_Johns_(ic...
42546    <http://dbpedia.org/resource/Mike_Stevens_(ice...
19643           <http://dbpedia

In [21]:
sampled_idx = random.randint(0, 42768)
print(sampled_idx)
compute_nearest_documents(sampled_idx)

15838
Distances : [[0.         1.20730197 1.2225336  1.23563525 1.24882308 1.25377613
  1.25457647 1.26044798 1.26161957 1.26266737 1.26381929]]



15838                            Marc Laforge
16332                            Gord Sherven
4610                              Steven Rice
2791                             Todd Strueby
42546    Mike Stevens (ice hockey, born 1965)
15441                             Eric Brewer
24501                               John Byce
25250                            Dave Semenko
36992                            Raffi Torres
37652                            Peter Douris
24936                           Jordan Eberle
Name: name, dtype: object
15838           <http://dbpedia.org/resource/Marc_Laforge>
16332           <http://dbpedia.org/resource/Gord_Sherven>
4610             <http://dbpedia.org/resource/Steven_Rice>
2791            <http://dbpedia.org/resource/Todd_Strueby>
42546    <http://dbpedia.org/resource/Mike_Stevens_(ice...
15441            <http://

## Summary

Data to be carried to Step 2 of the project.

In [None]:
idx = wiki_data_full[wiki_data_full['name'] == 'Scott Pellerin'].index[0]
compute_nearest_documents(idx)

Distances : [[0.         1.18226405 1.20104838 1.20606542 1.20708284 1.21226424
  1.2155799  1.21655112 1.21775824 1.21862487 1.21981896]]



8997                  Scott Pellerin
9242    Willie Mitchell (ice hockey)
3621                      Guy Larose
1416                    Todd Strueby
3223        Don Jackson (ice hockey)
3643                  Chris McAlpine
1167                    Robyn Regehr
3429                      Byron Bitz
7208                     Zach Parise
1413                       Rick Nash
8131                      Adam Oates
Name: name, dtype: object
8997         <http://dbpedia.org/resource/Scott_Pellerin>
9242    <http://dbpedia.org/resource/Willie_Mitchell_(...
3621             <http://dbpedia.org/resource/Guy_Larose>
1416           <http://dbpedia.org/resource/Todd_Strueby>
3223    <http://dbpedia.org/resource/Don_Jackson_(ice_...
3643         <http://dbpedia.org/resource/Chris_McAlpine>
1167           <http://dbpedia.org/resource/Robyn_Regehr>
3429             <h