# Compute KNN on User Selected Personality

## Problem Statement

Given a source of DBpedia article pages, let the user select a person from the set at random. Compute 10 nearest neighbors to the selected person based on the content of their article pages.

## Setup Software

In [None]:
%%capture
# Install textblob
!pip install -U textblob

In [None]:
%%capture
# Download corpora
!python -m textblob.download_corpora

In [None]:
%%capture output
#install Wikipedia API
!pip3 install wikipedia-api

## Setup Libraries

In [None]:
import pandas as pd

from textblob import TextBlob
from sklearn.feature_extraction.text import CountVectorizer as BagOfWords
from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.neighbors import NearestNeighbors

import wikipediaapi
import random

from ipywidgets import widgets, interact, interact_manual
from IPython.display import display

pd.options.display.max_columns = 100

## Get Data

In [None]:
url = 'https://ddc-datascience.s3.amazonaws.com/Projects/Project.5-NLP/Data/NLP.csv'

In [None]:
wiki_data_full = pd.read_csv(url)

In [None]:
wiki_data_full.head()

Unnamed: 0,URI,name,text
0,<http://dbpedia.org/resource/Digby_Morrell>,Digby Morrell,digby morrell born 10 october 1979 is a former...
1,<http://dbpedia.org/resource/Alfred_J._Lewy>,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from un...
2,<http://dbpedia.org/resource/Harpdog_Brown>,Harpdog Brown,harpdog brown is a singer and harmonica player...
3,<http://dbpedia.org/resource/Franz_Rottensteiner>,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lowe...
4,<http://dbpedia.org/resource/G-Enka>,G-Enka,henry krvits born 30 december 1974 in tallinn ...


In [None]:
wiki_data_full.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42786 entries, 0 to 42785
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   URI     42786 non-null  object
 1   name    42786 non-null  object
 2   text    42786 non-null  object
dtypes: object(3)
memory usage: 1002.9+ KB


In [None]:
wiki_data_full.iloc[7639].text

'samuel hollingsworth young born december 26 1922 was a us representative from illinoisborn in casey illinois young graduated from urbana high school urbana illinois in 1940 he received an llb from the university of illinois in 1947 and a jd from university of illinois law school in 1948young served in the united states army paratroops from 1943 to 1946 and attained the rank of captain he was admitted to the illinois bar in 1948 and commenced practice in chicago with the united states securities and exchange commission he also served as a lawyer in private practice from 1947 to 1948 young was an instructor in economics at university of illinois and taught business finance at northwestern university from 1949 to 1950young served as securities commissioner of illinois from 1953 to 1955 and as assistant secretary of state from 1955 to 1957 he was financial vice president secretary and treasurer for a hospital supply company from 1965 to 1966 he also served as delegate to the illinois stat

## Sample 10,000 Articles

In [None]:
sampled_wiki_data = wiki_data_full.sample(n=10000, random_state=42)
sampled_wiki_data = sampled_wiki_data.reset_index(drop=True)
sampled_wiki_data.head()

Unnamed: 0,URI,name,text
0,<http://dbpedia.org/resource/Tom_Bancroft>,Tom Bancroft,tom bancroft born 1967 london is a british jaz...
1,<http://dbpedia.org/resource/Bart_Zeller>,Bart Zeller,barton wallace zeller born july 22 1941 is a f...
2,<http://dbpedia.org/resource/Caitlin_Morrall>,Caitlin Morrall,caitlin shea morrall machol born may 2 1983 is...
3,<http://dbpedia.org/resource/Paddy_Roche>,Paddy Roche,patrick joseph christopher paddy roche born 4 ...
4,<http://dbpedia.org/resource/H._Jeff_Kimble>,H. Jeff Kimble,h jeff kimble is the william l valentine profe...


In [None]:
page_list_orig = sampled_wiki_data['text'].tolist()

In [None]:
page_list_orig[0]

'tom bancroft born 1967 london is a british jazz drummer and composer he began drumming aged seven and started off playing jazz with his father and identical twin brother phil after studying medicine at cambridge university he spent a year studying composition and arranging at mcgill university in montreal canada qualifying as a doctor in 1992 he then worked as a jazz musician and composer supporting his music income with locum work as a hospital doctor until 1998 when he began starting music related companies he is married to singer gina rae and has two children sam and sophie in 2004 he received the creative scotland awardin 1998 he launched caber music with support from the national lottery fund which went on to release over thirty cds over the next seven years to critical acclaim including two bbc jazz awards for best cd and numerous album of the year placings he has subsequently started the company abc creative music with his twin brother phil bancroft which develops creative musi

## Clean Article Content

In [None]:
for i, page in enumerate(page_list_orig):
  page_list_orig[i] = (
    page
    .replace("\n"," ")
    .replace("\'s",'')
    .replace('\'','')
    .replace("(", "")
    .replace(")", "")
    .replace('"', "")
  )

page_list_orig[0]

'tom bancroft born 1967 london is a british jazz drummer and composer he began drumming aged seven and started off playing jazz with his father and identical twin brother phil after studying medicine at cambridge university he spent a year studying composition and arranging at mcgill university in montreal canada qualifying as a doctor in 1992 he then worked as a jazz musician and composer supporting his music income with locum work as a hospital doctor until 1998 when he began starting music related companies he is married to singer gina rae and has two children sam and sophie in 2004 he received the creative scotland awardin 1998 he launched caber music with support from the national lottery fund which went on to release over thirty cds over the next seven years to critical acclaim including two bbc jazz awards for best cd and numerous album of the year placings he has subsequently started the company abc creative music with his twin brother phil bancroft which develops creative musi

## Prep Article Content

In [None]:
page_list_prepped = page_list_orig.copy()

for i, page in enumerate(page_list_prepped):
  if (i % 1000) == 0:
    print(i)
  page_blob = TextBlob(page)
  singlurized_page = ''
  for j, sentence in enumerate(page_blob.sentences[:]):
    singularized_sentence = ' '.join([x.singularize() for x in sentence.words])
    if j == 0:
      singlurized_page = singularized_sentence
    else:
      singlurized_page = singlurized_page + ' ' + singularized_sentence
  page_list_prepped[i] = str(singlurized_page)

page_list_prepped[0]

0
1000
2000
3000
4000
5000
6000
7000
8000
9000


'tom bancroft born 1967 london is a british jazz drummer and composer he began drumming aged seven and started off playing jazz with hi father and identical twin brother phil after studying medicine at cambridge university he spent a year studying composition and arranging at mcgill university in montreal canada qualifying a a doctor in 1992 he then worked a a jazz musician and composer supporting hi music income with locum work a a hospital doctor until 1998 when he began starting music related company he is married to singer gina ra and ha two child sam and sophie in 2004 he received the creative scotland awardin 1998 he launched caber music with support from the national lottery fund which went on to release over thirty cd over the next seven year to critical acclaim including two bbc jazz award for best cd and numerou album of the year placing he ha subsequently started the company abc creative music with hi twin brother phil bancroft which develop creative music education resource

## Bag of Words Using CountVectorizer

In [None]:
# Perform the count transformation
BoW =  BagOfWords(stop_words='english')
bow_vec = BoW.fit_transform(page_list_prepped)
# bow_vec.toarray() # This line blows up memory cos it takes sparse matrix and un-sparses it.

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

## TF-IDF

In [None]:
# Perform the TF-IDF transformation
tf_idf_vec = TfidfTransformer()
tf_idf_pages = tf_idf_vec.fit_transform(bow_vec)
# tf_idf_pages.toarray()  # This line blows up memory cos it takes sparse matrix and un-sparses it.

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

## K Nearest Neighbors

In [None]:
# Fit nearest neighbors
nn = NearestNeighbors().fit(tf_idf_pages)

In [None]:
def compute_nearest_documents():
  idx = sampled_wiki_data[sampled_wiki_data['name'] == famous_person].index[0]
  distances, indices = nn.kneighbors(tf_idf_pages[idx], n_neighbors = number_of_neighbors)
  for i in indices:
    print(sampled_wiki_data.iloc[i]['name'])

## Variables

In [None]:
def set_famous_person(name):
  print(f"Famous person selected: {name}")
  global famous_person
  famous_person = name

In [None]:
def set_number_of_neighbors(num):
  print(f"Number of neighbors selected: {num}")
  global number_of_neighbors
  number_of_neighbors = num

## Setup Interact

In [None]:
def setup_interact():
  print("Select a famous person.")
  interact(set_famous_person, name=sampled_wiki_data.sample(n=20)['name'].tolist());
  print("\n\n\n")
  print("Select number of neighbors.")
  interact(set_number_of_neighbors, num=[10,20,30,40,50]);
  print("\n\n\n")
  interact_manual(compute_nearest_documents);

## User Input

In [None]:
setup_interact()

Select a famous person.


interactive(children=(Dropdown(description='name', options=('Ed Tracy', 'Rebecca Probert', 'Liu Chaoying', 'Ti…





Select number of neighbors.


interactive(children=(Dropdown(description='num', options=(10, 20, 30, 40, 50), value=10), Output()), _dom_cla…







interactive(children=(Button(description='Run Interact', style=ButtonStyle()), Output()), _dom_classes=('widge…