# Compute KNN on Selected Wikipedia Articles

## Problem Statement

Step 1 of this project computed 10 nearest neighbors of a random person based on the content of thier DBPedia article content.

In this step of the project we will look at the Wikipedia content of the random person and their 10 nearest neighbors to see if their ranking and distances change when the source of the article content changes.

## Setup Software

In [None]:
%%capture
# Install textblob
!pip install -U textblob

In [None]:
%%capture
# Download corpora
!python -m textblob.download_corpora

In [None]:
%%capture output
#install Wikipedia API
!pip3 install wikipedia-api

## Setup Libraries

In [None]:
import pandas as pd

from textblob import TextBlob
from sklearn.feature_extraction.text import CountVectorizer as BagOfWords
from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.neighbors import NearestNeighbors

import wikipediaapi

pd.options.display.max_columns = 100

## Setup Data

The following data comes from the NLP-Step-1 Jupyter notebook.

In [None]:
people_df = pd.DataFrame()
people_df["name"] = [
  'Scott Pellerin',
  'Willie Mitchell (ice hockey)',
  'Guy Larose',
  'Todd Strueby',
  'Don Jackson (ice hockey)',
  'Chris McAlpine',
  'Robyn Regehr',
  'Byron Bitz',
  'Zach Parise',
  'Rick Nash',
  'Adam Oates'
]
people_df["wiki topic"] = [
  'Scott_Pellerin',
  'Willie_Mitchell_(ice_hockey)',
  'Guy_Larose',
  'Todd_Strueby',
  'Don_Jackson_(ice_hockey)',
  'Chris_McAlpine',
  'Robyn_Regehr',
  'Byron_Bitz',
  'Zach_Parise',
  'Rick_Nash',
  'Adam_Oates'
]

people_df

Unnamed: 0,name,wiki topic
0,Scott Pellerin,Scott_Pellerin
1,Willie Mitchell (ice hockey),Willie_Mitchell_(ice_hockey)
2,Guy Larose,Guy_Larose
3,Todd Strueby,Todd_Strueby
4,Don Jackson (ice hockey),Don_Jackson_(ice_hockey)
5,Chris McAlpine,Chris_McAlpine
6,Robyn Regehr,Robyn_Regehr
7,Byron Bitz,Byron_Bitz
8,Zach Parise,Zach_Parise
9,Rick Nash,Rick_Nash


## Get Wikipedia Articles

In [None]:
for index, row in people_df.iterrows():
  topic = row['wiki topic']
  wikip = wikipediaapi.Wikipedia('foobar')
  page_ex = wikip.page(topic)
  people_df.at[index, 'wiki page'] = page_ex.text

people_df

Unnamed: 0,name,wiki topic,wiki page
0,Scott Pellerin,Scott_Pellerin,Scott Jaque-Frederick Pellerin (born January 9...
1,Willie Mitchell (ice hockey),Willie_Mitchell_(ice_hockey),"William Mitchell (born April 23, 1977) is a Ca..."
2,Guy Larose,Guy_Larose,"Guy B. Larose (born July 31, 1967) is a Canadi..."
3,Todd Strueby,Todd_Strueby,"Todd Kenneth Strueby (born June 15, 1963) is a..."
4,Don Jackson (ice hockey),Don_Jackson_(ice_hockey),"Donald Clinton Jackson (born September 2, 1956..."
5,Chris McAlpine,Chris_McAlpine,"Christopher Walter McAlpine (born December 1, ..."
6,Robyn Regehr,Robyn_Regehr,"Robyn Regehr (born April 19, 1980) is a Brazil..."
7,Byron Bitz,Byron_Bitz,"Byron John Bitz (born July 21, 1984) is a Cana..."
8,Zach Parise,Zach_Parise,"Zachary Justin Parise (born July 28, 1984) is ..."
9,Rick Nash,Rick_Nash,"Richard Nash (born June 16, 1984) is a Canadia..."


## Clean Wikipedia Articles

In [None]:
for index, row in people_df.iterrows():
  page = row['wiki page']
  people_df.at[index, 'clean page'] = (
    page
    .replace("\n"," ")
    .replace("\'s",'')
    .replace('\'','')
    .replace("(", "")
    .replace(")", "")
    .replace('"', "")
    .replace(',', "")
  )

people_df

Unnamed: 0,name,wiki topic,wiki page,clean page
0,Scott Pellerin,Scott_Pellerin,Scott Jaque-Frederick Pellerin (born January 9...,Scott Jaque-Frederick Pellerin born January 9 ...
1,Willie Mitchell (ice hockey),Willie_Mitchell_(ice_hockey),"William Mitchell (born April 23, 1977) is a Ca...",William Mitchell born April 23 1977 is a Canad...
2,Guy Larose,Guy_Larose,"Guy B. Larose (born July 31, 1967) is a Canadi...",Guy B. Larose born July 31 1967 is a Canadian ...
3,Todd Strueby,Todd_Strueby,"Todd Kenneth Strueby (born June 15, 1963) is a...",Todd Kenneth Strueby born June 15 1963 is a Ca...
4,Don Jackson (ice hockey),Don_Jackson_(ice_hockey),"Donald Clinton Jackson (born September 2, 1956...",Donald Clinton Jackson born September 2 1956 i...
5,Chris McAlpine,Chris_McAlpine,"Christopher Walter McAlpine (born December 1, ...",Christopher Walter McAlpine born December 1 19...
6,Robyn Regehr,Robyn_Regehr,"Robyn Regehr (born April 19, 1980) is a Brazil...",Robyn Regehr born April 19 1980 is a Brazilian...
7,Byron Bitz,Byron_Bitz,"Byron John Bitz (born July 21, 1984) is a Cana...",Byron John Bitz born July 21 1984 is a Canadia...
8,Zach Parise,Zach_Parise,"Zachary Justin Parise (born July 28, 1984) is ...",Zachary Justin Parise born July 28 1984 is an ...
9,Rick Nash,Rick_Nash,"Richard Nash (born June 16, 1984) is a Canadia...",Richard Nash born June 16 1984 is a Canadian f...


## Prep Pages

In [None]:
page_list_prepped = people_df['clean page'].tolist()

for i, page in enumerate(page_list_prepped):
  page_blob = TextBlob(page)
  singlurized_page = ''
  for j, sentence in enumerate(page_blob.sentences[:]):
    singularized_sentence = ' '.join([x.singularize() for x in sentence.words])
    if j == 0:
      singlurized_page = singularized_sentence
    else:
      singlurized_page = singlurized_page + ' ' + singularized_sentence
  page_list_prepped[i] = str(singlurized_page)

page_list_prepped[0]

'Scott Jaque-Frederick Pellerin born January 9 1970 is a Canadian former professional ice hockey left winger who played in the National Hockey League between 1992 and 2004 Playing career Pellerin wa born in Shediac New Brunswick He played high school hockey at the Athol Murray College of Notre Dame a boarding school in Wilcox Saskatchewan under coach Barry MacKenzie In Pellerin junior year hi midget AAA hockey team took 2nd place in the 1987 Air Canada Cup the national midget AAA final Hi high school hockey teammate included other future NHLer including Rod BrindAmmy Jeff Batter Jason Herter and Joby Messier In 1988 Pellerin senior year he played for the Hound junior AA team during it 1st season in the Saskatchewan Junior Hockey League That year the Hound won the Centennial Cup the National Junior AA championship behind goaltender Curti Joseph Pellerin wa drafted 47th overall by the New Jersey Devil in the 1989 NHL Entry Draft He won the Hobey Baker Award a the best collegiate player i

## Bag of Words Using CountVectorizer

In [None]:
# Perform the count transformation
BoW =  BagOfWords(stop_words='english')
bow_vec = BoW.fit_transform(page_list_prepped)
# bow_vec.toarray() # This line blows up memory cos it takes sparse matrix and un-sparses it.

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 1, 0, ..., 0, 1, 1],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [1, 1, 2, ..., 0, 0, 0],
       [0, 0, 2, ..., 0, 0, 0],
       [2, 3, 1, ..., 1, 0, 0]])

## TF-IDF

In [None]:
# Perform the TF-IDF transformation
tf_idf_vec = TfidfTransformer()
tf_idf_pages = tf_idf_vec.fit_transform(bow_vec)
# tf_idf_pages.toarray()  # This line blows up memory cos it takes sparse matrix and un-sparses it.

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.00675774, 0.        , ..., 0.        , 0.01005934,
        0.01005934],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.00601683, 0.00537706, 0.00882476, ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.01079366, ..., 0.        , 0.        ,
        0.        ],
       [0.01413994, 0.01895467, 0.00518469, ..., 0.0094051 , 0.        ,
        0.        ]])

## K Nearest Neighbors

In [None]:
# Fit nearest neighbors
nn = NearestNeighbors().fit(tf_idf_pages)

In [None]:
distances, indices = nn.kneighbors(tf_idf_pages[0], n_neighbors = 11)

In [None]:
distances

array([[0.        , 1.22842243, 1.23882648, 1.25129158, 1.25182851,
        1.25380916, 1.25542432, 1.27359214, 1.28131391, 1.28643915,
        1.3228988 ]])

In [None]:
indices

array([[ 0,  5,  2,  6,  7,  3, 10,  1,  9,  8,  4]])

In [None]:
for i in indices:
  print(people_df.iloc[i]['name'])

0                   Scott Pellerin
5                   Chris McAlpine
2                       Guy Larose
6                     Robyn Regehr
7                       Byron Bitz
3                     Todd Strueby
10                      Adam Oates
1     Willie Mitchell (ice hockey)
9                        Rick Nash
8                      Zach Parise
4         Don Jackson (ice hockey)
Name: name, dtype: object


## Summary

In [None]:
print('Nearest neighbors based on dbpedia content.\n\n')
people_df['name']

Nearest neighbors based on dbpedia content.




0                   Scott Pellerin
1     Willie Mitchell (ice hockey)
2                       Guy Larose
3                     Todd Strueby
4         Don Jackson (ice hockey)
5                   Chris McAlpine
6                     Robyn Regehr
7                       Byron Bitz
8                      Zach Parise
9                        Rick Nash
10                      Adam Oates
Name: name, dtype: object

In [None]:
print('Nearest neighbors based on Wikipedia content.\n\n')
for i in indices:
    print(people_df.iloc[i]['name'])

Nearest neighbors based on Wikipedia content.


0                   Scott Pellerin
5                   Chris McAlpine
2                       Guy Larose
6                     Robyn Regehr
7                       Byron Bitz
3                     Todd Strueby
10                      Adam Oates
1     Willie Mitchell (ice hockey)
9                        Rick Nash
8                      Zach Parise
4         Don Jackson (ice hockey)
Name: name, dtype: object


In [None]:
print(f"Sentiment of {people_df.iloc[0]['name']} wikipedia page. \n\n")
page_blob = TextBlob(people_df.iloc[0]['clean page'])
page_blob.sentiment

Sentiment of Scott Pellerin wikipedia page. 




Sentiment(polarity=0.12626482213438736, subjectivity=0.31510793554271804)