<a href="https://colab.research.google.com/github/Wilkingc/About-me/blob/main/P_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Problem definition

This project consist in two parts:

First)
- The file contains a list of famous people and a brief overview.
- The goal of part 1) is provide the capability to
- Take one person from the list as input and output the 10 other people who's overview are "closest" to the person in a Natural Language Processing sense
- Also output the sentiment of the overview of the person.

2 Second)
- For the same person from step 1), use the Wikipedia API to access the whole content of that person's Wikipedia page.
- The goal of part 2) is to produce the capability to:
  1. For that Wikipedia page determine the sentiment of the entire page
  1. Print out the Wikipedia article
  1. Collect the Wikipedia pages from the 10 nearest neighbors in Step 1)
  1. Determine the nearness ranking of these 10 to your main subject based on their entire Wikipedia page
  1. Compare the nearest ranking from Step 1) with the Wikipedia page nearness ranking


## Imports

In [1]:
import numpy as np
import pandas as pd
from textblob import TextBlob
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.neighbors import NearestNeighbors
import random
import re
import nltk
nltk.download('omw-1.4')
import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('punkt')

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [2]:
! python -m textblob.download_corpora

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package conll2000 to /root/nltk_data...
[nltk_data]   Unzipping corpora/conll2000.zip.
[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
Finished.


## File path & read data

In [3]:
url = 'https://ddc-datascience.s3.amazonaws.com/Projects/Project.5-NLP/Data/NLP.csv'
data1 = pd.read_csv(url)
data1.head()

Unnamed: 0,URI,name,text
0,<http://dbpedia.org/resource/Digby_Morrell>,Digby Morrell,digby morrell born 10 october 1979 is a former...
1,<http://dbpedia.org/resource/Alfred_J._Lewy>,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from un...
2,<http://dbpedia.org/resource/Harpdog_Brown>,Harpdog Brown,harpdog brown is a singer and harmonica player...
3,<http://dbpedia.org/resource/Franz_Rottensteiner>,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lowe...
4,<http://dbpedia.org/resource/G-Enka>,G-Enka,henry krvits born 30 december 1974 in tallinn ...


## Part 1

## Data Cleaning

In [4]:
data_copy = data1.copy() #Let's make a copy of the dataset
df = data_copy.sample(frac=0.1).reset_index() #reseting the index for manipulation porpuses

In [5]:
def process_data (text):
  '''
  This function will take a value and apply the lower case method, remove white spaces and characters and convert the text to TextBlob
  '''
  text = text.lower()
  text = re.sub(r"[^\w\s]", "", text)
  text = TextBlob(text)

  return text

In [6]:
df['text'] = df['text'].apply(process_data) # Apply the function to our dataset

In [8]:
df['text'].loc[1]  # Display some data to make sure the function cleaned the data

TextBlob("alan ling sie kiong chinese pinyin ln s jin born 19 february 1983 in sibu sarawak is a malaysian lawyer and politician he is currently holding office of dap sarawak secretary he spent his childhood in nanga medamit limbang sarawak where his parents started their first business venture by operating a small grocery shops next to a river in the rural area ling received his primary education in srb chung hua limbang north school miri and secondary school in smk st joseph miri he is a christianling graduated from the university of sheffield with a bachelor of laws honours degree he is an advocate solicitor of the high court in malaya and high court in sabah and sarawakling became the partner of an established law firm suhaili ling advocates in miri at the age of 27besides legal qualification he is a licensed auctioneer company secretary and also actively involved in business field even during his university time during his stay in united kingdom he was the president of kpum malays

In [51]:
def singularize_text(text):
  '''
  This function lemmatize by splitting the data and returns a join version of it to mantain integrity
  '''

  lemmatizer = WordNetLemmatizer()
  words = text.split()  # Split the text into words
  singularized_words = [lemmatizer.lemmatize(word) for word in words]

  return ' '.join(singularized_words)

In [10]:
df['text'] = df['text'].apply(singularize_text)

### Bag of words

In [11]:
vectorizer = CountVectorizer(stop_words='english') # Transfor the text into a bag of words
bow = vectorizer.fit_transform(df["text"])
bow = bow.toarray() # Tranform the result to a numpy array

### TF-IDF

In [12]:
# Perform the TF-IDF transformation - (CountVectorizer + TfidfTransformer)
tf_idf_tran = TfidfTransformer()
tf_idf = tf_idf_tran.fit_transform(bow)
tf_idf = tf_idf.toarray()
tf_idf

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

### K- Nearest Neighbor

In [13]:
random.seed(10) #Random seed to mantain consistency in the output
random_index = random.randint(0, len(df) - 1) # Creating a random output of our dataset to index later for a name.
random_person = df.iloc[random_index]['name'] # Selecting randomly a person from our dataset using the random index variable.
random_index, random_person

(266, 'Marya Martin')

In [14]:
# Perform the TF-IDF transformation - (CountVectorizer + TfidfTransformer)
tf_idf_tran = TfidfTransformer()
tf_idf = tf_idf_tran.fit_transform(bow)
tf_idf = tf_idf.toarray()
tf_idf


array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [15]:
sent0 = np.array([tf_idf[random_index]]) # Creating the reference matrix
sent0.shape

(1, 84936)

In [16]:
kn = NearestNeighbors().fit(tf_idf) # Fitting the df_idf matrix to Nearest Neighbor

In [17]:
distances, indices = kn.kneighbors(
  X = sent0,
  n_neighbors = 11, #Selecting a K of 11 to ourput the 10 closest values in our dataset
)


In [18]:
distances # Display the distance

array([[0.        , 1.20012961, 1.20355055, 1.21317026, 1.21488492,
        1.24856663, 1.24914118, 1.2528284 , 1.25389802, 1.26453273,
        1.26577384]])

In [19]:
# Using list comprehension to display names of closest person
group_of_p = [ x for i,x in enumerate(df['name']) if i in indices[0] ]
group_of_p

['Marya Martin',
 'Trey Lee Chui-yee',
 'Alec Chien',
 'Kathryn Selby',
 'Clark Ross',
 'Sophia Yan',
 'Paul Hostetter',
 'Jeffrey Milarsky',
 'Yoko Misumi',
 'Joseph Twist',
 'Chan Ka Nin']

### Sentiment of the overview of the person selected

In [20]:
[ i for i,x in enumerate(df['text']) if i in indices[0] ] # Taking a look at the indices

[266, 484, 594, 1080, 1459, 1805, 1836, 2739, 2899, 3617, 4019]

In [21]:
# np.array(df['text'])[indices]

In [49]:
count = 0 # Starting a counter to give our output an index
for index in indices[0]: # Looping to find the Sentiment of each individual
  text = df.iloc[index]['text']
  blob = TextBlob(text)
  sentiment_score = blob.sentiment
  count += 1
  print(f" {count} Person sentiment scores are {sentiment_score}")

 1 Person sentiment scores are Sentiment(polarity=0.14822180134680138, subjectivity=0.3522516835016835)
 2 Person sentiment scores are Sentiment(polarity=0.08060146923783286, subjectivity=0.3457759412304866)
 3 Person sentiment scores are Sentiment(polarity=0.07588126159554731, subjectivity=0.2563955885384457)
 4 Person sentiment scores are Sentiment(polarity=0.15024891774891772, subjectivity=0.41317099567099563)
 5 Person sentiment scores are Sentiment(polarity=0.13716690716690716, subjectivity=0.29245297911964574)
 6 Person sentiment scores are Sentiment(polarity=0.030176767676767673, subjectivity=0.3515151515151515)
 7 Person sentiment scores are Sentiment(polarity=0.12024410774410772, subjectivity=0.26578282828282823)
 8 Person sentiment scores are Sentiment(polarity=0.09627840909090911, subjectivity=0.33225815850815854)
 9 Person sentiment scores are Sentiment(polarity=0.14750967117988392, subjectivity=0.3323823339780787)
 10 Person sentiment scores are Sentiment(polarity=0.230303

# Part 2 -  Using Wikipedia API

In [23]:
%%capture output
#install Wikipedia API
!pip3 install wikipedia-api

In [24]:
import wikipediaapi

In [25]:
np.array(df['URI'])[indices]

array([['<http://dbpedia.org/resource/Marya_Martin>',
        '<http://dbpedia.org/resource/Sophia_Yan>',
        '<http://dbpedia.org/resource/Kathryn_Selby>',
        '<http://dbpedia.org/resource/Paul_Hostetter>',
        '<http://dbpedia.org/resource/Yoko_Misumi>',
        '<http://dbpedia.org/resource/Alec_Chien>',
        '<http://dbpedia.org/resource/Joseph_Twist>',
        '<http://dbpedia.org/resource/Jeffrey_Milarsky>',
        '<http://dbpedia.org/resource/Trey_Lee_Chui-yee>',
        '<http://dbpedia.org/resource/Clark_Ross>',
        '<http://dbpedia.org/resource/Chan_Ka_Nin>']], dtype=object)

In [26]:
topic = random_person
wikipedia = wikipediaapi.Wikipedia(user_agent = 'salcocho')
page_ex = wikipedia.page(topic)
wiki_text = page_ex.text
wiki_text

"Marya Martin is an American flautist, soloist, recitalist, and chamber musician.\nBorn Mary Martin in New Zealand, Martin studied at the University of Auckland, where she had lessons with Richard Giese, then principal flute in the New Zealand Symphony Orchestra. After graduating in 1976, Martin was awarded a Queen Elizabeth II Arts Council grant to study at Yale University. In 1979, Martin graduated from Yale with a master's degree in flute performance.  She shortly thereafter won the 1979 Young Concert Artists International Auditions. She went on to win top prizes in the Naumburg Competition, the Munich International Competition, the Jean-Pierre Rampal International Competition, and the Concert Artists Guild—all within a two-year period. To date, Martin is the only flutist to take top prizes in all of these major competitions. In 1980, Martin made her New York concert debut. Following these successes, Martin moved to Paris to study with Jean-Pierre Rampal at the Nationale Superieur C

In [27]:
clean_data = process_data(wiki_text)
clean_data

TextBlob("marya martin is an american flautist soloist recitalist and chamber musician
born mary martin in new zealand martin studied at the university of auckland where she had lessons with richard giese then principal flute in the new zealand symphony orchestra after graduating in 1976 martin was awarded a queen elizabeth ii arts council grant to study at yale university in 1979 martin graduated from yale with a masters degree in flute performance  she shortly thereafter won the 1979 young concert artists international auditions she went on to win top prizes in the naumburg competition the munich international competition the jeanpierre rampal international competition and the concert artists guildall within a twoyear period to date martin is the only flutist to take top prizes in all of these major competitions in 1980 martin made her new york concert debut following these successes martin moved to paris to study with jeanpierre rampal at the nationale superieur conservatoire de par

In [28]:
group_of_p

['Marya Martin',
 'Trey Lee Chui-yee',
 'Alec Chien',
 'Kathryn Selby',
 'Clark Ross',
 'Sophia Yan',
 'Paul Hostetter',
 'Jeffrey Milarsky',
 'Yoko Misumi',
 'Joseph Twist',
 'Chan Ka Nin']

In [29]:
data = []

In [30]:
for i in group_of_p:
  # Get the Wikipedia page text
  group = wikipedia.page(i)
  # Create a list/dictionary to store data (optional)
  data_row = {'title': i, 'text': group.text}  # Example dictionary

  data.append(data_row)

In [31]:
rsult_df = pd.DataFrame(data)

In [32]:
rsult_df

Unnamed: 0,title,text
0,Marya Martin,"Marya Martin is an American flautist, soloist,..."
1,Trey Lee Chui-yee,Trey Lee (Chinese:李垂誼; pinyin: Chui-yee Lee) i...
2,Alec Chien,Alec Chien is a pianist from Hong Kong.\nBorn ...
3,Kathryn Selby,Kathryn Shauna Selby AM (born 1962) is an Aust...
4,Clark Ross,"Clark Winslow Ross is a Canadian composer, gui..."
5,Sophia Yan,"Sophia Yan (嚴倩君, pinyin: Yán Qiànjūn, b. Octob..."
6,Paul Hostetter,"Paul Hostetter is an American conductor, the E..."
7,Jeffrey Milarsky,Jeffrey Milarsky is a conductor of contemporar...
8,Yoko Misumi,"Yoko Misumi (三角ようこ, Misumi Yōko) is a Japanese..."
9,Joseph Twist,Joseph Edward Twist (born 1982) is an Australi...


### Clean Data

In [33]:
rsult_df['text'] = rsult_df['text'].apply(process_data)

In [34]:
rsult_df['text'] = rsult_df['text'].apply(singularize_text)

### BoW

In [35]:
vectorizer2 = CountVectorizer(stop_words='english')
bow2 = vectorizer2.fit_transform(rsult_df['text'])
bow2 = bow2.toarray()
bow2

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 1, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 1, ..., 0, 1, 0],
       [0, 0, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

### TF-IDF

In [36]:
# Perform the TF-IDF transformation - (CountVectorizer + TfidfTransformer)
tf_idf_tran2 = TfidfTransformer()
tf_idf2 = tf_idf_tran2.fit_transform(bow2)
tf_idf2 = tf_idf2.toarray()
tf_idf2

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.13169949, 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.05296468, ..., 0.        , 0.05296468,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.03240923, 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

### K-Nearest Neighbor



In [37]:
kn2 = NearestNeighbors().fit(tf_idf2)

In [38]:
subject = np.array([tf_idf2[2]])
subject.shape

(1, 1318)

In [39]:
distances2, indices2 = kn2.kneighbors(
  X = subject,
  n_neighbors = 11,
)

In [40]:
distances2

array([[0.        , 1.33117378, 1.33404489, 1.34857411, 1.35386852,
        1.35528678, 1.35823684, 1.35970446, 1.36379427, 1.3713848 ,
        1.37367806]])

In [41]:
new_group = [ x for i,x in enumerate(rsult_df['title']) if i in indices2[0] ]
new_group

['Marya Martin',
 'Trey Lee Chui-yee',
 'Alec Chien',
 'Kathryn Selby',
 'Clark Ross',
 'Sophia Yan',
 'Paul Hostetter',
 'Jeffrey Milarsky',
 'Yoko Misumi',
 'Joseph Twist',
 'Chan Ka Nin']

In [42]:
np.array(rsult_df['title'])[indices2]

array([['Alec Chien', 'Kathryn Selby', 'Sophia Yan', 'Yoko Misumi',
        'Trey Lee Chui-yee', 'Chan Ka Nin', 'Marya Martin',
        'Joseph Twist', 'Clark Ross', 'Paul Hostetter',
        'Jeffrey Milarsky']], dtype=object)

In [43]:
data_to_compare= {'distance_1': distances.squeeze(), 'distance_2': distances2.squeeze()}
df2 = pd.DataFrame(data_to_compare)
df2

Unnamed: 0,distance_1,distance_2
0,0.0,0.0
1,1.20013,1.331174
2,1.203551,1.334045
3,1.21317,1.348574
4,1.214885,1.353869
5,1.248567,1.355287
6,1.249141,1.358237
7,1.252828,1.359704
8,1.253898,1.363794
9,1.264533,1.371385
