<a href="https://colab.research.google.com/github/bdugick89/Data-Science-Bootcamp/blob/main/Project_5_NPL_of_Wikipedia.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing



This project will give you practical experience using Natural Language Processing techniques. This project is in three parts:
- in part 1) you will use a traditional dataset in a CSV file
- in part 2) you will use the Wikipedia API to directly access content
on Wikipedia.
- in part 3) you will make your notebook interactive


### Part 1)



- The CSV file is available at https://ddc-datascience.s3.amazonaws.com/Projects/Project.5-NLP/Data/NLP.csv
- The file contains a list of famous people and a brief overview.
- The goal of part 1) is provide the capability to
  - Take one person from the list as input and output the 10 other people who's overview are "closest" to the person in a Natural Language Processing sense
  - Also output the sentiment of the overview of the person



### Part 2)



- For the same person from step 1), use the Wikipedia API to access the whole content of that person's Wikipedia page.
- The goal of part 2) is to produce the capability to:
  1. For that Wikipedia page determine the sentiment of the entire page
  1. Print out the Wikipedia article
  1. Collect the Wikipedia pages from the 10 nearest neighbors in Step 1)
  1. Determine the nearness ranking of these 10 to your main subject based on their entire Wikipedia page
  1. Compare the nearest ranking from Step 1) with the Wikipedia page nearness ranking



### Part 3)


Make an interactive notebook.

In addition to presenting the project slides, at the end of the presentation each student will demonstrate their code using a famous person suggested by the other students that exists in the DBpedia set.


#Import libraries

In [None]:
import numpy as np
import pandas as pd
import random
import wikipediaapi

from IPython.display import clear_output
from textblob import TextBlob
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from nltk.stem import PorterStemmer, WordNetLemmatizer
from textblob import TextBlob
from textblob import Word
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer as BagOfWords
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics.pairwise import cosine_similarity
import ipywidgets as widgets
from IPython.display import display


pd.options.display.max_columns = 100


In [None]:
%%capture output
#install Wikipedia API
!pip3 install wikipedia-api

In [None]:
%%capture
# Download corpora
!python -m textblob.download_corpora


In [None]:
import nltk
nltk.download('omw-1.4')


[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

## Create a Path

In [None]:
url = 'https://ddc-datascience.s3.amazonaws.com/Projects/Project.5-NLP/Data/NLP.csv'

In [None]:
train_orig = pd.read_csv(url)
train_orig.head()

Unnamed: 0,URI,name,text
0,<http://dbpedia.org/resource/Digby_Morrell>,Digby Morrell,digby morrell born 10 october 1979 is a former...
1,<http://dbpedia.org/resource/Alfred_J._Lewy>,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from un...
2,<http://dbpedia.org/resource/Harpdog_Brown>,Harpdog Brown,harpdog brown is a singer and harmonica player...
3,<http://dbpedia.org/resource/Franz_Rottensteiner>,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lowe...
4,<http://dbpedia.org/resource/G-Enka>,G-Enka,henry krvits born 30 december 1974 in tallinn ...


In [None]:
cvs_df= train_orig.copy()

In [None]:
cvs_df.shape

(42786, 3)

In [None]:
cvs_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42786 entries, 0 to 42785
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   URI     42786 non-null  object
 1   name    42786 non-null  object
 2   text    42786 non-null  object
dtypes: object(3)
memory usage: 1002.9+ KB


In [None]:
cvs_df.iloc[7639].text


'samuel hollingsworth young born december 26 1922 was a us representative from illinoisborn in casey illinois young graduated from urbana high school urbana illinois in 1940 he received an llb from the university of illinois in 1947 and a jd from university of illinois law school in 1948young served in the united states army paratroops from 1943 to 1946 and attained the rank of captain he was admitted to the illinois bar in 1948 and commenced practice in chicago with the united states securities and exchange commission he also served as a lawyer in private practice from 1947 to 1948 young was an instructor in economics at university of illinois and taught business finance at northwestern university from 1949 to 1950young served as securities commissioner of illinois from 1953 to 1955 and as assistant secretary of state from 1955 to 1957 he was financial vice president secretary and treasurer for a hospital supply company from 1965 to 1966 he also served as delegate to the illinois stat

In [None]:
Sampled_df =cvs_df.sample(n=10000, random_state=42)


In [None]:
page_list_orig = Sampled_df['text'].tolist()


### Clean the data frame that holds the Name & Text- remove unnessary characters

In [None]:
for i, page in enumerate(page_list_orig):
  page_list_orig[i] = (
    page
    .replace("\n"," ")
    .replace("\'s",'')
    .replace('\'','')
    .replace("(", "")
    .replace(")", "")
    .replace('"', "")
  )


In [None]:
page_list_orig[0]

'tom bancroft born 1967 london is a british jazz drummer and composer he began drumming aged seven and started off playing jazz with his father and identical twin brother phil after studying medicine at cambridge university he spent a year studying composition and arranging at mcgill university in montreal canada qualifying as a doctor in 1992 he then worked as a jazz musician and composer supporting his music income with locum work as a hospital doctor until 1998 when he began starting music related companies he is married to singer gina rae and has two children sam and sophie in 2004 he received the creative scotland awardin 1998 he launched caber music with support from the national lottery fund which went on to release over thirty cds over the next seven years to critical acclaim including two bbc jazz awards for best cd and numerous album of the year placings he has subsequently started the company abc creative music with his twin brother phil bancroft which develops creative musi

### Create text blobs out of the 'text' feature

In [None]:

page_list_prepped = page_list_orig.copy()

for i, page in enumerate(page_list_prepped):
  if (i % 1000) == 0:
    print(i)
  page_blob = TextBlob(page)
  singlurized_page = ''
  for j, sentence in enumerate(page_blob.sentences[:]):
    singularized_sentence = ' '.join([x.singularize() for x in sentence.words])
    if j == 0:
      singlurized_page = singularized_sentence
    else:
      singlurized_page = singlurized_page + ' ' + singularized_sentence
  page_list_prepped[i] = str(singlurized_page)

page_list_prepped[0]


0
1000
2000
3000
4000
5000
6000
7000
8000
9000


'tom bancroft born 1967 london is a british jazz drummer and composer he began drumming aged seven and started off playing jazz with hi father and identical twin brother phil after studying medicine at cambridge university he spent a year studying composition and arranging at mcgill university in montreal canada qualifying a a doctor in 1992 he then worked a a jazz musician and composer supporting hi music income with locum work a a hospital doctor until 1998 when he began starting music related company he is married to singer gina ra and ha two child sam and sophie in 2004 he received the creative scotland awardin 1998 he launched caber music with support from the national lottery fund which went on to release over thirty cd over the next seven year to critical acclaim including two bbc jazz award for best cd and numerou album of the year placing he ha subsequently started the company abc creative music with hi twin brother phil bancroft which develop creative music education resource

#BOW- With Word Vectorization

In [None]:
# Perform the count transformation
BoW =  BagOfWords(stop_words='english')
bow_vec = BoW.fit_transform(page_list_prepped)
bow_vec.toarray()


array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [None]:
BoW.get_feature_names_out()

array(['00', '000', '00000', ..., 'zyzzyva', 'zyzzyza', 'zz'],
      dtype=object)

##TF-IDF

In [None]:
tf_idf_vec = TfidfTransformer()
tf_idf_pages = tf_idf_vec.fit_transform(bow_vec)
tf_idf_pages.toarray()


array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

#K Nearest Neighbords

In [None]:
nn = NearestNeighbors().fit(tf_idf_pages)


In [None]:
def map_sentiment_to_emoji(polarity):
    if polarity > 0.5:
        return "😄"  # Happy face emoji for positive sentiment
    elif polarity < -0.5:
        return "😞"  # Sad face emoji for negative sentiment
    else:
        return "😐"  # Neutral face emoji for neutral sentiment


In [None]:
sampled_idx = 453
compute_nearest_documents(sampled_idx)

In [None]:

# Define a function to fetch Wikipedia content
def fetch_wikipedia_content(topic):
    wikip = wikipediaapi.Wikipedia(user_agent='foobar')
    page = wikip.page(topic)
    if page.exists():
        return page.text
    else:
        return None

def compute_nearest_documents(idx):
    distances, indices = nn.kneighbors(tf_idf_pages[idx], n_neighbors=11)
    print(f"Distances : {distances}")
    print("\n\n")
    # Sorting indices based on distances
    sorted_indices = indices[0][np.argsort(distances[0])]
    for i in sorted_indices:
        print(Sampled_df.iloc[i]['name'])
        print(Sampled_df.iloc[i]['URI'])

    # Step 1: Identify the Person with the Highest Nearest Neighbor Rank
    highest_ranked_person_idx = sorted_indices[-1]  # Get the index of the person with the highest rank
    highest_ranked_person_name = Sampled_df.iloc[highest_ranked_person_idx]['name']

    # Step 2: Retrieve Wikipedia Content for the Highest Ranked Person
    person_page_content = fetch_wikipedia_content(highest_ranked_person_name)

    if person_page_content:
        # Step 3: Analyze Sentiment for the Highest Ranked Person
        text_blob = TextBlob(person_page_content)
        polarity = text_blob.sentiment.polarity
        emoji = map_sentiment_to_emoji(polarity)  # Map sentiment polarity to emoji
        print(f"{highest_ranked_person_name} overview sentiment {text_blob.sentiment} {emoji}")

        # Step 4: Print Wikipedia Article for the Highest Ranked Person
        print(person_page_content)
    else:
        print(f"Could not retrieve Wikipedia content for {highest_ranked_person_name}")

    # Step 5: Collect Wikipedia Pages of Nearest Neighbors
    nearest_neighbor_pages = []
    for i in sorted_indices[:-1]:  # Exclude the highest ranked person
        neighbor_name = Sampled_df.iloc[i]['name']
        neighbor_page_content = fetch_wikipedia_content(neighbor_name)
        if neighbor_page_content:
            nearest_neighbor_pages.append((neighbor_name, neighbor_page_content))
            # Step 6: Analyze Sentiment for Nearest Neighbors
            text_blob = TextBlob(neighbor_page_content)
            polarity = text_blob.sentiment.polarity
            emoji = map_sentiment_to_emoji(polarity)  # Map sentiment polarity to emoji
            print(f"{neighbor_name} overview sentiment {text_blob.sentiment} {emoji}")

            # Step 7: Print Wikipedia Article for Nearest Neighbors
            print(neighbor_page_content)
        else:
            print(f"Could not retrieve Wikipedia content for {neighbor_name}")

# Extract a list of names from the DataFrame
names = Sampled_df['name'].tolist()

# Select 20 random names
random_names = Sampled_df['name'].sample(n=20, random_state=42).tolist()

# Create a dropdown widget with the random names
name_dropdown = widgets.Dropdown(options=random_names, description='Select a name:')

# Define a function to handle dropdown value changes
def on_dropdown_change(change):
    clear_output(wait=True)  # Clear the output without clearing the input cells
    sampled_idx = random_names.index(change.new)  # Get the index of the selected name
    compute_nearest_documents(sampled_idx)  # Call the function with the selected index

# Attach the function to the dropdown's value change event
name_dropdown.observe(on_dropdown_change, names='value')

# Display the dropdown widget
display(name_dropdown)


Distances : [[0.         1.19041485 1.2031643  1.21811312 1.2309121  1.24422873
  1.24578009 1.25271097 1.254712   1.25698337 1.27305932]]



Peter Drury
<http://dbpedia.org/resource/Peter_Drury>
Martin Gillingham
<http://dbpedia.org/resource/Martin_Gillingham>
Rob Walker (sports announcer)
<http://dbpedia.org/resource/Rob_Walker_(sports_announcer)>
Ian Crocker (commentator)
<http://dbpedia.org/resource/Ian_Crocker_(commentator)>
Miles Harrison
<http://dbpedia.org/resource/Miles_Harrison>
Todd Parnell
<http://dbpedia.org/resource/Todd_Parnell>
Dickie Davies
<http://dbpedia.org/resource/Dickie_Davies>
Hosni Zaghdoudi
<http://dbpedia.org/resource/Hosni_Zaghdoudi>
Philip Sharp (referee)
<http://dbpedia.org/resource/Philip_Sharp_(referee)>
Chris Waddle
<http://dbpedia.org/resource/Chris_Waddle>
Ricardo Carvalho
<http://dbpedia.org/resource/Ricardo_Carvalho>
Ricardo Carvalho overview sentiment Sentiment(polarity=0.1340951774285107, subjectivity=0.36038372871706176) 😐
Ricardo Alberto Silveir