# Wikipedia

### Wikipedia API

If you intend to do any scraping projects or automated requests, consider alternatives such as Pywikipediabot or MediaWiki API, which has other superior features.

* wikipedia.search('keywords', results=2)
* wikipedia.suggest('keyword')
* wikipedia.summary('keywords', sentences=2)
* wikipedia.page('keywords')
* wikipedia.page('keywords').content
* wikipedia.page('keywords').references
* wikipedia.page('keywords').title
* wikipedia.page('keywords').url
* wikipedia.page('keywords').categories
* wikipedia.page('keywords').content
* wikipedia.page('keywords').links
* wikipedia.geosearch(33.2075, 97.1526)
* wikipedia.set_lang('hi')
* wikipedia.languages()
* wikipedia.page('keywords').images[0]
* wikipedia.page('keywords').html()

An API (Application Programming Interface) is like a messenger that allows different software systems to talk to each other and exchange information or functionality. It's a set of rules and specifications that define how components or systems should interact.

In [None]:
# pip install wikipedia

In [None]:
# https://kleiber.me/blog/2017/07/22/tutorial-lda-wikipedia/
import pandas as pd
import random
import wikipedia

# rtitles = wikipedia.random(5)

# get 5 Wikipedia page titles based on keywords or manually enter in keywords list
titles = []
keywords = ['Titanic', 'JP Morgan', 'immigration', 'suffrage', 'racist']
for key in keywords:
    title = wikipedia.search(key, results=1)
    titles.append(title[0])

print(titles)
data = []

for title in titles:
    # disambiguous error fix
    try:
        data.append([title, wikipedia.page(title, auto_suggest=False).content, wikipedia.summary(title, auto_suggest=False, sentences=5)])
    except wikipedia.exceptions.DisambiguationError as e:
        s = random.choice(e.options)
        data.append([title, wikipedia.page(s).content,  wikipedia.summary(title, auto_suggest=False, sentences=5)])

df = pd.DataFrame(data, columns=['title', 'content', 'summary'])
df.head()

['Titanic', 'J. P. Morgan', 'Immigration', 'Suffrage', 'Racism']


Unnamed: 0,title,content,summary
0,Titanic,"RMS Titanic was a British passenger liner, ope...","RMS Titanic was a British passenger liner, ope..."
1,J. P. Morgan,"John Pierpont Morgan Sr. (April 17, 1837 – Mar...","John Pierpont Morgan Sr. (April 17, 1837 – Mar..."
2,Immigration,Immigration is the international movement of p...,Immigration is the international movement of p...
3,Suffrage,"Suffrage, political franchise, or simply franc...","Suffrage, political franchise, or simply franc..."
4,Racism,Racism is discrimination and prejudice towards...,Racism is discrimination and prejudice towards...


### LDA (Latent Dirichlet Allocation)

Latent Dirichlet Allocation (LDA) is a statistical model used in natural language processing (NLP) and machine learning (ML) to uncover hidden thematic structures within a collection of documents or texts. It assumes that each document is a mixture of various topics, and each topic is characterized by a distribution of words.

In simpler terms, LDA helps us understand the main themes or subjects that connect different documents in a dataset. For example, if we apply LDA to a collection of news articles, it might identify topics like "politics," "sports," or "technology" and show us which articles belong to each topic.

In [None]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english')
vectors = vectorizer.fit_transform(df['summary'].values.astype('U'))

model = LatentDirichletAllocation(n_components=5, random_state=42)
model.fit(vectors)

for index, topic in enumerate(model.components_):
    print(f'Topic {index} top words: {[vectorizer.get_feature_names_out()[i] for i in topic.argsort()[-5:]]}')

Topic 0 top words: ['active', 'passive', 'vote', 'right', 'suffrage']
Topic 1 top words: ['sinking', 'titanic', 'operated', 'ship', 'ocean']
Topic 2 top words: ['wall', 'including', 'american', 'street', 'morgan']
Topic 3 top words: ['racism', 'racist', 'ideology', 'practices', 'social']
Topic 4 top words: ['natives', 'migration', 'effects', 'countries', 'immigration']


In [None]:
topic_results = model.transform(vectors)
df['topic'] = topic_results.argmax(axis=1)
df.head()

Unnamed: 0,title,content,summary,topic
0,Titanic,"RMS Titanic was a British passenger liner, ope...","RMS Titanic was a British passenger liner, ope...",1
1,J. P. Morgan,"John Pierpont Morgan Sr. (April 17, 1837 – Mar...","John Pierpont Morgan Sr. (April 17, 1837 – Mar...",2
2,Immigration,Immigration is the international movement of p...,Immigration is the international movement of p...,4
3,Suffrage,"Suffrage, political franchise, or simply franc...","Suffrage, political franchise, or simply franc...",0
4,Racism,Racism is discrimination and prejudice towards...,Racism is discrimination and prejudice towards...,3


### NMF (Non-Negative Matrix Factorization)

Non-negative matrix factorization (NMF) is a machine learning technique that takes a big matrix of data and breaks it down into two smaller matrices. The key feature of NMF is that all the numbers in these matrices have to be positive or zero.

This method is particularly useful for finding hidden patterns in data, like identifying topics in a collection of documents or discovering features in images. Because the numbers are non-negative, the resulting patterns are often easier to interpret and understand compared to other matrix factorization methods.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english')
vectors = vectorizer.fit_transform(df['summary'].values.astype('U'))
feature_names = vectorizer.get_feature_names_out()
dense = vectors.todense()
denselist = dense.tolist()
tfidf = pd.DataFrame(denselist, columns=feature_names)
tfidf.head()

Unnamed: 0,147,15,17,1837,1912,1913,19th,20th,21,224,...,voting,voyage,wall,wave,western,white,workers,works,world,york
0,0.0,0.095086,0.0,0.0,0.095086,0.0,0.0,0.0,0.0,0.095086,...,0.0,0.095086,0.0,0.0,0.0,0.190171,0.0,0.095086,0.0,0.095086
1,0.0,0.0,0.106841,0.106841,0.0,0.106841,0.106841,0.106841,0.106841,0.0,...,0.0,0.0,0.213682,0.106841,0.106841,0.0,0.0,0.0,0.0,0.0
2,0.087722,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.087722,0.0,0.087722,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.098268,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
print(tfidf.shape)

(5, 258)


https://arxiv.org/pdf/1706.05084.pdf

Figure 1: Illustration of NMF model for topic modeling

In [None]:
from sklearn.decomposition import NMF

model = NMF(init='random', n_components=10, random_state=42)
model.fit(vectors)

for index, topic in enumerate(model.components_):
    print(f'Topic {index} top words: {[vectorizer.get_feature_names_out()[i] for i in topic.argsort()[-5:]]}')

Topic 0 top words: ['disaster', 'white', 'rms', 'ocean', 'ship']
Topic 1 top words: ['racist', 'ideology', 'prejudice', 'practices', 'social']
Topic 2 top words: ['percent', 'migration', 'countries', 'effects', 'immigration']
Topic 3 top words: ['operated', 'line', 'star', 'ship', 'ocean']
Topic 4 top words: ['wall', 'street', 'american', 'including', 'morgan']
Topic 5 top words: ['called', 'active', 'right', 'vote', 'suffrage']
Topic 6 top words: ['ideology', 'racist', 'racism', 'social', 'practices']
Topic 7 top words: ['banking', 'head', '31', 'investment', 'sr']
Topic 8 top words: ['york', 'single', 'passenger', 'sank', 'april']
Topic 9 top words: ['passive', 'elections', 'vote', 'right', 'suffrage']


In [None]:
topic_results = model.transform(vectors)
df['topic'] = topic_results.argmax(axis=1)
df.head()

Unnamed: 0,title,content,summary,topic
0,Titanic,"RMS Titanic was a British passenger liner, ope...","RMS Titanic was a British passenger liner, ope...",8
1,J. P. Morgan,"John Pierpont Morgan Sr. (April 17, 1837 – Mar...","John Pierpont Morgan Sr. (April 17, 1837 – Mar...",7
2,Immigration,Immigration is the international movement of p...,Immigration is the international movement of p...,2
3,Suffrage,"Suffrage, political franchise, or simply franc...","Suffrage, political franchise, or simply franc...",9
4,Racism,Racism is discrimination and prejudice towards...,Racism is discrimination and prejudice towards...,1


In [None]:
suggested_words = []
for i in range(len(df)):
    topic = model.components_[df.loc[i].topic]
    keywords = ' '.join([vectorizer.get_feature_names_out()[i] for i in topic.argsort()[-5:]])
    suggested_words.extend(wikipedia.search(keywords, results=5))

suggested_words

['Passengers of the Titanic',
 'Lifeboats of the Titanic',
 'Olympic-class ocean liner',
 'Sinking of the Titanic',
 'Titanic',
 'Al-Rajhi Bank',
 'J. P. Morgan',
 'Bank of America',
 'Lehman Brothers',
 'Wallenberg family',
 'Immigration',
 'Immigration to Sweden',
 'Immigration to the United States',
 'History of immigration to the United States',
 'Immigration to Norway',
 'Suffrage',
 'Voting rights in Belgium',
 "Women's suffrage",
 'Universal suffrage',
 'Non-citizen suffrage',
 'Racism',
 'Woke',
 'Prejudice',
 'Nazism',
 'Social dominance orientation']

In [None]:
# https://medium.com/analytics-vidhya/text-summarization-using-spacy-ca4867c6b744
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation
from collections import Counter
from heapq import nlargest

nlp = spacy.load('en_core_web_md')

In [None]:
# doc = nlp(df.loc[0]['content'])
summary_text = ' '.join([txt for txt in df.summary])
# print(summary_text)
doc = nlp(summary_text)
len(list(doc.sents))
keyword = []
stopwords = list(STOP_WORDS)
pos_tag = ['PROPN', 'ADJ', 'NOUN', 'VERB']
for token in doc:
    if(token.text in stopwords or token.text in punctuation):
        continue
    if(token.pos_ in pos_tag):
        keyword.append(token.text)

# count most frequent words
freq_word = Counter(keyword)
print(freq_word.most_common(5))

# normalize for better processing
max_freq = Counter(keyword).most_common(1)[0][1]
for word in freq_word.keys():
    freq_word[word] = (freq_word[word]/max_freq)

print(freq_word.most_common(5))


[('immigration', 4), ('right', 4), ('vote', 4), ('suffrage', 4), ('ship', 3)]
[('immigration', 1.0), ('right', 1.0), ('vote', 1.0), ('suffrage', 1.0), ('ship', 0.75)]


In [None]:
# weights based on frequency
sent_strength={}
for sent in doc.sents:
    for word in sent:
        if word.text in freq_word.keys():
            if sent in sent_strength.keys():
                sent_strength[sent] += freq_word[word.text]
            else:
                sent_strength[sent] = freq_word[word.text]

print(sent_strength)

{RMS Titanic was a British passenger liner, operated by the White Star Line, that sank in the North Atlantic Ocean on 15 April 1912 after striking an iceberg during her maiden voyage from Southampton, England to New York City, United States.: 8.75, Of the estimated 2,224 passengers and crew aboard, more than 1,500 died, making it the deadliest sinking of a single ship up to that time.: 3.75, It remains the deadliest peacetime sinking of an ocean liner or cruise ship.: 3.5, The disaster drew public attention, provided foundational material for the disaster film genre, and has inspired many artistic works.
: 4.0, RMS Titanic was the largest ship afloat at the time she entered service and the second of three Olympic-class ocean liners operated by the White Star Line.: 6.5, John Pierpont Morgan Sr. (April 17, 1837 – March 31, 1913) was an American financier and investment banker who dominated corporate finance on Wall Street throughout the Gilded Age.: 5.75, As the head of the banking firm

In [None]:
summary = nlargest(10, sent_strength, key=sent_strength.get)
summary = ' '.join([w.text for w in summary])
summary

"RMS Titanic was a British passenger liner, operated by the White Star Line, that sank in the North Atlantic Ocean on 15 April 1912 after striking an iceberg during her maiden voyage from Southampton, England to New York City, United States. Suffrage, political franchise, or simply franchise, is the right to vote in public, political elections and referendums (although the term is sometimes used for any right to vote). In some languages, and occasionally in English, the right to vote is called active suffrage, as distinct from passive suffrage, which is the right to stand for election. Studies show that the elimination of barriers to migration would have profound effects on world GDP, with estimates of gains ranging between 67 and 147 percent for the scenarios in which 37 to 53 percent of the developing countries' workers migrate to the developed countries. RMS Titanic was the largest ship afloat at the time she entered service and the second of three Olympic-class ocean liners operate

# SPARQL-dataframe

* https://pypi.org/project/SPARQLWrapper/
* https://sparqlwrapper.readthedocs.io/en/latest/main.html
* https://pypi.org/project/sparql-dataframe/

In [None]:
# pip install sparql-dataframe

In [None]:
import sparql_dataframe

endpoint = "http://dbpedia.org/sparql"

q = """
SELECT ?label ?ship ?owner ?status ?port ?route
WHERE
{
 ?ship dbp:shipName ?label .
 ?ship rdf:type dbo:Ship .
 ?ship dbo:owner ?owner FILTER ( ?owner = dbr:White_Star_Line ) .
 ?ship dbo:status ?status .
 ?ship dbp:shipRegistry ?port .
 ?ship dbp:shipRoute ?route .
}
LIMIT 50
"""

sparql_df = sparql_dataframe.get(endpoint, q)
sparql_df.head()

Unnamed: 0,label,ship,owner,status,port,route
0,SS Arabic,http://dbpedia.org/resource/SS_Arabic_(1881),http://dbpedia.org/resource/White_Star_Line,Broken up August 1901,http://dbpedia.org/resource/Rotterdam,*Liverpool-New York \n*San Francisco-Hong Kong...
1,SS Arabic,http://dbpedia.org/resource/SS_Arabic_(1881),http://dbpedia.org/resource/White_Star_Line,Sold to theHolland America Linein February 1890,http://dbpedia.org/resource/Rotterdam,*Liverpool-New York \n*San Francisco-Hong Kong...
2,SS Spaarndam,http://dbpedia.org/resource/SS_Arabic_(1881),http://dbpedia.org/resource/White_Star_Line,Broken up August 1901,http://dbpedia.org/resource/Rotterdam,*Liverpool-New York \n*San Francisco-Hong Kong...
3,SS Spaarndam,http://dbpedia.org/resource/SS_Arabic_(1881),http://dbpedia.org/resource/White_Star_Line,Sold to theHolland America Linein February 1890,http://dbpedia.org/resource/Rotterdam,*Liverpool-New York \n*San Francisco-Hong Kong...
4,SS Arabic,http://dbpedia.org/resource/SS_Arabic_(1881),http://dbpedia.org/resource/White_Star_Line,Broken up August 1901,http://dbpedia.org/resource/Liverpool,*Liverpool-New York \n*San Francisco-Hong Kong...


## Knowledge Graph Extension

https://colab.research.google.com/github/om-hb/kgextension/blob/master/examples/book_genre_prediction.ipynb

The kgextension package allows to access and use Linked Open Data to augment existing datasets. It enables to incorporate knowledge graph information in pandas. DataFrames then can be used within the scikit-learn pipeline.

Its functionality includes:

* Linking datasets to any Linked Open Data (LOD) Source such as DBpedia, WikiData or the EU Open Data Portal
* Generation of new features from the LOD Sources
* Hierarchy-based feature selection algorithms
* Data Integration of features from different sources

https://kgextension.readthedocs.io/en/latest/

In [None]:
# !pip install kgextension

In [None]:
from kgextension.linking_sklearn import DbpediaLookupLinker

linker = DbpediaLookupLinker(column='title')
df_extended = linker.fit_transform(df)
df_extended.head()

DBpedia Lookup Linker: Querying DLL:   0%|          | 0/5 [00:00<?, ?it/s]

Unnamed: 0,title,content,summary,topic,new_link
0,Titanic,"RMS Titanic was a British passenger liner, ope...","RMS Titanic was a British passenger liner, ope...",8,http://dbpedia.org/resource/RMS_Titanic
1,J. P. Morgan,"John Pierpont Morgan Sr. (April 17, 1837 – Mar...","John Pierpont Morgan Sr. (April 17, 1837 – Mar...",7,http://dbpedia.org/resource/J._P._Morgan
2,Immigration,Immigration is the international movement of p...,Immigration is the international movement of p...,2,http://dbpedia.org/resource/Slovenia
3,Suffrage,"Suffrage, political franchise, or simply franc...","Suffrage, political franchise, or simply franc...",9,http://dbpedia.org/resource/Women's_suffrage
4,Racism,Racism is discrimination and prejudice towards...,Racism is discrimination and prejudice towards...,1,http://dbpedia.org/resource/Racism


In [None]:
# https://kgextension.readthedocs.io/en/latest/source/usage_generators.html#specific-relation-generator
from kgextension.generator_sklearn import SpecificRelationGenerator

generator = SpecificRelationGenerator(columns=['new_link'], direct_relation='http://purl.org/dc/terms/subject')
df_extended = generator.fit_transform(df_extended)
df_extended.head()

Column:   0%|          | 0/1 [00:00<?, ?it/s]

  df_result = df_result.append(query_result)


Unnamed: 0,title,content,summary,topic,new_link,"new_link_in_boolean_http://dbpedia.org/resource/Category:Businesspeople_from_Hartford,_Connecticut","new_link_in_boolean_http://dbpedia.org/resource/Category:Burials_at_Cedar_Hill_Cemetery_(Hartford,_Connecticut)",new_link_in_boolean_http://dbpedia.org/resource/Category:U.S._Steel_people,new_link_in_boolean_http://dbpedia.org/resource/Category:20th-century_American_businesspeople,new_link_in_boolean_http://dbpedia.org/resource/Category:American_art_collectors,...,new_link_in_boolean_http://dbpedia.org/resource/Category:Countries_in_Europe,new_link_in_boolean_http://dbpedia.org/resource/Category:Member_states_of_NATO,new_link_in_boolean_http://dbpedia.org/resource/Category:Member_states_of_the_Union_for_the_Mediterranean,new_link_in_boolean_http://dbpedia.org/resource/Category:Republics,new_link_in_boolean_http://dbpedia.org/resource/Category:Southern_European_countries,new_link_in_boolean_http://dbpedia.org/resource/Category:Central_European_countries,new_link_in_boolean_http://dbpedia.org/resource/Category:Member_states_of_the_European_Union,new_link_in_boolean_http://dbpedia.org/resource/Category:Member_states_of_the_Three_Seas_Initiative,new_link_in_boolean_http://dbpedia.org/resource/Category:Women's_suffrage,new_link_in_boolean_http://dbpedia.org/resource/Category:Suffrage
0,Titanic,"RMS Titanic was a British passenger liner, ope...","RMS Titanic was a British passenger liner, ope...",8,http://dbpedia.org/resource/RMS_Titanic,,,,,,...,,,,,,,,,,
1,J. P. Morgan,"John Pierpont Morgan Sr. (April 17, 1837 – Mar...","John Pierpont Morgan Sr. (April 17, 1837 – Mar...",7,http://dbpedia.org/resource/J._P._Morgan,True,True,True,True,True,...,False,False,False,False,False,False,False,False,False,False
2,Immigration,Immigration is the international movement of p...,Immigration is the international movement of p...,2,http://dbpedia.org/resource/Slovenia,False,False,False,False,False,...,True,True,True,True,True,True,True,True,False,False
3,Suffrage,"Suffrage, political franchise, or simply franc...","Suffrage, political franchise, or simply franc...",9,http://dbpedia.org/resource/Women's_suffrage,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,True
4,Racism,Racism is discrimination and prejudice towards...,Racism is discrimination and prejudice towards...,1,http://dbpedia.org/resource/Racism,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
