# Week 8 - Natural Language Processing

* https://www.nltk.org/index.html
* https://spacy.io/
* https://pypi.org/project/wikipedia/ # pie pea eye or python packaging index
* https://kgextension.readthedocs.io/en/latest/

NLTK Downloads

* install nltk: https://pypi.org/project/nltk/
* stopwords: https://pythonspot.com/nltk-stop-words/
* punkt: https://www.nltk.org/api/nltk.tokenize.punkt.html
* wordnet: https://www.tutorialspoint.com/how-to-get-synonyms-antonyms-from-nltk-wordnet-in-python
* averaged_perceptron_tagger: https://morioh.com/p/04a148fa2131

In [1]:
# downloads for processing raw text, meanings, pos tagging, and cleaning
# import nltk
# nltk.download('stopwords')
# nltk.download('punkt')
# nltk.download('wordnet')
# nltk.download('averaged_perceptron_tagger')

### Text Analysis

From https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction :

> Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length.

> In order to address this, scikit-learn provides utilities for the most common ways to extract numerical features from text content, namely:
> * tokenizing strings and giving an integer id for each possible token, for instance by using white-spaces and punctuation as token separators
> * counting the occurrences of tokens in each document
> * normalizing and weighting with diminishing importance tokens that occur in the majority of samples / documents

> In this scheme, features and samples are defined as follows:
> * each individual token occurrence frequency (normalized or not) is treated as a feature
> * the vector of all the token frequencies for a given document is considered a multivariate sample
> A corpus of documents can thus be represented by a matrix with one row per document and one column per token (e.g. word) occurring in the corpus

> We call vectorization the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the Bag of Words or “Bag of n-grams” representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

### Word Tokens

Tokens are the total numbers of words in a corpus regardless if they are repeated. Word tokenization splits text into words.

In [2]:
# demonstrate word_tokenize
from nltk.tokenize import word_tokenize

text = 'I love learning. I have learned so much, and hope to learn more. I also hope to learn how a machine learns.'
word_tokenize(text)

['I',
 'love',
 'learning',
 '.',
 'I',
 'have',
 'learned',
 'so',
 'much',
 ',',
 'and',
 'hope',
 'to',
 'learn',
 'more',
 '.',
 'I',
 'also',
 'hope',
 'to',
 'learn',
 'how',
 'a',
 'machine',
 'learns',
 '.']

In [3]:
# create dataframe from messages
import pandas as pd
from nltk.tokenize import word_tokenize

msgs = [
    'I love learning. I have learned so much, and hope to learn more. I also hope to learn how a machine learns',
    'Learning about beautiful mice was so much fun till I learned there was going to be a quiz'
]

df = pd.DataFrame({'msgs': msgs})
print(df)

                                                msgs
0  I love learning. I have learned so much, and h...
1  Learning about beautiful mice was so much fun ...


In [4]:
# demonstrate CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()
matrix = cv.fit_transform(df['msgs'])
cv_df = pd.DataFrame(matrix.toarray(), columns=cv.get_feature_names_out())
cv_df

Unnamed: 0,about,also,and,be,beautiful,fun,going,have,hope,how,...,machine,mice,more,much,quiz,so,there,till,to,was
0,0,1,1,0,0,0,0,1,2,1,...,1,0,1,1,0,1,0,0,2,0
1,1,0,0,1,1,1,1,0,0,0,...,0,1,0,1,1,1,1,1,1,2


### Bag of Words

The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity. The bag-of-words model is commonly used in methods of document classification where the (frequency of) occurrence of each word is used as a feature for training a classifier.

https://en.wikipedia.org/wiki/Bag-of-words_model

### Stemming

Stemming finds the stem of a word.

In [5]:
# demonstrate stemming
from nltk.stem import PorterStemmer

ps =PorterStemmer()
words= ['learn', 'learned', 'learning', 'learns']

for word in words:
    print(ps.stem(word))

learn
learn
learn
learn


In [6]:
# demonstrate stemming and tokenization
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

ps =PorterStemmer()
sentence = 'I love learning. I have learned so much, and hope to learn more. I also hope to learn how a machine learns.'
words = word_tokenize(sentence)
print([ps.stem(word) for word in words])

['i', 'love', 'learn', '.', 'i', 'have', 'learn', 'so', 'much', ',', 'and', 'hope', 'to', 'learn', 'more', '.', 'i', 'also', 'hope', 'to', 'learn', 'how', 'a', 'machin', 'learn', '.']


In [7]:
# demonstrate stemming and tokenization
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize

sb =SnowballStemmer(language='english')
sentence = 'I love learning. I have learned so much, and hope to learn more. I also hope to learn how a machine learns.'
words = word_tokenize(sentence)
print([sb.stem(word) for word in words])

['i', 'love', 'learn', '.', 'i', 'have', 'learn', 'so', 'much', ',', 'and', 'hope', 'to', 'learn', 'more', '.', 'i', 'also', 'hope', 'to', 'learn', 'how', 'a', 'machin', 'learn', '.']


### Lemmatization

Lemmatization tries to provide context.

In [8]:
# demonstrate lemmatization
from nltk.corpus import wordnet as wn
from nltk.stem.wordnet import WordNetLemmatizer
from nltk import word_tokenize, pos_tag
from collections import defaultdict

tag_map = defaultdict(lambda : wn.NOUN)
tag_map['J'] = wn.ADJ
tag_map['V'] = wn.VERB
tag_map['R'] = wn.ADV

text = 'learning about beautiful mice was so much fun till I learned there was going to be a quiz'
tokens = word_tokenize(text)
lemma_function = WordNetLemmatizer()

for token, tag in pos_tag(tokens):
    lemma = lemma_function.lemmatize(token, tag_map[tag[0]])
    print(f'{token} => {lemma} ({tag_map[tag[0]]})')

learning => learn (v)
about => about (n)
beautiful => beautiful (a)
mice => mouse (n)
was => be (v)
so => so (r)
much => much (a)
fun => fun (n)
till => till (n)
I => I (n)
learned => learn (v)
there => there (n)
was => be (v)
going => go (v)
to => to (n)
be => be (v)
a => a (n)
quiz => quiz (n)


### TF-IDF (Term Frequency Inverse Document Frequency)

* Term frequency vs term usefulness
* Simple frequency count can be misleading because frequent terms in one document can also be frequent in other documents
* TF-IDF is used to score words in context of the document as well as in the context of the corpus, the higher the score the more useful

For example, you are wondering what to take for your electives. You want the class to be good but you also want the class to be relevant to your major. It's easy to see that the class can be:
1. both good and relevant
2. good but not relevant
3. relevant but not good
4. not good and not relevant

In the same way, a term may be:
1. frequently used in your corpus and useful in the analysis of the document it is found in
2. frequently used in your corpus but useless
3. infrequently used in your corpus but useful
4. infrequenlty used in your corpus and useless

In [9]:
# tf-idf demonstration
from sklearn.feature_extraction.text import TfidfVectorizer

text1 = 'I love learning. I have learned so much, and hope to learn more. I also hope to learn how a machine learns.'
text2 = 'learning about beautiful mice was so much fun till I learned there was going to be a quiz'

vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform([text1, text2])
feature_names = vectorizer.get_feature_names_out()
dense = vectors.todense()
denselist = dense.tolist()
df = pd.DataFrame(denselist, columns=feature_names)
df

Unnamed: 0,about,also,and,be,beautiful,fun,going,have,hope,how,...,machine,mice,more,much,quiz,so,there,till,to,was
0,0.0,0.223328,0.223328,0.0,0.0,0.0,0.0,0.223328,0.446656,0.223328,...,0.223328,0.0,0.223328,0.1589,0.0,0.1589,0.0,0.0,0.3178,0.0
1,0.253745,0.0,0.0,0.253745,0.253745,0.253745,0.253745,0.0,0.0,0.0,...,0.0,0.253745,0.0,0.180542,0.253745,0.180542,0.253745,0.253745,0.180542,0.50749


## Stop Words

In [10]:
from nltk.corpus import stopwords

stopwords = set(stopwords.words('english'))

# add words
add_stopwords = ['word1', 'word2']
stopwords = stopwords.union(add_stopwords)

# remove words
remove_stopwords = {'word1', 'word2'} 
stopwords = set([word for word in stopwords if word not in remove_stopwords])

In [11]:
# https://stackoverflow.com/questions/54366913/removing-stopwords-from-a-pandas-dataframe
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

msgs = [
    'I love learning. I have learned so much, and hope to learn more. I also hope to learn how a machine learns',
    'Learning about beautiful mice was so much fun till I learned there was going to be a quiz'
]

df = pd.DataFrame({'msgs': msgs})
stopwords = set(stopwords.words('english')) 
# df['msgs'] = df['msgs'].str.replace("[^\w\s]", "").str.lower()
df['msgs'] = df['msgs'].apply(lambda x: ' '.join([item.lower() for item in x.split() if item.lower() not in stopwords]))
print(df.head())

                                                msgs
0  love learning. learned much, hope learn more. ...
1  learning beautiful mice much fun till learned ...


# spaCy

Processing raw text intelligently is difficult: most words are rare, and it’s common for words that look completely different to mean almost the same thing. The same words in a different order can mean something completely different. Even splitting text into useful word-like units can be difficult in many languages. While it’s possible to solve some problems starting from only the raw characters, it’s usually better to use linguistic knowledge to add useful information. That’s exactly what spaCy is designed to do: you put in raw text, and get back a Doc object, that comes with a variety of annotations.

https://spacy.io/usage/linguistic-features

## Part of Speech (POS)

* https://machinelearningknowledge.ai/tutorial-on-spacy-part-of-speech-pos-tagging/

In [12]:
# pip install spacy

The next cell took a while to download. en_core_web_md = 40MB

In [13]:
# !python -m spacy download en_core_web_md

In [14]:
# en_core_web_md is an English pipeline trained on written web text (blogs, news, comments), 
# that includes vocabulary, syntax, entities, and vectors
import spacy

nlp = spacy.load('en_core_web_md')

In [15]:
# create spacy doc with u(nicode) string
doc = nlp(u"I like to read about data analysis everyday. I read a book on knowledge discovery last night.")
print(doc.text)

I like to read about data analysis everyday. I read a book on knowledge discovery last night.


In [16]:
for token in doc:
    print(f'{token.text:{10}} {token.pos_:{8}} {token.tag_:{6}} {spacy.explain(token.tag_)}')

I          PRON     PRP    pronoun, personal
like       VERB     VBP    verb, non-3rd person singular present
to         PART     TO     infinitival "to"
read       VERB     VB     verb, base form
about      ADP      IN     conjunction, subordinating or preposition
data       NOUN     NN     noun, singular or mass
analysis   NOUN     NN     noun, singular or mass
everyday   NOUN     NN     noun, singular or mass
.          PUNCT    .      punctuation mark, sentence closer
I          PRON     PRP    pronoun, personal
read       VERB     VBD    verb, past tense
a          DET      DT     determiner
book       NOUN     NN     noun, singular or mass
on         ADP      IN     conjunction, subordinating or preposition
knowledge  NOUN     NN     noun, singular or mass
discovery  PROPN    NNP    noun, proper singular
last       ADJ      JJ     adjective (English), other noun-modifier (Chinese)
night      NOUN     NN     noun, singular or mass
.          PUNCT    .      punctuation mark, sente

## Named Entity Recognition (NER)

* GPE: Geographical Entity
* Org: Organization

In [17]:
doc = nlp(u"I read about data analysis in Denton Texas everyday. I read a book on knowledge discovery last night from WikiData.")
for ent in doc.ents:
    print(ent.text, ent.label_, str(spacy.explain(ent.label_)))

Denton GPE Countries, cities, states
Texas GPE Countries, cities, states
last night TIME Times smaller than a day
WikiData ORG Companies, agencies, institutions, etc.


## Sentence Segmentation

In [18]:
doc = nlp(u"I read about data analysis in Denton Texas everyday. I read a book on knowledge discovery last night from WikiData.")
for sent in doc.sents:
    print(sent)

I read about data analysis in Denton Texas everyday.
I read a book on knowledge discovery last night from WikiData.


## Similarities

In [19]:
tokens = nlp('dog cat banana afskfsd')

for token in tokens:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov) # oov out-of-vocabulary

dog True 75.254234 False
cat True 63.188496 False
banana True 31.620354 False
afskfsd False 0.0 True


## Topic Modeling

Topic modeling helps us 

* discover hidden, or latent, topics, or themes in documents
* summarize documents
* search for similar documents
* classify documents

A document consists of topics and topics consist of words. The same word can be a part of multiple topics and one topic can be part of multiple documents. We can assign probabilities to how relevant a word is in one topic and that probability can be larger or smaller in another topic. The same can be said in the relationship between topics and documents. We can use topics and the words in topics for knowledge discovery without going through the entire document. These latent topics are like clustering, which we’ll cover a little more next week, and since they’re latent, we really don’t know, at first, what the big theme of the topic is so we just say that this group of words belong to topic 1, this group of words to topic 2, and so on. At some point, we might be able to give a topic a name, but it’s not necessary for our purposes. We want to find documents with similar topics because those topics have the same words, or key words, we’re looking for. In a sense, we can start annotating our documents by topics to optimize our searching.


### Latent Dirichlet Allocation

In natural language processing, Latent Dirichlet Allocation (LDA) is a generative statistical model that explains a set of observations through unobserved groups, and each group explains why some parts of the data are similar. LDA is an example of a topic model. In this, observations (e.g., words) are collected into documents, and each word's presence is attributable to one of the document's topics. Each document will contain a small number of topics.

https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation

* Documents with similar topics use similar groups of words
* Topics can be discovered (latent topics) by finding words that frequently occur together in a document
* It's up to the user to label the topics based on the words within that topic (topic with the words Titanic and Carpathia might be Ship Tragedies)
* LDA represents documents as probabilities of topics which consists of probabilities of words
* LDA requires us to select the number of topics (K), the topics are just numbers
* Then we randomly assign words in a document to a K topic
* Then we find the proportion of words assigned to the topic p(topic t | document d)
* We also find p(word w | topic t)
* Then we reassign the word to a new topic with p(topic t | document d) * p(word w | topic t)
* This is the probability that the topic generated the word
* This is done a large number of times till words to topics are acceptable (clustering)

https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf

https://highdemandskills.com/topic-modeling-intuitive/

### Non-Negative Matrix Factorization

Non-Negative Matrix Factorization is a statistical method that helps us to reduce the dimension of the input corpora or corpora. Internally, it uses the factor analysis method to give comparatively less weightage to the words that are having less coherence.

https://www.analyticsvidhya.com/blog/2021/06/part-15-step-by-step-guide-to-master-nlp-topic-modelling-using-nmf/ 

* Performs dimensionality reduction and clustering
* Used with TF-IDF

In [20]:
# pip install wikipedia

In [21]:
# https://kleiber.me/blog/2017/07/22/tutorial-lda-wikipedia/
import pandas as pd
import random
import wikipedia

titles = wikipedia.random(5)

content = []
for title in titles:
    # disambiguous error fix
    try:
        content.append([title, wikipedia.page(title).content])
    except wikipedia.exceptions.DisambiguationError as e:
        s = random.choice(e.options)
        content.append([title, wikipedia.page(s).content])

df = pd.DataFrame(content, columns=['title', 'content'])
df.head()

Unnamed: 0,title,content
0,Aue (Elbe),The Aue is a river in northern Germany in the ...
1,Marco Camargo,"Marco Antonio Camargo González (born May 8, 19..."
2,Girl Guides Association of Zambia,The Girl Guides Association of Zambia is the n...
3,Asiedu,Asiedu is both a surname and a given name. Not...
4,Carol Morley,Carol Anne Morley (born 14 January 1966) is an...


### LDA (Latent Dirichlet Allocation)

In [22]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english')
vectors = vectorizer.fit_transform(df['content'].values.astype('U'))

model = LatentDirichletAllocation(n_components=10, random_state=42)
model.fit(vectors)

for index, topic in enumerate(model.components_):
    print(f'Topic {index} top words: {[vectorizer.get_feature_names_out()[i] for i in topic.argsort()[-5:]]}')

Topic 0 top words: ['km', 'lower', 'saxony', 'horneburg', 'river']
Topic 1 top words: ['girl', 'butterfly', 'association', 'ghanaian', 'asiedu']
Topic 2 top words: ['external', '2018', '1966', 'features', 'references']
Topic 3 top words: ['external', '2018', '1966', 'features', 'references']
Topic 4 top words: ['external', '2018', '1966', 'features', 'references']
Topic 5 top words: ['external', '2018', '1966', 'features', 'references']
Topic 6 top words: ['external', '2018', '1966', 'features', 'references']
Topic 7 top words: ['external', '2018', '1966', 'features', 'references']
Topic 8 top words: ['external', '2018', '1966', 'features', 'references']
Topic 9 top words: ['life', 'years', 'short', 'film', 'morley']


In [23]:
topic_results = model.transform(vectors)
df['topic'] = topic_results.argmax(axis=1)
df.head()

Unnamed: 0,title,content,topic
0,Aue (Elbe),The Aue is a river in northern Germany in the ...,0
1,Marco Camargo,"Marco Antonio Camargo González (born May 8, 19...",1
2,Girl Guides Association of Zambia,The Girl Guides Association of Zambia is the n...,1
3,Asiedu,Asiedu is both a surname and a given name. Not...,1
4,Carol Morley,Carol Anne Morley (born 14 January 1966) is an...,9


### NMF (Non-Negative Matrix Factorization)

In [24]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english')
vectors = vectorizer.fit_transform(df['content'].values.astype('U'))
feature_names = vectorizer.get_feature_names_out()
dense = vectors.todense()
denselist = dense.tolist()
tfidf = pd.DataFrame(denselist, columns=feature_names)
tfidf.head()

Unnamed: 0,100,11,12,14,15,16,16mm,18,19,1924,...,world,written,wrote,year,years,yirenkyi,young,youth,zambia,zawe
0,0.0,0.0,0.077198,0.0,0.0,0.0,0.0,0.0,0.077198,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.214946,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.086708,0.0,...,0.086708,0.0,0.0,0.086708,0.0,0.0,0.0,0.086708,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.110441,...,0.089103,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.331322,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.10309,0.0,0.0,0.0,0.0
4,0.0,0.016754,0.013517,0.050261,0.033507,0.033507,0.033507,0.016754,0.0,0.0,...,0.0,0.067015,0.016754,0.04055,0.201044,0.0,0.050261,0.013517,0.0,0.016754


In [25]:
print(tfidf.shape)

(5, 532)


https://arxiv.org/pdf/1706.05084.pdf

Figure 1: Illustration of NMF model for topic modeling

In [26]:
from sklearn.decomposition import NMF

model = NMF(init='random', n_components=10, random_state=42)
model.fit(vectors)

for index, topic in enumerate(model.components_):
    print(f'Topic {index} top words: {[vectorizer.get_feature_names_out()[i] for i in topic.argsort()[-5:]]}')

Topic 0 top words: ['life', 'years', 'short', 'film', 'morley']
Topic 1 top words: ['km', 'mi', 'aue', 'horneburg', 'river']
Topic 2 top words: ['organization', 'zambia', 'guides', 'girl', 'association']
Topic 3 top words: ['scouts', 'guides', 'zambia', 'girl', 'association']
Topic 4 top words: ['ecuador', 'fina', 'camargo', 'olympics', 'butterfly']
Topic 5 top words: ['2009', 'girl', 'guides', 'zambia', 'association']
Topic 6 top words: ['ghanaian', '2018', 'born', 'asiedu', 'people']
Topic 7 top words: ['documentary', 'years', 'short', 'morley', 'film']
Topic 8 top words: ['surname', 'given', 'politician', 'ghanaian', 'asiedu']
Topic 9 top words: ['200', 'summer', '100', 'olympics', 'butterfly']


In [27]:
topic_results = model.transform(vectors)
df['topic'] = topic_results.argmax(axis=1)
df.head()



Unnamed: 0,title,content,topic
0,Aue (Elbe),The Aue is a river in northern Germany in the ...,1
1,Marco Camargo,"Marco Antonio Camargo González (born May 8, 19...",9
2,Girl Guides Association of Zambia,The Girl Guides Association of Zambia is the n...,2
3,Asiedu,Asiedu is both a surname and a given name. Not...,8
4,Carol Morley,Carol Anne Morley (born 14 January 1966) is an...,7


In [36]:
topic = model.components_[2]
keywords = ' '.join([vectorizer.get_feature_names_out()[i] for i in topic.argsort()[-5:]])
wikipedia.search(keywords, results=5)

['Girl Guides Association of Zambia',
 'List of World Association of Girl Guides and Girl Scouts members',
 'Scouting and Guiding in Zambia',
 'List of World Organization of the Scout Movement members',
 'Zambia']

### Wikipedia API

If you intend to do any scraping projects or automated requests, consider alternatives such as Pywikipediabot or MediaWiki API, which has other superior features.

* wikipedia.search('keywords', results=2)
* wikipedia.suggest('keyword')
* wikipedia.summary('keywords', sentences=2)
* wikipedia.page('keywords')
* wikipedia.page('keywords').content
* wikipedia.page('keywords').references
* wikipedia.page('keywords').title
* wikipedia.page('keywords').url
* wikipedia.page('keywords').categories
* wikipedia.page('keywords').content
* wikipedia.page('keywords').links
* wikipedia.geosearch(33.2075, 97.1526)
* wikipedia.set_lang('hi')
* wikipedia.languages()
* wikipedia.page('keywords').images[0]
* wikipedia.page('keywords').html()

## SPARQL-dataframe

* The wikipedia api
* https://pypi.org/project/SPARQLWrapper/
* https://sparqlwrapper.readthedocs.io/en/latest/main.html 

In [29]:
# pip install sparql-dataframe

In [30]:
import sparql_dataframe

endpoint = "http://dbpedia.org/sparql"

q = """
SELECT ?label ?ship ?owner ?status ?port ?route
WHERE
{
 ?ship dbp:shipName ?label .
 ?ship rdf:type dbo:Ship .
 ?ship dbo:owner ?owner FILTER ( ?owner = dbr:White_Star_Line ) .
 ?ship dbo:status ?status .
 ?ship dbp:shipRegistry ?port .
 ?ship dbp:shipRoute ?route .
}
LIMIT 50
"""

df = sparql_dataframe.get(endpoint, q)
df.head()

Unnamed: 0,label,ship,owner,status,port,route
0,SS Arabic,http://dbpedia.org/resource/SS_Arabic_(1881),http://dbpedia.org/resource/White_Star_Line,Broken up August 1901,http://dbpedia.org/resource/Rotterdam,*Liverpool-New York \n*San Francisco-Hong Kong...
1,SS Spaarndam,http://dbpedia.org/resource/SS_Arabic_(1881),http://dbpedia.org/resource/White_Star_Line,Broken up August 1901,http://dbpedia.org/resource/Rotterdam,*Liverpool-New York \n*San Francisco-Hong Kong...
2,SS Arabic,http://dbpedia.org/resource/SS_Arabic_(1881),http://dbpedia.org/resource/White_Star_Line,Sold to theHolland America Linein February 1890,http://dbpedia.org/resource/Rotterdam,*Liverpool-New York \n*San Francisco-Hong Kong...
3,SS Spaarndam,http://dbpedia.org/resource/SS_Arabic_(1881),http://dbpedia.org/resource/White_Star_Line,Sold to theHolland America Linein February 1890,http://dbpedia.org/resource/Rotterdam,*Liverpool-New York \n*San Francisco-Hong Kong...
4,SS Arabic,http://dbpedia.org/resource/SS_Arabic_(1881),http://dbpedia.org/resource/White_Star_Line,Broken up August 1901,http://dbpedia.org/resource/Liverpool,*Liverpool-New York \n*San Francisco-Hong Kong...


## Knowledge Graph Extension

https://colab.research.google.com/github/om-hb/kgextension/blob/master/examples/book_genre_prediction.ipynb

The kgextension package allows to access and use Linked Open Data to augment existing datasets. It enables to incorporate knowledge graph information in pandas. DataFrames and can be used within the scikit-learn pipeline.

Its functionality includes:

* Linking datasets to any Linked Open Data (LOD) Source such as DBpedia, WikiData or the EU Open Data Portal
* Generation of new features from the LOD Sources
* Hierarchy-based feature selection algorithms
* Data Integration of features from different sources

https://kgextension.readthedocs.io/en/latest/

In [31]:
# !pip install kgextension

In [32]:
# get the data
import pandas as pd

titles = pd.read_csv('titles.csv')
titles.head()

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
0,ts300399,Five Came Back: The Reference Films,SHOW,This collection includes 12 World War II-era p...,1945,TV-MA,48,['documentation'],['US'],1.0,,,,0.6,
1,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976,R,113,"['crime', 'drama']",['US'],,tt0075314,8.3,795222.0,27.612,8.2
2,tm127384,Monty Python and the Holy Grail,MOVIE,"King Arthur, accompanied by his squire, recrui...",1975,PG,91,"['comedy', 'fantasy']",['GB'],,tt0071853,8.2,530877.0,18.216,7.8
3,tm70993,Life of Brian,MOVIE,"Brian Cohen is an average young Jewish man, bu...",1979,R,94,['comedy'],['GB'],,tt0079470,8.0,392419.0,17.505,7.8
4,tm190788,The Exorcist,MOVIE,12-year-old Regan MacNeil begins to adapt an e...,1973,R,133,['horror'],['US'],,tt0070047,8.1,391942.0,95.337,7.7


In [33]:
from kgextension.linking_sklearn import DbpediaLookupLinker

linker = DbpediaLookupLinker(column='title')
df_enhanced = linker.fit_transform(titles.head())
df_enhanced.head()

DBpedia Lookup Linker: Querying DLL:   0%|          | 0/5 [00:00<?, ?it/s]

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score,new_link
0,ts300399,Five Came Back: The Reference Films,SHOW,This collection includes 12 World War II-era p...,1945,TV-MA,48,['documentation'],['US'],1.0,,,,0.6,,
1,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976,R,113,"['crime', 'drama']",['US'],,tt0075314,8.3,795222.0,27.612,8.2,http://dbpedia.org/resource/Taxi_Driver
2,tm127384,Monty Python and the Holy Grail,MOVIE,"King Arthur, accompanied by his squire, recrui...",1975,PG,91,"['comedy', 'fantasy']",['GB'],,tt0071853,8.2,530877.0,18.216,7.8,http://dbpedia.org/resource/Monty_Python_and_t...
3,tm70993,Life of Brian,MOVIE,"Brian Cohen is an average young Jewish man, bu...",1979,R,94,['comedy'],['GB'],,tt0079470,8.0,392419.0,17.505,7.8,http://dbpedia.org/resource/Monty_Python's_Lif...
4,tm190788,The Exorcist,MOVIE,12-year-old Regan MacNeil begins to adapt an e...,1973,R,133,['horror'],['US'],,tt0070047,8.1,391942.0,95.337,7.7,http://dbpedia.org/resource/The_Exorcist_(film)


In [34]:
# https://kgextension.readthedocs.io/en/latest/source/usage_generators.html#specific-relation-generator
from kgextension.generator_sklearn import SpecificRelationGenerator

generator = SpecificRelationGenerator(columns=['new_link'], direct_relation='http://purl.org/dc/terms/subject')
df_enhanced = generator.fit_transform(df_enhanced)
df_enhanced.head()

Column:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,...,new_link_in_boolean_http://dbpedia.org/resource/Category:Films_about_telekinesis,"new_link_in_boolean_http://dbpedia.org/resource/Category:Films_shot_in_Washington,_D.C.","new_link_in_boolean_http://dbpedia.org/resource/Category:Films_set_in_Washington,_D.C.",new_link_in_boolean_http://dbpedia.org/resource/Category:Demons_in_film,new_link_in_boolean_http://dbpedia.org/resource/Category:Films_set_in_Iraq,new_link_in_boolean_http://dbpedia.org/resource/Category:Films_shot_in_Iraq,new_link_in_boolean_http://dbpedia.org/resource/Category:Films_featuring_a_Best_Supporting_Actress_Golden_Globe-winning_performance,new_link_in_boolean_http://dbpedia.org/resource/Category:Films_scored_by_Jack_Nitzsche,new_link_in_boolean_http://dbpedia.org/resource/Category:Rating_controversies_in_film,new_link_in_boolean_http://dbpedia.org/resource/Category:Supernatural_drama_films
0,ts300399,Five Came Back: The Reference Films,SHOW,This collection includes 12 World War II-era p...,1945,TV-MA,48,['documentation'],['US'],1.0,...,,,,,,,,,,
1,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976,R,113,"['crime', 'drama']",['US'],,...,False,False,False,False,False,False,False,False,False,False
2,tm127384,Monty Python and the Holy Grail,MOVIE,"King Arthur, accompanied by his squire, recrui...",1975,PG,91,"['comedy', 'fantasy']",['GB'],,...,False,False,False,False,False,False,False,False,False,False
3,tm70993,Life of Brian,MOVIE,"Brian Cohen is an average young Jewish man, bu...",1979,R,94,['comedy'],['GB'],,...,False,False,False,False,False,False,False,False,False,False
4,tm190788,The Exorcist,MOVIE,12-year-old Regan MacNeil begins to adapt an e...,1973,R,133,['horror'],['US'],,...,True,True,True,True,True,True,True,True,True,True


In [35]:
# https://kgextension.readthedocs.io/en/latest/source/usage_generators.html#data-properties-generator
from kgextension.linking_sklearn import DbpediaLookupLinker
from kgextension.generator import data_properties_generator

linker = DbpediaLookupLinker(column='title')
df_props = linker.fit_transform(titles.head())


df_data_properties = data_properties_generator(df_props, 'new_link')
df_data_properties

DBpedia Lookup Linker: Querying DLL:   0%|          | 0/5 [00:00<?, ?it/s]

Column:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,...,new_link_data_http://dbpedia.org/property/starring,new_link_data_http://dbpedia.org/property/studio,new_link_data_http://dbpedia.org/property/title,new_link_data_http://dbpedia.org/property/totalWidth,new_link_data_http://dbpedia.org/property/type,new_link_data_http://dbpedia.org/property/width,new_link_data_http://dbpedia.org/property/writers,new_link_data_http://www.w3.org/2000/01/rdf-schema#comment,new_link_data_http://www.w3.org/2000/01/rdf-schema#label,new_link_data_http://xmlns.com/foaf/0.1/name
0,ts300399,Five Came Back: The Reference Films,SHOW,This collection includes 12 World War II-era p...,1945,TV-MA,48,['documentation'],['US'],1.0,...,,,,,,,,,,
1,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976,R,113,"['crime', 'drama']",['US'],,...,Harvey Keitel,Bill/Phillips Productions,Awards for Taxi Driver,300.0,soundtrack,431.0,,Taxikář (v americkém originále: Taxi Driver) j...,出租车司机 (1976年电影),Taxi Driver
2,tm127384,Monty Python and the Holy Grail,MOVIE,"King Arthur, accompanied by his squire, recrui...",1975,PG,91,"['comedy', 'fantasy']",['GB'],,...,John Cleese,,,,,230.0,Eric Idle,Monty Python a Svatý Grál je britský film kome...,Monty Python e il Sacro Graal,Monty Python and the Holy Grail
3,tm70993,Life of Brian,MOVIE,"Brian Cohen is an average young Jewish man, bu...",1979,R,94,['comedy'],['GB'],,...,Terry Gilliam,Python (Monty) Pictures,,,,30.0,Eric Idle,La vida de Brian (título original: Life of Bri...,Monty Python : La Vie de Brian,Monty Python's Life of Brian
4,tm190788,The Exorcist,MOVIE,12-year-old Regan MacNeil begins to adapt an e...,1973,R,133,['horror'],['US'],,...,Linda Blair,Hoya Productions,Awards for The Exorcist,,,,,طارد الأرواح الشريرة (بالإنجليزية: The Exorcis...,Der Exorzist,The Exorcist
