___

<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>
___

# Non-Negative Matric Factorization

Let's repeat thet opic modeling task from the previous lecture, but this time, we will use NMF instead of LDA.

## Step 1: loading data

We will be using articles scraped from NPR (National Public Radio), obtained from their website [www.npr.org](http://www.npr.org)

In [1]:
# Importing modules
import pandas as pd
import os

import json 

In [2]:
# os.chdir('..')

cwd = os.getcwd()
print(cwd)

/home/christopheschellinck/Documents/Projects/project_NLP_humain/ipynb_files


In [3]:
with open("../news_data.json", "r") as f:
    papers10 = json.load(f)

In [4]:
from pandas.io.json import json_normalize

In [5]:
papers = pd.json_normalize(papers10["data"])

In [6]:
papers.head()

Unnamed: 0,id,title,summary,authors,tags,text,url,source,created_at,updated_at,author,date
0,10813,"ZingBox aims for ‘Internet of Trusted Things’,...",Cybersecurity provider ZingBox has announced t...,,device\niot\nguardian\napproach\ndevices\nindu...,Cybersecurity provider ZingBox has announced t...,https://artificialintelligence-news.com/2017/0...,AInews,2020-02-05T17:08:34.343Z,2020-02-05T17:08:34.343Z,James Bourne,2017-04-25
1,10814,AI may help create more sustainable data centres,Enterprise data centre provider Aegis Data arg...,,data\ncentre\nnatural\nnew\ntechnology\nindust...,Enterprise data centre provider Aegis Data arg...,https://artificialintelligence-news.com/2017/0...,AInews,2020-02-05T17:08:34.355Z,2020-02-05T17:08:34.355Z,James Bourne,2017-04-25
2,10815,Why a potential trillion dollar B2B bots indus...,"From Domino’s Pizza, to Uber, to Bank of Ameri...",,next\nbig\ngupshup\none\nbusiness\ntech\nimpac...,"From Domino’s Pizza, to Uber, to Bank of Ameri...",https://artificialintelligence-news.com/2017/0...,AInews,2020-02-05T17:08:34.365Z,2020-02-05T17:08:34.365Z,James Bourne,2017-04-25
3,10816,Why companies investing in AI today should exp...,Organisations investing in artificial intellig...,,ai\norganisations\nindustry\nemployees\nexpo\n...,Organisations investing in artificial intellig...,https://artificialintelligence-news.com/2017/0...,AInews,2020-02-05T17:08:34.375Z,2020-02-05T17:08:34.375Z,James Bourne,2017-04-25
4,10817,Tencent gears up for greater GPU acceleration ...,Tencent’s cloud computing services will be bee...,,gpu\naccelerators\ngpus\ncloud\nservices\ntesl...,Tencent’s cloud computing services will be bee...,https://artificialintelligence-news.com/2017/0...,AInews,2020-02-05T17:08:34.385Z,2020-02-05T17:08:34.385Z,James Bourne,2017-04-26


In [7]:
papers.columns

Index(['id', 'title', 'summary', 'authors', 'tags', 'text', 'url', 'source',
       'created_at', 'updated_at', 'author', 'date'],
      dtype='object')

In [8]:
papers.shape

(1626, 12)

In [9]:
papers.sample(1).text

246    The UK government has announced the opening of...
Name: text, dtype: object

Notice how we don't have the topic of the articles! Let's use LDA to attempt to figure out clusters of the articles.

## Step 2: Data cleaning

In [10]:
papers1 = papers

In [11]:
# Remove the columns
papers2 = papers1.drop(columns=['authors', 'url', 'source', 'created_at', 'updated_at', 'author', 'date'], axis=1, inplace = False)

# Print out the first rows of papers
papers2.head()


Unnamed: 0,id,title,summary,tags,text
0,10813,"ZingBox aims for ‘Internet of Trusted Things’,...",Cybersecurity provider ZingBox has announced t...,device\niot\nguardian\napproach\ndevices\nindu...,Cybersecurity provider ZingBox has announced t...
1,10814,AI may help create more sustainable data centres,Enterprise data centre provider Aegis Data arg...,data\ncentre\nnatural\nnew\ntechnology\nindust...,Enterprise data centre provider Aegis Data arg...
2,10815,Why a potential trillion dollar B2B bots indus...,"From Domino’s Pizza, to Uber, to Bank of Ameri...",next\nbig\ngupshup\none\nbusiness\ntech\nimpac...,"From Domino’s Pizza, to Uber, to Bank of Ameri..."
3,10816,Why companies investing in AI today should exp...,Organisations investing in artificial intellig...,ai\norganisations\nindustry\nemployees\nexpo\n...,Organisations investing in artificial intellig...
4,10817,Tencent gears up for greater GPU acceleration ...,Tencent’s cloud computing services will be bee...,gpu\naccelerators\ngpus\ncloud\nservices\ntesl...,Tencent’s cloud computing services will be bee...


In [12]:
papers2.shape

(1626, 5)

### Applying regex

In [13]:
# Load the regular expression library
import re

# Remove punctuation
papers2['text_preprocessed'] = \
papers2['text'].map(lambda x: re.sub('\s+', ' ', x))

papers2['text_preprocessed'] = \
papers2['text_preprocessed'].map(lambda x: re.sub('[\n]', ' ', x))

papers2['text_preprocessed'] = \
papers2['text_preprocessed'].map(lambda x: re.sub('[\']', '', x))

papers2['text_preprocessed'] = \
papers2['text_preprocessed'].map(lambda x: re.sub('[,\.!?]', '', x))

# Convert the titles to lowercase
papers2['text_preprocessed'] = \
papers2['text_preprocessed'].map(lambda x: x.lower())

# Print out the first rows of papers
papers2['text_preprocessed'].head()

0    cybersecurity provider zingbox has announced t...
1    enterprise data centre provider aegis data arg...
2    from domino’s pizza to uber to bank of america...
3    organisations investing in artificial intellig...
4    tencent’s cloud computing services will be bee...
Name: text_preprocessed, dtype: object

### Applying lemmatization

In [14]:
# Perform standard imports:
import spacy


### spaCy preparation for lemmatization

In [15]:
nlp_en_core_web_sm = spacy.load('en_core_web_sm')

In [16]:
def lemmatizing_article(line):
    
    string = ''
    list1 = []
    doc = nlp_en_core_web_sm(line)
    for token in doc:
        #string = ''.join(token.lemma_)
        list1.append(token.lemma_)
    
    return list1

In [17]:
papers2['text_lemmatized'] = \
papers2['text_preprocessed'].apply(lambda x: lemmatizing_article(x))

In [18]:
print(papers2.head(20))

       id                                              title  \
0   10813  ZingBox aims for ‘Internet of Trusted Things’,...   
1   10814   AI may help create more sustainable data centres   
2   10815  Why a potential trillion dollar B2B bots indus...   
3   10816  Why companies investing in AI today should exp...   
4   10817  Tencent gears up for greater GPU acceleration ...   
5   10818  Bonsai launches Early Access Program to help e...   
6   10819  AI falls on the final furlong in predicting Ke...   
7   10820  Most Britons want AI to support at least part ...   
8   10821  Medicine, law and IT may be affected by the ri...   
9   10822  Cisco acquires AI firm MindMeld to create more...   
10  10823  University of Cambridge bolsters AI research e...   
11  10824  Cray launches two new CS-Storm accelerated clu...   
12  10825  UNICEF joins Apple, Google, Facebook et al in ...   
13  10826  New intelligent street light software aims to ...   
14  10827  Cylance launches first claime

In [19]:
# Function to convert each line of a dataset column from list to string
def listToString(s):  
    
    # initialize an empty string 
    str1 = " " 
    
    # return string   
    return (str1.join(s)) 

In [20]:
papers2['text_lemmatized_string'] = \
papers2['text_lemmatized'].apply(lambda x: listToString(x))

In [21]:
print(papers2.head(20))

       id                                              title  \
0   10813  ZingBox aims for ‘Internet of Trusted Things’,...   
1   10814   AI may help create more sustainable data centres   
2   10815  Why a potential trillion dollar B2B bots indus...   
3   10816  Why companies investing in AI today should exp...   
4   10817  Tencent gears up for greater GPU acceleration ...   
5   10818  Bonsai launches Early Access Program to help e...   
6   10819  AI falls on the final furlong in predicting Ke...   
7   10820  Most Britons want AI to support at least part ...   
8   10821  Medicine, law and IT may be affected by the ri...   
9   10822  Cisco acquires AI firm MindMeld to create more...   
10  10823  University of Cambridge bolsters AI research e...   
11  10824  Cray launches two new CS-Storm accelerated clu...   
12  10825  UNICEF joins Apple, Google, Facebook et al in ...   
13  10826  New intelligent street light software aims to ...   
14  10827  Cylance launches first claime

### spaCy preparation for removing stopwords (automatic removed by spaCy as well as the manual selection of stopwords)

In [22]:
from spacy.lang.en import English
from spacy.lang.en.stop_words import STOP_WORDS

In [23]:
# STOP WORD set in order to add manually stopwords to the stop word list
STOP_WORDS |= {"ai", "artificial", "intelligence"}

In [24]:
# Load English tokenizer, tagger, parser, NER and word vectors
nlp = English()

In [25]:
def removing_stopwords(line):
    #  "nlp" Object is used to create documents with linguistic annotations.
    my_doc = nlp(line)

    # Create list of word tokens
    token_list = []
    
    for token in my_doc:
        token_list.append(token.text)


    # Create list of word tokens after removing stopwords
    filtered_sentence =[] 

    for word in token_list: 

        lexeme = nlp.vocab[word]
   
        if lexeme.is_stop == False:
            filtered_sentence.append(word) 
            
            
    return filtered_sentence
    
    
    

In [26]:
papers2['text_cleaned'] = \
papers2['text_lemmatized_string'].apply(lambda x: removing_stopwords(x))

In [27]:
print(papers2['text_cleaned'].head(10))

0    [cybersecurity, provider, zingbox, announce, l...
1    [enterprise, data, centre, provider, aegis, da...
2    [domino, pizza, uber, bank, america, bots, hot...
3    [organisation, invest, (, ), anticipate, 39, %...
4    [tencent, cloud, computing, service, beef, gpu...
5    [-PRON-, -, base, bonsai, set, engage, enterpr...
6    [kentucky, derby, race, triple, crown, horse, ...
7    [new, survey, commission, uc, expo, event, rev...
8    [gartner, tentative, guideline, 2022, smart, m...
9    [cisco, announce, -PRON-, intent, acquire, min...
Name: text_cleaned, dtype: object


In [28]:
papers2['text_cleaned_string'] = \
papers2['text_cleaned'].apply(lambda x: listToString(x))

In [29]:
print(papers2['text_cleaned_string'].head(20))

0     cybersecurity provider zingbox announce launch...
1     enterprise data centre provider aegis data arg...
2     domino pizza uber bank america bots hot proper...
3     organisation invest ( ) anticipate 39 % revenu...
4     tencent cloud computing service beef gpu accel...
5     -PRON- - base bonsai set engage enterprise ind...
6     kentucky derby race triple crown horse racing ...
7     new survey commission uc expo event reveal 85 ...
8     gartner tentative guideline 2022 smart machine...
9     cisco announce -PRON- intent acquire mindmeld ...
10    leverhulme centre future ( cfi ) join - - prof...
11    cray launch new cs - storm accelerate cluster ...
12    unicef announce -PRON- join partnership ( ) da...
13    tcs digital software & solutions group tata co...
14    australia - base cylance announce general avai...
15    peopleai receive $ 7 million series funding le...
16    ( ) deep learning help analyse image patient o...
17    san francisco - base crowdsource firm crow

## Step 3: Splitting the articles with a training part and a test part 
### Is to be doen now because after tfidf application not possible anymore to add the colum "topic"

In [30]:
from sklearn.model_selection import train_test_split

In [31]:
articles_train, articles_test = train_test_split(papers2, test_size = 0.25)
    

## Step 4: Preprocessing with TfidfVectorizer and fit_transform on the training data

In [32]:
from sklearn.feature_extraction.text import TfidfVectorizer

**`max_df`**` : float in range [0.0, 1.0] or int, default=1.0`<br>
When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

**`min_df`**` : float in range [0.0, 1.0] or int, default=1`<br>
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

In [33]:
tfidf = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')

## Step 4.b:  Applying fit_transform on the training data by TfidfVectorizer 

In [34]:
# dtm_train = tfidf.fit_transform(articles_train['text_cleaned_string'])
dtm_train = tfidf.fit_transform(articles_train['text_cleaned_string'])
# dtm_train = tfidf.fit_transform(articles_train['text_preprocessed'])


In [35]:
dtm_train

<1219x12369 sparse matrix of type '<class 'numpy.float64'>'
	with 281356 stored elements in Compressed Sparse Row format>

In [36]:
dtm_train.shape

(1219, 12369)

<h1> <p style="color:purple">Step 5: NMF making the model with the training part of the data (THIS IS THE MODEL) </h1></p>

## Step 5.a: Making the model

In [37]:
from sklearn.decomposition import NMF

In [38]:
nmf_model = NMF(n_components=20,random_state=42)

In [39]:
### fit based on the train data

In [40]:
# This can take awhile, we're dealing with a large amount of documents!
nmf_model.fit(dtm_train)

NMF(n_components=20, random_state=42)

## Step 5.b: Saving the model

In [41]:
import pickle

In [43]:
# Save to file in the current working directory
pkl_filename = "pickle_model_NLP.pkl"
with open(pkl_filename, 'wb') as file:
    pickle.dump(nmf_model, file)

## Step 5.c: Displaying Topics

In [44]:
len(tfidf.get_feature_names())

12369

In [45]:
import random

In [46]:
for i in range(10):
    random_word_id = random.randint(0, len(tfidf.get_feature_names()))
    print(tfidf.get_feature_names()[random_word_id])

indecipherable
greene
empower
evan
consortium
richness
norvig
filmmaking
uploaded
sprawl


In [47]:
for i in range(10):
    random_word_id = random.randint(0, len(tfidf.get_feature_names()))
    print(tfidf.get_feature_names()[random_word_id])

alizadeh
dublin
conventional
lago
bonding
knowingly
skip
supersede
reskilling
lehman


In [48]:
len(nmf_model.components_)

20

In [49]:
nmf_model.components_

array([[0.0007206 , 0.        , 0.        , ..., 0.        , 0.        ,
        0.00600381],
       [0.        , 0.00152527, 0.        , ..., 0.        , 0.        ,
        0.00158674],
       [0.00593661, 0.        , 0.00126903, ..., 0.        , 0.00012436,
        0.        ],
       ...,
       [0.        , 0.        , 0.0007598 , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [50]:
len(nmf_model.components_[0])

12369

In [51]:
single_topic = nmf_model.components_[0]

In [52]:
# Returns the indices that would sort this array.
single_topic.argsort()

array([6184, 9203, 4840, ...,  677, 2957, 7209])

In [53]:
# Word least representative of this topic
single_topic[0]

0.0007205983726366147

In [54]:
# Word most representative of this topic
single_topic[4197]

0.0004850248179647465

In [55]:
# Top 10 words for this topic:
single_topic.argsort()[-10:]

array([ 9789, 11782,  6417,  6415,  6712,  9354,  2949,   677,  2957,
        7209])

In [56]:
top_word_indices = single_topic.argsort()[-10:]

In [57]:
for index in top_word_indices:
    print(tfidf.get_feature_names()[index])

science
use
learning
learn
machine
researcher
data
algorithm
datum
model


These look like business articles perhaps... Let's confirm by using .transform() on our vectorized articles to attach a label number. But first, let's view all the 10 topics found.

In [58]:
print(nmf_model.components_)

[[0.0007206  0.         0.         ... 0.         0.         0.00600381]
 [0.         0.00152527 0.         ... 0.         0.         0.00158674]
 [0.00593661 0.         0.00126903 ... 0.         0.00012436 0.        ]
 ...
 [0.         0.         0.0007598  ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]]


In [59]:
for index,topic in enumerate(nmf_model.components_):
    print(f'THE TOP 15 WORDS FOR TOPIC #{index}')
    print([tfidf.get_feature_names()[i] for i in topic.argsort()[-15:]])
    print([topic[i] for i in topic.argsort()[-15:]])
    print('\n')

THE TOP 15 WORDS FOR TOPIC #0
['train', 'prediction', 'veeramachaneni', 'problem', 'paper', 'science', 'use', 'learning', 'learn', 'machine', 'researcher', 'data', 'algorithm', 'datum', 'model']
[0.2717460201053615, 0.2776034423543002, 0.2814727603858789, 0.29804492437659585, 0.2981082070430484, 0.337891019680181, 0.4092908112061429, 0.41948146301344647, 0.4230123294562512, 0.5172362471896735, 0.5361625103603559, 0.5698032519378289, 0.6969971569325947, 0.8048838587210169, 1.1003275948509748]


THE TOP 15 WORDS FOR TOPIC #1
['event', 'enterprise', 'security', 'attend', 'cyber', 'amsterdam', 'upcoming', 'london', 'cloud', 'blockchain', 'leader', 'industry', 'iot', 'locate', 'expo']
[0.2459291715848743, 0.24711402397998056, 0.2599248977709651, 0.26281012772836526, 0.26286334833821423, 0.263802643948324, 0.26747678607118075, 0.27350408512461616, 0.2742712289584149, 0.28040142641791993, 0.2812799033479883, 0.28403736345255004, 0.306770774829742, 0.47565962855599225, 1.107956714637829]


THE

['student', 'working', 'new', 'social', 'discipline', 'science', 'department', 'chair', 'compute', 'group', 'faculty', 'schwarzman', 'computing', 'mit', 'college']
[0.17691424668542943, 0.17896067382611092, 0.1858760372761101, 0.18600318891812948, 0.18710116235462412, 0.18883778649629557, 0.201111944456366, 0.20192655236087612, 0.21093706290546158, 0.3386646656925198, 0.3552239914419984, 0.36942881921884996, 0.4407141262894141, 0.5176877580231162, 0.7771627790351121]


THE TOP 15 WORDS FOR TOPIC #18
['privacy', 'search', 'core', 'developer', 'federighi', 'device', 'app', 'machine', 'learning', 'salakhutdinov', 'ml', 'iphone', 'giannandrea', 'siri', 'apple']
[0.09833276312372996, 0.09933415172268444, 0.09935400761494506, 0.11124034608417244, 0.11427688221683152, 0.12530945363353974, 0.12746857371338033, 0.13337767439800127, 0.14407404019209105, 0.15239834454038584, 0.15977794425144928, 0.1938160575895991, 0.25932051909887804, 0.29807708472063366, 1.0726292892115095]


THE TOP 15 WORDS F

In [60]:
dfs = []
for index,topic in enumerate(nmf_model.components_):
#    print(f'THE TOP 15 WORDS FOR TOPIC #{index}')
    names = [tfidf.get_feature_names()[i] for i in topic.argsort()[-15:]]
    weight = [topic[i] for i in topic.argsort()[-15:]]
    d = {'Names' : names, 'Weight' : weight}
    df = pd.DataFrame(d)
    df = df.sort_values(by='Weight', ascending=False)
    dfs.append(df)


In [61]:
dfs[0].head(5)

Unnamed: 0,Names,Weight
14,model,1.100328
13,datum,0.804884
12,algorithm,0.696997
11,data,0.569803
10,researcher,0.536163


# Attaching Discovered Topic Labels to Original Articles

In [62]:
articles_train.shape

(1219, 10)

In [63]:
articles_test.shape

(407, 10)

### transform based on the train data

In [64]:
topic_results_train = nmf_model.transform(dtm_train)

In [65]:
topic_results_train.shape

(1219, 20)

In [66]:
topic_results_train[0]

array([0.        , 0.        , 0.01790269, 0.        , 0.00919475,
       0.00296454, 0.00169547, 0.        , 0.02064694, 0.        ,
       0.        , 0.15885794, 0.        , 0.0135937 , 0.00078248,
       0.        , 0.        , 0.        , 0.04868635, 0.02181076])

In [67]:
topic_results_train[0].round(2)

array([0.  , 0.  , 0.02, 0.  , 0.01, 0.  , 0.  , 0.  , 0.02, 0.  , 0.  ,
       0.16, 0.  , 0.01, 0.  , 0.  , 0.  , 0.  , 0.05, 0.02])

In [68]:
topic_results_train[0].argmax()

11

This means that our model thinks that the first article belongs to topic #1.

### Combining with Original Data

In [69]:
papers2.head()

Unnamed: 0,id,title,summary,tags,text,text_preprocessed,text_lemmatized,text_lemmatized_string,text_cleaned,text_cleaned_string
0,10813,"ZingBox aims for ‘Internet of Trusted Things’,...",Cybersecurity provider ZingBox has announced t...,device\niot\nguardian\napproach\ndevices\nindu...,Cybersecurity provider ZingBox has announced t...,cybersecurity provider zingbox has announced t...,"[cybersecurity, provider, zingbox, have, annou...",cybersecurity provider zingbox have announce t...,"[cybersecurity, provider, zingbox, announce, l...",cybersecurity provider zingbox announce launch...
1,10814,AI may help create more sustainable data centres,Enterprise data centre provider Aegis Data arg...,data\ncentre\nnatural\nnew\ntechnology\nindust...,Enterprise data centre provider Aegis Data arg...,enterprise data centre provider aegis data arg...,"[enterprise, data, centre, provider, aegis, da...",enterprise data centre provider aegis data arg...,"[enterprise, data, centre, provider, aegis, da...",enterprise data centre provider aegis data arg...
2,10815,Why a potential trillion dollar B2B bots indus...,"From Domino’s Pizza, to Uber, to Bank of Ameri...",next\nbig\ngupshup\none\nbusiness\ntech\nimpac...,"From Domino’s Pizza, to Uber, to Bank of Ameri...",from domino’s pizza to uber to bank of america...,"[from, domino, ’s, pizza, to, uber, to, bank, ...",from domino ’s pizza to uber to bank of americ...,"[domino, pizza, uber, bank, america, bots, hot...",domino pizza uber bank america bots hot proper...
3,10816,Why companies investing in AI today should exp...,Organisations investing in artificial intellig...,ai\norganisations\nindustry\nemployees\nexpo\n...,Organisations investing in artificial intellig...,organisations investing in artificial intellig...,"[organisation, invest, in, artificial, intelli...",organisation invest in artificial intelligence...,"[organisation, invest, (, ), anticipate, 39, %...",organisation invest ( ) anticipate 39 % revenu...
4,10817,Tencent gears up for greater GPU acceleration ...,Tencent’s cloud computing services will be bee...,gpu\naccelerators\ngpus\ncloud\nservices\ntesl...,Tencent’s cloud computing services will be bee...,tencent’s cloud computing services will be bee...,"[tencent, ’s, cloud, computing, service, will,...",tencent ’s cloud computing service will be bee...,"[tencent, cloud, computing, service, beef, gpu...",tencent cloud computing service beef gpu accel...


In [70]:
papers2.tail()

Unnamed: 0,id,title,summary,tags,text,text_preprocessed,text_lemmatized,text_lemmatized_string,text_cleaned,text_cleaned_string
1621,12434,Robotic pets may be bad medicine for melancholy,Sherry Turkle finds human-machine love unsettling,Technology and society\nResearch Laboratory of...,"In the face of techno-doomsday punditry, Sherr...",in the face of techno-doomsday punditry sherry...,"[in, the, face, of, techno, -, doomsday, pundi...",in the face of techno - doomsday punditry sher...,"[face, techno, -, doomsday, punditry, sherry, ...",face techno - doomsday punditry sherry turkle ...
1622,12435,MIT develops Anklebot for stroke patients,Research team foresees robotic gym,arm\ntrial\ntherapy\nclinical\nhogan\nmedical\...,Clinical trials have already shown that an MIT...,clinical trials have already shown that an mit...,"[clinical, trial, have, already, show, that, a...",clinical trial have already show that an mit r...,"[clinical, trial, mit, robotic, arm, help, str...",clinical trial mit robotic arm help stroke pat...
1623,12436,Notes from the Lab,UNDERWATER ROBOTS THAT MAP AND NAVIGATE,Research Laboratory of Electronics\nmit\nImagi...,Imagine driving down an unfamiliar road and tr...,imagine driving down an unfamiliar road and tr...,"[imagine, drive, down, an, unfamiliar, road, a...",imagine drive down an unfamiliar road and try ...,"[imagine, drive, unfamiliar, road, try, find, ...",imagine drive unfamiliar road try find -PRON- ...
1624,12437,Ray and Maria Stata give $25 million to MIT,Gift is largest ever for Institute building pr...,together\nnew\nCampus buildings and architectu...,MIT today announced a $25 million donation by ...,mit today announced a $25 million donation by ...,"[mit, today, announce, a, $, 25, million, dona...",mit today announce a $ 25 million donation by ...,"[mit, today, announce, $, 25, million, donatio...",mit today announce $ 25 million donation ray m...
1625,12438,Reuters Uses AI To Prototype First Ever Automa...,AI is coming for journalism. But rather than s...,journalism\ncontent generation\nreuters\nSynth...,AI is coming for journalism. But rather than s...,ai is coming for journalism but rather than si...,"[ai, be, come, for, journalism, but, rather, t...",ai be come for journalism but rather than simp...,"[come, journalism, simply, use, job, writer, r...",come journalism simply use job writer reuters ...


In [71]:
topic_results_train.argmax(axis=1)

array([11, 19, 19, ...,  0, 12,  7])

In [72]:
articles_train['Topic'] = topic_results_train.argmax(axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [73]:
print(articles_train)

         id                                              title  \
727   11538  Amazon Wants Alexa to Hear Your Whispers and F...   
1557  12370                     More-flexible machine learning   
726   11537  Artificial Intelligence Has a Strange New Muse...   
1236  12049  Bringing artificial intelligence and MIT to mi...   
604   11415  Drag Queen vs. David Duke: Whose Tweets Are Mo...   
...     ...                                                ...   
831   11643  Google Is Giving Away AI That Can Build Your G...   
584   11395  Poll Finds Americans Trust Police Use of Facia...   
186   10998  Embracing the power of AI: The process behind ...   
668   11477      What Trump’s Executive Order on AI Is Missing   
331   11141  DeepMind’s first commercial product diagnoses ...   

                                                summary  \
727   Amazon announced new listening features for Al...   
1557  Giving machine-learning systems “partial credi...   
726   The brain's way of proce

## <p style="color:purple">Step 6: Using the trained model to define the topics on an imported PDF-file </p>

# Step X: topic results with the PDF-file

## Step X.a:  Importing PDF file

In [74]:
# note the capitalization
import PyPDF2

from PyPDF2 import PdfFileReader

import os
# import str

In [75]:
os.chdir('../reports')

cwd = os.getcwd()
print(cwd)

/home/christopheschellinck/Documents/Projects/project_NLP_humain/reports


In [76]:
url_MGI = 'MGI-The-Age-of-Analytics-Full-report.pdf'
url_DS_use_cases ='The Big Book of Data Science Use Cases.pdf'
url_big_Data = 'Using big data to make better pricing decisions.pdf'
url_AI_strat = 'aiadoptionstrategies-march2019pdf.pdf'
url_main = 'main.pdf'

In [77]:
def extract_text_from_pdf(url):
    
    f = open(url,'rb')
    pdf_reader = PyPDF2.PdfFileReader(f)
    
    num_of_pages = pdf_reader.getNumPages()
    
    text = ''

    for i in range(0, num_of_pages):
#         text += "Page Number: " + str(i)
#         text += "- - - - - - - - - - - - - - - - - - - -"
        page_obj = pdf_reader.getPage(i)
        text += page_obj.extractText()
#         text += "- - - - - - - - - - - - - - - - - - - -"
    # close the PDF file object
    f.close()
    return text

In [78]:
text = extract_text_from_pdf(url_main)

In [79]:
article = str(text)

In [80]:
# print(article)

## Step X.b:  Applying transform on the test article string by TfidfVectorizer

In [81]:
## hereunder in the original train alternative fi_transform is used but here transform only, otherwise the shape is no
## corresponding with the topic_results of train data

In [82]:
"""
with open("../message.txt", "r") as f:
    
    single_article = f.read()
    #single_article = f
    
print(single_article)

"""



'\nwith open("../message.txt", "r") as f:\n    \n    single_article = f.read()\n    #single_article = f\n    \nprint(single_article)\n\n'

In [83]:
# article = str(single_article)

In [84]:
# dtm_test = tfidf.transform(articles_test['text_preprocessed'])
# dtm_test = tfidf.transform(article).toarray()
dtm_test = tfidf.transform([article])


In [85]:
dtm_test.shape

(1, 12369)

## topics with the test data

### transform based on the test data

In [86]:
topic_results_test = nmf_model.transform(dtm_test)

In [87]:
topic_results_test.shape

(1, 20)

In [88]:
topic_results_test[0]

array([0.00557138, 0.00557025, 0.        , 0.00217062, 0.00417199,
       0.        , 0.        , 0.0055254 , 0.00198606, 0.00354989,
       0.00157003, 0.00333757, 0.00881008, 0.0147009 , 0.00170519,
       0.00786329, 0.        , 0.        , 0.00372142, 0.00200447])

In [89]:
print(topic_results_test[0])

[0.00557138 0.00557025 0.         0.00217062 0.00417199 0.
 0.         0.0055254  0.00198606 0.00354989 0.00157003 0.00333757
 0.00881008 0.0147009  0.00170519 0.00786329 0.         0.
 0.00372142 0.00200447]


In [90]:
topic_results_test[0].round(4)

array([0.0056, 0.0056, 0.    , 0.0022, 0.0042, 0.    , 0.    , 0.0055,
       0.002 , 0.0035, 0.0016, 0.0033, 0.0088, 0.0147, 0.0017, 0.0079,
       0.    , 0.    , 0.0037, 0.002 ])

### the string of the file is most near topic number 11

In [91]:
topic_results_test[0].argmax()

13

In [92]:
dtm_test.shape

(1, 12369)