___

<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>
___

# Non-Negative Matric Factorization

Let's repeat thet opic modeling task from the previous lecture, but this time, we will use NMF instead of LDA.

## Step 1: loading data

We will be using articles scraped from NPR (National Public Radio), obtained from their website [www.npr.org](http://www.npr.org)

In [1]:
# Importing modules
import pandas as pd
import os

import json 

In [2]:
# os.chdir('..')

cwd = os.getcwd()
print(cwd)

c:\Users\opaps\Downloads


In [4]:
with open("../da", "r") as f:
    papers10 = json.load(f)

FileNotFoundError: [Errno 2] No such file or directory: '../data/news_data.json'

In [None]:
from pandas.io.json import json_normalize

In [None]:
papers = pd.json_normalize(papers10["data"])

In [None]:
papers.head()

In [None]:
papers.columns

In [None]:
papers.shape

In [None]:
papers.sample(1).text

Notice how we don't have the topic of the articles! Let's use LDA to attempt to figure out clusters of the articles.

## Step 2: Data cleaning

In [None]:
papers1 = papers

In [None]:
# Remove the columns
papers2 = papers1.drop(columns=['authors', 'url', 'source', 'created_at', 'updated_at', 'author', 'date'], axis=1, inplace = False)

# Print out the first rows of papers
papers2.head()


In [None]:
papers2.shape

### Applying regex

In [None]:
# Load the regular expression library
import re

# Remove punctuation
papers2['text_preprocessed'] = \
papers2['text'].map(lambda x: re.sub('\s+', ' ', x))

papers2['text_preprocessed'] = \
papers2['text_preprocessed'].map(lambda x: re.sub('[\n]', ' ', x))

papers2['text_preprocessed'] = \
papers2['text_preprocessed'].map(lambda x: re.sub('[\']', '', x))

papers2['text_preprocessed'] = \
papers2['text_preprocessed'].map(lambda x: re.sub('[,\.!?]', '', x))

# Convert the titles to lowercase
papers2['text_preprocessed'] = \
papers2['text_preprocessed'].map(lambda x: x.lower())

# Remove AI words
papers2['text_preprocessed'].map(lambda x : x.replace('ai', ''))
papers2['text_preprocessed'].map(lambda x : x.replace('artificial', ''))
papers2['text_preprocessed'].map(lambda x : x.replace('intelligence', ''))

# Print out the first rows of papers
papers2['text_preprocessed'].head()

### Applying lemmatization

In [None]:
# Perform standard imports:
import spacy
nlp = spacy.load('en_core_web_sm')

In [None]:
def lemmatizing_article(line):
    
    string = ''
    list1 = []
    doc = nlp(line)
    for token in doc:
        #string = ''.join(token.lemma_)
        list1.append(token.lemma_)
    
    return list1

In [None]:
papers2['text_lemmatized'] = \
papers2['text_preprocessed'].apply(lambda x: lemmatizing_article(x))

In [None]:
print(papers2.head(20))

In [None]:
# Function to convert each line of a dataset column from list to string
def listToString(s):  
    
    # initialize an empty string 
    str1 = " " 
    
    # return string   
    return (str1.join(s)) 

In [None]:
papers2['text_lemmatized_string'] = \
papers2['text_lemmatized'].apply(lambda x: listToString(x))

In [None]:
print(papers2.head(20))

## Step 3: Splitting the articles with a training part and a test part 
### Is to be doen now because after tfidf application not possible anymore to add the colum "topic"

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
articles_train, articles_test = train_test_split(papers2, test_size = 0.25)
    

## Step 4: Preprocessing with TfidfVectorizer and fit_transform on the training data

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

**`max_df`**` : float in range [0.0, 1.0] or int, default=1.0`<br>
When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

**`min_df`**` : float in range [0.0, 1.0] or int, default=1`<br>
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

In [None]:
tfidf = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')

## Step 4.b:  Applying fit_transform on the training data by TfidfVectorizer 

In [None]:
dtm_train = tfidf.fit_transform(articles_train['text_lemmatized_string'])
# dtm_train = tfidf.fit_transform(articles_train['text_preprocessed'])


In [None]:
dtm_train

In [None]:
dtm_train.shape

<h1> <p style="color:purple">Step 5: NMF making the model with the training part of the data (THIS IS THE MODEL) </h1></p>

In [None]:
from sklearn.decomposition import NMF

In [None]:
nmf_model = NMF(n_components=90,random_state=42)

In [None]:
### fit based on the train data

In [None]:
# This can take awhile, we're dealing with a large amount of documents!
nmf_model.fit(dtm_train)

## Step 5.a: Displaying Topics

In [None]:
len(tfidf.get_feature_names())

In [None]:
import random

In [None]:
for i in range(10):
    random_word_id = random.randint(0, len(tfidf.get_feature_names()))
    print(tfidf.get_feature_names()[random_word_id])

In [None]:
for i in range(10):
    random_word_id = random.randint(0, len(tfidf.get_feature_names()))
    print(tfidf.get_feature_names()[random_word_id])

In [None]:
len(nmf_model.components_)

In [None]:
nmf_model.components_

In [None]:
len(nmf_model.components_[0])

In [None]:
single_topic = nmf_model.components_[0]

In [None]:
# Returns the indices that would sort this array.
single_topic.argsort()

In [None]:
# Word least representative of this topic
single_topic[12321]

In [None]:
# Word most representative of this topic
single_topic[635]

In [None]:
# Top 10 words for this topic:
single_topic.argsort()[-10:]

In [None]:
top_word_indices = single_topic.argsort()[-10:]

In [None]:
for index in top_word_indices:
    print(tfidf.get_feature_names()[index])

These look like business articles perhaps... Let's confirm by using .transform() on our vectorized articles to attach a label number. But first, let's view all the 10 topics found.

In [None]:
print(nmf_model.components_)

In [None]:
for index,topic in enumerate(nmf_model.components_):
    print(f'THE TOP 15 WORDS FOR TOPIC #{index}')
    print([tfidf.get_feature_names()[i] for i in topic.argsort()[-15:]])
    print([topic[i] for i in topic.argsort()[-15:]])
    print('\n')

In [None]:
dfs = []
for index,topic in enumerate(nmf_model.components_):
#    print(f'THE TOP 15 WORDS FOR TOPIC #{index}')
    names = [tfidf.get_feature_names()[i] for i in topic.argsort()[-15:]]
    weight = [topic[i] for i in topic.argsort()[-15:]]
    d = {'Names' : names, 'Weight' : weight}
    df = pd.DataFrame(d)
    df = df.sort_values(by='Weight', ascending=False)
    max_weight = df['weight'].max()
    if max_weight > 1:
        df = df[df['weight'] > 1]
    else : df = df.iloc[0]
    dfs.append(df)


In [None]:
dfs[0].head(5)

# Attaching Discovered Topic Labels to Original Articles

In [None]:
articles_train.shape

In [None]:
articles_test.shape

### transform based on the train data

In [None]:
topic_results_train = nmf_model.transform(dtm_train)

In [None]:
topic_results_train.shape

In [None]:
topic_results_train[0]

In [None]:
topic_results_train[0].round(2)

In [None]:
topic_results_train[0].argmax()

This means that our model thinks that the first article belongs to topic #1.

### Combining with Original Data

In [None]:
papers2.head()

In [None]:
papers2.tail()

In [None]:
topic_results_train.argmax(axis=1)

In [None]:
articles_train['Topic'] = topic_results_train.argmax(axis=1)

In [None]:
print(articles_train)

## <p style="color:purple">Step 6: Using the trained model to define the topics on the test-articles </p>

# Step X: topic results with the test *.txt file

## Step X.a:  Applying transform on the test article string by TfidfVectorizer

In [None]:
## hereunder in the original train alternative fi_transform is used but here transform only, otherwise the shape is no
## corresponding with the topic_results of train data

In [None]:
with open("../message.txt", "r") as f:
    
    single_article = f.read()
    #single_article = f
    
print(single_article)





In [None]:
article = str(single_article)

In [None]:
# dtm_test = tfidf.transform(articles_test['text_preprocessed'])
# dtm_test = tfidf.transform(article).toarray()
dtm_test = tfidf.transform([article])


In [None]:
dtm_test.shape

## topics with the test data

### transform based on the test data

In [None]:
topic_results_test = nmf_model.transform(dtm_test)

In [None]:
topic_results_test.shape

In [None]:
topic_results_test[0]

In [None]:
print(topic_results_test[0])

In [None]:
topic_results_test[0].round(4)

### the string of the file is most near topic number 11

In [None]:
topic_results_test[0].argmax()

In [None]:
dtm_test.shape