## Non-Negative Matrix Factorization

In [5]:
# import all the tools and load data
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF

In [2]:
npr=pd.read_csv("npr.csv")
npr.head()

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


In [3]:
tfidf=TfidfVectorizer(max_df=0.9,min_df=2,stop_words="english")
dtm=tfidf.fit_transform(npr["Article"])
dtm

<11992x54777 sparse matrix of type '<class 'numpy.float64'>'
	with 3033388 stored elements in Compressed Sparse Row format>

### NMF

In [6]:
nmf_model=NMF(n_components=7,random_state=42)
nmf_model.fit(dtm)

NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=200,
  n_components=7, random_state=42, shuffle=False, solver='cd', tol=0.0001,
  verbose=0)

In [11]:
tfidf.get_feature_names()[4678]

'ballerinas'

In [12]:
for i in range(10):
    i=np.random.randint(0,54777)
    print(tfidf.get_feature_names()[i])

moribund
bandit
betray
hookup
peregrine
juvenile
reinventing
invalidate
mineralization
memorized


### Displaying Topics

In [14]:
nmf_model.components_.shape

(7, 54777)

In [15]:
len(nmf_model.components_)

7

In [16]:
first_topic=nmf_model.components_[0]

In [18]:
top_ten_words=first_topic.argsort()[-10:]
for index in top_ten_words:
    print(tfidf.get_feature_names()[index])

disease
percent
women
virus
study
water
food
people
zika
says


Looks like our first topic is related to public health

### Now let's find all the topics

In [21]:
for i,topic in enumerate(nmf_model.components_):
    print(f"Top 20 words related to Topic #{i} are:")
    print([tfidf.get_feature_names()[index] for index in topic.argsort()[-20:]])
    print("\n")

Top 20 words related to Topic #0 are:
['years', 'brain', 'university', 'researchers', 'scientists', 'new', 'research', 'like', 'patients', 'health', 'disease', 'percent', 'women', 'virus', 'study', 'water', 'food', 'people', 'zika', 'says']


Top 20 words related to Topic #1 are:
['intelligence', 'office', 'nominee', 'republicans', 'comey', 'gop', 'pence', 'presidential', 'russia', 'administration', 'election', 'republican', 'obama', 'white', 'house', 'donald', 'campaign', 'said', 'president', 'trump']


Top 20 words related to Topic #2 are:
['insurers', 'federal', 'said', 'aca', 'repeal', 'senate', 'house', 'people', 'act', 'law', 'tax', 'plan', 'republicans', 'affordable', 'obamacare', 'coverage', 'medicaid', 'insurance', 'care', 'health']


Top 20 words related to Topic #3 are:
['killed', 'reported', 'military', 'justice', 'city', 'officers', 'syria', 'security', 'department', 'law', 'isis', 'russia', 'government', 'state', 'attack', 'president', 'reports', 'court', 'said', 'police'

### Attaching discovered topic labels to original dataframe

In [22]:
npr.head()

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


In [23]:
topic_results=nmf_model.transform(dtm)

In [24]:
topic_results.shape

(11992, 7)

In [25]:
topic_results[0].argmax()

1

This shows that our first article belongs to the second discovered topic

In [28]:
# Let's add a column to the original Dataframe
npr["Topic"]=topic_results.argmax(axis=1)
npr.head(6)

Unnamed: 0,Article,Topic
0,"In the Washington of 2016, even when the polic...",1
1,Donald Trump has used Twitter — his prefe...,1
2,Donald Trump is unabashedly praising Russian...,1
3,"Updated at 2:50 p. m. ET, Russian President Vl...",3
4,"From photography, illustration and video, to d...",6
5,I did not want to join yoga class. I hated tho...,5


In [29]:
# You can name these topics and add them to the data frame
topic_dict={0:"Health research", 1:"Political Issues",2:"Government Policies", 3: "Military|Police Topic",4:"Election",5:"Art",6:"Education"}

In [31]:
topic_dict

{0: 'Health research',
 1: 'Political Issues',
 2: 'Government Policies',
 3: 'Military|Police Topic',
 4: 'Election',
 5: 'Art',
 6: 'Education'}

In [36]:
npr["Topic_labels"]=npr["Topic"].map(topic_dict)
npr.head(10)

Unnamed: 0,Article,Topic,Topic_labels
0,"In the Washington of 2016, even when the polic...",1,Political Issues
1,Donald Trump has used Twitter — his prefe...,1,Political Issues
2,Donald Trump is unabashedly praising Russian...,1,Political Issues
3,"Updated at 2:50 p. m. ET, Russian President Vl...",3,Military|Police Topic
4,"From photography, illustration and video, to d...",6,Education
5,I did not want to join yoga class. I hated tho...,5,Art
6,With a who has publicly supported the debunk...,0,Health research
7,"I was standing by the airport exit, debating w...",0,Health research
8,"If movies were trying to be more realistic, pe...",0,Health research
9,"Eighteen years ago, on New Year’s Eve, David F...",5,Art


In [46]:
npr.iloc[6][0]

'With a   who has publicly supported the debunked claim that vaccines cause autism, suggested that climate change is a hoax dreamed up by the Chinese, and appointed to his Cabinet a retired neurosurgeon who doesn’t buy the theory of evolution, things might look grim for science. Yet watching Patti Smith sing ”A Hard Rain’s   Fall” live streamed from the Nobel Prize ceremony in early December to a room full of physicists, chemists and physicians  —   watching her twice choke up, each time stopping the song altogether, only to push on through all seven wordy minutes of one of Bob Dylan’s most beloved songs  —   left me optimistic. Taking nothing away from the very real anxieties about future funding and support for science, neuroscience in particular has had plenty of promising leads that could help fulfill Alfred Nobel’s mission to better humanity. In the spirit of optimism, and with input from the Society for Neuroscience, here are a few of the noteworthy neuroscientific achievements o