# NLP Final Project - Sentiment Analysis & Topic Modeling Comparison

# Summary

Compute average sentiment scores using the Positiv and Negativ lists from the Harvard IV-4 dictionary. Additionally, test text vectorizing (tf-idf vectorizer) and topic modeling methods (LSA & NMF) to determine best combination and set of hyperparameters (n-gram range & number of topics) to use for training & testing classification models.

# Import Modules & Data

In [1]:
import pandas as pd
import re

In [2]:
articles = pd.read_csv('./articles_100K.csv', index_col = 0)
articles.head()

Unnamed: 0,news_source,pub_date,title,text,fake_news_binary,left_bias_avg,right_bias_avg,net_bias,fake_news_outlier,clean_txt,PoS_tags
0,The Sun,2018-10-17,Who left Big Brother last week and whos still ...,THE last ever Big Brother is underway with a f...,0,0.0,1.0,1.0,1,last ever Big Brother be underway final batch ...,DT JJ RB JJ NN VBZ RB IN DT JJ NN IN NNS . IN ...
1,True Pundit,2018-02-08,I Have to Let Them in DHS Chief Reveals How MS...,During a White House roundtable discussion on ...,1,0.0,1.0,1.0,-1,White House roundtable discussion immigration ...,"IN DT NNP NNP JJ NN IN NN CC NN NN , DT NN IN ..."
2,The Daily Caller,2018-11-16,Trump Works On Mueller Answers Avoiding Perjur...,President Donald Trump is working on written a...,0,0.0,1.0,1.0,1,President Donald Trump be work write answer qu...,NNP NNP NN VBZ VBG IN VBN NNS TO NNS IN JJ NN ...
3,The Daily Caller,2018-08-28,McConnell Announces Bipartisan Senate Group Wi...,Senate Majority Leader Mitch McConnell announc...,0,0.0,1.0,1.0,1,Senate Majority Leader Mitch McConnell announc...,NNP NN NN NN NNP VBD DT NN NNP NN MD VB VBG RB...
4,BBC,2018-07-23,Why millions listen to this nine-year-old girl...,A scheme to let parents close their streets to...,0,0.25,0.0,-0.25,1,scheme let parent close street car so child ca...,DT NN TO VB NNS RB PRP$ NNS TO NNS RB NNS MD N...


In [3]:
len(articles)

100000

In [4]:
# check to see whether there are any articles with null text, clean text, or PoS tags
len(articles[articles['text'].notnull() & articles['clean_txt'].notnull() & articles['PoS_tags'].notnull()])

100000

In [5]:
articles['fake_news_binary'].value_counts()

0    75072
1    24928
Name: fake_news_binary, dtype: int64

## sentiment score

In [6]:
hi4_df = pd.read_excel('inquireraugmented.xls')
hi4_df.head()

Unnamed: 0,Entry,Source,Positiv,Negativ,Pstv,Affil,Ngtv,Hostile,Strong,Power,...,Anomie,NegAff,PosAff,SureLw,If,NotLw,TimeSpc,FormLw,Othrtags,Defined
0,,,1915.0,2291,1045.0,557.0,1160,833.0,1902.0,689.0,...,30.0,193.0,126.0,175.0,132.0,25.0,428.0,368.0,,
1,A,H4Lvd,,,,,,,,,...,,,,,,,,,DET ART,| article: Indefinite singular article--some o...
2,ABANDON,H4Lvd,,Negativ,,,Ngtv,,,,...,,,,,,,,,SUPV,|
3,ABANDONMENT,H4,,Negativ,,,,,,,...,,,,,,,,,Noun,|
4,ABATE,H4Lvd,,Negativ,,,,,,,...,,,,,,,,,SUPV,|


In [7]:
pos_df = hi4_df[hi4_df.Positiv=="Positiv"]
neg_df = hi4_df[hi4_df.Negativ =="Negativ"]

In [8]:
pos_list = pos_df['Entry'].tolist()
neg_list = neg_df['Entry'].tolist()

In [9]:
pos_word = [ ]
for i in range(0, len(pos_list)):
    pos_word.append( re.sub(r'[^A-Z]', "", pos_list[i]))
pos_word=set(pos_word)

In [10]:
neg_word = [ ]
for i in range(0, len(neg_list)):
    neg_word.append( re.sub(r'[^A-Z]', "",str(neg_list[i]))) 
neg_word=set(neg_word)

In [11]:
def sentiment_scorer(text_input, in_list):
    words_set = set(in_list)
    text_input = text_input.upper().split(' ')
    score = 0
    for i in text_input:
        if i in words_set:
            score += 1
    score = score/len(text_input)
    return(score)

In [12]:
articles['pos_sent']=articles['clean_txt'].apply(lambda x: sentiment_scorer(x, pos_word))

In [13]:
articles['neg_sent']=articles['clean_txt'].apply(lambda x: sentiment_scorer(x, neg_word))

In [14]:
articles['net_sent'] = articles['pos_sent'] - articles['neg_sent']
articles.head()

Unnamed: 0,news_source,pub_date,title,text,fake_news_binary,left_bias_avg,right_bias_avg,net_bias,fake_news_outlier,clean_txt,PoS_tags,pos_sent,neg_sent,net_sent
0,The Sun,2018-10-17,Who left Big Brother last week and whos still ...,THE last ever Big Brother is underway with a f...,0,0.0,1.0,1.0,1,last ever Big Brother be underway final batch ...,DT JJ RB JJ NN VBZ RB IN DT JJ NN IN NNS . IN ...,0.062016,0.093023,-0.031008
1,True Pundit,2018-02-08,I Have to Let Them in DHS Chief Reveals How MS...,During a White House roundtable discussion on ...,1,0.0,1.0,1.0,-1,White House roundtable discussion immigration ...,"IN DT NNP NNP JJ NN IN NN CC NN NN , DT NN IN ...",0.118483,0.113744,0.004739
2,The Daily Caller,2018-11-16,Trump Works On Mueller Answers Avoiding Perjur...,President Donald Trump is working on written a...,0,0.0,1.0,1.0,1,President Donald Trump be work write answer qu...,NNP NNP NN VBZ VBG IN VBN NNS TO NNS IN JJ NN ...,0.061644,0.068493,-0.006849
3,The Daily Caller,2018-08-28,McConnell Announces Bipartisan Senate Group Wi...,Senate Majority Leader Mitch McConnell announc...,0,0.0,1.0,1.0,1,Senate Majority Leader Mitch McConnell announc...,NNP NN NN NN NNP VBD DT NN NNP NN MD VB VBG RB...,0.133333,0.033333,0.1
4,BBC,2018-07-23,Why millions listen to this nine-year-old girl...,A scheme to let parents close their streets to...,0,0.25,0.0,-0.25,1,scheme let parent close street car so child ca...,DT NN TO VB NNS RB PRP$ NNS TO NNS RB NNS MD N...,0.05,0.25,-0.2


In [15]:
articles.to_csv('articles_100K_plusSentiment.csv')

## compare the NMF & LSA

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.decomposition import TruncatedSVD

In [17]:
n_features = 20000
n_top_words = 20

def set_vectorizer(model = TfidfVectorizer,
              max_df = .95,
              min_df = 2,
              max_features = n_features,
              stop_words = 'english',
             ngram_range = (1, 2)):
    
    return model(max_df = max_df,
                min_df = min_df,
                max_features = max_features,
                stop_words = stop_words,
                ngram_range = ngram_range)

In [18]:
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i]+ ','
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()

In [19]:
data_samples = articles['clean_txt'].tolist()

### 10 topic
n_components = 10


#### ngram(1,1)

In [20]:
tf_gram1= set_vectorizer(ngram_range = (1, 1))

gram1_data = tf_gram1.fit_transform(data_samples)

nmf_word = tf_gram1.get_feature_names()

In [21]:
nmf10 = NMF(n_components = 10).fit(gram1_data)

print_top_words(nmf10, nmf_word, n_top_words)

Topic #0: say, people, year, just, make, time, know, new, think, work, life, woman, want, love, thing, company, come, look, dont, day,
Topic #1: nginx, forbidden, asylum, rule, ban, tigar, law, order, chinese, judge, guard, chau, conference, islamic, religious, world, federal, potter, opinion, tourist,
Topic #2: trump, president, mueller, white, cohen, house, say, trumps, donald, campaign, investigation, russia, fbi, tweet, russian, counsel, administration, attorney, news, comey,
Topic #3: democrats, election, vote, republican, voter, democratic, state, republicans, candidate, party, race, percent, senate, house, district, poll, seat, campaign, gop, primary,
Topic #4: game, league, season, player, club, team, win, play, premier, match, goal, manchester, cup, football, score, chelsea, city, mourinho, england, united,
Topic #5: police, say, gun, school, officer, child, report, arrest, student, kill, shoot, people, man, incident, attack, law, city, border, victim, suspect,
Topic #6: der, 

In [22]:
svd1 = TruncatedSVD(n_components = 10).fit(gram1_data)

svd1_word = list(sorted(tf_gram1.vocabulary_.keys()))

print_top_words(svd1, svd1_word, n_top_words)

Topic #0: trump, say, president, people, house, make, year, new, state, time, white, just, election, report, come, told, campaign, court, work, country,
Topic #1: nginx, forbidden, game, police, season, year, league, player, city, home, people, play, time, club, school, life, world, just, child, start,
Topic #2: trump, president, mueller, house, cohen, election, trumps, donald, white, nginx, democrats, forbidden, campaign, fbi, investigation, russia, senate, republicans, republican, counsel,
Topic #3: democrats, vote, senate, court, republican, republicans, democratic, supreme, voter, election, ford, judge, woman, party, candidate, committee, race, nominee, district, state,
Topic #4: trump, game, league, season, player, team, cohen, club, play, mueller, win, premier, match, manchester, cup, football, goal, fan, star, chelsea,
Topic #5: der, die, und, den, police, von, zu, mit, da, das, auf, fur, sich, dem, eine, ein, nicht, investigation, ist, auch,
Topic #6: der, die, und, vote, democ

### ngram(1,2)

In [23]:
tf_gram2= set_vectorizer(ngram_range = (1, 2))

gram2_data = tf_gram2.fit_transform(data_samples)

nmf2_word = tf_gram2.get_feature_names()

In [24]:
nmf2 = NMF(n_components =10).fit(gram2_data)

print_top_words(nmf2, nmf2_word, n_top_words)

Topic #0: say, people, police, year, child, woman, just, make, school, time, life, family, work, know, new, use, told, day, home, come,
Topic #1: forbidden nginx, nginx, forbidden, asylum, ban, rule, law, chinese, guard, order, judge, apparently, islamic, potter, world, harry potter, tourist, annoy, religious, century,
Topic #2: trump, president, white house, white, house, donald, say, donald trump, trumps, president trump, administration, trump say, tweet, obama, press, president donald, news, cnn, think, people,
Topic #3: north, korea, kim, north korea, korean, nuclear, summit, north korean, south, pompeo, jong, south korea, meeting, pyongyang, kim jong, denuclearization, leader, moon, singapore, korean leader,
Topic #4: der, die, und, den, von, zu, mit, da, das, auf, fur, sich, dem, eine, ein, nicht, ist, auch, sie, nach,
Topic #5: game, league, season, player, club, team, win, play, premier league, premier, goal, match, manchester, cup, football, score, chelsea, england, mourinho, 

In [25]:
svd2 = TruncatedSVD(n_components = 10).fit(gram2_data)

svd2_word = list(sorted(tf_gram2.vocabulary_.keys()))

print_top_words(svd2, svd2_word, n_top_words)

Topic #0: say, trump, president, people, year, make, house, new, state, time, just, come, report, white, told, work, election, know, want, country,
Topic #1: nginx, forbidden nginx, forbidden, game, police, season, year, league, city, player, home, world, child, life, play, people, club, time, school, start,
Topic #2: trump, president, white house, house, mueller, election, white, trumps, donald, democrats, cohen, donald trump, campaign, fbi, investigation, senate, russia, republican, republicans, president trump,
Topic #3: der, die, trump, und, north, korea, kim, iran, north korea, den, russia, nuclear, korean, russian, von, china, zu, president, trade, deal,
Topic #4: der, die, und, den, von, zu, mit, da, das, auf, fur, dem, sich, eine, ein, nicht, court, ist, auch, im,
Topic #5: trump, game, league, season, player, team, win, club, play, premier league, premier, match, goal, manchester, cup, football, score, president, chelsea, mueller,
Topic #6: vote, democrats, voter, state, party

#### The LSA has overlap, I will stop analysis the LSA in more topics.

### ngram(1,3)

In [26]:
tf_v3= set_vectorizer(ngram_range = (1, 3))

tfidf3_data = tf_v3.fit_transform(data_samples)

In [27]:
nmf3 = NMF(n_components = 10).fit(tfidf3_data)

In [28]:
nmf3_word = tf_v3.get_feature_names()

print_top_words(nmf3, nmf3_word, n_top_words)

Topic #0: say, people, police, year, child, make, just, woman, time, life, school, family, new, work, know, use, told, day, home, come,
Topic #1: forbidden nginx, nginx, forbidden, asylum, rule, ban, chinese, law, guard, order, judge, apparently, potter, world, islamic, harry potter, tourist, annoy, religious, century,
Topic #2: trump, president, white house, white, house, mueller, cohen, say, trumps, donald, donald trump, campaign, investigation, fbi, tweet, president trump, counsel, administration, news, attorney,
Topic #3: der, die, und, den, von, zu, mit, da, das, auf, fur, sich, dem, eine, ein, nicht, ist, auch, sie, nach,
Topic #4: court, ford, supreme, supreme court, judge, senate, committee, allegation, brett, justice, hearing, sexual, nominee, assault, say, judiciary, kavanaughs, confirmation, fbi, judiciary committee,
Topic #5: game, league, season, player, club, team, win, play, premier league, premier, goal, match, manchester, cup, football, score, chelsea, england, mourinh

In [30]:
svd3 = TruncatedSVD(n_components = 10).fit(tfidf3_data)

svd3_word = list(sorted(tf_v3.vocabulary_.keys()))

print_top_words(svd3, svd3_word, n_top_words)

Topic #0: say, trump, president, people, year, make, house, new, state, time, just, come, report, white, work, told, election, know, want, country,
Topic #1: forbidden nginx, nginx, forbidden, game, police, season, league, year, city, player, world, club, home, play, life, people, child, time, woman, start,
Topic #2: trump, president, white house, house, mueller, election, white, trumps, democrats, donald, cohen, donald trump, campaign, fbi, senate, investigation, republican, russia, republicans, president trump,
Topic #3: der, die, und, den, von, zu, mit, da, das, auf, fur, sich, dem, eine, trump, ein, nicht, ist, auch, sie,
Topic #4: der, die, court, und, democrats, vote, senate, supreme court, republican, supreme, woman, ford, republicans, judge, democratic, den, voter, election, von, committee,
Topic #5: trump, game, league, season, player, team, win, club, play, premier league, premier, match, goal, manchester, cup, football, score, chelsea, president, mueller,
Topic #6: vote, dem

### nmf : 20 topic

n_components = 20

#### ngram(1,1)

tf_gram1= set_vectorizer(ngram_range = (1, 1))

gram1_data = tf_gram1.fit_transform(data_samples)

nmf_word = tf_gram1.get_feature_names()

In [31]:
nmf20 = NMF(n_components = 20).fit(gram1_data)

In [32]:
print_top_words(nmf20, nmf_word, n_top_words)

Topic #0: say, just, people, know, think, woman, life, love, make, time, want, year, thing, im, really, star, dont, look, come, film,
Topic #1: nginx, forbidden, rule, ban, asylum, tigar, order, chinese, guard, judge, law, chau, conference, islamic, tourist, religious, potter, annoy, city, smoking,
Topic #2: trump, president, donald, trumps, tweet, say, obama, cohen, administration, meeting, news, daniels, campaign, putin, think, rally, twitter, people, fox, cnn,
Topic #3: democrats, election, vote, republican, voter, democratic, state, republicans, candidate, race, party, senate, district, campaign, seat, poll, percent, primary, gop, ballot,
Topic #4: game, league, season, player, club, team, win, premier, play, goal, match, manchester, cup, football, score, chelsea, mourinho, city, england, united,
Topic #5: der, die, und, den, von, zu, mit, da, das, auf, fur, sich, dem, eine, ein, nicht, ist, auch, sie, nach,
Topic #6: russian, russia, putin, iran, syria, moscow, military, sanction,

#### n-gram(1,2)

tf_gram2= set_vectorizer(ngram_range = (1, 2))

gram2_data = tf_gram2.fit_transform(data_samples)

nmf2_word = tf_gram2.get_feature_names()

In [34]:
nmf20_2 = NMF(n_components = 20).fit(gram2_data)

In [35]:
print_top_words(nmf20_2, nmf2_word, n_top_words)

Topic #0: say, just, people, know, life, make, time, love, think, year, woman, want, thing, star, im, look, film, really, come, work,
Topic #1: nginx, forbidden nginx, forbidden, ban, rule, asylum, guard, chinese, apparently, order, potter, islamic, tourist, city, harry potter, annoy, conference, world, century, displeasure,
Topic #2: trump, president, donald, donald trump, president trump, trumps, trump say, tweet, say, president donald, obama, administration, trade, think, putin, meeting, people, american, tariff, rally,
Topic #3: north, korea, kim, north korea, korean, nuclear, summit, north korean, south, jong, pompeo, south korea, pyongyang, kim jong, denuclearization, meeting, leader, moon, singapore, korean leader,
Topic #4: der, die, und, den, von, zu, mit, da, das, auf, fur, sich, dem, eine, ein, nicht, ist, auch, sie, nach,
Topic #5: game, league, season, player, club, team, win, premier league, premier, play, goal, manchester, match, cup, football, score, chelsea, city, mour

### nmf topic 50

 n_components = 50

#### n-gram(1,1)

tf_gram1= set_vectorizer(ngram_range = (1, 1))

gram1_data = tf_gram1.fit_transform(data_samples)

nmf_word = tf_gram1.get_feature_names()

In [36]:
nmf50 = NMF(n_components = 50).fit(gram1_data)

In [37]:
print_top_words(nmf50, nmf_word, n_top_words)

Topic #0: think, just, thing, dont, know, want, people, im, really, make, thats, way, lot, good, time, look, feel, need, try, come,
Topic #1: nginx, forbidden, asylum, tigar, rule, ban, guard, order, chinese, chau, conference, tourist, smoking, annoy, potter, apparently, islamic, displeasure, reporting, footage,
Topic #2: trump, donald, tweet, trumps, administration, campaign, meeting, rally, melania, twitter, supporter, election, president, tower, summit, ivanka, republican, presidency, attack, bad,
Topic #3: vote, election, state, voter, ballot, race, county, voting, florida, governor, recount, abrams, georgia, kemp, candidate, gillum, scott, count, campaign, republican,
Topic #4: game, team, player, season, play, win, england, match, cup, score, football, goal, coach, league, second, ball, sport, start, played, world,
Topic #5: der, die, und, den, von, zu, mit, da, das, auf, fur, sich, dem, eine, ein, nicht, ist, auch, sie, nach,
Topic #6: world, macron, french, france, europe, euro

#### ngram(1,2)

tf_gram2= set_vectorizer(ngram_range = (1, 2))

gram2_data = tf_gram2.fit_transform(data_samples)

nmf2_word = tf_gram2.get_feature_names()

In [38]:
nmf50_2 = NMF(n_components =50).fit(gram2_data)

In [39]:
print_top_words(nmf50_2, nmf2_word, n_top_words)

Topic #0: people, think, just, dont, thing, know, want, im, really, make, way, thats, time, right, look, feel, lot, good, life, come,
Topic #1: nginx, forbidden nginx, forbidden, ban, rule, asylum, guard, chinese, order, apparently, potter, tourist, islamic, annoy, harry potter, world, smoking, conference, displeasure, century,
Topic #2: trump, president, donald, donald trump, president trump, trumps, trump say, tweet, president donald, administration, obama, campaign, rally, twitter, say trump, trump administration, trump tweet, presidency, melania, supporter,
Topic #3: putin, president, vladimir, vladimir putin, meeting, russian president, president vladimir, summit, russia, russian, election, intelligence, helsinki, nato, leader, putin say, conference, putins, kremlin, security,
Topic #4: der, die, und, den, von, zu, mit, da, das, auf, fur, sich, dem, eine, ein, nicht, ist, auch, sie, nach,
Topic #5: league, club, season, premier league, premier, chelsea, arsenal, city, liverpool, g

### nmf  75 topics

#### ngram(1,1)

tf_gram1= set_vectorizer(ngram_range = (1, 1))

gram1_data = tf_gram1.fit_transform(data_samples)

nmf_word = tf_gram1.get_feature_names()

In [40]:
nmf75 = NMF(n_components = 75).fit(gram1_data)

In [41]:
print_top_words(nmf75, nmf_word, n_top_words)

Topic #0: think, just, dont, thing, know, want, im, really, people, thats, make, way, lot, good, look, feel, need, time, youre, talk,
Topic #1: nginx, forbidden, asylum, tigar, rule, ban, guard, order, chinese, chau, conference, potter, annoy, smoking, apparently, tourist, islamic, displeasure, reporting, footage,
Topic #2: trump, donald, administration, trumps, meeting, campaign, tweet, rally, melania, supporter, summit, tower, ivanka, presidency, republican, america, bad, attack, ally, midterm,
Topic #3: democrats, republicans, republican, democratic, party, senate, seat, house, candidate, district, race, gop, pelosi, win, primary, election, congressional, democrat, majority, midterm,
Topic #4: league, club, manchester, mourinho, premier, united, chelsea, season, arsenal, liverpool, player, jose, champions, goal, madrid, manager, pogba, getty, summer, tottenham,
Topic #5: der, die, und, den, von, zu, mit, da, das, auf, fur, sich, dem, eine, ein, nicht, ist, auch, sie, nach,
Topic #6:

#### ngram(1,2)
tf_gram2= set_vectorizer(ngram_range = (1, 2))

gram2_data = tf_gram2.fit_transform(data_samples)

nmf2_word = tf_gram2.get_feature_names()

In [42]:
nmf75_2 = NMF(n_components =75).fit(gram2_data)

In [43]:
print_top_words(nmf75_2, nmf2_word, n_top_words)

Topic #0: think, just, dont, thing, know, want, im, really, people, thats, make, way, lot, good, look, feel, time, youre, need, talk,
Topic #1: nginx, forbidden nginx, forbidden, rule, asylum, ban, guard, order, apparently, potter, harry potter, annoy, islamic, tourist, smoking, displeasure, conference, broadway, chinese, loses,
Topic #2: trump, donald, donald trump, trump say, trumps, tweet, campaign, president donald, administration, president, say trump, president trump, trump administration, meeting, rally, trump tweet, trump told, trump campaign, melania, supporter,
Topic #3: putin, vladimir, vladimir putin, meeting, russian president, russian, russia, president vladimir, intelligence, summit, election, helsinki, president, putin say, putins, kremlin, conference, cia, security, coats,
Topic #4: der, die, und, den, von, zu, mit, da, das, auf, fur, sich, dem, eine, ein, nicht, ist, auch, sie, nach,
Topic #5: league, club, premier league, premier, season, chelsea, arsenal, liverpool,

### 100 topics 

n_components = 100

#### ngram(1,1)

tf_gram1= set_vectorizer(ngram_range = (1, 1))

gram1_data = tf_gram1.fit_transform(data_samples)

nmf_word = tf_gram1.get_feature_names()

In [44]:
nmf_100 = NMF(n_components =100).fit(gram1_data)

In [45]:
print_top_words(nmf_100, nmf_word, n_top_words)

Topic #0: think, just, dont, thing, know, want, im, really, thats, make, way, good, look, lot, feel, time, youre, talk, need, try,
Topic #1: nginx, forbidden, asylum, tigar, ban, rule, guard, chinese, chau, order, annoy, potter, smoking, conference, apparently, displeasure, islamic, footage, tourist, reporting,
Topic #2: trump, donald, administration, trumps, campaign, tweet, rally, meeting, melania, supporter, tower, presidency, ivanka, america, republican, reporter, golf, bad, election, ally,
Topic #3: republican, race, candidate, democratic, district, democrats, seat, primary, republicans, gop, win, campaign, run, senate, congressional, won, democrat, election, incumbent, governor,
Topic #4: game, season, play, team, win, score, goal, second, ball, start, point, coach, run, minute, player, inning, yard, played, pitch, yankees,
Topic #5: der, die, und, den, von, zu, mit, da, das, auf, fur, sich, dem, eine, ein, nicht, ist, auch, sie, nach,
Topic #6: war, world, soldier, yemen, end, p

#### ngram(1,2)


tf_gram2= set_vectorizer(ngram_range = (1, 2))

gram2_data = tf_gram2.fit_transform(data_samples)

nmf2_word = tf_gram2.get_feature_names()

In [46]:
nmf100_2 = NMF(n_components =100).fit(gram2_data)

In [47]:
print_top_words(nmf100_2, nmf2_word, n_top_words)

Topic #0: think, just, dont, thing, know, im, want, really, thats, make, way, good, lot, look, time, feel, youre, need, talk, try,
Topic #1: nginx, forbidden nginx, forbidden, asylum, rule, ban, guard, chinese, apparently, order, potter, islamic, harry potter, annoy, displeasure, smoking, tourist, broadway, conference, century,
Topic #2: trump, donald trump, donald, trump say, trumps, tweet, president donald, administration, campaign, say trump, rally, trump administration, meeting, trump told, trump tweet, supporter, mr trump, melania, trump campaign, republican,
Topic #3: say, told, add, ask, statement, comment, interview, want, trump say, tuesday, monday, say say, wednesday, say statement, say want, believe, say add, spokesman, thursday, official say,
Topic #4: der, die, und, den, von, zu, mit, da, das, auf, fur, sich, dem, eine, ein, nicht, ist, auch, sie, nach,
Topic #5: league, premier league, premier, season, liverpool, wolves, goal, match, manchester, tottenham, win, live, west