## Alternus Vera
Dataset: Politifact/Liar-Liar dataset(https://www.politifact.com)

Description of the train TSV file format:

- Column 1: the ID of the statement ([ID].json)
- Column 2: the label.
- Column 3: the statement.
- Column 4: the subject(s).
- Column 5: the speaker.
- Column 6: the speaker's job title.
- Column 7: the state info.
- Column 8: the party affiliation.
- Column 9-13: the total credit history count, including the current statement.
  - 9: barely true counts.
  - 10: false counts.
  - 11: half true counts.
  - 12: mostly true counts.
  - 13: pants on fire counts.
- Column 14: the context (venue / location of the speech or statement).

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
from nltk.stem.porter import *
from nltk.corpus import stopwords  
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer

In [3]:
df =pd.read_csv('/Users/vidhsharma/Documents/SJSU/Fall18/ML/Lab/fakeNews/liar_dataset/train.tsv', sep='\t')

In [4]:
df.head(10)

Unnamed: 0,file,value,statement,topic,name,occupation,state,party,barely true counts,false counts,half true counts,mostly true counts,pants on fire counts,context
0,2635.json,FALSE,Says the Annies List political group supports ...,abortion,dwayne-bohac,State representative,Texas,republican,0.0,1.0,0.0,0.0,0.0,a mailer
1,10540.json,half-true,When did the decline of coal start? It started...,"energy,history,job-accomplishments",scott-surovell,State delegate,Virginia,democrat,0.0,0.0,1.0,1.0,0.0,a floor speech.
2,324.json,mostly-true,"Hillary Clinton agrees with John McCain ""by vo...",foreign-policy,barack-obama,President,Illinois,democrat,70.0,71.0,160.0,163.0,9.0,Denver
3,1123.json,FALSE,Health care reform legislation is likely to ma...,health-care,blog-posting,,,none,7.0,19.0,3.0,5.0,44.0,a news release
4,9028.json,half-true,The economic turnaround started at the end of ...,"economy,jobs",charlie-crist,,Florida,democrat,15.0,9.0,20.0,19.0,2.0,an interview on CNN
5,12465.json,TRUE,The Chicago Bears have had more starting quart...,education,robin-vos,Wisconsin Assembly speaker,Wisconsin,republican,0.0,3.0,2.0,5.0,1.0,a an online opinion-piece
6,2342.json,barely-true,Jim Dunnam has not lived in the district he re...,candidates-biography,republican-party-texas,,Texas,republican,3.0,1.0,1.0,3.0,1.0,a press release.
7,153.json,half-true,I'm the only person on this stage who has work...,ethics,barack-obama,President,Illinois,democrat,70.0,71.0,160.0,163.0,9.0,"a Democratic debate in Philadelphia, Pa."
8,5602.json,half-true,"However, it took $19.5 million in Oregon Lotte...",jobs,oregon-lottery,,,organization,0.0,0.0,1.0,0.0,1.0,a website
9,9741.json,mostly-true,Says GOP primary opponents Glenn Grothman and ...,"energy,message-machine-2014,voting-record",duey-stroebel,State representative,Wisconsin,republican,0.0,0.0,0.0,1.0,0.0,an online video


In [5]:
df.shape

(10240, 14)

In [6]:
df.columns

Index(['file', 'value', 'statement', 'topic', 'name', 'occupation', 'state',
       'party', 'barely true counts', 'false counts', 'half true counts',
       'mostly true counts', 'pants on fire counts', 'context '],
      dtype='object')

##### Data exploration

In [7]:
df['state'].value_counts()

Texas                                                1009
Florida                                               997
Wisconsin                                             713
New York                                              657
Illinois                                              556
Ohio                                                  447
Georgia                                               426
Virginia                                              407
Rhode Island                                          369
New Jersey                                            241
Oregon                                                239
Massachusetts                                         206
Arizona                                               182
California                                            159
Washington, D.C.                                      120
Vermont                                                98
Pennsylvania                                           90
New Hampshire 

In [8]:
df['party'].unique()

array(['republican', 'democrat', 'none', 'organization', 'independent',
       'columnist', 'activist', 'talk-show-host', 'libertarian',
       'newsmaker', 'journalist', 'labor-leader', 'state-official',
       'business-leader', 'education-official', 'tea-party-member', nan,
       'green', 'liberal-party-canada', 'government-body', 'Moderate',
       'democratic-farmer-labor', 'ocean-state-tea-party-action',
       'constitution-party'], dtype=object)

In [9]:
df.loc[df['topic']=='health-care']

Unnamed: 0,file,value,statement,topic,name,occupation,state,party,barely true counts,false counts,half true counts,mostly true counts,pants on fire counts,context
3,1123.json,FALSE,Health care reform legislation is likely to ma...,health-care,blog-posting,,,none,7.0,19.0,3.0,5.0,44.0,a news release
14,8705.json,barely-true,Most of the (Affordable Care Act) has already ...,health-care,george-will,Columnist,Maryland,columnist,7.0,6.0,3.0,5.0,1.0,"comments on ""Fox News Sunday"""
25,10215.json,FALSE,I dont know who (Jonathan Gruber) is.,health-care,nancy-pelosi,House Minority Leader,California,democrat,3.0,7.0,11.0,2.0,3.0,a news conference
85,2044.json,half-true,"In Rick Perrys Texas, we import nurses ... fro...",health-care,bill-white,Former mayor of Houston,Texas,democrat,2.0,3.0,5.0,7.0,3.0,a speech at the Texas Democratic Party convention
92,2020.json,mostly-true,The insurance commissioner cant do squat about...,health-care,ralph-hudgens,,Georgia,republican,0.0,0.0,0.0,1.0,0.0,a radio interview broadcast over the Internet
165,4504.json,barely-true,Says Rick Perry wrote a letter supporting Hill...,health-care,ron-paul,U.S. representative,Texas,republican,5.0,8.0,8.0,8.0,3.0,a presidential debate
248,11772.json,FALSE,We now have driven (health care) costs down to...,health-care,hillary-clinton,Presidential candidate,New York,democrat,40.0,29.0,69.0,76.0,7.0,comments during the South Carolina Democratic ...
314,8308.json,barely-true,The American people support defunding Obamacar...,health-care,marco-rubio,U.S. Senator,Florida,republican,33.0,24.0,32.0,35.0,5.0,a press release
350,9031.json,FALSE,"Most young Americans right now, theyre not cov...",health-care,barack-obama,President,Illinois,democrat,70.0,71.0,160.0,163.0,9.0,"a Funnyordie.com ""Between Two Ferns"" interview..."
389,5631.json,FALSE,Says Congressman Bill Pascrell voted to remove...,health-care,steve-rothman,,,democrat,1.0,2.0,0.0,1.0,1.0,an interview on NJToday


In [10]:
df.isnull().sum()

file                       0
value                      0
statement                  0
topic                      2
name                       2
occupation              2897
state                   2208
party                      2
barely true counts         2
false counts               2
half true counts           2
mostly true counts         2
pants on fire counts       2
context                  102
dtype: int64

### Data Cleaning
    Tokenizing: converting a document to its atomic elements.
    Stopping: removing meaningless words.
    Stemming: merging words that are equivalent in meaning.

In [14]:
df1 =df.drop(columns=['file','value' ,'barely true counts', 'false counts',
       'half true counts', 'mostly true counts', 'pants on fire counts',])

In [16]:
print (df1)

                                               statement  \
0      Says the Annies List political group supports ...   
1      When did the decline of coal start? It started...   
2      Hillary Clinton agrees with John McCain "by vo...   
3      Health care reform legislation is likely to ma...   
4      The economic turnaround started at the end of ...   
5      The Chicago Bears have had more starting quart...   
6      Jim Dunnam has not lived in the district he re...   
7      I'm the only person on this stage who has work...   
8      However, it took $19.5 million in Oregon Lotte...   
9      Says GOP primary opponents Glenn Grothman and ...   
10     For the first time in history, the share of th...   
11     Since 2000, nearly 12 million Americans have s...   
12     When Mitt Romney was governor of Massachusetts...   
13     The economy bled $24 billion due to the govern...   
14     Most of the (Affordable Care Act) has already ...   
15     In this last election in November

In [18]:
raw =df1[['statement']]
raw[:5]
final=raw.values.T.tolist()
print (len(final[0]))

10240


In [19]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/vidhsharma/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [20]:
## porter stemmer
def textProcessing(doc):
    process = re.sub('[^a-zA-Z]', ' ',doc) 
    process = process.lower()
    process = process.split()
    ps = PorterStemmer()
    process = [ps.stem(word) for word in process if not word in set(stopwords.words('english'))]
    process = ' '.join(process)
    return process

In [21]:
print('original document: ')
words = []
for word in final:
    words.append(word)
print (words)



original document: 


In [22]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/vidhsharma/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [24]:
result =[]
for i in final[0]:
    result.append(textProcessing(i))
print (result)

['say anni list polit group support third trimest abort demand', 'declin coal start start natur ga took start begin presid georg w bush administr', 'hillari clinton agre john mccain vote give georg bush benefit doubt iran', 'health care reform legisl like mandat free sex chang surgeri', 'econom turnaround start end term', 'chicago bear start quarterback last year total number tenur uw faculti fire last two decad', 'jim dunnam live district repres year', 'person stage work activ last year pass along russ feingold toughest ethic reform sinc waterg', 'howev took million oregon lotteri fund port newport eventu land new noaa marin oper center pacif', 'say gop primari oppon glenn grothman joe leibham cast compromis vote cost million higher electr cost', 'first time histori share nation popular vote margin smaller latino vote margin', 'sinc nearli million american slip middl class poverti', 'mitt romney governor massachusett didnt slow rate growth govern actual cut', 'economi bled billion due

### tokenization the words

In [25]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')

In [27]:
final=[]
for i in result:
    tokens = tokenizer.tokenize(i)
    final.append(tokens)
print (final)



[['say', 'anni', 'list', 'polit', 'group', 'support', 'third', 'trimest', 'abort', 'demand'], ['declin', 'coal', 'start', 'start', 'natur', 'ga', 'took', 'start', 'begin', 'presid', 'georg', 'w', 'bush', 'administr'], ['hillari', 'clinton', 'agre', 'john', 'mccain', 'vote', 'give', 'georg', 'bush', 'benefit', 'doubt', 'iran'], ['health', 'care', 'reform', 'legisl', 'like', 'mandat', 'free', 'sex', 'chang', 'surgeri'], ['econom', 'turnaround', 'start', 'end', 'term'], ['chicago', 'bear', 'start', 'quarterback', 'last', 'year', 'total', 'number', 'tenur', 'uw', 'faculti', 'fire', 'last', 'two', 'decad'], ['jim', 'dunnam', 'live', 'district', 'repres', 'year'], ['person', 'stage', 'work', 'activ', 'last', 'year', 'pass', 'along', 'russ', 'feingold', 'toughest', 'ethic', 'reform', 'sinc', 'waterg'], ['howev', 'took', 'million', 'oregon', 'lotteri', 'fund', 'port', 'newport', 'eventu', 'land', 'new', 'noaa', 'marin', 'oper', 'center', 'pacif'], ['say', 'gop', 'primari', 'oppon', 'glenn', 'g

### adding all the sentence token's in one list

In [34]:
import itertools
flat=itertools.chain.from_iterable(final)
text = list(flat)
print (type(text))

<class 'list'>


In [36]:
from gensim import corpora, models

In [37]:
dictionary = corpora.Dictionary(final)
print (dictionary)

Dictionary(7696 unique tokens: ['abort', 'anni', 'demand', 'group', 'list']...)


In [38]:
import numpy as np
final=np.asarray(final)
raw1 = np.concatenate(final).ravel().tolist()
type(raw1)

list

### Bag of words

In [39]:
dictionary.filter_extremes(no_below=20, no_above=0.5, keep_n=100000)
bow_corpus = [dictionary.doc2bow(result) for result in final]
print (bow_corpus[50])



[(4, 1), (34, 1), (90, 1), (91, 1), (93, 1), (275, 1), (276, 1), (277, 1), (278, 1)]


### TF-IDF on Bag of words

In [41]:
from gensim import corpora, models

tfidf = models.TfidfModel(bow_corpus)
corpus_tfidf = tfidf[bow_corpus]

In [56]:
from pprint import pprint

for doc in corpus_tfidf:
    pprint(doc)
    break

[(0, 0.35904583507755133),
 (1, 0.43205352296640565),
 (2, 0.490320729707899),
 (3, 0.4306640399228395),
 (4, 0.12019203984807553),
 (5, 0.2874852011836854),
 (6, 0.40179104146231087)]


#### Sentiment Analysis

In [59]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

analyser = SentimentIntensityAnalyzer()

In [61]:
def sentiment_analyzer_scores(sentence):
    score = analyser.polarity_scores(sentence)
    print("{:-<40} {}".format(sentence, str(score)))

In [72]:
for i in df['statement']:
    result = (sentiment_analyzer_scores(i))
    print (result)
    

Says the Annies List political group supports third-trimester abortions on demand. {'neg': 0.115, 'neu': 0.692, 'pos': 0.192, 'compound': 0.25}
None
When did the decline of coal start? It started when natural gas took off that started to begin in (President George W.) Bushs administration. {'neg': 0.0, 'neu': 0.902, 'pos': 0.098, 'compound': 0.3612}
None
Hillary Clinton agrees with John McCain "by voting to give George Bush the benefit of the doubt on Iran." {'neg': 0.107, 'neu': 0.687, 'pos': 0.206, 'compound': 0.3182}
None
Health care reform legislation is likely to mandate free sex change surgeries. {'neg': 0.0, 'neu': 0.606, 'pos': 0.394, 'compound': 0.7579}
None
The economic turnaround started at the end of my term. {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
None
The Chicago Bears have had more starting quarterbacks in the last 10 years than the total number of tenured (UW) faculty fired during the last two decades. {'neg': 0.12, 'neu': 0.836, 'pos': 0.043, 'compound': 

None
Says abortion doctors are flying into this state, performing abortions and flying out. {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
None
Speaker Gingrich for 20 years supported a federal individual mandate for health insurance. {'neg': 0.0, 'neu': 0.827, 'pos': 0.173, 'compound': 0.3182}
None
There have been at least 100 shootings each year [Angel Taveras] has been mayor of Providence. {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
None
Maurice Ferre says Kendrick Meek has voted 98.6 percent of the time with the Democrat party. {'neg': 0.0, 'neu': 0.847, 'pos': 0.153, 'compound': 0.4019}
None
Weve created more jobs in the United States than every other advanced economy combined since I came into office. {'neg': 0.0, 'neu': 0.688, 'pos': 0.312, 'compound': 0.7003}
None
One in four inmates is an illegal immigrant. {'neg': 0.34, 'neu': 0.66, 'pos': 0.0, 'compound': -0.5574}
None
Wisconsin is dead last in income growth among midwestern states during Gov. Scott Walkers 

China owns about 29 percent of (the U.S.) debt. {'neg': 0.238, 'neu': 0.762, 'pos': 0.0, 'compound': -0.3612}
None
Chris Sununu supported Obamas Common Core agenda, taking away local control of our schools. {'neg': 0.0, 'neu': 0.85, 'pos': 0.15, 'compound': 0.3182}
None
(President Barack Obama) said unemployment was never gonna go over 8 percent if we passed the stimulus plan. {'neg': 0.153, 'neu': 0.847, 'pos': 0.0, 'compound': -0.4404}
None
Gwinnett County government has made significant cutbacks in staffing. {'neg': 0.0, 'neu': 0.816, 'pos': 0.184, 'compound': 0.2023}
None
Says the Constitution specifically states the Congress shall write legislation for immigration policy, so Barack Obama lacks the authority to defer the deportation of young illegal immigrants. {'neg': 0.186, 'neu': 0.772, 'pos': 0.042, 'compound': -0.6705}
None
Crime rises in communities with casinos. {'neg': 0.412, 'neu': 0.588, 'pos': 0.0, 'compound': -0.5423}
None
Americans spend 6.1 billion hours a year comply

None
Says Eric Cantor was the co-author of the House GOP principles on immigration reform. Both the New York Times and the Washington Postsaid that thatcaptured the essence of what was in the Senate immigration bill. {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
None
Wisconsins 6th congressional district has more manufacturing jobs than almost any other in the nation. {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
None
I lost my health insurance and my doctor because of Obamacare. {'neg': 0.204, 'neu': 0.796, 'pos': 0.0, 'compound': -0.3182}
None
As far as standing committees, we have 10 fewer standing committees. So weve saved the state about a quarter-of-a-million dollars through the reduction of the standing committees. {'neg': 0.0, 'neu': 0.89, 'pos': 0.11, 'compound': 0.4728}
None
(Obamas) entire national security team, including his secretary of state, said we want to arm and train and equip (Syrian rebel forces), and he made the unilateral decision to turn them do

None
Im not one for name calling.------------ {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
None
Says Sen. Pat Toomey even tried to shut down the federal government in order to eliminate funding for Planned Parenthood. {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
None
Wisconsin has one of the most progressive tax codes in the country. {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
None
During the year Kagan barred military recruiters from Harvard Law School's Office of Career Services, "military recruiting actually went up." {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
None
ISIS reaches about 100 million people a day through social media. {'neg': 0.0, 'neu': 0.882, 'pos': 0.118, 'compound': 0.0516}
None
Texas added more jobs in 2010 than any other state. {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
None
In the primary election, a right-wing group spent more than $100,000,000 to support Supreme Court Justice Rebecca Bradley. {'neg': 0.0, 'neu': 

We have been focusing so much, especially the Texas Department of Transportation, on that 97 percent of people in single-occupancy vehicles. {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
None
There is no state in the U.S. where a 40-hour minimum wage work week is enough to afford a two-bedroom apartment. {'neg': 0.109, 'neu': 0.891, 'pos': 0.0, 'compound': -0.296}
None
When those restrictions expire (in the Iran nuclear deal), Iran will have an industrial-size military nuclear capability ready to go. {'neg': 0.0, 'neu': 0.884, 'pos': 0.116, 'compound': 0.3612}
None
Says Hillary Clinton will receive her congressional salary until she dies and the Secret Service pays her mortgage. {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
None
Says theres no proven instance where hydraulic fracking has polluted groundwater. {'neg': 0.366, 'neu': 0.634, 'pos': 0.0, 'compound': -0.6369}
None
Sixty percent of the auto thefts that we have in (St. Petersburg) are caused by people leaving t

Half of Oregon university students are on Pell Grants. {'neg': 0.0, 'neu': 0.808, 'pos': 0.192, 'compound': 0.2263}
None
Says uncompensated health care costs absorbed by Texas hospitals are adding $1,800 a year to Texas private insurance rates. {'neg': 0.0, 'neu': 0.842, 'pos': 0.158, 'compound': 0.4939}
None
Invested $90 million in traffic fixes without raising taxes. {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
None
Countries bombed: Obama 7, Bush 4------- {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
None
U.S. Rep. Debbie Wasserman Schultz has blamed the Republicans for the creation of Hamas. {'neg': 0.18, 'neu': 0.698, 'pos': 0.122, 'compound': -0.25}
None
A report in Navy Times said that 7.3 percent of Army, Navy and Marines have thought about attempting suicide. {'neg': 0.209, 'neu': 0.791, 'pos': 0.0, 'compound': -0.6705}
None
We cut property taxes by one-third in the state of Texas while Ive been governor. {'neg': 0.13, 'neu': 0.87, 'pos': 0.0, 'compound': -0.2

None
Manufacturing is having its best employment year in almost 15 years. {'neg': 0.0, 'neu': 0.704, 'pos': 0.296, 'compound': 0.6369}
None
Georgia ranks No. 9 in the rate of women murdered by men. {'neg': 0.423, 'neu': 0.577, 'pos': 0.0, 'compound': -0.765}
None
Bicycle ownership drops by half while obesity in Rhode Island rises by 154 percent {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
None
Sue Lowden gave Harry Reid's campaigns a thousand dollars in five different elections, helping his Senate dreams come true. {'neg': 0.0, 'neu': 0.661, 'pos': 0.339, 'compound': 0.7717}
None
A gay man who survived #orlando hate crime can STILL show up to work in FL tomorrow and have his boss fire him simply because he is gay. {'neg': 0.267, 'neu': 0.641, 'pos': 0.092, 'compound': -0.743}
None
Says Gov. Charlie Crist has called him "a rock star." {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
None
Says U.S. Rep. Connie Mack took seven and a half years to finish college. {'neg': 0.0,

None
\u201CThirty-four percent of Hispanics don\u2019t have any health care at all, don\u2019t have any health insurance.\u201D {'neg': 0.0, 'neu': 0.824, 'pos': 0.176, 'compound': 0.4939}
None
Says when New Jersey adopted guaranteed coverage and cost provisions without a mandate individual health insurance market rates doubled or tripled and enrollment dropped from 180,000 people to 80,000 people. {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
None
Says PolitiFact has said he kept a campaign promise to toughen ethics rules. {'neg': 0.0, 'neu': 0.746, 'pos': 0.254, 'compound': 0.34}
None
Says Milwaukee County Executive Chris Abele is a billionaire. {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
None
Florida high schools are four out of the top 10 in the entire United States. {'neg': 0.0, 'neu': 0.739, 'pos': 0.261, 'compound': 0.5574}
None
On supporting the eventual Republican presidential nominee {'neg': 0.0, 'neu': 0.674, 'pos': 0.326, 'compound': 0.4404}
None
State emp

The average person only pays about $1,800 in state taxes which is the lowest of all 50 states. {'neg': 0.133, 'neu': 0.867, 'pos': 0.0, 'compound': -0.3818}
None
The Austin Independent School Districts graduation rate reached an all-time high of 82.5 percent in 2012. {'neg': 0.0, 'neu': 0.915, 'pos': 0.085, 'compound': 0.1027}
None
Says Scott Brown voted with President Barack Obama 70 percent of the time in 2011. {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
None
ISIS lures women with kittens, nutella.- {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
None
In the past six years alone, changes to the pension formula have saved over half a billion dollars. {'neg': 0.101, 'neu': 0.758, 'pos': 0.141, 'compound': 0.2023}
None
Texas added 6,600 miles of highway from 2001-2012, more than any other state. {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
None
Premeditation, in murder cases like the Oscar Pistorius case, can be formed in the twinkling of an eye. {'neg': 0.203, 

None
Says we are overcharging students . . . to help pay for the health care law. {'neg': 0.081, 'neu': 0.578, 'pos': 0.341, 'compound': 0.6705}
None
Says Ellen Rosenblum has said over and over again that this is a job where 80 percent of the job is being the governments lawyer. {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
None
New Jersey has lost over half of our pharmaceutical jobs to states you know, not low-tax states like in the South, but high-tax states like New York. {'neg': 0.197, 'neu': 0.803, 'pos': 0.0, 'compound': -0.5954}
None
If the space-shuttle program is terminated, Russia and China will be the only nations ...with the capability to launch humans into space. {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
None
The Super Bowl has one of the highest levels of human sex trafficking activity of any event in the country. {'neg': 0.0, 'neu': 0.83, 'pos': 0.17, 'compound': 0.5994}
None
Says the reforms in state Issue 2 will save taxpayer dollars {'neg': 0.0, '

None
We have no idea what is contained in [electronic cigarette] vapor. {'neg': 0.18, 'neu': 0.82, 'pos': 0.0, 'compound': -0.296}
None
Says new estimates from the Congressional Budget Office conclude the final price-tag for the health care law will exceed $2 trillion more than double what was initially reported. {'neg': 0.0, 'neu': 0.894, 'pos': 0.106, 'compound': 0.4939}
None
SaysRuben Kihuen only managed in the minority to get one bill passed out of the eight to 10 he introduced during the 2015 legislative session. {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
None
This is the true picture of Milwaukees Renaissance after 12 years of Mayor Tom Barretts leadership. {'neg': 0.0, 'neu': 0.843, 'pos': 0.157, 'compound': 0.4215}
None
Says President Barack Obama revealed in his State of the Union address that he now is against earmarks. {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
None
Guns have murdered more Americans here at home in recent years than have died on the bat

None
Says hes proposed the largest employer contribution to the Virginia Retirement System in history. {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
None
While the sequester is in effect, the federal government is still funding a study on duck penises. {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
None
A number of the national publications have put this race in a tossup race between Democrats and Republicans. They don't know who's going to win this race. {'neg': 0.0, 'neu': 0.819, 'pos': 0.181, 'compound': 0.6249}
None
Says New Jerseys job growth in May represents 25 percent of all the jobs created in the country. {'neg': 0.0, 'neu': 0.777, 'pos': 0.223, 'compound': 0.5574}
None
President Obama's bill won't bring down the costs (of health care) for average Americans -- or really for very few Americans, if any. {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
None
The United States has more of our people in prison than Russia, China, and North Korea combined. {'neg'

Recently Rick Scott closed 30 womens health care centers across the state. {'neg': 0.0, 'neu': 0.775, 'pos': 0.225, 'compound': 0.4939}
None
The income tax that started at 2 percent under Governor Byrne is now 9 percent. {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
None
Says federal health care overhaul will cost Texas state government upwards of $30 billion over the next 10 years. {'neg': 0.0, 'neu': 0.849, 'pos': 0.151, 'compound': 0.4939}
None
Says President Barack Obama has put (up) a stop sign against oil drilling, against any kind of exploration offshore or in Alaska. {'neg': 0.096, 'neu': 0.833, 'pos': 0.07, 'compound': -0.1513}
None
Ive created over 40,000 jobs.----------- {'neg': 0.0, 'neu': 0.667, 'pos': 0.333, 'compound': 0.25}
None
Passing a federal firearms background check through the NICS database . . . typically takes 90 seconds. {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
None
The debt-ceiling bill doesnt cut the debt. It will add about $7 trillion i

Says the state budget includes spending on commercials for Fortune 500 companies. {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
None
Says high-speed rail would have created 60,000 jobs. {'neg': 0.0, 'neu': 0.778, 'pos': 0.222, 'compound': 0.25}
None
Says IBM leader told Obama that using IBM technology to cut fraud could pay for health care reform. {'neg': 0.298, 'neu': 0.571, 'pos': 0.131, 'compound': -0.4767}
None
Since President Obama took office, 2 million jobs. Gone. {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
None
Supported in-state tuition in Arkansas for illegal immigrants "if you'd sat in our schools from the time you're 5 or 6 years old and you had become an A-plus student," among other things. {'neg': 0.103, 'neu': 0.831, 'pos': 0.066, 'compound': -0.3182}
None
Health insurance premiums "have almost doubled ... since 2000." {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
None
Our economy could fall $1 trillion short of its full capacity, which translat

None
The Chinese government provides their people no access to the Internet. {'neg': 0.18, 'neu': 0.82, 'pos': 0.0, 'compound': -0.296}
None
Sen. Clinton said "the surge of troops in Iraq was 'working.' Now.... Sen. Clinton says the surge 'has failed' and that we should 'begin the immediate withdrawal of U.S. troops.'" {'neg': 0.102, 'neu': 0.864, 'pos': 0.034, 'compound': -0.4939}
None
Says that except for foreign policy, Ron Pauls voting record and his voting record are virtually identical. {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
None
The African-American community lost half of their wealth as a result of the Wall Street collapse. {'neg': 0.266, 'neu': 0.58, 'pos': 0.155, 'compound': -0.3182}
None
What concerns me is there is only two sentences that have been written about minority business. {'neg': 0.0, 'neu': 0.926, 'pos': 0.074, 'compound': 0.0516}
None
Says the Congressional Budget Office estimates a cap-and-trade program would cost the average family the equivalent

None
We cut taxes for 95 percent of working families. {'neg': 0.208, 'neu': 0.792, 'pos': 0.0, 'compound': -0.2732}
None
The United States has a record number of abortions year after year after year. {'neg': 0.0, 'neu': 0.728, 'pos': 0.272, 'compound': 0.4767}
None
Drone technology now allows an individual to be recorded in their homes by drones as small as birds and immediately uploaded to the internet. {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
None
Says the first word spoken from the moon was Houston. {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
None
McCain is "selectively editing Joe Biden's words...Biden actually said about Barack Obama: 'They're gonna find out this guy's got steel in his spine.'" {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
None
Donald Trumps strategy is pretty simple. They have even said in his campaign its to get women to stay home, get young people to stay home, get people of color to stay home, and get a lot smart, intelligent men

Says Obama administration delay of health care laws employer mandate affects about 1 percent of the American workforce. {'neg': 0.112, 'neu': 0.732, 'pos': 0.156, 'compound': 0.2263}
None
Says all my grandparents immigrated to America. {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
None
Said his mother had to fight with health insurance companies for treatment because of a pre-existing condition. {'neg': 0.148, 'neu': 0.852, 'pos': 0.0, 'compound': -0.3818}
None
Says Jeb Bush oversaw (an) average in-state tuition increase of 48.2 percent during his tenure. {'neg': 0.0, 'neu': 0.859, 'pos': 0.141, 'compound': 0.3182}
None
The entire state of Florida led the nation last year with the most prison inmates committing tax fraud. {'neg': 0.309, 'neu': 0.626, 'pos': 0.065, 'compound': -0.7801}
None
Mike Pence voted against expanding the Childrens Health Insurance program, which Hillary helped to start. {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
None
Most state employees could

Tim Kaine wants to thwart right-to-work reform measures in Midwest battleground states. {'neg': 0.197, 'neu': 0.803, 'pos': 0.0, 'compound': -0.4019}
None
New Virginia regulations on abortion clinics provide the same sanitary environment we expect of dental offices. {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
None
On average, women make 77 cents for every dollar men make. {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
None
Says Charlie Crist signed the nations harshest marijuana laws. {'neg': 0.328, 'neu': 0.672, 'pos': 0.0, 'compound': -0.5994}
None
We are the most generous in New England and New England is known for its generosity toward its welfare recipients. {'neg': 0.0, 'neu': 0.723, 'pos': 0.277, 'compound': 0.7841}
None
I used my line-item veto authority to veto $360 million dollars in special interest spending, so that our budget this year ... is still smaller than the fiscal year 2008 and 2009 budgets signed by my predecessor. {'neg': 0.0, 'neu': 0.825, 'pos'

None
Today in America, we have more people in jail than any other country on Earth. {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
None
The Muslim Brotherhood has openly stated they want to declare war on Israel. {'neg': 0.241, 'neu': 0.679, 'pos': 0.08, 'compound': -0.5574}
None
On financial reform, there is "a million dollars being spent, per congressman, in lobbying expenses on this issue. (The) industry has four lobbyists per member of the House and Senate working on this." {'neg': 0.039, 'neu': 0.961, 'pos': 0.0, 'compound': -0.0772}
None
Elena Kagan "violated the law of the United States at various points" with her opposition to military recruiters. {'neg': 0.153, 'neu': 0.721, 'pos': 0.126, 'compound': -0.1531}
None
Sixty million Americans depend on Social Security, and one-third of all the seniors in America depend on Social Security for 90 percent of their income. {'neg': 0.0, 'neu': 0.827, 'pos': 0.173, 'compound': 0.5859}
None
Already we've identified $2 trillion in d

None
Farouk is on fire.---------------------- {'neg': 0.444, 'neu': 0.556, 'pos': 0.0, 'compound': -0.34}
None
Says Patrick Murphy was named one of Americas least effective congressmen. {'neg': 0.203, 'neu': 0.797, 'pos': 0.0, 'compound': -0.3724}
None
Obamacare adds trillions to our deficits and to our national debt. {'neg': 0.2, 'neu': 0.8, 'pos': 0.0, 'compound': -0.3612}
None
Frankly, (Hillary Clinton) doesnt do very well with women. {'neg': 0.202, 'neu': 0.798, 'pos': 0.0, 'compound': -0.2572}
None
Says Democratic U.S. Senate candidate Russ Feingold voted over 250 times to raise taxes. {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
None
The gun industry isthe only business in America that is wholly protected from any kind of liability. {'neg': 0.21, 'neu': 0.654, 'pos': 0.136, 'compound': -0.1513}
None
Your tellers were paid kickbacks for directing elderly consumers from ... safe deposits to risky ones. {'neg': 0.096, 'neu': 0.749, 'pos': 0.155, 'compound': 0.2732}
None
Say

Says union bosses bused protesters to aCentral Florida education protest. {'neg': 0.328, 'neu': 0.672, 'pos': 0.0, 'compound': -0.4404}
None
Says that at a campaign rally President Barack Obama spent so much time screaming at a protester, and frankly it was a disgrace. {'neg': 0.252, 'neu': 0.748, 'pos': 0.0, 'compound': -0.7239}
None
On transparency in dealing with the Republican Party of Florida's financial issues. {'neg': 0.0, 'neu': 0.803, 'pos': 0.197, 'compound': 0.4019}
None
Says a drug test can be performed for just $4 or $5. {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
None
Oregon House Republicans jobs plan could generate more than 50,000 jobs over five years. {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
None
Says that before Rick Perry became governor, only 4 percent of our total labor force was a minimum wage job ... Today, that number has more than doubled (to) 9.5 percent. {'neg': 0.0, 'neu': 0.956, 'pos': 0.044, 'compound': 0.0772}
None
When Hillary Cli

Says Rick Scott pled the Fifth 75 times to avoid jail for Medicare fraud. {'neg': 0.333, 'neu': 0.667, 'pos': 0.0, 'compound': -0.7184}
None
Over the time that President Obama has been in office, we have lost 2.5 million free enterprise system jobs, and, yet, 500,000 federal government jobs have been added. {'neg': 0.073, 'neu': 0.823, 'pos': 0.104, 'compound': 0.25}
None
Brock Turners early release will be a regular occurrence if Prop. 57 passes. {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
None
Taxes are lower on families than theyve been probably in the last 50 years. {'neg': 0.145, 'neu': 0.855, 'pos': 0.0, 'compound': -0.296}
None
As a state representative, David Cicilline argued against Megans Law and voted against mandatory registration of sex offenders. {'neg': 0.246, 'neu': 0.69, 'pos': 0.064, 'compound': -0.5719}
None
This guy didn't even support Ronald Reagan. {'neg': 0.273, 'neu': 0.727, 'pos': 0.0, 'compound': -0.3089}
None
Says Rick Scott is letting Duke (Energy)

### LDA Model on bag of words

In [43]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
import numpy as np
np.random.seed(2018)

In [44]:
lda_model = gensim.models.ldamodel.LdaModel(bow_corpus, num_topics=10, id2word = dictionary, passes=2)

In [45]:
print (lda_model)

LdaModel(num_terms=968, num_topics=10, decay=0.5, chunksize=2000)


In [47]:
for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Topic: 0 
Words: 0.126*"percent" + 0.041*"tax" + 0.027*"rate" + 0.024*"year" + 0.017*"incom" + 0.017*"pay" + 0.015*"american" + 0.015*"sinc" + 0.014*"top" + 0.013*"say"
Topic: 1 
Words: 0.049*"job" + 0.025*"state" + 0.022*"peopl" + 0.017*"creat" + 0.016*"get" + 0.016*"offic" + 0.015*"new" + 0.013*"health" + 0.013*"insur" + 0.013*"busi"
Topic: 2 
Words: 0.053*"vote" + 0.031*"say" + 0.026*"bill" + 0.022*"democrat" + 0.021*"support" + 0.018*"u" + 0.018*"john" + 0.016*"allow" + 0.016*"even" + 0.016*"mccain"
Topic: 3 
Words: 0.033*"say" + 0.025*"illeg" + 0.025*"immigr" + 0.024*"texa" + 0.023*"trump" + 0.022*"state" + 0.021*"one" + 0.020*"wage" + 0.020*"donald" + 0.016*"law"
Topic: 4 
Words: 0.053*"year" + 0.025*"state" + 0.023*"million" + 0.018*"peopl" + 0.018*"two" + 0.018*"got" + 0.016*"last" + 0.015*"work" + 0.013*"wisconsin" + 0.013*"everi"
Topic: 5 
Words: 0.051*"presid" + 0.039*"obama" + 0.024*"ever" + 0.021*"say" + 0.021*"state" + 0.019*"time" + 0.018*"bush" + 0.017*"debt" + 0.015*"f

10 topics:
- tax rate
- job
- voting for bill 
- immigration
- President Obama
- school fund
- tax on health care

#### Score calculation

In [48]:
for index, score in sorted(lda_model[bow_corpus[4310]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model.print_topic(index, 10)))


Score: 0.47892361879348755	 
Topic: 0.053*"vote" + 0.031*"say" + 0.026*"bill" + 0.022*"democrat" + 0.021*"support" + 0.018*"u" + 0.018*"john" + 0.016*"allow" + 0.016*"even" + 0.016*"mccain"

Score: 0.19224534928798676	 
Topic: 0.056*"say" + 0.022*"republican" + 0.020*"budget" + 0.020*"secur" + 0.020*"state" + 0.018*"billion" + 0.017*"cut" + 0.016*"rick" + 0.016*"clinton" + 0.016*"democrat"

Score: 0.17535533010959625	 
Topic: 0.051*"presid" + 0.039*"obama" + 0.024*"ever" + 0.021*"say" + 0.021*"state" + 0.019*"time" + 0.018*"bush" + 0.017*"debt" + 0.015*"first" + 0.015*"unit"

Score: 0.1073104590177536	 
Topic: 0.045*"say" + 0.034*"school" + 0.034*"fund" + 0.028*"public" + 0.027*"romney" + 0.025*"state" + 0.023*"student" + 0.020*"educ" + 0.020*"mitt" + 0.015*"would"


### LDA Model on tf -idf

In [51]:
lda_model_tfidf  = gensim.models.ldamodel.LdaModel(corpus_tfidf, num_topics=10, id2word = dictionary, passes=2)

In [52]:
print (lda_model)

LdaModel(num_terms=968, num_topics=10, decay=0.5, chunksize=2000)


In [53]:
for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Topic: 0 
Words: 0.126*"percent" + 0.041*"tax" + 0.027*"rate" + 0.024*"year" + 0.017*"incom" + 0.017*"pay" + 0.015*"american" + 0.015*"sinc" + 0.014*"top" + 0.013*"say"
Topic: 1 
Words: 0.049*"job" + 0.025*"state" + 0.022*"peopl" + 0.017*"creat" + 0.016*"get" + 0.016*"offic" + 0.015*"new" + 0.013*"health" + 0.013*"insur" + 0.013*"busi"
Topic: 2 
Words: 0.053*"vote" + 0.031*"say" + 0.026*"bill" + 0.022*"democrat" + 0.021*"support" + 0.018*"u" + 0.018*"john" + 0.016*"allow" + 0.016*"even" + 0.016*"mccain"
Topic: 3 
Words: 0.033*"say" + 0.025*"illeg" + 0.025*"immigr" + 0.024*"texa" + 0.023*"trump" + 0.022*"state" + 0.021*"one" + 0.020*"wage" + 0.020*"donald" + 0.016*"law"
Topic: 4 
Words: 0.053*"year" + 0.025*"state" + 0.023*"million" + 0.018*"peopl" + 0.018*"two" + 0.018*"got" + 0.016*"last" + 0.015*"work" + 0.013*"wisconsin" + 0.013*"everi"
Topic: 5 
Words: 0.051*"presid" + 0.039*"obama" + 0.024*"ever" + 0.021*"say" + 0.021*"state" + 0.019*"time" + 0.018*"bush" + 0.017*"debt" + 0.015*"f

10 topics:
- tax
- sex
- wage
- college 
- train
- job
- health care
- Donald Trump
- oil tax

#### Score calculation

In [54]:
for index, score in sorted(lda_model_tfidf[bow_corpus[4310]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model_tfidf.print_topic(index, 10)))




Score: 0.5462357401847839	 
Topic: 0.024*"health" + 0.023*"care" + 0.018*"insur" + 0.013*"compani" + 0.012*"walker" + 0.012*"job" + 0.012*"scott" + 0.012*"oil" + 0.012*"ga" + 0.011*"employe"

Score: 0.296714723110199	 
Topic: 0.014*"largest" + 0.013*"food" + 0.013*"john" + 0.012*"democrat" + 0.011*"board" + 0.011*"presid" + 0.011*"say" + 0.010*"member" + 0.010*"despit" + 0.010*"vote"

Score: 0.1031959056854248	 
Topic: 0.020*"job" + 0.016*"state" + 0.014*"percent" + 0.013*"wisconsin" + 0.012*"new" + 0.012*"measur" + 0.012*"california" + 0.012*"year" + 0.011*"florida" + 0.011*"lose"
