## Alternus Vera
Dataset: Politifact/Liar-Liar dataset(https://www.politifact.com)

Description of the train TSV file format:

- Column 1: the ID of the statement ([ID].json)
- Column 2: the label.
- Column 3: the statement.
- Column 4: the subject(s).
- Column 5: the speaker.
- Column 6: the speaker's job title.
- Column 7: the state info.
- Column 8: the party affiliation.
- Column 9-13: the total credit history count, including the current statement.
  - 9: barely true counts.
  - 10: false counts.
  - 11: half true counts.
  - 12: mostly true counts.
  - 13: pants on fire counts.
- Column 14: the context (venue / location of the speech or statement).


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
from nltk.stem.porter import *
from nltk.corpus import stopwords  
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer

In [3]:
df =pd.read_csv('/Users/vidhsharma/Documents/SJSU/Fall18/ML/Lab/fakeNews/liar_dataset/train.tsv', sep='\t')

In [4]:
df.head()

Unnamed: 0,file,value,statement,topic,name,occupation,state,party,barely true counts,false counts,half true counts,mostly true counts,pants on fire counts,context
0,2635.json,FALSE,Says the Annies List political group supports ...,abortion,dwayne-bohac,State representative,Texas,republican,0.0,1.0,0.0,0.0,0.0,a mailer
1,10540.json,half-true,When did the decline of coal start? It started...,"energy,history,job-accomplishments",scott-surovell,State delegate,Virginia,democrat,0.0,0.0,1.0,1.0,0.0,a floor speech.
2,324.json,mostly-true,"Hillary Clinton agrees with John McCain ""by vo...",foreign-policy,barack-obama,President,Illinois,democrat,70.0,71.0,160.0,163.0,9.0,Denver
3,1123.json,FALSE,Health care reform legislation is likely to ma...,health-care,blog-posting,,,none,7.0,19.0,3.0,5.0,44.0,a news release
4,9028.json,half-true,The economic turnaround started at the end of ...,"economy,jobs",charlie-crist,,Florida,democrat,15.0,9.0,20.0,19.0,2.0,an interview on CNN


In [5]:
df.shape

(10240, 14)

In [6]:
df.columns

Index([u'file', u'value', u'statement', u'topic', u'name', u'occupation',
       u'state', u'party', u'barely true counts', u'false counts',
       u'half true counts', u'mostly true counts', u'pants on fire counts',
       u'context '],
      dtype='object')

##### Data exploration

In [7]:
df['state'].value_counts()

Texas                                                1009
Florida                                               997
Wisconsin                                             713
New York                                              657
Illinois                                              556
Ohio                                                  447
Georgia                                               426
Virginia                                              407
Rhode Island                                          369
New Jersey                                            241
Oregon                                                239
Massachusetts                                         206
Arizona                                               182
California                                            159
Washington, D.C.                                      120
Vermont                                                98
Pennsylvania                                           90
New Hampshire 

In [8]:
df['occupation'].value_counts()

President                                                         492
U.S. Senator                                                      479
Governor                                                          391
President-Elect                                                   273
U.S. senator                                                      263
Presidential candidate                                            254
Former governor                                                   176
U.S. Representative                                               172
Milwaukee County Executive                                        149
Senator                                                           147
State Senator                                                     108
U.S. representative                                               103
U.S. House of Representatives                                     102
Attorney                                                           81
Congressman         

In [9]:
df['party'].unique()

array(['republican', 'democrat', 'none', 'organization', 'independent',
       'columnist', 'activist', 'talk-show-host', 'libertarian',
       'newsmaker', 'journalist', 'labor-leader', 'state-official',
       'business-leader', 'education-official', 'tea-party-member', nan,
       'green', 'liberal-party-canada', 'government-body', 'Moderate',
       'democratic-farmer-labor', 'ocean-state-tea-party-action',
       'constitution-party'], dtype=object)

In [10]:
df.loc[df['topic']=='health-care']

Unnamed: 0,file,value,statement,topic,name,occupation,state,party,barely true counts,false counts,half true counts,mostly true counts,pants on fire counts,context
3,1123.json,FALSE,Health care reform legislation is likely to ma...,health-care,blog-posting,,,none,7.0,19.0,3.0,5.0,44.0,a news release
14,8705.json,barely-true,Most of the (Affordable Care Act) has already ...,health-care,george-will,Columnist,Maryland,columnist,7.0,6.0,3.0,5.0,1.0,"comments on ""Fox News Sunday"""
25,10215.json,FALSE,I dont know who (Jonathan Gruber) is.,health-care,nancy-pelosi,House Minority Leader,California,democrat,3.0,7.0,11.0,2.0,3.0,a news conference
85,2044.json,half-true,"In Rick Perrys Texas, we import nurses ... fro...",health-care,bill-white,Former mayor of Houston,Texas,democrat,2.0,3.0,5.0,7.0,3.0,a speech at the Texas Democratic Party convention
92,2020.json,mostly-true,The insurance commissioner cant do squat about...,health-care,ralph-hudgens,,Georgia,republican,0.0,0.0,0.0,1.0,0.0,a radio interview broadcast over the Internet
165,4504.json,barely-true,Says Rick Perry wrote a letter supporting Hill...,health-care,ron-paul,U.S. representative,Texas,republican,5.0,8.0,8.0,8.0,3.0,a presidential debate
248,11772.json,FALSE,We now have driven (health care) costs down to...,health-care,hillary-clinton,Presidential candidate,New York,democrat,40.0,29.0,69.0,76.0,7.0,comments during the South Carolina Democratic ...
314,8308.json,barely-true,The American people support defunding Obamacar...,health-care,marco-rubio,U.S. Senator,Florida,republican,33.0,24.0,32.0,35.0,5.0,a press release
350,9031.json,FALSE,"Most young Americans right now, theyre not cov...",health-care,barack-obama,President,Illinois,democrat,70.0,71.0,160.0,163.0,9.0,"a Funnyordie.com ""Between Two Ferns"" interview..."
389,5631.json,FALSE,Says Congressman Bill Pascrell voted to remove...,health-care,steve-rothman,,,democrat,1.0,2.0,0.0,1.0,1.0,an interview on NJToday


In [11]:
df.isnull().sum()

file                       0
value                      0
statement                  0
topic                      2
name                       2
occupation              2897
state                   2208
party                      2
barely true counts         2
false counts               2
half true counts           2
mostly true counts         2
pants on fire counts       2
context                  102
dtype: int64

### Data Cleaning
    Tokenizing: converting a document to its atomic elements.
    Stopping: removing meaningless words.
    Stemming: merging words that are equivalent in meaning.

In [12]:
df1 =df.drop(columns=['file','value' ,'barely true counts', 'false counts',
       'half true counts', 'mostly true counts', 'pants on fire counts',])

In [13]:
print df1


                                               statement  \
0      Says the Annies List political group supports ...   
1      When did the decline of coal start? It started...   
2      Hillary Clinton agrees with John McCain "by vo...   
3      Health care reform legislation is likely to ma...   
4      The economic turnaround started at the end of ...   
5      The Chicago Bears have had more starting quart...   
6      Jim Dunnam has not lived in the district he re...   
7      I'm the only person on this stage who has work...   
8      However, it took $19.5 million in Oregon Lotte...   
9      Says GOP primary opponents Glenn Grothman and ...   
10     For the first time in history, the share of th...   
11     Since 2000, nearly 12 million Americans have s...   
12     When Mitt Romney was governor of Massachusetts...   
13     The economy bled $24 billion due to the govern...   
14     Most of the (Affordable Care Act) has already ...   
15     In this last election in November

In [14]:
raw =df1[['statement']]
raw[:5]
final=raw.values.T.tolist()
print len(final[0])

10240


In [15]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/vidhsharma/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [16]:
## porter stemmer
def textProcessing(doc):
    process = re.sub('[^a-zA-Z]', ' ',doc) 
    process = process.lower()
    process = process.split()
    ps = PorterStemmer()
    process = [ps.stem(word) for word in process if not word in set(stopwords.words('english'))]
    process = ' '.join(process)
    return process

In [17]:

print('original document: ')
words = []
for word in final:
    words.append(word)
print (words)


original document: 


In [18]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/vidhsharma/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [19]:
result =[]
for i in final[0]:
    result.append(textProcessing(i))
print result

[u'say anni list polit group support third trimest abort demand', u'declin coal start start natur ga took start begin presid georg w bush administr', u'hillari clinton agre john mccain vote give georg bush benefit doubt iran', u'health care reform legisl like mandat free sex chang surgeri', u'econom turnaround start end term', u'chicago bear start quarterback last year total number tenur uw faculti fire last two decad', u'jim dunnam live district repres year', u'person stage work activ last year pass along russ feingold toughest ethic reform sinc waterg', u'howev took million oregon lotteri fund port newport eventu land new noaa marin oper center pacif', u'say gop primari oppon glenn grothman joe leibham cast compromis vote cost million higher electr cost', u'first time histori share nation popular vote margin smaller latino vote margin', u'sinc nearli million american slip middl class poverti', u'mitt romney governor massachusett didnt slow rate growth govern actual cut', u'economi bl

In [21]:
print len(result)

10240


In [22]:
processed_docs = df1['statement'].map(textProcessing)

In [23]:
processed_docs[:10]

0    say anni list polit group support third trimes...
1    declin coal start start natur ga took start be...
2    hillari clinton agre john mccain vote give geo...
3    health care reform legisl like mandat free sex...
4                     econom turnaround start end term
5    chicago bear start quarterback last year total...
6                 jim dunnam live district repres year
7    person stage work activ last year pass along r...
8    howev took million oregon lotteri fund port ne...
9    say gop primari oppon glenn grothman joe leibh...
Name: statement, dtype: object

### tf-idf

In [29]:
vectorizor = TfidfVectorizer()
X = vectorizor.fit_transform(final[0])
vectorizor.vocabulary_


{u'abort': 0,
 u'anni': 1,
 u'demand': 2,
 u'group': 3,
 u'list': 4,
 u'polit': 5,
 u'say': 6,
 u'support': 7,
 u'third': 8,
 u'trimest': 9}

### tokenization 

In [28]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')

In [26]:
final=[]
for i in result:
    tokens = tokenizer.tokenize(i)
    final.append(tokens)
print final


[[u'say', u'anni', u'list', u'polit', u'group', u'support', u'third', u'trimest', u'abort', u'demand'], [u'declin', u'coal', u'start', u'start', u'natur', u'ga', u'took', u'start', u'begin', u'presid', u'georg', u'w', u'bush', u'administr'], [u'hillari', u'clinton', u'agre', u'john', u'mccain', u'vote', u'give', u'georg', u'bush', u'benefit', u'doubt', u'iran'], [u'health', u'care', u'reform', u'legisl', u'like', u'mandat', u'free', u'sex', u'chang', u'surgeri'], [u'econom', u'turnaround', u'start', u'end', u'term'], [u'chicago', u'bear', u'start', u'quarterback', u'last', u'year', u'total', u'number', u'tenur', u'uw', u'faculti', u'fire', u'last', u'two', u'decad'], [u'jim', u'dunnam', u'live', u'district', u'repres', u'year'], [u'person', u'stage', u'work', u'activ', u'last', u'year', u'pass', u'along', u'russ', u'feingold', u'toughest', u'ethic', u'reform', u'sinc', u'waterg'], [u'howev', u'took', u'million', u'oregon', u'lotteri', u'fund', u'port', u'newport', u'eventu', u'land', u

### adding all the sentence token's in one list

In [30]:
import itertools
flat=itertools.chain.from_iterable(final)
text = list(flat)
print type(text)

<type 'list'>


In [31]:
from gensim import corpora, models

In [32]:
dictionary = corpora.Dictionary(final)
print dictionary

Dictionary(7696 unique tokens: [u'foul', u'interchang', u'four', u'jihad', u'suzann']...)


In [33]:
import numpy as np
final=np.asarray(final)
raw1 = np.concatenate(final).ravel().tolist()
type(raw1)

list

### Bag of words

In [36]:
bow_corpus = [dictionary.doc2bow(result) for result in final]
print bow_corpus[50]

[(6, 1), (38, 1), (124, 1), (125, 1), (127, 1), (386, 1), (387, 1), (388, 1), (389, 1), (390, 1), (391, 1), (392, 1), (393, 1)]


#### LDA Model

In [40]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(2018)


In [41]:
lda_model = gensim.models.ldamodel.LdaModel(bow_corpus, num_topics=3, id2word = dictionary, passes=2)

In [42]:
print lda_model

LdaModel(num_terms=7696, num_topics=3, decay=0.5, chunksize=2000)


In [43]:
for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Topic: 0 
Words: 0.021*"say" + 0.014*"state" + 0.011*"cut" + 0.010*"tax" + 0.010*"wisconsin" + 0.009*"billion" + 0.009*"year" + 0.008*"percent" + 0.007*"budget" + 0.006*"million"
Topic: 1 
Words: 0.021*"percent" + 0.020*"say" + 0.016*"tax" + 0.015*"state" + 0.013*"year" + 0.007*"obama" + 0.007*"nation" + 0.006*"rate" + 0.006*"presid" + 0.006*"one"
Topic: 2 
Words: 0.018*"say" + 0.016*"health" + 0.015*"care" + 0.012*"vote" + 0.011*"obama" + 0.011*"job" + 0.009*"presid" + 0.009*"democrat" + 0.007*"republican" + 0.007*"romney"
