In [29]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option('display.max_columns', None)
plt.style.use('fivethirtyeight')
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

### TASK 1: Find and Clean Your Data: Source and format the required data for your project.

This task required us to:
- Create a database
- Create a data dictionary 

I have already created the database with the notebook [Obtaining Politifact Dataframe](Obtaining Politifact Dataframe.ipynb), which I've kept in a separate notebook because it's a little tedious to read through. (Using the API was really time-consuming, so I performed the scraping repeatedly in little chunks whenever I had free time -- most of the notebook is just running the exact same code but with different numbers and re-saving the dataframe as a csv.  It's not very interesting!)

Below, I've loaded the data into this notebook and created a data dictionary.


In [30]:
#here is the data:
df = pd.read_csv('first_17500_csv', index_col=0)

In [31]:
df.head()

Unnamed: 0,edition,statement,statement_type,statement_context,speaker,speaker_party,speaker_job,subject,ruling,date
0,Texas,<p>\r\n\t&quot;The attorney general requires t...,Claim,in a Web site video,Barbara Ann Radnofsky,Democrat,,Crime,Pants on Fire!,2009-10-22
1,National,<p>President Clinton &quot;reduced the scale o...,Claim,"a Republican debate in Orlando, Fla.",Mitt Romney,Republican,Former governor,Military,Half-True,2007-10-21
2,National,"""New Mexico was 46th in teacher pay (when he w...",Claim,a TV ad.,Bill Richardson,Democrat,Governor,Education,Mostly True,2007-05-10
3,National,"""I used tax cuts to help create over 80,000 jo...",Claim,a TV Ad.,Bill Richardson,Democrat,Governor,Taxes,Mostly True,2007-05-10
4,National,"New Mexico moved ""up to"" sixth in the nation i...",Claim,a TV Ad.,Bill Richardson,Democrat,Governor,Job Accomplishments,Mostly True,2007-05-10


In [32]:
df.shape

(14135, 10)

In [33]:
df.describe()

Unnamed: 0,edition,statement,statement_type,statement_context,speaker,speaker_party,speaker_job,subject,ruling,date
count,14135,14135,14135,13999,14135,14135,9954,14135,14135,14135
unique,25,14120,3,5543,3666,25,1447,146,9,3239
top,National,<p>On a cap-and-trade plan.</p>\r\n,Claim,an interview,Barack Obama,Republican,President,Economy,Half-True,2011-10-11
freq,4724,3,12347,345,618,6230,1110,1180,2779,26


#### Data Dictionary:

edition: indicates from what edition of Politifact the reporting comes (eg National for country-wide politics, Florida for Florida-state-level politics, or PunditFact for political discussions in the media). 

statement: the actual statement that Politifact evaluated.  The majority of these are direct quotes, but some have direct quotes embedded into a paraphrased statement for ease of understanding.

statement_type: the classification of the statement as a Claim (a general assertion), an Attack (against a particular person), or a Flip (the last of which rates not whether the statement is true or not, but rather whether it represents a change from the speaker's previous position). 

statement_context: where the statement came from (eg a speech, a tweet, or an ad, to give a few examples)

speaker: the name of the person who gave the statement

speaker_party: the party of the person who gave the statement, usually either Republican or Democrat.  Sometimes, when appropriate, this is actually the organization that the speaker represents rather than their political affiliation (for example, 'Newsmaker', 'Government body', or 'Education official').

speaker_job: the job of the person who gave the statement (for example, 'President' or 'Governor').

subject: the topic of the statement (for example, 'Military' or 'Education').

ruling: the Politifact judgement about the truthfulness of the statement (except for Flips, as noted above).

date: the date on which the statement was given 

### Task 2: Perform preliminary data munging and cleaning of your data: organize your data relevant to your project goals.
   - Review data to verify initial assumptions
   - Clean and munge data as necessary

There aren't that many 'Flip' statements, and they don't really fit my purposes (I'm not interested in looking at whether someone has changed position, just whether they're telling the truth or not), so I'm going to eliminate those from my dataset.

In [34]:
df['statement_type'].value_counts()

Claim     12347
Attack     1570
Flip        218
Name: statement_type, dtype: int64

In [35]:
df = df.loc[df['statement_type']!='Flip']

In [36]:
df.shape

(13917, 10)

In [37]:
df['statement_type'].unique()

array(['Claim', 'Attack'], dtype=object)

I don't know if I want to look at only the National edition or not, so I'm going to look at the breakdown for what the different editions are and how many observations are associated with each.

In [38]:
df['edition'].value_counts()

National               4658
Florida                1429
Texas                  1353
Wisconsin              1259
PunditFact              931
Georgia                 864
Ohio                    591
Rhode Island            544
Virginia                538
New Jersey              393
Oregon                  389
New Hampshire           155
California              119
Global News Service      91
North Carolina           87
New York                 87
Missouri                 87
Pennsylvania             80
Tennessee                76
Illinois                 62
Nevada                   41
Arizona                  38
Colorado                 30
Iowa                     12
NBC                       3
Name: edition, dtype: int64

In [39]:
df.loc[df['edition']=='PunditFact'].head()

Unnamed: 0,edition,statement,statement_type,statement_context,speaker,speaker_party,speaker_job,subject,ruling,date
7811,PunditFact,<p>&quot;1 percent of candidates that (the Nat...,Claim,"a discussion on CNN's ""Crossfire""",Van Jones,Democrat,"Co-host of CNN's ""Crossfire""",Campaign Finance,False,2013-09-13
8011,PunditFact,<p>&quot;The insurance industry is actually ru...,Claim,"comments on Fox News Channel's ""The Five.""",Dana Perino,Republican,"Co-host on Fox News Channel's ""The Five""",Health Care,False,2013-10-31
8012,PunditFact,<p>&quot;Even after Obamacare is fully impleme...,Claim,"a column in the ""Wall Street Journal""",Suzanne Somers,,Actress,Health Care,True,2013-11-28
8014,PunditFact,"<p>The average fast food worker is 29, and mos...",Claim,"comments on HBO's ""Real Time with Bill Maher""",Bill Maher,Independent,Host of Real Time with Bill Maher,Jobs,Mostly True,2013-10-25
8017,PunditFact,<p>Congress &quot;is bewildered at the scope a...,Claim,"in comments on Comedy Central's ""The Daily Show""",Jon Stewart,,,Foreign Policy,True,2013-10-30


In [40]:
df.loc[df['edition']=='Global News Service'].head()

Unnamed: 0,edition,statement,statement_type,statement_context,speaker,speaker_party,speaker_job,subject,ruling,date
11234,Global News Service,<p>&quot;Bioweapon! Zika virus is being spread...,Claim,comments on YouTube,Viral image,,,Animals,Pants on Fire!,2016-01-27
11258,Global News Service,<p>&quot;People think AIDS is done --&nbsp;it&...,Claim,"an interview with the ""Irish Times""",Bono,,,Public Health,Mostly True,2016-01-24
11259,Global News Service,<p>&quot;Obama spent $7 billion to bring elect...,Claim,an article on the Frontpage website,Daniel Greenfield,Republican,,Energy,Pants on Fire!,2015-07-30
11294,Global News Service,<p>&quot;The most likely triggering cause of (...,Claim,an article posted on the group's website,Centre for Research on Globalization,,,Public Health,False,2016-02-04
11300,Global News Service,<p>&quot;The richest 80 people in the world ow...,Claim,a video,Bernie Sanders,Independent,U.S. Senator,Economy,Mostly True,2016-02-07


In [41]:
df.loc[df['edition']=='NBC']

Unnamed: 0,edition,statement,statement_type,statement_context,speaker,speaker_party,speaker_job,subject,ruling,date
11452,NBC,<p>&quot;Our campaign depends on small donatio...,Claim,a victory speech after the March 15 primaries,Hillary Clinton,Democrat,Presidential candidate,Campaign Finance,Mostly False,2016-03-15
11824,NBC,<p>&quot;More businesses went out of business ...,Claim,"an interview on NBC's ""Meet the Press Daily""",David Perdue,Republican,U.S. Senate candidate,Economy,False,2016-06-02
11883,NBC,<p>Even among &quot;second and third generatio...,Claim,an interview with Fox News' Sean Hannity,Donald Trump,Republican,President,Diversity,False,2016-06-15


I understand PunditFact and Global News Service as distinct and potentially valuable categories (the former dealing with statements made in the media, and the latter dealing primarily with events or people not within the space of American politics), but I don't think I understand what 'NBC' is all about (one of them seems to be from an NBC interview, but then one of them is from a Fox News interview, so...?).  Given that all three of the observations relate to national-level politics and are delivered by national-level politicians, I'm going to incorporate them into the 'National' edition category.  

Otherwise, I'm going to retain all of the editions, but I'm going to add an additional column indicating whether the edition is National, State, Media, or Global.

In [42]:
df.loc[df['edition']=='NBC', 'edition'] = 'National'

In [43]:
df.loc[df['edition']=='NBC']

Unnamed: 0,edition,statement,statement_type,statement_context,speaker,speaker_party,speaker_job,subject,ruling,date


In [44]:
df['edition_type']=df['edition'].map(lambda x: 'National' if x=='National' else 'Media' if x=='PunditFact' else 'Global' if x=='Global News Service' else 'State')

In [45]:
df['edition_type'].value_counts()

State       8234
National    4661
Media        931
Global        91
Name: edition_type, dtype: int64

In [46]:
df.head()

Unnamed: 0,edition,statement,statement_type,statement_context,speaker,speaker_party,speaker_job,subject,ruling,date,edition_type
0,Texas,<p>\r\n\t&quot;The attorney general requires t...,Claim,in a Web site video,Barbara Ann Radnofsky,Democrat,,Crime,Pants on Fire!,2009-10-22,State
1,National,<p>President Clinton &quot;reduced the scale o...,Claim,"a Republican debate in Orlando, Fla.",Mitt Romney,Republican,Former governor,Military,Half-True,2007-10-21,National
2,National,"""New Mexico was 46th in teacher pay (when he w...",Claim,a TV ad.,Bill Richardson,Democrat,Governor,Education,Mostly True,2007-05-10,National
3,National,"""I used tax cuts to help create over 80,000 jo...",Claim,a TV Ad.,Bill Richardson,Democrat,Governor,Taxes,Mostly True,2007-05-10,National
4,National,"New Mexico moved ""up to"" sixth in the nation i...",Claim,a TV Ad.,Bill Richardson,Democrat,Governor,Job Accomplishments,Mostly True,2007-05-10,National


Next I want to have a look at the rulings category, to determine what sort of rulings I have, and how many of each there are. This will be my target variable, and I think I'll probably want to create a new binary variable that tries to capture either 'True' or 'False'.

In [47]:
df['ruling'].value_counts()

Half-True         2779
Mostly True       2686
False             2619
Mostly False      2322
True              2149
Pants on Fire!    1350
Full Flop           10
Half Flip            2
Name: ruling, dtype: int64

In [48]:
df.loc[df['ruling']=='Full Flop']

Unnamed: 0,edition,statement,statement_type,statement_context,speaker,speaker_party,speaker_job,subject,ruling,date,edition_type
828,National,<p>U.S. Senate Majority Leader Harry Reid on w...,Claim,public statements.,Harry Reid,Democrat,Senate Democratic Leader,Legal Issues,Full Flop,2009-01-13,National
1116,National,<p>On whether the governor of Massachusetts sh...,Claim,votes in 2004 and 2009,Massachusetts legislature,Democrat,Making laws,Elections,Full Flop,2009-09-22,National
3308,National,<p>\r\n\tOn the president unilaterally authori...,Claim,"an interview with the ""Boston Globe""",Barack Obama,Democrat,President,Foreign Policy,Full Flop,2007-12-07,National
3313,National,<p>\r\n\tOn whether the United States should i...,Claim,televised interviews,Newt Gingrich,Republican,,Foreign Policy,Full Flop,2011-03-23,National
3629,National,<p>\r\n\tOn whether he would vote for the budg...,Claim,a public appearance and an op-ed,Scott Brown,Republican,,Deficit,Full Flop,2011-05-23,National
10262,Texas,<p>On reauthorization of Export-Import Bank.</...,Claim,an oped column,Rick Perry,Republican,U.S. Energy Secretary,Trade,Full Flop,2015-05-05,State
10867,Wisconsin,<p>On support for the Export-Import Bank</p>\r\n,Claim,On support for the Export-Import Bank,Ron Johnson,Republican,,Trade,Full Flop,2015-10-20,State
13041,National,<p>On leaks and the release of secret informat...,Claim,comments during and after the 2016 presidentia...,Donald Trump,Republican,President,Homeland Security,Full Flop,2017-02-16,National
13051,Wisconsin,<div>On whether the Senate should vote in a ti...,Claim,various public comments,Tammy Baldwin,Democrat,U.S. Representative,Legal Issues,Full Flop,2017-02-02,State
13867,National,<p>On whether higher federal deficits are acce...,Claim,"an interview on ""Fox News Sunday""",Mick Mulvaney,Republican,Director of the Office of Management and Budget,Debt,Full Flop,2017-10-01,National


As explained before, I'm not interested in whether someone has changed their position, just whether the statement is true, so I'm going to get rid of the observations with 'Full Flop' and 'Half Flip' rulings.

In [49]:
df = df.loc[(df['ruling']!='Full Flop')&(df['ruling']!='Half Flip')]

In [50]:
df.shape

(13905, 11)

I'm not entirely sure what to do about the 'Half-True' ruling, if I'm trying to create a binary variable.  I looked at the write-up for five of these cases on Politifact.com as well as their rating system, given [here](http://www.politifact.com/truth-o-meter/article/2011/feb/21/principles-truth-o-meter/), which describes the criteria for a Half-True ruling as, "The statement is partially accurate but leaves out important details or takes things out of context."  The tone of the write-ups seems to be that the statements deemed 'Half-True' are more true than false, on balance, and it makes things neat, with three 'True' rulings ('True', 'Mostly True', and 'Half-True') and three 'False' rulings ('Mostly False', 'False', and 'Pants on Fire!'). 

In [51]:
df['ruling'].unique()

array(['Pants on Fire!', 'Half-True', 'Mostly True', 'True', 'False',
       'Mostly False'], dtype=object)

In [52]:
def trueorfalse(x):
    if x=='True': y='True'
    elif x=='Mostly True': y='True'
    elif x=='Half-True': y='True'
    elif x=='Mostly False': y='False'
    elif x=='False': y='False'
    elif x=='Pants on Fire!': y='False'
    return y

df['binary_ruling'] = df['ruling'].map(trueorfalse)

In [53]:
df['binary_ruling'].value_counts()

True     7614
False    6291
Name: binary_ruling, dtype: int64

A lot of the statements have various characters at the beginning for formatting on the webpage that shouldn't be there for our analysis, and I'm going to try to strip those characters away (although it's hard to see all of their formulations, so I'm not sure if I'll get all of them).

In [56]:
df.tail(50)

Unnamed: 0,edition,statement,statement_type,statement_context,speaker,speaker_party,speaker_job,subject,ruling,date,edition_type,binary_ruling
14085,Wisconsin,"Under the House Republican tax plan, in Wiscon...",Claim,an interview,Paul Ryan,Republican,U.S. Representative,Children,Half-True,2017-12-01,State,True
14086,Texas,"""Earlier this year, Hurricane Harvey left more...",Claim,a guest editorial in Drilling Contractor,Ryan Sitton,Republican,"President and CEO, PinnacleAIS",Housing,False,2017-11-07,State,False
14087,National,"""We're ... getting into the pool of the 100 mi...",Claim,a cabinet meeting,Donald Trump,Republican,President,Economy,Mostly True,2017-12-06,National,True
14088,National,Says the concealed carry bill would allow resi...,Claim,a tweet,Brendan Boyle,Democrat,,Guns,False,2017-12-05,National,False
14089,Missouri,"""In the current law, if you report harassment,...",Claim,"an interview on ""Meet the Press""",Roy Blunt,Republican,Senator,Congressional Rules,Half-True,2017-11-19,State,True
14090,National,"Says ""Obama's deadliest cover-up (in an explos...",Claim,a campaign ad,Don Blankenship,Republican,,Elections,Pants on Fire!,2017-12-03,National,False
14091,National,"""Gloria Allred Accuser **ADMITS** She Tampered...",Claim,a blog post,Bloggers,,,Fake news,Pants on Fire!,2017-12-08,National,False
14092,PunditFact,Says Russian President Vladimir Putingota stan...,Claim,a chain email,Chain email,,,Immigration,Pants on Fire!,2017-12-05,Media,False
14093,PunditFact,"""The FBI has become America's secret police......",Claim,"an interview on ""Hannity""",Gregg Jarrett,Republican,,Crime,Pants on Fire!,2017-12-06,Media,False
14094,Illinois,"""I released way more (tax) information than Br...",Claim,a candidate forum,JB Pritzker,Democrat,Gubernatorial candidate,Campaign Finance,False,2017-12-06,State,False


In [57]:
def cleanstring(x):
    x = x.replace('<p>\r\n\t','')
    x = x.replace('<p>','')
    x = x.replace('</p>','')
    x = x.replace('<p>&quot;','')
    x = x.replace('&quot;','"')
    x = x.replace('&lsquo;',"'")
    x = x.replace('&rsquo;',"'")
    x = x.replace('<div>','')
    x = x.replace('&nbsp;','')
    x = x.replace('&#39;',"'")
    x = x.replace('&hellip;','...')
    x = x.replace('\r\n','')
    return x
    
df['statement'] = df['statement'].map(cleanstring)

In [58]:
#this is looking cleaner now!
df.tail(50)

Unnamed: 0,edition,statement,statement_type,statement_context,speaker,speaker_party,speaker_job,subject,ruling,date,edition_type,binary_ruling
14085,Wisconsin,"Under the House Republican tax plan, in Wiscon...",Claim,an interview,Paul Ryan,Republican,U.S. Representative,Children,Half-True,2017-12-01,State,True
14086,Texas,"""Earlier this year, Hurricane Harvey left more...",Claim,a guest editorial in Drilling Contractor,Ryan Sitton,Republican,"President and CEO, PinnacleAIS",Housing,False,2017-11-07,State,False
14087,National,"""We're ... getting into the pool of the 100 mi...",Claim,a cabinet meeting,Donald Trump,Republican,President,Economy,Mostly True,2017-12-06,National,True
14088,National,Says the concealed carry bill would allow resi...,Claim,a tweet,Brendan Boyle,Democrat,,Guns,False,2017-12-05,National,False
14089,Missouri,"""In the current law, if you report harassment,...",Claim,"an interview on ""Meet the Press""",Roy Blunt,Republican,Senator,Congressional Rules,Half-True,2017-11-19,State,True
14090,National,"Says ""Obama's deadliest cover-up (in an explos...",Claim,a campaign ad,Don Blankenship,Republican,,Elections,Pants on Fire!,2017-12-03,National,False
14091,National,"""Gloria Allred Accuser **ADMITS** She Tampered...",Claim,a blog post,Bloggers,,,Fake news,Pants on Fire!,2017-12-08,National,False
14092,PunditFact,Says Russian President Vladimir Putingota stan...,Claim,a chain email,Chain email,,,Immigration,Pants on Fire!,2017-12-05,Media,False
14093,PunditFact,"""The FBI has become America's secret police......",Claim,"an interview on ""Hannity""",Gregg Jarrett,Republican,,Crime,Pants on Fire!,2017-12-06,Media,False
14094,Illinois,"""I released way more (tax) information than Br...",Claim,a candidate forum,JB Pritzker,Democrat,Gubernatorial candidate,Campaign Finance,False,2017-12-06,State,False


Right now, the statement_context category is very difficult to interpret, because some of them are much more detailed than others (for instance, 'interview' versus 'an interview on WJAR-TV's "10 News Conference."')

I'm going to try to create a new variable called 'simple_context' which eliminates some of that additional information to create clearer categories.

In [59]:
df['statement_context'].value_counts()

an interview                                                      333
a news release                                                    319
a tweet                                                           317
a press release                                                   310
a speech                                                          291
a TV ad                                                           224
a campaign ad                                                     170
a television ad                                                   163
a radio interview                                                 133
a debate                                                          117
a news conference                                                 111
a press conference                                                108
a Facebook post                                                   104
a headline                                                        100
a campaign commercia

In [60]:
def simplifycontext(x):
    x = str(x)
    if 'interview' in x:
        y = 'an interview'
    elif 'news release' in x:
        y = 'a news release'
    elif 'press release' in x:
        y = 'a news release'
    elif 'campaign ad' in x:
        y = 'a campaign ad'
    elif 'speech' in x:
        y = 'a speech'
    elif 'debate' in x:
        y = 'a debate'
    elif 'news conference' in x:
        y = 'a news conference'
    elif 'press conference' in x:
        y = 'a news conference'
    elif 'TV ad' in x:
        y = 'a TV ad'
    elif 'op-ed' in x:
        y = 'an op-ed'
    elif 'inaugural' in x:
        y = 'speech'
    else: y = x
    return y 

df['simple_context'] = df['statement_context'].map(simplifycontext)

In [61]:
df['simple_context'].value_counts()

an interview                                                          2329
a speech                                                              1404
a debate                                                               960
a news release                                                         903
a TV ad                                                                394
a news conference                                                      354
a tweet                                                                317
a campaign ad                                                          208
an op-ed                                                               182
a television ad                                                        163
nan                                                                    114
a Facebook post                                                        104
a headline                                                             100
a campaign commercial    

This has helped a little bit, but there's just so many different unique contexts that I think it will be very time-consuming to properly classify all of them. I may revisit this at a later date if it seems like there may be some interesting relationships within this category. 

There are similar problems with the speaker_job field, which I will try to address below with the same approach.

In [62]:
df['speaker_job'].value_counts()

President                                                                 1084
U.S. Senator                                                               641
Governor                                                                   333
U.S. senator                                                               325
Presidential candidate                                                     312
U.S. Representative                                                        233
Senator                                                                    221
Former governor                                                            206
Milwaukee County Executive                                                 182
U.S. Energy Secretary                                                      171
State Senator                                                              148
U.S. representative                                                        141
U.S. House of Representatives                       

In [63]:
def simplifyjob(x):
    x = str(x)
    if 'u.s. senator' in x.lower(): y='U.S. Senator'
    elif 'united states senator' in x.lower(): y='U.S. Senator'
    elif x.lower()=='senator': y='U.S. Senator'
    elif 'u.s. representative' in x.lower(): y='U.S. Representative'
    elif 'congressman' in x.lower(): y='U.S. Representative'
    elif 'congresswoman' in x.lower(): y='U.S. Representative'
    elif 'u.s. house' in x.lower(): y='U.S. Representative'
    elif 'member of congress' in x.lower(): y='U.S. Representative'
    elif 'state representative' in x.lower(): y='State Representative'
    elif 'state senator' in x.lower(): y='State Senator'
    elif 'governor' in x.lower(): y='Governor'
    elif 'lawyer' in x.lower(): y='Lawyer'
    elif 'political action committee'==x.lower(): y='Political Action Committee'
    elif 'pac'==x.lower(): y='Political Action Committee'
    elif 'super pac'==x.lower(): y='Political Action Committee'
    elif 'state assembly' in x.lower(): y='State Assembly Member'
    else: y = x
    return y

df['simple_job'] = df['speaker_job'].map(simplifyjob)

In [64]:
df['simple_job'].value_counts()

nan                                                                          4133
U.S. Senator                                                                 1246
U.S. Representative                                                          1135
President                                                                    1084
Governor                                                                     1018
State Senator                                                                 330
Presidential candidate                                                        312
State Representative                                                          271
Milwaukee County Executive                                                    182
U.S. Energy Secretary                                                         171
Attorney                                                                      110
Social media posting                                                           99
State Assembly M

Again, this has helped slightly, but there are still an awful lot of unique entries in this field.  Further work may be necessary to generalize further. 

Finally: 'date' is a field that should not be a string object, but rather a datetime object, so I'll finish this cleaning by converting that column below.

In [65]:
df['date'] = pd.to_datetime(df['date'],infer_datetime_format=True)

In [66]:
df.dtypes

edition                      object
statement                    object
statement_type               object
statement_context            object
speaker                      object
speaker_party                object
speaker_job                  object
subject                      object
ruling                       object
date                 datetime64[ns]
edition_type                 object
binary_ruling                object
simple_context               object
simple_job                   object
dtype: object

In [67]:
dfcopy = df.copy()
dfcopy.sort_values('date').head()

Unnamed: 0,edition,statement,statement_type,statement_context,speaker,speaker_party,speaker_job,subject,ruling,date,edition_type,binary_ruling,simple_context,simple_job
2693,Georgia,The economic impact of Atlanta's 2000 Super Bo...,Claim,a study,Atlanta Sports Council,,,Economy,False,2000-10-01,State,False,a study,
3069,Florida,"""Financed the largest parking expansion progra...",Claim,city report reprinted as 2011 mayoral campaign...,Dick Greco,,,City Budget,Mostly False,2002-05-01,State,False,city report reprinted as 2011 mayoral campaign...,
22,National,"""The failings in our civil service are encoura...",Claim,Oklahoma City.,John McCain,Republican,U.S. senator,Federal Budget,Mostly True,2007-03-21,National,True,Oklahoma City.,U.S. Senator
12,National,"""(McCain) said he was opposed to overturning R...",Attack,"Stratham, N.H.",Mitt Romney,Republican,Former governor,Abortion,Half-True,2007-04-26,National,True,"Stratham, N.H.",Governor
11,National,"""Senator McCain voted against the Bush tax cut...",Attack,"Stratham, N.H.",Mitt Romney,Republican,Former governor,Taxes,True,2007-04-26,National,True,"Stratham, N.H.",Governor


In [68]:
dfcopy.sort_values('date').tail()

Unnamed: 0,edition,statement,statement_type,statement_context,speaker,speaker_party,speaker_job,subject,ruling,date,edition_type,binary_ruling,simple_context,simple_job
14129,National,"Says of the diversity visa lottery program, ""t...",Claim,a speech,Donald Trump,Republican,President,Homeland Security,Pants on Fire!,2017-12-15,National,False,a speech,President
14123,PunditFact,"""Alabama state police arrest 3 poll workers in...",Claim,a headline,American Revolution,,,Elections,Pants on Fire!,2017-12-15,Media,False,a headline,
14126,PunditFact,"""Breaking: Muslim New Jersey Mayor just banned...",Attack,a headline,TheLastLineOfDefense.org,,,City Government,Pants on Fire!,2017-12-16,Media,False,a headline,
14130,National,"""We essentially repealed Obamacare because we ...",Claim,a speech,Donald Trump,Republican,President,Health Care,False,2017-12-20,National,False,a speech,President
14131,National,"In April, ""the vast majority will be (filling ...",Claim,"an interview on ""Fox & Friends""",Ivanka Trump,,Assistant to President Donald Trump,Taxes,False,2017-12-21,National,False,an interview,Assistant to President Donald Trump


### Task 3: Describe your data: keep your intended audience(s) in mind.
   - Document your work so far in a Jupyter notebook.

In [69]:
df.isnull().sum()

edition                 0
statement               0
statement_type          0
statement_context     114
speaker                 0
speaker_party           0
speaker_job          4133
subject                 0
ruling                  0
date                    0
edition_type            0
binary_ruling           0
simple_context          0
simple_job              0
dtype: int64

In [70]:
df['statement_type'].value_counts()

Claim     12335
Attack     1570
Name: statement_type, dtype: int64

In [71]:
df['speaker'].value_counts()

Barack Obama                                    598
Donald Trump                                    481
Hillary Clinton                                 295
 Bloggers                                       238
Mitt Romney                                     206
John McCain                                     184
Scott Walker                                    182
 Chain email                                    180
Rick Perry                                      171
Marco Rubio                                     154
Rick Scott                                      148
Ted Cruz                                        123
Bernie Sanders                                  118
Chris Christie                                  102
 Facebook posts                                  99
Paul Ryan                                        83
Newt Gingrich                                    82
Charlie Crist                                    80
Jeb Bush                                         80
Joe Biden   

In [72]:
df['speaker_party'].value_counts()

Republican                         6097
Democrat                           4354
None                               2582
Organization                        285
Independent                         197
Newsmaker                            66
Libertarian                          60
Journalist                           52
Activist                             49
Columnist                            47
Talk show host                       33
State official                       27
Labor leader                         17
Business leader                      11
Tea Party member                     10
Green                                 3
Constitution Party                    3
Education official                    3
Government body                       2
county commissioner                   2
Moderate Party                        1
Ocean State Tea Party in Action       1
Democratic Farmer-Labor               1
Law enforcement official              1
Liberal Party of Canada               1


In [73]:
df['subject'].value_counts()

Economy                          1173
Health Care                      1036
Candidate Biography               757
Education                         745
Elections                         570
Federal Budget                    494
Crime                             493
Taxes                             424
Foreign Policy                    415
Immigration                       414
Abortion                          330
Energy                            326
State Budget                      255
Jobs                              245
Guns                              244
Campaign Finance                  236
Children                          226
Fake news                         216
Congress                          205
Corrections and Updates           197
Deficit                           186
Environment                       164
History                           161
Corporations                      141
Job Accomplishments               139
Climate Change                    137
Civil Rights

The data that I'm working with is a collection of statements from (mostly) American public figures, along with information about the speaker, the context of the statement, and the true/false rating the statement received from Politifact. (Refer to the data dictionary above for a more detailed explanation of the fields.)

The Politifact ruling will be my target variable in this investigation.  There are six different rulings, ranging from 'True' for the most veracious statements to 'Pants on Fire!' for the most egregious lies. Most of the ruling categories have roughly comparable numbers of observations (between 2100 and 2800), except for 'Pants on Fire!', which has 1350.  When the top three rulings are bundled together as 'True' and the bottom three rulings are bundled together as 'False', there are 7600 True statements and 6300 False statements -- so, reasonably balanced.  (This relieves one of my concerns, which was that Politifact would be mostly looking to research statements which seem fishy or controversial to begin with, and that I wouldn't therefore have sufficient 'True' observations to train my model with.  I'm glad that doesn't seem to be the case!)

A majority of my observations (around 8000) come from state editions of Politifact, and then roughly 5000 come from the national edition.  A further 1000 come from editions focused on the media or international events/figures. 

The vast majority of my observations (12300) are Claims, with a further 1600 Attacks.

The statement_context variable is tricky, because Politifact seems to have wildly different approaches about how specific this field should be. I've done some work towards generalizing some of the overly-specific contexts, and this is recorded as simple_context in my dataframe.  However, this isn't extremely helpful at the moment, because there are still far too many unique entries.  Generally though, we can see that the statements have been pulled from a wide variety of sources -- speeches, interviews, twitter feeds, campaign materials, debates, articles, and more. Also, note that 114 observations are missing the context.

Statements are from a wide variety of speakers, but we can see that prominent politicians from the last three presidential election cycles (presidential candidates and their vice presidential candidates especially) are particularly well-represented.  Additionally, some of the speakers are not individuals at all, but rather organizations or groups of people like the Republican National Committee or 'Bloggers'. 

The majority of speakers are either Democrat, Republican, or None. These together account for roughly 13 out of every 14 observations. 

The speaker_job variable has similar problems to the statement_context field.  Many of the observations (around 4000) do not have this information recorded.  Of those that do, the most frequent jobs are those with national political prominence: the president, members of the US House and US Senate, and governors.  

The subject variable indicates the topic of the statement being assessed, and we can see that there are a wide variety of subjects that Politifact reports on.  The most common topics are the economy and healthcare, which reflects the prominence of these two subjects in the national political debate. 

The dates of these statements range from 2000 through to December 2017. However, there are only two observations from before 2007, since Politifact really started its work during the primaries for the 2008 presidential election.

### Task 4: Document your project goals (revise from your initial pitch)
   - Articulate specific aims
   - Outline proposed methods and models
   - Define risks & assumptions

#### Specific aims:

I will be trying to see whether I can predict the truthfulness of a statement from the information that I've collected from Politifact.

I would also like to identify what features are helpful in predicting truthfulness.  At a minimum, I'll be investigating the relationship between the truthfulness of a statement and:
- the identity of the speaker
- the political affiliation of the speaker
- the topic of the statement
- the nature of the statement (Claim vs Attack)
- the language used in the statement

#### Proposed methods and models:

This will be a classification problem (although I could potentially use regression models if I turn ruling variable into a number from 1 to 6? I don't think this makes sense, though -- it would still be an ordinal scale where we don't know if the difference between 1 and 2 (ie Pants on Fire! to False) is the same as the difference between 2 and 3 (ie False to Mostly False)).

I will be using some NLP to help investigate the relationship between the language used in the statement and the statement's truthfulness, so the first step will be to use a vectorizer to obtain the language features.

I will also have to do some EDA to get a better sense of what features may be predictive of truthfulness from those I already have in my dataframe (political affiliation, for example).

Then, I will use variety of classification models from those that we've learned (logistic regression, k-nearest neighbors, support vector classification, decision trees, ensemble methods like random forest or bagging classifiers, etc) and find the best model for accurately predicting truthfulness. 

I will evaluate the success of my model on the basis of its accuracy as well as its precision.  (Here, I'm defining a true positive as correctly predicting a lie.)  Precision is important to me in this case because having a high proportion of true 'Lies' out of all predicted 'Lies' would enable journalists for example to use this model to get a good list of leads to follow if they're trying to hold lying politicians to account.  I think it's probably more useful to be pretty confident that the 'Lies' the model identifies are probably lies than to be confident that we've caught all of the lies (because with politicians, that would NEVER be possible anyway, since they lie as easily as they breathe).  If a journalist had to take a lot of time researching statements that ended up being true, it might not be very useful.    



#### Risks and assumptions:

Some of my assumptions are:
- The Politifact writers are impartial and good at their jobs, so their ratings will be fair and accurate across different subjects, speakers, years, political affiliations, etc., and these ratings will be consistent from one Politifact writer to another. 
- Half-True is actually more true than false, so these statements should be grouped in as broadly 'True'.
- The statements selected by Politifact to evaluate are roughly representative of political statements generally (ie, they are not selected because they sound suspicious or sensational, but are in fact typical of the national political debate).

In [74]:
df.to_csv('cleaned_politifact_df',encoding='utf-8')