# 1. Source data

Goal: recognition of source data

## 1.1. Data description

<img src="./Grahpics/politifact-logo.jpg" width="200">

PolitiFact.com is a fact-checking website that rates the accuracy of claims by elected officials and others on its Truth-O-Meter. PolitiFact.com is a nonprofit project operated by the Poynter Institute in St. Petersburg, Florida, with offices there and in Washington, D.C. It began in 2007 as a project of the Tampa Bay Times (then the St. Petersburg Times), with reporters and editors from the newspaper and its affiliated news media partners reporting on the accuracy of statements made by elected officials, candidates, their staffs, lobbyists, interest groups and others involved in U.S. politics. Its journalists evaluate original statements and publish their findings on the PolitiFact.com website, where each statement receives a "Truth-O-Meter" rating. The ratings range from "True" for completely accurate statements to "Pants on Fire" (from the taunt "Liar, liar, pants on fire") for false and ludicrous claims.

https://en.wikipedia.org/wiki/PolitiFact

## 1.2. Import of packages

pip install pandas

pip install requests

e.t.c. - depending on the needs

In [1]:
import requests
import json
import pprint
import pandas as pd
import numpy as np

## 1.3. Recognition of the data structure on the website
####   www.politifact.com

In [2]:
r = requests.get("https://www.politifact.com/api/factchecks/?format=json&page=1") # first subpage

In [3]:
try:
    v_page_json = r.json()
except json.decoder.JSONDecodeError:
    print("Invalid format")
else:     
    pprint.pprint(v_page_json) # JSON on the first subpage

{'count': 17614,
 'next': 'http://www.politifact.com/api/factchecks/?format=json&page=2',
 'previous': None,
 'results': [{'id': 18460,
              'publication_date': '2020-05-01T17:34:36-04:00',
              'ruling_comments': '<p>A virology lab in Wuhan, China, continues '
                                 'to draw scrutiny for work it did on bat '
                                 'viruses as part of American-funded '
                                 'research.</p>\n'
                                 '\n'
                                 '<p>To be clear, there is no sign that the '
                                 'coronavirus that has swept around the globe '
                                 'was bioengineered, but suspicions run high, '
                                 'including from <a '
                                 'href="https://www.whitehouse.gov/briefings-statements/remarks-president-trump-vice-president-pence-members-coronavirus-task-force-press-briefing-april-17-2020

              'publication_date': '2020-05-01T11:00:00-04:00',
              'ruling_comments': '<p>A<a '
                                 'href="https://www.facebook.com/photo.php?fbid=3042273349144351&amp;set=a.771765556195153&amp;type=3&amp;theater"> '
                                 'social media post</a>, which in April was '
                                 'shared widely on Facebook and made '
                                 'appearances on a conservative online '
                                 'discussion forum, asserts that former '
                                 'President Barack Obama signed legislation '
                                 'that caused companies to&nbsp; manufacture '
                                 'medical devices overseas, including items '
                                 'essential for the current coronavirus '
                                 'pandemic.</p>\n'
                                 '\n'
                                 '<p>Alongside a p

In [4]:
len(v_page_json) # number of items on the page

4

In [5]:
results = v_page_json['results'] # data included on the sample subpage
pprint.pprint(results)

[{'id': 18460,
  'publication_date': '2020-05-01T17:34:36-04:00',
  'ruling_comments': '<p>A virology lab in Wuhan, China, continues to draw '
                     'scrutiny for work it did on bat viruses as part of '
                     'American-funded research.</p>\n'
                     '\n'
                     '<p>To be clear, there is no sign that the coronavirus '
                     'that has swept around the globe was bioengineered, but '
                     'suspicions run high, including from <a '
                     'href="https://www.whitehouse.gov/briefings-statements/remarks-president-trump-vice-president-pence-members-coronavirus-task-force-press-briefing-april-17-2020/">President '
                     'Donald Trump</a>&rsquo;s lawyer Rudy Giuliani.</p>\n'
                     '\n'
                     '<p>Giuliani <a '
                     'href="https://twitter.com/RudyGiuliani/status/1254513987196248065">tweeted</a> '
                     'April 26, &quot;Why 

                     '\n'
                     '<div class="pf_subheadline">Other '
                     'factors&nbsp;&nbsp;</div>\n'
                     '\n'
                     '<p>So finally, does the now non-existent medical device '
                     'tax have anything to do with our current PPE shortage in '
                     'the face of the coronavirus pandemic?</p>\n'
                     '\n'
                     '<p>The answer to that is also no, said the '
                     'experts.&nbsp;&nbsp;</p>\n'
                     '\n'
                     '<p>The current personal protective equipment shortage '
                     'can be attributed to the lack of a stockpile reserve of '
                     'PPE, an initial slow response by the U.S. and the '
                     'tariffs imposed against Chinese goods by the Trump '
                     'administration, said Peter Petri, a professor of '
                     'international finance at Brandeis Univer

In [6]:
len(results) # number of statements posted on the page (the last subpage may have fewer statements than the others)

10

In [7]:
pprint.pprint(results[0]) # a single statement

{'id': 18460,
 'publication_date': '2020-05-01T17:34:36-04:00',
 'ruling_comments': '<p>A virology lab in Wuhan, China, continues to draw '
                    'scrutiny for work it did on bat viruses as part of '
                    'American-funded research.</p>\n'
                    '\n'
                    '<p>To be clear, there is no sign that the coronavirus '
                    'that has swept around the globe was bioengineered, but '
                    'suspicions run high, including from <a '
                    'href="https://www.whitehouse.gov/briefings-statements/remarks-president-trump-vice-president-pence-members-coronavirus-task-force-press-briefing-april-17-2020/">President '
                    'Donald Trump</a>&rsquo;s lawyer Rudy Giuliani.</p>\n'
                    '\n'
                    '<p>Giuliani <a '
                    'href="https://twitter.com/RudyGiuliani/status/1254513987196248065">tweeted</a> '
                    'April 26, &quot;Why did the US (NIH

## 1.4 Selection of elements - variables for modeling

In [8]:
results[0].keys() # key names

dict_keys(['id', 'slug', 'speaker', 'targets', 'statement', 'ruling_slug', 'publication_date', 'ruling_comments', 'sources'])

In [9]:
results[0]['id'] # example id

18460

In [10]:
# example author of the statement
results[0]['speaker']['full_name']


'Rudy Giuliani'

In [11]:
# sample numbers of targets on the first subpage
targets = []
print('Targets:')
for i, result in enumerate(results):
    nr_of_targets = len(result['targets'])
    if nr_of_targets == 0:
        print('statement',i,'->','no targets')
    else:
        for j in range(nr_of_targets):
                targets.append(result['targets'][j]['full_name'])
        print('statement',i,'->',nr_of_targets,'targets:',targets)
        targets = []

Targets:
statement 0 -> no targets
statement 1 -> no targets
statement 2 -> no targets
statement 3 -> no targets
statement 4 -> no targets
statement 5 -> no targets
statement 6 -> no targets
statement 7 -> no targets
statement 8 -> no targets
statement 9 -> 1 targets: ['Joe Biden']


In [12]:
results[9]['targets'] # example person who was an subject of statement 
# in some of statements can be no indicated person or can be more of indicated persons

[{'slug': 'joe-biden',
  'full_name': 'Joe Biden',
  'first_name': 'Joe',
  'last_name': 'Biden'}]

In [13]:
v_page_json['results'][0]['statement']  # example statement

'"Why did the US (NIH) in 2017 give $3.7m to the Wuhan Lab in China? Such grants were prohibited in 2014. Did Pres. Obama grant an exception?"'

In [14]:
v_page_json['results'][0]['ruling_slug']  # sample evaluation of truth of statement

'false'

In [15]:
v_page_json['results'][0]['publication_date']  # date of publication

'2020-05-01T17:34:36-04:00'

#### 1.4. Downloading data from the website

In [16]:
### RELEVANT VARIBLES ###

#v_page_json             # content downloaded from the subpage (JSON format)
v_column_names = []      # list of column names
v_id = int               # statement identifier
v_publication_date = str # date of publication
v_statement = str        # statement
v_assessment = str       # assessment of the truth of the statement
v_speaker = str          # author of the statement
v_targets = []           # persons who where an subjects of statement
v_nr_of_statements = int # number of downloaded statements
v_www_status = 200       # server response (200 - optimistic assumption to start :))
v_page = 1               # subpage number
v_url = "https://www.politifact.com/api/factchecks/?format=json&page="

r = requests.get(v_url+str(v_page))

In [17]:
### CONSTANTS ###
#NR_SUBPAGES = 10         # limit of subpages - to tests

In [18]:
%%time
# DOWNLOAD DATA TO FRAME

# creating an empty data frame
v_column_names = ['id','speaker','targets','statement','assessment','publication_date']
DF_Statements = pd.DataFrame(columns=v_column_names)

# loop on subpages
while v_www_status == 200:
    #print('\n\n***** page:',v_page,'*****\n')
    try:
        v_page_json = r.json()
    except json.decoder.JSONDecodeError:
        print("Invalid format")
    else:
        # loop within one page
        results = v_page_json['results'] # download selected fragment from the subpage
        for i, result in enumerate(results):
            v_id = results[i]['id']
            #print(v_id)
            v_speaker = results[i]['speaker']['full_name']
            v_targets = []
            nr_of_targets = len(result['targets'])
            if nr_of_targets > 0:
                for j in range(nr_of_targets):
                        v_targets.append(result['targets'][j]['full_name'])
            v_statement = results[i]['statement']
            v_assessment = results[i]['ruling_slug']
            v_date = results[i]['publication_date'] # wyciąć tylko dzień

            # insert a row into the data frame
            DF_Statements.loc[v_id] = [v_id, v_speaker, v_targets, v_statement, v_assessment, v_date]
    
    v_page += 1   
    #if v_page > NR_SUBPAGES: break # limited pages for testing
    
    # checking if another page exists
    r = requests.get(v_url+str(v_page))
    v_www_status = r.status_code

v_nr_of_statements = len(DF_Statements)
print(v_nr_of_statements ,'statements were collected from',v_page-1,'subpages')
v_page = 1

17614 statements were collected from 1762 subpages
Wall time: 31min 34s


In [19]:
DF_Statements.head(25)

Unnamed: 0,id,speaker,targets,statement,assessment,publication_date
18460,18460,Rudy Giuliani,[],"""Why did the US (NIH) in 2017 give $3.7m to th...",false,2020-05-01T17:34:36-04:00
18456,18456,Facebook posts,[],Homeless people are immune to COVID-19,pants-fire,2020-05-01T17:19:18-04:00
18459,18459,Facebook posts,[],Says President Donald Trump is selling coronav...,false,2020-05-01T17:18:14-04:00
18458,18458,Viral image,[],"Walmart, Amazon, Kroger, Target and Costco “ha...",false,2020-05-01T16:05:36-04:00
18457,18457,Tony Evers,[],"Says Wisconsin measures have ""prevented the de...",mostly-true,2020-05-01T12:54:52-04:00
18455,18455,Facebook posts,[],The existence of a canine coronavirus vaccine ...,false,2020-05-01T11:56:25-04:00
18454,18454,Facebook posts,[],"Says President Barack Obama ""signed the medica...",pants-fire,2020-05-01T11:00:00-04:00
18446,18446,Facebook posts,[],The CDC recommends that only people with COVID...,false,2020-05-01T08:25:44-04:00
18452,18452,Facebook posts,[],“Pelosi was in (Wuhan) China 6 days after the ...,pants-fire,2020-04-30T15:39:26-04:00
18451,18451,Facebook posts,[Joe Biden],Says a video shows Joe Biden lolling his tongue.,pants-fire,2020-04-30T14:53:24-04:00


In [20]:
DF_Statements.info() 

<class 'pandas.core.frame.DataFrame'>
Int64Index: 17614 entries, 18460 to 31
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   id                17614 non-null  object
 1   speaker           17614 non-null  object
 2   targets           17614 non-null  object
 3   statement         17614 non-null  object
 4   assessment        17614 non-null  object
 5   publication_date  17614 non-null  object
dtypes: object(6)
memory usage: 963.3+ KB


## 1.3. Saving data to a file

In [22]:
DF_Statements.to_csv('./Data/Statements.csv', index=False) # write to CSV

In [23]:
DF_Statements.to_json('./Data/Statements.json') # write to JSON

In [24]:
DF_csv = pd.read_csv('./Data/Statements.csv',index_col = 0) # test reading
DF_csv.head(10)

Unnamed: 0_level_0,speaker,targets,statement,assessment,publication_date
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
18460,Rudy Giuliani,[],"""Why did the US (NIH) in 2017 give $3.7m to th...",false,2020-05-01T17:34:36-04:00
18456,Facebook posts,[],Homeless people are immune to COVID-19,pants-fire,2020-05-01T17:19:18-04:00
18459,Facebook posts,[],Says President Donald Trump is selling coronav...,false,2020-05-01T17:18:14-04:00
18458,Viral image,[],"Walmart, Amazon, Kroger, Target and Costco “ha...",false,2020-05-01T16:05:36-04:00
18457,Tony Evers,[],"Says Wisconsin measures have ""prevented the de...",mostly-true,2020-05-01T12:54:52-04:00
18455,Facebook posts,[],The existence of a canine coronavirus vaccine ...,false,2020-05-01T11:56:25-04:00
18454,Facebook posts,[],"Says President Barack Obama ""signed the medica...",pants-fire,2020-05-01T11:00:00-04:00
18446,Facebook posts,[],The CDC recommends that only people with COVID...,false,2020-05-01T08:25:44-04:00
18452,Facebook posts,[],“Pelosi was in (Wuhan) China 6 days after the ...,pants-fire,2020-04-30T15:39:26-04:00
18451,Facebook posts,['Joe Biden'],Says a video shows Joe Biden lolling his tongue.,pants-fire,2020-04-30T14:53:24-04:00


In [25]:
DF_csv.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 17614 entries, 18460 to 31
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   speaker           17614 non-null  object
 1   targets           17614 non-null  object
 2   statement         17614 non-null  object
 3   assessment        17614 non-null  object
 4   publication_date  17614 non-null  object
dtypes: object(5)
memory usage: 825.7+ KB


In [26]:
DF_csv.describe()

Unnamed: 0,speaker,targets,statement,assessment,publication_date
count,17614,17614,17614,17614,17614
unique,4229,877,17597,9,16382
top,Donald Trump,[],<p>On a cap-and-trade plan.</p>\r\n,false,2008-09-04T00:00:00-04:00
freq,795,13700,3,3656,8


In [27]:
DF_json = pd.read_json('./Data/Statements.json')
DF_json.head(10)

Unnamed: 0,id,speaker,targets,statement,assessment,publication_date
18460,18460,Rudy Giuliani,[],"""Why did the US (NIH) in 2017 give $3.7m to th...",false,2020-05-01T17:34:36-04:00
18456,18456,Facebook posts,[],Homeless people are immune to COVID-19,pants-fire,2020-05-01T17:19:18-04:00
18459,18459,Facebook posts,[],Says President Donald Trump is selling coronav...,false,2020-05-01T17:18:14-04:00
18458,18458,Viral image,[],"Walmart, Amazon, Kroger, Target and Costco “ha...",false,2020-05-01T16:05:36-04:00
18457,18457,Tony Evers,[],"Says Wisconsin measures have ""prevented the de...",mostly-true,2020-05-01T12:54:52-04:00
18455,18455,Facebook posts,[],The existence of a canine coronavirus vaccine ...,false,2020-05-01T11:56:25-04:00
18454,18454,Facebook posts,[],"Says President Barack Obama ""signed the medica...",pants-fire,2020-05-01T11:00:00-04:00
18446,18446,Facebook posts,[],The CDC recommends that only people with COVID...,false,2020-05-01T08:25:44-04:00
18452,18452,Facebook posts,[],“Pelosi was in (Wuhan) China 6 days after the ...,pants-fire,2020-04-30T15:39:26-04:00
18451,18451,Facebook posts,[Joe Biden],Says a video shows Joe Biden lolling his tongue.,pants-fire,2020-04-30T14:53:24-04:00


In [28]:
DF_json.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 17614 entries, 18460 to 31
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   id                17614 non-null  int64 
 1   speaker           17614 non-null  object
 2   targets           17614 non-null  object
 3   statement         17614 non-null  object
 4   assessment        17614 non-null  object
 5   publication_date  17614 non-null  object
dtypes: int64(1), object(5)
memory usage: 963.3+ KB
