## Task 2 


### Task 2.1
Get list of US politicians with political affiliation. 

2 sources:
- https://github.com/casmlab/politicians-tweets 
- https://www.congress.gov/members?q={%22congress%22:[%22110%22,%22111%22,%22112%22,%22113%22,%22114%22,%22115%22,%22116%22,117]}

Take 1st list, keep politicians whose affiliation is known. Then merge with congress list to be sure we have a good dataset.
List of politicians is in `data/ressources/politicians.json`.

In [9]:
import os 
import json
import time
import csv

from tqdm import tqdm
import requests
from bs4 import BeautifulSoup

In [29]:
def to_csv(file_name: str, pol_lst: list) -> None:
    """
    Write list to file to keep trace
    """

    csv_path = os.path.join("data", "resources", file_name)

    with open(csv_path, "w") as f:
        writer = csv.writer(f, delimiter=" ")
        writer.writerow(["Name", "Party"])

        for member in pol_lst:
            writer.writerow([el for el in member])
        

#### Github dataset

In [14]:
file_name = "politicians_github.json"
file_path = os.path.join("data", "resources", file_name)

with open(file_path, "r") as f:
    json = json.load(f)

len(json)

15

In [15]:
json.keys()

dict_keys(['id', 'id_str', 'screen_name', 'confirmed_account_type', 'state', 'twitter_name', 'real_name', 'bioguide', 'office_holder', 'party', 'district', 'level', 'woman', 'birthday', 'last_updated'])

In [35]:
json["office_holder"]["2"]

In [41]:
# Only keep politicians with politicl affiliation

politicians = []

for i in tqdm(range(1, len(json["id"]))):
    i = str(i)
    affiliation = json["party"][i]
    screen_name = json["screen_name"][i]
    elected = json["office_holder"][i] is not None

    if affiliation is not None and affiliation in ("Republican", "Democratic"):
        politicians.append((json["real_name"][i], affiliation, elected))
    elif screen_name == "realdonaldtrump":
        politicians.append(("Donald Trump", "Republican", True))
    elif screen_name == "barackobama":
        politicians.append(("Barack Obama", "Democratic", True))


100%|██████████| 9979/9979 [00:00<00:00, 134865.04it/s]


In [46]:
sum(pol[-1] for pol in politicians)

1107

All politicians are in Congress!

In [30]:
print(f"{len(politicians)=}") 
politicians[:10]

len(politicians)=1107


[('Mark Green', 'Republican'),
 ('Pete Stauber', 'Republican'),
 ('Derek Kilmer', 'Democratic'),
 ('Andy Harris', 'Republican'),
 ('Donald Payne', 'Democratic'),
 ('A. Ferguson', 'Republican'),
 ('Richard Hudson', 'Republican'),
 ('Edward Markey', 'Democratic'),
 ('Bobby Rush', 'Democratic'),
 ('Gregory Meeks', 'Democratic')]

In [31]:
# Write to file
to_csv("politicians_github.csv", politicians)

#### US Congress dataset

In [2]:
URL = 'https://www.congress.gov/members?q={"congress":["110","111","112","113","114","115","116",117]}&pageSize=250'

In [3]:
congress_members = []

r  = requests.get(URL, params={"page": 1})
soup = BeautifulSoup(r.text, "html.parser")

members = soup.find_all("li", class_="compact")

In [74]:
congress_members = []
for member in members:
    # Scrape the information
    items = member.find_all("span", class_="result-item")
    name = sanitize_name(member.span.a.text)
    
    for item in items:
        if item.strong.text == "Party:":
            affiliation = item.span.text

    congress_members.append((name, affiliation))

congress_members[:10]

[('Neil  Abercrombie', 'Democratic'),
 ('Ralph L  Abraham', 'Republican'),
 ('Gary L.  Ackerman', 'Democratic'),
 ('Alma S.  Adams', 'Democratic'),
 ('Sandy  Adams', 'Republican'),
 ('Robert B.  Aderholt', 'Republican'),
 ('John H.  Adler', 'Democratic'),
 ('P  Aguilar', 'Democratic'),
 ('Daniel K.  Akaka', 'Democratic'),
 ('W. Todd  Akin', 'Republican')]

In [4]:
def sanitize_name(name: str) -> str:
    """
    Strip and clean name.
    "Senator Cruz, Ted" -> "Ted Cruz"
    """

    for element in ("Representative", "Senator"):
        name = name.strip(element)

    name = " ".join(name.split(",")[::-1])
    name = name.strip()
    return name

# name = "Senator Cruz, Ted"
# sanitize_name(name)

In [5]:
congress_members = []

# Download each congress page
for page_number in tqdm(range(1, 6)):
    r  = requests.get(URL, params={"page": page_number})
    soup = BeautifulSoup(r.text, "html.parser")

    members = soup.find_all("li", class_="compact")

    for member in members:
        # Scrape the information
        items = member.find_all("span", class_="result-item")
        name = sanitize_name(member.span.a.text)
        
        for item in items:
            if item.strong.text == "Party:":
                affiliation = item.span.text

        congress_members.append((name, affiliation))

100%|██████████| 5/5 [00:14<00:00,  2.80s/it]


In [12]:
# Sanity check
len(congress_members) == 1158  # Nice

True

In [32]:
# Write to file
to_csv("politicians_congress.csv", politicians)

#### Compare lists

Actually might not be useful (and less of a headache) to just take the congress, since almost kept politicians from the github list are elected (congress members).