# Homework 1.5: Scraping Review 🏋️‍🏋️‍🏋️‍

For this assignment, you will be scraping an API and a live website.

### Table of Contents
1. CFPB API
2. Microworkers

## Prelude: Importing Your Libraries 
The *first* first thing we're going to do is make sure we're all set up and ready to go. 

That means importing some libraries! I've got a cell below all ready for you to put in some libraries.

Remember with third-party libraries, you will need to make sure they are actually installed before they will run. 

In [1]:
# Import native libraries
import datetime
import csv

# Import third party libraries
import requests
from bs4 import BeautifulSoup

## Part One: Scraping the CFPB

The Consumer Finance and Protection Bureau was founded in the aftermath of the 2008 financial crisis. One of the things that they do is collect complaints from consumers about bad banks, lenders and other financial insitutions. This complaint data is available to the public in many forms. While you can download a big, horrible CSV file with _all_ of the data, it's usually easier (for you and your computer's memory storage) to use the API to only get the data you need.

The CFPB uses [Socrata](https://www.tylertech.com/products/socrata) to manage their API, which is a company that helps a lot of public agencies share their data with the rest of the world. The way they have you request for data is kind of funky, but we will perservere together!

[The homepage for the Consumer Complaint Database](https://cfpb.github.io/api/ccdb/index.html)<br>[API Reference](https://dev.socrata.com/foundry/data.consumerfinance.gov/s6ew-h6mp)

### 1. Doing a single request

#### Open the page as a JSON object with Requests

In [2]:
cfpb_endpoint = "https://data.consumerfinance.gov/resource/s6ew-h6mp.json"
page = requests.get(cfpb_endpoint)
page_content = page.json()

#### Print the first item in the returned list

**NOTE!** Computers are weird, and they start counting with **zero** instead of **one**. To access the first value in a list, you must call `list_variable[0]`–not `list_variable[1]`!

In [3]:
page_content[0]

{'date_received': '2019-10-18T00:00:00.000',
 'product': 'Credit reporting, credit repair services, or other personal consumer reports',
 'sub_product': 'Credit reporting',
 'issue': 'Incorrect information on your report',
 'sub_issue': 'Account status incorrect',
 'company': 'FREEDOM FINANCIAL NETWORK',
 'state': 'CA',
 'zip_code': '925XX',
 'submitted_via': 'Web',
 'date_sent_to_company': '2019-10-18T00:00:00.000',
 'company_response': 'In progress',
 'timely': 'Yes',
 'consumer_disputed': 'N/A',
 'complaint_id': '3410006'}

Notice that there are a lot of fields!


### 2. Getting a lil bit more specific

When we get complaint data from just the endpoint, we are getting ALL the data–it's basically a firehose! However, we don't actually want all the complaints submitted to the CFPB! We only want specific kinds! 

In fact, we only want complaints that fit this criteria:
- The consumer is based the state of New York
- It was received by the CFPB between January 1, 2018 and January 1, 2019
- It is about the product "Debt collection" and the sub-product "Mortgage debt"

Using the `cfpb_endpoint`, we will build a url that requests just these kinds of complaints!

We will first filter by each thing, and then write a url that filters all three at the same time! Woah!

#### Filtering by state

Look back at the piece of data we printed in Step 1. How can you tell which state the complaint is from? How are they formatting the state names–is it the full name, or an abbreviation of sort? Consider checking out the [API documentation](https://dev.socrata.com/foundry/data.consumerfinance.gov/s6ew-h6mp)'s "Fields" section if you're feeling a little lost.

In [4]:
state_filter = cfpb_endpoint + "?state=NY"
page = requests.get(state_filter)
page_content = page.json()
page_content[0]

{'date_received': '2019-10-17T00:00:00.000',
 'product': 'Credit reporting, credit repair services, or other personal consumer reports',
 'sub_product': 'Credit reporting',
 'issue': 'Incorrect information on your report',
 'sub_issue': 'Account information incorrect',
 'company_public_response': 'Company has responded to the consumer and the CFPB and chooses not to provide a public response',
 'company': 'TRANSUNION INTERMEDIATE HOLDINGS, INC.',
 'state': 'NY',
 'zip_code': '10467',
 'consumer_consent_provided': 'Consent not provided',
 'submitted_via': 'Web',
 'date_sent_to_company': '2019-10-17T00:00:00.000',
 'company_response': 'Closed with explanation',
 'timely': 'Yes',
 'consumer_disputed': 'N/A',
 'complaint_id': '3409533'}

#### Filtering by date range

Read the [between...and...](https://dev.socrata.com/docs/functions/between.html) page in the API documentation. This will explain how to query for complaints within a particular timeframe! Now use that knowledge to call all the complaints between January 1, 2018 and January 1, 2019!

In [5]:
date_filter = cfpb_endpoint + "?$where=date_received between '2018-01-01T00:00:00' and '2019-01-01T00:00:00'"
page = requests.get(date_filter)
page_content = page.json()
len(page_content)
page_content[0]

{'date_received': '2018-01-01T00:00:00.000',
 'product': 'Debt collection',
 'sub_product': 'I do not know',
 'issue': 'Attempts to collect debt not owed',
 'sub_issue': 'Debt is not yours',
 'complaint_what_happened': "I 've sent letters to Viking Client Services requesting specifics on this alleged debt. I do not have a contract with Viking or XXXX XXXX XXXX. I have not received nor made payments to either of these companies.",
 'company': 'Viking Client Services',
 'state': 'TX',
 'zip_code': '761XX',
 'consumer_consent_provided': 'Consent provided',
 'submitted_via': 'Web',
 'date_sent_to_company': '2018-01-01T00:00:00.000',
 'company_response': 'Closed with explanation',
 'timely': 'Yes',
 'consumer_disputed': 'N/A',
 'complaint_id': '2768843'}

#### Filtering by sub-product

In [6]:
state_filter = cfpb_endpoint + "?sub_product='Mortgage debt'"
page = requests.get(state_filter)
page_content = page.json()
page_content[0]

{'date_received': '2019-08-18T00:00:00.000',
 'product': 'Debt collection',
 'sub_product': 'Mortgage debt',
 'issue': 'Attempts to collect debt not owed',
 'sub_issue': 'Debt was result of identity theft',
 'company': 'Caliber Home Loans, Inc.',
 'state': 'MI',
 'zip_code': '490XX',
 'tags': 'Servicemember',
 'consumer_consent_provided': 'Consent not provided',
 'submitted_via': 'Web',
 'date_sent_to_company': '2019-08-21T00:00:00.000',
 'company_response': 'Closed with explanation',
 'timely': 'Yes',
 'consumer_disputed': 'N/A',
 'complaint_id': '3345527'}

#### Putting it all together

Now that you've gotten data from each *individual* filter, let's combine them! You can use multiple filters by sticking an `&` between them.

In [7]:
whole_thing = cfpb_endpoint + "?sub_product=Mortgage debt&state=NY" + "&$where=date_received between '2018-01-01T00:00:00' and '2019-01-01T00:00:00'"
page = requests.get(whole_thing)
page_content = page.json()
len(page_content)

68

**Gutcheck:** Count how many items you get back using the `len()` function. Is it 68? You're good to go!

### 3. Saving the data into a CSV file

Now that we have a beautifully crafted URL that gives us all the data we want, let's save it in a CSV file so we can open it up in ｡･:*:･ﾟ★,｡･:*:･ﾟ☆𝔰𝔭𝔯𝔢𝔞𝔡𝔰𝔥𝔢𝔢𝔱 𝔣𝔬𝔯𝔪｡･:*:･ﾟ★,｡･:*:･ﾟ☆.

#### Save the data to a file called `"../output/2018_NY_mortgage_complaints.csv"`

In [8]:
with open("../output/2018_NY_mortgage_complaints.csv", "w") as f:
    headers = ['date_received', 'product', 'sub_product', 'issue', 'sub_issue', 'complaint_what_happened', 'company_public_response', 'company', 'state', 'zip_code', 'consumer_consent_provided', 'submitted_via', 'date_sent_to_company', 'company_response', 'timely', 'tags', 'consumer_disputed', 'complaint_id']
    writer = csv.DictWriter(f, headers)
    writer.writeheader()
    writer.writerows(page_content)
    

In [9]:
keys = set()
for content in page_content:
    for content_key in content.keys():
        keys.add(content_key)
keys = list(keys)
print(keys)

['state', 'company', 'tags', 'company_response', 'consumer_consent_provided', 'complaint_what_happened', 'date_sent_to_company', 'issue', 'zip_code', 'complaint_id', 'submitted_via', 'consumer_disputed', 'date_received', 'company_public_response', 'sub_product', 'sub_issue', 'product', 'timely']


### Bonus: Collect mortgage complaints from multiple states!
**For an extra point:** write a script that loops through the list of states below, downloads all complaints between January 1, 2018 and January 1, 2019 that are about the sub-product "Mortgage debt", and save each into their own csv, that has the filename format `../output/2018_STATENAME_mortgage_complaints.csv`

In [10]:
states = ['NY', 'NJ', 'NV', 'ND', 'NM', 'NC']

In [11]:
for state in states:
    
    whole_thing = cfpb_endpoint + "?sub_product=Mortgage debt&state=" + state + "&$where=date_received between '2018-01-01T00:00:00' and '2019-01-01T00:00:00'"
    page = requests.get(whole_thing)
    page_content = page.json()
    
    with open("../output/2018_" + state + "_mortgage_complaints.csv", "w") as f:
        headers = ['date_received', 'product', 'sub_product', 'issue', 'sub_issue', 'complaint_what_happened', 'company_public_response', 'company', 'state', 'zip_code', 'consumer_consent_provided', 'submitted_via', 'date_sent_to_company', 'company_response', 'timely', 'tags', 'consumer_disputed', 'complaint_id']
        writer = csv.DictWriter(f, headers)
        writer.writeheader()
        writer.writerows(page_content)

## Part Two: Scraping Microworkers.com

For Part Two, you will be scraping an archive I've made of [Microworkers](https://www.microworkers.com/), a site that pays small amounts of money for the completion of short tasks. I have archived their "Twitter" job listings.

You will have to:
1. Scrape the homepage for links to each job listing
2. Figure out how to scrape a single job listing
3. Apply the knowledge you learned from **(2)** to each link from **(1)**

The link to the archive is here:<br>
**[http://maddy.zone/microworker/index.html](http://maddy.zone/microworker/index.html)**

### 1. Scraping the homepage

**NOTE!** This part, and the part after (scraping a single job), should give you all the code you need for the last part (scraping each job).

#### Open the homepage using Requests

In [12]:
url = "http://maddy.zone/microworker/index.html"
page = requests.get(url)
page_content = page.content


#### Parse the page using BeautifulSoup

In [13]:
soup = BeautifulSoup(page_content, "html.parser")
listings = soup.find_all('div', class_='jobname')

#### Isolate each job listing url and add them to an array

In [14]:
links = []
for listing in listings:
    href = (listing.find('a',href=True))
    if href is not None:
        links.append(href['href'])

### 2. Scraping a single job listing

![screenshot of the linked page](example.png)

For each page, we will collect **five** different pieces of information:
1. Job title
2. Job ID
3. Employer ID
4. Payment
5. Description

But scraping them all at once can be overwhelming! Let's scrape a signle listing first. For some of the pieces of information, you might want to look into `.replace()` and `.strip()` functions for strings.

#### Open `http://maddy.zone/microworker/54y2h5e4j5c4z213o503w2b4.html` using Requests

In [15]:
url = "http://maddy.zone/microworker/54y2h5e4j5c4z213o503w2b4.html"
page = requests.get(url)
page_content = page.content

#### Parse the page using BeautifulSoup

In [16]:
soup = BeautifulSoup(page_content, "html.parser")
print(soup)

<!DOCTYPE html>

<html lang="en">
<head>
<title>Microworkers - work &amp; earn or offer a micro job</title>
<meta content="make money, make money at home, make money from home, make money on the internet, make extra money, make money online, make money home based business, work at home, work from home, work from home data entry, work at home jobs, work at home opportunity, work from home jobs, work at home business, work at home moms, work at home business opportunity, at home work, work online, computer work at home, online jobs work from home, work at home mom, work from home business opportunity, legitimate work at home jobs, temporary jobs, online temporary jobs, temporary job search, earn money from home, earn money at home, earn money online." name="keywords"/>
<meta content="The online market place for work. We give businesses and developers access to an on-demand scalable workforce. Workers can work at home and make money by choosing from thousands of tasks and jobs." name="des

#### Isolate the job title

In [17]:
job_title = soup.find('h1').get_text()
print(job_title)

DE Shaw Twitter: Follow + Retweet


#### Isolate the job id

**NOTE!** Many of you did an awesome job using `.strip()` to get rid of extra whitespace at the ends of each variable. But there's so much more you can do! Check out how I used the `.replace()` function before to get rid of that weird "Job ID:" label. You can more about how to use it [here](https://www.geeksforgeeks.org/python-string-replace/).

In [18]:
job_id = soup.find('div', class_='jobdetailsnoteleft').find_all('p')[3].get_text().replace("Job ID:", "").strip()
print(job_id)

b1befe34f477


#### Isolate the employer id

In [19]:
employer_id = soup.find('div', class_="jobdetailsnoteright").find('a',href=True).get_text()
print(employer_id)

Member_1014973


#### Isolate the payment

In [20]:
payment = soup.find('div', class_="jobdetailsnoteleft").find_all('strong')[1].get_text()
print(payment)

$0.75


#### Isolate the description

In [21]:
description = soup.find('div', class_='jobdetailsbox').find_all('p')[1].get_text()
print(description)

1. Go to this link - https://twitter.com/DEShawInsider/status/1176597146776289281
2. Follow this account on Twitter
3. Retweet this recent post
4. Take a screenshot of the repost


#### Store each of your variables into this dictionary

In [22]:
job_listing = {
    'job_title':         job_title,
    'job_id':            job_id,
    'employer_id':       employer_id, 
    'payment':           payment, 
    'description':       description,
} 

### 3. Scraping all of the linked pages

#### Make an empty array for your data

In [23]:
data = []

#### Loop through each of the listing links that you saved in Step 1, and...<br>    Use the code from Step 2 to get the data from each listing page<br>And add the dictionary you make to the array from above

**NOTE!** This step looks kind of intimidating, but if you were able to do the first and second part of the Microworkers scraper, you have everything you need to do this one!

In the first part, I scraped all the urls of each job link into a list called `links`. 

Then, I wrote a for loop that went through each link, and then went through all the code I wrote for the single job listing! 

I then added the `job_listing` dict we made to a list I made called `data`. I saved that list to a csv using `DictWriter()`.

In [25]:
for link in links:
    print(link)
    url = "http://maddy.zone/microworker/" + link
    
    print(url)
    page = requests.get(url)
    page_content = page.content
    soup = BeautifulSoup(page_content, "html.parser")
        
    job_title = soup.find('h1').get_text()
    employer_id = soup.find('div', class_="jobdetailsnoteright").find('a',href=True).get_text()
    job_id = soup.find('div', class_='jobdetailsnoteleft').find_all('p')[3].get_text().replace("Job ID:", "").strip()
    payment = soup.find('div', class_="jobdetailsnoteleft").find_all('strong')[1].get_text()
    description = soup.find('div', class_='jobdetailsbox').find_all('p')[1].get_text()
    
    job_listing = {
            'job_title':         job_title,
            'job_id':            job_id,
            'employer_id':       employer_id, 
            'payment':           payment, 
            'description':       description,
    } 
    data.append(job_listing)

54y2h5e4j5c4z213o503w2b4.html
http://maddy.zone/microworker/54y2h5e4j5c4z213o503w2b4.html
s253k5534494b4z2k54374m5.html
http://maddy.zone/microworker/s253k5534494b4z2k54374m5.html
s213k5339403y2d4f433q2d4.html
http://maddy.zone/microworker/s213k5339403y2d4f433q2d4.html
44c4a4335413x2b4b4y2v2h5.html
http://maddy.zone/microworker/44c4a4335413x2b4b4y2v2h5.html
545354c4a42394x264y2x294.html
http://maddy.zone/microworker/545354c4a42394x264y2x294.html
54y2l5a444v2233374e4p254.html
http://maddy.zone/microworker/54y2l5a444v2233374e4p254.html
x24364b414c413y2f423p2l5.html
http://maddy.zone/microworker/x24364b414c413y2f423p2l5.html
y233i543149433a494b4v274.html
http://maddy.zone/microworker/y233i543149433a494b4v274.html
446384x274y2y2f4a4b444a4.html
http://maddy.zone/microworker/446384x274y2y2f4a4b444a4.html
s253a4x214d45303k5d434d4.html
http://maddy.zone/microworker/s253a4x214d45303k5d434d4.html
t2c4440374x22343o5d474b4.html
http://maddy.zone/microworker/t2c4440374x22343o5d474b4.html
740334z2h5

### 4. Saving the data into a CSV file

🎉 Wooo! you have all of data! 

#### Print each row into a spreadsheet called `"../output/twitter_microworkers.csv"`

In [26]:
with open('../output/twitter_microworkers.csv', 'w') as f:
    writer = csv.DictWriter(f, fieldnames=data[0].keys())
    writer.writeheader()
    writer.writerows(data)