# Python Immersion Course

### Data Collection with Python - Part 3

### Joe Blankenship - Just some dude

Once you have a good grasp of Python's basic fucntionality, you can interact with a number of data sources. This section will focus on the basics of extracting, tranforming, and loading data formats into dataframes for analysis. Data manipulation inside of the dataframes will be saved for Part 4.

<hr>

## Web Scraping

<hr>

The first scraper we'll build will use core Python libraries to:

* Go to a HTTP website
* Gather the source code
* Print the output

In [None]:
# Here we'll import urllib, io, and pprint modules to obtain out data

from urllib.request import Request, urlopen
from io import TextIOWrapper
from pprint import pprint

# Declare the URL
url = 'https://en.wikipedia.org/wiki/Department_of_Geography,_University_of_Kentucky'

# Open the URL
page = Request(url)
page_content = urlopen(page)
# page_content.read()

# Buffer our text stream from the website
page_data = TextIOWrapper(page_content)

# pprint out our data
for row in page_data:
    pprint(row)

However, we may want something a bit more elegant. This is where `requests` and `beautifulsoup` comes in to help us out.

In [None]:
# Import requests and beautifulsoup
# Import pandas, we'll use that at the end
import requests
from bs4 import BeautifulSoup
import pandas as pd

# we are going to scrape crime data from the UK crime http://www.uky.edu/crimelog/
# substitute variables to fill in REST query criteria
start_month, start_day, start_year = 1, 1, 2018
end_month, end_day, end_year = 10, 4, 2018
crime_data_raw = requests.get('http://www.uky.edu/crimelog/log?field_log_category_value=All' +
                              '&field_log_report_value%5Bmin%5D%5Bmonth%5D=' + str(start_month) +
                              '&field_log_report_value%5Bmin%5D%5Bday%5D=' + str(start_day) +
                              '&field_log_report_value%5Bmin%5D%5Byear%5D=' + str(start_year) +
                              '&field_log_report_value%5Bmax%5D%5Bmonth%5D=' + str(end_month) +
                              '&field_log_report_value%5Bmax%5D%5Bday%5D=' + str(end_day) +
                              '&field_log_report_value%5Bmax%5D%5Byear%5D=' + str(end_year)
                             )


In [None]:
# create a soup object 
crime_bs_proc = BeautifulSoup((crime_data_raw.text), "html5lib")

In [None]:
# create a filter for our soup object to pull out the table
crime_data_table = crime_bs_proc.find('table', {'class': 'views-table cols-8'})

In [None]:
# find the table header in the data
crime_data_header = crime_data_table.find('thead')

In [None]:
# find all the table headers
crime_data_heads = crime_data_header.find_all('th')

In [None]:
# create an empty list for the header
header = []

# iterate through the header element to get text
for col in crime_data_heads:
    cols = col.find_all('a')
    cols = [ele.text.strip() for ele in cols]
    header.append([ele for ele in cols if ele])

# flatten the list to a single list
header = [item for sublist in header for item in sublist]

In [None]:
# find the table rows in the data
crime_data_body = crime_data_table.find('tbody')

In [None]:
# find all table rows
crime_data_rows = crime_data_body.find_all('tr')

In [None]:
# create an empty list for the rows of data
data = []

# iterate through the header element to get the rows
for row in crime_data_rows:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    data.append([ele for ele in cols if ele])

In [None]:
# create a dataframe with our data using our header list
uk_crime_data = pd.DataFrame(data, columns=header)
uk_crime_data.head()

There is also the `scrapy` library in Python for more intense scraping projects with more considerations.

<hr>

## APIs

<hr>

APIs often have 'wrappers' in Python that you can use to interface with the underlying data.

Here we will use the data.world API to import some data

  * docs at https://github.com/datadotworld/data.world-py

Prior to this, you should load your API credentials from data.world into your active virtual env (in the terminal)

`export DW_AUTH_TOKEN=<YOUR_TOKEN>`

In [None]:
# import our API library

import datadotworld as dw

In [None]:
# load our data sets from the API using a known user data collection

lex_business_health = dw.load_dataset('inform8n/most-recent-lexington-ky-health-department-inspection-scores')

In [None]:
# list the dataframes available in the data set collection

lex_business_health.dataframes

In [None]:
# load a data set into a dataframe from the data collection

food_scores = lex_business_health.dataframes.get('most_recent_food_scores')
food_scores.head(5)

<hr>

## Flat files

<hr>

<hr>

## Streaming

<hr>

<hr>

## Resources

<hr>

**Note:** A lot of the open-source materials are provided by people who develop those materials for a living. So please consider sending them a thank you and if you can, a few buck to support their efforts. Thanks! :)    

* 