<a href="https://colab.research.google.com/github/brendenwest/cis276/blob/main/3_data_acquisition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Loading & Saving Data

### Reading
- Murach's, Chapter 5, 6
- https://wesmckinney.com/book/accessing-data.html

### Learning Outcomes

- Loading data from files
- Loading data from a database
- loading data from the internet
- Saving data to files


# Loading Data into Pandas

Pandas can import structured data from a variety of file formats and data sources.

File can be in plain-text or in certain binary formats Pandas recognizes.

Files can be on the same computer as your program or on a remote system.



## Pandas Data-loading Methods

Some common methods and file formats are:

- **read_csv** - read delimited values from .csv or .tsv files
- **read_html** - read tables from a .html file
- **read_excel** - read .xls, .xlsx Excel spreadsheets
- **read_json** - read JavaScript Object Notation (JSON) from .json files
- **read_sas** - read a dataset created by SAS
- **read_spss** - read a data file created by SPSS
- **read_xml** - .xml - Extensible Markup Language
- **read_sql** - read results of an SQL query
- **read_sql_table** - read a whole SQL table (similar to everything in a table using `read_sql`)

These methods only work when the data is in a tabular form. If the data isn’t tabular (e.g. with complex or nested data), the read method will throw an error.

### Data Loading Options

All the above methods convert input data into a DataFrame. But they vary according to optional arguments for how to interpret the data.

Some coomon considerations are:

- **Indexing** - which columns to read and whether to get colkumn names from the file
- **Type inference** - converting data to optimal types
- **Date & time parsing** - identifying date/time values and combining multiple columns into one
- **Chunked iteration** - reading large files in chunks
- **Handling dirty data** - Includes skipping rows or comments, formatting numeric data, etc.

See a full list of [options for reading CSV files](https://wesmckinney.com/book/accessing-data#tbl-table_read_csv_function).

## Loading Data From Files

### Local Files

Pandas can read files located on the same computer using relative or absolute file paths

In [None]:
import pandas as pd
# file in same directory
df = pd.read_csv("data.csv")

# file path relative to current directory
df = pd.read_csv("examples/data.csv")

# absolute file path
df = pd.read_csv("/usr/johndoe/examples/data.csv")

### Internet Files

In [3]:
import pandas as pd
# get example data from pulic internet location
url = "https://data.cdc.gov/api/views/v6ab-adf5/rows.csv?accessType=DOWNLOAD"
df = pd.read_csv(url)
df.head()

Unnamed: 0,Year,Age Group,Death Rate
0,1900,1-4 Years,1983.8
1,1901,1-4 Years,1695.0
2,1902,1-4 Years,1655.7
3,1903,1-4 Years,1542.1
4,1904,1-4 Years,1591.5


Sometimes it's helpful to retrieve a file from the internet and save to disk before reading into a DataFrame.

Python's `urllib.request` module is helpful for that:

In [None]:
from urllib import request
data_url = "https://data.cdc.gov/api/views/v6ab-adf5/rows.csv?accessType=DOWNLOAD"
request.urlretrieve(data_url, filename='mortality_data.csv')

('mortality_data.csv', <http.client.HTTPMessage at 0x7fb86ad41820>)

### Google Drive

## Loading Data from Databases

### Using SQL

Python has libraries for interacting with common relational database platforms:

- sqlite3 - SQLite
- pymysql - MySQL
- psycopg2 - PostgreSQL
- cx_oracle - Oracle
- pymssql - MS SQL Server

You can `query` a database from Python by:

- creating a connection object with the `connect()` method
- getting a cursor object with the `cursor()` method
- executing an SQL query to fetch desired rows with `execute()` and `fetchall()`

For example, to list the tables in a database:
```
import sqlite3
fires_con = sqlite3.connect('Data/FPA_FOD_20170508.sqlite')
fires_cur = fires_con.cursor()
'SELECT name FROM sqlite_master WHERE type="table"').fetchall()
```

SQL query results can be read directly into a DataFrame using the `read_sql_query` method:

```
fires = pd.read_sql_query(
'''SELECT STATE, FIRE_YEAR, DATETIME(DISCOVERY_DATE) AS DISCOVERY_DATE, FIRE_NAME, FIRE_SIZE, LATITUDE, LONGITUDE FROM Fires''', fires_con)

```

## Working with JSON Data

The JSON format is popular for transmitting data between applications and closely matches the structure of a Python `dict`, with the exception of its `null` value and some other minor syntax differences.

There are several Python libraries for handling JSON, including `json` which is built into Python.



In [17]:
#
json_string = """
{
  "state": "AK",
  "cities": [
  {"name": "Anchorage", "pop": 250000, "region": "south-central"},
  {"name": "Fairbanks", "pop": 75000, "region": "interior"},
  {"name": "Juneau", "pop": 25000, "region": "south-east"}
  ],
  "industries": ["fishing","mining","tourism"]
}
"""
from pprint import pprint as pp
import json
data = json.loads(json_string)

# use pretty-print to print formatted json data
pp(data)

cities = pd.DataFrame(data["cities"], columns=["name", "pop"])
cities

{'cities': [{'name': 'Anchorage', 'pop': 250000, 'region': 'south-central'},
            {'name': 'Fairbanks', 'pop': 75000, 'region': 'interior'},
            {'name': 'Juneau', 'pop': 25000, 'region': 'south-east'}],
 'industries': ['fishing', 'mining', 'tourism'],
 'state': 'AK'}


Unnamed: 0,name,pop
0,Anchorage,250000
1,Fairbanks,75000
2,Juneau,25000


## Web Scraping

Often, data is only available as HTML on a web page. Python has a number of libraries for reading & writing HTML.

pandas built-in `read_html` uses some of these to automatically parse tables out of HTML files as DataFrame objects.

In [20]:
import pandas as pd
# get example data from pulic internet location
url = "https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list/"
tables = pd.read_html(url)


len(tables)

# html page might have multiple tables so specify which to use for DataFrame
failures = tables[0]
failures.head()

Unnamed: 0,Bank NameBank,CityCity,StateSt,CertCert,Acquiring InstitutionAI,Closing DateClosing,FundFund
0,Citizens Bank,Sac City,IA,8758,Iowa Trust & Savings Bank,"November 3, 2023",10545
1,Heartland Tri-State Bank,Elkhart,KS,25851,"Dream First Bank, N.A.","July 28, 2023",10544
2,First Republic Bank,San Francisco,CA,59017,"JPMorgan Chase Bank, N.A.","May 1, 2023",10543
3,Signature Bank,New York,NY,57053,"Flagstar Bank, N.A.","March 12, 2023",10540
4,Silicon Valley Bank,Santa Clara,CA,24735,First–Citizens Bank & Trust Company,"March 10, 2023",10539
