<a href="https://colab.research.google.com/github/brendenwest/cis276/blob/main/3_data_acquisition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Acquisition & Cleaning

### Reading
- Murach's, Chapter 5, 6
- https://wesmckinney.com/book/accessing-data.html
- https://wesmckinney.com/book/data-cleaning.html

### Learning Outcomes

- Loading data from files
- Loading data from a database
- loading data from the internet
- Saving data to files
- Handling missing values
- Handling outliers
- Handling ata type problems

## Data Acquisition

### Importing Data into a DataFrame

Pandas can import structured data from a variety of file formats, as listed below.

Some common methods and file formats are:

- **read_csv** - read delimited values from .csv or .tsv files
- **read_html** - read tables from a .html file
- **read_excel** - read .xls, .xlsx Excel spreadsheets
- **read_json** - read JavaScript Object Notation (JSON) from .json files
- **read_sas** - read a dataset created by SAS
- **read_spss** - read a data file created by SPSS
- **read_xml** - .xml - Extensible Markup Language
- **read_sql** - read results of an SQL query
- **read_sql_table** - read a whole SQL table (similar to everything in a table using `read_sql`)

These methods only work when the data is in a tabular form. If the data isn’t tabular (e.g. with complex or nested data), the read method will throw an error.


In [None]:
import pandas as pd
# get example data
url = "https://data.cdc.gov/api/views/v6ab-adf5/rows.csv?accessType=DOWNLOAD"
mortality_data = pd.read_csv(url)

### Downloading data to a file
Sometimes it's helpful to retrieve a file from the internet and save to disk before reading into a DataFrame.

Python's `urllib.request` module is helpful for that:

In [None]:
from urllib import request
data_url = "https://data.cdc.gov/api/views/v6ab-adf5/rows.csv?accessType=DOWNLOAD"
request.urlretrieve(data_url, filename='mortality_data.csv')

('mortality_data.csv', <http.client.HTTPMessage at 0x7fb86ad41820>)

### Working with JSON data

### Working with Databases

Python has libraries for interacting with common relational database platforms:

- sqlite3 - SQLite
- pymysql - MySQL
- psycopg2 - PostgreSQL
- cx_oracle - Oracle
- pymssql - MS SQL Server

You can `query` a database from Python by:

- creating a connection object with the `connect()` method
- getting a cursor object with the `cursor()` method
- executing an SQL query to fetch desired rows with `execute()` and `fetchall()`

For example, to list the tables in a database:
```
import sqlite3
fires_con = sqlite3.connect('Data/FPA_FOD_20170508.sqlite')
fires_cur = fires_con.cursor()
'SELECT name FROM sqlite_master WHERE type="table"').fetchall()
```

SQL query results can be read directly into a DataFrame using the `read_sql_query` method:

```
fires = pd.read_sql_query(
'''SELECT STATE, FIRE_YEAR, DATETIME(DISCOVERY_DATE) AS DISCOVERY_DATE, FIRE_NAME, FIRE_SIZE, LATITUDE, LONGITUDE FROM Fires''', fires_con)

```

### Working with Google Drive

## Data Cleaning

Cleaning data is a crucial and often time-consuming step in data science.

Data scientists might use pure Python, psndas, or other programming tools for this step. Examples here focus on pandas with a few other approaches for specific scenarios.

Common tasks are:

*   Handling missing data
*   Simplifying data
*   Data-type conversion

### Handling Missing Data

Often data analysts need to account for missing data values.

pandas uses the floating-point value NaN (Not a Number) to represent missing numerica data. This is a **sentinel** value that can be easily detected.

The built-in Python `None` value is also treated as NA.

pandas has several methods for detecting `NaN` values in a Series or DataFrame:
- isnull
- notnull

These methods can be used as filters in a data query.

`data[data.notnull()]`

Alternatively, programs can use `dropna` to filter axis labels where values may have missing data.

`dropna` has options to control how many missing values a row or column should have to be dropped.

**replace missing values**
Sometimes it's more useful to replace missing data with a specific or interpreted value, using `fillna`.

`fillna` can use a function to determine fill value.

`fillna` returns a new object, but has an `inplace` option.

### Simplifying Data

- **removing duplicates** - DataFrames have built-in methods to identify which rows are `duplicated` and to `drop_duplicates`. By default, these methods consider all columns, but programs can specify a subset.

- **replacing values** - the replace() method is a simple approach for replacing values in a pandas object.

- **handling outliers** - programs may want to find & replace or filter values that exceed some threshold.

### String Conversion

python has a wide range of built-in string methods. Some common ones are:
  - **split** - generate an array of substrings from a string based on a delimiter
  - **lowercase** - convert a string to lower case
  - **uppercase** - convert a string to upper case
  - **join** - combine strings with a delimiter
  - **index** - determine where in a string a substring is first found
  - **find** - determine if a string contains a substring
  - **count** - number of occurences of a substring in a string
  - **replace** - substitute occurrences of one pattern with another.


**Regular Expressions** provide a (mostly) language-agnostic logical syntax for finding/matching string patterns in text.

`regex` patterns can be applied to strings with python's [re module](https://docs.python.org/3/library/re.html).

In [None]:
import re
text = "foo    bar\t baz  \tqux"

# inline regex pattern
re.split('\s+', text)

# reusable regex object
regex = re.compile('\s+')
regex.split(text)