<a href="https://colab.research.google.com/github/brendenwest/ad450/blob/master/3_data_wrangling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Wrangling

Accessing and cleaning data is a crucial and often time-consuming step data science.

Data scientists might use pure Python, psndas, or other programming tools for this step. Examples here focus on pandas, with a few other approaches for specific scenarios.

### Reading

- McKinney, Chapters 6 - 7
- Molin, “Data Wrangling with Pandas”

### Practice
- https://www.datacamp.com/courses/importing-data-in-python-part-1
- https://www.datacamp.com/courses/importing-data-in-python-part-2
- https://www.datacamp.com/courses/cleaning-data-in-python (cleaning data for analysis)

### Learning Outcomes
- Loading data from text files
- Working with common data formats
- Web scraping and API interaction
- Inspecting data with pandas
- Data cleaning


## Loading data files

pandas has a number of built-in functions for reading tabular data from plain-text files into a dataframe, including these common formats:

*   read_csv - comma-separated values
*   read_table - tab-separated values (tsv)
*   read_fwf - fixed-width columns
*   read_html
*   read_json - JavaScript object notation

pandas' data-parsing functions support options for:
- indexing
- type inference and data conversion
- datetime parsing
- iterating over chunks of very large files
- handling unclean data


In [0]:
import pandas as pd

df = pd.read_csv('sample_data/california_housing_test.csv')
df = pd.read_table('sample_data/california_housing_test.csv', sep=",")
df = pd.read_json('sample_data/anscombe.json')

#### Common options:

- handle files with no header
- set a specific column as the dataframe index
- set a hierarchichal index
- skip specific rows
- attempt to parse dates

#### missing values

pandas recognizes common strings for missing data, such as `NA` and `NULL`. 

Programs can also specify values to treat as missing and can use different values for different columns.



#### Reading files in parts

It sometimes makes sense to read part of a large file or iterate through it in small chunks.

- `nrows` to read a small number or rows
- `chunksize` to return a text parser object for iteration
- using python's csv library:


```
# using python csv reader
import csv
f = open.('FILENAME')
reader = csv.reader(f)
for line in reader:
  # operate on each line
```




### Working with JSON

Programs can load JSON data with pandas `read_json` method or with core python.

By default, `pandas.read_json` assumes each object in a JSON array is table row.

```
import json
data = json.loads(FILENAME) # read JSON file into python object
jsonfile = json.dumps(PYTHON_OBJECT) # convert python object to JSON
```



### Working with HTML

pandas `read_html` method depends on several supporting libraries.

```
pip install lxml
pip install beautifulsoup4 html5lib
```

By default it looks for & attempts to parse all TABLE elements in an HTML file.

### Working with Excel

Programs can load data from Excel files using pandas `ExcelFile` method.

```
xlsx = pd.ExcelFile(FILENAME)
pd.read_excel(xlsx, SHEETNAME)
```

## Web API integration

Python programs can load data from web sites using a number of approaches. 


- authentication


In [0]:
import requests
url = 'https://data.seattle.gov/resource/jguv-t9rb.json'
resp = requests.get(url)
data = resp.json() # parse HTTP response
licenses = pd.DataFrame(data)
licenses.head()

## Web scraping

## Data cleaning

- handling dates
- regular expressions
  - missing data
  - type conversion
  - duplicate records
  - Date fields
  - Categorization & binning
  - Handling outliers
  - Handling strings
