<a href="https://colab.research.google.com/github/brendenwest/ad450/blob/master/3_data_acquisition_cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Acquisition & Cleaning

Accessing and cleaning data is a crucial and often time-consuming step data science.

Data scientists might use pure Python, psndas, or other programming tools for this step. Examples here focus on pandas, with a few other approaches for specific scenarios.

### Reading

- McKinney, Chapters 6 - 7
- Molin, “Data Wrangling with Pandas”
- https://www.detroitnews.com/story/news/local/detroit-city/housing/2020/01/09/detroit-homeowners-overtaxed-600-million/2698518001/
- https://realpython.com/beautiful-soup-web-scraper-python/

### Practice
- https://www.datacamp.com/courses/importing-data-in-python-part-1
- https://www.datacamp.com/courses/importing-data-in-python-part-2
- https://www.datacamp.com/community/tutorials/making-web-crawlers-scrapy-python
- https://www.datacamp.com/courses/cleaning-data-in-python (cleaning data for analysis)
- https://github.com/wesm/pydata-book

### Learning Outcomes
- Loading data from text files
- Working with common data formats
- Web scraping and API interaction
- Interacting with database
- Inspecting data with pandas
- Data cleaning


## Loading data files

pandas has a number of built-in functions for reading tabular data from plain-text files into a dataframe, including these common formats:

*   read_csv - comma-separated values
*   read_table - tab-separated values (tsv)
*   read_fwf - fixed-width columns
*   read_html
*   read_json - JavaScript object notation

pandas' data-parsing functions support options for:
- indexing
- type inference and data conversion
- datetime parsing
- iterating over chunks of very large files
- handling unclean data


In [0]:
import pandas as pd

df = pd.read_csv('sample_data/california_housing_test.csv')
df = pd.read_table('sample_data/california_housing_test.csv', sep=",")


#### Common options:

- handle files with no header
- set a specific column as the dataframe index
- set a hierarchichal index
- skip specific rows
- attempt to parse dates

#### missing values

pandas recognizes common strings for missing data, such as `NA` and `NULL`. 

Programs can also specify values to treat as missing and can use different values for different columns.



#### Reading files in parts

It sometimes makes sense to read part of a large file or iterate through it in small chunks.

- `nrows` to read a small number or rows
- `chunksize` to return a text parser object for iteration
- using python's csv library:


```
# using python csv reader
import csv
f = open.('FILENAME')
reader = csv.reader(f)
for line in reader:
  # operate on each line
```




### Working with JSON

Programs can load JSON data with core python:

```
import json
data = json.loads(FILENAME) # read JSON file into python object
jsonfile = json.dumps(PYTHON_OBJECT) # convert python object to JSON
```
or with pandas `read_json` method. By default, `pandas.read_json` assumes each object in a JSON array is table row.



In [0]:
import pandas as pd
df = pd.read_json('sample_data/anscombe.json')
df.head()


Unnamed: 0,Series,X,Y
0,I,10,8.04
1,I,8,6.95
2,I,13,7.58
3,I,9,8.81
4,I,11,8.33


### Working with HTML

pandas `read_html` method depends on several supporting libraries.

```
pip install lxml
pip install beautifulsoup4 html5lib
```

By default it looks for & attempts to parse all TABLE elements in an HTML file.

### Working with Excel

Programs can load data from Excel files using pure python with openpyxl, or with pandas' `ExcelFile` class or read_excel() method.

```
xlsx = pd.ExcelFile(FILENAME)
pd.read_excel(xlsx, SHEETNAME)
```

## Loading Web data

Python programs can load data from web sites using a number of approaches, depending on:

- whether data are available from an API
- whether data are publicly available or behind authentication


### Web API integration

When data are available from an API as structured data (e.g. JSON, XML, CSV), programs can fetch using libraries such as `requests`.


In [0]:
import requests
url = 'https://data.seattle.gov/resource/jguv-t9rb.json'
resp = requests.get(url)
data = resp.json() # parse HTTP response
licenses = pd.DataFrame(data)
licenses.head()

Unnamed: 0,license_issue_date,license_number,species,primary_breed,zip_code,animal_s_name,secondary_breed
0,2000-01-03T00:00:00.000,263574,Dog,Shepherd,98119.0,,
1,2000-01-05T00:00:00.000,119820,Dog,"Retriever, Labrador",98106.0,Fancy,
2,2000-01-06T00:00:00.000,10401,Dog,Siberian Husky,,Skip,Mix
3,2000-01-12T00:00:00.000,941592,Dog,German Shepherd,98107.0,Kanga,
4,2000-01-24T00:00:00.000,422763,Dog,"Retriever, Golden",,Oscar,


### Web scraping

Sometimes data are available on the internet, but not as structured data. So programs need to capture a full web page and then extract just the desired data.

Two widely used python libraries are:

- [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) - easy to use but requires other supporting libraries.
- [Scrapy](https://scrapy.org/) - complete package & optimal for large, complicated tasks

## Data cleaning

- handling dates
- type conversion


### Handling missing data

Often data analysts need to account for missing data values.

pandas uses the floating-point value NaN (Not a Number) to represent missing numerica data. This is a **sentinel** value that can be easily detected.

The built-in Python `None` value is also treated as NA.

pandas has several methods for detecting `NaN` values in a Series or DataFrame:
- isnull
- notnull

These methods can be used as filters in a data query.

`data[data.notnull()]`

Alternatively, programs can use `dropna` to filter axis labels where values may have missing data. 

`dropna` has options to control how many missing values a row or column should have to be dropped.

**replace missing values**
Sometimes it's more useful to replace missing data with a specific or interpreted value, using `fillna`.

`fillna` can use a function to determine fill value.

`fillna` returns a new object, but has an `inplace` option.


### Transforming data

- **removing duplicates** - DataFrames have built-in methods to identify which rows are `duplicated` and to `drop_duplicates`. By default, these methods consider all columns, but programs can specify a subset.

- **transformation functions** - programs can use the Series **map** method to perform element-wise operations on transformations.

- **replacing values** - the replace() method is a simple approach for replacing values in a pandas object.

- **transforming indexes** - pandas dataset indexes  support **map()** operations to produce new objects with different labels. **replace()** is useful for simple index changes.

- **handling outliers** - programs may want to find & replace or filter values that exceed some threshold.


#### Binning

Sometimes it's necessary to separate continuous data into bins or groups for analysis.

pandas `cut()` method supports binning operations, including:

- dividing data into specific bins based on value ranges
- dividing data into equal-size bins
- assigning bin names
- determining which side of a bin is open or closed

pandas also provides `qcut()` to bin data based on sample quantiles.

#### Indicator/Dummy variables

It's often necessary to convert a categorical variable into a `dummy` or `indicator` matrix for statistical modeling.

`get_dummies()` returns a matrix with one row for each value in a Series and a column for each distinct category value. 

The matrix value is 1 where the Series value at a given index matches the category and 0 otherwise.

In [0]:
df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'], 'data1': range(6)})
pd.get_dummies(df['key'])

Unnamed: 0,a,b,c
0,0,1,0
1,0,1,0
2,1,0,0
3,0,0,1
4,1,0,0
5,0,1,0


### String manipulation

- string methods - python has a wide range of built-in string methods. Some common ones are:
  - **split** - generate an array of substrings from a string based on a delimiter
  - **lowercase** - convert a string to lower case
  - **uppercase** - convert a string to upper case
  - **join** - combine strings with a delimiter
  - **index** - determine where in a string a substring is first found
  - **find** - determine if a string contains a substring
  - **count** - number of occurences of a substring in a string
  - **replace** - substitute occurrences of one pattern with another.


**Regular Expressions** provide a (mostly) language-agnostic logical syntax for finding/matching string patterns in text.

`regex` patterns can be applied to strings with python's [re module](https://docs.python.org/3/library/re.html).


In [0]:
import re
text = "foo    bar\t baz  \tqux"

# inline regex pattern
re.split('\s+', text)

# reusable regex object
regex = re.compile('\s+')
regex.split(text)

['foo', 'bar', 'baz', 'qux']

**vectorized string functions**

pandas Series has array-oriented methods for string operations that will skip `NA` values.

In [0]:
name = "Bob"
name[1:]

'ob'