<a name="Top"></a>

# Unit 2: Data Preparation

## Contents

* [Getting Started](#Getting-Started)
* Data Preparation
    * [Example 1: Franklin County Audit Data](#Example-1:-Franklin-County-Audit-Data)
        * [Sample Data](#1-SampleData)
        * [Clean Data](#1-CleanData)
        * [Construct Data](#1-ConstructData)
        * [Integrate Data](#1-IntegrateData)
    * [Example 2: Parsing Errors with Licking County Auditor Data](#Example-2:-Parsing-Errors-with-Licking-County-Auditor-Data)
        * [Concept: Append Data](#Concept-Append)
    * [Extraction from a GIS Dataset: Fairfield County Auditor Data](#GIS-dataset-extraction)
        * [Concept: Data Manipulation](#Concept-DataManip)
* [Joining Related Datasets](#Joining-Related-Datasets)
* [Lab Answers](#Lab-Answers)
* [Next Steps](#Next-Steps)
* [Resources and Further Reading](#Resources-and-Further-Reading)

### Exercises

[1](#Exercise-1), [2](#Exercise-2), [3](#Exercise-3), [4](#Exercise-4), [5](#Exercise-5), [6](#Exercise-6), [7](#Exercise-7), [8](#Exercise-8), [9](#Exercise-9),  [10](#Exercise-10), [11](#Exercise-11), [12](#Exercise-12), [13](#Exercise-13), [14](#Exercise-14), [15](#Exercise-15)

## Getting Started

In addition to libraries we used in the last unit, this notebook relies on the [GeoPandas](library) to process data from a [geographic information system](https://en.wikipedia.org/wiki/Geographic_information_system). To install the library, we should use `pip` since the Anaconda package manager, `conda`, does not install one of the underlying support modules for `geopandas` the same way that `pip` does.

NOTE: If using [Anaconda](https://www.anaconda.com/download) and the following pip command fails, try installing `geopandas` with the Anaconda prompt on your computer by running the following command:

```
conda install --yes geopandas
```

In [None]:
import sys
!{sys.executable} -m pip install geopandas

<a name="Example-1:-Franklin-County-Audit-Data"></a>
### Example 1: Franklin County Auditor Data
We'll be working with county auditor data containing real estate information. The full dataset is quite large - 400,000+ rows - so we need to sample the data for initial data cleansing and construction for later combination with other datasets from other central Ohio counties. We will only be looking at 10% of the entire dataset, which is a somewhat arbitrary data sampling approach, but assuming we have no idea what might interest us with this dataset it's a good place to start.

[Top](#Top)

<a name="1-SampleData"></a>
### 1. Sample Data
To reduce the memory and time required to process the Franklin County data, the size of the dataset will be reduced; the `sample_file()` function below contains the code to do this. Two arguments are required when the function is called, `input_file` and `output_file`, to specify the source and target files, respectively. A keyword argument, `fraction`, can be specified to set the desired size of the output file relative to the input file; the default value is 0.1.

First, `sample_file()` calls `get_line_count()` to calculate the total number of lines in the source file. The total is multiplied by the specified fraction and truncated to the nearest integer to determine the number of lines that should be in the output file. Next, the `sample()` function from the random module is used to select a sampling of line numbers from a range of values starting at 1 and ending at the last line number in the source file; the number of line numbers in the sample is equal to the calculated sample line count. Finally, the function iterates through the source file, line-by-line, and copies a line to the output file if that line's line number is in the sample of line numbers.

**NOTE:** We are using the very simplistic `random.sample()` function which is good enough for our learning purposes in this lesson. We also initialize the `seed()` used to determine which rows we grab "randomly" from our dataset. This allows some consistency in our lesson, but is not necessary in real-world practice. As you become more proficient with Python you will find that there are more robust data sampling functions and methods available which you will want to use for more statistically defensible work.

In [None]:
import linecache
import random

def get_line_count(input_file):
    """Count number of lines in a file"""
    count = 0
    with open(input_file) as infile:
        for line in infile:
            count += 1
    return count

def sample_file(input_file, output_file, fraction=0.1):
    """Exctract a subset of lines from a file"""
    total_line_count = get_line_count(input_file)
    sample_line_count = int(fraction * total_line_count)  # fraction of total
    random.seed(12345)  # set an arbitrary number to force random.sample() to return same same sample of rows each time we call sample_file method
    sample_line_numbers = random.sample(range(1, total_line_count), 
                                        sample_line_count)  # sample of line numbers
    sample_line_numbers.sort()
    sample_line_numbers.insert(0, 0)
    with open(output_file, 'w') as outfile:
        for line_number in sample_line_numbers:
            line = linecache.getline(input_file, line_number + 1)
            outfile.write(line)
                    

Now that we have a data sampling function, let's use it with Franklin County data. The Franklin County Auditor's website offers the ability to [generate datasets](https://apps.franklincountyauditor.com/reporter). Another approach is to download the [data files via FTP](ftp://apps.franklincountyauditor.com/), which is the approach we'll use here. 

We will first load the data file, then sample it using the `sample_file()` function that we defined above, and finally we will store the sampled output as a data file which we can use again in further steps.

The `pandas` package can be used to read the franklin county data file:

In [None]:
import pandas as pd

# load full dataset for initial analysis
franklin = pd.read_csv('../data/county_auditor/OH-Franklin/Parcel.csv')

# sample the data (default is 10% of all rows)
sample_file('../data/county_auditor/OH-Franklin/Parcel.csv', 
            '../data/county_auditor/OH-Franklin/franklin_auditor_subset.csv')

[Top](#Top)

<a name="1-CleanData"></a>
### 2. Clean Data
In addition to simply extracting the data, we often need to address data quality issues as well; examples of quality issues include

- *duplication*: the dataset unnecessarily includes repeated data
- *inconsistency*: different values are used to represent the same thing or the values do not fit the defined schema
- *incompleteness*: data is missing from the dataset
- *inaccuracy*: the data does not reflect what it purports to measure or represent

Resolving duplication issues is usually straightforward: duplicate data is removed before conducting further analysis. Inconsistencies can be resolved by determining the appropriate values and transforming the data as needed; however, realizing that multiple values correspond to the same thing might require some examination of the data.  While it can be easy to detect incompleteness of data, the approach for resolving the issue might be context-specific. Should a default value be used? Should a randomly generated value within some range be substituted? Should records with missing data be dropped entirely? Inaccuracies tend to be more difficult to detect and to resolve as they often require examining how the source data was collected.

Let's begin by looking at the original documentation about the Franklin County dataset.

In [None]:
# display auditor documentation in the notebook (NOTE: May not work properly in all browsers)
from IPython.display import IFrame
IFrame(src="../data/county_auditor/OH-Franklin/documentation/ParcelDataReadme.pdf", height=800, width=1024)

In the following examples, we'll work on aggregating data from three different county auditor datasets. Our immediate objective is to collect any data regarding appraisal values, sale price, total area, the number of rooms, information about heating and cooling, and the year built for residential properties.

With the data loaded, we can now begin examining it. To start, we can see the complete list of columns in the dataset.  Compare this to the columns in the documentation - there are column names in the dataset that do not appear in the documentation and column names in the documentation that do not appear in the dataset.

In [None]:
franklin.columns.tolist()

We can view the first few rows using the DataFrame's *head()* method.  We'll increase the number of columns displayed to 200 to accommodate the datasets we'll be working with. 

In [None]:
pd.set_option('display.max_column', 200)
franklin.head()

Examining these rows, we can get a sense of the type of data in each column.  We can also see that some values are `NaN` which stands for "Not a Number" and is used when no value is present, i.e. when data is missing.  We'll address these values later.

As noted earlier, we'd like to extract appraisal value, sale price, and other data for residential real estate.  Based on the documentation, the `PCLASS` field should indicate a parcel's property class.  The first few rows of data are consistent with the documentation.  To see all the values that appear in the `PCLASS` field, we can use the *unique()* method for that column.

In [None]:
franklin.PCLASS.unique()

We can also see the number of records with each value using the *value_counts()* method.

In [None]:
franklin.PCLASS.value_counts()

The documentation indicates that the `PROPTYP` also includes property type information; however, the file we have from July, 2015 doesn't even include this field. Let's try to look for a different field which can help us filter our data by the type of property that each row represents.  

<hr>

<a name="Exercise-1"></a><mark> **Excercise 1** Using `unique()` and `value_counts()` functions in the cells below to confirm that there are only rows containing a "1." or "nan" value in the `DWELTYP` field.</mark>

<hr>
In a future unit, we'll explore how price or appraisal value is dependent on other factors such as number of bathrooms or the year in which a building was built. In order to do this analysis, we'll need to extract the relevant data. 

Based on the documentation and the first few rows of data, we might be interested in extracting the following columns from the larger dataset.

- `APPRLND`: 10 characters numeric representing the total appraised land value for taxable properties
- `APPRBLD`: 10 characters numeric representing the total appraised building/improvements value for taxable properties
- `LandUse`: 3 character numeric identifier representing the land use code for appraisal purposes
- `Cauv`   : 10 character numeric representing the agricultural use value
- `SCHOOL` : school district code which is supposed to be a 4 character code, but from our 2015 dataset it appears to be a string
- `HOMESTD`: (note the difference from the documentation of "HOMESTD" instead of "HOMSTD") 1 character indicating whether the property is eligible for a homestead exemption
- `TRANDT` : 10 character transfer date of the property (MM/DD/YYYY)
- `NAME1`, `NAME2`: Parcel owner's name lines 1 and 2 (25 characters max per field)
- `NBRHD`  : 6 character code (although our dataset appears to be comprised of mostly 4 digit codes) identifying the auditor's internal neighborhood code
- `PCLASS` : 1 character representing the parcel property class. Valid entries include:
        - C >> Commercial property
        - E >> Exempt property
        - F >> Agricultural property
        - I >> Industrial property
        - M >> Mineral
        - R >> Residential property
        - U >> Utility
- `PRICE`: 12 characters numeric representing the last known valid sale price
- `ACREA`: 10 characters representing an acreage amount
- `ROOMS`: 2 characters numeric field representing the total number of rooms.
- `BATHS`: 2 characters numeric field representing the number of full baths.
- `HBATHS`: 2 characters numeric field representing the number of half baths.
- `BEDRMS`: 2 characters numeric field representing the number of bedrooms.
- `AIRCOND`: 1 characters representing central air. Valid entries are (0 >> None; 1 >> Heat; >> 2 Heat & Air)
- `FIREPLC`: 1 character (Y or N) representing presence of fireplaces
- `YEARBLT`: 4 characters representing what year the building was built. Valid entries include:
        - 1) any calendar year from 1920 to the present year.
        - 2) “OLD” for buildings remodeled before 1920.
        - 3) “E 99” where 99 is a valid calendar year for estimated year of remodel
- `USPS_CITY`: This undocumented field appears to contain the name of the city used by the US Postal Service (USPS).
- `AREA2`: Appears to be a renamed version of the `AREA` field noted in the documentation
- `Grade`: Useful fields, but notice the mixed case of the variable name

To extract these columns, we'll first create a list containing their names then create a copy of the DataFrame consisting only of the columns. Let's switch over to using our sample dataset, `franklin_auditor_subset.csv` that we saved with the `sample_file` function.

In [None]:
# read in the subset data file, and assign to variable
franklin_subset_file = pd.read_csv('../data/county_auditor/OH-Franklin/franklin_auditor_subset.csv')

In [None]:
# home properties or fields we can use to filter out data
franklin_columns = ['ParcelNumber', 'APPRLND', 'APPRBLD', 'LandUse', 'Cauv', 'SCHOOL', 'HOMESTD', 'TRANDT',
                    'NAME1', 'NAME2', 'NBRHD', 'PCLASS', 'PRICE', 'ACREA', 'ROOMS', 'BATHS',
                    'ANN_TAX', 'DESCR1', 'TAXDESI', 'AREA2', 'DWELTYP', 'COND', 'Grade', 'USPS_CITY', 
                    'HBATHS', 'BEDRMS', 'AIRCOND', 'FIREPLC', 'YEARBLT', 'WALL']

# copy vs view
franklin_subset = franklin_subset_file[franklin_columns].copy()

In [None]:
# re-run our frequency counts
franklin_subset.PCLASS.value_counts()

The data in `franklin_subset` is a copy of the source data file. We can manipulate the copy while leaving the full dataset unchanged. This can be helpful if we make a mistake or need to see what a value might have been prior to manipulation. Alternatively, we could just reload the data whenever necessary.

The first thing we can do is remove data for non-residential properties. As shown above, your version of the sample data file should contain between ~35,000 and 42,000 records which are coded "R", a residential property. To filter the data, we can use a mask and bracket notation with the DataFrame. The mask we'll need is one that evaluates to `True` when the value of `PCLASS` is *R*.

Once we've filtered the data, we can drop the `PCLASS` column since it is no longer needed. To do this, we'll use the DataFrame's `drop()` method and specify the column name and axis. We'll specify an axis value of `1` to indicate that we'd like to drop a column as opposed to a value of `0` to drop a row. We'll also use the `inplace` keyword argument to indicate that we'd like to manipulate the DataFrame itself rather than to return a DataFrame with the dropped column.

In [None]:
# filter the data using a mask
franklin_subset = franklin_subset[franklin_subset.PCLASS == 'R']

We can confirm that that the DataFrame has been filtered by comparing number of records in the `franklin_subset` DataFrame to the number of records in the `franklin_subset_file` DataFrame.

<hr>

<a name="Exercise-2"></a><mark> **Exercise 2** In the cell below, use the `<` comparison operator to confirm that the number of records in the `franklin_subset.PCLASS` Series is less than the number of records in the `franklin_subset_file.PCLASS` Series, and by extension, the DataFrame.</mark>

Once we've confirmed that the data has been filtered, we can drop the `PCLASS` column since it is no longer needed. To do this, we'll use the DataFrame's `drop()` method and specify the column name and axis. We'll specify an axis value of `1` to indicate that we'd like to drop a column as opposed to a value of `0` to drop a row. We'll also use the `inplace` keyword argument to indicate that we'd like to manipulate the DataFrame itself rather than to return a DataFrame with the dropped column.

In [None]:
# drop the PCLASS column
franklin_subset.drop(['PCLASS'], axis=1, inplace=True)

<hr>

Let's look at the the appraisal-related fields, `APPRLND` and `APPRBLD`. From above, we know that that data in the `APPRBLD` field is stored as floating point numbers. We can use the `describe()` method to calculate some descriptive statistics for `APPRBLD`.

In [None]:
franklin_subset.APPRBLD.describe()

Notice that the minimum value is zero.  Let's see how many records have a building appraisal value of zero. Because the data is stored as floating point numbers, we should be aware of the [issues](https://docs.python.org/3/tutorial/floatingpoint.html) related to floating point values.  If we choose to continue working with the data as floating point values, we can use the [NumPy `isclose()`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.isclose.html) function to create a mask to compare values to zero.

In [None]:
len(franklin_subset[pd.np.isclose(franklin_subset.APPRBLD, 0)])

As an alternative to working with floating point values, we can convert a column's datatype to `int` when appropriate.  Here, an integer would represent whole dollar amounts and would be meaningful; if decimals are used to record fractions of a dollar, we won't loose much information.  As a Series, each column has an `astype()` method that can be used to convert the column's type.  The method creates a copy so we have to reassign the DataFrame's column when doing the conversion.

In [None]:
franklin_subset['APPRBLD'] = franklin_subset.APPRBLD.astype(int)

Now that the data is stored as integers, we can make comparisons more directly using the standard operators.

In [None]:
len(franklin_subset[franklin_subset.APPRBLD == 0])

Let's filter the data to include only the rows where the building appraisal value is greater than zero.

In [None]:
franklin_subset = franklin_subset[franklin_subset.APPRBLD > 0]

<hr>

<a name="Exercise-3"></a><mark> **Exercise 3** In the cell below, filter the `franklin_subset` DataFrame to exclude rows that have a `APPRLND` of zero.</mark>

In [None]:
# redefine APPRLND as an int
franklin_subset['APPRLND'] = franklin_subset.APPRLND.astype(int)

In [None]:
# only keep those rows where APPRLND is greater than 0
franklin_subset = franklin_subset[franklin_subset.APPRLND > 0]

<hr>

To simplify comparisons and other analysis later, we might choose to combine the data related to number of bathrooms into on column.  Because the data currently distinguishes between full and half baths, we can calculate the total number of bathrooms as the sum of the value of `BATH` and half the value of `HBATHS`.  Note that this has the effect of counting two half-bathrooms as a full bathroom; while two half-bathrooms might effect price differently than a full bathroom, we'll effectively ignore any such effect.

<hr>

<a name="Exercise-4"></a><mark> **Exercise 4** The data type of the `HBATH` and `HBATHS` should a numeric type (an integer or a floating point value) in order to calculate the combined value directly from the existing values. In the cell below, use the `dtypes` property to confirm that the `HBATH` and `HBATHS` columns have a numeric data type.</mark>

<hr>

Rather than using a for loop and calculating net number of bathrooms for each row, pandas supports element-wise multiplication and addition allowing use to do the following.  For this calculation we will treat missing data in the same was as a zero value.  To do this, we use the `fillna()` (pronounced "fill N/A") method for the appropriate column and specify the value we'd like to use in place of missing values - zero, in this case.

In [None]:
franklin_subset["Bathrooms"] = franklin_subset.BATHS.fillna(0) + 0.5 * franklin_subset.HBATHS.fillna(0)

This calculates the number of bathrooms as defined above and creates stores each row's value in a new column named `Bathrooms`.  

<hr>

<a name="Exercise-5"></a><mark> **Exercise 5** We no longer need the `BATHS` or `HBATHS` columns. In the cell below, use the *drop()* method to remove these columns from the `franklin_subset` DataFrame.</mark>

<hr>

Next, let's look at the `Bathrooms` column we created to see what type of frequency of dwellings have different numbers of bathrooms.

In [None]:
franklin_subset.Bathrooms.value_counts()

As we can see the `Bathrooms` column has values of "0.0". At this point, we might contact the person responsible for maintaining the data for clarification on why residential dwellings can still be listed as having zero bathrooms. However, a quick review of the top 50 or so records of our dataset offers some clues:

In [None]:
franklin_subset.head(50)

Moving on to to the `FIREPLC` column, we can list the unique values to see that the dataset is inconsistent with the documentation; rather than containing a single 'Y' or 'N' character representing the presence or absence of a fireplace the dataset instead contains integer values of 1 or 0.

In [None]:
franklin_subset.FIREPLC.unique()

You may also see that `nan` is among the values in your version of the `franklin_subset` dataset.

<hr>

<a name="Exercise-6"></a><mark> **Exercise 6** In the cell below, use the `fillna()` property with the `FIREPLC` column to replace missing values with zero. Either reassign the DataFrame's `FIREPLC` column with modified data or specify `inplace=True` as an argument to `fillna()` to alter the column in-place.</mark>

In [None]:
# first replace nan values

# verify that it worked with a call of the unique() function on FIREPLC


Now we create a new variable, `Fireplaces`, which will be a boolean variable and contain the values of the `FIREPLC` variable. 

In [None]:
# create a new variable 'Fireplace' as a str variable by copying the values from FIREPLC
franklin_subset['Fireplaces'] = franklin_subset.FIREPLC.astype(bool)

# verify that it worked with a call of the value_counts() function on FIREPLC
franklin_subset.Fireplaces.value_counts()

<hr>

At this point we have the following columns and data types.

In [None]:
franklin_subset.dtypes

[Top](#Top)

<a name="1-ConstructData"></a>
### 3. Construct Data

Before moving on to the next dataset, it might be useful to give the columns more descriptive names.  First, let's create a copy of the DataFrame in case we need access to the data in its current state later.

In [None]:
# copy dataframe to a new variable, home_data
home_data = franklin_subset.copy()

To rename the columns, we can use the DataFrame's `rename()` method. When calling the method, we can pass a dictionary that maps the existing column names to new names. We'll also specify that we want to change column names rather than index labels by specifying `axis=1` and that we'd like to alter the DataFrame itself rather than return a copy with the alteration using `inplace=True`.

In [None]:
home_data.rename(
    {'APPRLND': 'AppraisedTaxableLand',
     'APPRBLD': 'AppraisedTaxableBuilding',
     'SCHOOL':  'SchoolDistrict',
     'HOMESTD': 'HomesteadFlag',
     'TRANDT':  'TransferDate',
     'NAME1':   'OwnerNameLine1',
     'NAME2':   'OwnerNameLine2',
     'NBRHD':   'NeighborhoodCode',
     'PRICE':   'SalePrice',
     'ACREA':   'Acreage', 
     'AREA2':   'Area',
     'ROOMS':   'Rooms',
     'ANN_TAX': 'AnnualTaxes',
     'Cauv':    'CAUV',
     'DWELTYP': 'DwellingType',
     'COND':    'Condition',
     'BEDRMS':  'Bedrooms',
     'AIRCOND': 'AirConditioning',
     'YEARBLT': 'YearBuilt',
     'WALL':    'WallType'
    },
    axis=1 ,
    inplace=True
)

<hr>

<a name="Exercise-7"></a><mark> **Exercise 7** In the cell below, verify that the columns of the `home_data` DataFrame have been changed.</mark>

<hr>

As noted above, we'd also like to record whether or not a parcel includes heating.  Earlier we assumed that all residential properties in the data set did include heating.  To add a column with the same value for each row, we can write a statement that assigns that value to the new column in the DataFrame.  Similarly, we'll add a `County` column to indicate the source of the data.

In [None]:
home_data['Heat'] = True
home_data['County'] = "Franklin"

We can view the first few rows to examine the state of our data before moving on to the next dataset.

In [None]:
home_data.head()

Now is also a good time to use an additional feature of Jupyter Notebooks: the Save and Checkpoint option. At the top of the Jupyter Notebook menu bar, choose File -> Save and Checkpoint

<img src="../images/Jupyter_how_to_image1.jpg"></img>

Doing so will allow you to close the notebook (and Jupyter, if you wish). Clicking on the floppy disk icon just below the 'File' menu option performs the same function. Saving your work in Jupyter Notebooks also saves the result outputs, which can be helpful if you have no need of referencing your prior work, but don't want to re-run all of the code.

If you shutdown Jupyter, then you will shut down the kernel running Python in the background. In that case, when you reopen the notebook, Python will not be aware that you have imported modules, loaded datasets, changed variables, and other important work. In order to quickly get back to where you left off without stepping through every cell in the Notebook:
- (Optional) Choose the menu option Kernel -> Restart and Clear Output to clear prior code outputs
- Click on the last cell in the notebook that you executed
- Choose the menu option Cell -> Run All Above

<img src="../images/Jupyter_how_to_image2.jpg"></img>

Jupyter will re-run all of the preceding code in sequence, as if you were running each line from the command line. You can now continue where you left off!

[Top](#Top)

<a name="1-IntegrateData"></a>
### 4. Integrate Data

Now we will begin to use the same select, clean, and construct data steps that we performed with the Franklin County data on additional datasets. As we proceed, we will encounter some different data formats and ways that similar data is recorded which will require further work in order to integrate it with our Franklin County data. When integrating data into cohesive wholes in real-world analytical projects it is common to be required to manipulate and massage data to be useful for analysis. Recording the steps taken in the Data Preparation phase is important for providing context in any conclusions or recommendations at the end of the project. One of the benefits of using Jupyter Notebooks to perform our work is that all of that documentation takes place while we work - no more heavy-duty written work outside of our standard work process!

[Top](#Top)

### Example 2: Parsing Errors with Licking County Auditor Data

We can augment the Franklin County Auditor data with data from the [Licking County Auditor](https://www.lickingcountyohio.us/). The data we'll use was obtained directly from the auditor's site and was not sampled or modified. Let's try loading the data stored in `data/02-licking.txt`. When you run the below `read_csv` function you will receive an error:

In [None]:
#licking = pd.read_csv("../data/02-licking.txt")
licking = pd.read_csv("../data/county_auditor/OH-Licking/Parcels.txt")

The exception indicates that there was a problem parsing the data; specifically, pandas expected 8 fields but found 23 on line 3. This could be due to pandas incorrectly guessing what the delimiter is. We could use the csv module's `Sniffer` class to detect the delimiter but visual inspection will suffice.

In the code below, we'll print the first five lines.

In [None]:
line_number = 0 
with open("../data/county_auditor/OH-Licking/Parcels.txt") as infile:
    while line_number < 5:
        print(infile.readline())
        line_number += 1

Examining the output, we can see that the delimiter is probably a semicolon rather than a comma. The pandas `read_csv()` method takes a keyword argument, `delimiter`, that will allow us to specify the appropriate value.

*Note: You will encounter another error when running the below code:*

In [None]:
licking = pd.read_csv("../data/county_auditor/OH-Licking/Parcels.txt", delimiter=";")

The exception message indicates that pandas made it farther into the file before encountering an error. On line 11117, pandas expected to find 194 fields based on the previous lines but instead found 184. Let's investigate further.

While we could iterate through the file and collect the line or lines that are of interest to use, we can use the `linecache` module to access a specific line within a file. The code below extracts a typical line (one that did not cause a parser error) and the line that causes a problem. After extracting the lines, the code displays their content.

In [None]:
import linecache
typical_line = linecache.getline("../data/county_auditor/OH-Licking/Parcels.txt", 2)
error_line = linecache.getline("../data/county_auditor/OH-Licking/Parcels.txt", 11117)

display(typical_line)
display(error_line)

Printing the lines in their entirety isn't very revealing.  

<hr>

<a name="Exercise-8"></a><mark> **Exercise 8** A difference in the number of fields could be caused by a difference in the number of delimiters. In the cell below use the [`count()`](https://docs.python.org/3/library/stdtypes.html#str.count) method with each line to display the number of times the delimiter appears.</mark>

<hr>

It would be helpful if we could compare each field's values between the two lines. To do this we'll use the String [`split()`](https://docs.python.org/3/library/stdtypes.html#str.split) method to separate each line into a list of field values. In addition to the two lines we already have, we'll retrieve the first line from the data for column names. We can use the itertools module's [`zip_longest`](https://docs.python.org/3/library/itertools.html#itertools.zip_longest) function to combine the list of values extracted from each line for comparison.  

In [None]:
from itertools import zip_longest

header_line = linecache.getline("../data/county_auditor/OH-Licking/Parcels.txt", 1)

header_entries = header_line.split(";")
typical_entries = typical_line.split(";")
error_entries = error_line.split(";")

for entry in zip_longest(header_entries, typical_entries, error_entries):
    print(entry)

We'll start at the end of the file and work our way backwards. We can see that each valid value in the "error_line" output has been pushed down one column. As we look upwards through the fields, we will eventually see that this result starts at the `fldTopo` header. The "typical line" has no value whereas the "error line" has a value that seems related to the value associated with the previous field, `fldLUC`. If we look back to the display of each line's content, we can see that "430 Resturant" and "cafeteria and/or bar" are separated by a semicolon but should be kept together rather than split apart as different field values; als0 note that "Restaurant" is misspelled in the source data.  The source data should use quoting if a delimiter appears as part of a data value or avoid using the delimiter in such a capacity.

Now that we know what the problems are, there are a variety of ways to address them. One way is to replace all instances of "430 Restaurant; cafeteria and/or bar" in the source text with something that doesn't have a semicolon prior to loading it with pandas. In the code below, we assign the problematic value and its replacement value to variables. After reading the content of the file, we use the `replace()` method to substitute occurrences of the first value with the second. We then load the data into a pandas dataframe. Because the pandas `read_csv()` function is expecting a file or stream, and not a string or bytes, we use the [`StringIO`](https://docs.python.org/3/library/io.html#io.StringIO) class to create a stream from the altered content. We specify "python" as the `engine` in the `read_csv()` method to avoid warnings about memory. The Python CSV engine provided by pandas is more feature-complete but is slower than the default C engine.

In [None]:
import io
old_value = "430 Resturant; cafteria and/or bar"
new_value = "430 Resturant, cafeteria and/or bar"

with open("../data/county_auditor/OH-Licking/Parcels.txt") as infile:
    content = infile.read()
    
content = content.replace(old_value, new_value)
    
licking = pd.read_csv(io.StringIO(content), delimiter=";", engine="python")

# replace all the leading 'fld' portion of every column header to simplify outputs
licking.columns = licking.columns.str.replace("fld", "")

While this was relatively straightforward, there are disadvantages to this method.  The primary disadvantage here is that we iterate through the content of the file several times: first we read all the content, then we iterate through it to find and replace the problematic value, then iterate through it to load it into pandas; usually we only iterate through the file once when loading it into pandas.  While this is fine for relatively small files, we should avoid looping through the entirety of a file whenever possible.

An alternative method would be to make use of pandas' support for [regular expressions](https://docs.python.org/3.2/library/re.html) when specifying the delimiter. We can use a [negative look-behind assertion](https://www.regular-expressions.info/lookaround.html) to indicate that a delimiter is any semicolon that isn't immediately preceded by the string "Resturant".  We could do this with the following call to *read_csv()*:

```python
licking = pd.read_csv("./data/02-licking.txt", delimiter="(?<!Resturant);", engine="python")
```

With the data loaded, let's display the first few lines to get a sense of the data.

In [None]:
licking.head()

As with the Franklin country dataset, we'd like to filter this dataset for only residential buildings.  Unfortunately, there isn't documentation available to describe the content of each column so we'll have to do our best to infer meaning from the column name and values. Looking at the data above, it looks like `PropertyType` or `Style` might be useful to determine which properties are residential and which are not.

In [None]:
licking.PropertyType.unique()

In [None]:
licking.Style.unique()

It looks like most of style values are related to residential-type properties.  At this point, we might decide to choose specific styles to filter on or choose to simply exclude records without style information or those that correspond to a commercial style.  

Let's see the styles associated with the `Dwelling` property type.  

In [None]:
licking[licking.PropertyType == 'Dwelling'].Style.unique()

Filter the data to include only the *Dwelling* property didn't reduce the number of styles.  For this example, we'll filter the data to include only *Single Family*, *MFD Home*, *Tri-Level*, *Duplex*, *Bi-Level*, *Multi-Level*, *Condominum*, *Mobile Home*, *Triplex*, and *4-Level*.  Note that *Condominum* is misspelled in the source data.

We can create a list of acceptable style values now that can be used to filter the data later.

In [None]:
licking_styles = ['Single Family', 'MFD Home', 'Tri-Level', 'Duplex',
                     'Bi-Level', 'Multi-Level', 'Condominum', 'Mobile Home',
                     'Triplex', '4-Level']

Let's consider the other columns we'll need.  We had collected sales price data from the Franklin county dataset.  In this dataset, there are quite a few columns with "sales" in the name.

In [None]:
[column for column in licking.columns if "sales" in column.lower()]

Looking at the sample data above, we will likely be interested in the collection of `SalesPrice` columns to determine the sales price. Let's look at the values of these columns for a small number of rows.

In [None]:
licking[['SalesPrice1', 'SalesPrice2', 'SalesPrice3', 'SalesPrice4']].head(10)

We have nonzero, zero and `NaN` values.  In addition to these columns, the data also contains `SalesDate` columns. To get a better idea of what the prices represent, let's look at the date columns as well.

In [None]:
licking[['SalesPrice1', 'SalesPrice2', 'SalesPrice3', 'SalesPrice4', 
         'SalesDate1', 'SalesDate2', 'SalesDate3', 'SalesDate4']].head(10)

As we move from the first price/date column to the second, the second price/date column to the third, and so on, we move backward in time.  It seems reasonable then that `SalesPrice1` represents the most recent sales price and the other columns are used to record historic sales data (if it exists).  We'll use the most recent sales price for our work so we'll only need `SalesPrice1`.

Just as with the word "sale", there are a number of columns that contain the word "area". 

<hr>

<a name="Exercise-9"></a><mark> **Exercise 9** In the cell below, display all the columns with "area" in their name.</mark>

<hr>

Here, we'll assume `FinishedLivingArea` contains the data need for area.

Next, lets look for bathroom data.

In [None]:
[column for column in licking.columns if "bath" in column.lower()]

We have columns corresponding to both full and half bathrooms as before but there is a third column for "other".  Let's see what values for this field look like.

In [None]:
licking.OtherBaths.value_counts()

The values themselves don't give a clear idea of what the field represents.  Given the lack of documentation, we'd likely contact the person or group responsible for the data for clarification; for our work here, we'll assume this field corresponds to quarter bathrooms.  

Examining the sample data above, we can identify the other columns of interest.  Specifically, we'll extract the following columns from the Licking Country dataset.

In [None]:
licking_columns = ["ParcelNo","Owner","Grade","Condition","MarketLand","LUC","SchoolDistrict",
                   "TaxDistrict","Neighborhood","Subtotal","CAUVTotal","LegalDesc","PropertyType",
                   "Exterior","MarketImprov","SalesDate1","SalesPrice1","AcreageTotal",
                   "FinishedLivingArea","Rooms","Bedrooms","FullBaths","HalfBaths","OtherBaths",
                   "Heating","Cooling","FireplaceOpenings","YearBuilt","MailingAddress5"]

We can create a mask to filter the data based on style values.  Rather than compare one value to another as we did when filtering the Franklin Country data, we'll instead check if a values is among a list of values.  To do this, we can us the column's `isin()` method to test a value's membership in a specified list. 

Below we check if each value in the `Style` column is in the `licking_styles` list we created earlier.

In [None]:
licking.Style.isin(licking_styles)

We can apply this mask in the usual way using bracket notation.

In [None]:
licking_subset = licking[licking.Style.isin(licking_styles)].copy()

We can confirm that the filtered data contains only the style values we had wanted.

In [None]:
licking_subset.Style.unique()

Now, let's extract only the columns we want. We use bracket notation again with the list of columns we specified above.

In [None]:
licking_subset = licking_subset[licking_columns]

We can display the first few rows of `licking_subset` to confirm we've extracted what we wanted.

In [None]:
licking_subset.head()

We can combine the various bathroom columns into one `Bathroom` column in the same way we combined them for the Franklin County dataset.

<hr>

<a name="Exercise-10"></a><mark> **Exercise 10** In the cell below, combine the values for full baths, half baths and other baths into one columns named `Bathroooms`. Assume the value in `OtherBaths` is equivalent to a quarter of a full bathroom. Additionally, drop the original bathroom-related columns after computing the values for the new column
</mark>

<hr>

Let's look at heating and cooling data . We extracted two columns from the original dataset `Heating` and `Cooling` that contain heating and cooling data, respectively.  Let's look at the heating data first.

In [None]:
licking_subset.Heating.value_counts()

The target dataset doesn't differentiate among different data sources - it only indicates whether the property has heating or not.  For the Licking County data, we'd like to associate `False` with `No Heat` and `True` otherwise. We can do this by comparing values.

In [None]:
licking_subset.Heating = licking_subset.Heating != "No Heat"

We can calculate the value counts of the field to confirm that the number of `False` entries corresponds to the previous number of `No Heat` entries.

In [None]:
licking_subset.Heating.value_counts()

<hr>

<a name="Exercise-11"></a><mark> **Exercise 11** In the cell below, replace the values in the `Cooling` column with `True` to indicate that a property has cooling and `False` otherwise.

</mark>

<hr>

We can assume `FireplaceOpenings` correspond to the number of fireplaces. Let's create a new variable `Fireplaces` as we did with the Franklin County dataset which is a boolean representation of whether a residence has at least one fireplace or not.

In [None]:
licking_subset.FireplaceOpenings.fillna(0, inplace=True)
licking_subset['Fireplaces'] = licking_subset.FireplaceOpenings.astype(bool)

That should be the last modification needed for the Licking County data.  In order to combine the `franklin_subset` and `licking_subset` DataFrames, we need to make sure they have the same column names.  We'll rename columns in the same way we did previously.

In [None]:
licking_subset.rename(
    {'ParcelNo': 'ParcelNumber',
     'Owner': 'OwnerName',
     'MarketLand': 'AppraisedTaxableLand',
     'MarketImprov': 'AppraisedTaxableBuilding',
     'Subtotal': 'AnnualTaxes',
     'Neighborhood': 'NeighborhoodCode',
     'CAUVTotal': 'CAUV',
     'LegalDesc': 'LegalDescription',
     'PropertyType': 'DwellingType',
     'Exterior': 'WallType',
     'SalesDate1': 'TransferDate',
     'SalesPrice1': 'SalePrice',
     'AcreageTotal': 'Acreage',
     'FinishedLivingArea': 'Area', 
     'Heating': 'Heat',
     'Cooling': 'AirConditioning',
     'MailingAddress5': 'USPSCity',
     'LUC': 'LandUse'
    },
    axis=1 ,
    inplace=True
)

In [None]:
licking_subset.head()

We can also add a column to the identify the source of this data.

In [None]:
licking_subset['County'] = "Licking"

We can confirm that the columns in `home_data` and `licking_subset` are the same using the Series [`eq()`](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.eq.html) method. Any differences will have to be corrected before merging the data. 

In [None]:
cols1 = pd.Series(home_data.columns.sort_values())
cols2 = pd.Series(licking_subset.columns.sort_values())
cols1.eq(cols2)

Oops! We have a different number of columns between the two data sets. Let's view the column lists in alphabetical order to see the error we've made:

In [None]:
display(home_data.columns.sort_values())
display(licking_subset.columns.sort_values())

Ah, we have two different fields for the Owner Name in `home_data` - which comes from Franklin County. We could develop some additional rules for concatenating these two fields, but for the sake of time and brevity in this lesson we'll just drop the `OwnerNameLine2` field, and rename `OwnerNameLine1` to `OwnerName` so that it matches the field structure we have from the Licking County data set.

In [None]:
# rename OwnerNameLine1 column to OwnerName along with other column name cleanups in the home_data data set
home_data.rename(
    {'DESCR1': 'LegalDescription',
     'TAXDESI': 'TaxDesignation',
     'USPS_CITY': 'USPSCity',
     'OwnerNameLine1': 'OwnerName'
    },
    axis=1 ,
    inplace=True
)

# drop the OwnerNameLine2 column from home_data
home_data.drop(['OwnerNameLine2','HomesteadFlag'], axis=1, inplace=True)

# now recheck
cols1 = pd.Series(home_data.columns.sort_values())
cols2 = pd.Series(licking_subset.columns.sort_values())
cols1.eq(cols2)

It turns out we haven't been inconsistent with our naming of the count of fireplace openings and the fireplace Y/N flag fields, nor the `TaxDesignation` vs. `TaxDistrict`. Let's clean this up:

In [None]:
home_data.rename({'FIREPLC':'FireplaceOpenings','Fireplaces':'FireplacesFlag'}, axis=1, inplace=True)
licking_subset.rename({'Fireplaces':'FireplacesFlag','TaxDistrict':'TaxDesignation'}, axis=1, inplace=True)

# now recheck
cols1 = pd.Series(home_data.columns.sort_values())
cols2 = pd.Series(licking_subset.columns.sort_values())
cols1.eq(cols2)

Excellent! Our dataframes now match in column naming conventions. We can proceed with appending them together.

<a name="Concept-Append"></a>

<hr>

#### Concept: Append DataFrames
To see how we can combine the DataFrames, let look at an example.  We'll start with two DataFrames, each with columns `A` and `B` and with two rows.

In [None]:
d1 = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
d2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('BA'))

display(d1)
display(d2)

To append the content of one DataFrame to the end of another, we can use the DataFrame `append()` method.  We specify `ignore_index=True` to prevent duplication of index labels. We also pass the parameter `sort=True` to make this teaching example easy to follow, but using this parameter can be computationally expensive and potentially unnecessary on large datasets.

In [None]:
d1.append(d2, ignore_index=True, sort=True)

[Top](#Top)

<hr>

Note that the `append()` method does not modify the original DataFrames directly but instead returns the combined DataFrame.  We can append the `licking_subset` DataFrame to `home_data` and assign the result to `home_data`.

*Note: Not using the `sort=True` parameter here saves us a few CPU cycles, but you may receive a warning from pandas that the default behavior will switch to `sort=False` in future versions. You can ignore this warning here.*

In [None]:
home_data = home_data.append(licking_subset, ignore_index=True)

We can see that `home_data` now has data for two counties.

In [None]:
home_data.County.value_counts()

Why are there more entries for Licking County than for Franklin County? Remember, we haven't searched for a 10% sample of entries the way that we did with Franklin County. We need to correct this error before proceeding.

We could use a variation of the sampling function we wrote earlier for reading in Franklin County data, but in this case, our data already resides in our `home_data` dataframe. As with most things in Python, there is a simpler way to sample a subset of a data frame without much fuss. We simply use the pandas dataframe methods, [`loc`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html#pandas.DataFrame.loc) and [`iloc`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html#pandas.DataFrame.iloc). These methods allows us to reference a set of rows or column by label (`loc`) or by index (`iloc`). In our case, we'll limit our results to `County`="Licking" and then sample from there, using the random list of index values to select our subset of rows.

In [None]:
licking_subset_sample = home_data.loc[home_data['County'] == "Licking"]  # break out Licking County rows from home_data
t = len(licking_subset_sample)
s = int(0.1 * t)  # fraction of total
rows = random.sample(range(licking_subset_sample.index[0], t), s)
sample_licking = licking_subset_sample.iloc[rows]
sample_licking.describe()

Now that we've confirmed we've got a 10% sample of the Licking County data, let's re-build our `home_data` dataset.

In [None]:
home_data = home_data.loc[home_data['County'] == "Franklin"]
home_data = home_data.append(sample_licking, ignore_index=True)
home_data.County.value_counts()

There, that's better. Let's save our progress by writing to a CSV file, but this time we'll use the pipe '|' separator instead of a comma so that we don't run into any further problems when we want to read in the file.

In [None]:
home_data.to_csv('../data/unit2_home_data.csv', sep='|', encoding='utf-8')

[Top](#Top)

### 4. Integrate Data, Continued...

We continue our work in integrating datasets by now turning to an unexpected dataset format that still provides plenty of value to our analysis: geo-location information. Although this is but one example of gathering data to integrate with other datasets, think of the possibilities of combining other types of datasets: Twitter real-time messages with geo-location information, geo-location information with transactional purchasing information, or even demographic data with pharmacy counter sales. There are endless permutations of data that a project could explore! The possibilities may be endless, but the problem statement of any good analytical project is not. Remain focused on answering the questions of your project to help guide the data that you gather and how you integrate it.

[Top](#Top)

<a name="GIS-dataset-extraction"></a>

<hr>

### Extraction from a GIS Dataset: Fairfield County Auditor Data

The final dataset we'll work with is the [Fairfield County Auditor Data](https://www.co.fairfield.oh.us/gis/).  This data is stored as [Geographic Information System (GIS)](https://en.wikipedia.org/wiki/Geographic_information_system) data so loading it won't be as straightforward as reading a text file.  To access the data, we'll use the [GeoPandas](http://geopandas.org/) library, which will load the GIS data into a DataFrame so we can work with it in the same way we manipulated the other datasets. Recall that we installed the library using `conda` (or `pip`) at the beginning of this notebook; with the libary installed, we can import it.

In [None]:
import geopandas
import pandas as pd

The GIS data we're working with is stored in a [format](http://doc.arcgis.com/en/arcgis-online/reference/shapefiles.htm) specified by [ESRI](https://www.esri.com/en-us/home), the developer of [ArcGIS](https://www.arcgis.com/features/index.html), a popular GIS software product.  Data in this format is stored across several files and can be distributed as a single zip file.  We can load the data from the zip file using GeoPanda's `read_file()` function.  The data we'll be using is stored in `data/02-fairfield-gis.zip`.

In [None]:
# https://www.co.fairfield.oh.us/gis/
fairfield = geopandas.read_file("zip://../data/county_auditor/OH-Fairfield/fairfield_gis_parcels.zip")

The object returned by the `read_file()` method is a [GeoDataFrame](http://geopandas.org/data_structures.html#geodataframe), an extension of the pandas DataFrame with additional functionality. The attributes and methods we've used with other DataFrames are available to use when working with GeoDataFrames. For example, we can see the first few rows in the Fairfield County data using the `head()` method.

In [None]:
fairfield.head()

For the most part, this looks like what we'd expect for county auditor data. The last column, however, is something we haven't seen yet.  The `geometry` column contains [data](http://desktop.arcgis.com/en/arcmap/10.3/analyze/arcpy-classes/geometry.htm) used to represent the location and shape of geometric features.  We can use this data to construct plots with map data.  

In the code below, we use the [Matplotlib](https://matplotlib.org/) library to create a plot; we'll work with this library again later.  Because the geometric data represents geographic objects on Earth's surface, position information is stored using a [coordinate reference system](http://geopandas.org/projections.html).  For simpler manipulation, the code below converts data to use a reference system that relies on standard latitude and longitude which can be used in masks to filter the data.  Once filtered, the data is plotted.  In the resulting plot of [Lancaster](https://www.google.com/maps/place/Lancaster,+OH+43130/@39.7234464,-82.678719,12z/data=!3m1!4b1!4m5!3m4!1s0x88478a5e4f80f267:0x136dd5d79e3b4de5!8m2!3d39.7136754!4d-82.5993294), we can see features such as roads.

In [None]:
fairfield.crs

In [None]:
# lancaster
%matplotlib inline
import matplotlib
matplotlib.rcParams["figure.figsize"] = (16, 16)  # a square figure size works best in this situation
fairfield = fairfield.to_crs({'init': 'epsg:3735'})  # use the result from crs above to determine how to plot the map
fairfield.plot()

Returning to the task at hand, let's work on cleaning/filtering the Fairfield County data and merging it with the existing data.  

Access to data that was used to construct the GIS dataset is available through a link on the Fairfield County Auditors site. The data is hosted on an external [site](http://downloads.ddti.net/fairfieldoh/) and includes a description of the database structure.  We can load the documentation in the notebook; the description for dwelling data is given on page 13.

In [None]:
IFrame("../data/county_auditor/OH-Fairfield/02-fairfield-description.pdf", 1024, 1280)  # see page 13

Based on the first few rows and the documentation, we might be able to filter the data based on the values in the `CLASS` column.  Let's look at its values.

In [None]:
fairfield.CLASS.value_counts()

In [None]:
fairfield.columns.tolist()

We'll assume the value `R` represents residential data. We can see that we'll likely need the following columns as well:

- `PARID`: Parcel ID (not the same format as Franklin and Licking county Parcel Numbers
- `ACRES`: Acreage
- Airconditioning: We'll create this column using the `HEAT` column in a moment
- `APRLAND`: Appraised land value
- `APRBLDG`: Appraised building value
- `SFLA`: Area (Total Living Area)
- `RMBED`: Bedrooms
- `FIXBATH`: Bathrooms
- `FIXHALF`: Half-bathrooms
- County (we'll add this later)
- `LEGAL1`: Legal description
- `OWN1`: Owner 1
- `OWN2`: Owner 2
- `WBFP_O`: 'Fireplaces'
- `GRDFACT`: Grade - notice this is different than the documentation
- `LUC`: LandUse
- `MCITYNAME`: 'USPS_CITY'
- `YRBLT`: Year built
- `RMTOT`: Rooms
- `HEAT`: Heat
- `TRANSDT`: Transfer Date
- `PRICE`: Sale price
- `EXTWALL`: Wall type
-  We don't have any of these columns: NeighborhoodCode, ParcelNumber, SchoolDistrict, TAXDESI, TransferDate, HomesteadFlag, DwellingType, CAUV, Condition, AnnualTaxes

We can filter the data and extract the columns of interest.

In [None]:
fairfield_columns = ['PARID', 'ACRES', 'APRLAND', 'APRBLDG', 'SFLA', 'RMBED', 
                     'FIXBATH', 'FIXHALF', 'LEGAL1', 'OWN1', 'OWN2', 'WBFP_O',
                     'GRDFACT', 'LUC', 'MCITYNAME', 'YRBLT', 'RMTOT', 'HEAT',
                     'PRICE', 'EXTWALL', 'TRANSDT']
fairfield_subset = fairfield[fairfield.CLASS == 'R'][fairfield_columns].copy()

Let's look the first few rows of the DataFrame.

In [None]:
fairfield_subset.head()

We can combine the `FIXBATH` and `FIXHALF` columns into a single column using the same method we used for the Franklin County data.

In [None]:
fairfield_subset['Bathrooms'] = fairfield_subset.FIXBATH + 0.5 * fairfield_subset.FIXHALF
fairfield_subset.drop(["FIXBATH", "FIXHALF"], axis=1, inplace=True)

Turing to the `HEAT` column, the data documentation indicates that the values in this column represent a "heat code".

In [None]:
fairfield_subset.HEAT.value_counts()

Among the tables in the database available online is one that defines these values.  The heat codes are as follows:

- 1: None
- 2: Basic
- 3: Air conditioning
- 4: Heat Pump

We'll have to extract both heating and cooling data from this column. Notice that the output of `value_counts()` includes a row count for what appears to be missing data.  Before continuing, let's try to determine why there is a missing value. To begin, let's use the `unique()` method for a better representation of the distinct values.

In [None]:
fairfield_subset.HEAT.unique() 

The *None* entry in the variable `HEAT` is ambiguous since the documentation does not indicate that this would be a valid entry. At this point we need to decide if should assume that a missing value means no heat or some other type of heating that doesn't correspond to a code. Let's look at a few rows where `HEAT` is an empty string.

<hr>

<a name="Exercise-12"></a><mark> **Exercise 12** Using a mask and the `head()` method, display the first five rows of `fairfield_subset` where the `HEAT` column `isna()`.
</mark>

<hr>

It appears that records where `HEAT` is a null value correspond to parcels with no living area (the variable `SFLA`). Let's investigate how many rows have no living area, and some attributes about those rows using the `describe()` method once again, before modifying our dataset.

In [None]:
fairfield_subset[fairfield_subset.HEAT.isna()].describe()

By looking at the `ACRES`, `APRLAND`, and `PRICE` variables we can see that these are all fairly large plots of land, without any livable building space, and sold at widely varying price points, although the mean price was 283k dollars with a standard deviation of 1.67M dollars. It's safe to say that these must be farm fields or other plots of land which have been zoned residential, but without any home erected on them. Additionally, there are 9,671 rows in the dataset with no value for `HEAT`, which is about 20% of our dataset. If we did not remove these entries from our data, it would certainly cause problems with accurately determining characteristics about home prices in the central Ohio region. Therefore, we can filter the data again to exclude records with a null value in `HEAT`.

In [None]:
fairfield_subset = fairfield_subset[fairfield_subset.HEAT.notna()]

This leaves the following values in the `HEAT` column.

In [None]:
fairfield_subset.HEAT.value_counts()

While we could write code that iterate through the rows of the DataFrame and sets heating and cooling values at the same time, it is easier to split this into two tasks: set the cooling value then set the heating value. 

We can create a new column, `AirConditioning` based on whether or not `HEAT` has a value of '3'. Because the column contains strings, it's important that our masks compare the column's values to another string rather than an integer, i.e, our mask for air conditioning should be

```python
fairfield_subset.HEAT == '3'
```

rather than

```python
fairfield_subset.HEAT == 3
```

In [None]:
fairfield_subset['AirConditioning'] = fairfield_subset.HEAT == '3'
fairfield_subset.AirConditioning.value_counts()

Similarly, we can assign a new value to `HEAT` based on the existing value. 

In [None]:
fairfield_subset.HEAT = fairfield_subset.HEAT != '1'
fairfield_subset.HEAT.value_counts()

Let's see what the data looks like.

In [None]:
fairfield_subset.head()

The final steps are to rename the columns, add data about the source, and append the Fairfield subset to our larger dataset.


<hr>

<a name="Exercise-13"></a><mark> **Exercise 13** In the cell below, rename the columns of the `fairfield_subset` DataFrame so they are consistent with the columns in `home_data`.
</mark>

In [None]:
fairfield_subset.rename(
    {'PARID': 'ParcelNumber',
     'ACRES': 'Acreage',
     'APRLAND': 'AppraisedTaxableLand',
     'APRBLDG': 'AppraisedTaxableBuilding',
     'SFLA': 'Area',
     'RMBED': 'Bedrooms',
     'LEGAL1': 'LegalDescription',
     'OWN1': 'OwnerName',
     'WBFP_O': 'FireplaceOpenings',
     'GRDFACT': 'Grade',
     'LUC': 'LandUse',
     'MCITYNAME': 'USPSCity',
     'YRBLT': 'YearBuilt',
     'RMTOT': 'Rooms',
     'HEAT': 'Heat',
     'PRICE': 'SalePrice',
     'EXTWALL': 'WallType',
     'TRANSDT': 'TransferDate'
    },
    axis=1 ,
    inplace=True
)

Remember, we've thus far been looking at all of Fairfield County's data. Before we go further, it would be wise to sample our data now so that we don't have to do so after joining it to the larger dataset of `home_data`. 

In [None]:
#import random
t2 = len(fairfield_subset)
s2 = int(0.1 * t2)  # fraction of total
rows = random.sample(range(fairfield_subset.index[0], t2), s2)
fairfield_subset = fairfield_subset.iloc[rows]
fairfield_subset.describe()

Let's add the `County` column to the dataset, and once again compare the `fairfield_subset` dataframe to the `home_data` dataframe.

In [None]:
home_data = pd.read_csv('../data/unit2_home_data.csv', sep='|', encoding='utf-8', low_memory=False)

In [None]:
cols1 = pd.Series(home_data.columns.sort_values())
cols2 = pd.Series(fairfield_subset.columns.sort_values())
cols1.eq(cols2)

In [None]:
display(home_data.columns.sort_values())
display(fairfield_subset.columns.sort_values())

Remember that we didn't have all of the necessary columns in our Fairfield dataset that we had available in Franklin and Licking county datasets. Let's create some placeholder fields in the Fairfield dataset and set their values to nulls (*NaN* to be precise) so that we can cleanly merge `home_data` and `fairfield_subset` together. We'll deal with the null value fields in a programmatic way in a little bit.

In [None]:
import numpy as np
fairfield_subset['AnnualTaxes'] = np.nan
fairfield_subset['CAUV'] = np.nan
fairfield_subset['Condition'] = np.nan
fairfield_subset['County'] = 'Fairfield'
fairfield_subset['DwellingType'] = np.nan
fairfield_subset['FireplacesFlag'] = fairfield_subset.FireplaceOpenings > 0
fairfield_subset['NeighborhoodCode'] = np.nan
fairfield_subset.drop(['OWN2'], axis=1, inplace=True)
fairfield_subset['SchoolDistrict'] = np.nan
fairfield_subset['TaxDesignation'] = np.nan

<hr>

Now we can append the `fairfield_subset` dataframe to our `home_data` dataframe.

In [None]:
home_data = home_data.append(fairfield_subset, ignore_index=True)

We can see that our dataset contains data from three counties.

In [None]:
home_data.County.value_counts()

Often when we work with data, we encounter duplication - repetition of data.  We can see if `home_data` contains duplicate data by comparing the number of rows that would be left if we removed duplicates using the `drop_duplicates()` method to the number of rows in the current DataFrame.

In [None]:
len(home_data.drop_duplicates())/len(home_data)

[Top](#Top)

<hr>

<a name="Concept-DataManip"></a>
##### Concept: Data Manipulation Resulting in Data Duplication

The above equation shows that about 0.0001% of our data corresponds to duplicates. While the original data might not have contained duplicates, we created what appear to be duplicates by removing columns. To see how this happened more clearly, consider the following example DataFrame:

In [None]:
tmp = d1 = pd.DataFrame([[1, 2, 3], [1, 2, 4]], columns=list('ABC'))
tmp

The two rows of data are distinct.  However, if decide we no longer need column `C`, the rows will appear to be duplicates.

In [None]:
tmp.drop(['C'], axis=1, inplace=True)
tmp

We can calculate the percentage of rows that would be left after removing duplicates as we did above.

In [None]:
len(tmp.drop_duplicates())/len(tmp)

[Top](#Top)

<hr>

While the rows in `home_data` might have corresponded to distinct properties to begin with, at this point there is no way to distinguish between duplicated rows. Be careful with transactional data when this happens; it could be that those duplicate rows were only unique because of some existing errors in how data was recorded. By ignoring the duplication you could be double-counting transactional activity, or inflating statistical counts and averages incorrectly for the purposes of your analyses. In this case, however, we will leave the duplicates in the data knowing that they do represent different properties, not simply transactional data recording errors.

Let's look at the data types of our columns.

In [None]:
home_data.dtypes

Notice that `Area` data is stored as objects.  We would probably like to store area as an integer or as a floating point value.  Let's try to convert the area values to integers.

*Note: You will receive an error message when attempting to run the next step of code.*

In [None]:
home_data.Area = home_data.Area.astype(int)

The exception message indicates that some of the values contain commas. Before continuing, note the following.

In [None]:
from collections import defaultdict

types = defaultdict(int)
for value in home_data.Area:
    value_type = type(value)
    types[value_type] += 1

types

The `Area` data contains some data stored as integers and other data stored as strings.  The number of strings corresponds to the number of entries from Licking country so the area data was likely stored with commas in that dataset. To resolve this we can iterate through each row and, if the data is a string, remove any commas and convert to an integer. To iterate through the rows of a DataFrame we can use `iterrows()`.

In [None]:
for index, row in home_data.iterrows():
    if isinstance(row.Area, str):
        new_value = int(float(row.Area.replace(',', '')))  # first remove commas, then cast everything to a float, and finally to int
        home_data.loc[index, 'Area'] = new_value

Iterating through the DataFrame row-by-row can be slow and should generally be avoided. An alternative approach is to create a function that would handle one value at a time and use the DataFrame's [`apply()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html) method.

Let's look at the data types for each column now.

In [None]:
home_data.dtypes

As we loaded each data set, we checked some columns for missing values, represented by `NaN` by examining the value counts of those columns. We can check our `home_data` data frame for missing values we might have overlooked . In the code below, we first use the `isna()` method to return `True` for every value that is `NaN` and `False` otherwise.  We then calculate the sum of `True` values for each column using the `sum()` method. 

In [None]:
home_data.isna().sum()

Many columns contain missing values. For most of the columns, we might choose to remove any rows that contain missing information; for example, if we plan to use the data to see how different factors affect sales price, we probably don't want to keep records missing price data. Before we start dropping rows, let's address the `NeighborhoodCode` column - there are a significant number of missing values. Recall that when we were working with the Fairfield data, we didn't have any neighborhood codes, so we filled in the column with *NaN* values in order to join to our `home_data` dataset. We can confirm that all the missing `NeighborhoodCode` values correspond to records from Fairfield county.

In [None]:
home_data[home_data.NeighborhoodCode.isna()].County.value_counts()

When performing the next unit's learning objectives we may need to filter our data further to remove rows with missing values. Some python modules which produce various charts and graphs won't work with columns containing missing values, and other statistical tools that we will begin learning about it Unit 4 won't produce valid results with missing values. However, for now we are still building our dataset, and can leave these entries as-is.

We'll store the data in a SQLite database using SQLAlchemy and the DataFrame's [`to_sql()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html) method; the first argument to this method specifies the table name we'd like to use an the third argument indicates that we would like to replace any existing data. 

In [None]:
from sqlalchemy import create_engine
engine = create_engine('sqlite:///../data/output.sqlite')
home_data.to_sql("home_data", con=engine, if_exists='replace')

To read data back from the SQLite database, we use the `sqlalchemy` method, `read_sql()`. (Now you can easily read/write your data to a simple SQLite database from within Jupyter without having to re-run all prior steps to re-create the same datasets!)

In [None]:
from sqlalchemy import create_engine
engine = create_engine('sqlite:///../data/output.sqlite')

query = "SELECT * from home_data;"
home_data = pd.read_sql_query(query, con=engine)
display(home_data.head())

[Top](#Top)

## Joining Related Datasets

In the previous examples, we worked on appending the rows of one DataFrame to another. Pandas supports a [variety of methods](https://pandas.pydata.org/pandas-docs/stable/merging.html) of combining data. Another common method is similar to a [database join](https://en.wikipedia.org/wiki/Join_%28SQL%29) where we combine combine the columns of two or more datasets. Pandas DataFrames provide two methods that can be used for data joins: `join()` and `merge()`. The `join()` method can be used when combining datasets based on index values and the `merge()` method can be used to combine datasets on any column values as well as index values; `merge()` is the more general method and `join()` ultimately relies on `merge()` to combine DataFrames. While it is best practice for database administrators to use explicit join conditions to architect high-performing databases, it isn't a necessity to worry about performance issues in data analytics projects. By definition, we are working on a project, not building a system, and therefore are not as concerned about performance or stability of a system. If the `merge()` function is appropriate to use and completes your tasks faster, use it!

In the next example, we'll load a new dataset from [Realtor.com](https://www.realtor.com/research/data/). We will then join some of this data to our auditor data file, `home_data` to create a new time-series based dataset. In a future lesson, we'll explore relationships within and between these datasets but let's just focus on how we can combine the datasets first. We begin by downloading the Realtor.com single-family home, county level data from the [Realtor.com](https://www.realtor.com/research/data/) site. *Note: To avoid network issues, we've included the file in your lesson materials: `./data/Realtor_com_Data/RDC_InventoryCoreMetrics_County_sfh.csv`.*

<hr>

<a name="Exercise-14"></a><mark> **Exercise 14** In the cell below, use the Pandas `read_csv()` function to load the data from `./data/Realtor_com_Data/RDC_InventoryCoreMetrics_County_sfh.csv` and store it in a variable named `realtor_county`. You might encounter a warning message about columns containing mixed types. One solution is to use the more feature-complete Python engine rather than the faster C engine; to do this, add `engine='python'` as an argument to `read_csv()`.
</mark>

As with most of the datasets we've dealt with in this lesson, we need to select, clean, and construct the dataset before integrating it with our existing `home_data`. Review the following code, run it, and describe what is happening in the Markdown cell, below:

In [None]:
display(realtor_county.describe())
display(realtor_county.columns.tolist())
display(realtor_county.head())

realtor_columns = ['Month', 'CountyName', 'CountyFIPS', 'Nielsen Rank', 'Median Listing Price', 'Active Listing Count ',
    'Days on Market ', 'New Listing Count ', 'Price Increase Count ', 'Price Decrease Count ', 'Pending Listing Count ',
    'Avg Listing Price', 'Total Listing Count', 'Pending Ratio']

realtor_subset = realtor_county[realtor_columns].copy()

<mark>Write your observation of what is happening in the code above, here:</mark>

<hr>

As we can see, this new dataset contains monthly information about aggregate home sales as recorded by Realtor.com for each county in the country, and we've been able to extract our three central Ohio counties from this dataset. Let's describe our data now.

In [None]:
realtor_subset.CountyName.value_counts()

In [None]:
realtor_subset.describe()

Very good. We have 83 months of data per county, and a nice set of aggregate measurements for home sales. Importantly, we now have data such as Nielsen Rank, Days on Market, and more information about listing vs. final sales prices which we did not have available in our auditor data. Now that we see some of the core data available, let's prepare it for joining to our existing `home_data` dataset.

We can use the same functionality that we used earlier to eliminate all but our three counties, Fairfield, Franklin, and Licking. We can also eliminate the trailing `", OH"` portion of the CountyName variable so that we can join it to our dataset. We'll store the result in a new field called `County`, and drop the `CountyName` variable.

In [None]:
# limit result set to the counties we're interested in
counties_list = {'Fairfield, OH','Franklin, OH','Licking, OH'}
realtor_subset = realtor_subset[realtor_subset.CountyName.isin(counties_list)].copy()

# clean up the CountyName values so we can store in new column County
realtor_subset['County'] = realtor_subset['CountyName'].str.replace(r', OH', '', regex=True)
realtor_subset.drop(['CountyName'], axis=1, inplace=True)
display(realtor_subset.head())

<hr>

We can now see that we have fairly good data for time series analysis (minimally), but our `Month` column has a lot of unnecessary time values: `00:00:00`. We can use the `replace()` method to replace the `00:00:00` values with an empty string - effectively removing it. We will then use the `strip()` method to remove any leading or trailing white space.  

We could iterate through the rows of the DataFrame using a for-loop; however, this tends to be slow. An alternative method is to use the DataFrame's `apply()` method which allows us to specify a function to [vectorize](https://en.wikipedia.org/wiki/Array_programming) an operation to an entire row or column.

In the code below, we first define the function that we would like to apply.  The function is written as though it is applied one row at a time - pandas handles the vectorization for us. We define our function `remove_time` first by specifying what we want to happen in each row. We then invoke the function by using the `apply()` method, specifying `axis=1` which indicates that we would like to apply the function along multiple columns and allow us to access their values. If we had multiple date-time fields like `Month` in our table, this would make cleaning up formats very easy indeed! Since this is a simple dataset, we could have chosen to replace the values in the column `Month` directly, but for example purposes we're learning the benefit of the `apply()` method.

In [None]:
def remove_time(row):
    month = row['Month']
    # remove timestamp and any leading or trailing white space
    return month.replace("00:00:00", "").strip()
    
realtor_subset["Month"] = realtor_subset.apply(remove_time, axis=1)

In [None]:
realtor_subset.head()

Good, now our dataset is cleaned up and ready to be joined. In order to join `home_data` and `realtor_subset`, we'll specify the columns whose values will be compared for matching. To simplify this, it is helpful to use the sames column names, including the same case, in both datasets. For our data, we can convert the columns in the sales data to lowercase. 

In [None]:
home_data_tmp = home_data.rename({col: col.lower() for col in home_data.columns}, axis=1)
home_data_tmp.dtypes

In [None]:
realtor_subset_tmp = realtor_subset.rename({col: col.lower() for col in realtor_subset.columns}, axis=1)
realtor_subset_tmp.dtypes

Before we merge, let's get rid of the extra spaces in the column names by replacing " " characters with "_".

In [None]:
realtor_subset_tmp.columns = realtor_subset_tmp.columns.str.replace(" ", "_")

Because this dataset will be useful to us in the future, let's save it now to our SQLite database:

In [None]:
#from sqlalchemy import create_engine
#engine = create_engine('sqlite:///../data/output.sqlite')
realtor_subset_tmp.to_sql("realtor_history", con=engine, if_exists='replace')

To merge the data, we can use the `merge()` method of one of the two DataFrames. When using `merge()`, we need to specify the other DataFrame and the columns used for matching.

In [None]:
# to save time typing, let's rename this dataset as our "Point-in-Time DataFrame (pit_df)"
pit_df = home_data_tmp.merge(realtor_subset_tmp, on=["county"])
pit_df.head(100)

Now it appears we've duplicated data again. Let's examine why.

First, notice that the duplicated information from the `home_data` dataset resets at row 83. If we think about our two datasets, what did they contain? The `home_data` dataset contained a single listing of every property in the county; a 1:many relationship. The `realtor_subset` dataset contained multiple months of aggregate data for all homes bought and sold in a given county; a many:many relationship (or in the case of just one county like Franklin County, Ohio, a many:1 relationship). When we merged our datasets by county we therefore found a match in `realtor_subset` 83 times for each property listing because `realtor_subset` contains 83 months of aggregated purchasing information.

Second, we should ask what point in time our county auditor datasets were gathered in. It would make little sense to combine December 2018 `realtor_subset` prices with 2010 `home_data` auditor information. Market conditions for housing prices, sales volumes, and consumer behavior between those two years was wildly different!

How can we solve this problem? Since we are only interested in showing the possibilities of the `merge` function, let's pick just one month's aggregate numbers out of our `realtor_subset` data. We can then merge that one month of data with our original `home_data` dataset. Doing so will allow us to calculate a variety of metrics and attributes about a given property. We'll find uses for these additional metrics and attributes as we explore model building activities in Units 3 and 4 of this course.

First, let's see if we have good data for the month of March, 2019:

In [None]:
realtor_subset_tmp[realtor_subset_tmp.month == '2019-03-01']

Yes, we have good data for all three counties. Let's now merge our datasets using only March, 2019 data from the realtor data.

In [None]:
pit_df = home_data_tmp.merge(realtor_subset_tmp[realtor_subset_tmp.month == '2019-03-01'], on=["county"])
pit_df.head(10)

Notice our use of object-oriented coding, above. Rather than creating more datasets and variables, we just replaced our original code:

```python
pit_df = home_data_tmp.merge(realtor_subset_tmp, on=["county"])
pit_df.head(100)
```

with new, inline filtering on the `realtor_subset_tmp` dataframe from our previous step where we verified that we had useful data for March, 2019:

```python
pit_df = home_data_tmp.merge(realtor_subset_tmp[realtor_subset_tmp.month == '2019-03-01'], on=["county"])
pit_df.head(10)
```

It looks like the merge was successful.  We can use the DataFrame's *shape* property to see the number of columns and rows it contains.

In [None]:
pit_df.shape

Let's save this DataFrame to the database use created previously so we can access the data later.

<hr>

<a name="Exercise-15"></a><mark> **Exercise 15** In the cell below, use the `county_info` DataFrame's `to_sql()` method to save the data in the same database that we stored property information. Use the table name `county_info`.
</mark>

Another dataset we will use later in this course can be the reverse of a single point-in-time. We could join all of the auditor results to each and every month's data from Realtor.com. To build an accurate time-series dataset we would need to retrieve all of the datasets corresponding to each month's data from each auditor's website. This step is too time consuming for this lesson, and we would find that most of the central Ohio auditor websites do not have good historical data readily available online as of March, 2019. An exception is Franklin County, which we will explore in more detail in a later lesson.

[Top](#Top)

<hr>

## Exercise Answers

1.  ```python
    franklin.PROPTYP.unique()
    ``` 

    or 

    ```python
    franklin.PROPTYP.value_counts()
    ```
    
    
2. ```python
   len(franklin_subset) < len(franklin)
   ```
   
   
3. ```python
   franklin_subset['APPRLND'] = franklin_subset.APPRLND.astype(int)
   franklin_subset = franklin_subset[franklin_subset.APPRLND > 0]
   ```
   
   
4. ```python
   franklin_subset[['BATHS', 'HBATHS']].dtypes
   ```
   
   
5. ```python
   franklin_subset.drop(['BATHS', "HBATHS"], axis=1, inplace=True)
   ```
   
   
6. ```python
   franklin_subset.FIREPLC = franklin_subset.FIREPLC.fillna(value=0)
   ```
   
   or
   
   ```python
   franklin_subset.FIREPLC.fillna(value=0, inplace=True)
   ```
   
   
7. ```python
   home_data.columns
   ```
   
   
8. ```python
   display(typical_line.count(";"))
   display(error_line.count(";"))
   ```
   
   
9. ```python
   for column in licking.columns:
       if "area" in column.lower():
           display(column)
   ```
   
   
10. ```python
    licking_subset['Bathrooms'] = (licking_subset.fldFullBaths.fillna(0) + 
                                   0.5 * licking_subset.fldHalfBaths.fillna(0) + 
                                   0.25 * licking_subset.fldOtherBaths.fillna(0))
   licking_subset.drop(["fldFullBaths", "fldHalfBaths", "fldOtherBaths"], axis=1, inplace=True)
    ```
    
    
11. ```python
    licking_subset.fldCooling = licking_subset.fldCooling == "Central"
    ```
    
    
12. ```python
    fairfield_subset[fairfield_subset.HEAT == ''].head()
    ```
    
    
13. ```python
    fairfield_subset.rename(
        {'PARID': 'ParcelNumber',
         'ACRES': 'Acreage',
         'APRLAND': 'AppraisedTaxableLand',
         'APRBLDG': 'AppraisedTaxableBuilding',
         'SFLA': 'Area',
         'RMBED': 'Bedrooms',
         'LEGAL1': 'LegalDescription',
         'OWN1': 'OwnerName',
         'WBFP_O': 'FireplaceOpenings',
         'GRDFACT': 'Grade',
         'LUC': 'LandUse',
         'MCITYNAME': 'USPSCity',
         'YRBLT': 'YearBuilt',
         'RMTOT': 'Rooms',
         'HEAT': 'Heat',
         'PRICE': 'SalePrice',
         'EXTWALL': 'WallType',
         'TRANSDT': 'TransferDate'
        },
        axis=1 ,
        inplace=True
    )
    ```


14. ```python
    realtor_county = pd.read_csv("../data/Realtor_com_Data/RDC_InventoryCoreMetrics_County_Hist.csv")
    ```
    
    or 
    
    ```python
    realtor_county = pd.read_csv("../data/Realtor_com/RDC_InventoryCoreMetrics_County_Hist.csv", engine='python')
    ```


15. ```python
    #from sqlalchemy import create_engine
    #engine = create_engine('sqlite:///../data/output.sqlite')
    pit_df.to_sql("pit_df", con=engine, if_exists='replace')
    ```

[Top](#Top)

## Next Steps

Now that we've loading and cleansed the data to some extent, next step might be to being exploring the data. In the next unit, we'll calculate simple, descriptive statistics and create exploratory visualizations using the data we prepared in this this unit.

## Resources and Further Reading

- [GeoPandas Documentation](http://geopandas.org/)
- [*Data Cleaning: Problems and Current Approaches* by Rahm and Do](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.98.8661&rep=rep1&type=pdf)
- [*Data Mining: Concepts and Techniques* by Han, Pei, and Kamber, Section 3.2: Data Cleansing (Safari Books)](http://proquest.safaribooksonline.com.cscc.ohionet.org/book/databases/data-warehouses/9780123814791/3dot-data-preprocessing/32_data_cleaning?uicode=ohlink)
- [*Python Data Science Handbook* by VanderPlas, Chapter 3: Data Manipulation with Pandas](https://jakevdp.github.io/PythonDataScienceHandbook/03.00-introduction-to-pandas.html)
- [*Python for Data Analysis* by Wes McKinney, Chapter 7: Data Cleaning and Preparation (Safari Books)](http://proquest.safaribooksonline.com.cscc.ohionet.org/book/programming/python/9781491957653/data-cleaning-and-preparation/data_preparation_html?uicode=ohlink)

[Top](#Top)