<a name="Top"></a>

# Group Project 1

## Contents

* [Lab Description](#Lab-Description)
* [Initial Imports & Settings](#Imports)
* [Get Data](#Get-Data)
    * [Franklin County Auditor Data](#Franklin-County-Audit-Data)
        * [Sample Franklin County Auditor Data](#SampleData)
    * [Delaware County Audit Data](#Delaware-County-Audit-Data)
    * [Hocking County Audit Data](#Hocking-County-Audit-Data)
* [Examine Data](#Examine-data)
    * [Decide which columns are common across counties](#ID-Columns)
* [Merge the Counties](#Merge)
    * [Create Merge Files for each County](#Create-merge-files)
        * [Madison Merge File](#Madison-merge)
        * [Morrow Merge File](#Morrow-merge)
        * [Franklin Merge File](#Franklin-merge)
    * [Validate Columns for All Merge Files](#Validate-columns)
    * [Merge Files into One (Using APPEND)](#Append-files)
        * [Validate All 3 Counties are in the Merged File](#Validate)
* [Explore Data](#Explore-data)
    * [Get dtypes for each Column](#Check-dtypes)
    * [Describe Data](#Describe)
    * [Get Value Counts](#Get-value-counts)
    * [Check for Duplicates](#Check-duplicates)
    * [Check for NaN values](#Check-isna)
* [Clean Data](#Clean-data)
    * [Clean Data Approach](#Clean-data-appraoch)
* [Results](#Results)
* [Next Steps](#Next-Steps)

<a name="Lab-Description"></a>
## Lab Description
Your groups have been assigned by Nationwide, and now it is time to begin working together to build towards your Capstone presentations! Your first group project will be to gather the additional county auditor data, clean it, merge it, and begin to understand it for those counties that are considered part of the Columbus Ohio MSA: Franklin, Fairfield, Licking, Delaware, Hocking, Madison, Morrow, Perry, Pickaway, and Union. We have already worked with Franklin, Fairfield, and Licking County auditor datasets during Unit 2, and will continue working with these datasets, but your job as a group is to combine all pf the county datasets together.

A copy of auditor data for all of these counties is attached to this entry on Blackboard. Note that Franiklin, Fairfield, and Licking county data is not included since that data was already made available in Unit 2 Exercises. Additionally, Union County data is unavailable at this time - Bonus points to any team who can figure out how to get it!

Group Project 1 presentations will be at the beginning of class on June 20, and are to be 10 minutes or less in length. Presentations for this first group project can be done by one ore more members of the group. You will be graded on the following for this first group presentation:

    Presentation of basic statistics and charts showing how your team chose to sample (or not sample) data files, what format you stored it in, and any interesting facts you found while reviewing basic statistics about the data.
    What your team would like to study next about the data.
    Presentation succinctness: less than 10min in duration, highlight main points, highlight any key questions or concerns about the data that your team had as you performed the work.

Remember that this data can and should be used by your group during your Capstone work, so take your time building it well. This is your team's chance to perform its first practice run at presenting to an audience - enjoy it! The pressure to have robust data and impressive data visualization and modeling techniques can wait until your Capstone.

<a name="Imports"></a>
# Python Imports
Initial Imports and settings

In [3]:
import linecache
import random
import samplingu2 as sampling 
import pandas as pd

# pandas settings
pd.set_option('display.max_column', 200)

<a name="Get-Data"></a>
# County Auditor Data
Gather the county auditor data for those Ohio counties that are considered part of the Columbus Ohio MSA: Franklin, Fairfield, Licking, Delaware, Hocking, Madison, Morrow, Perry, Pickaway, and Union. 

### Note
Load full dataset for initial analysis on Delaware and Hocking.

<a name="Delaware-County-Audit-Data"></a>
### Delaware County Auditor Data
<strong>Special Note:</strong> Delaware uses "|" as a delimiter.

In [4]:
delaware = pd.read_csv('../data/county_auditor/OH-Delaware/governmaxextract.txt', delimiter="|", low_memory=False)

<a name="Hocking-County-Audit-Data"></a>
### Hocking County Auditor Data
<strong>Special Note:</strong> Hocking uses "," as a delimiter.

In [22]:
hocking = pd.read_csv('../data/county_auditor/OH-Hocking/Grid_salef_2019-03-20.csv', delimiter=",", low_memory=False)

<a name="Examine-data"></a>
## Examine Data
We would like a glimpse of the data for each of these, so we will use the head() function

In [6]:
delaware.head()

Unnamed: 0,mpropertyNumber,mmapNumber,mcamaId,CardCount,NeighborhoodCode,mlegalDescription,mlocStrDir,mlocStrNo,mlocStrNo2,mlocStrName,mlocStrSuffix,mlocStrSuffixDir,msecondaryAddress,mlocCity,mlocState,mlocZipCode,mlocDescription,mClassificationId,macres,mfrontfootage,mdeedVolume,mdeedPage,Par_Deleted,DeededOwner,mlot,msublot,OwnName,OwnFirstName,OwnMiddleInitial,OwnLastName,OwnNameSuffix,OwnBusiness,OwnStreetName,OwnStreetNumber,OwnStreetDirection,OwnStreetSuffix,OwnSufDir,OwnSecondaryAddress,OwnCity,OwnState,OwnZipcode,OwnCountry,OwnAttentionline,TaxpName,TaxpFirstName,TaxpMiddleInitial,TaxpLastName,TaxpNameSuffix,TaxpBusiness,TaxpStreetName,TaxpStreetNumber,TaxpStreetDirection,TaxpStreetSuffix,TaxpStreetSuffixDir,TaxpSecondaryAddress,TaxpCity,TaxpState,TaxpZipcode,TaxpCountry,TaxpAttentionline,Spec_Flag,mtaxsetCode,mstateSchoolcode,mvaluedBy,MKT_Land_Value,MKT_Impr_Value,CAUV_Value,MKT_Total_Value,MKT_Tot_Total,SaleDate,ValidSale,SaleAmount,NumberOfPropertiesInSale,InstumentType,InstumentDate,Conveyance,DeedNumber,Hmstd_Flag,Foreclosed,NewConstruction,Reduction25,BORflag,DividedProperty,LandVal,ImproveVal,SpecAssessment,AnnualTax,TaxesPaid,Delinquent,mroutingNumber,Zoning,Parcel_Type,ForeclosureStep,TresCode,BC,activeYear,LenderId,TaxLienStatus,OwnOtherInfo,mFORCmessageid,msubdivision
0,100-100-00-000-251,,,0.0,,RANDALL-HOWISON DITCH ^11016 ...,,,,,,,,RADNOR,OH,43066,,101.0,0.0,0.0,,,N,DELAWARE COUNTY DITCH ASSESSMENT,,,DELAWARE COUNTY DITCH,,,DELAWARE COUNTY DITCH,,1.0,,,,,,,RADNOR,OH,43066,USA,ASSESSMENT,DELAWARE COUNTY,,,DELAWARE COUNTY,,1.0,SANDUSKY,101,N,POINT,,,DELAWARE,OH,43015,USA,COMMISSIONERS,Y,38.0,2102.0,MAN,0,0,0,0,0,11-23-1999,N,0,0,0,1999-11-23 12:00:00,0.0,,N,N,N,N,N,N,0,0,1,42.83,42.83,0.0,,,0,,S 9/02,,1990,0,,,0,0
1,100-100-00-000-252,,,0.0,,RANDALL-HOWISON DITCH ^11016 ...,,,,,,,,RADNOR,OH,43066,,101.0,0.0,0.0,,,N,THOMPSON TWP DITCH ASSESSMENT,,,THOMPSON TWP DITCH,TWP DITCH,,THOMPSON,,0.0,,,,,,,RADNOR,OH,43066,USA,ASSESSMENT,THOMPSON TOWNSHIP,THOMPSON,T,,,0.0,STATE ROUTE 257,4373,,,N,,RADNOR,OH,43066,USA,,Y,38.0,2102.0,MAN,0,0,0,0,0,11-23-1999,N,0,0,0,1999-11-23 12:00:00,0.0,,N,N,N,N,N,N,0,0,1,3.63,3.63,0.0,,,0,,S11/99,,1990,0,,,0,0
2,100-100-00-000-253,,,0.0,,RANDALL-HOWISON DITCH ^11016 ...,,,,,,,,RADNOR,OH,43066,,101.0,0.0,0.0,,,N,ODOT DITCH ASSESSMENT,,,ODOT DITCH ASSESSMENT,DITCH ASSESSMENT,,ODOT,,0.0,,,,,,,RADNOR,OH,43066,USA,,ODOT,ODOT,,,,0.0,WILLIAM,400,E,ST,,,DELAWARE,OH,43015,USA,,Y,38.0,2102.0,MAN,0,0,0,0,0,11-23-1999,N,0,0,0,1999-11-23 12:00:00,0.0,,N,N,N,N,N,N,0,0,1,20.21,20.21,0.0,,,0,,LT4/00,,1990,0,,,0,0
3,100-100-01-001-000,,,1.0,38002.0,LANDS SURVEY 675 ^ ...,,7511.0,,DAVIS-KIRK,RD,,,PROSPECT,OH,43342,,511.0,1.0,0.0,49.0,1746.0,N,BURNS JASON E HORNER JACQUELINE S,,,BURNS JASON E,JASON,E,BURNS,,0.0,DAVIS-KIRK,7511.0,,RD,,,PROSPECT,OH,43342,USA,HORNER JACQUELINE S,CORELOGIC,CORELOGIC,,,,0.0,HACKBERRY,3001,,,,,IRVING,TX,75063,USA,,N,39.0,5101.0,RVL,23800,146200,0,170000,170000,09-12-2000,Y,127000,1,16,2000-09-12 12:00:00,3898.0,,N,N,N,Y,N,N,8330,51170,0,2367.34,2367.34,0.0,,,0,,TU1213,,1990,1521,,,0,0
4,100-100-01-002-000,,,1.0,38002.0,LANDS SURVEY 675 ^ ...,,7491.0,,DAVIS-KIRK,RD,,,PROSPECT,OH,43342,,511.0,1.0,0.0,,,N,REEBEL RICK R TRUSTEE,,,REEBEL RICK R TRUSTEE,REEBEL,R,R,,0.0,DAVIS-KIRK,7491.0,,RD,,,PROSPECT,OH,43342,USA,,REEBEL RICK R TRUSTEE,REEBEL,R,R,,0.0,DAVIS-KIRK,7491,,RD,,,PROSPECT,OH,43342,USA,,N,39.0,5101.0,RVL,23800,115100,0,138900,138900,05-15-2018,N,0,2,18,2018-05-15 09:58:00,,912.0,Y,N,N,Y,N,N,8330,40290,0,1586.94,1586.94,0.0,,,0,,,,1990,0,,,0,0


In [9]:
hocking.head()

Unnamed: 0,Parcel Number,Number,Street,School District,Township,Corp / Village,Sale #,Month,Year,Sale Price,Sqft,Acres,Yr Built,Story,Basement,Frame,Bedrooms,BasementSqft,Fin Base,Air,GarageSqft,Type,Heat Type,Roof,RoofType,Rooms,FullBaths,HalfBaths,Grade,MHRE,NeighCode,PropClass,LandValue,BldgValue,Property Card
0,10000080900,19475.0,LINDEN,2A,4A,,287.0,7,10,40000,,18432.0,,,,,,,,,,,,,,,,,,,107.0,500,32840.0,,http://www.realestate.co.hocking.oh.us/cards/C...
1,10000530000,,,2A,4A,,536.0,12,10,230000,,70000.0,,,,,,,,,,,,,,,,,,,7015.0,120,10970.0,,http://www.realestate.co.hocking.oh.us/cards/C...
2,10000560101,20680.0,KEIFEL,2A,4A,,141.0,4,10,171600,1120.0,18365.0,2002.0,1.0,B,F,2.0,1120.0,F,,816.0,D,HEAT PMP,S,,6.0,2.0,,C,,7015.0,511,16390.0,122750.0,http://www.realestate.co.hocking.oh.us/cards/C...
3,10000700000,20644.0,CUPP,2A,4A,,474.0,11,10,65000,,73610.0,,,,,,,,,,,,,,,,,,,7015.0,502,24970.0,,http://www.realestate.co.hocking.oh.us/cards/C...
4,10000710600,23431.0,BIG PINE,2A,4A,,531.0,12,10,158950,608.0,90345.0,1993.0,1.0,,L,2.0,,,,750.0,D,FA GAS,S,GAB,,1.0,,C,,7015.0,581,27580.0,71240.0,http://www.realestate.co.hocking.oh.us/cards/C...


In [14]:
delaware.columns.tolist()

['mpropertyNumber',
 'mmapNumber',
 'mcamaId',
 'CardCount',
 'NeighborhoodCode',
 'mlegalDescription',
 'mlocStrDir',
 'mlocStrNo',
 'mlocStrNo2',
 'mlocStrName',
 'mlocStrSuffix',
 'mlocStrSuffixDir',
 'msecondaryAddress',
 'mlocCity',
 'mlocState',
 'mlocZipCode',
 'mlocDescription',
 'mClassificationId',
 'macres',
 'mfrontfootage',
 'mdeedVolume',
 'mdeedPage',
 'Par_Deleted',
 'DeededOwner',
 'mlot',
 'msublot',
 'OwnName',
 'OwnFirstName',
 'OwnMiddleInitial',
 'OwnLastName',
 'OwnNameSuffix',
 'OwnBusiness',
 'OwnStreetName',
 'OwnStreetNumber',
 'OwnStreetDirection',
 'OwnStreetSuffix',
 'OwnSufDir',
 'OwnSecondaryAddress',
 'OwnCity',
 'OwnState',
 'OwnZipcode',
 'OwnCountry',
 'OwnAttentionline',
 'TaxpName',
 'TaxpFirstName',
 'TaxpMiddleInitial',
 'TaxpLastName',
 'TaxpNameSuffix',
 'TaxpBusiness',
 'TaxpStreetName',
 'TaxpStreetNumber',
 'TaxpStreetDirection',
 'TaxpStreetSuffix',
 'TaxpStreetSuffixDir',
 'TaxpSecondaryAddress',
 'TaxpCity',
 'TaxpState',
 'TaxpZipcode',


In [27]:
hocking.columns.tolist()

['Parcel Number',
 'Number',
 'Street',
 'School District',
 'Township',
 'Corp / Village',
 'Sale #',
 'Month',
 'Year',
 'Sale Price',
 'Sqft',
 'Acres',
 'Yr Built',
 'Story',
 'Basement',
 'Frame',
 'Bedrooms',
 'BasementSqft',
 'Fin Base',
 'Air',
 'GarageSqft',
 'Type',
 'Heat Type',
 'Roof',
 'RoofType',
 'Rooms',
 'FullBaths',
 'HalfBaths',
 'Grade',
 'MHRE',
 'NeighCode',
 'PropClass',
 'LandValue',
 'BldgValue',
 'Property Card']

<a name="ID-Columns"></a>
### Decide which columns are common across Counties
To do this, we put the column names for each in an Excel workbook.  
Based on name and content, we made decisions about which columns were common across each.  

Along the way, we took notes about whether we needed to clean any of the data (ex., like merging address fields).

Please see ../data/Column mapping.xls for these details.

Now we will create a list of the columns that we identified as common across these counties. 
We copied the list from the output of the madison.columns.tolist() command we ran above and removed those that we did not need.  We also added a County field since we would want to  know which data came from which county.

In [12]:
columns = ["Parcel",
            "Owner",
            "PropertyAddress",
            "MailingAddress",
            "LandUse",
            "Acres",
            "LegalDescription",
            "NeighborhoodCode",
            "SaleDate",
            "SalePrice",
            "YearBuilt",
            "NumberOfStories",
            "FinishedArea",
            "NumberOfRooms",
            "NumberOfBedrooms",
            "NumberOfFullBaths",
            "NumberOfHalfBaths",
            "County"]

<a name="Data-Cleanup"></a>
# Data Cleanup
Using the columns defined above, we need to cleanup the data for each county.

<a name="Columns-Delaware"></a>
### Delaware County Columns
Delaware County data lacked information on the residental property specific to the building. When the query search was performed on the Auditor's website, the query would error out. The prepackaged data was the only information available.

In [36]:
# Filter Data
# Delaware as column called Parcel_Type, but all the values in the column are 0.
# Instead the column called mClassificationId contains Land Use Codes, which we use as a filter.
# I'm assuming the data of interest is Residental Property. According to http://codes.ohio.gov/oac/5703-25-10
# codes between 500 and 599 are Taxable residential real property
#delaware_subset = delaware[delaware.mClassificationId >= 500]
#delaware_subset = delaware_subset[delaware_subset.mClassificationId <= 599]

# Filter out sales above 1.5 million dollars and sales of zero dollars
#delaware_subset = delaware_subset[delaware_subset.SaleAmount < 1500000]
#delaware_subset = delaware_subset[delaware_subset.SaleAmount > 0]

# Delaware separates out the address across six columns. Combine them into one Street Address
street_address = delaware_subset.mlocStrDir.fillna('') + " "
street_address += delaware_subset.mlocStrNo.fillna('') + " "
street_address += delaware_subset.mlocStrNo2.fillna('') + " "
street_address += delaware_subset.mlocStrName.fillna('') + " "
street_address += delaware_subset.mlocStrSuffix.fillna('') + " "
street_address += delaware_subset.mlocStrSuffixDir.fillna('') + " "
street_address += delaware_subset.mlocCity.fillna('') + " "
street_address += delaware_subset.mlocState.fillna('') + " "
street_address += delaware_subset.mlocZipCode.fillna('')

# Add the street_address to the dataframe at index 2
delaware_subset.insert(2, "PropertyAddress", street_address)

#restrict the columns of interest
delaware_columns = [ 'mpropertyNumber', 'OwnName', 'PropertyAddress',
                     'mClassificationId','macres', 'mlegalDescription', 
                    'NeighborhoodCode', 'SaleDate', 'SaleAmount']

# Apply the column masks to the dataframe
delaware_subset = delaware_subset[delaware_columns].copy()

delaware_subset.rename(
    {'mpropertyNumber' : 'ParcelNumber',     
     'OwnName' : 'Owner',
     'mlocDescription' : 'Description',
     'mClassificationId' : 'LandUse',
     'macres' : 'Acres',     
     'mlegalDescription' : 'LegalDescription',
     'SaleAmount':'SalePrice'
    },
    axis=1,
    inplace=True
)

# Add Columns for data merge
mailing_address = delaware_subset.PropertyAddress
delaware_subset.insert(3, "MailingAddress", mailing_address)

delaware_subset.insert(10, "YearBuilt", " ")
delaware_subset.insert(11, "NumberOfStories", " ")
delaware_subset.insert(12, "FinishedArea", " ")
delaware_subset.insert(13, "NumberOfRooms", " ")
delaware_subset.insert(14, "NumberOfBedrooms", " ")
delaware_subset.insert(15, "NumberOfFullBaths", " ")
delaware_subset.insert(16, "NumberOfHalfBaths", " ")

delaware_subset['County'] = 'Delaware'
display(delaware_subset.head())

Unnamed: 0,ParcelNumber,Owner,PropertyAddress,MailingAddress,LandUse,Acres,LegalDescription,NeighborhoodCode,SaleDate,SalePrice,YearBuilt,NumberOfStories,FinishedArea,NumberOfRooms,NumberOfBedrooms,NumberOfFullBaths,NumberOfHalfBaths,County
3,100-100-01-001-000,BURNS JASON E,7511 DAVIS-KIRK RD PROSPECT OH 43342,7511 DAVIS-KIRK RD PROSPECT OH 43342,511.0,1.0,LANDS SURVEY 675 ^ ...,38002.0,09-12-2000,127000,,,,,,,,Delaware
7,100-100-01-005-000,VANHOESEN KEVIN D,7321 DAVIS-KIRK RD PROSPECT OH 43342,7321 DAVIS-KIRK RD PROSPECT OH 43342,511.0,0.8,LANDS SURVEY 675 ^ ...,38002.0,02-02-2016,128500,,,,,,,,Delaware
13,100-100-01-010-001,PORTUESI BEVERLY J,7745 STATE ROUTE 257 N PROSPECT OH 43342,7745 STATE ROUTE 257 N PROSPECT OH 43342,511.0,1.492,LANDS SURVEY 675 ^TRACT 1 ...,38002.0,10-27-2015,19500,,,,,,,,Delaware
14,100-100-01-010-002,VALENCIA CARLOS PEREZ,7741 STATE ROUTE 257 N PROSPECT OH 43342,7741 STATE ROUTE 257 N PROSPECT OH 43342,501.0,1.492,LANDS SURVEY 675 ^TRACT 2 ...,38002.0,02-17-2017,19000,,,,,,,,Delaware
15,100-100-01-011-000,WILLIAMS ROBERT H III,7715 STATE ROUTE 257 N PROSPECT OH 43342,7715 STATE ROUTE 257 N PROSPECT OH 43342,511.0,2.0,LANDS SURVEY 675 ^ ...,38002.0,11-06-2013,250000,,,,,,,,Delaware


<a name="Columns-Hocking"></a>
### Hocking County Columns

In [28]:
### Filters and Drops
hocking_subset = hocking

# filter out any property that has NaN Sqft, House Number (Number) is NaN (Assume no house on property)
hocking_subset = hocking_subset[pd.notna(hocking_subset.Sqft)]
hocking_subset = hocking_subset[pd.notna(hocking_subset.Number)]

### Address
# Convert Number and Street to same data type
hocking_subset['Number'] = hocking_subset.Number.astype(str)
hocking_subset['Street'] = hocking_subset.Street.astype(str)

# Combine Number and Street to create Address
address = hocking_subset.Number.fillna('') + " "
address += hocking_subset.Street.fillna('')

# Add the street_address to the dataframe at index 2
hocking_subset.insert(1, "Address", address)

### Bathrooms
hocking_subset.FullBaths = hocking_subset.FullBaths.fillna(0)
hocking_subset.HalfBaths = hocking_subset.HalfBaths.fillna(0)

### PropClass aka Land Use Codes

# PropClass is Land Use Codes; only include Residental Properties
hocking_subset = hocking_subset[hocking_subset.PropClass >= 500]
hocking_subset = hocking_subset[hocking_subset.PropClass <= 599]

# Bedrooms: If NaN, then 1
hocking_subset.loc[:, 'Bedrooms'] = hocking_subset['Bedrooms'].fillna(1)

### Sale Date from Month and Year
# convert both to str
hocking_subset['Month'] = hocking_subset.Month.astype(str)
hocking_subset['Year'] = hocking_subset.Year.astype(str)

# concatenate Month and Year to create SaleDate
sale_date = hocking_subset.Month + "-01-20" + hocking_subset.Year
hocking_subset.insert(hocking_subset.columns.get_loc('Month') - 1, "SaleDate", sale_date)


### Rename Columns

# Remove spaces in names and clarify names
hocking_subset.rename(
    {
     'Parcel Number' : 'ParcelNumber',
     'Address' : 'PropertyAddress',
     'School District' : 'SchoolDistrict',
     'Sale #': 'SaleNumber',
     'Sale Price' : 'SalePrice',
     'Yr Built' : 'YearBuilt',
     'Fin Base' : 'FinishedBasement',
     'Heat Type' : 'HeatType',
     'Property Card' : 'PropertyCard',
     'PropClass' : 'LandUse',
     'NeighCode' : 'NeighborhoodCode',
     'Story' : 'NumberOfStories',
     'Sqft' : 'FinishedArea',
     'Rooms': 'NumberOfRooms',
     'Bedrooms' : 'NumberOfBedrooms',
     'FullBaths':'NumberOfFullBaths',
     'HalfBaths':'NumberOfHalfBaths'
    },
    axis=1,
    inplace=True
)

### Owner
# Add Owner columns; Hocking does not have Owner data, so it will be empty
hocking_subset.insert(1, "Owner", " ")
# Add LegalDescription; Hocking does not have legalDescription in the data
hocking_subset.insert(6, "LegalDescription", " ")

# Add Columns for data merge
mailing_address = hocking_subset.PropertyAddress
hocking_subset.insert(3, "MailingAddress", mailing_address)

hocking_subset['County'] = 'Hocking'

hocking_columns = [
    'ParcelNumber',
    'Owner',
    'PropertyAddress',
    'MailingAddress',
    'LandUse',    
    'Acres',
    'LegalDescription',
    'NeighborhoodCode',
    'SaleDate',
    'SalePrice',
    'YearBuilt',
    'NumberOfStories',
    'FinishedArea',
    'NumberOfRooms',
    'NumberOfBedrooms',
    'NumberOfFullBaths',
    'NumberOfHalfBaths',
    'County'
]

# Apply the column masks to the dataframe
hocking_subset = hocking_subset[hocking_columns].copy()

hocking_subset.head()

Unnamed: 0,ParcelNumber,Owner,PropertyAddress,MailingAddress,LandUse,Acres,LegalDescription,NeighborhoodCode,SaleDate,SalePrice,YearBuilt,NumberOfStories,FinishedArea,NumberOfRooms,NumberOfBedrooms,NumberOfFullBaths,NumberOfHalfBaths,County
2,10000560101,,20680.0 KEIFEL,20680.0 KEIFEL,511,18365.0,,7015.0,4-01-2010,171600,2002,1,1120.0,6.0,2.0,2.0,0.0,Hocking
4,10000710600,,23431.0 BIG PINE,23431.0 BIG PINE,581,90345.0,,7015.0,12-01-2010,158950,1993,1,608.0,,2.0,1.0,0.0,Hocking
5,10000900000,,20327.0 ST RT 664,20327.0 ST RT 664,511,9012.0,,7015.0,11-01-2010,60000,1970,1,1120.0,4.0,1.0,1.0,0.0,Hocking
6,10001030000,,20672.0 ST RT 664,20672.0 ST RT 664,511,4630.0,,7015.0,9-01-2010,50000,2006,1,1518.0,5.0,3.0,2.0,0.0,Hocking
9,10002600200,,23270.0 ST RT 56,23270.0 ST RT 56,513,213240.0,,7015.0,12-01-2010,120000,1953,1,926.0,5.0,1.0,1.0,0.0,Hocking


<a name="Validate-columns"></a>
### Validate Columns before Merging
Before trying to merge, we should confirm that the columns across datasets are the same using the Series eq() method. Any differences will have to be corrected before merging the data

In [37]:
cols1 = pd.Series(delaware_subset.columns.sort_values())
cols2 = pd.Series(hocking_subset.columns.sort_values())
cols1.eq(cols2)

0     True
1     True
2     True
3     True
4     True
5     True
6     True
7     True
8     True
9     True
10    True
11    True
12    True
13    True
14    True
15    True
16    True
17    True
dtype: bool

In [38]:
display(delaware_subset.columns.sort_values())
display(hocking_subset.columns.sort_values())

Index(['Acres', 'County', 'FinishedArea', 'LandUse', 'LegalDescription',
       'MailingAddress', 'NeighborhoodCode', 'NumberOfBedrooms',
       'NumberOfFullBaths', 'NumberOfHalfBaths', 'NumberOfRooms',
       'NumberOfStories', 'Owner', 'ParcelNumber', 'PropertyAddress',
       'SaleDate', 'SalePrice', 'YearBuilt'],
      dtype='object')

Index(['Acres', 'County', 'FinishedArea', 'LandUse', 'LegalDescription',
       'MailingAddress', 'NeighborhoodCode', 'NumberOfBedrooms',
       'NumberOfFullBaths', 'NumberOfHalfBaths', 'NumberOfRooms',
       'NumberOfStories', 'Owner', 'ParcelNumber', 'PropertyAddress',
       'SaleDate', 'SalePrice', 'YearBuilt'],
      dtype='object')

<a name="Merge"></a>
# Merge the counties
<a name="Append-files"></a>
### APPEND
Since all of the columns in each file match, we can use APPEND to merge the county data.  
NOTE:
__Pandas concat Vs append Vs join Vs merge__
* __Concat__ gives the flexibility to join based on the axis( all rows or all columns)

* __Append__ is the specific case(axis=0, join='outer') of concat

* __Join__ is based on the indexes (set by set_index) on how variable =['left','right','inner','couter']

* __Merge__ is based on any particular column each of the two dataframes, this columns are variables on like 'left_on', 'right_on', 'on'

NOTE:  Came across a thread about Concat being faster than Append, would like to investigate this further

In [41]:
all_data = delaware_subset.append([hocking_subset], ignore_index=True, sort=True)
all_data.head()

Unnamed: 0,Acres,County,FinishedArea,LandUse,LegalDescription,MailingAddress,NeighborhoodCode,NumberOfBedrooms,NumberOfFullBaths,NumberOfHalfBaths,NumberOfRooms,NumberOfStories,Owner,ParcelNumber,PropertyAddress,SaleDate,SalePrice,YearBuilt
0,1.0,Delaware,,511.0,LANDS SURVEY 675 ^ ...,7511 DAVIS-KIRK RD PROSPECT OH 43342,38002.0,,,,,,BURNS JASON E,100-100-01-001-000,7511 DAVIS-KIRK RD PROSPECT OH 43342,09-12-2000,127000,
1,0.8,Delaware,,511.0,LANDS SURVEY 675 ^ ...,7321 DAVIS-KIRK RD PROSPECT OH 43342,38002.0,,,,,,VANHOESEN KEVIN D,100-100-01-005-000,7321 DAVIS-KIRK RD PROSPECT OH 43342,02-02-2016,128500,
2,1.492,Delaware,,511.0,LANDS SURVEY 675 ^TRACT 1 ...,7745 STATE ROUTE 257 N PROSPECT OH 43342,38002.0,,,,,,PORTUESI BEVERLY J,100-100-01-010-001,7745 STATE ROUTE 257 N PROSPECT OH 43342,10-27-2015,19500,
3,1.492,Delaware,,501.0,LANDS SURVEY 675 ^TRACT 2 ...,7741 STATE ROUTE 257 N PROSPECT OH 43342,38002.0,,,,,,VALENCIA CARLOS PEREZ,100-100-01-010-002,7741 STATE ROUTE 257 N PROSPECT OH 43342,02-17-2017,19000,
4,2.0,Delaware,,511.0,LANDS SURVEY 675 ^ ...,7715 STATE ROUTE 257 N PROSPECT OH 43342,38002.0,,,,,,WILLIAMS ROBERT H III,100-100-01-011-000,7715 STATE ROUTE 257 N PROSPECT OH 43342,11-06-2013,250000,


<a name="Validate"></a>
### Validate all Counties are in the merged file

In [43]:
all_data.County.value_counts()

Delaware    52988
Hocking      4630
Name: County, dtype: int64

In [None]:
### Sample County Data

# using sampleCounty in samplingsu2.py
hocking_subset = sampling.sampleCounty(hocking_subset, 0.1)

display(hocking_subset.head())

In [None]:
from sqlalchemy import create_engine
engine = create_engine('sqlite:///../data/output.sqlite')

delaware_subset.to_sql("Delaware", con=engine, if_exists='replace')
hocking_subset.to_sql("Hocking", con=engine, if_exists='replace')