# Prepare and clean the census data
* missing values: drop columns with too many missing values, drop rows with too many missing values, fill with zero where it makes sense, and then make note of any columns you want to impute missing values in (you will need to do that on split data).

* outlier: an observation point that is distant from other observations 

* outliers: ignore, drop rows, snap to a selected max/min value, create bins (cut, qcut)

* data errors: drop the rows/observations with the errors, correct them to what it was intended

* address text normalization issues...e.g. deck 'C' 'c'. (correct and standardize the text)

* tidy data: getting your data in the shape it needs to be for modeling and exploring. every row should be an observation and every column should be a feature/attribute/variable. You want 1 observation per row, and 1 row per observation. If you want to predict a customer churn, each row should be a customer and each customer should be on only 1 row. (address duplicates, aggregate, melt, reshape, ...)

* creating new variables out of existing variables (e.g. z = x - y)

* rename columns

* datatypes: need numeric data to be able to feed into model (dummy vars, factor vars, manual encoding)

* scale numeric data: so that continuous variables have the same weight, are on the same units, if algorithm will be used that will be affected by the differing weights, or if data needs to be scaled to a gaussian/normal distribution for statistical testing. (linear scalers and non-linear scalers)

In [1]:
# imports
from acquire import get_census_data
import pandas as pd

from prepare_module import summarize

In [2]:
# getting data
gdf10, gdf20 = get_census_data()

## 2010 gdf cleaning

In [3]:
gdf10.head()

Unnamed: 0,fid,uace10,geoid10,countyfp10,blockce10,ur10,statefp10,tractce10,intptlat10,uatype,name10,funcstat10,intptlon10,aland10,awater10,mtfcc10,SHAPE__Length,SHAPE__Area,geometry
0,1,,181457108005006,145,5006,R,18,710800,39.5206564,,Block 5006,S,-85.9116882,4317949,1247,G5040,8751.984794,4316672.0,"POLYGON ((-85.92794 39.52354, -85.92791 39.523..."
1,2,,180799606002051,79,2051,R,18,960600,38.9335629,,Block 2051,S,-85.6352221,1277980,31175,G5040,4977.961559,1308558.0,"POLYGON ((-85.64117 38.92621, -85.64121 38.926..."
2,3,,181259541002131,125,2131,R,18,954100,38.4154515,,Block 2131,S,-87.3437,1214448,2492,G5040,5833.011878,1215994.0,"POLYGON ((-87.35258 38.41400, -87.35256 38.413..."
3,4,,181259539002041,125,2041,R,18,953900,38.4463365,,Block 2041,S,-87.15854,4716036,68216,G5040,10184.51979,4780449.0,"POLYGON ((-87.17438 38.44355, -87.17440 38.443..."
4,5,,181259542004154,125,4154,R,18,954200,38.2436132,,Block 4154,S,-87.286756,7321864,52968,G5040,11325.53468,7369043.0,"POLYGON ((-87.29837 38.23219, -87.29838 38.231..."


In [None]:
summarize(gdf10.drop(['geometry'], axis=1))

#### Takeaways:
* The data is clean, as expected from the census beureau
* The uatype10 has blanks for areas that are not urban
* I'm not sure what features are needed and what features are not needed
    * For land area and water are I think i will keep the areas, it may reflect on how quick certain fixes are completed (more land, more area to cover, more wateer, possibly more complications with seemingly simple repairs)
    * Drop:
        * ur10 - uatype can show whether it is rural, urban, or whatever that third letter signifies
        * statefp10 - all indiana - the same states
        * intplat10 - the lat and long dont seem necessary, so I will drop these becasue the geometry will allow us to plot on a map
        * intplon10 - see above
        * funcstat10 - only one value
        * mtfcc10 - only one value
        * fid - the index already does the work, would prefer starting at 0 rather than 1
        * uace10 - the code is not very important for what i need to complete
        * uatype - drop becasue it isnt in the 2020

In [11]:
# dropping columns
gdf10.columns

cols_to_drop = ['fid', 'uace10', 'statefp10', 'ur10','funcstat10','mtfcc10','intptlon10','intptlat10']

gdf10.drop(cols_to_drop, axis=1, inplace=True)

In [23]:
# adding rural to the uatype for the missing values
gdf10['uatype'].value_counts()

gdf10.loc[:,'uatype'] = gdf10.loc[:,'uatype'].replace(' ', 'not_urban').replace('U', 'urbanized_area').replace('C', 'urban_cluster')


gdf10['uatype'].value_counts()


uatype
not_urban         135218
urbanized_area     94365
urban_cluster      37488
Name: count, dtype: int64

In [31]:
# change name10 to blockname10 and drop block and the space

gdf10.loc[:,'name10'] = gdf10.loc[:,'name10'].str.strip('Block ')

gdf10.rename({'name10':'blockname10'}, axis=1, inplace=True)

In [32]:
gdf10

Unnamed: 0,geoid10,countyfp10,blockce10,tractce10,uatype,blockname10,aland10,awater10,SHAPE__Length,SHAPE__Area,geometry
0,181457108005006,145,5006,710800,not_urban,5006,4317949,1247,8751.984794,4.316672e+06,"POLYGON ((-85.92794 39.52354, -85.92791 39.523..."
1,180799606002051,079,2051,960600,not_urban,2051,1277980,31175,4977.961559,1.308558e+06,"POLYGON ((-85.64117 38.92621, -85.64121 38.926..."
2,181259541002131,125,2131,954100,not_urban,2131,1214448,2492,5833.011878,1.215994e+06,"POLYGON ((-87.35258 38.41400, -87.35256 38.413..."
3,181259539002041,125,2041,953900,not_urban,2041,4716036,68216,10184.519790,4.780449e+06,"POLYGON ((-87.17438 38.44355, -87.17440 38.443..."
4,181259542004154,125,4154,954200,not_urban,4154,7321864,52968,11325.534680,7.369043e+06,"POLYGON ((-87.29837 38.23219, -87.29838 38.231..."
...,...,...,...,...,...,...,...,...,...,...,...
267066,180071002001095,007,1095,100200,not_urban,1095,1299048,0,5474.841915,1.298024e+06,"POLYGON ((-87.26664 40.57645, -87.26662 40.576..."
267067,181410107002032,141,2032,010700,urbanized_area,2032,9060,0,643.886880,9.053447e+03,"POLYGON ((-86.17341 41.66392, -86.17343 41.663..."
267068,181270505031019,127,1019,050503,urbanized_area,1019,42364,0,1056.159352,4.233003e+04,"POLYGON ((-87.13257 41.54125, -87.13259 41.540..."
267069,181410118024016,141,4016,011802,urbanized_area,4016,19183,0,595.456254,1.916962e+04,"POLYGON ((-86.24857 41.61560, -86.24856 41.614..."


In [9]:
def get_clean_gdf10():
    '''
    This function drops unnecessary columns, edits names, and changes values in 2 columns
    Modules:
        from pathlib import Path
        import pandas as pd
    '''

    # create file path
    data_path = Path('./data')
    file_path = data_path.joinpath('Census_Block_Boundaries_2010_Clean.csv')

    # check for clean data file 
    if file_path.exists():

        # return the clean data
        return pd.read_csv(file_path, index_col=0)
    
    # getting data
    gdf10, gdf20 = get_census_data()

    # drop columns
    cols_to_drop = ['fid', 'uace10', 'uatype', 'statefp10', 'ur10','funcstat10','mtfcc10','intptlon10','intptlat10']
    gdf10.drop(cols_to_drop, axis=1, inplace=True)

    # strip unecesasry values and change feature name
    gdf10.loc[:,'name10'] = gdf10.loc[:,'name10'].str.strip('Block ')
    gdf10.rename({'name10':'blockname10'}, axis=1, inplace=True)

    # lowercase columns and remove extra '_'
    gdf10.columns = gdf10.columns.str.lower().str.replace('__', '_')

    gdf10.to_csv(file_path)
                                          
    return gdf10
    

In [41]:
gdf10.columns.str.lower().str.replace('__', '_')

gdf10.siz

2937781

In [46]:
from pathlib import Path

data_path = Path('./data')

filename = 'Census_Block_Boundaries_2020_Clean.geojson'

file_path = data_path.joinpath(filename)
print(file_path)

if file_path.exists():
    
    gdf10 = geopandas.read_file(file_path)

else:

    gdf10, gdf20 = get_census_data()


data/Census_Block_Boundaries_2020_Clean.geojson


In [60]:
gdf10, gdf20 = get_census_data()

In [64]:
get_clean_gdf10(gdf10)

Unnamed: 0,geoid10,countyfp10,blockce10,tractce10,uatype,blockname10,aland10,awater10,shape_length,shape_area,geometry
0,181457108005006,145,5006,710800,not_urban,5006,4317949,1247,8751.984794,4.316672e+06,POLYGON ((-85.92794196348832 39.52354047703989...
1,180799606002051,79,2051,960600,not_urban,2051,1277980,31175,4977.961559,1.308558e+06,POLYGON ((-85.64116509600031 38.92620957534559...
2,181259541002131,125,2131,954100,not_urban,2131,1214448,2492,5833.011878,1.215994e+06,"POLYGON ((-87.35258063903558 38.4140047547761,..."
3,181259539002041,125,2041,953900,not_urban,2041,4716036,68216,10184.519790,4.780449e+06,POLYGON ((-87.17437968661865 38.44354974040075...
4,181259542004154,125,4154,954200,not_urban,4154,7321864,52968,11325.534680,7.369043e+06,POLYGON ((-87.29836866727443 38.23219378672903...
...,...,...,...,...,...,...,...,...,...,...,...
267066,180071002001095,7,1095,100200,not_urban,1095,1299048,0,5474.841915,1.298024e+06,"POLYGON ((-87.266644480409 40.5764463535952, -..."
267067,181410107002032,141,2032,10700,urbanized_area,2032,9060,0,643.886880,9.053447e+03,"POLYGON ((-86.17341470329092 41.6639191035303,..."
267068,181270505031019,127,1019,50503,urbanized_area,1019,42364,0,1056.159352,4.233003e+04,POLYGON ((-87.13257042972036 41.54125017650422...
267069,181410118024016,141,4016,11802,urbanized_area,4016,19183,0,595.456254,1.916962e+04,POLYGON ((-86.24856768443412 41.61559711692266...


# gdf 2020 cleaning

In [None]:
summarize(gdf20.drop('geometry', axis=1))

#### Takeawys:
* All the urban things are blank, going ot check on the website viewer for the data to see if its an error or not

In [4]:
def get_clean_gdf20():
    '''
    This function drops unnecessary columns, edits names, and changes values in 2 columns
    Modules:
        from pathlib import Path
        import pandas as pd
    '''

    # create file path
    data_path = Path('./data')
    file_path = data_path.joinpath('Census_Block_Boundaries_2020_Clean.csv')

    # check for clean data file 
    if file_path.exists():

        # return the clean data
        return pd.read_csv(file_path, index_col=0)
    
    # getting data
    gdf10, gdf20 = get_census_data()

    # drop columns
    cols_to_drop = ['fid', 'uace20', 'uatype20', 'statefp20', 'ur20','funcstat20','mtfcc20','intptlon20','intptlat20']
    gdf20.drop(cols_to_drop, axis=1, inplace=True)

    # strip unecesasry values and change feature name
    gdf20.loc[:,'name20'] = gdf20.loc[:,'name20'].str.strip('Block ')
    gdf20.rename({'name20':'blockname20'}, axis=1, inplace=True)

    # lowercase columns and remove extra '_'
    gdf20.columns = gdf20.columns.str.lower().str.replace('__', '_')

    gdf20.to_csv(file_path)
                                          
    return gdf20
    

# Testing the functions

In [6]:
# imports
from acquire import get_census_data
import pandas as pd
from pathlib import Path

# getting raw data
gdf10, gdf20 = get_census_data()

In [10]:
get_clean_gdf10(gdf10)

Unnamed: 0,geoid10,countyfp10,blockce10,tractce10,blockname10,aland10,awater10,shape_length,shape_area
0,181457108005006,145,5006,710800,5006,4317949,1247,8751.984794,4.316672e+06
1,180799606002051,79,2051,960600,2051,1277980,31175,4977.961559,1.308558e+06
2,181259541002131,125,2131,954100,2131,1214448,2492,5833.011878,1.215994e+06
3,181259539002041,125,2041,953900,2041,4716036,68216,10184.519790,4.780449e+06
4,181259542004154,125,4154,954200,4154,7321864,52968,11325.534680,7.369043e+06
...,...,...,...,...,...,...,...,...,...
267066,180071002001095,7,1095,100200,1095,1299048,0,5474.841915,1.298024e+06
267067,181410107002032,141,2032,10700,2032,9060,0,643.886880,9.053447e+03
267068,181270505031019,127,1019,50503,1019,42364,0,1056.159352,4.233003e+04
267069,181410118024016,141,4016,11802,4016,19183,0,595.456254,1.916962e+04


In [7]:
get_clean_gdf20()

Unnamed: 0,countyfp20,tractce20,blockce20,geoid20,name20,aland20,awater20,shape_length,shape_area
0,39,1602,3026,180390016023026,3026,139907,0,0.020238,1.514178e-05
1,39,2002,3021,180390020023021,3021,169692,0,0.019385,1.833337e-05
2,39,2102,2020,180390021022020,2020,13103,0,0.004764,1.416532e-06
3,131,958900,1013,181319589001013,1013,1187548,0,0.080037,1.274152e-04
4,149,953700,2016,181499537002016,2016,1439638,0,0.085054,1.549878e-04
...,...,...,...,...,...,...,...,...,...
204563,167,900,2070,181670009002070,2070,4780,0,0.002825,5.004110e-07
204564,167,300,2034,181670003002034,2034,5722,0,0.003133,5.990546e-07
204565,167,10500,3027,181670105003027,3027,5783,0,0.003461,6.052680e-07
204566,167,300,2040,181670003002040,2040,5898,0,0.003267,6.174468e-07
