## Data Cleaning/Wrangling

Some initial data wrangling and cleaning was conducted before inserting the data. All NaN values were removed since MariaDB only accepts null values which are not NaN so we were getting errors when trying to inserrt data will NaN values so in the end we decided to remove all missing values before hand. Secondly, as you can see below, the data was loaded separately from 9 different CSV files. This is because of the massive size of this data set being around 550,000. On top of that it contained multipolygon coordinate data so it massively increased the size of the files. Because of this we ended up removing the multipolygon column before inserting our data because we did not plan on using this column in the end anyways. This vastly cut down on the time it took for inserting our data (I tried inserting the data before with the multipolygon column and it was still running after 2 hours). To also be able to load our data into the jupyter notebook cloud without it crashing and to insert it into a database in a much quicker time we took a random sample of the property assessment dataset. After loading all 9 different CSV's into eachother, we merged them into one dataset and then took a random sample of 200000 rows, saved it as a CSV so that we could always go back to it (since the data at this point stays the same and doesnt get randomized ever again). We chose 200000 rows for our sample size since we felt that it was still a large amount of data to use to accurately answer our guiding questions.

In [30]:
#import libraries

import pandas as pd
import numpy as np

In [40]:
#Loaded in initial csv files separately so that the cloud kernel would not crash and restart

Property1 = pd.read_csv("Historical_Property_Assessments__Parcel_-1.csv")
Property2 = pd.read_csv("Historical_Property_Assessments__Parcel_-2.csv")
Property3 = pd.read_csv("Historical_Property_Assessments__Parcel_-3.csv")
Property4 = pd.read_csv("Historical_Property_Assessments__Parcel_-4.csv")
Property5 = pd.read_csv("Historical_Property_Assessments__Parcel_-5.csv")
Property6 = pd.read_csv("Historical_Property_Assessments__Parcel_-6.csv")
Property7 = pd.read_csv("Historical_Property_Assessments__Parcel_-7.csv")
Property8 = pd.read_csv("Historical_Property_Assessments__Parcel_-8.csv")
Property9 = pd.read_csv("Historical_Property_Assessments__Parcel_-9.csv")


In [41]:
#checked which columns had NaN values present and how many were present 

display(Property1.isna().sum())
display(Property2.isna().sum())
display(Property3.isna().sum())
display(Property4.isna().sum())
display(Property5.isna().sum())
display(Property6.isna().sum())
display(Property7.isna().sum())
display(Property8.isna().sum())
display(Property9.isna().sum())

ROLL_YEAR                           0
ROLL_NUMBER                         0
ADDRESS                             0
ASSESSED_VALUE                      0
ASSESSMENT_CLASS_DESCRIPTION        0
RE_ASSESSED_VALUE                  24
FL_ASSESSED_VALUE               72411
COMM_CODE                           0
COMM_NAME                           0
YEAR_OF_CONSTRUCTION            72411
LAND_USE_DESIGNATION                8
PROPERTY_TYPE                       0
LAND_SIZE_SM                        8
LAND_SIZE_SF                        8
LAND_SIZE_AC                        8
SUB_PROPERTY_USE                71054
MULTIPOLYGON                        0
dtype: int64

ROLL_YEAR                           0
ROLL_NUMBER                         0
ADDRESS                             3
ASSESSED_VALUE                      0
ASSESSMENT_CLASS_DESCRIPTION        0
RE_ASSESSED_VALUE                  80
FL_ASSESSED_VALUE               68926
COMM_CODE                           0
COMM_NAME                           0
YEAR_OF_CONSTRUCTION            68924
LAND_USE_DESIGNATION                5
PROPERTY_TYPE                       0
LAND_SIZE_SM                        5
LAND_SIZE_SF                        5
LAND_SIZE_AC                        5
SUB_PROPERTY_USE                61771
MULTIPOLYGON                        0
dtype: int64

ROLL_YEAR                           0
ROLL_NUMBER                         0
ADDRESS                             0
ASSESSED_VALUE                      0
ASSESSMENT_CLASS_DESCRIPTION        0
RE_ASSESSED_VALUE                  73
FL_ASSESSED_VALUE               71513
COMM_CODE                           0
COMM_NAME                           0
YEAR_OF_CONSTRUCTION            71511
LAND_USE_DESIGNATION                3
PROPERTY_TYPE                       0
LAND_SIZE_SM                        3
LAND_SIZE_SF                        3
LAND_SIZE_AC                        3
SUB_PROPERTY_USE                69714
MULTIPOLYGON                        0
dtype: int64

ROLL_YEAR                           0
ROLL_NUMBER                         0
ADDRESS                             0
ASSESSED_VALUE                      0
ASSESSMENT_CLASS_DESCRIPTION        0
RE_ASSESSED_VALUE                  40
FL_ASSESSED_VALUE               63193
COMM_CODE                           0
COMM_NAME                           0
YEAR_OF_CONSTRUCTION            63201
LAND_USE_DESIGNATION                0
PROPERTY_TYPE                       0
LAND_SIZE_SM                        0
LAND_SIZE_SF                        0
LAND_SIZE_AC                        0
SUB_PROPERTY_USE                57786
MULTIPOLYGON                        0
dtype: int64

ROLL_YEAR                           0
ROLL_NUMBER                         0
ADDRESS                             0
ASSESSED_VALUE                      0
ASSESSMENT_CLASS_DESCRIPTION        0
RE_ASSESSED_VALUE                  38
FL_ASSESSED_VALUE               54336
COMM_CODE                           0
COMM_NAME                           0
YEAR_OF_CONSTRUCTION            54357
LAND_USE_DESIGNATION                3
PROPERTY_TYPE                       0
LAND_SIZE_SM                        3
LAND_SIZE_SF                        3
LAND_SIZE_AC                        3
SUB_PROPERTY_USE                44331
MULTIPOLYGON                        0
dtype: int64

ROLL_YEAR                           0
ROLL_NUMBER                         0
ADDRESS                             0
ASSESSED_VALUE                      0
ASSESSMENT_CLASS_DESCRIPTION        0
RE_ASSESSED_VALUE                  23
FL_ASSESSED_VALUE               53976
COMM_CODE                           0
COMM_NAME                           0
YEAR_OF_CONSTRUCTION            54061
LAND_USE_DESIGNATION               11
PROPERTY_TYPE                       0
LAND_SIZE_SM                        0
LAND_SIZE_SF                        0
LAND_SIZE_AC                        0
SUB_PROPERTY_USE                41794
MULTIPOLYGON                        0
dtype: int64

ROLL_YEAR                           0
ROLL_NUMBER                         0
ADDRESS                             0
ASSESSED_VALUE                      0
ASSESSMENT_CLASS_DESCRIPTION        0
RE_ASSESSED_VALUE                  29
FL_ASSESSED_VALUE               41068
COMM_CODE                           0
COMM_NAME                           0
YEAR_OF_CONSTRUCTION            41070
LAND_USE_DESIGNATION                0
PROPERTY_TYPE                       0
LAND_SIZE_SM                        0
LAND_SIZE_SF                        0
LAND_SIZE_AC                        0
SUB_PROPERTY_USE                31564
MULTIPOLYGON                        0
dtype: int64

ROLL_YEAR                           0
ROLL_NUMBER                         0
ADDRESS                             0
ASSESSED_VALUE                      0
ASSESSMENT_CLASS_DESCRIPTION        0
RE_ASSESSED_VALUE                  45
FL_ASSESSED_VALUE               40387
COMM_CODE                           0
COMM_NAME                           0
YEAR_OF_CONSTRUCTION            40406
LAND_USE_DESIGNATION                4
PROPERTY_TYPE                       0
LAND_SIZE_SM                        4
LAND_SIZE_SF                        4
LAND_SIZE_AC                        4
SUB_PROPERTY_USE                29907
MULTIPOLYGON                        0
dtype: int64

ROLL_YEAR                           0
ROLL_NUMBER                         0
ADDRESS                             0
ASSESSED_VALUE                      0
ASSESSMENT_CLASS_DESCRIPTION        0
RE_ASSESSED_VALUE                  15
FL_ASSESSED_VALUE               53921
COMM_CODE                           0
COMM_NAME                           0
YEAR_OF_CONSTRUCTION            53961
LAND_USE_DESIGNATION                9
PROPERTY_TYPE                       0
LAND_SIZE_SM                        5
LAND_SIZE_SF                        5
LAND_SIZE_AC                        5
SUB_PROPERTY_USE                51517
MULTIPOLYGON                        0
dtype: int64

In [48]:
#Cleaning the dataframes to get rid of nan values so we can load it into the database

#make a list containing all our dataframes
frames = [Property1, Property2, Property3, Property4, Property5, Property6, Property7, Property8, Property9]

#make a list of the columns where we want to drop any of the rows that have NaN values in these columns
na_rows_to_drop = ["LAND_USE_DESIGNATION", "LAND_SIZE_SM", "LAND_SIZE_SF", "LAND_SIZE_AC","ADDRESS"]

#make a list of the columns we want to drop because the majority of the column contains NaN values or the column is undesired such as geo/multipolygon data (since it is insanely large and crashes loading into the db, we dont need it anyways). 
columns_to_drop = ["FL_ASSESSED_VALUE","YEAR_OF_CONSTRUCTION", "SUB_PROPERTY_USE", "PROPERTY_TYPE", "RE_ASSESSED_VALUE", "MULTIPOLYGON"]

#loop through all our 9 dataframes
for dataset in frames:
    #drop any rows containing NaN values from the columns where we saw NaN values to be present (we only are doing this with the columns that we want to keep and only have a small amount of NaN values)
    for col in na_rows_to_drop:
        dataset.dropna(subset=[col], inplace=True)
        
    #drop all unnecesarry columns that we are not going to be using (drop columns with the majority of the data being NaN values and also columns that are non-desired)
    for col in columns_to_drop:
        dataset.drop(col, axis = 1, inplace = True)


In [50]:
#check if there are any more remaining NaN values present in any of the 9 datasets
display(Property1.isna().sum())
display(Property2.isna().sum())
display(Property3.isna().sum())
display(Property4.isna().sum())
display(Property5.isna().sum())
display(Property6.isna().sum())
display(Property7.isna().sum())
display(Property8.isna().sum())
display(Property9.isna().sum())

#we can see that there are no NaN values present so we are ready to insert the data into a database

ROLL_YEAR                       0
ROLL_NUMBER                     0
ADDRESS                         0
ASSESSED_VALUE                  0
ASSESSMENT_CLASS_DESCRIPTION    0
COMM_CODE                       0
COMM_NAME                       0
LAND_USE_DESIGNATION            0
LAND_SIZE_SM                    0
LAND_SIZE_SF                    0
LAND_SIZE_AC                    0
dtype: int64

ROLL_YEAR                       0
ROLL_NUMBER                     0
ADDRESS                         0
ASSESSED_VALUE                  0
ASSESSMENT_CLASS_DESCRIPTION    0
COMM_CODE                       0
COMM_NAME                       0
LAND_USE_DESIGNATION            0
LAND_SIZE_SM                    0
LAND_SIZE_SF                    0
LAND_SIZE_AC                    0
dtype: int64

ROLL_YEAR                       0
ROLL_NUMBER                     0
ADDRESS                         0
ASSESSED_VALUE                  0
ASSESSMENT_CLASS_DESCRIPTION    0
COMM_CODE                       0
COMM_NAME                       0
LAND_USE_DESIGNATION            0
LAND_SIZE_SM                    0
LAND_SIZE_SF                    0
LAND_SIZE_AC                    0
dtype: int64

ROLL_YEAR                       0
ROLL_NUMBER                     0
ADDRESS                         0
ASSESSED_VALUE                  0
ASSESSMENT_CLASS_DESCRIPTION    0
COMM_CODE                       0
COMM_NAME                       0
LAND_USE_DESIGNATION            0
LAND_SIZE_SM                    0
LAND_SIZE_SF                    0
LAND_SIZE_AC                    0
dtype: int64

ROLL_YEAR                       0
ROLL_NUMBER                     0
ADDRESS                         0
ASSESSED_VALUE                  0
ASSESSMENT_CLASS_DESCRIPTION    0
COMM_CODE                       0
COMM_NAME                       0
LAND_USE_DESIGNATION            0
LAND_SIZE_SM                    0
LAND_SIZE_SF                    0
LAND_SIZE_AC                    0
dtype: int64

ROLL_YEAR                       0
ROLL_NUMBER                     0
ADDRESS                         0
ASSESSED_VALUE                  0
ASSESSMENT_CLASS_DESCRIPTION    0
COMM_CODE                       0
COMM_NAME                       0
LAND_USE_DESIGNATION            0
LAND_SIZE_SM                    0
LAND_SIZE_SF                    0
LAND_SIZE_AC                    0
dtype: int64

ROLL_YEAR                       0
ROLL_NUMBER                     0
ADDRESS                         0
ASSESSED_VALUE                  0
ASSESSMENT_CLASS_DESCRIPTION    0
COMM_CODE                       0
COMM_NAME                       0
LAND_USE_DESIGNATION            0
LAND_SIZE_SM                    0
LAND_SIZE_SF                    0
LAND_SIZE_AC                    0
dtype: int64

ROLL_YEAR                       0
ROLL_NUMBER                     0
ADDRESS                         0
ASSESSED_VALUE                  0
ASSESSMENT_CLASS_DESCRIPTION    0
COMM_CODE                       0
COMM_NAME                       0
LAND_USE_DESIGNATION            0
LAND_SIZE_SM                    0
LAND_SIZE_SF                    0
LAND_SIZE_AC                    0
dtype: int64

ROLL_YEAR                       0
ROLL_NUMBER                     0
ADDRESS                         0
ASSESSED_VALUE                  0
ASSESSMENT_CLASS_DESCRIPTION    0
COMM_CODE                       0
COMM_NAME                       0
LAND_USE_DESIGNATION            0
LAND_SIZE_SM                    0
LAND_SIZE_SF                    0
LAND_SIZE_AC                    0
dtype: int64

In [51]:
#merge all our separate cleaned/wrangled csvs into one so we can grab a sample from it and then insert it into the database 
frames = [Property1, Property2, Property3, Property4, Property5, Property6, Property7, Property8, Property9]

Property_Full = pd.concat(frames)

display(Property_Full)
                               

Unnamed: 0,ROLL_YEAR,ROLL_NUMBER,ADDRESS,ASSESSED_VALUE,ASSESSMENT_CLASS_DESCRIPTION,COMM_CODE,COMM_NAME,LAND_USE_DESIGNATION,LAND_SIZE_SM,LAND_SIZE_SF,LAND_SIZE_AC
1,2019,4032504,52 SADDLEBACK WY NE,360000,Residential,SAD,SADDLE RIDGE,R-1N,310.0,3337.0,0.08
2,2019,4032603,48 SADDLEBACK WY NE,366000,Residential,SAD,SADDLE RIDGE,R-1N,391.4,4214.0,0.10
3,2019,4032702,10 SADDLEBACK RD NE,404000,Residential,SAD,SADDLE RIDGE,R-1N,340.0,3659.0,0.08
4,2019,4032801,14 SADDLEBACK RD NE,385500,Residential,SAD,SADDLE RIDGE,R-1N,347.5,3741.0,0.09
5,2019,4032900,18 SADDLEBACK RD NE,395000,Residential,SAD,SADDLE RIDGE,R-1N,317.6,3419.0,0.08
...,...,...,...,...,...,...,...,...,...,...,...
53956,2019,792044703,130 CRANWELL CL SE,488000,Residential,CRA,CRANSTON,R-1,451.0,4855.0,0.11
53957,2019,792044802,126 CRANWELL CL SE,495000,Residential,CRA,CRANSTON,R-1,438.2,4717.0,0.11
53958,2019,792044901,122 CRANWELL CL SE,523000,Residential,CRA,CRANSTON,R-1,484.6,5216.0,0.12
53959,2019,814002002,20606 56 ST SE,1910000,Residential,12F,RESIDUAL WARD 12 - SUB AREA 12F,"S-FUD,S-SPR,C-C1,M-1,S-R,R-G,R-Gm",647184.6,6966465.0,159.93


In [55]:
#take a sample of the massive combined dataset so that the data can be inserted int othe database at a quicker pace

#Note: I commented this entire cell out because I did not want to accidetally run it again since we did not want to keep grabbing different samples
#I tried setting the seed, however that still did not stop from the sample being different every time. We also checked to see if the sample contained the same amount of unique community codes and we found it contained 249 of 260 community codes. We checked to see which codes were not present and they ended up being residual ward areas so not even communities (we did not plan on using these anyways so there was no issue in them missing)

#import random

#random.seed(2023)

#Property_Data = Property_Full.sample(200000)

#display(Property_Full["COMM_CODE"].nunique())
#display(Property_Data["COMM_CODE"].nunique())

#display(Property_Data)


260

249

In [56]:
#Take our sample dataset that was made and save it as a CSV file so that the data would not be changed and would forever be saved in case we had issues and needed to go back to the specific data we are using. 

#Property_Data.to_csv("PropertyData.csv", index=False)

In [7]:
import pandas as pd

Property_Assessment_Data = pd.read_csv("PropertyData.csv")

display(Property_Assessment_Data)

Unnamed: 0,ROLL_YEAR,ROLL_NUMBER,ADDRESS,ASSESSED_VALUE,ASSESSMENT_CLASS_DESCRIPTION,COMM_CODE,COMM_NAME,LAND_USE_DESIGNATION,LAND_SIZE_SM,LAND_SIZE_SF,LAND_SIZE_AC
0,2019,67946327,312 1040 15 AV SW,228500,Residential,BLN,BELTLINE,CC-MH,1010.0,10871.0,0.25
1,2019,202249132,302 1720 10 ST SW,254000,Residential,LMR,LOWER MOUNT ROYAL,M-C2,1615.8,17393.0,0.40
2,2019,79582243,2320 ERLTON PL SW,616000,Residential,ERL,ERLTON,M-CG d87,106.5,1147.0,0.03
3,2019,201219904,2406 11811 LAKE FRASER DR SE,186000,Residential,LKB,LAKE BONAVISTA,"DC (pre 1P2007),M-H1 d247",22509.6,242299.0,5.56
4,2019,200571818,169 EVERSYDE CM SW,302500,Residential,EVE,EVERGREEN,M-G d44,123.0,1324.0,0.03
...,...,...,...,...,...,...,...,...,...,...,...
199995,2019,104017405,6323 LOUISE RD SW,626000,Residential,LKV,LAKEVIEW,R-C1,611.5,6583.0,0.15
199996,2019,202508461,1 1616 15 AV SW,660000,Residential,SNA,SUNALTA,M-CG d111,603.7,6498.0,0.15
199997,2019,154012405,871 PARKRIDGE RD SE,569500,Residential,PKL,PARKLAND,R-C1,592.4,6376.0,0.15
199998,2019,128057205,152 BRAXTON PL SW,450000,Residential,BRA,BRAESIDE,R-C1,897.6,9662.0,0.22


## Inserting Data into our Database

After briefly cleaning/wrangling our data, we made the database table and inserted our data as seen below

In [4]:
import mysql.connector
from mysql.connector import errorcode

filepath = "C:\Users\chore\Documents\Data Science\DATA 604\mariadbpassword.txt"

with open(filepath) as f:
    passw = f.read()

#attempt a connection
myconnection = mysql.connector.connect(user="connor_horemans", 
                                       password=passw,
                                       host="datasciencedb2.ucalgary.ca", 
                                       database="connor_horemans",
                                       allow_local_infile=True)
myconnection

<mysql.connector.connection_cext.CMySQLConnection at 0x7eff6c1eb550>

In [8]:
# CREATE TABLE STATEMENT
create_statement = '''create table connor_horemans.property_data (
    ROLL_YEAR int,
    ROLL_NUMBER int,
    ADDRESS varchar(1000), 
    ASSESSED_VALUE int,
    ASSESSMENT_CLASS_DESCRIPTION varchar(15),
    COMM_CODE varchar(3),
    COMM_NAME varchar(250),
    LAND_USE_DESIGNATION varchar(1000),
    LAND_SIZE_SM float,
    LAND_SIZE_SF float, 
    LAND_SIZE_AC float
    );'''

# now we'll create a cursor and run our create statement
create_cursor = myconnection.cursor()
try:
    create_cursor.execute(create_statement)
except mysql.connector.Error as err:
    if err.errno == errorcode.ER_TABLE_EXISTS_ERROR:
        print("Ooops! We already have that table")
    else:
        print(err.msg)
else:
    print("table created successfully!")

create_cursor.close()

Ooops! We already have that table


True

In [9]:
#insert our data into the table we made above

insertCursor = myconnection.cursor()

columnString = "`,`".join([str(currentColumn) for currentColumn in Property_Assessment_Data.columns.tolist()])
#print (columnString)

# inserting rows one by one from the DataFrame
for i, currentRow in Property_Assessment_Data.iterrows():
    insertCommand = "INSERT INTO `property_data` (`" + columnString + "`) VALUES (" + "%s,"*(len(currentRow)-1) + "%s)"
    insertCursor.execute(insertCommand, tuple(currentRow))
    
myconnection.commit()

insertCursor.close()

True

In [2]:
#DELETE TABLE IF NEEDED. DO NOT RUN IF NOT REQUIRED.

deletecursor = myconnection.cursor()
sql = "DROP TABLE IF EXISTS property_data;""
deletecursor.execute(sql)
deletecursor.close()

True

## Queries 

The comments on why we chose each of these queries are below each of the individual query cells (so the comment below a specific query explains that one query only)

## Query 1

In [8]:
#Get unique Community Names and Codes

read_cursor = myconnection.cursor(buffered=True, dictionary=True)

query_string = ("SELECT DISTINCT COMM_CODE, COMM_NAME FROM property_data ORDER BY COMM_NAME ASC;")

read_cursor.execute(query_string)

for (prop_value) in read_cursor:
    print(prop_value)
    
read_cursor.close()

{'COMM_CODE': 'ABB', 'COMM_NAME': 'ABBEYDALE'}
{'COMM_CODE': 'ACA', 'COMM_NAME': 'ACADIA'}
{'COMM_CODE': 'ALB', 'COMM_NAME': 'ALBERT PARK/RADISSON HEIGHTS'}
{'COMM_CODE': 'ALT', 'COMM_NAME': 'ALTADORE'}
{'COMM_CODE': 'AYB', 'COMM_NAME': 'ALYTH/BONNYBROOK'}
{'COMM_CODE': 'APP', 'COMM_NAME': 'APPLEWOOD PARK'}
{'COMM_CODE': 'ARB', 'COMM_NAME': 'ARBOUR LAKE'}
{'COMM_CODE': 'ASP', 'COMM_NAME': 'ASPEN WOODS'}
{'COMM_CODE': 'AUB', 'COMM_NAME': 'AUBURN BAY'}
{'COMM_CODE': 'BNF', 'COMM_NAME': 'BANFF TRAIL'}
{'COMM_CODE': 'BNK', 'COMM_NAME': 'BANKVIEW'}
{'COMM_CODE': 'BYV', 'COMM_NAME': 'BAYVIEW'}
{'COMM_CODE': 'BED', 'COMM_NAME': 'BEDDINGTON HEIGHTS'}
{'COMM_CODE': 'BEL', 'COMM_NAME': 'BEL-AIRE'}
{'COMM_CODE': 'BLM', 'COMM_NAME': 'BELMONT'}
{'COMM_CODE': 'BLN', 'COMM_NAME': 'BELTLINE'}
{'COMM_CODE': 'BVD', 'COMM_NAME': 'BELVEDERE'}
{'COMM_CODE': 'BDO', 'COMM_NAME': 'BONAVISTA DOWNS'}
{'COMM_CODE': 'BOW', 'COMM_NAME': 'BOWNESS'}
{'COMM_CODE': 'BRA', 'COMM_NAME': 'BRAESIDE'}
{'COMM_CODE': 'BRE', 

True

Community codes is the column we are using to merge our databases together. So it is good to know which Community codes are present for each of our databases to see how they properly align. On top of that, gathering this info is useful for our end project since there are Community codes/names that we can discard/ignore because we are not going to be using them. For instance there are community names that are wards (eg. "Residual Ward 1") or even community names that are extremely specific to the point that they would not normally be considered a community of calgary (For instance, there is a listed community in the list that is called Queens Park Village. When you search up this community you find out that it really isnt a community but instead is a graveyard. No idea why it is listed as well so this is one instance where we can disregard this name).

## Query 2

In [9]:
#Get the maximum and minimum property assessment values per community

read_cursor = myconnection.cursor(buffered=True, dictionary=True)

query_string = ("SELECT max(ASSESSED_VALUE), min(ASSESSED_VALUE), COMM_CODE, COMM_NAME FROM property_data GROUP BY COMM_NAME ORDER BY COMM_NAME ASC;")

read_cursor.execute(query_string)

for (prop_value) in read_cursor:
    print(prop_value)
    
read_cursor.close()

{'max(ASSESSED_VALUE)': 1850000, 'min(ASSESSED_VALUE)': 170000, 'COMM_CODE': 'ABB', 'COMM_NAME': 'ABBEYDALE'}
{'max(ASSESSED_VALUE)': 50920000, 'min(ASSESSED_VALUE)': 108500, 'COMM_CODE': 'ACA', 'COMM_NAME': 'ACADIA'}
{'max(ASSESSED_VALUE)': 25000000, 'min(ASSESSED_VALUE)': 10000, 'COMM_CODE': 'ALB', 'COMM_NAME': 'ALBERT PARK/RADISSON HEIGHTS'}
{'max(ASSESSED_VALUE)': 3980000, 'min(ASSESSED_VALUE)': 3050, 'COMM_CODE': 'ALT', 'COMM_NAME': 'ALTADORE'}
{'max(ASSESSED_VALUE)': 179000, 'min(ASSESSED_VALUE)': 179000, 'COMM_CODE': 'AYB', 'COMM_NAME': 'ALYTH/BONNYBROOK'}
{'max(ASSESSED_VALUE)': 27410000, 'min(ASSESSED_VALUE)': 3050, 'COMM_CODE': 'APP', 'COMM_NAME': 'APPLEWOOD PARK'}
{'max(ASSESSED_VALUE)': 2020000, 'min(ASSESSED_VALUE)': 10000, 'COMM_CODE': 'ARB', 'COMM_NAME': 'ARBOUR LAKE'}
{'max(ASSESSED_VALUE)': 28900000, 'min(ASSESSED_VALUE)': 10000, 'COMM_CODE': 'ASP', 'COMM_NAME': 'ASPEN WOODS'}
{'max(ASSESSED_VALUE)': 49730000, 'min(ASSESSED_VALUE)': 10000, 'COMM_CODE': 'AUB', 'COMM_NAM

True

Getting the max and min property values per community allows us to see the range of property assessment values. This lets us look and see if there are any typos for assessment values entered that we can ignore. From looking at the min and max values we decided to start using a threshold for property assessment values (greater than 100,000). We chose to do this because our data for property assessment values also included monthly rent costs which would have interfered with future queries such as when we grab an average for property assessment values. For this project we wanted to focus solely on property values (cost for buying an entire property) and not include rent since it would be tricky trying to differentiate between the two since they are together in the same column/field. So setting our threshold to 100,000 allowed us to negate any rows that used monthly/yearly rent prices and also helped us ignore any property assessment values that were set to 0 because they were not properly recorded when the data was gathered. We chose to use a threshold of 100,000 as well because realistically you would be winning the lottery for finding a property that is less than 100,000 in Calgary. This threshold value also allowed us to keep rows that would have the property assessment values for buying an apartment outright rather than renting (which we still considered the same as buying your own home).

## Query 3

In [10]:
#Get the total property assessment count per each community above the chosen property assessment value threshold

read_cursor = myconnection.cursor(buffered=True, dictionary=True)

query_string = ("SELECT count(COMM_CODE), COMM_CODE, COMM_NAME FROM property_data WHERE ASSESSED_VALUE > 100000 GROUP BY COMM_NAME;")

read_cursor.execute(query_string)

for (prop_value) in read_cursor:
    print(prop_value)
    
read_cursor.close()

{'count(COMM_CODE)': 735, 'COMM_CODE': 'ABB', 'COMM_NAME': 'ABBEYDALE'}
{'count(COMM_CODE)': 1419, 'COMM_CODE': 'ACA', 'COMM_NAME': 'ACADIA'}
{'count(COMM_CODE)': 707, 'COMM_CODE': 'ALB', 'COMM_NAME': 'ALBERT PARK/RADISSON HEIGHTS'}
{'count(COMM_CODE)': 1037, 'COMM_CODE': 'ALT', 'COMM_NAME': 'ALTADORE'}
{'count(COMM_CODE)': 1, 'COMM_CODE': 'AYB', 'COMM_NAME': 'ALYTH/BONNYBROOK'}
{'count(COMM_CODE)': 639, 'COMM_CODE': 'APP', 'COMM_NAME': 'APPLEWOOD PARK'}
{'count(COMM_CODE)': 1343, 'COMM_CODE': 'ARB', 'COMM_NAME': 'ARBOUR LAKE'}
{'count(COMM_CODE)': 1198, 'COMM_CODE': 'ASP', 'COMM_NAME': 'ASPEN WOODS'}
{'count(COMM_CODE)': 2389, 'COMM_CODE': 'AUB', 'COMM_NAME': 'AUBURN BAY'}
{'count(COMM_CODE)': 495, 'COMM_CODE': 'BNF', 'COMM_NAME': 'BANFF TRAIL'}
{'count(COMM_CODE)': 848, 'COMM_CODE': 'BNK', 'COMM_NAME': 'BANKVIEW'}
{'count(COMM_CODE)': 96, 'COMM_CODE': 'BYV', 'COMM_NAME': 'BAYVIEW'}
{'count(COMM_CODE)': 1681, 'COMM_CODE': 'BED', 'COMM_NAME': 'BEDDINGTON HEIGHTS'}
{'count(COMM_CODE)': 

True

Getting the total property count for each community allows us to get a general idea on the sizes of each community compared to one another. This will also be useful when we are comparing property assessment values/counts for each community with other factors from our other datasets such as demographics and crime, since we need to take into account the size differences between the communities (This helps us when we look into our guiding question which involves looking at the effect of population density and property assessments/density).

Along with the threshold that was decided from the query above (100,000), we decided from this query that we are only considering communities with a count of 10 or more property assessments present. This is because communities with a count of less than 10 is not really considered a community in our eyes. The reason the community property count for some communities may be so low is that it is an extremely new community or they are the "residual ward communities" (mentioned these before and we are not using them anyways) or are possible typos. So having these two restrictions in place (count of 10 or more and a threshold of 100000 or more for property assessment values) helps us gather consistent data as you will see in the next few queries (hint: we grab averages, so having a count of 10 or more allows us to actually get an average that isnt just from a single property).

## Query 4

In [13]:
#Get the average property assessment values per each community above the chosen property assessment value threshold

read_cursor = myconnection.cursor(buffered=True, dictionary=True)

query_string = ("SELECT avg(ASSESSED_VALUE), COMM_CODE, COMM_NAME FROM property_data WHERE ASSESSED_VALUE > 100000 GROUP BY COMM_NAME HAVING count(COMM_CODE) > 10;")

read_cursor.execute(query_string)

for (prop_value) in read_cursor:
    print(prop_value)
    
read_cursor.close()

{'avg(ASSESSED_VALUE)': Decimal('297105.4422'), 'COMM_CODE': 'ABB', 'COMM_NAME': 'ABBEYDALE'}
{'avg(ASSESSED_VALUE)': Decimal('385040.1691'), 'COMM_CODE': 'ACA', 'COMM_NAME': 'ACADIA'}
{'avg(ASSESSED_VALUE)': Decimal('483790.6648'), 'COMM_CODE': 'ALB', 'COMM_NAME': 'ALBERT PARK/RADISSON HEIGHTS'}
{'avg(ASSESSED_VALUE)': Decimal('846348.0154'), 'COMM_CODE': 'ALT', 'COMM_NAME': 'ALTADORE'}
{'avg(ASSESSED_VALUE)': Decimal('380939.7496'), 'COMM_CODE': 'APP', 'COMM_NAME': 'APPLEWOOD PARK'}
{'avg(ASSESSED_VALUE)': Decimal('494914.7431'), 'COMM_CODE': 'ARB', 'COMM_NAME': 'ARBOUR LAKE'}
{'avg(ASSESSED_VALUE)': Decimal('921439.0651'), 'COMM_CODE': 'ASP', 'COMM_NAME': 'ASPEN WOODS'}
{'avg(ASSESSED_VALUE)': Decimal('503028.0452'), 'COMM_CODE': 'AUB', 'COMM_NAME': 'AUBURN BAY'}
{'avg(ASSESSED_VALUE)': Decimal('680679.7980'), 'COMM_CODE': 'BNF', 'COMM_NAME': 'BANFF TRAIL'}
{'avg(ASSESSED_VALUE)': Decimal('408836.6745'), 'COMM_CODE': 'BNK', 'COMM_NAME': 'BANKVIEW'}
{'avg(ASSESSED_VALUE)': Decimal('1

True

Getting the average property assessment values for each community allows us to be able to easily visualize property assessment values between our communities. It allows us to find which communities on average are lower end communities vs higher end communities. This helps us for our final project in seeing if there are relationships between certain communities and their crime counts and types. This is also the same case for when we compare it to demographics and see if there is a relationship between certain demographics and average property assessment values of communities.

## Query 5

In [22]:
#Find the top 10 communities to look into based on property counts if you are interested in buying a home within a certain price range. Each price range increment is 250,000

#make a list of the upper range of the price range
prop_upper_lim = [350000, 600000, 850000, 1100000, 1350000]

read_cursor = myconnection.cursor(buffered=True, dictionary=True)

query_string = ("SELECT count(ASSESSED_VALUE), COMM_CODE, COMM_NAME FROM property_data WHERE ASSESSED_VALUE BETWEEN %s-250000 AND %s GROUP BY COMM_NAME ORDER BY count(ASSESSED_VALUE) DESC LIMIT 10;")

#loop through our price ranges and insert them into the %s placeholders
for prop_range in prop_upper_lim: 
    print("Top 10 Communities to look into if you are buying a home in the price range:", prop_range - 250000, "-", prop_range)
    read_cursor.execute(query_string, (prop_range,prop_range,))
    for (prop_value) in read_cursor:
        print(prop_value)
        
read_cursor.close()


Top 10 Communities to look into if you are buying a home in the price range: 100000 - 350000
{'count(ASSESSED_VALUE)': 3177, 'COMM_CODE': 'BLN', 'COMM_NAME': 'BELTLINE'}
{'count(ASSESSED_VALUE)': 1436, 'COMM_CODE': 'DOV', 'COMM_NAME': 'DOVER'}
{'count(ASSESSED_VALUE)': 1429, 'COMM_CODE': 'MCT', 'COMM_NAME': 'MCKENZIE TOWNE'}
{'count(ASSESSED_VALUE)': 1027, 'COMM_CODE': 'FAL', 'COMM_NAME': 'FALCONRIDGE'}
{'count(ASSESSED_VALUE)': 1008, 'COMM_CODE': 'PEN', 'COMM_NAME': 'PENBROOKE MEADOWS'}
{'count(ASSESSED_VALUE)': 954, 'COMM_CODE': 'PAN', 'COMM_NAME': 'PANORAMA HILLS'}
{'count(ASSESSED_VALUE)': 875, 'COMM_CODE': 'PIN', 'COMM_NAME': 'PINERIDGE'}
{'count(ASSESSED_VALUE)': 875, 'COMM_CODE': 'MRL', 'COMM_NAME': 'MARLBOROUGH'}
{'count(ASSESSED_VALUE)': 852, 'COMM_CODE': 'MPK', 'COMM_NAME': 'MARLBOROUGH PARK'}
{'count(ASSESSED_VALUE)': 828, 'COMM_CODE': 'MRT', 'COMM_NAME': 'MARTINDALE'}
Top 10 Communities to look into if you are buying a home in the price range: 350000 - 600000
{'count(ASSESS

True

One point of our project was to gain insight on communities that people may be interested in when buying a new home, whether it be for the first time or if you simply want to live in a higher end property. Finding the top 10 communities that have the highest count of properties that is within a specific price range gives us this very insight. Furthermore we can use this information along with our other datasets to help expand on our guiding questions (Such as when we look at relationships between age demographics and property assessment values in communities).

We chose to go up by incremements of 250,000 for each property assessment value range.

## Query 6

In [27]:
#Get average sq. footage of properties per each community below the chosen sq. footage value threshold

read_cursor = myconnection.cursor(buffered=True, dictionary=True)

query_string = ("SELECT avg(LAND_SIZE_SF), COMM_NAME FROM property_data WHERE ASSESSED_VALUE > 100000 AND LAND_SIZE_SF < 15000 GROUP BY COMM_NAME HAVING count(COMM_CODE) > 10;")

read_cursor.execute(query_string)

for (prop_value) in read_cursor:
    print(prop_value)
    
read_cursor.close()

{'avg(LAND_SIZE_SF)': 4099.422316384181, 'COMM_NAME': 'ABBEYDALE'}
{'avg(LAND_SIZE_SF)': 5727.267837541163, 'COMM_NAME': 'ACADIA'}
{'avg(LAND_SIZE_SF)': 5500.396226415094, 'COMM_NAME': 'ALBERT PARK/RADISSON HEIGHTS'}
{'avg(LAND_SIZE_SF)': 4892.798117154812, 'COMM_NAME': 'ALTADORE'}
{'avg(LAND_SIZE_SF)': 4184.833333333333, 'COMM_NAME': 'APPLEWOOD PARK'}
{'avg(LAND_SIZE_SF)': 4882.549158547387, 'COMM_NAME': 'ARBOUR LAKE'}
{'avg(LAND_SIZE_SF)': 5608.866537717601, 'COMM_NAME': 'ASPEN WOODS'}
{'avg(LAND_SIZE_SF)': 4365.851798561151, 'COMM_NAME': 'AUBURN BAY'}
{'avg(LAND_SIZE_SF)': 5827.481707317073, 'COMM_NAME': 'BANFF TRAIL'}
{'avg(LAND_SIZE_SF)': 7486.755520504732, 'COMM_NAME': 'BANKVIEW'}
{'avg(LAND_SIZE_SF)': 10119.556962025317, 'COMM_NAME': 'BAYVIEW'}
{'avg(LAND_SIZE_SF)': 4515.905109489051, 'COMM_NAME': 'BEDDINGTON HEIGHTS'}
{'avg(LAND_SIZE_SF)': 12198.415384615384, 'COMM_NAME': 'BEL-AIRE'}
{'avg(LAND_SIZE_SF)': 3453.4603174603176, 'COMM_NAME': 'BELMONT'}
{'avg(LAND_SIZE_SF)': 7722.23

True

I included this query as an option for future project work if we want to expand on our guiding questions if we have enough time after answering what we have for our guiding questions. Rather than purely looking at property assessment values and its relationship with demographics or crime, we can also take a look at average property sizes (in terms of square footage) for communities to see if there are any correlations present as well. I know many people do consider buying homes with specific property sizes so this could be an interesting thing to expand upon. 

We decided to use a top limit sq. footage of 15000 when doing searches because in this column, apartments sometimes take the sq. footage of the entire apartment complex rather than their own unit/space. So having this restriction ensures the rows that have the entire sq. footage of an apartment complex are not interfering with the average sq footage calculations. We also chose 15000 because we felt that no property within the bounds of calgary (that isnt a massive apartment complex) would be over these bounds, since while properties can be quite spacious in calgary, they are not the same size as farmland properties and so it would be quite difficult to be over the size of 15000 square feet. 
