In [1]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

In [2]:
## We will use machine learning via Sklean to perform linear regression analysis.
## First, we'll import our cleaned data.
austin_change = 'https://raw.githubusercontent.com/dianahandler/Final_Module20_Group3/main/autinHousingData_cleaned_citynamechanged.csv'
austin_no_change = 'https://raw.githubusercontent.com/dianahandler/Final_Module20_Group3/main/autinHousingData_cleaned_nochangecity.csv'
df_ac = pd.read_csv(austin_change)
df_nc = pd.read_csv(austin_no_change)

In [3]:
df_ac.head()

Unnamed: 0,zpid,city,streetAddress,zipcode,description,latitude,longitude,propertyTaxRate,garageSpaces,hasAssociation,...,avgSchoolRating,avgSchoolSize,MedianStudentsPerTeacher,numOfBathrooms,numOfBedrooms,numOfStories,homeImage,zip_rank,median_zip,pr_sqft
0,111373431,austin,14424 Lake Victor Dr,78660,"14424 Lake Victor Dr, Pflugerville, TX 78660 i...",30.430632,-97.663078,1.98,2,1,...,2.666667,1063,14,3.0,4,2,111373431_ffce26843283d3365c11d81b8e6bdc6f-p_f...,8,289500.0,117.0
1,120900430,austin,1104 Strickling Dr,78660,Absolutely GORGEOUS 4 Bedroom home with 2 full...,30.432673,-97.661697,1.98,2,1,...,2.666667,1063,14,2.0,4,1,120900430_8255c127be8dcf0a1a18b7563d987088-p_f...,8,289500.0,167.0
2,2084491383,austin,1408 Fort Dessau Rd,78660,Under construction - estimated completion in A...,30.409748,-97.639771,1.98,0,1,...,3.0,1108,14,2.0,3,1,2084491383_a2ad649e1a7a098111dcea084a11c855-p_...,8,289500.0,173.0
3,120901374,austin,1025 Strickling Dr,78660,Absolutely darling one story home in charming ...,30.432112,-97.661659,1.98,2,1,...,2.666667,1063,14,2.0,3,1,120901374_b469367a619da85b1f5ceb69b675d88e-p_f...,8,289500.0,143.0
4,60134862,austin,15005 Donna Jane Loop,78660,Brimming with appeal & warm livability! Sleek ...,30.437368,-97.65686,1.98,0,1,...,4.0,1223,14,3.0,3,2,60134862_b1a48a3df3f111e005bb913873e98ce2-p_f.jpg,8,289500.0,113.0


## Pearson correlation analysis

As the above dataframes show, there are more than 40 features to consider in predicting
property prices. With correlation analysis, as mentioned in 15.7.1 (2021), we can generate Pearson correlation coefficients to better determine numerically how each feature is related to those prices.
To do this with dataframe columns, we will need corr(), which happens to generate Pearson correlation
coefficients by default ('pandas.DataFrame.corr', 2021).

Source: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html

We should also note that this correlation analysis has been conducted elsewhere in the group repository, but only as a demonstration. We will be recreating this process here since it is key for our later regression analysis.
Source: https://github.com/dianahandler/Final_Module20_Group3/commit/7c2f33ca7364dfbb40c21b7be7651fa2b5ad51de?branch=7c2f33ca7364dfbb40c21b7be7651fa2b5ad51de#diff-1c8bade9d6a619027f877dbf242cf157c76dbcd45194479d71bc5e565ffa1a09

In [13]:
## Let's look at the column names for future reference.
df_ac.columns

Index(['zpid', 'city', 'streetAddress', 'zipcode', 'description', 'latitude',
       'longitude', 'propertyTaxRate', 'garageSpaces', 'hasAssociation',
       'hasCooling', 'hasGarage', 'hasHeating', 'hasSpa', 'hasView',
       'homeType', 'parkingSpaces', 'yearBuilt', 'latestPrice',
       'numPriceChanges', 'latest_saledate', 'latest_salemonth',
       'latest_saleyear', 'latestPriceSource', 'numOfPhotos', 'accessibility',
       'numOfAppliances', 'numOfParkingFeatures', 'patioporch', 'security',
       'waterfront', 'windowfeatures', 'community', 'lotSizeSqFt',
       'livingAreaSqFt', 'numOfPrimarySchools', 'numOfElementarySchools',
       'numOfMiddleSchools', 'numOfHighSchools', 'avgSchoolDistance',
       'avgSchoolRating', 'avgSchoolSize', 'MedianStudentsPerTeacher',
       'numOfBathrooms', 'numOfBedrooms', 'numOfStories', 'homeImage',
       'zip_rank', 'median_zip', 'pr_sqft'],
      dtype='object')

In [14]:
## For this project, 'latestPrice' is our output we want to predict. So for our corr() function,
## we will only need to look at the 'latestPrice' column of the outputted correlation dataframe.
correlations = df_ac.corr()['latestPrice']
correlations.head()

zpid               0.010168
zipcode           -0.238276
latitude           0.085285
longitude         -0.198968
propertyTaxRate   -0.059908
Name: latestPrice, dtype: float64

In [19]:
## Looking at this series of coefficients, we can sort them in descending order to evaluate them better.
correlations.sort_values(ascending=False)

latestPrice                 1.000000
median_zip                  0.730931
zip_rank                    0.691525
pr_sqft                     0.576027
livingAreaSqFt              0.496778
numOfBathrooms              0.412107
avgSchoolRating             0.384555
lotSizeSqFt                 0.296135
MedianStudentsPerTeacher    0.285408
numOfBedrooms               0.270606
numOfStories                0.208965
numOfPhotos                 0.160217
garageSpaces                0.129184
parkingSpaces               0.121176
numOfParkingFeatures        0.111064
numOfElementarySchools      0.105525
hasView                     0.102876
hasSpa                      0.102659
hasGarage                   0.093460
latitude                    0.085285
numOfMiddleSchools          0.084401
patioporch                  0.080889
latest_saleyear             0.078021
windowfeatures              0.075416
avgSchoolSize               0.068801
security                    0.054198
numOfAppliances             0.041193
l

For Pearson correlation coefficients, we'll be looking for the absolute value