# What's the Problem? - An Analysis of New York Housing Complaints

## Introduction

As part of IBM's professional certificate for data science, I was required to complete this capstone project where I had to analyse housing and building complaints in the state of New York. I was given information such as metadata of each complaint made for a house and the details about the buildings.


### Context of the Project

>The people of New York use the 311 system to report complaints about the non-emergency problems to local authorities. Of those complaints, the Department of Housing Preservation and Development of New York City is the agency that processes 311 complaints that are related to housing and buildings.

>In the last few years, the number of 311 complaints coming to the Department of Housing Preservation and Development has increased significantly. Although these complaints are not necessarily urgent, the large volume of complaints and the sudden increase is impacting the overall efficiency of operations of the agency.

>Therefore, the Department of Housing Preservation and Development has approached your organization to help them manage the large volume of 311 complaints they are receiving every year. 


By analysing existing data from the New York Department of Housing Preservation and Development, I attempted to answer these questions posed by the department:
- What type of complaint should be focused on first?
- Should the department focus on any particular geographical area (borough, zip codes, or street) for the determined complaint type?
- Is there any obvious relationship between the characteristics of a house or building and the complaint type?
- Can a predictive model be created for future possible complaints of the determined complainted type?

## Summary of Results

### Prediction Model Results
For binary target (complaint made or not):

- Logistic regression: Jaccard score of 0.989, F1 score of 0.99

For predicting the number of complaints:

1. K-Nearest Neighbors: accuracy score of 0.97
2. Multiple linear regression: Residual sum of squares of 25923.79, R^2/coeffient of determination of 0.08

### Complaint Type Result
From analysing the data, it was discovered that complaints relating to heating occurred the most with over 2,149,424 complaints compared to the next highest with only 711,141 complaints in comparison. From this, it was determined that heating complaints were going to be focused on to answer the questions that the department provided.

!["Complaint Types"](https://raw.githubusercontent.com/Zenoix/IBM-DS-Capstone-Project/master/Images/Complaint%20Types.png)

### Geographical Results
Regarding the locations where these complaints occurred, the Bronx had the most amount of complaints with Brooklyn having the second most followed by Manhatten, Queens, and then Staten Island. 

!["Geographical"](https://raw.githubusercontent.com/Zenoix/IBM-DS-Capstone-Project/master/Images/Boroughs%20Graph.png)

The 5 zip codes with the highest complaint numbers were: 

1. 11226: 69041 complaints
2. 10467: 66073 complaints
3. 10458: 65372 complaints
4. 10468: 58190 complaints
5. 10453: 57818 complaints

and the 5 streets with the most complaints were:

1. GRAND CONCOURSE: 37863 complaints
2. BROADWAY: 24484 complaints
3. OCEAN AVENUE: 18716 complaints
4. MORRIS AVENUE: 16409 complaints
5. ARDEN STREET: 15963 complaints

Based on the number of heating complaints for each borough, I would suggest the Department to focus on **The Bronx**.
Looking at the number of complaints per zip code, the zip codes with the most complaints are **11226** located in Brooklyn, and 10467 located in the Bronx.
Lastly, going off the streets with the most amount of complaints, the **Grand Concourse** has the highest amount of complaints with Broadway second.

### Analysis of Correlations Between Building Characteristics and Complaints
As the Bronx has the highest number of heating complaints out of all the boroughs I focussed on the information for houses and buildings that are located in the Bronx. The data about the building characteristics contained information such as lot numbers, building area, number of floors, and the year the buildings were built. From the information provided, I decided to use two approaches to find a relationship between characteristics and complaints:

1. Use a binary approach where the features will either end in a complaint being made (represented by a 1) or not (represented by a 0)
2. Using total number of complaints for each street to analyse correlation

To find out if there is any correlation, I calculated the Pearson correlation and p-values for the features of the buildings and the target. Lastly, I created a correlation matrix to help visualise the Pearson correlations.

#### Approach 1:
From the correlation data frame, and the correlation matrix, we see that the feature with the highest correlation to a heating complaint being made is the year that the building was built (Pearson=-0.7553). The next highest is the number of floors the house has (Pearson=-0.7313). I believe that the results of the p-values are inconclusive as they are all zero or extremely close to zero. From this, we may infer that this binary approach may not be the method for answering our questions.

|              | Lot Number  | Lot Area    | Building Area | Residential Area | Office Area | Retail Area | Num of Buildings | Num of Floors | Lot Depth    | Building Depth | Year Built | Year 1st Altered | Floor Area Ratio | Max Residential FAR | Max Commercial FAR | Max Facility FAR |
|--------------|-------------|-------------|---------------|------------------|-------------|-------------|------------------|---------------|--------------|----------------|------------|------------------|------------------|---------------------|--------------------|------------------|
| Pearson Corr | 0.0135295   | 0.0113194   | -0.079132     | -0.0981778       | 0.00396574  | -0.0695622  | -0.253903        | -0.731387     | 0.0277289    | -0.602128      | -0.755379  | -0.218871        | -0.421666        | 0.0663305           | 0.0513572          | 0.0679223        |
| P-Value      | 3.18555e-49 | 5.62946e-35 | 0             | 0                | 1.54197e-05 | 0           | 0                | 0             | 9.60683e-201 | 0              | 0          | 0                | 0                | 0                   | 0                  | 0                |


![Correlation Matrix](https://raw.githubusercontent.com/Zenoix/IBM-DS-Capstone-Project/master/Images/CorrMatrixBinary.png "Correlation Matrix 1")

#### Approach 2:
From the Pearson correlation, we see that once again the year the building was built has the highest correlation to the number of complaints made with a Pearson correlation value of -0.277962 followed by the building depth and the number of floors. Also, the p-values of this approach seem to be a bit better than the binary approach which may suggest that this may be a better method of finding the correlation between features and complaints made.

|              | Lot Number | Lot Area   | Building Area | Residential Area | Office Area | Retail Area | Num of Buildings | Num of Floors | Lot Depth  | Building Depth | Year Built | Year 1st Altered | Floor Area Ratio | Max Residential FAR | Max Commercial FAR | Max Facility FAR |
|--------------|------------|------------|---------------|------------------|-------------|-------------|------------------|---------------|------------|----------------|------------|------------------|------------------|---------------------|--------------------|------------------|
| Pearson Corr | 0.00608823 | 0.00124213 | -0.00878389   | -0.00795726      | 0.00177427  | -0.00562281 | -0.0429668       | -0.117689     | 0.0142087  | -0.1181        | -0.277962  | -0.0233512       | -0.0476101       | 0.0282175           | 0.0231703          | 0.0265687        |
| P-Value      | 0.0685153  | 0.710158   | 0.00858445    | 0.0172735        | 0.595516    | 0.0925013   | 7.40908e-38      | 1.68606e-273  | 2.12414e-5 | 2.07135e-275   | 0          | 2.79631e-12      | 4.2979e-46       | 3.05826e-17         | 4.10718e-12        | 1.85436e-15      |

![Correlation Matrix](https://raw.githubusercontent.com/Zenoix/IBM-DS-Capstone-Project/master/Images/CorrMatrixAmounts.png "Correlation Matrix 2")

From both approaches, we see that some features may be useful in predicting whether a complaint will be made or how many complaints will be made from the features of a building. Using the binary approach, we find that 3 features have a high correlation of a complaint being made. These are the year that the building was built (Pearson=-0.7553), the number of floors the house has (Pearson=-0.7313), and the depth of the building (Pearson=-0.6021). Although we have three features that have a high correlation, we see that the vast majority of p-values are zeros with the rest all being extremely close to 0. This either means that all the features are statistically significant or that the binary approach is not suitable for the dataset we have. From the second approach, we do not have any features that are highly correlated to the number of complaints made for a building. The highest correlation we have is once again the year the building was built (Pearson=-0.277962) followed by the depth of the building, and the number of floors it has. However, the p-values of the second approach are much more promising as there do not seem to be any abnormalities and most of the values are less than or equal to 0.5.

Overall, for both approaches, the year that the building was built seems to have the highest correlation to complaints being made. Although it is out of the scope of the project, we could think about a possible reason for this correlation such as older buildings were made with less advanced techniques and technology compared to more recent buildings or maybe older buildings have had more time to deteriorate that haven't been maintained.

### Model Building

Once again, I used the same 2 approaches to build machine learning models:

1. Use a binary approach where the features will either end in a complaint being made (represented by a 1) or not (represented by a 0)
2. Consider the total number of complaints for each street based on the building's features

**Note: all the models were trained and tested using a 70-30 split to prevent underfitting and overfitting.**

#### Approach 1:

For the binary approach, I only created 1 predictive model. This was a logistic regression model to predict whether a complaint would be made for a building based on its features.
To evaluate the model, I used the Jaccard Index which helps gauge the similarity of data sets, a confusion matrix, and the model's F1 score.

##### Jaccard Index

The Jaccard Index is a statistic used to understand the similarities between sample sets. If the entire set of predicted labels for a sample strictly match with the true set of labels, then the subset accuracy is 1.0; if none of the predicted labels matches, it is 0.0. Our model has a Jaccard index of 0.9899171594785319 which is very good.

##### Confusion Matrix

![Confusion Matrix](https://raw.githubusercontent.com/Zenoix/IBM-DS-Capstone-Project/master/Images/ConfusionMatrixBinary.png "Confusion Matrix 1")

Looking at the confusion matrix, it seems that the model is very accurate at predicting when a complaint will be made and when a complaint will not be made. Out of 330,189 complaints being made, the model managed to accurately predict 329,094 of them which is about 99.67%. On the other hand, out of 26,232 complaints not being made, the model predicted that 2,257 complaints were made when they weren't. This means that the model managed to correctly predict about 91.39% of the buildings that did not make a complaint. Although this is lower than the true positive rate, it is still reasonably accurate.

##### F1 score
![F1 Score](https://raw.githubusercontent.com/Zenoix/IBM-DS-Capstone-Project/master/Images/F-1%20Score%20Logistic%20Reg.png "F1 Score")

From wikipedia: *In statistical analysis of binary classification, the F1 score (also F-score or F-measure) is a measure of a test's accuracy. It is calculated from the precision and recall of the test, where the precision is the number of correctly identified positive results divided by the number of all positive results, including those not identified correctly, and the recall is the number of correctly identified positive results divided by the number of all samples that should have been identified as positive.* Ideally, we want the F1 score to be as high as possible with 1 being the max which, in this case, our model has achieved  an F1 score of 0.99.


#### Approach 2
For the second approach of the model building, I created 2 predictive models, one a multiple linear regression and the other a k-nearest neighbours model.

##### Multiple Linear Regression
Using a Python package called scikit-learn, I created and fit the multiple linear regression model to the data set of housing characteristics and heating complaints. From this, the model's coefficients of the features and intercept were:
- Coefficients:  [  0.14,   0.36,   1.81,  -0.72,   0.22,   0.46,  -0.34,  -2.7,   -1.22,  -1.1,
 -39.97,  -0.23,  -1.42,   8.8,   -1.18,  -1.85]
- Intercept: 11.721307551851885

To evaluate the regression model, I evaluated the model's residual sum of squares and variance score on the testing data. The residual sum of squares was 26923.79 which is considerably high which indicates a high amount of error in the multiple linear regression. The models R^2 or coefficient of determination is 0.08 which is low (1 is the highest) indicating that the model may not fit our data well enough. As the multiple linear regression underperforms in terms of both the residual sum of squares and variance score, we can assume that the model may not be effective as a predictive tool for future heating complaints.

##### K-Nearest Neighbors
For the K-Nearest Neighbors model, I first experimented with different values of K (from 1 to 9) to find a value with high accuracy. A graph of the values of K plotted against the model accuracy:

![KNN](https://raw.githubusercontent.com/Zenoix/IBM-DS-Capstone-Project/master/Images/KNN.png "K Values")

After building, fitting, and testing the KNN Model with a K value of nine, the accuracy on the testing data was 0.9696913281453625 which indicates that this model may be a better fit than the multiple linear regression.

#### Model Building Conclusion

For the first approach of a binary target (1 or 0), I only developed one model which was the logistic regression. This model had a Jaccard score of 0.989 and an F1 score of 0.99. These are both very high scores which indicate that the model may be useful as a predictive model to predict if a complaint will be made for any building.

For the second approach of considering the total number of complaints per building, I developed two models: a multiple linear regression and a k-nearest neighbours model. For the multiple linear regression, the accuracy was very disappointing. It had a residual sum of squares of 26923.79 (the lower the better) and an R^2 of 0.08 (best score is 1). This indicates that it will most likely not be a good model to predict complaints. On the other hand, the KNN model did much better in its evaluation. Using a K value of 9, the model had an accuracy score of roughly 0.97. This means that the KNN model would probably be a better model to use to predict the number of complaints a building will make.

Depending on the method the Department of Housing Preservation and Development wants to employ, it is possible to make a predictive model from the data they have. If they would like to use the binary approach, a logistic regression model will be a good fit for future predictions. On the other hand, if they would like to predict the total number of complaints for a building, then I would recommend the KNN model as I believe that it is a better fit than the multiple logistic regression model.