# What's the Problem? - An Analysis of New York Housing Complaints

## Introduction

As part of IBM's professional certificate for data science, I was required to complete this capstone project where I had to analyse housing and building complaints in the state of New York. I was given information such as metadata of each complaint made for a house and the details about the buildings like the location, year it was built etc. 



### Context of the Project

>The people of New York use the 311 system to report complaints about the non-emergency problems to local authorities. Of those complaints, the Department of Housing Preservation and Development of New York City is the agency that processes 311 complaints that are related to housing and buildings.

>In the last few years, the number of 311 complaints coming to the Department of Housing Preservation and Development has increased significantly. Although these complaints are not necessarily urgent, the large volume of complaints and the sudden increase is impacting the overall efficiency of operations of the agency.

>Therefore, the Department of Housing Preservation and Development has approached your organization to help them manage the large volume of 311 complaints they are receiving every year. 


By analysing existing data from the New York Department of Housing Preservation and Developing, I attempted to answer these questions posed by the department:
- What type of complaint should be focused on first?
- Should the department focus on any particular geographical area (borough, zip codes, or street) for the determined complaint type?
- Is there any obvious relationship between the characteristics of a house or building and the complaint type?
- Can a predictive model be created for future possible complaints of the determined complainted type?

## Summary of Results

#### Prediction Model Results
For binary target (complaint made or not):

- Logistic regression: Jaccard score of 0.989, F-1 score of 0.99

For predicting the number of complaints:

1. K-Nearest Neighbors: accuracy score of 0.97
2. Multiple linear regression: Residual sum of squares of 25923.79, variance score of 0.08

#### Complaint Type Result
From analysing the data, it was discovered that complaints relating to heating occured the most with over 2,149,424 complaints compared to next highest with only 711,141 complaints in comparison. From this, it was determined that heating complaints was going to be focused on in answering the questions that the department provided. 

#### Geographical Results
Regarding the locations where these complaints occured, the Bronx had the most amount of complaints with Brooklyn having the second most followed by Manhatten, Queens, and then Staten Island. 
The 5 zip codes with the highest complaint numbers were: 
1. 11226: 69041 complaints
2. 10467: 66073 complaints
3. 10458: 65372 complaints
4. 10468: 58190 complaints
5. 10453: 57818 complaints

and the 5 streets with the most complaints were:
1. GRAND CONCOURSE: 37863 complaints
2. BROADWAY: 24484 complaints
3. OCEAN AVENUE: 18716 complaints
4. MORRIS AVENUE: 16409 complaints
5. ARDEN STREET: 15963 complaints

Based on the number of heating complaints for each borough, I would suggest the Department to focus on the **Bronx Area**.
Looking at the number of complaints per zip code, the zip codes with the most complaints are **11226** located in Brooklyn, and 10467 located in the Bronx.
Lastly, going off the streets with the most amount of complaints, the **Grand Concourse** has the highest amount of complaints with Broadway second.

#### Analysis of Correlations Between Building Characteristics and Complaints
As the Bronx has the highest number of heating complaints out of all the boroughs I focussed on the information for houses and buildings that are located in the Bronx. The data about the building characteristics contained information such as the lot number, building area, number of floors, and the year the buildings was built. From the information provided, I decided to use two approaches to find a relationship between characteristics and complaints:

1. Use a binary approach where the features will either end in a complaint being made (represented by a 1) or not (represented by a 0)
2. Using total number of complaints for each street to analyse correlation

To find out if there is any correlation, I calculated the pearson correlation and p-values of the features of the buildings and the target. Lastly, I created a correlation matrix to help visualise the pearson correlations.

##### Approach 1:
From the correlation dataframe, and the correlation matrix, we see that the feature with the highest correlation to a heating complaint being made is the year that the building was built (pearson=-0.7553). The next highest is the number of floors the house has (pearson=-0.7313). I believe that the results of the P-values are inconclusive as they are all zero or extremely close to zero. From this, we may infer that this binary approach may not be the method for answering our questions.

|              | Lot Number  | Lot Area    | Building Area | Residential Area | Office Area | Retail Area | Num of Buildings | Num of Floors | Lot Depth    | Building Depth | Year Built | Year 1st Altered | Floor Area Ratio | Max Residential FAR | Max Commercial FAR | Max Facility FAR |
|--------------|-------------|-------------|---------------|------------------|-------------|-------------|------------------|---------------|--------------|----------------|------------|------------------|------------------|---------------------|--------------------|------------------|
| Pearson Corr | 0.0135295   | 0.0113194   | -0.079132     | -0.0981778       | 0.00396574  | -0.0695622  | -0.253903        | -0.731387     | 0.0277289    | -0.602128      | -0.755379  | -0.218871        | -0.421666        | 0.0663305           | 0.0513572          | 0.0679223        |
| P-Value      | 3.18555e-49 | 5.62946e-35 | 0             | 0                | 1.54197e-05 | 0           | 0                | 0             | 9.60683e-201 | 0              | 0          | 0                | 0                | 0                   | 0                  | 0                |


![Correlation Matrix](Images\CorrMatrixBinary.png "Title")

##### Approach 2:
From the pearson correlation, we see that once again the year the building was built has the highest correlation to the number of complaints made with a pearson correlation value of -0.277962 followed by the building depth and number of floors. Also from the p-values of the features, we see that this may be a better method of finding the correlation between features and complaints made.

|              | Lot Number | Lot Area   | Building Area | Residential Area | Office Area | Retail Area | Num of Buildings | Num of Floors | Lot Depth  | Building Depth | Year Built | Year 1st Altered | Floor Area Ratio | Max Residential FAR | Max Commercial FAR | Max Facility FAR |
|--------------|------------|------------|---------------|------------------|-------------|-------------|------------------|---------------|------------|----------------|------------|------------------|------------------|---------------------|--------------------|------------------|
| Pearson Corr | 0.00608823 | 0.00124213 | -0.00878389   | -0.00795726      | 0.00177427  | -0.00562281 | -0.0429668       | -0.117689     | 0.0142087  | -0.1181        | -0.277962  | -0.0233512       | -0.0476101       | 0.0282175           | 0.0231703          | 0.0265687        |
| P-Value      | 0.0685153  | 0.710158   | 0.00858445    | 0.0172735        | 0.595516    | 0.0925013   | 7.40908e-38      | 1.68606e-273  | 2.12414e-5 | 2.07135e-275   | 0          | 2.79631e-12      | 4.2979e-46       | 3.05826e-17         | 4.10718e-12        | 1.85436e-15      |

![Correlation Matrix](Images\CorrMatrixAmounts.png "Title")

# Finish introduction