In [None]:
df_complaints.shape

In [None]:
df_complaints.head(2)

In [None]:
df_complaints.columns

##### New York City Census Data

In [None]:
df_pop.shape

In [None]:
df_pop.head(2)

In [None]:
df_pop.columns

##### New York City Infrastructure Data

In [None]:
df_infr.shape

In [None]:
df_infr.head(2)

In [None]:
df_infr.columns

##### New York City Community Districts Shapefile

Finally we count the number of complaints by community district and remove any unwanted columns in our dataset.

In [None]:
df_complaints = df_complaints.groupby("Community Board").count()

In [None]:
df_complaints.reset_index(inplace = True)
df_complaints = df_complaints.iloc[:,:2]
df_complaints.rename(columns = {"Unique Key": "Number of Complaints"}, inplace = True)

We devise a function that converts the current community district to the more accepted format of 'BK01'.

In [None]:
def convertCD(x):
    """
    Convert community district codes to appropriate format.
    """
    borough_codes = {"BRONX": "BX", "MANHATTAN": "MN", "BROOKLYN": "BK", "STATENISLAND": "SI", "QUEENS": "QN"}
    borough = re.sub("\d| ", "", x)  
    cd = borough_codes[borough] + x[:2]
    return cd

In [None]:
df_complaints["Community Board"] = np.vectorize(convertCD)(df_complaints["Community Board"])

We begin our data munging by checking the community district columns of this data set. Everything seems to be correct except for the `MN1111` row, which ought to be just `MN11`.

In [None]:
# df_pop["cd_id"]

In [None]:
df_pop.ix[15, "cd_id"] = 'MN11'

Since we are only interested in `Population Density (per sq. mile)`, we drop the rest of the columns. 

In [None]:
df_pop = df_pop[["cd_id", "Population Density (per sq. mile)"]]

The infrastructure data is a bit different since it has less than 59 community districts. Upon closer inspection, we notice that several districts are combined together. The long string values of its community district column requires that we use regular expressions.

In [None]:
df_infr["Qualifying Name"].head()

In [None]:
len(df_infr["Qualifying Name"].unique())

We devise the following function to first convert the long string format into our accepted standard. Then we separate the aggregated community districts.

In [None]:
def extractCD(x):
    """
    Extracts community district from `Qualifying Name`.
    """
    pattern = "--|-"
    extract = re.sub(pattern, " ", x).upper().split()
    
    if len(extract[4]) == 1:
        cd = " ".join(["0"+extract[4], extract[1]])
    elif extract[3] == "COMMUNITY":
        if len(extract[5]) == 1:
            cd = " ".join(["0"+extract[5], extract[1], extract[2]])
        else:
            cd = " ".join([extract[5], extract[1], extract[2]])
    else:
        cd = " ".join([extract[4], extract[1]])
    
    if extract[5] == "&":
        cd1 = " ".join(["0"+extract[4], extract[1]])
        cd2 = " ".join(["0"+extract[6], extract[1]])
        return(convertCD(cd1) + convertCD(cd2))
    else:
        return convertCD(cd)
extractCD = np.vectorize(extractCD)

In [None]:
df_infr["cd_id"] = extractCD(df_infr["Qualifying Name"])

In [None]:
tmp = np.where(df_infr["cd_id"].duplicated())[0].tolist()
indx = tmp.copy()
for i in range(len(indx)):
    indx.append(indx[i]-1)

In [None]:
for i in indx:
    if i%2 == 0:
        df_infr.ix[i,"cd_id"] = df_infr["cd_id"][i][:4]
    else:
        df_infr.ix[i,"cd_id"] = df_infr["cd_id"][i][4:]

Finally, we are interested in the number of mobile subscribers per district. This can be found by summing up all the different types of household connections. We will also create a separate dataframe that aggregates internet type by high and low connections.

---

### Research Question

##### Which Community Districts in NYC show the highest number of complaints and how do factors such as population density per square mile and mobile subscription come into play? 

### Read Data

In this analysis, we will make use of the following three datasets: [311 Service Requests from 2010 to Present](https://datahub.cusp.nyu.edu/dataset/erm2-nwe9), New York City Census Data, New York City Infrastructure Data, and finally New York City Community Districts shapefile.

In [None]:
df_complaints = pd.read_csv("https://data.cityofnewyork.us/resource/erm2-nwe9.csv")

In [None]:
df_pop = pd.read_csv("http://cosmo.nyu.edu/~fb55/PUI2016/data/Final_Demographics.csv")

In [None]:
df_infr = pd.read_csv("http://cosmo.nyu.edu/~fb55/PUI2016/data/ACS_Computer_Use_and_Internet_2014_1Year_Estimate.csv")

In [None]:
df_shp = gpd.GeoDataFrame.from_file("Community Districts/geo_export_dadb2ff5-7f11-4358-abbe-c37fb52b840b.shp")

Let's take a look at each of the datasets.

##### 311 Service Requests from 2010 to Present

In [None]:
SI = ['0' + str(i) + ' STATEN ISLAND' if len(str(i)) == 1 else str(i) + ' STATEN ISLAND' for i in range(1,4)]
BK = ['0' + str(i) + ' BROOKLYN' if len(str(i)) == 1 else str(i) + ' BROOKLYN' for i in range(1,19)]
MN = ['0' + str(i) + ' MANHATTAN' if len(str(i)) == 1 else str(i) + ' MANHATTAN' for i in range(1,13)]
BX = ['0' + str(i) + ' BRONX' if len(str(i)) == 1 else str(i) + ' BRONX' for i in range(1,13)]
QN = ['0' + str(i) + ' QUEENS' if len(str(i)) == 1 else str(i) + ' QUEENS' for i in range(1,15)]

In [None]:
plausible = SI + MN + BK + BX + QN

In [None]:
indx = [i if df_complaints.ix[i, "Community Board"] in plausible else None for i in range(len(df_complaints))]
indx = list(filter(None, indx))

In [None]:
df_complaints = df_complaints.iloc[indx, :]

We are also interested in checking whether if there are any missing values in the dataset.

In [None]:
df_complaints["Resolution Description"].isnull().sum()

There are 110 rows with missing `Resolution Description`s. Let's see if we can fill in these missing descriptions with other rows with the same `Complaint Type`. Only a few of the missing values can be filled.

In [None]:
ct = df_complaints[df_complaints["Resolution Description"].isnull()]["Complaint Type"].unique()
for i in range(len(ct)):
    print(ct[i], df_complaints[df_complaints["Complaint Type"] == ct[i]]["Resolution Description"].unique())

In [None]:
internet = df_infr[['Households: Dial-Up Alone', 
                    'Households: Dsl', 
                    'Households: Cable Modem',
                    'Households: Fiber-Optic', 'Households: Satellite Internet Service', 
                    'Households: Two or More Fixed Broadband Types, or Other', 
                    'Households: Mobile Broadband Alone or With Dialup']].sum(axis=0)
internet

In [None]:
df_infr['Mobile Subscription'] = df_infr[['Households: With Mobile Broadband',
                                          'Households: With Mobile Broadband.1',
                                          'Households: With Mobile Broadband.2', 
                                          'Households: With Mobile Broadband.3',
                                          'Households: With Mobile Broadband.4', 
                                          'Households: Mobile Broadband Alone or With Dialup']].sum(axis=1)

In [None]:
df_infr = df_infr[["cd_id", "Mobile Subscription"]]

The New York City Community Districts Shapefile is a GeoPandas dataframe. It has 71 community districts coded in community district numbers. For example, community district number 301 stands for `'BK01'`. We first get rid of the extra community districts, then standardize the community district codes.

We begin by checking if all boroughs and community districts are represented in the data.

In [None]:
df_complaints["Borough"].unique()

In [None]:
print(sorted(list(df_complaints["Community Board"].unique())))

In [None]:
len(df_complaints["Community Board"].unique())

The extra community districts represent data that either have unspecified borough or unspecified community districts. There are also community districts that doesn't make any sense at all. We create a list of plausible community district values to remove the unlabeled data.

<img src="Images/311.png">

### Import Packages

In [None]:
import os
import re
import numpy as np
import pandas as pd
import seaborn as sns
import geopandas as gpd
import matplotlib.pylab as plt
%matplotlib inline

In [None]:
%load_ext watermark

In [None]:
%watermark -a 'Kevin Han' -u -d -v -p numpy,pandas,geopandas,matplotlib

In [None]:
df_shp.shape

In [None]:
df_shp.head(2)

In [None]:
df_shp.columns

### Data Wrangling

Each of these datasets have a common column, the community district identifier that can be used to join these datasets together. However, New York City only have 59 community districts, yet both the 311 Service Requests from 2010 to Present data and the New York City Community District Shapefile have more than that.

<img src="Images/community_districts.png">

Therefore we need to clean our data to achieve the following end results:
- Each dataset should have the same `Community District` column
- Extract `Population Density per Square Mile` from `df_pop`
- Extract `Mobile Subscription` from `df_infr`
- Extract `Number of Complaints` from `df_complaints`

In [None]:
df_shp["boro_cd"].head(2)

In [None]:
len(df_shp["boro_cd"].unique())

In [None]:
def cleanBoroCD(x):
    if (x - np.around(x, decimals = -2)) > 20:
        return 0
    elif (x - np.around(x, decimals = -2)) < 0:
        return 0
    else:
        return x
cleanBoroCD = np.vectorize(cleanBoroCD)

In [None]:
df_shp["boro_cd"] = cleanBoroCD(df_shp["boro_cd"])
df_shp = df_shp[df_shp["boro_cd"] > 0]

We devise the following function to standardize the community district codes. The following convention is used:
- 1-Manhattan
- 2-Bronx
- 3-Brooklyn
- 4-Queens
- 5-Staten Island

In [None]:
def convertBoroCD(x):
    """
    Convert community district numbers to the following format 'BK01'.
    """
    num2cd= {100: "MN", 200: "BX", 300: "BK", 400: "QN", 500: "SI"}
    num = np.around(x, decimals = -2)
    cd_code = str(int(x - num))
    if len(cd_code) == 1:
        cd = num2cd[num] + "0" + cd_code
    else:
        cd = num2cd[num] + cd_code
    return cd
convertBoroCD = np.vectorize(convertBoroCD)

In [None]:
df_shp["boro_cd"] = convertBoroCD(df_shp["boro_cd"])

### Final Data

We combine all datasets except for the shapefile dataset into one single dataframe for our analysis. Using `pd.merge`, we join these datasets by their community district column. Note that we don't yet merge the shapefiles because we want to make a plot of ordered data first.

In [None]:
tmp = pd.merge(df_complaints, df_pop, left_on="Community Board", right_on="cd_id")
final_data = pd.merge(tmp, df_infr, on="cd_id")

In [None]:
indx = final_data["Number of Complaints"].copy().sort_values(ascending=False).index
for i in range(1,len(final_data)):
    final_data.ix[indx[i], "Rank"] = i
final_data.sort_values(by = "Rank", inplace = True)

In [None]:
final_data.shape

In [None]:
final_data.head(2)

### Data Visualization

In [None]:
plt.style.use('fivethirtyeight')
fig = plt.figure(figsize=(8,6))
plt.plot(final_data["Rank"], final_data["Mobile Subscription"], lw=2.5)
plt.plot(final_data["Rank"], final_data["Population Density (per sq. mile)"], lw=2.5)
plt.title("Population Density (per sq. mile) and Mobile Subscription vs. Rank", fontsize=12)
plt.xlabel("Rank", fontsize=10); plt.ylabel("Population", fontsize=10);
plt.xticks(fontsize=8); plt.yticks(fontsize=8)
plt.legend(prop={'size':10})
plt.show()

<center>*Figure 1: Suprisingly, Community Districts with high population density are not necessarily those with high number of complaints. An interesting point to notice about this graph is the relatively higher number of complaints for districts with higher number of mobile subscriptions than those with lower number of mobile subscriptions.*</center>

We combine the **unordered** shapefile data to our final dataset to make maps. This is required in order for the map to be correctly rendered.

In [None]:
final_data = pd.merge(df_shp, final_data, left_on = "boro_cd", right_on = "Community Board")
final_data.drop(["Community Board", "cd_id"], axis=1, inplace=True)

In [None]:
plt.style.use('classic')
plt.style.use('seaborn-white')
final_data.plot("Number of Complaints", cmap="Reds", scheme="quantiles", legend=True)
plt.title("Number of Complaints by NYC Community Districts")
plt.xlabel("Longitude"); plt.ylabel("Latitude")
plt.show()

<center>*Figure 2: The upper parts of Manhattan and lower parts of Brooklyn seem to have the highest number of complaints.*</center>

In [None]:
final_data.plot("Mobile Subscription", cmap="Blues", scheme="quantiles", legend=True)
plt.title("Mobile Subscription by NYC Community Districts")
plt.xlabel("Longitude"); plt.ylabel("Latitude")
plt.show()

<center>*Figure 3: Interestingly enough, the districts with the highest number of complaints are also those with the highest number of mobile subscriptions.*</center>