# Capstone project notebook

## Problem 3

### What Is the Relationship between Housing Characteristics and Complaints?
The goal of this exercise is to find the answer to the Question 3 of the problem statement: 

### Does the Complaint Type that you identified in response to Question 1 have an obvious relationship with any particular characteristic or characteristic of the Houses?

In this exercise, use the 311 dataset.

You also need to read back the PLUTO dataset from Cloud Object Store that you saved previously in the course. Use the PLUTO dataset for the borough that you already identified to focus on the last exercise.Ensure that you use only a limited number of fields from the dataset so that you are not consuming too much memory during your analysis.

The recommended fields are Address, BldgArea, BldgDepth, BuiltFAR, CommFAR, FacilFAR, Lot, LotArea, LotDepth, NumBldgs, NumFloors, OfficeArea, ResArea, ResidFAR, RetailArea, YearBuilt, YearAlter1, ZipCode, YCoord, and XCoord.

In [None]:
# The code was removed by Watson Studio for sharing.



### Read Bronx file

To prevent loading unnecesary data, we select the columns to load

In [None]:
body = client_cba83a820ee941cd921cc2bbfefd15eb.get_object(Bucket='edx1-donotdelete-pr-ffppmpbmudcobi',Key='bronxs.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )
cols_to_read = [  'Address',
                  'BldgArea',
                  'BldgDepth',
                  'BuiltFAR',
                  'CommFAR',
                  'FacilFAR',
                  'Lot', 
                  'LotArea',
                  'LotDepth',
                  'NumBldgs',
                  'NumFloors',
                  'OfficeArea',
                  'ResArea',
                  'ResidFAR',
                  'RetailArea',
                  'YearBuilt',
                  'YearAlter1',
                  'ZipCode',
                  'YCoord',
                  'XCoord']
df_bronx_info = pd.read_csv(body, usecols=cols_to_read)

In [None]:
df_bronx_info.head()

### We create a subset pandas dataframe

With only the values to study, namely borough, address and location. After that we make and encoding setting BRONX as 1 and all others as 0


In [None]:
df_bronx_incidents = df_311[['complaint_type', 'incident_address', 'latitude', 'longitude', 'unique_key']].loc[df_311['borough'] == 'BRONX']
print('Number of Bronx incidents',df_bronx_incidents['unique_key'].count(),sep=' ')
df_bronx_incidents.head()

In [None]:
df_bronx_incidents['complaint_type'] = (df_bronx_incidents['complaint_type'] == 'HEAT/HOT WATER').astype(int)

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
df_bronx_incidents.groupby('complaint_type').agg('complaint_type').count().plot.bar()

### We need to join incidents with building information

A left inner join is what we need

In [None]:
df_bronx = pd.merge(df_bronx_incidents, df_bronx_info, how='left', left_on=['incident_address'], right_on=['Address'])

In [None]:
print('Number of Bronx incidents and building information',df_bronx['unique_key'].count(),sep=' ')

### There are incident streets not available in the pluto file, so we just drop them

This is a problem with cardinality.

In [None]:
df_bronx.dropna(inplace=True)

### Let's get rid of the addresses

Let's see how many null values we have in the dataframe, and then eliminate them. Also we can select Lot as the index

In [None]:
df_bronx.drop(['Address', 'incident_address'], axis=1, inplace=True)

### Eliminate duplicates

There are a lot of duplicates. We are cleaning them and it results in *_1,211,609_* rows

In [None]:
df_bronx.drop_duplicates(inplace=True)

In [None]:
print('Number of Bronx incidents with no duplicates',df_bronx['unique_key'].count(),sep=' ')

### Let's define dependant and independant variables

In [None]:
y = np.asarray(df_bronx['complaint_type'])
predictors = df_bronx.columns.difference(['complaint_type'])
X = df_bronx[predictors]
X.set_index('unique_key', inplace=True)

In [None]:
X.head(10)

In [None]:
y = np.asarray(df_bronx['complaint_type'])
predictors = df_bronx.columns.difference(['complaint_type'])
X = df_bronx[predictors]
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

In [None]:
def xgb_model(train_data, train_label, test_data, test_label):
    clf = xgb.XGBClassifier(max_depth=7,
                           min_child_weight=1,
                           learning_rate=0.1,
                           n_estimators=500,
                           silent=True,
                           objective='binary:logistic',
                           gamma=0,
                           max_delta_step=0,
                           subsample=1,
                           colsample_bytree=1,
                           colsample_bylevel=1,
                           reg_alpha=0,
                           reg_lambda=0,
                           scale_pos_weight=1,
                           seed=1,
                           missing=None)
    clf.fit(train_data, train_label, eval_metric='auc', verbose=True, eval_set=[(test_data, test_label)], early_stopping_rounds=100)
    y_pre = clf.predict(test_data)
    y_pro = clf.predict_proba(test_data)[:, 1]
    return clf 

In [None]:
model = xgb_model(X_train, y_train, X_test, y_test)