# IBM/Coursera Data Science Course - Capstone Project Report
Author: Pieter Joan van Voorst Vader MSc.  
Date: 16 January 2019
## University and College neighbourhood shops and crime climate
### 1. Introduction - Business Case
For prospect students who can afford living in the San Francisco area and want to apply for a college,  
a preliminary research might be interesting.  
This research project aims to provide insights in the neighbourhoods of the largest Colleges and Universities in San Francisco.  
The assumption is that a student has sufficient means for buying coffee and visiting restaurants, besides the usual supermarket visits.
Presumably sponsored by their parents, scholarship or an unhealthy studyloan.  
What might be interesting is the crimerate of the several serious recorded offences in the neighbourhoods when selecting an appropriate  
College or University given that the subjects of your choice are available at multiple locations.  

Therefore this project serves two purposes:
- Selecting a College or University with nice neighbourhood shops and facilities (venues).
- For the personal security conscious, also insights in the crimerate that can be factored in your education choice.

### 1.1 Discussion of the business case
Wanting to know the nice shops is quite obvious, these can also be clustered with a machine learning algorithm.  
The crimerate might also be interesting because when your choice of education is only available at a limited number of locations,  
you know beforehand which areas to avoid and at what periods of the day.  
Possibly it might also be interesting to explore a few points within walking distance of each education location,  
to provide insight into the kind of shop/venue mixture of that smaller area for knowing where to go a bit more specific than the  
more generic foursquare in the neighbourhood searches.

## 2. Discussion of data sources and usage
The folium package will be used to show OpenStreetMap data.  
The Foursquare API provides nearby venues with some additional data.  
The https://data.sfgov.org provides the crimerates of 2018.  
The http://www.city-data.com/city/San-Francisco-California.html provides the list of Universities and Colleges that will be scraped  
with BeautifulSoup4 into a Pandas DataFrame. These locations provide the basis for the neighbourhood analysis of shops and crimerate.  

The crimes will be grouped in the vincinity of these beforementioned locations with a Latitude and Longitude bandwith,  
furthermore there will be a selection of the more violent and invasive crimes that have a personal experience potential.  
This means that non-criminal, fraud, found licence plates and so on, will be excluded from the further analysis.  

Then these crimes could even be clustered with the K-means unsupervised clustering algorithm to provide a 'to be labeled'  
crime profile. Distinctions could be made between working hours, evening and nighttime events to provide further differentiation  
between crimes.  

It might also be possible that there is a correlation between time of events and shop-venue clusters, therefore this will be researched.  
It might well be that this possible correlation is quite insignificant or non-existent at all. Though the presumption is that in an area  
with many bars and cafes the assault rate would be higher.  



# 3. Methodology section
Pandas will be used for data mangling, thus cleaning, reading, grouping and aggregating data in a table formatted style.
The beforementioned data sources will be filtered on:
- the vincinity of the educational locations
- shops and venues around these locations
- crimes by these neighbourhoods, crimetype and timelyness of occurrences

## 3.1 Exploratory analysis
- describe the Latitude and Longitude bandwidths of the selected locations for crime and shop-venue selections
- plot the crimes in total and per location with a histogram with time-period on the horizontal axis and count on the vertical axis.
- plot these crimes also per selected main category.

For visualisation purposes also provide a detailed map with aggregated crimes around each location, with markers that provide details.
Also plot the found venues in the areas around each location for generating an idea about the neighbourhoods.

## 3.2 Machine Learning: Clustering
For creating insights into the neighbourhood climates, clustering is an appropriate algorithm that can classify similar 'themed'  
shops and venues in the vincinity. This clustering could be tweaked manually by performing different selections.  
This selection could theme the area quite significantly.  
The crimes will be clustered with 'Agglomerative Hierarchical Clustering', K-Means, Expectation–Maximization (EM) Clustering using Gaussian Mixture Models (GMM) _(or variant)_, Mean Shift Clustering, and Density-Based Spatial Clustering of Applications with Noise (DBSCAN).  
All these clustering algorithms are provided by the SciKit-Learn/SciPy packages and each algorithm has advantages and disadvantages.  
These crime clusters will be visualised in distribution plots and on the folium based maps.

## 3.3 Machine Learning: Inferential statistics
There might be a correlation between:
- size of the educational institution and: 
-- shop-venue climate
-- crime climate
- shop-venue climate and:
-- kind of most occuring crimes
-- timelyness of specific kind of crimes
where these crime and shop-venue climates are approached with the found clusters(e.g. crimetype classification cluster),  
or on shop-venue climate and specific selected crimetypes in the area (e.g. assault in a cafe-bar typed area).

## 3.4 Further considerations
What might be interesting is to run the classification algorithms on the central-large education location area,  
the prespecified quadrant-sub-areas around each education location,  
and on the aggregated crime-statistics-set of the selected locations of the San Fransisco area with different cluster sizes.


# 4. Results
This results section will discuss the gathered Education Locations, FourSquare venue data, crime data, feature selection, applied algorithms, the non successful search for correlations, and lastly the most successful combination of clustering techniques.
## 4.1 Education Locations
The education locations are scraped with BeautifulSoup from the beforementioned site: http://www.city-data.com/city/San-Francisco-California.html. This resulted in a dataframe with addresses. Combined with address to geolocation queries, the locations could be plotted on a Folium map. This was the starting point of the whole data gathering and clustering process. The result is shown below.
![](uni_loc_folium.jpg)
You can observe that there is an increased density of Education Locations in the _center_ of San Francisco.  
To create a _Neighbourhood_ around each location, a _bounding box_ was calculated. These bounding boxes provided the limits of each neighbourhood that should be included in the clustering efforts. The limits are about 1 km each side of the education locations. This is an approximate value since the curvature of the earth is included, and different _ellipsoids_ are available and all approximate this curvature of the earth. The 1 km distance was chosen because of the maximum walking distance one could travel in search for venues nearby.

## 4.2 FourSquare data
With the geo-locations list, FourSquare was queried for each location, and with multiple search variants that the FourSquare API provided. The query url was adjusted to retrieve venues at any time and any day, since the searches were performed from Europe and due to the time difference issue, the queries produced too small and inconsistent  results. Also the _sandbox_ account had a quite functional limitation of only returning 100 results per query. Therefore, the bounding box was used to calculate quadrants and produced a list 5 geolocations per education location. 100 geolocations were queried in total, all within the _bounding boxes_ of the education locations.  
These results did contain duplicates that could be filtered and resulted in a collection of 1442 unique locations. These results were saved to csv, since the _sandbox_ account also had a limitation of 950 queries per day.  
However, this resultset did contain venues that were not in the bounding-boxed neighbourhoods. These results were filtered and resulted in 639 venues for all education locations.  

To provide more insight in the found venues, that contained sub-categories, a full category list was retrieved from FourSquare. With a recursive function, all subcategories were mapped to main categories for map-labeling purposes.  
Since the bounding boxes of the education locations did overlap each other, the venues were labeled with the education locations that had the venue in their neigbourhood.  
The effect is depicted in the image below.  
![](venue_loc_folium.jpg)
Since the returned venues contain all kinds of categories, the main categories is displayed first, followed by the venue name, the categorical short name and the education location the venue _belongs_ to.  

## 4.3 Crime Data of San Francisco
The crime data from https://data.sfgov.org provided for the year 2018 until 18 december, 146743 records. This amount has been reduced significantly by removing all records that did not induced personal harm or had a potential traumatising effect. Furthermore, the results were filtered on the bounding boxes of the education location neighbourhoods. After the category inclusion filtering only 14746 records remained, almost a decimation of the dataset. 131997 records were purged and 10.05% remained. Following were 72 records that did not contain geolocations, these records were also purged.  
A preliminary _Incident Subcategory_ exploration was performed by displaying counts of occurrences per _interesting_ main categories, to possibly further _clean_ the dataset.  
After the bounding box filtering 7827 crime related incidents remained in the crime dataset. Since the time and date was also recorded, daynames, daynumbers, and time-of-day in 3 partitions were added. The time-of-day was partitioned in _Business Hours_ ranging from 8am to 18pm, _Evening Hours_ ranging from 18pm to 11pm, and _Night Time_ covering the remaining hours from 23pm to 8am. This partitioning created options for daytime selections and possible crime-trends evaluations. 

### 4.3.1 Crime Incident Plotting
The plotting of nearly 8000 crime incidents with labels ran into folium plotting issues. The built-in grouping did not help, the FastMarkerCluster class only provided points without labels and did not serve the goal of plotting all crimes with labels. Therefore the FastMarkerCluster was the basis for the FasterMarkerCluster that provided an implementation that could handle custom labelling and custom marker colours. With this custom class, venues and crimes could be differentiated. However, the gross label-array size did also had some limitation, probably caused by the Python2JavaScript module implementation. When this boundary was exceeded, the map would not render at all. After debugging this issue, some crime labelling was reduced and all venues and crime incidents could be plotted with region-clustering that would provide more details when zooming in to map-areas.  
An example plotted map is shown below that depicts the different coloured markers and high-density crime clustered _spider-plotted_ when selected.
![](crime_venue_clustered_markers.jpg)

### 4.3.2 Crime Trend Analysis
The crime trends were plotted in different kind of representations to visually spot trends.  
The first plot depicted below, is the plot of categorical crimes, differentiated by the 3 time-of-day partitions in a bar-graph.
![](crime_cat_dayperiod.png)
The shown below, depicts a y-axis limited bandwith plot, to emphasize the less frequently occuring incidents in 2018.  
![](crime_cat_dayperiod_ylimit.png)
The plot below shows three bargraphs with per weekday, period of day distribution of crimes in 2018. The week starts on Monday with number 0 and ends with Sunday with day 7.
![](crime_cat_perweekday_dayperiod.jpg)
The main category totals are shown in the horizontal barchart in the figure below. 
![](crime_hor_bar_totals.jpg)
A _Kernel Density Estimation Plot_ (below), depicts the density of filtered crimes in a form analoguous to a kind of heatmap, combined with a more informative manner to display the _density estimations_ in comparison to a _binned_ histogram. In these upper and side plots, it is observable that in the city centre more Gaussian bell-shaped curves are depicted. The red crosses represent the education locations, and reside in this crime-centre of San Francisco.
![](crime_small_density_distr.png)
The following scatter plot depicts the same data with more detail. This plot also shows the _anonymisation practices_ of the lawenforcement agencies, since the dataset is accompanied with the note that all crimes are anonymised towards an intersection granularity. In practice this could also be a mid-street location. This has the effect that the street plan of San Francisco is visible. 
![](crime_scatter_frequency_large.png)
The next plot depicts in scatterplot form, the crimes per day, per daytime period. This is to spot if there are significant crime-free of concentrations based on day of the week and time of the day periods.
The trend seems visually quite constant.
![](crime_weekday_timeofday_scatter.png)
The line-chart with a line per category, with weekdays on the X-axis, shows possible weekday trends in aggregated format. The Fridays and Saturdays show a light peak in Assaults, however the actual counts are not significantly higher than a Wednesday aggregated count. 
![](crime_weekday_timeofday_scatter.png)
Lastly, possible montly trends are visualised in the graph below. This graph shows that Prostitution has 2 peculiar spikes, possibly related to some temporal policing-programs that are performed on a theme basis. Possibly induced by political requests or media attention.  
The Robbery category seems quite stable throughout the year. Assault has an elevated occurrence per month, from Februari until August with a maximum of 500 per month. During Autumn and Winter this occurrence rate drops to around 400 incidents per month.  
This leads to the conclusion that there are no obvious seasonal or monthly trends observable and the maximum bandwith between max and minimum occurrences is 122. 
Notably, one should ignore the month December, since only data up until 18 December is available in the dataset.
![](crime_cat_monthly_trends.png)

This leads to the conclusion that only a higher city-centre density crime rate could be observed without any further significant or peculiar trends that could lead to a statistical significance.

## 3.4 Feature Selection
The feature selection process could be described by a 2-step process, first the already performed elimination of applicable crime incidents and venues. This is by category selection and querying, and by bounding boxed neighbourhood filtering. Secondly, the binary encoded categories and subcategories for algorithm processing.
These binary encoded features are used when the applied algorithm allows this. Mathematical possible inclusion does not always makes sense with the exploratory project goal in mind. 

Distinctions are made with the inclusion of subcategories only or with main categories only. When both were included in the algorithm calculations the results were not desirable, because the main categories would occur more often and would be appointed an extra weight by the algorithm.  

This binary encoding scheme is also called _one-hot-encoding_ and implemented in Pandas by ```pd.get_dummies()```



## 3.5 Applied Clustering Algorithms
### 3.5.1 K-Means neighbourhood typification
For the means of neighbourhood typification the venues and crimes were grouped per _education location_ and later also per 4x4 quadrants that were generated _per_ education location. The mean values of all binary encoded features were calculated and sorted in descending order. Thus most occurring venue-type, crime-category type first.  
For the venues a representation percentage was calculated for the top5 and top10, to show how representative this ranked listing is for the goal of neighbourhood typification.  

The _Elbow-method_ was applied for optimal cluster-count selection, the K-value central to the K-Means algorithm. This analysis produces a somewhat arbitrary result, with 10 or 12 clusters that could be _optimal_, for both crime and venue clusters.  
K-selection remains a quite arbitrary process, especially when no real significant elbow can be observed.  

The on education neighbourhood clustering with a K of 10, produced the following clustering depicted in the figure below.
![](kmeans_venue_eduloc_only.jpg)
The examination of the clustered venue climates with subcategories only showed that at least 49% of venues was represented, based on a top10 mean calculation when clustering.

Suprisingly the clustering with the inclusion of main categories produced exactly the same clusters, this is shown by a comparison of education locations with cluster labels in the jupyter-notebook.  

The later applied quadrant neighbourhood clustering per education location, an arbitrary value of 20 clusters was selected for venues, for crime-clusters 12 _more optimal_. This value produced, quite logically, the lowest mean squared error values. Furthermore, no significant elbow could be observed and the _granularity_ or detail of the cluster naming would be more informative.  
In the cluster examination, a quite homogeneous venue distribution could be observed. This observation is quite significant for the goal of exploratory analysis for prospective and current students. This homogeneous clusters help to select a direction to search for a nice venue on foot.  

### 3.5.2 Automatic cluster naming
Automatic cluster naming might be controversial, but quite practical. This feature was implemented by selecting the most occuring venue or crime category in the top4, then concatenating these names in descending order. This coarse method also implies that a cluster could be named ```Cafe, Cafe, Coffee Place, Coffee Place```.  
However, this naming scheme _algorithm_ still quite distinctively typifies a clustered neighbourhood crime of venue climate. This method suits the formulated exploratory themed project goal.  

### 3.5.3 Quadranted education location clustering
The result of the quadranted crime and venue clustering is plotted in a Folium map, depicted in the figure below.  
![](quad_clustering.jpg)
The labels show the cluster number, amount of venues/crime-incidents of that cluster, combined with the automatically generated clustername. Furthermore can be observed, that not all quadrants are occupied by venue or crime clusters. Also should be noted that venues are represented by _CircleMarkers_ and crime incidents by _MapMarkers_, each with a _per-cluster-appointed colour_.  







### 3.5.4 DBSCAN Clustering
The Density-Based Spatial Clustering of Applications with Noise (DBSCAN) produces quite interpretable results, with a _high enough_ 'K'-hyperparameter setting. Another advantage of this density-based clustering was that a suitable result could be obtained without quadrantising the crimes into pre-defined neighbourhoods.  

Important to mention is that the venues are not clustered with other algorithms than K-Means, since the total count of unique venues is quite low in comparison to the crime related incidents.  
Also to further improve the _usefulness_ of the clustered crimes, only _Day Time_ or _Business Hours_ events are clustered. This decision is motivated by the fact that most students will be in the neighbourhoods of education locations during _Day Time_, therefore a crime-climate typification that depicts daytime-crime-incidents suits the exploratory goal of education location selection or venue search direction best.  

The DBSCAN clustering produced clusters of arbitrary form, based on density of occurrences, exactly like the algorithm name implies. However to nicely plot these clusters without cluttering the map too much, a plottable shape has to be determined. A suitable algorithm is the _Concave-Hull_ algorithm, that is able to create properly ordered polygons, created from a de-duplicated geo-location list of each cluster. Interestingly enough, this algorithm also implements a form of K-Means with a configurable hyperparameter 'K', of to include nearby points. This hyperparameter tuning affects the final polygon shape. The hyperparameter selection is performed by visual inspection of the produced cluster-polygons. One limitation of the concave hull algorithm or polygons in general is that 3 or more points are needed to create one, therefore clusters with less than 3 points will be represented by red-outlined circles with a yellow translucent centre. This follows the set, general theme of crime clusters to red-outline, yellow-inner-area. This creates a clear distinction with the other cluster representations, and allows the venue clustered to be plotted distinguishable from crime clusters.

#### 3.5.4.1 DBSCAN noise considerations
The DBSCAN algorithm has two configurable hyperparameters, namely _Epsilon_ that represents the radius the algorithm should search for _nearby_ points. And minimumSamples that defines the minimum amount of samples to create a cluster. Another algorithm specific issue is the exclusion of points, that could be named _outliers-with-cluster_ or are considered _noise_. This algorithm is designed for noisy applications, and might therefore not be the ideal algorithm for crime-climate exploratory goals. However, when selection of the amount of clusters is based on a minimum _noise/outlier_-count, this algorithm could serve the purpose of _generalising_ enough for exploration purposes.  
This follows the reasoning that _noisy_-incidents occur with a very low frequency and density in that part of the city, that you could ignore them. This ignoring noise has to do with the realistic statistical chance, that a similar event would happen there in a future point in time. This is of course the basis for crime-climate cluster analysis.  

The cluster naming is again performed by concatenating the 4 most occuring sub-categories per cluster.

#### 3.5.4.1 DBSCAN cluster selection
The amount of clusters is based on the selection of the minimum noise/outlier-count. This minimum noise/outlier-count was reached consistently with 151 clusters. All points with label ```-1``` are considered noise by the DBSCAN algorithm and will therefore not be plotted on the resulting Folium-map. This version of crimes-clusters with irregular shapes combined with the previously calculated K-Means-quadrant-venue clusters is depicted below.
![](dbscan_clust_folium.jpg)

Also should be noted that the DBSCAN algorithm prefers _normalised_ values for clustering, therefore the scatter plot below has non-geolocation-axis labelcounts. The darkish grey points are the _noisy_ evaluated cluster crime incident points. 
![](dbscan_scatter.png)

### 3.5.5 Automatic neighbourhood creation with K-Means
Since the exploratory nature of this project, K-Means was applied in a bit different setting. This algorithm was applied to define crime-neighbourhoods autonomously based on the arbitrary value of ```k=151```, a value that was found with the DBSCAN algorithm. This neighbourhood creation was intended to visually show differences between K-Means and DBSCAN for neighbourhood creation. Furthermore, this method of applying K-Means could then be followed by the previously _mean-occurrence-clustering with K-Means_, thus applying the same algorithm twice for a different purpose. The resulting K-Means neighbourhood-detection result is shown below, and does show differences with DBSCAN due to the nature of the method K-Means applies in cluster creation. The more spherical cluster shapes can be observed. 
![](kmeans_neigh_detect.png)


### 3.5.6 Agglomerative Hierarchical Clustering (SciPy & SciKit-Learn implementations)
Agglomerative Hierarchical Clustering or AHC in short, is also implemented with both the SciPy and SciKit-Learn implementations, to show if the implementation would differ in resulting clusters. This difference does not happen, at least when ```euclidian``` is selected with the SciKit-Learn implementation in combination with the ```ward``` algorithm that tries to minimise the variaton in the clusters. 

Furthermore, only geolocation data is used, since only that data makes sense for _categorical-distance-determination_ is an issue.   

The resulting clusters, visualised with Folium and Concave Hull with a K=7 produced nicely formed clusters. Noticeably is that the clusters are also formed on street level. This could be quite informative if you would like to know with more precision what 'dangerous' streets would be.   
The clusternaming is performed in the previously described method of top4-most-occurring-concatenation.  

The dendogram plots are not very informative and will not be shown in this report.  
The Cophenetic Correlation Coefficient relates the value of 0.821, with how well the data is represented by the produced model. A value of 0.98 would be impressive, however 0.821 is well below the 'ususal' Social Sciences threshold of 0.95 or 0.99 in more stringent cases. This metric is in this case a nice to have, and depicts that the model is not all informative. 

The amount of clusters also has to be chosen arbitrarily, and is often measured by _distance-cut-off_ or domain knowledge about the desired amount of clusters. In this case a _reasonably_ high amount is chosen and set to 144 clusters, to provide enough detail in total and differentiation between clusters.

The resulting cluster plot is depicted below. 
![](ahc_clust_folium.jpg)

### 3.5.7 Mean Shift Clustering
Mean Shift Clustering would have the potential to automatically determine the _right_ amount of clusters. However, based on the geolocation data, with or without the inclusion of binary encoded (Sub-)Categories, produced a very small amount of clusters. With geolocation only, with pre-set _bandwith-detection_ 4 clusters, and without only 3 clusters. With main categories binary encoded, the algorithm produced 9 clusters, and with only subcategories 17 clusters were determined _ideal_. While evaluating the automatically named clusters, one observation could be made. All clusters contained only one sub or main category, this seems very frequency of occurrence related. In other words, a Bayesian-like frequency count would determine the cluster.  
Due to this _simple_ cluster contents and the very few clusters produced by geolocation only, this algorithm is least applicable for the goal of crime-climate determination. Also the clusters had their cluster centres in the city-centre, and all individual incidents were _categorically-labeled_ all scattered throughout the plotted results.  

The decision was made to not apply this method on a quadrantised neighbourhood, since frequency was leading in the naming of the clusters.  

The resulting plots are shown below, firstly is the geolocation only plot with bandwith estimation, followed by the subcategory encoded plot where bandwith estimation had no effect for the tuneable hyperparameters. 
![](ms_latlong_w_bndwth.png)
![](ms_subcat_w_bandwth.png)

### 3.5.8 Gaussian Mixture Model Clustering
First of all, one of the features of this model is that it can produce partially overlapping clusters. This effect does not always appear on the final plots, since this algorithm also suffers from the _random-initialisation_ problem.  

Two separate runs with the same given hyperparameters do not produce the same results, subtle differences can be observed in the two plots below, with 144 components.  
![](gmm144.run1.png)
![](gmm144.run2.png)

Also the suggested metrics have a few issues. Namely, due to the inconsistent algorithm results, the metrics need a several re-runs of the algorithm to provide an _error-bandwith_. Also the observable _elbows_ were absent in the metric calculations, and moreover, the metrics do not show a clearly interpretable trendline in each of the applied metrics. Another serious problem is that with the intended _higher_ optimal cluster count, these calculations are _very_ computationally expensive. The suggested metric iterations is 20, with 2 gmm-inits, per amount-of-clusters.  
The calculation of a cluster range between 4 and 240 had a duration of about ```22.22 hours```.
Together with not quite interpretable metric results, that suffer more from non-interpretability with non-observable elbows, no clear ceiling/bottom-values, and non-deterministic results, the metrics provided very little insight in the _right_ amount of clusters to be _chosen_ for this algorithm. This also is observable that each metric produces a different optimal value for the amount of clusters.

The GMM algorithm is implemented in this project with ```co-variance=full``` for more flexible cluster shapes and sizes, and only on geolocation data. It is mathematically possible to run the algorithm with binary encoded categorical values, but would not make sense for the application. This is because of the previously mentioned _categorical distance measure issue_. 

In the Folim plot of the GMM-clusters below, can be observed the effect of the _overlapping_ cluster shapes. This effect is inherent to Gaussian Modelling that implies it is based on distribution shapes.  
Just below the label pointer, two overlapping clusters are displayed, visualised clearly by the translucency of the yellow-cluster-filling. This plot is generated with the quite arbitrary value of 195 components. This value was obtained by a one-time _high-value_ of the Silhouette Score plots. A re-run of the notebook produced a local optimum of 192, and differs on each metric-suite-run. Lastly quite noticeable is the irradicate cluster shapes, even with a K=7 of the Concave Hull algorithm, that produced also these polygonial shapes.
![](gmm_195_overlap_clusters.jpg)

The main reason to not further pursue results with GMM is that the overlapping cluster have the potential to confuse users.  
Below are the most relevant parts of the metrics displayed.  
To further elaborate on the metrics, that should help preventing _overfitting_, the Silhouette Score should be high.  
The error bar represents the reproduceability of the best 5 of 20 runs, with the deviation between those best 5 runs represented in the length of the error bar at any cluster size.
The Silhouette Score, cut-out graph is depicted in the figure below and shows the area of the highest scores in the tested range of 4-200. 
![](gmm_silhouette_score.jpg)
The test-train split represents the amount of similarity between 2 randomly chosen sub-sets of GMM models at any pre-setted clustersize. This comparison is based on the Jensen-Shannon (JS) metric, the lesser this distance is, the better the model should be. Ideally a cluster size with a very small error bar in combination with a small JS-distance should be chosen.  
The best area of the 4-200 range is depicted below.  
![](gmm_test_train_split.jpg)
The next metric, the Bayesian Information Criterion (BIC) represents the prediction performance of the produced GMM model, concerning the current available data. Lower BIC scores represent better performing models, considering that the true distribution is unknown. For _overfitting_ prevention, high cluster counts are penalised. More clusters should produce a better fitting model, however this metric does not prevent overfitting in this form.  
The best performing area of the BIC metric is depicted below.  
![](gmm_bic_score.jpg)
This quite smooth, monotone curve bottoms out around 21-23 clusters.  
Another approach to evaluate this curve is the plotting of gradients, that should depict changes in curve-gradient more explicitly. With this representation, depicted in the figure below, the elbow to search for is the area where the gradient becomes somewhat constant, with a higher gradient, thus more steep graph before that area.
Around 128 clusters, the gradient becomes very stable and is almost a flat line up until 239 clusters. However, this was not the area of interest defined by the previously depicted BIC-Score graph, that graph points us to an area around 21-23 clusters. In the figure below, one could argue that the elbow is at 24 clusters or at 17 clusters. Those cluster points show quite extreme differences in gradients and represent the instableness of the BIC-gradient on cluster counts below 128 clusters.  
![](gmm_bic_gradient.jpg)
Concluding that this BIC-gradient metric does not help determining a good performing GMM cluster size, with this particular dataset.

### 3.5.9 Inferential Statistics
A correlation was expected between crime and venue neighbourhoods and or climates. No combination produced a mentionworthy result of far above 0.5 correlation score.  
The conclusion for this part of the project has to be that there are no significant correlations that disprove the null-hypothesis of there is no correlation. 

## 4. Discussion 
Aside from the non-found correlated expected effects, this project was able to produce quite insighful perspective with the explorative goals in mind, of neighbourhood climate-exploration based on two factors, crime-incident climate and venue climate.  
Most interestingly, a few applied algorithms with daytime-only crime-incidents produces quite detailed results when all incidents were not quadrantised, thus without predetermined neighbourhood sizes.  
Both the DBSCAN and the AHC algorithm produces results that were easily interpretable for individuals without specific domain-knowledge available. The cluster naming produced an quantifyable neighbourhood climate impression, with displayed points per cluster and cluster name.  
These points allow an individual to make sense of the density of each cluster, and especially in case of crime, correlate this to the _actual_ chance that one could be confronted with an event that could have a traumatising effect, either emotionally of physically.  
The venue clusters appear to clutter the view, this a known effect of the quadrant neighbourhood determination.  This intentionally cluttered view dissapears when the Folium Layer Controls are used. Each education location can be removed from the display. This can be done individually in all versions of the Folium crime-venue compound maps. The crimes can be removed only all at once. This is also intentional, since the high amount of unsupervised-non-quadrantised clusters have little or no locational-related relation with any of the education locations.  

These deliberate choices have the effect, that multiple Folium versions could be presented, even all in one presentation. Thus the DBSCAN, AHC and quadrant-crime-climate clusters in one presentation, where the user can switch between each clustering method. An individual can then choose the most representative or informative clustering method with a granularity or detail the user prefers.  

The choices for cluster sizes are quite arbitrarily chosen, however the chosen sizes are loosely based on the total amount of education locations, amount of overlap, and _suggested or preferred_ amount of detail represented by the plotted clusters. Since the main goal of this project to provide insights in the venue and crime neighbourhood climates for prospective or current students.  

One of the main recommendations is when an individual explores any version of the map, is to consciously be aware that there are a few important things to keep in mind at all times :  
- The counts in the clusters are **aggregated per year**, so the actual chance of occurrence is _that_ amount divided by 365/366 days.  
- All crime related incidents are anonymised, many could happen indoors and would not be noticable for pedestrians on the sidewalk of a street.  
- **Very Important:** Incidents happened in the past, do not have to occur again at any time in the future with any guarantee.  
  The crime related data is _historic_ data, and give an impression of the incidents last year.  
  This reasoning is also one of the _better_ arguments for choosing DBSCAN, since that algorithm is designed to cluster inherent noisy data. Combined with the choice to minimise the hyperparameters to minimise the noise count, the effect could be that spurious occurrences, like homicide, are simply not visually clustered.  
  This has the accompanying effect to _not_ over-frighten an individual inspecting these clusters.  
  


# 5. Conclusion
The fulfilling of the project was quite educational, to observe the actual effects of the applied algorithms, tuning hyperparameters and the effects in a numerical sense and visually. 
To explore different manners of tackling programmatic challenges and observed malfunctioning of implemented packages, finding satisfiable solutions to those.  
The gaining of more practical experience in the complete process of Data Science, from research question formulation, gathering and processing data to presenting the results within the limits of the _supplied_ tools like _free-version-API-limitations_ was a quite interesting process.

Noteable is however, that an _ideal_ cluster amount is quite arbitrarily with semi-steered unsupervised clustering, this means any decision can be argued upon. The intention with the presented choices is that they are well-informed with the main project goal in mind, of exploratory research into neighbourhood climates around the 20 biggest educational locations in San Francisco.

Concluding that the research question is successfully answered, a presentable solution has been developed to provide insight into the venue and crime climates in the neighbourhoods of educational locations for future or current students in San Francisco, who can explore the neighbourhoods visually and interpret the provided labelled clusters with quantified venue and crime incident counts. This should help a security conscious individual to make an educated choice, or to possibly avoid some streets.  

In the current representations a user-selectable crime-layer is optimal, because of the different clustering characteristics and a user can decide which clustering method is personally preferrable. A default clustering method is suggested and is the DBSCAN clustering, since this algorithm ignores noise in the visualisation and produces the least unintended effect of crime-climate overappreciation.

## 5.1 Future Research
An interesting effort is suggested to actually present the three different crime-climate representations to a corpus of potential users, and perform a field-survey among the participants. This could result in 'better' or more-educated choices about the final selected algorithm of hyperparameter tuning. This includes quadrants consisting of 5x5 or more quadrants per education location.

Furthermore, an effort could also be to further explore the already suggested K-Means-unsupervised-neighbourhood determination followed by the already applied per-determined neighbourhood Mean-calculated-most-occurence neighbourhood climate classification.

Another approach for Mean Shift clustering could be, to iteratively run the algorithm. This would produce clusters-in-clusters, until a satisfiable amount of clusters is reached and the autonaming process could be initiated.

Lastly, a more elegant solution to the automatic naming algorithm could be developed, that would not repeat names or would produce a more statistically significant naming than the current implementation.

