Skip to content

Commit

Permalink
merge new contributions, license texts, and booktools
Browse files Browse the repository at this point in the history
  • Loading branch information
ljwolf committed Jun 24, 2019
2 parents 308434e + 413b653 commit 4cb141c
Show file tree
Hide file tree
Showing 3 changed files with 49 additions and 14 deletions.
50 changes: 37 additions & 13 deletions notebooks/05_choropleth.md
Original file line number Diff line number Diff line change
Expand Up @@ -160,7 +160,7 @@ range of the attribute values. Given $w$ the number of bins ($k$) is:
$$k=(max-
min)/w.$$

Below we present several approaches to create these break points that follow criteria that can be of interest in different contests, as they focus on different priorities.
Below we present several approaches to create these break points that follow criteria that can be of interest in different contexts, as they focus on different priorities.

#### Equal Intervals

Expand All @@ -169,9 +169,9 @@ the width and, in turn, the number of bins for the classification. This is a
special case of a more general classifier known as "equal intervals", where each
of the bins has the same width in the value space.
For a given value of $k$, equal intervals
classification splits the range of the attribute space into equal lengthened
classification splits the range of the attribute space into $k$ equal length
intervals, with each interval having a width
$w = \frac{x_0 - x_{n-1}}{k-1}$.
$w = \frac{x_0 - x_{n-1}}{k}$.
Thus the maximum class is $(x_{n-1}-w, x_{n-1}]$ and the first class is
$(-\infty, x_{n-1} - (k-1)w]$.

Expand Down Expand Up @@ -215,7 +215,7 @@ While quantiles does avoid the pitfall of sparse classes, this classification is
not problem free. The varying widths of the intervals can be markedly different
which can lead to problems of interpretation. A second challenge facing quantiles
arises when there are a large number of duplicate values in the distribution
such that the limits for one or more classes become ambiguous.
such that the limits for one or more classes become ambiguous. For example, if one had a variable with $n=20$ but 10 of the observations took on the same value which was the minimum observed, then for values of $k>2$, the class boundaries become ill-defined since a simple rule of splitting at the $n/k$ ranked observed value would depend upon how ties are tried when ranking.

#### Mean-standard deviation

Expand All @@ -236,20 +236,32 @@ msd

This classifier is best used when data is normally distributed or, at least, when the sample mean is a meaningful measure to anchor the classification around. Clearly this is
not the case for our income data as the positive skew results in a loss of
information when we use the standard deviation. The lack of symmetry thus leads to
inadmissible boundaries for the first class as well as a concentration of the
information when we use the standard deviation. The lack of symmetry leads to
an inadmissible upper bound for the first class as well as a concentration of the
vast majority of values in the middle class.

#### Maximum Breaks

The maximum breaks classifier decides where to set the break points between classes by considering the difference between sorted values. That is, rather than considering a value of the dataset in itself, it looks at how appart each value is from the next one in the sorted sequence. The classifier then places the the $k-1$ break points in between the $k$ values most stretched apart from each other in the entire sequence:
The maximum breaks classifier decides where to set the break points between
classes by considering the difference between sorted values. That is, rather
than considering a value of the dataset in itself, it looks at how appart each
value is from the next one in the sorted sequence. The classifier then places
the the $k-1$ break points in between the pairs of values most stretched apart from
each other in the entire sequence, proceeding in descending order relative to
the size of the breaks:

```python
mb5 = mapclassify.Maximum_Breaks(mx['PCGDP1940'], k=5)
mb5
```

Maximum breaks is an appropriate approach when we are interested in making sure observations in each class are as similar to each other as possible. As such, it works well in cases where the distribution of values is not unimodal. In addition, the algorithm is relatively fast to compute. However, its simplicitly can sometimes cause unexpected results. To the extent in only considers the top $k$ differences between consecutive values, other more nuanced within-group differences and dissimilarities can be ignored.
Maximum breaks is an appropriate approach when we are interested in making sure
observations in each class are separated from those in neighboring classes. As
such, it works well in cases where the distribution of values is not unimodal.
In addition, the algorithm is relatively fast to compute. However, its
simplicitly can sometimes cause unexpected results. To the extent in only
considers the top $k-1$ differences between consecutive values, other more nuanced
within-group differences and dissimilarities can be ignored.

#### Box-Plot

Expand Down Expand Up @@ -283,7 +295,7 @@ neighboring internal classes.

#### Head Tail Breaks

The head tail algorithm, introduced by Jiang (2013) is based on a recursive partioning of the data using splits around
The head tail algorithm, introduced by Jiang (2013), is based on a recursive partioning of the data using splits around
iterative means. The splitting process continues until the distributions within each of
the classes no longer display a heavy-tailed distribution in the sense that
there is a balance between the number of smaller and larger values assigned to
Expand Down Expand Up @@ -349,12 +361,12 @@ As a special case of clustering, the definition of
the number of classes and the class boundaries pose a problem to the map
designer. Recall that the Freedman-Diaconis rule was said to be optimal,
however, the optimality necessitates the specification of an objective function.
In the case of Freedman-Diaconis the objective function is to minimize the
In the case of Freedman-Diaconis, the objective function is to minimize the
difference between the area under estimated kernel density based on the sample
and the area under the theoretical population distribution that generated the
sample.

This notion of statistical fit is an important one, however, not the
This notion of statistical fit is an important one. However, it is not the
only consideration when evaluating classifiers for the purpose of choropleth
mapping. Also relevant is the spatial distribution of the attribute values and
the ability of the classifier to convey a sense of that spatial distribution. As
Expand Down Expand Up @@ -632,11 +644,23 @@ choropleth maps. Readers interested in pursuing this literature are encouraged
to see the references cited.

At the same time, given the philosophy underlying PySAL the methods we cover
here are sufficient for exploratory data analysis where rapid and flexible
generation of views is critical to the work flow. Once the analysis is complete
here are sufficient for exploratory data analysis where the rapid and flexible
generation of views is critical to the work flow. Once the analysis is complete,
and the final presentation quality maps are to be generated, there are excellent
packages in the data stack that the user can turn to.

## Questions

1. A variable such as population density measured for census tracts in a metropolitan area can display a high degree of skewness. What is an appropriate choice for a choropleth classification for such a variable?
2. Provide two solutions to the problem of ties when applying quantile classification to the following series: $y=[2,2,2,2,2,2,4,7,8,9,20,21]$ and $k=4$. Discuss the merits of each approach.
3. Which classifiers are appropriate for data that displays a high degree of multi-modality in its statistical distribution?
4. Contrast and compare classed choropleth maps with class-less choropleth maps? What are the strengths and limitations of each type of visualization for spatial data?
5. In what ways do choropleth classifiers treat intra-class and inter-class heterogeneity differently? What are the implications of these choices?
6. To what extent do most commonly employed choropleth classification methods take the geographical distribution of the variable into consideration? Can you think of ways to incorporate the spatial features of a variable into a classification?
7. Discuss the similarities between the choice of the number of classes in choropleth mapping, on the one hand, and the determination of the number of clusters in a data set on the other. What aspects of choropleth mapping differentiate the former from the latter?
8. The Fisher-Jenks classifier will always dominate other k-classifiers for a given data set, with respect to statistical fit. Given this, why might one decide on choosing a different k-classifier for a particular data set?


## References

Duque, J.C., L. Anselin, and S.J. Rey. (2012) "The max-p regions problem." *Journal of Regional Science*, 52:397-419.
Expand Down
11 changes: 11 additions & 0 deletions notebooks/10_clustering_and_regionalization.md
Original file line number Diff line number Diff line change
Expand Up @@ -877,6 +877,17 @@ since the spatial structure and covariation in multivariate spatial data is what
determines the spatial structure and data profile of discovered clusters or regions.
Thus, clustering and regionalization are essential tools for the spatial data scientist.

## Questions

1. What disciplines employ regionalization? Cite concrete examples for each discipline you list.
2. Contrast and compare the concepts of *clusters* and *regions*?
3. In evaluating the quality of the solution to a regionalization problem, how might traditional measures of cluster evaluation be used? In what ways might those measures be limited and need expansion to consider the geographical dimensions of the problem?
4. Discuss the implications for the processes of regionalization that follow from the number of *connected components* in the spatial weights matrix that would be used.
5. True or false: The average silhouette score for a spatially constrained solution will be no larger than the average silhouette score for an unconstrained solution. Why, or why not? (add reference and or explain silhouette)
6. Consider two possible weights matrices for use in a spatially constrained clustering problem. Both form a single connected component for all the areal units. However, they differ in the sparsity of their adjacency graphs (think Rook versus queen). How might this sparsity affect the quality of the clustering solution?
7. What are the challenges and opportunities that spatial dependence pose for spatial cluster formation?
8. In other areas of spatial analysis, the concept of multilevel modeling (cites) exploits the hierarchical nesting of spatial units at different levels of aggregation. How might such nesting be exploited in the implementation of regionalization algorithms? What are some possible limitations/challenges that such nesting imposes/represents in obtaining a regionalization solution.
9. Using a spatial weights object obtained as `w = libpysal.weights.lat2W(20,20)`, what are the number of unique ways to partition the graph into 20 clusters of 20 units each, subject to each cluster being a connected component? What are the unique number of possibilities for `w = libpysal.weights.lat2W(20,20, rook=False)` ?

---

Expand Down
2 changes: 1 addition & 1 deletion notebooks/booktools.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,4 +54,4 @@ def choropleth(df, column, scheme='Quantiles', k=5, cmap='BluGrn', legend=False,
edgecolor=edgecolor, linewidth=linewidth, \
alpha=alpha, ax=ax)
return ax


0 comments on commit 4cb141c

Please sign in to comment.