<div style="float:left">
    <h1 style="width:500px">Live Coding 9: Grouping Data</h1>
    <h2 style="width:450px">Dealing with similarity and difference</h2>
</div>
<div style="float:right"><img width="100" src="https://github.com/jreades/i2p/raw/master/img/casa_logo.jpg" /></div>

## Week 9: Thoughts

- First, you should be taking away the *fundamental* links between what we do when we make maps and what we do when we make plots in order to *understand* our data or models. Classifying observations based on the distribution is conceptually very closely aligned with classifying observations based on a label.
- Second, you should be taking away the *difference* between classification and clustering (see [Problem Domains](#Problem-Domains)); why one is called 'supervised' and the other 'unsupervised'.
- Third, you should be bearing in mind the *difference* between what we might call top-down and bottom-up clustering algorithms.
- Fourth, you should be *understanding* that how you present and understand your data is *inseperable* from how you organise the analysis in terms of classification, clustering, and so forth.
- Learning to code:
  - Have said before: doing data science is about judgement.
  - Sometimes the practicals show techniques because I want you to see how they work, not because they are always the best choice on the data we're looking at.
  - For instance: 
    - We've seen many ways of potentially cleaning the InsideAirbnb data (outliers based on price per night, descriptive text or its absence), but I've never put it all together in one place nor have I ever attempted to reconcile all the different steps.
    - I've also performed *both* PCA *and* UMAP dimensionality reduction and this week you'll see two clustering algorithms as well as a classifier, but I've not (and won't) attempt to tell you which is 'correct' -- there are simply choices to make and justify.
    - In the practical this week there is a question about using the *mean* and the *median* price by MSOA as part of the clustering process: I ask a question about why *might* it be useful to feed both of these into the clustering algorithm. Note the emphasis on *might*. There are also some very compelling reasons *not* to do this, so the argument *against* doing this is that we're effectively giving pricing information *two votes* in the cluster assignment process -- first the mean, then the median -- and this risks outweighing everything else we know from the UMAP and PCA dimensionality reduction process. And its impact is particularly strong on the UMAP process because that is only two dimensions: i.e. two votes. *But* you also have to take into account the magnitude of each dimension! This is *hard* and it requires *effort* to think through the implications of your choices around normalisation/standardisation together with what dimensions *matter* to the clustering process!
    - In the practicals we often use these techniques without justifying them analytically or theoretically: this is why I spend so much time asking you to *be* critical in your approach to data! Without that critical perspective, you simply apply stuff 'because it's what we did in class', not because 'it's the right choice given the analytical need and the data we're looking at'.
  - So do not take it as 'given' that if you repeat exactly something that was in a practical then that is the 'right thing to do' -- you absolutely *should* feel free to make use of the code provided in the practicals, but you should think of it as remixing, not copying+pasting. Take what you've learned and work out how to apply it.

## Assessments 2 & 3

- BibTex
  - Start with the [one in my repo](https://github.com/jreades/fsds/blob/master/bib/Readings.bib).
  - Can open in BibDesk (Mac) or JabRef (x-platform)
  - Adding:
    - Google Scholar, then `Import into BibTeX`, then select and copy the record from the browser pop-up.
    - With BibDesk: `Publication` > `New Publication from Clipboard`
    - With JabRef: `New Article` (`+`) > `BibTex source` > `Paste`
    - **Do not assume that the Google Scholar entry is 100% correct and you may want to chan ge to the Citation Key!**
- External py and data files
  - The textual library download offers a model for doing this with code. 
  - Use OneDrive or Dropbox for additional data unless you are *sure* that direct downloads will work.
    - In the description/comments indicate the original source of your data.
    - Link to documentation/DTD where appropriate.
    - Do not put in Git!
  - Be considerate: if you use external data don't repeatedly download large data sets
    - Either save to your own Dropbox/OneDrive (see above)
    - Or only download *once* and then check for its extistence locally (again, you have code to do this already!)
- Adding 'new' libraries
  - Can use a try/except block to be user-friendly
  - May want to check what version (`<module>.__version__` usually works)
- Adding others to a repo
  - You will need to set up a project repo for the group submission.
- Conflicts in Git
  - Easy to deal with in text and PY files, hard in Notebooks, impossible in Word or other binary formats.
  - See some examples for how to find what's 'wrong' [here](https://stackoverflow.com/questions/1800783/how-to-compare-a-local-git-branch-with-its-remote-branch) (*hint*: `git fetch` then `git diff master origin/master` [or main if that's how you set it up]).

## Other Admin

- Initial feedback on audit
- Readings:
  - Dâ€™Ignazio and Klein (2020), chap. 3, On Rational, Scientific, Objective Viewpoints from Mythical, Imaginary, Impossible Standpoints <[URL](https://ucl.primo.exlibrisgroup.com/discovery/fulldisplay?docid=alma9931206723604761&context=L&vid=44UCL_INST:UCL_VU2&lang=en&search_scope=UCLLibraryCatalogue&adaptor=Local%20Search%20Engine&isFrbr=true&tab=UCLLibraryCatalogue&query=any,contains,D%5C%27Ignazio%20Data%20Feminism&sortby=date_d&facet=frbrgroupid,include,9041340239229546206&offset=0)>
  - Badger, Bui, and Gebeloff (2019) <[URL](https://www.nytimes.com/interactive/2019/04/27/upshot/diversity-housing-maps-raleigh-gentrification.html)>
  - Massey (1996) <[URL](https://www.tandfonline.com/doi/abs/10.1080/14702549608554458)>

## Problem Domains

|       | Continuous | Categorical |
| :---- | :--------- | :---------- |
| **Supervised** | Regression | Classification |
| **Unsupervised** | Dimensionality Reduction | Clustering |


## Installing a Module

In [3]:
try:
    import foo
except ModuleNotFoundError:
    ! pip install foo

[31mERROR: Could not find a version that satisfies the requirement foo (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for foo[0m[31m
[0m