# Geoclustering
A Command Line tool to cluster geolocated events and enable pattern analysis.

Read [more about the tool's features and goals](https://github.com/bellingcat/geoclustering).

### Step 1 - install the python package
You can do so with this command `!pip install geoclustering[full]`, below we escape the brackets (`\[` and `\]`) with a traling backslash due to jupyter notebooks code interpretation.

In [None]:
!pip install geoclustering\[full\]

### Step 2 - Get a full command line description by calling `help` 
This will show all the command line options we can use and what each of them does

In [5]:
!geoclustering --help

Usage: geoclustering [OPTIONS] FILENAME

  Tool to cluster geolocations. A cluster is created when a certain number of
  points (defined with --size) each are within a given distance (defined with
  --distance) of at least one other point in the cluster. Input is supplied as
  a csv file. At a minimum, each row needs to have a 'lat' and a 'lon' column.
  Other rows are reflected to the output.

Options:
  -d, --distance FLOAT            (in km) Max. distance between two points in
                                  a cluster.  [required]
  -s, --size INTEGER              Min. number of points in a cluster.
                                  [required]
  -o, --output PATH               Output directory for results. Default:
                                  ./output
  -a, --algorithm [dbscan|optics]
                                  Clustering algorithm to be used. `optics`
                                  produces tighter clusters but is slower.
                                  Default:

## Step 3 - fetch our data in CSV format

As an example we're using Bellingcat's Civilian Harm in Ukraine dataset.

You can download a ready copy of the data from 2024-01-17 from the [repository](https://github.com/bellingcat/open-source-research-notebooks/blob/main/notebooks/bellingcat/civharm.csv).

Alternatively, you can download it from [ukraine.bellingcat.com](https://ukraine.bellingcat.com/), see this image:

![image](https://github.com/bellingcat/geoclustering/assets/19508417/d7d59372-cf26-4692-ba3c-fc29f9152aae)

Save the file to your computer as `civharm.csv`, and change the column names "latitude" to "lat" and "longitude" to "lon".


With your ready `civharm.csv` file, upload it to the notebook folder (this will be different for Google Colab, Binder, etc but should always be straightforward by searching for the folder/upload button and selecting your CSV file).

Run the cell bellow to check the file is available, the `head` command reads the first lines of a file, this way you'll know your CSV looks good.

In [7]:
!head civharm.csv

"id","date","lat","lon","location","description","sources","associations"
"CIV0098","02/24/2022","49.212119","38.905921","","Individual injured by shelling, ambulance responds.","https://www.facebook.com/story.php?story_fbid=317408247086919&id=100064532388988","Type of area affected=Residential,Weapon System=Unknown"
"CIV0013","02/24/2022","48.055395","37.7783","","Apparent strike on hospital in separatist held area.","https://twitter.com/City_Donetsk/status/1496877801689038859","Type of area affected=Healthcare,Weapon System=Unknown"
"CIV0004","02/24/2022","47.775609","37.239673","","Explosion in central Kyiv, nothing further yet.","https://twitter.com/N_Waters89/status/1496856695896969222,https://twitter.com/IBN_MOHAMMAD/status/1496778855863963650,https://twitter.com/amnestypress/status/1496872501447700481,https://twitter.com/TiranaHassan/status/1497348792038924290,https://www.hrw.org/news/2022/02/25/ukraine-russian-cluster-munition-hits-hospital","Type of area affected=Healthcare,We

### Step 4 - call the tool
You can look at the output from the `help` command to understand the tool configurations, in the example bellow we're looking for clusters of at least 5 incidents (`--size 5`) and within a distance of 750m (`--distance 0.75` in km).

The `--open` flag will open the map visualization of your clusters as soon as the tool finishes.

In [9]:
!geoclustering --size 2 --distance 0.75 civharm.csv

Removed 1 empty coordinate pairs.
Output files saved to /home/m/dev/bcat/open-source-research-notebooks/notebooks/bellingcat/output
[32mClustering completed.[0m


You should see a `Clustering completed.` message and a link to the output folder, usually `/output` inside the folder you're in now.

This folder contains:
- a `result.csv` with the clusters information
- a `result.geojson` with the same information in a mapping-friendly format (see [GEOJSON](https://geojson.org/))
- a `result.json` 
- a `result.txt`
- a `result.html`

All formats except the `.html` can easily be further analysed or transformed. 

You can download the `.html` (or open it directly if you're running this notebook on your computer) and see an interactive Kepler.gl map where you can visually explore the existing clusters, like this one:

![image](https://github.com/bellingcat/geoclustering/assets/19508417/20d95122-8b14-4795-baed-2c4b556f2196)


That's it, you can now explore different datasets and configurations for the tool. 

For instance you can change the clustering algorithm with `--algorithm optics` or play around with the cluster `--size` or the minimum `--distance` between the incidents that are grouped into a cluster.