## Expedition Clustering Full Dataset Labeling

In order to verify our clustering is working effectively, let's identify and label a set of expeditions.

Let's say 10 (and more if we deem necessary!)

And while we're at it, we can get an idea of the spatiotemporal separations (epsilon, or ϵ) and Levenshtein distances we can expect in expeditions to use in clustering later.

In [146]:
import numpy as np
import pandas as pd

In [147]:
df = pd.read_csv("../data/full_df.csv", on_bad_lines = "skip", index_col=None)
df.columns = map(str.lower, df.columns)

In [148]:
# Drop 
df = df.drop(df.columns[0], axis=1)
df = df.drop(df.columns[0], axis=1)

In [149]:
df

Unnamed: 0,collectionobjectid,text1,countamt,collectingeventid,collectionobjectattachmentid,attachmentid,attachmentlocation,startdate,enddate,remarks,localityid,minelevation,maxelevation,elevationaccuracy,latitude1,longitude1,localityname,namedplace,geographyid
0,5,Tree ca. 4 m tall. Fruit purplish black.,8.0,66157.0,,,,2004-10-29,,Tsuga dumosa forest mixed with elements of sub...,66157.0,2500.0,,,27.717865,98.421631,Vicinity of Sandui campsite between Shigong Qi...,Vicinity of Sandui campsite between Shigong Qi...,33272.0
1,9,Perennial herb.,2.0,31111.0,,,,2003-08-27,,Remnant disturbed secondary forest on steep sl...,31111.0,1630.0,,,24.882559,98.713638,"Tenglan Cun, Lanniba He. W side of Gaoligong S...","Tenglan Cun, Lanniba He. W side of Gaoligong S...",33255.0
2,11,"arbre de 8m, fruits jaunes, rouge au sommet, p...",1.0,58265.0,,,,2004-12-10,,"foret dense humide de moyenne altitude, vegeta...",58265.0,623.0,,,-14.435100,49.767502,"Parc National de Marojejy, commune rurale de M...","Parc National de Marojejy, commune rurale de M...",27772.0
3,20,Tree ca. 8 m tall. Young fruits green.,1.0,105773.0,153980.0,153981.0,2606ab29-8b8c-46df-b3bb-03587d23a2aa.jpg,2002-07-24,2002-07-24,Evergreen broad-leaved forest.,105773.0,2300.0,2300.0,0.0,27.877222,98.335556,"Kongdang, W side of Gaoligong Shan, along the ...","Kongdang, W side of Gaoligong Shan, along the ...",33272.0
4,24,,1.0,19002.0,365549.0,365652.0,8cd186c3-3176-48ac-858e-fe82ddb98af9.jpg,1935-06-08,1935-06-08,,19002.0,,,,41.932200,-123.831000,Old road from Patrick's Creek into Oregon.,,9217.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
44800,924772,Perenne; tallos y base del cáliz rojo; pétalos...,,761041.0,195782.0,195868.0,7eea1486-713e-43ac-a3b5-0963ac271e78.jpg,1993-04-08,,Matorral de Baccharis vaccinioides; asociado c...,74797.0,2430.0,,,16.670000,-92.532500,Mitziton,,28289.0
44801,936416,,,772288.0,,,,1997-06-01,,"Crevice of vertical rock surface, exposed. Rip...",100664.0,6000.0,,,41.491695,-119.499161,"Upper High Rock Canyon, 0.25 mi south of Steve...",,26961.0
44802,941113,,1.0,776832.0,411761.0,411865.0,2d4cdc16-d02c-4101-b46d-57df96182d32.jpg,1940-06-13,,,100244.0,,,,41.733299,-120.370796,Near Davis Creek.,,17158.0
44803,941114,,1.0,776833.0,411762.0,411866.0,97952295-350f-441e-a3e0-7b604e423e68.jpg,1940-06-13,,,100244.0,,,,41.733299,-120.370796,Near Davis Creek.,,17158.0


In [150]:

df['startdate'] = pd.to_datetime(df['startdate'])
df['enddate'] = pd.to_datetime(df['enddate'])

df['startdate_num'] = df['startdate'].view(int)//1e9
df['enddate_num'] = df['enddate'].view(int)//1e9

  df['startdate_num'] = df['startdate'].view(int)//1e9
  df['enddate_num'] = df['enddate'].view(int)//1e9


In [151]:
df[['latitude1', 'longitude1', 'localityname', 'startdate_num', 'enddate_num']].head(5)

Unnamed: 0,latitude1,longitude1,localityname,startdate_num,enddate_num
0,27.717865,98.421631,Vicinity of Sandui campsite between Shigong Qi...,1099008000.0,-9223372000.0
1,24.882559,98.713638,"Tenglan Cun, Lanniba He. W side of Gaoligong S...",1061942000.0,-9223372000.0
2,-14.4351,49.767502,"Parc National de Marojejy, commune rurale de M...",1102637000.0,-9223372000.0
3,27.877222,98.335556,"Kongdang, W side of Gaoligong Shan, along the ...",1027469000.0,1027469000.0
4,41.9322,-123.831,Old road from Patrick's Creek into Oregon.,-1090886000.0,-1090886000.0


In [152]:
def get_dist_from_latlon(ll1, ll2):
    return geopy.distance.geodesic(ll1, ll2).km


## Lets do some manual expedition cluster labeling!

First, let's look at the distance between records that seem to be in a single cluster... Let's find the max of a couple that seem like they belong _together_

Let's start small, then increase radius

## ∆lat = 0.00001

In [188]:
samp_collobjid_l = list(df.collectionobjectid.sample(10, random_state=123))
samp_collobjid_l

[325334, 203161, 275379, 189769, 306553, 305052, 53269, 124799, 104780, 177087]

[325334, 203161, 275379, 189769, 306553, 305052, 53269, 124799, 104780, 177087]

In [189]:
samp_record = df[df.collectionobjectid == 325334]
samp_record

Unnamed: 0,collectionobjectid,text1,countamt,collectingeventid,collectionobjectattachmentid,attachmentid,attachmentlocation,startdate,enddate,remarks,...,minelevation,maxelevation,elevationaccuracy,latitude1,longitude1,localityname,namedplace,geographyid,startdate_num,enddate_num
37784,325334,,1.0,152848.0,470470.0,470574.0,c5d21bad-9129-4b37-9df6-e8a9dc84cc05.jpg,1934-06-11,1934-06-11,,...,5200.0,5200.0,0.0,41.4506,-120.3595,Parker Creek.,,17158.0,-1122163000.0,-1122163000.0


In [190]:
samp_record_lat = samp_record.latitude1.values[0]
samp_record_lat

np.float64(41.4506)

In [191]:
test_df = df[abs(df.latitude1 - samp_record_lat) < 0.000001].reset_index(drop=True)
test_df

Unnamed: 0,collectionobjectid,text1,countamt,collectingeventid,collectionobjectattachmentid,attachmentid,attachmentlocation,startdate,enddate,remarks,...,minelevation,maxelevation,elevationaccuracy,latitude1,longitude1,localityname,namedplace,geographyid,startdate_num,enddate_num
0,720,,1.0,138443.0,435603.0,435707.0,a894352c-bb8d-4f35-be78-3dab5ad7a6cf.jpg,1934-06-16,1934-06-16,,...,5200.0,5200.0,0.0,41.4506,-120.359500,"Parker Creek, Warner Mountains",,17158.0,-1.121731e+09,-1.121731e+09
1,1692,,1.0,103525.0,394164.0,394268.0,9e63a2de-e631-44c2-858e-d5f7ded2e684.jpg,1919-06-13,1919-06-13,,...,,,,41.4506,-120.359500,Parker Creek near Modoc National Forest Boundary.,,17158.0,-1.595376e+09,-1.595376e+09
2,1961,,1.0,24078.0,370082.0,370185.0,464afdd4-2153-4e6f-84c5-b169704ff57a.jpg,1932-06-10,1932-06-10,Under junipers.,...,5200.0,5200.0,0.0,41.4506,-120.359500,Parker Creek.,,17158.0,-1.185322e+09,-1.185322e+09
3,4612,,1.0,10156.0,288351.0,288454.0,fb10ee38-e556-436b-90a0-a3251d8d4ad3.jpg,1934-06-11,1934-06-11,,...,5200.0,5200.0,0.0,41.4506,-120.359500,Parker Creek.,,17158.0,-1.122163e+09,-1.122163e+09
4,4999,,1.0,91016.0,524941.0,525045.0,9f9c3a53-7d74-485d-9656-f13351091bf9.jpg,1934-06-11,1934-06-11,,...,5200.0,5200.0,0.0,41.4506,-120.359500,Parker Creek.,,17158.0,-1.122163e+09,-1.122163e+09
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
222,377637,,1.0,118781.0,436271.0,436375.0,9a926f62-5b58-474d-a6fe-166336facd2f.jpg,1934-06-11,1934-06-11,,...,5200.0,5200.0,0.0,41.4506,-120.359500,"Parker Creek, Warner Mountains",,17158.0,-1.122163e+09,-1.122163e+09
223,380502,,1.0,157854.0,457933.0,458037.0,b3465598-1faf-4518-8392-d9e3612dea8d.jpg,1929-06-28,NaT,,...,5200.0,,,41.4506,-120.359497,Parker Creek.,,17158.0,-1.278461e+09,-9.223372e+09
224,381494,,1.0,123011.0,152320.0,152321.0,0a7fee08-6e81-437a-bfc2-621dd0ac7777.jpg,1931-06-11,1931-06-11,,...,5200.0,5200.0,0.0,41.4506,-120.359500,"Parker Creek, Warner Mts.",,17158.0,-1.216858e+09,-1.216858e+09
225,381540,,1.0,24845.0,524195.0,524299.0,52e3db33-adff-4c7c-b7da-97aa006d6c36.jpg,1934-06-11,1934-06-11,,...,5200.0,5200.0,0.0,41.4506,-120.359500,"Parker Creek, Warner Mts.",,17158.0,-1.122163e+09,-1.122163e+09


In [194]:
test_df.startdate.value_counts()

startdate
1934-06-11    60
1919-06-14    33
1919-06-13    20
1919-06-15    13
1932-06-10    11
1934-06-16     8
1988-05-13     8
1931-06-17     6
1940-06-01     5
1932-06-22     3
1988-05-24     3
1919-06-03     3
1931-06-11     3
1932-06-24     2
1931-06-10     2
1931-06-20     2
1932-05-27     2
1932-07-22     2
1932-07-11     2
1919-06-16     2
1931-06-12     2
1932-04-22     2
1932-08-01     2
1935-07-19     2
1932-06-27     2
1932-06-26     2
1989-06-14     1
1931-06-01     1
1931-09-18     1
1988-06-11     1
1919-06-20     1
1932-07-14     1
1919-06-17     1
1929-07-01     1
1935-07-05     1
1935-08-23     1
1989-07-14     1
1935-06-11     1
1935-08-25     1
1931-06-18     1
1929-06-12     1
1932-07-08     1
1929-06-29     1
1931-06-08     1
1929-06-23     1
1981-05-22     1
1932-07-03     1
1934-06-14     1
1929-07-20     1
1932-07-18     1
1929-06-28     1
Name: count, dtype: int64

In [185]:
test_df[test_df.localityid == 55313.0]

Unnamed: 0,collectionobjectid,text1,countamt,collectingeventid,collectionobjectattachmentid,attachmentid,attachmentlocation,startdate,enddate,remarks,...,minelevation,maxelevation,elevationaccuracy,latitude1,longitude1,localityname,namedplace,geographyid,startdate_num,enddate_num
91,164152,,1.0,55313.0,208128.0,208231.0,69c51c3e-80ba-46bc-82b8-e1262a7f3509.jpg,1931-06-17,1931-06-17,,...,5200.0,5200.0,0.0,41.4506,-120.3595,Parker Creek.,,17158.0,-1216339000.0,-1216339000.0
92,164152,,1.0,55313.0,208129.0,208232.0,22be2f5f-84ce-44de-b4bc-450378f4d8f8.jpg,1931-06-17,1931-06-17,,...,5200.0,5200.0,0.0,41.4506,-120.3595,Parker Creek.,,17158.0,-1216339000.0,-1216339000.0


In [183]:
test_df.localityid.value_counts()

localityid
55313.0     2
138443.0    1
59156.0     1
131059.0    1
29933.0     1
           ..
162528.0    1
117987.0    1
19796.0     1
77711.0     1
162061.0    1
Name: count, Length: 226, dtype: int64

In [174]:
df.latitude1

0        27.717865
1        24.882559
2       -14.435100
3        27.877222
4        41.932200
           ...    
44800    16.670000
44801    41.491695
44802    41.733299
44803    41.733299
44804    26.060278
Name: latitude1, Length: 44805, dtype: float64

In [153]:
df[df.collectionobjectid == 5]

Unnamed: 0,collectionobjectid,text1,countamt,collectingeventid,collectionobjectattachmentid,attachmentid,attachmentlocation,startdate,enddate,remarks,...,minelevation,maxelevation,elevationaccuracy,latitude1,longitude1,localityname,namedplace,geographyid,startdate_num,enddate_num
0,5,Tree ca. 4 m tall. Fruit purplish black.,8.0,66157.0,,,,2004-10-29,NaT,Tsuga dumosa forest mixed with elements of sub...,...,2500.0,,,27.717865,98.421631,Vicinity of Sandui campsite between Shigong Qi...,Vicinity of Sandui campsite between Shigong Qi...,33272.0,1099008000.0,-9223372000.0


In [81]:
test_df = df[abs(df.latitude1 - 27.717865) < 0.00001].reset_index(drop=True)
test_df

Unnamed: 0,unnamed: 0,index,collectionobjectid,text1,countamt,collectingeventid,collectionobjectattachmentid,attachmentid,attachmentlocation,startdate,...,minelevation,maxelevation,elevationaccuracy,latitude1,longitude1,localityname,namedplace,geographyid,startdate_num,enddate_num
0,0,2,5,Tree ca. 4 m tall. Fruit purplish black.,8.0,66157.0,,,,2004-10-29,...,2500.0,,,27.717865,98.421631,Vicinity of Sandui campsite between Shigong Qi...,Vicinity of Sandui campsite between Shigong Qi...,33272.0,1099008000.0,-9223372000.0
1,10638,52375,91014,Shrub ca. 3 mm tall. Young fruit green.,8.0,55602.0,,,,2004-10-29,...,2500.0,,,27.717865,98.421631,Vicinity of Sandui campsite between Shigong Qi...,Vicinity of Sandui campsite between Shigong Qi...,33272.0,1099008000.0,-9223372000.0
2,11705,57552,100050,Tree ca. 7 m tall. Young fruit green.,8.0,86909.0,,,,2004-10-29,...,2500.0,,,27.717865,98.421631,Vicinity of Sandui campsite between Shigong Qi...,Vicinity of Sandui campsite between Shigong Qi...,33272.0,1099008000.0,-9223372000.0
3,11718,57621,100178,Twining vine. Sepals greenish cream colored.,8.0,100437.0,,,,2004-10-29,...,2500.0,,,27.717865,98.421631,Vicinity of Sandui campsite between Shigong Qi...,Vicinity of Sandui campsite between Shigong Qi...,33272.0,1099008000.0,-9223372000.0
4,22104,109001,190209,Tree ca. 10 m tall. In bud.,8.0,138740.0,,,,2004-10-29,...,2500.0,,,27.717865,98.421631,Vicinity of Sandui campsite between Shigong Qi...,Vicinity of Sandui campsite between Shigong Qi...,33272.0,1099008000.0,-9223372000.0
5,26172,128903,225221,Shrub ca. 2.5 mm tall. Fruit green.,8.0,59796.0,,,,2004-10-29,...,2500.0,,,27.717865,98.421631,Vicinity of Sandui campsite between Shigong Qi...,Vicinity of Sandui campsite between Shigong Qi...,33272.0,1099008000.0,-9223372000.0
6,32043,158119,276235,Shrub ca. 2.5 m tall. Fruit turning black.,8.0,115954.0,,,,2004-10-29,...,2500.0,,,27.717865,98.421631,Vicinity of Sandui campsite between Shigong Qi...,Vicinity of Sandui campsite between Shigong Qi...,33272.0,1099008000.0,-9223372000.0
7,37487,184907,322679,Terrestrial fern with dimorphic fronds forming...,8.0,41984.0,,,,2004-10-29,...,2500.0,,,27.717865,98.421631,Vicinity of Sandui campsite between Shigong Qi...,Vicinity of Sandui campsite between Shigong Qi...,33272.0,1099008000.0,-9223372000.0
8,39501,194746,339738,Twining vine. Fruit purple.,2.0,13668.0,,,,2004-10-29,...,2500.0,,,27.717865,98.421631,Vicinity of Sandui campsite between Shigong Qi...,Vicinity of Sandui campsite between Shigong Qi...,33272.0,1099008000.0,-9223372000.0


In [82]:
test_df.latitude1

0    27.717865
1    27.717865
2    27.717865
3    27.717865
4    27.717865
5    27.717865
6    27.717865
7    27.717865
8    27.717865
Name: latitude1, dtype: float64

Start date all same day, same elevation, same localityname, geographyid
### Verdict: same expedition

## ∆lat = 0.0001

In [99]:
test_df = df[abs(df.latitude1 - 27.717865) < 0.0001].reset_index(drop=True)
len(test_df)

19

In [100]:
test_df.head(4)

Unnamed: 0,unnamed: 0,index,collectionobjectid,text1,countamt,collectingeventid,collectionobjectattachmentid,attachmentid,attachmentlocation,startdate,...,minelevation,maxelevation,elevationaccuracy,latitude1,longitude1,localityname,namedplace,geographyid,startdate_num,enddate_num
0,0,2,5,Tree ca. 4 m tall. Fruit purplish black.,8.0,66157.0,,,,2004-10-29,...,2500.0,,,27.717865,98.421631,Vicinity of Sandui campsite between Shigong Qi...,Vicinity of Sandui campsite between Shigong Qi...,33272.0,1099008000.0,-9223372000.0
1,8873,43798,76009,Epiphytic fern.,8.0,148643.0,,,,2004-10-29,...,2500.0,2500.0,0.0,27.717778,98.421667,Vicinity of Sandui campsite between Shigong Qi...,Vicinity of Sandui campsite between Shigong Qi...,33272.0,1099008000.0,1099008000.0
2,10638,52375,91014,Shrub ca. 3 mm tall. Young fruit green.,8.0,55602.0,,,,2004-10-29,...,2500.0,,,27.717865,98.421631,Vicinity of Sandui campsite between Shigong Qi...,Vicinity of Sandui campsite between Shigong Qi...,33272.0,1099008000.0,-9223372000.0
3,11485,56358,97955,Arching shrub. Stems ca. 4 m long. Flowers cre...,8.0,24056.0,,,,2004-10-29,...,2500.0,2500.0,0.0,27.717778,98.421667,Vicinity of Sandui campsite between Shigong Qi...,Vicinity of Sandui campsite between Shigong Qi...,33272.0,1099008000.0,1099008000.0


In [101]:
test_df[['latitude1', 'longitude1', 'localityname', 'startdate', 'enddate']]

Unnamed: 0,latitude1,longitude1,localityname,startdate,enddate
0,27.717865,98.421631,Vicinity of Sandui campsite between Shigong Qi...,2004-10-29,NaT
1,27.717778,98.421667,Vicinity of Sandui campsite between Shigong Qi...,2004-10-29,2004-10-29
2,27.717865,98.421631,Vicinity of Sandui campsite between Shigong Qi...,2004-10-29,NaT
3,27.717778,98.421667,Vicinity of Sandui campsite between Shigong Qi...,2004-10-29,2004-10-29
4,27.717865,98.421631,Vicinity of Sandui campsite between Shigong Qi...,2004-10-29,NaT
5,27.717865,98.421631,Vicinity of Sandui campsite between Shigong Qi...,2004-10-29,NaT
6,27.717778,98.421667,Vicinity of Sandui campsite between Shigong Qi...,2004-10-29,2004-10-29
7,27.717778,98.421667,Vicinity of Sandui campsite between Shigong Qi...,2004-10-29,2004-10-29
8,27.717865,98.421631,Vicinity of Sandui campsite between Shigong Qi...,2004-10-29,NaT
9,27.717778,98.421667,Vicinity of Sandui campsite between Shigong Qi...,2004-10-29,2004-10-29


Start date all same day, same elevation, same localityname, geographyid
### Verdict: same expedition

## ∆lat = 0.001

In [108]:
test_df = df[abs(df.latitude1 - 27.717865) < 0.001].reset_index(drop=True)
len(test_df)

174

In [109]:
test_df[['latitude1', 'longitude1', 'localityname', 'startdate', 'enddate']]

Unnamed: 0,latitude1,longitude1,localityname,startdate,enddate
0,27.717865,98.421631,Vicinity of Sandui campsite between Shigong Qi...,2004-10-29,NaT
1,27.717421,98.737190,"Yimaluo, E side of Salween river, along the tr...",2002-04-16,NaT
2,27.718170,98.569084,"E side of Gaoligong Shan, W of Gongshan, along...",2002-04-29,NaT
3,27.717392,98.692688,Vicinity of Shigu SW of Gongshan on the road t...,2002-09-26,NaT
4,27.716944,98.690556,Vicinity of Shigu SW of Gongshan along the roa...,2002-10-09,2002-10-09
...,...,...,...,...,...
169,27.717058,98.690521,Vicinity of Shigu SW of Gongshan along the roa...,2002-10-09,NaT
170,27.717421,98.737190,"Yimaluo, E side of Salween river, along the tr...",2002-04-16,NaT
171,27.717392,98.692688,Vicinity of Shigu SW of Gongshan on the road t...,2002-09-26,NaT
172,27.717058,98.690521,Vicinity of Shigu SW of Gongshan along the roa...,2002-10-09,NaT


In [110]:
test_df.head(4)

Unnamed: 0,unnamed: 0,index,collectionobjectid,text1,countamt,collectingeventid,collectionobjectattachmentid,attachmentid,attachmentlocation,startdate,...,minelevation,maxelevation,elevationaccuracy,latitude1,longitude1,localityname,namedplace,geographyid,startdate_num,enddate_num
0,0,2,5,Tree ca. 4 m tall. Fruit purplish black.,8.0,66157.0,,,,2004-10-29,...,2500.0,,,27.717865,98.421631,Vicinity of Sandui campsite between Shigong Qi...,Vicinity of Sandui campsite between Shigong Qi...,33272.0,1099008000.0,-9223372000.0
1,59,252,447,Shrub ca. 2 m tall. Bud green.,1.0,94938.0,,,,2002-04-16,...,2020.0,,,27.717421,98.73719,"Yimaluo, E side of Salween river, along the tr...",,33269.0,1018915000.0,-9223372000.0
2,116,571,976,Shrub ca. 1.5 m tall. Fruits black.,1.0,100757.0,,,,2002-04-29,...,2020.0,,,27.71817,98.569084,"E side of Gaoligong Shan, W of Gongshan, along...",,33269.0,1020038000.0,-9223372000.0
3,303,1473,2556,Perennial herb. ca. 1.5 m tall. Flowers white.,4.0,65770.0,,,,2002-09-26,...,1460.0,,,27.717392,98.692688,Vicinity of Shigu SW of Gongshan on the road t...,Vicinity of Shigu SW of Gongshan on the road t...,33269.0,1032998000.0,-9223372000.0


In [112]:
test_df['startdate'].unique()

<DatetimeArray>
['2004-10-29 00:00:00', '2002-04-16 00:00:00', '2002-04-29 00:00:00',
 '2002-09-26 00:00:00', '2002-10-09 00:00:00', '2000-07-10 00:00:00',
 '2000-07-12 00:00:00', '1990-12-30 00:00:00', '2000-07-17 00:00:00',
 '1991-03-22 00:00:00', '1990-11-22 00:00:00']
Length: 11, dtype: datetime64[ns]

In [113]:
test_df['localityname'].unique()

array(['Vicinity of Sandui campsite between Shigong Qiao and Xixiaofang on trail from Bapo to Gongshan via Qiqi on the W side',
       'Yimaluo, E side of Salween river, along the trail to wild ox valley.',
       'E side of Gaoligong Shan, W of Gongshan, along the Pula He on the trail from Gongshan to Qiqi and Dulong Jiang valley.',
       'Vicinity of Shigu SW of Gongshan on the road to Danzhu along the W bank of the Nu Jiang. E side of Gaoligong Shan. SW',
       'Vicinity of Shigu SW of Gongshan along the road to Danzhu, W side of the Nu Jiang. E side of Gaoligong Shan. SW facing',
       'E side of Gaoligong Shan, W of Gongshan, along the Pula He, between Qiqi bridge and Qiqi, on the trail from Gongshan to',
       'E side of Gaoligong Shan, W of Gongshan, in the vicinity of Qiqi above the Pula He.',
       'Along the Gamolai He, on the trail from Bapo to Gongshan on the E side of the Dulong Jiang.',
       'W side of Gaoligong Shan, W of Gongshan, on the trail from Qiqi to Bapo i

In [114]:
test_df['geographyid'].unique()

array([33272., 33269.])

Differing start dates, around 1990, 1991, 2000, 2002, 2004.

Differing localitynames, some appear similar enough to be clustered.

2 different geographyids.

## Verdict: A couple different clusters!

## Okay so for this example, looks like del latitude 0 0.001 was not granular enough.

In [14]:
df[['latitude1', 'longitude1']]

Unnamed: 0,latitude1,longitude1
0,27.717865,98.421631
1,24.882559,98.713638
2,-14.435100,49.767502
3,27.877222,98.335556
4,41.932200,-123.831000
...,...,...
44800,16.670000,-92.532500
44801,41.491695,-119.499161
44802,41.733299,-120.370796
44803,41.733299,-120.370796


In [None]:
import geopy.distance

coords_1 = (52.2296756, 21.0122287)
coords_2 = (52.406374, 16.9251681)

print(geopy.distance.geodesic(coords_1, coords_2).km)

In [13]:
df[abs(df.latitude1 - 27.717865) < 0.00001]

Unnamed: 0,unnamed: 0,index,collectionobjectid,text1,countamt,collectingeventid,collectionobjectattachmentid,attachmentid,attachmentlocation,startdate,...,minelevation,maxelevation,elevationaccuracy,latitude1,longitude1,localityname,namedplace,geographyid,startdate_num,enddate_num
0,0,2,5,Tree ca. 4 m tall. Fruit purplish black.,8.0,66157.0,,,,NaT,...,2500.0,,,27.717865,98.421631,Vicinity of Sandui campsite between Shigong Qi...,Vicinity of Sandui campsite between Shigong Qi...,33272.0,-9223372000.0,-9223372000.0
10638,10638,52375,91014,Shrub ca. 3 mm tall. Young fruit green.,8.0,55602.0,,,,NaT,...,2500.0,,,27.717865,98.421631,Vicinity of Sandui campsite between Shigong Qi...,Vicinity of Sandui campsite between Shigong Qi...,33272.0,-9223372000.0,-9223372000.0
11705,11705,57552,100050,Tree ca. 7 m tall. Young fruit green.,8.0,86909.0,,,,NaT,...,2500.0,,,27.717865,98.421631,Vicinity of Sandui campsite between Shigong Qi...,Vicinity of Sandui campsite between Shigong Qi...,33272.0,-9223372000.0,-9223372000.0
11718,11718,57621,100178,Twining vine. Sepals greenish cream colored.,8.0,100437.0,,,,NaT,...,2500.0,,,27.717865,98.421631,Vicinity of Sandui campsite between Shigong Qi...,Vicinity of Sandui campsite between Shigong Qi...,33272.0,-9223372000.0,-9223372000.0
22104,22104,109001,190209,Tree ca. 10 m tall. In bud.,8.0,138740.0,,,,NaT,...,2500.0,,,27.717865,98.421631,Vicinity of Sandui campsite between Shigong Qi...,Vicinity of Sandui campsite between Shigong Qi...,33272.0,-9223372000.0,-9223372000.0
26172,26172,128903,225221,Shrub ca. 2.5 mm tall. Fruit green.,8.0,59796.0,,,,NaT,...,2500.0,,,27.717865,98.421631,Vicinity of Sandui campsite between Shigong Qi...,Vicinity of Sandui campsite between Shigong Qi...,33272.0,-9223372000.0,-9223372000.0
32043,32043,158119,276235,Shrub ca. 2.5 m tall. Fruit turning black.,8.0,115954.0,,,,NaT,...,2500.0,,,27.717865,98.421631,Vicinity of Sandui campsite between Shigong Qi...,Vicinity of Sandui campsite between Shigong Qi...,33272.0,-9223372000.0,-9223372000.0
37487,37487,184907,322679,Terrestrial fern with dimorphic fronds forming...,8.0,41984.0,,,,NaT,...,2500.0,,,27.717865,98.421631,Vicinity of Sandui campsite between Shigong Qi...,Vicinity of Sandui campsite between Shigong Qi...,33272.0,-9223372000.0,-9223372000.0
39501,39501,194746,339738,Twining vine. Fruit purple.,2.0,13668.0,,,,NaT,...,2500.0,,,27.717865,98.421631,Vicinity of Sandui campsite between Shigong Qi...,Vicinity of Sandui campsite between Shigong Qi...,33272.0,-9223372000.0,-9223372000.0


In [6]:
df.head(20)

Unnamed: 0,unnamed: 0,index,collectionobjectid,text1,countamt,collectingeventid,collectionobjectattachmentid,attachmentid,attachmentlocation,startdate,...,remarks,localityid,minelevation,maxelevation,elevationaccuracy,latitude1,longitude1,localityname,namedplace,geographyid
0,0,2,5,Tree ca. 4 m tall. Fruit purplish black.,8.0,66157.0,,,,2004-10-29,...,Tsuga dumosa forest mixed with elements of sub...,66157.0,2500.0,,,27.717865,98.421631,Vicinity of Sandui campsite between Shigong Qi...,Vicinity of Sandui campsite between Shigong Qi...,33272.0
1,1,5,9,Perennial herb.,2.0,31111.0,,,,2003-08-27,...,Remnant disturbed secondary forest on steep sl...,31111.0,1630.0,,,24.882559,98.713638,"Tenglan Cun, Lanniba He. W side of Gaoligong S...","Tenglan Cun, Lanniba He. W side of Gaoligong S...",33255.0
2,2,6,11,"arbre de 8m, fruits jaunes, rouge au sommet, p...",1.0,58265.0,,,,2004-12-10,...,"foret dense humide de moyenne altitude, vegeta...",58265.0,623.0,,,-14.4351,49.767502,"Parc National de Marojejy, commune rurale de M...","Parc National de Marojejy, commune rurale de M...",27772.0
3,3,10,20,Tree ca. 8 m tall. Young fruits green.,1.0,105773.0,153980.0,153981.0,2606ab29-8b8c-46df-b3bb-03587d23a2aa.jpg,2002-07-24,...,Evergreen broad-leaved forest.,105773.0,2300.0,2300.0,0.0,27.877222,98.335556,"Kongdang, W side of Gaoligong Shan, along the ...","Kongdang, W side of Gaoligong Shan, along the ...",33272.0
4,4,14,24,,1.0,19002.0,365549.0,365652.0,8cd186c3-3176-48ac-858e-fe82ddb98af9.jpg,1935-06-08,...,,19002.0,,,,41.9322,-123.831,Old road from Patrick's Creek into Oregon.,,9217.0
5,5,15,25,Shrub ca. 2 m tall. Fruit red.,8.0,88358.0,,,,2002-09-28,...,With scattered Abies and Larix.,88358.0,2980.0,2980.0,0.0,27.798056,98.503333,Vicinity of Dabadi along the Sikeluo river on ...,Vicinity of Dabadi along the Sikeluo river on ...,33269.0
6,6,18,28,Annual.,1.0,103138.0,603448.0,603552.0,bb747f67-fa35-402c-960f-b46797bd6a6b.jpg,1990-07-20,...,Moist meadow in Lodgepole Pine forest.,103138.0,2150.0,2150.0,0.0,41.8866,-120.2181,"Tamarack Flat, E side of Warner Mts.",,17158.0
7,7,23,39,"Ray flowers white, disk flowers yellow.",1.0,103989.0,465741.0,465845.0,a3e43330-d49d-4974-a2ba-501998c26d27.jpg,1950-05-14,...,,103989.0,500.0,500.0,0.0,37.083333,-122.066667,"Ben Lomond San Hills, near summit of Quail Hol...",,23212.0
8,8,24,40,Growing on dry slopes; flowers dark purple-blu...,2.0,134787.0,,,,1997-06-01,...,,134787.0,6000.0,6000.0,0.0,41.491667,-119.499167,"Upper High Rock Canyon, 0.25 mi south of Steve...",,26961.0
9,9,38,70,Perennial herb.,1.0,144235.0,429736.0,429840.0,9c4c5b86-3616-43bd-b584-f75f941b5be6.jpg,1991-10-13,...,Dry lake bottom of Middle Alkali Lake.,144235.0,1320.0,1320.0,0.0,41.5364,-120.0789,"Just N of Surprise Valley Mineral Hot Springs,...",,17158.0


In [5]:
df['localityname']

0        Vicinity of Sandui campsite between Shigong Qi...
1        Tenglan Cun, Lanniba He. W side of Gaoligong S...
2        Parc National de Marojejy, commune rurale de M...
3        Kongdang, W side of Gaoligong Shan, along the ...
4               Old road from Patrick's Creek into Oregon.
                               ...                        
44800                                             Mitziton
44801    Upper High Rock Canyon, 0.25 mi south of Steve...
44802                                    Near Davis Creek.
44803                                    Near Davis Creek.
44804    Ca. 12.9 km N of Pianma on the road to Gangfan...
Name: localityname, Length: 44805, dtype: object