## Problem 3: How far did people travel? (8 points)

During this task, the aim is to calculate the (air-line) distance in meters that each social media user in the data set prepared in *Problem 2* has travelled in-between the posts. We’re interested in the Euclidean distance between subsequent points generated by the same user.

For this, we will need to use the `userid` column of the data set `kruger_posts.shp` that we created in *Problem 2*.

Answer the following questions:
- What was the shortest distance a user travelled between all their posts (in meters)?
- What was the mean distance travelled per user (in meters)?
- What was the maximum distance a user travelled (in meters)?

---


### a) Read the input file and re-project it

- Read the input file `kruger_points.shp` into a geo-data frame `kruger_points`
- Transform the data from WGS84 to an `EPSG:32735` projection (UTM Zone 35S, suitable for South Africa). This CRS has *metres* as units.

In [2]:
# ADD YOUR OWN CODE HERE
import pathlib
import geopandas
pathlib.Path()
DATA_DIRECTORY = pathlib.Path().resolve()/"data"
DATA_DIRECTORY
#pathlib.Path(DATA_DIRECTORY / "kruger_points.shp").exists()
kruger_points_4326 = geopandas.read_file(DATA_DIRECTORY/"kruger_points.shp")
#type(kruger_points_4326)
kruger_points = kruger_points_4326.to_crs("EPSG:32735")
#kruger_points.crs

In [3]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

# Check the data
kruger_points.head()

Unnamed: 0,lat,lon,timestamp,userid,geometry
0,-24.980792,31.484633,2015-07-07 03:02,66487960,POINT (952912.890 7229683.258)
1,-25.499225,31.508906,2015-07-07 03:18,65281761,POINT (953433.223 7172080.632)
2,-24.342578,30.930866,2015-03-07 03:38,90916112,POINT (898955.144 7302197.408)
3,-24.854614,31.519718,2015-10-07 05:04,37959089,POINT (956927.218 7243564.942)
4,-24.921069,31.520836,2015-10-07 05:19,27793716,POINT (956794.955 7236187.926)


In [4]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

# Check that the crs is correct after re-projecting (should be epsg:32735)
import pyproj
assert kruger_points.crs == pyproj.CRS("EPSG:32735")

### b) Group the data by user id

Group the data by `userid` and store the grouped data in a variable `grouped_by_users`

In [5]:
# ADD YOUR OWN CODE HERE

In [6]:
#kruger_points["userid"].nunique() #14990
grouped_by_users = kruger_points.groupby("userid")
grouped_by_users.groups

{16301: [30512, 30535, 30545, 30770, 38232, 38235, 38909, 38911, 38913], 26589: [61781], 29322: [78280], 42181: [8081], 45136: [80613, 81278], 48971: [71512], 50136: [42402, 42439, 42453, 42478, 42526, 42566, 42620, 42670, 42751, 42880], 50530: [79157], 66129: [60285], 74329: [4003], 75914: [60235], 76069: [2388], 88775: [15288, 15289], 88918: [37496, 37879], 90156: [74848, 74852, 75080, 75081, 75083, 75085, 75089, 75091], 120615: [81361], 133296: [66934], 141256: [47421], 156058: [4775], 161653: [59189], 174181: [35734, 35735], 177106: [47034], 177600: [7616, 9609, 9610, 9611, 9678, 9679], 180146: [70740], 181216: [52587, 52688, 52779, 52838, 52965], 184404: [4006, 4203, 4974, 74796, 76816], 186335: [47691], 193414: [78179, 78912, 79513], 195149: [56739, 56771, 56775, 56777, 56779], 198845: [51067, 51726], 209862: [75654], 214933: [10933, 10935], 217091: [72658, 73097, 73099], 222264: [36677, 48629, 48631], 228231: [16713], 231302: [13437], 232626: [3610, 3611, 3612, 3613, 3614, 3615,

In [7]:
print(len(grouped_by_users.groups))
print(kruger_points["userid"].nunique())

14990
14990


In [8]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

# Check the number of groups:
assert len(grouped_by_users.groups) == kruger_points["userid"].nunique(), "Number of groups should match number of unique users!"

### c) Create `shapely.geometry.LineString` objects for each user connecting the points from oldest to most recent

There are multiple ways to solve this problem (see the [hints for this exercise](https://autogis-site.readthedocs.io/en/latest/lessons/lesson-2/exercise-2.html). You can use, for instance, a dictionary or an empty GeoDataFrame to collect data that is generated using the steps below:

- Use a for-loop to iterate over the grouped object. For each user’s data: 
    - [sort](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html) the rows by timestamp 
    - create a `shapely.geometry.LineString` based on the user’s points

**CAREFUL**: Remember that every LineString needs at least two points. Skip users who have less than two posts.

Store the results in a `geopandas.GeoDataFrame` called `movements`, and remember to assign a CRS.

In [10]:
# ADD YOUR OWN CODE HERE
from shapely.geometry import LineString
import geopandas
import pandas
import pathlib
import fiona

pathlib.Path()
DATA_DIRECTORY = pathlib.Path().resolve() / "data"
DATA_DIRECTORY
df_dupli = kruger_points.loc[kruger_points.duplicated(subset=['userid'],keep=False),:]
sorted_df = df_dupli.sort_values(by=['userid','timestamp']).groupby('userid')
#sorted_df.to_csv(DATA_DIRECTORY / "sorted_df.csv")

# * Creating blank movements GDF data to populate it later with user info
movements_data = {"userid": [],"geometry": []}
# * Iterating inside userid groups
for userid_group in sorted_df.groups:
    # for every group I get the corresponding row_ids list that conform the group
    row_idxs = sorted_df.groups[userid_group]

    # saving current userid to movement_data
    movements_data["userid"].append(str(userid_group))

    movement_points = []
    for row_idx in row_idxs:
        # locating row inside original gdf
        row = kruger_points.iloc[row_idx]

        # extracting all POINT geometries for user_id
        movement_points.append(row.geometry)

    # constructing a LINESTRING with all POINTs for a given userid and storing them inside movements_data
    movements_data["geometry"].append(LineString(movement_points))

# * Creating GDF out of movements_data + setting CRS (EPSG:32735 has [m] as units)
movements = geopandas.GeoDataFrame(movements_data, crs="EPSG:32735")

# * Creating a new `distance` column based on movements['geometry'] corresponding length (in [m])
movements['distance'] = movements.length
print(movements)    
    

        userid                                           geometry  \
0        16301  LINESTRING (942231.630 7254606.868, 938934.725...   
1        45136  LINESTRING (905394.500 7193375.148, 905394.500...   
2        50136  LINESTRING (944551.607 7253384.183, 963788.403...   
3        88775  LINESTRING (902800.817 7192546.975, 902800.839...   
4        88918  LINESTRING (959332.961 7219877.715, 963788.403...   
...        ...                                                ...   
9021  99921781  LINESTRING (902885.190 7196931.096, 904027.710...   
9022  99936874  LINESTRING (963782.211 7228000.079, 963754.402...   
9023  99964140  LINESTRING (938876.653 7305143.369, 938876.943...   
9024  99986933  LINESTRING (935937.029 7305973.536, 936598.681...   
9025  99990870  LINESTRING (899089.377 7180296.561, 899089.377...   

           distance  
0     328455.115430  
1          0.000000  
2     159189.081019  
3          0.080245  
4       9277.252211  
...             ...  
9021  211162.6959

In [13]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

# Check the result
print(type(movements))
print(movements.crs)

movements

<class 'geopandas.geodataframe.GeoDataFrame'>
EPSG:32735


Unnamed: 0,userid,geometry,distance
0,16301,"LINESTRING (942231.630 7254606.868, 938934.725...",328455.115430
1,45136,"LINESTRING (905394.500 7193375.148, 905394.500...",0.000000
2,50136,"LINESTRING (944551.607 7253384.183, 963788.403...",159189.081019
3,88775,"LINESTRING (902800.817 7192546.975, 902800.839...",0.080245
4,88918,"LINESTRING (959332.961 7219877.715, 963788.403...",9277.252211
...,...,...,...
9021,99921781,"LINESTRING (902885.190 7196931.096, 904027.710...",211162.695906
9022,99936874,"LINESTRING (963782.211 7228000.079, 963754.402...",29.097909
9023,99964140,"LINESTRING (938876.653 7305143.369, 938876.943...",2.478976
9024,99986933,"LINESTRING (935937.029 7305973.536, 936598.681...",2548.913592


### d) Calculate the distance between all posts of a user

- Check once more that the CRS of the data frame is correct
- Compute the lengths of the lines, and store it in a new column called `distance`

In [18]:
# ADD YOUR OWN CODE HERE
movements.crs

movements['distance'] = movements.length

In [19]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

#Check the output
movements.head()

Unnamed: 0,userid,geometry,distance
0,16301,"LINESTRING (942231.630 7254606.868, 938934.725...",328455.11543
1,45136,"LINESTRING (905394.500 7193375.148, 905394.500...",0.0
2,50136,"LINESTRING (944551.607 7253384.183, 963788.403...",159189.081019
3,88775,"LINESTRING (902800.817 7192546.975, 902800.839...",0.080245
4,88918,"LINESTRING (959332.961 7219877.715, 963788.403...",9277.252211


### e) Answer the original questions

You should now be able to quickly find answers to the following questions: 
- What was the shortest distance a user travelled between all their posts (in meters)? (store the value in a variable `shortest_distance`)
- What was the mean distance travelled per user (in meters)? (store the value in a variable `mean_distance`)
- What was the maximum distance a user travelled (in meters)? (store the value in a variable `longest_distance`)

In [34]:
# ADD YOUR OWN CODE HERE
shortest_distance = movements['distance'].min()
print(shortest_distance,"meters")
mean_distance = movements['distance'].mean()
print(f"{mean_distance:.1f} meters")
longest_distance = round(movements['distance'].max(),1)
print(longest_distance,"meters")

0.0 meters
107133.6 meters
6970666.7 meters


### f) Save the movements in a file

Save the `movements` into a new Shapefile called `movements.shp` inside the `data` directory.

In [36]:
# ADD YOUR OWN CODE HERE

movements.to_file(DATA_DIRECTORY / "movements.shp")


In [38]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

assert (DATA_DIRECTORY / "movements.shp").exists()


---

# Fantastic job!

That’s all for this week! 