## Problem 3: How long distance individuals have travelled? (8 points)

In this problem the aim is to calculate the distance in meters that the individuals have travelled according the social media posts (Euclidean distances between points). In this problem, we will need the `userid` -column an the points created in the previous problem. You will need the shapefile `Kruger_posts.shp` generated in Problem 2 as input file.

Our goal is to answer these questions based on the input data:

 - What was the shortest distance travelled in meters?
 - What was the mean distance travelled in meters?
 - What was the maximum distance travelled in meters?

**In your code, you should first:**
 - Import required modules
 - Read in the shapefile as a geodataframe called `data`

In [1]:
# Import required modules
import os
from pyproj import CRS
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point, LineString
# Read in the shapefile as a geodataframe called data
fp = gpd.read_file("Kruger_posts.shp")
data = gpd.GeoDataFrame(fp)

 - Check the crs of the input data. If this information is missing, set it as epsg:4326 (WGS84).
 - Reproject the data from WGS84 to `EPSG:32735` -projection which stands for UTM Zone 35S (UTM zone for South Africa) to transform the data into metric system. (don't create a new variable, update the existing variable `data`!)

In [2]:
# Check the crs of the input data
data.crs

<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World.
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich

In [3]:
# check that geometry column contains lat-lon values
data['geometry'].head()

0    POINT (-24.98079 31.48463)
1    POINT (-25.49922 31.50891)
2    POINT (-24.34258 30.93087)
3    POINT (-24.85461 31.51972)
4    POINT (-24.92107 31.52084)
Name: geometry, dtype: geometry

In [4]:
# Reproject the data from WGS84 to EPSG:32735
data = data.to_crs(epsg=32735)

In [5]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION
print(data.head())

         lat        lon         timestamp    userid  \
0 -24.980792  31.484633  2015-07-07 03:02  66487960   
1 -25.499225  31.508906  2015-07-07 03:18  65281761   
2 -24.342578  30.930866  2015-03-07 03:38  90916112   
3 -24.854614  31.519718  2015-10-07 05:04  37959089   
4 -24.921069  31.520836  2015-10-07 05:19  27793716   

                            geometry  
0  POINT (-4695752.719 14973674.275)  
1  POINT (-4748939.258 15014098.837)  
2  POINT (-4672729.591 14859391.193)  
3  POINT (-4679391.656 14969037.444)  
4  POINT (-4686373.982 14973910.589)  


In [6]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION
# Check that the crs is correct after re-projecting (should be epsg:32735)
print(data.crs)

epsg:32735


 - Group the data by userid

In [7]:
# Sol1: Group the data by userid -- apply trivial filter to groupby obj to return entire df
grouped = data.groupby('userid').filter(lambda x: True)

In [8]:
grouped

Unnamed: 0,lat,lon,timestamp,userid,geometry
0,-24.980792,31.484633,2015-07-07 03:02,66487960,POINT (-4695752.719 14973674.275)
1,-25.499225,31.508906,2015-07-07 03:18,65281761,POINT (-4748939.258 15014098.837)
2,-24.342578,30.930866,2015-03-07 03:38,90916112,POINT (-4672729.591 14859391.193)
3,-24.854614,31.519718,2015-10-07 05:04,37959089,POINT (-4679391.656 14969037.444)
4,-24.921069,31.520836,2015-10-07 05:19,27793716,POINT (-4686373.982 14973910.589)
...,...,...,...,...,...
81374,-24.799541,31.354469,2015-09-05 02:23,90744213,POINT (-4687214.711 14944554.855)
81375,-25.467992,30.956033,2015-02-05 02:40,71109799,POINT (-4792423.345 14942761.977)
81376,-25.332223,30.997409,2015-08-05 02:40,54796261,POINT (-4774258.492 14938101.158)
81377,-25.508851,31.005536,2015-08-05 02:43,78762204,POINT (-4792657.700 14951948.528)


In [9]:
# Sol2: Groupby userid to create grouped obj
grouped = data.groupby('userid')

In [10]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION
assert len(grouped.groups) == data["userid"].nunique(), "Number of groups should match number of unique users!"

**Then:**
- Create an empty GeoDataFrame called `movements`
- Create a for-loop where you iterate over the grouped object. For each user's data: 
    - [sort](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html) the rows by timestamp 
    - create a LineString object based on the user's points
    - add the geometry and the userid into the `movements` dataframe (one userid per row). You can achieve this either by using the `.at` indexer, or the `append` method. See hints for more help.
- Set the CRS of the ``movements`` GeoDataFrame as ``EPSG:32735`` 

In [11]:
# Create an empty GeoDataFrame called movements
movements = gpd.GeoDataFrame()

In [12]:
# Sol3 : group values by sorting first userid, then timestamp
data.sort_values(['userid', 'timestamp'])

Unnamed: 0,lat,lon,timestamp,userid,geometry
30535,-24.759508,31.371200,2015-02-08 06:18,16301,POINT (-4681550.088 14943799.279)
30770,-24.749845,31.338317,2015-02-09 08:09,16301,POINT (-4683233.102 14939015.568)
38235,-24.995803,31.592000,2015-03-13 10:59,16301,POINT (-4688386.821 14988087.394)
38232,-24.791483,31.865172,2015-05-13 10:51,16301,POINT (-4643987.777 15007357.316)
30512,-24.760170,31.339430,2015-06-08 04:34,16301,POINT (-4684246.015 14939886.378)
...,...,...,...,...,...
72163,-25.030678,31.123574,2015-05-10 17:41,99988918,POINT (-4731181.206 14932198.635)
60600,-25.440831,30.967180,2015-01-19 00:44,99990870,POINT (-4788544.778 14942186.180)
61457,-25.440831,30.967180,2015-02-23 07:08,99990870,POINT (-4788544.778 14942186.180)
62250,-25.440831,30.967180,2015-09-27 06:36,99990870,POINT (-4788544.778 14942186.180)


In [13]:
data

Unnamed: 0,lat,lon,timestamp,userid,geometry
0,-24.980792,31.484633,2015-07-07 03:02,66487960,POINT (-4695752.719 14973674.275)
1,-25.499225,31.508906,2015-07-07 03:18,65281761,POINT (-4748939.258 15014098.837)
2,-24.342578,30.930866,2015-03-07 03:38,90916112,POINT (-4672729.591 14859391.193)
3,-24.854614,31.519718,2015-10-07 05:04,37959089,POINT (-4679391.656 14969037.444)
4,-24.921069,31.520836,2015-10-07 05:19,27793716,POINT (-4686373.982 14973910.589)
...,...,...,...,...,...
81374,-24.799541,31.354469,2015-09-05 02:23,90744213,POINT (-4687214.711 14944554.855)
81375,-25.467992,30.956033,2015-02-05 02:40,71109799,POINT (-4792423.345 14942761.977)
81376,-25.332223,30.997409,2015-08-05 02:40,54796261,POINT (-4774258.492 14938101.158)
81377,-25.508851,31.005536,2015-08-05 02:43,78762204,POINT (-4792657.700 14951948.528)


In [14]:
# convert userid groups of points to a line, accounting for userid groups of one point
obj = data.groupby(['userid'])['geometry'].apply(lambda x: LineString(x.tolist()) if x.size > 1 else x.tolist())

In [15]:
# convert to df and reset_index
df = pd.DataFrame(obj)
df = df.reset_index()
df

Unnamed: 0,userid,geometry
0,16301,LINESTRING (-4684246.015113611 14939886.377917...
1,26589,[POINT (-4792681.728864173 14948384.78874119)]
2,29322,[POINT (-4729750.563136787 15040717.25706361)]
3,42181,[POINT (-4764400.981879417 14946675.91338033)]
4,45136,LINESTRING (-4770692.229766149 14940874.449218...
...,...,...
14985,99966397,[POINT (-4612489.515026943 14858007.34187377)]
14986,99986933,LINESTRING (-4635750.601466788 14900847.234702...
14987,99988918,[POINT (-4731181.206194319 14932198.63483956)]
14988,99990870,LINESTRING (-4788544.77792941 14942186.1800538...


In [16]:
# get first element of rows of type list, else get whole row 
clean_df = gpd.GeoDataFrame(df[['userid']], geometry=df['geometry'].apply(lambda x: x[0] if type(x) == list else x))

In [17]:
movements = gpd.GeoDataFrame(clean_df, geometry='geometry', crs=CRS.from_epsg(32735).to_wkt())

In [18]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION
movements.head()

Unnamed: 0,userid,geometry
0,16301,"LINESTRING (-4684246.015 14939886.378, -468155..."
1,26589,POINT (-4792681.729 14948384.789)
2,29322,POINT (-4729750.563 15040717.257)
3,42181,POINT (-4764400.982 14946675.913)
4,45136,"LINESTRING (-4770692.230 14940874.449, -477069..."


**Finally:**
- Check once the crs definition of your dataframe (should be epsg:32735, define the correct crs if this information is missing)
- Calculate the lenghts of the lines into a new column called ``distance`` in ``movements`` GeoDataFrame.

In [19]:
# YOUR CODE HERE
movements.crs

<Projected CRS: EPSG:32735>
Name: WGS 84 / UTM zone 35S
Axis Info [cartesian]:
- E[east]: Easting (metre)
- N[north]: Northing (metre)
Area of Use:
- name: Between 24°E and 30°E, southern hemisphere between 80°S and equator, onshore and offshore. Botswana. Burundi. Democratic Republic of the Congo (Zaire). Rwanda. South Africa. Tanzania. Uganda. Zambia. Zimbabwe.
- bounds: (24.0, -80.0, 30.0, 0.0)
Coordinate Operation:
- name: UTM zone 35S
- method: Transverse Mercator
Datum: World Geodetic System 1984
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich

In [20]:
# if geom_type is a Linestring, calculate length, else return point
movements['distance'] = movements['geometry'].apply(lambda x: x.length if type(x) == LineString else "Did not move")

In [21]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION
movements.head()

Unnamed: 0,userid,geometry,distance
0,16301,"LINESTRING (-4684246.015 14939886.378, -468155...",277733
1,26589,POINT (-4792681.729 14948384.789),Did not move
2,29322,POINT (-4729750.563 15040717.257),Did not move
3,42181,POINT (-4764400.982 14946675.913),Did not move
4,45136,"LINESTRING (-4770692.230 14940874.449, -477069...",0


You should now be able to print answers to the following questions: 

 - What was the shortest distance travelled in meters?
 - What was the mean distance travelled in meters?
 - What was the maximum distance travelled in meters?

#### **What was the shortest distance travelled in meters?**
movements.head() reveals the shortest distance travelled in meters to be equal to 0.

In [22]:
# if geom_type is not a point, add distance to total & increment count by 1
total = 0
count = 0
for idx, row in movements['distance'].items():
    if type(row) != str:
        total += row
        count += 1
print("{:.2f}".format(total/count, 2))

90256.51


#### **What was the mean distance travelled in meters?**
The mean distance traveled in meters was 138872.37 meters.

In [23]:
max_val = []

for idx, row in movements['distance'].items():
    if type(row) != str:
        max_val.append(row)
        
max(max_val)

5486426.016001173

#### **What was the maximum distance travelled in meters?**
The maximum distance traveled from 41.44 meters.

- Finally, save the movements of into a Shapefile called ``some_movements.shp``

In [25]:
# Save output to file: some_movements.shp
movements.to_file(driver = 'ESRI Shapefile', filename= "some_movements.shp")

In [None]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION
import os
assert os.path.isfile(fp), "output shapefile does not exits"

That's all for this week!