## Problem 3: How long distance individuals have travelled? (8 points)

In this problem the aim is to calculate the distance in meters that the individuals have travelled according to the social media posts (Euclidean distances between points). In this problem, we will need the `userid` -column and the points created in the previous problem. You will need the shapefile `Kruger_posts.shp` generated in Problem 2 as input file.

Our goal is to answer these questions based on the input data:

 - What was the shortest distance travelled in meters?
 - What was the mean distance travelled in meters?
 - What was the maximum distance travelled in meters?

**In your code, you should first:**
 - Import required modules
 - Read in the shapefile as a geodataframe called `data`

In [1]:
# Import necessary modules
import pandas as pd
import geopandas as gpd
from pyproj import CRS
from shapely.geometry import Point, LineString
import matplotlib.pyplot as plt

fp = r'Kruger_posts.shp'
data = gpd.read_file(fp)

# 1221 records have errors in timestamp format. 
# thus, an intelligent convertion of timestamps from type(str) to type(datetime) is problematic.

 - Check the crs of the input data. If this information is missing, set it as epsg:4326 (WGS84).
 - Reproject the data from WGS84 to `EPSG:32735` -projection which stands for UTM Zone 35S (UTM zone for South Africa) to transform the data into metric system. (don't create a new variable, update the existing variable `data`!)

In [2]:
# original crs is based on epsg=4326 or WGS 84 ellipsoid with decimal degrees
# reproject data to a cartesian coord system epsg=32735
data = data.to_crs(epsg=32735)

In [3]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION
print(data.head())

         lat        lon         timestamp    userid  \
0 -24.980792  31.484633  2015-07-07 03:02  66487960   
1 -25.499225  31.508906  2015-07-07 03:18  65281761   
2 -24.342578  30.930866  2015-03-07 03:38  90916112   
3 -24.854614  31.519718  2015-10-07 05:04  37959089   
4 -24.921069  31.520836  2015-10-07 05:19  27793716   

                         geometry  
0  POINT (952912.890 7229683.258)  
1  POINT (953433.223 7172080.632)  
2  POINT (898955.144 7302197.408)  
3  POINT (956927.218 7243564.942)  
4  POINT (956794.955 7236187.926)  


In [4]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION
# Check that the crs is correct after re-projecting (should be epsg:32735)
print(data.crs)

epsg:32735


 - Group the data by userid

In [5]:
# group data
grouped = data.groupby('userid')

# the number of unique groups is 14990
len(list(grouped))

14990

In [6]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION
assert len(grouped.groups) == data["userid"].nunique(), "Number of groups should match number of unique users!"

**Then:**
- Create an empty GeoDataFrame called `movements`
- Create a for-loop where you iterate over the grouped object. For each user's data: 
    - [sort](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html) the rows by timestamp 
    - create a LineString object based on the user's points
    - add the geometry and the userid into the `movements` dataframe (one userid per row). You can achieve this either by using the `.at` indexer, or the `append` method. See hints for more help.
- Set the CRS of the ``movements`` GeoDataFrame as ``EPSG:32735`` 

In [7]:
# create an empty geo df
movements = gpd.GeoDataFrame(geometry='geometry', columns=['geometry', 'userid'], crs=CRS.from_epsg(32735).to_wkt())

# set index to zero for placement of line objects
idx = 0

# iterate over groups
for key, group in grouped:    

    # sort timestamp value
    group.sort_values(by=['timestamp'], ascending=True)

    # put points from geometry col to a list of tuples
    pnts_sequence = [(point) for point in group['geometry']]

    # line object needs more than 1 point
    if len(pnts_sequence) > 1:

        # create a line object from point objects
        line = LineString(pnts_sequence)

        # save results to geo df
        movements.at[idx, 'userid'] = key
        movements.at[idx, 'geometry'] = line

        # increment index
        idx += 1

        # reset line obj
        line = None


In [8]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION
movements.head()

Unnamed: 0,geometry,userid
0,"LINESTRING (939011.113 7254636.121, 942231.630...",16301
1,"LINESTRING (905394.500 7193375.148, 905394.500...",45136
2,"LINESTRING (963788.403 7228015.063, 944551.607...",50136
3,"LINESTRING (902800.817 7192546.975, 902800.839...",88775
4,"LINESTRING (959332.961 7219877.715, 963788.403...",88918


**Finally:**
- Check once the crs definition of your dataframe (should be epsg:32735, define the correct crs if this information is missing)
- Calculate the lenghts of the lines into a new column called ``distance`` in ``movements`` GeoDataFrame.

In [9]:
movements['distance'] = (movements['geometry'].length / 1000)

In [10]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION
movements.head()

Unnamed: 0,geometry,userid,distance
0,"LINESTRING (939011.113 7254636.121, 942231.630...",16301,195.251396
1,"LINESTRING (905394.500 7193375.148, 905394.500...",45136,0.0
2,"LINESTRING (963788.403 7228015.063, 944551.607...",50136,254.70253
3,"LINESTRING (902800.817 7192546.975, 902800.839...",88775,8e-05
4,"LINESTRING (959332.961 7219877.715, 963788.403...",88918,9.277252


You should now be able to print answers to the following questions: 

 - What was the shortest distance travelled in meters?
 - What was the mean distance travelled in meters?
 - What was the maximum distance travelled in meters?

In [11]:
max_len = movements['distance'].max()
min_len = movements['distance'].min()
mean_len = movements['distance'].mean()

print('The maximum length is %0.2f km' % (max_len))
print('The minimum length is %0.2f km' % (min_len))
print('The mean length is %0.2f km' % (mean_len))

The maximum length is 4535.32 km
The minimum length is 0.00 km
The mean length is 69.09 km


- Finally, save the movements of into a Shapefile called ``some_movements.shp``

In [12]:
fp = r'some_movements.shp'
movements.to_file(fp)

In [13]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION
import os
assert os.path.isfile(fp), "output shapefile does not exits"

That's all for this week!

In [14]:
len(movements)

9026

We started from 14990 groups and ended up with 9026 because 5964 had less than 2 points to construct a line