## Problem 3: How long distance individuals have travelled? (8 points)

In this problem the aim is to calculate the "distance" in meters that the individuals have travelled according the social media posts (Euclidean distances between points). In this problem, we will need the `userid` -column and the points created in the previous problem. You will need the shapefile `Kruger_posts.shp` generated in Problem 2 as input file.

Our goal is to answer these questions based on the input data:
- What was the shortest distance travelled in meters?
- What was the mean distance travelled in meters?
- What was the maximum distance travelled in meters?

**In your code, you should first:**
 - Import required modules.
 - Read in the shapefile as a geodataframe called `data`
 - Reproject the data from WGS84 projection into `EPSG:32735` -projection which stands for UTM Zone 35S (UTM zone for South Africa) to transform the data into the metric system.
 
*Store the result in a variable called `data`*!

In [1]:
import os
import geopandas as gpd
from shapely.geometry import Point, Polygon, LineString
import pandas as pd
import matplotlib.pyplot as plt
from pyproj import CRS

fp = "data/Kruger_posts.shp"
data = gpd.read_file(fp)

- Check the crs of the input data. If this information is missing, set it as epsg:4326 (WGS84).
- Reproject the data from WGS84 to `EPSG:32735` -projection which stands for UTM Zone 35S (UTM zone for South Africa) to transform the data into metric system. (don't create a new variable, update the existing variable `data`!)"

In [2]:
data = gpd.GeoDataFrame(data, geometry = "geometry")

In [3]:
data.crs

<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World.
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich

In [4]:
data.head()

Unnamed: 0,lat,lon,timestamp,userid,geometry
0,-24.980792,31.484633,2015-07-07 03:02,66487960,POINT (31.48463 -24.98079)
1,-25.499225,31.508906,2015-07-07 03:18,65281761,POINT (31.50891 -25.49922)
2,-24.342578,30.930866,2015-03-07 03:38,90916112,POINT (30.93087 -24.34258)
3,-24.854614,31.519718,2015-10-07 05:04,37959089,POINT (31.51972 -24.85461)
4,-24.921069,31.520836,2015-10-07 05:19,27793716,POINT (31.52084 -24.92107)


In [5]:
# converting to epsg 32735
data = data.to_crs(32735)

In [6]:
data.head()

Unnamed: 0,lat,lon,timestamp,userid,geometry
0,-24.980792,31.484633,2015-07-07 03:02,66487960,POINT (952912.890 7229683.258)
1,-25.499225,31.508906,2015-07-07 03:18,65281761,POINT (953433.223 7172080.632)
2,-24.342578,30.930866,2015-03-07 03:38,90916112,POINT (898955.144 7302197.408)
3,-24.854614,31.519718,2015-10-07 05:04,37959089,POINT (956927.218 7243564.942)
4,-24.921069,31.520836,2015-10-07 05:19,27793716,POINT (956794.955 7236187.926)


In [7]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

# Check the data
print(data.head())

         lat        lon         timestamp    userid  \
0 -24.980792  31.484633  2015-07-07 03:02  66487960   
1 -25.499225  31.508906  2015-07-07 03:18  65281761   
2 -24.342578  30.930866  2015-03-07 03:38  90916112   
3 -24.854614  31.519718  2015-10-07 05:04  37959089   
4 -24.921069  31.520836  2015-10-07 05:19  27793716   

                         geometry  
0  POINT (952912.890 7229683.258)  
1  POINT (953433.223 7172080.632)  
2  POINT (898955.144 7302197.408)  
3  POINT (956927.218 7243564.942)  
4  POINT (956794.955 7236187.926)  


In [8]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

# Check that the crs is correct after re-projecting (should be epsg:32735)
print(data.crs)

epsg:32735


 - Group the data by userid

In [9]:
grouped = data.groupby('userid')

In [10]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

#Check the number of groups:
assert len(grouped.groups) == data["userid"].nunique(), "Number of groups should match number of unique users!"

**Create LineString objects for each user connecting the points from oldest to latest:**

*Suggested steps:*
- Create an empty DataFrame called `movements`. 
- Create an empty column "geometry"
- Use a for-loop where you iterate over the grouped object. For each user's data: 
    - [sort](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html) the rows by timestamp 
    - create a LineString object based on the user's points
    - Add the LineString to the geometry column of the `movements` dataframe. You can also add the `userid` in a separate column (or use the userid as index).
- Convert `movements` into a `GeoDataFrame` (you can replace the DataFrame created in the previous steps with the GeoDataFrame). Remember to set the `geometry` column.
- Set the CRS of the ``movements`` GeoDataFrame as ``EPSG:32735`` 

In [12]:
# I had some problems with this part. I decided to proceed only with userid:s that have more than 1 observation since it's impossible to create
# linestring from single point. Also I got the shapely deprecation warning which I deicded to ignore.

import shapely
import warnings
from shapely.errors import ShapelyDeprecationWarning
warnings.filterwarnings("ignore", category=ShapelyDeprecationWarning) 
movements = pd.DataFrame()
movements['geometry'] = None
movements['userid'] = None
for key,group in grouped:
    points = []
    group = group.sort_values(by=['timestamp'])
    for row in group['geometry']:
        points.append(row)
    if len(points) > 1:
        line1 = LineString(points)
        movements.loc[key, 'geometry'] = line1
        movements.loc[key, 'userid'] = key
        del(line1)
        del(points)



In [13]:
movements = gpd.GeoDataFrame(movements, geometry = 'geometry')
movements.crs

In [14]:
movements = movements.set_crs("EPSG:32735")


In [15]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

#Check the result
print(type(movements))
print(movements.crs)
print(movements["geometry"].head())

<class 'geopandas.geodataframe.GeoDataFrame'>
EPSG:32735
16301    LINESTRING (942231.630 7254606.868, 938934.725...
45136    LINESTRING (905394.500 7193375.148, 905394.500...
50136    LINESTRING (944551.607 7253384.183, 963788.403...
88775    LINESTRING (902800.817 7192546.975, 902800.839...
88918    LINESTRING (959332.961 7219877.715, 963788.403...
Name: geometry, dtype: geometry


**Finally:**
- Check once more the crs definition of your dataframe (should be epsg:32735, define the correct crs if this information is missing)
- Calculate the lenghts of the lines into a new column called ``distance`` in ``movements`` GeoDataFrame.

In [16]:
movements['distance'] = None
    

In [17]:
movements['distance'] = movements['geometry'].length

In [18]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

#Check the output
movements.head()

Unnamed: 0,geometry,userid,distance
16301,"LINESTRING (942231.630 7254606.868, 938934.725...",16301,328455.11543
45136,"LINESTRING (905394.500 7193375.148, 905394.500...",45136,0.0
50136,"LINESTRING (944551.607 7253384.183, 963788.403...",50136,159189.081019
88775,"LINESTRING (902800.817 7192546.975, 902800.839...",88775,0.080245
88918,"LINESTRING (959332.961 7219877.715, 963788.403...",88918,9277.252211


You should now be able to print answers to the following questions: 

 - What was the shortest distance travelled in meters?
 - What was the mean distance travelled in meters?
 - What was the maximum distance travelled in meters?

In [19]:
movements = movements.loc[~(movements['distance']==0)]

In [21]:
# The results seem very confusing so there might be some error in creating the linestrings. However, since this is 
# practice data it might just be random values.

min_distance = movements['distance'].min()
max_distance = movements['distance'].max()
max_distance = max_distance/1000
mean_distance = movements['distance'].mean()
mean_distance = mean_distance/1000
min_distance = min_distance
print("Shortest distance is",min_distance,"m, max distance is", max_distance, "km and mean distance is", mean_distance, "km")

Shortest distance is 0.00010161846437100987 m, max distance is 6970.668816343964 km and mean distance is 117.43831425525391 km


- Finally, save the movements of into a Shapefile called ``some_movements.shp``

In [22]:
outfp = "data/some_movements.shp"
movements.to_file(outfp)

In [23]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

import os

#Check if output file exists
assert os.path.isfile(fp), "Output file does not exits."

That's all for this week!