Problem 3: 
How long distance individuals have travelled? (8 points)
In this problem the aim is to calculate the "distance" in meters that the individuals have travelled according the social media posts (Euclidean distances between points). In this problem, we will need the userid -column and the points created in the previous problem. You will need the shapefile Kruger_posts.shp generated in Problem 2 as input file.

Our goal is to answer these questions based on the input data:

What was the shortest distance travelled in meters?
What was the mean distance travelled in meters?
What was the maximum distance travelled in meters?
In your code, you should first:

- Import required modules.

- Read in the shapefile as a geodataframe called data

- Reproject the data from WGS84 projection into EPSG:32735 

- projection which stands for UTM Zone 35S (UTM zone for South Africa) to transform the data into the metric system.

- Store the result in a variable called data!

In [None]:
#Import required modules.
import geopandas as gpd
from shapely.geometry import Point, LineString
import pandas as pd
from pyproj import CRS

In [None]:
#Read in the shapefile as a geodataframe called data
fp = 'Kruger_posts.shp'
data = gpd.read_file(fp)

print(data.head())

In [None]:
#Let’s check the data type:
type(data)

In [None]:
#Reproject the data from WGS84 projection into EPSG:32735 
data = data.to_crs(epsg=32735)

In [None]:
data.crs

In [None]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

# Check the data
print(data.head())

In [None]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

# Check that the crs is correct after re-projecting (should be epsg:32735)
print(data.crs)

Group the data by userid

In [None]:
#group the data by userid
grouped = data.groupby('userid')

#check group keys
#grouped.groups.keys()

print(grouped)
#print("Groups: ", grouped.groups) #grab dictionary of userid and associated indices
#print("Indicies: ", grouped.indices) #grab dictionary of userid and associated indicies in array format

In [None]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

#Check the number of groups:
assert len(grouped.groups) == data["userid"].nunique(), "Number of groups should match number of unique users!"

Create LineString objects for each user connecting the points from oldest to latest:

Suggested steps:

- Create an empty DataFrame called movements.
- Create an empty column "geometry"
- Use a for-loop where you iterate over the grouped object. For each user's data:
    - sort the rows by timestamp
    - create a LineString object based on the user's points
    - Add the LineString to the geometry column of the movements dataframe. You can also add the userid in a separate column (or use the userid as index).
- Convert movements into a GeoDataFrame (you can replace the DataFrame created in the previous steps with the GeoDataFrame). Remember to set the geometry column.
- Set the CRS of the movements GeoDataFrame as EPSG:32735

Test Code for Inplace sort for timestamp value looking at a subset of 10 userid values

In [None]:
#extract the first 10 userid unique values 
subset_id = data["userid"].unique()[0:10]
print(subset_id)

#example of extracting the first unique value from our subset of unique values
print(subset_id[0])

print("Before Sort....\n")
#build a subset groupby geodataframe: userid: [ rows_related_to_user_id ] 
subset_groupby = {}
for user_id in subset_id:
    #get copy of geodataframe with userid
    subset_groupby[user_id] = grouped.get_group(user_id).copy()
    print(subset_groupby[user_id].head())

print("-------------")
print("After Sort....\n")

#sort userid rows by timestamp value inplace in ascending order
for user_id in subset_id:
    subset_groupby[user_id].sort_values('timestamp', ascending=True, inplace=True)
    print(subset_groupby[user_id].head())

    

More test code for sorting in place of Geodataframe

In [None]:
#example of extracting Geodataframe from GroupByGeoDataFrame\
#
#
#return dataframe of rows with the following userid 
test = grouped.get_group(65281761).copy()
print(test)
print(type(test))
print("--------------------------")

#sort dataframe by timestamp in ascending order
test.sort_values('timestamp', ascending=True, inplace=True)
print(test)

type(grouped)

In [None]:
movements = gpd.GeoDataFrame()
movements['geometry'] = None
movements['userid'] = None

def return_lines(user_id_list, groupby_geo_data_frame):
    '''
    Takes a groupby dataframe, and unique id list. 
    Sorts dataframe by timestamp, creates lat long points for each row, and generates a linestring from the points for each userid
    '''
    
    lines = []

    for user_id in user_id_list:
        
        #get copy of geodataframe with userid
        temp = groupby_geo_data_frame.get_group(user_id).copy()

        #sort the timestamps of the copied rows with userid
        temp.sort_values('timestamp', ascending=True, inplace=True)

        #reset the previous index to start from 0 to N, old index is added to geodataframe as a column
        temp.reset_index(inplace=True)

        #store all points for a given userid
        points = []

        #create point with each lat, long
        #append point to line list
        for index in range(0,len(temp)):
            points.append( Point( temp.at[index,'lon'], temp.at[index,'lat'] ) )

        
        #print("userid: ",user_id , ", # of Points:", len(points))
        
        #if userid contains 1 point, create a Point object, otherwise create a LineString object
        if (len(points)> 1):
            lines.append(LineString(points))
        else:
            #make a 'LineString with the same point (point that points to itself)
            lines.append(LineString( [ points[0],points[0] ] ))
            
    return lines


#Iterate over the grouped dictionary : userid : [rows]
#print(grouped.get_group(65281761))

#get a sample of unique user ids
#all_unique_id = data["userid"].unique()[0:5]

#get all unique user_ids
all_unique_id = data["userid"].unique()

#append userid to geodataframe
movements['userid'] = all_unique_id

#append geometry to geodataframe
movements['geometry'] = return_lines(all_unique_id,grouped)

#convert movements epsg to 32735
movements.crs= CRS.from_epsg(32735).to_wkt()

print(movements.head())

In [None]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

#Check the result
print(type(movements))
print(movements.crs)
print(movements["geometry"].head())

Finally:

- Check once more the crs definition of your dataframe (should be epsg:32735, define the correct crs if this information is missing)

- Calculate the lenghts of the lines into a new column called distance in movements GeoDataFrame.

In [None]:
print(movements.crs)

In [None]:
#get the distance of each linestring of the geopandas dataframe
movements['distance'] = movements.length

In [None]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

#Check the output
movements.head()

You should now be able to print answers to the following questions:

- What was the shortest distance travelled in meters?
- What was the mean distance travelled in meters?
- What was the maximum distance travelled in meters?

In [None]:
#get the min distance travelled
print("min: ", movements['distance'].min() )

#get the mean distance travelled
print("mean: ", movements['distance'].sum() / len(movements['distance']) )

#get the max distance travelled
print("max: ", movements['distance'].max() )

Finally, save the movements of into a Shapefile called some_movements.shp

In [None]:
fp = "some_movements.shp"

#each shapefile can only contain one geometry, points and lines geometry cannot be mixed together
movements.to_file(fp)

In [None]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

import os

#Check if output file exists
assert os.path.isfile(fp), "Output file does not exits."