## Problem 3: How long distance individuals have travelled? (8 points)

In this problem the aim is to calculate the "distance" in meters that the individuals have travelled according the social media posts (Euclidean distances between points). In this problem, we will need the `userid` -column an the points created in the previous problem. You will need the shapefile `Kruger_posts.shp` generated in Problem 2 as input file.

Our goal is to answer these questions based on the input data:
- What was the shortest distance travelled in meters?
- What was the mean distance travelled in meters?
- What was the maximum distance travelled in meters?

**In your code, you should first:**
 - Import required modules.
 - Read in the shapefile as a geodataframe called `data`
 - Reproject the data from WGS84 projection into `EPSG:32735` -projection which stands for UTM Zone 35S (UTM zone for South Africa) to transform the data into metric system.
 
*Store the result in a variable called `data`*!

In [None]:
import pandas as pd
import geopandas as gpd 
from shapely.geometry import Point,LineString,Polygon
from pyproj import CRS

fp = r"C:\Users\aradi\Desktop\Curricullum\GISBox\GIS_DataFactory\AutomatingGIS_Python\source\Excercises2020\exercise-2\Kruger_posts.shp"
data = gpd.read_file(fp)
# Data Cleanup Procedure 
data['year'] = data['timestamp'].str.slice(start=0,stop=4)
data['day'] = data['timestamp'].str.slice(start=5,stop=7)
data['month'] = data['timestamp'].str.slice(start=8,stop=10)
data['hour'] = data['timestamp'].str.slice(start=11,stop=13)
data['mins'] = data['timestamp'].str.slice(start=14,stop=16)
data['sec'] = data['timestamp'].str.slice(start=18,stop=20)
# Create temp column
for idx,row in data.iterrows():
    m = int(data.at[idx,'month'])
    if  m > 12:
        data.at[idx,'month'] = data.at[idx,'day']
data['TS'] = data['year']+'-'+data['month']+'-'+data['day']+' '+data['hour']+':'+data['mins']
data['timestamp'] = pd.to_datetime(data['TS'],format='%Y-%m-%d %H:%M')
data = data[['lat','lon','timestamp','userid','geometry']]
data.head(100)

Unnamed: 0,lat,lon,timestamp,userid,geometry
0,-24.980792,31.484633,2015-07-07 03:02:00,66487960,POINT (-24.98079 31.48463)
1,-25.499225,31.508906,2015-07-07 03:18:00,65281761,POINT (-25.49922 31.50891)
2,-24.342578,30.930866,2015-07-03 03:38:00,90916112,POINT (-24.34258 30.93087)
3,-24.854614,31.519718,2015-07-10 05:04:00,37959089,POINT (-24.85461 31.51972)
4,-24.921069,31.520836,2015-07-10 05:19:00,27793716,POINT (-24.92107 31.52084)
...,...,...,...,...,...
95,-24.517654,31.165026,2015-08-01 10:36:00,75048721,POINT (-24.51765 31.16503)
96,-24.803247,31.422454,2015-08-03 10:39:00,61381975,POINT (-24.80325 31.42245)
97,-25.486701,30.984525,2015-08-06 10:49:00,41418953,POINT (-25.48670 30.98452)
98,-24.527366,31.116080,2015-08-01 10:55:00,5007076,POINT (-24.52737 31.11608)


- Check the crs of the input data. If this information is missing, set it as epsg:4326 (WGS84).
- Reproject the data from WGS84 to `EPSG:32735` -projection which stands for UTM Zone 35S (UTM zone for South Africa) to transform the data into metric system. (don't create a new variable, update the existing variable `data`!)"

In [None]:
print(data.crs)
crs = CRS.from_epsg(32735)
data = data.to_crs(crs)
print(data.crs)

epsg:4326
epsg:32735


In [None]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

# Check the data
print(data.head())
print(data.dtypes)

         lat        lon           timestamp    userid  \
0 -24.980792  31.484633 2015-07-07 03:02:00  66487960   
1 -25.499225  31.508906 2015-07-07 03:18:00  65281761   
2 -24.342578  30.930866 2015-07-03 03:38:00  90916112   
3 -24.854614  31.519718 2015-07-10 05:04:00  37959089   
4 -24.921069  31.520836 2015-07-10 05:19:00  27793716   

                            geometry  
0  POINT (-4695752.719 14973674.275)  
1  POINT (-4748939.258 15014098.837)  
2  POINT (-4672729.591 14859391.193)  
3  POINT (-4679391.656 14969037.444)  
4  POINT (-4686373.982 14973910.589)  
lat                 float64
lon                 float64
timestamp    datetime64[ns]
userid                int64
geometry           geometry
dtype: object


In [None]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

# Check that the crs is correct after re-projecting (should be epsg:32735)
print(data.crs)

epsg:32735


 - Group the data by userid

In [None]:
grouped = data.groupby('userid')
print(len(data))
print(len(grouped.nunique()))

81379
14990


In [None]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

#Check the number of groups:
assert len(grouped.groups) == data["userid"].nunique(), "Number of groups should match number of unique users!"

**Create LineString objects for each user connecting the points from oldest to latest:**

*Suggested steps:*
- Create an empty DataFrame called `movements`. 
- Create an empty column "geometry"
- Use a for-loop where you iterate over the grouped object. For each user's data: 
    - [sort](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html) the rows by timestamp 
    - create a LineString object based on the user's points
    - Add the LineString to the geometry column of the `movements` dataframe. You can also add the `userid` in a separate column (or use the userid as index).
- Convert `movements` into a `GeoDataFrame` (you can replace the DataFrame created in the previous steps with the GeoDataFrame). Remember to set the `geometry` column.
- Set the CRS of the ``movements`` GeoDataFrame as ``EPSG:32735`` 

In [7]:
movements = pd.DataFrame()
movements['geometry'] = 0
i = 0 
for name,group in grouped:
    df = pd.DataFrame(grouped.get_group(name))
    df = df.sort_values('timestamp')
    for idx in df.index:
        df.at[idx,'geometry'] = Point(df.at[idx,'lat'],df.at[idx,'lon'])
    if  len(df) > 1:
        line = LineString(df['geometry'].to_list())
        movements.at[i,'geometry'] = line
    i = i+1   


  arr_value = np.asarray(value)
  arr_value = np.asarray(value)
  arr_value = np.asarray(value)
  arr_value = np.asarray(value)
  arr_value = np.asarray(value)
  arr_value = np.asarray(value)
  arr_value = np.asarray(value)
  arr_value = np.asarray(value)
  arr_value = np.asarray(value)
  arr_value = np.asarray(value)
  arr_value = np.asarray(value)
  arr_value = np.asarray(value)
  arr_value = np.asarray(value)
  arr_value = np.asarray(value)
  arr_value = np.asarray(value)
  arr_value = np.asarray(value)
  arr_value = np.asarray(value)
  arr_value = np.asarray(value)
  arr_value = np.asarray(value)
  arr_value = np.asarray(value)
  arr_value = np.asarray(value)
  arr_value = np.asarray(value)
  arr_value = np.asarray(value)
  arr_value = np.asarray(value)
  arr_value = np.asarray(value)
  arr_value = np.asarray(value)
  arr_value = np.asarray(value)
  arr_value = np.asarray(value)
  arr_value = np.asarray(value)
  arr_value = np.asarray(value)
  arr_value = np.asarray(value)
  arr_va

In [8]:
movements = gpd.GeoDataFrame(movements,geometry='geometry')
movements.crs = CRS(32735)
movements.tail(100)
#print(movements.crs)
#type(movements)

Unnamed: 0,geometry
14811,"LINESTRING (-25.035 31.114, -25.035 31.114)"
14814,"LINESTRING (-25.442 31.995, -25.449 31.992, -2..."
14815,"LINESTRING (-24.972 31.531, -24.972 31.531)"
14817,"LINESTRING (-24.521 31.112, -24.350 30.967)"
14818,"LINESTRING (-25.082 31.098, -25.082 31.098, -2..."
...,...
14978,"LINESTRING (-25.285 30.990, -25.295 31.011, -2..."
14980,"LINESTRING (-24.993 31.593, -24.993 31.592, -2..."
14984,"LINESTRING (-24.305 31.322, -24.305 31.322)"
14986,"LINESTRING (-24.299 31.293, -24.276 31.299)"


**Finally:**
- Check once more the crs definition of your dataframe (should be epsg:32735, define the correct crs if this information is missing)
- Calculate the lenghts of the lines into a new column called ``distance`` in ``movements`` GeoDataFrame.

In [15]:
movements['distance']=0
movements.tail()
for i in movements.index:
    movements.at[i,'distance'] = (movements.at[i,'geometry']).length

In [16]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

#Check the output
movements.head()

Unnamed: 0,geometry,distance
0,"LINESTRING (-24.996 31.592, -24.791 31.865, -2...",2
4,"LINESTRING (-25.321 31.026, -25.321 31.026)",0
6,"LINESTRING (-24.993 31.593, -24.770 31.394, -2...",1
12,"LINESTRING (-25.329 31.000, -25.329 31.000)",0
13,"LINESTRING (-25.067 31.551, -24.993 31.593)",0


You should now be able to print answers to the following questions: 

 - What was the shortest distance travelled in meters?
 - What was the mean distance travelled in meters?
 - What was the maximum distance travelled in meters?

In [17]:
movements.describe()

Unnamed: 0,distance
count,9026.0
mean,0.699424
std,2.80003
min,0.0
25%,0.0
50%,0.0
75%,0.0
max,64.0


- Finally, save the movements of into a Shapefile called ``some_movements.shp``

In [21]:
fp = r'some_movements.shp'
movements.to_file(fp)

In [23]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

import os

#Check if output file exists
assert os.path.isfile(fp), "Output file does not exits."

That's all for this week!