## In September of 2019 Metro Nashville came to our class at Nashville Software School and proposed that we assist them with analyzing data from scooter usage in Nashville. The objective of this proposal was to make recommendations about how the city should create legislation to best monitor and coordinate the usage of scooters across the city. 

## We were tasked with addressing scooter density, which companies to work with, and how to increase utilitzation of the scooters in the "Promise Zone", an area of Nashville with low income and high unemployment, among other things.

## The team that I led was tasked with coming up with a density solution so that each scooter would theoretically be used for at least three rides per day across all utilization areas. Through working with different approaches, my team was able to land on a zip code density solution that was presented to Metro, and was warmly recieved as an approach that they had not considered.

# <font color = red>PART 1: Working from the original data

## Initially we were given three months worth of data from May, June, and July of 2019. These data sets were very large, at over 60 million lines of data combined. My goal was to work through the data and determine how I could remove as much inaccurate and missing content as possible.

## The majority of the content removed was scooter trips that were not trips at all (stationary location pings by the companies), and errant trips that may have been associated with transport, charging, etc. 

## Because the files are so large, I have included a screenshot below to showcase the size of the data being worked with for only one month.

<img src="images/one_month_data_read.png">

## To start, I began removing specific content that was not helpful for the analysis

#### <font color = "blue">Removing the rows that contain the word "bicycle"

In [1]:
# may_scooters_df = may_scooters_df[~may_scooters_df.sumdgroup.str.contains("bicycle")]
# june_scooters_df = june_scooters_df[~june_scooters_df.sumdgroup.str.contains("bicycle")]
# july_scooters_df = july_scooters_df[~july_scooters_df.sumdgroup.str.contains("bicycle")]

## After determining the the size of the data set was too large to do a meaningful analysis, I decided to instead look at one single scooter ID so that I could try and find trends or other filtering parameters that could be applied towards the entire data set

In [3]:
# singlescooter = may_scooters_df.loc[may_scooters_df.sumdid == 'PoweredLIRL1']
# singlescooter.head()

<img src="images/one_scooter_info.png">

## Once I had the smaller data set, I wanted to filter out the scooter trips by distance traveled and time elapsed so that only actual trips were being analyzed

In [8]:
# singlescooter['pubdatetime'] = pd.to_datetime(singlescooter['pubdatetime'])
# singlescooter['date'] = singlescooter['pubdatetime'].dt.date
# singlescooter['time'] = singlescooter['pubdatetime'].dt.time
# singlescooter['date'] = pd.to_datetime(singlescooter['date'])
# singlescooter['elapsed_time'] = singlescooter.pubdatetime.diff()

<img src="images/elapsed_time.png">

<img src="images/adding_geometry.png">

<img src="images/geo_dataframe.png">

<img src="images/adding_dist.png">

## I then started to filter out my content by creating a minimum and maximum distance guideline, while also creating new columns that calculated the time between two "trips" to determine if the trip is valid

In [10]:
# singlescooter_filter = singlescooter_geo[~(singlescooter_geo['dist'] < 10)]
# # singlescooter_geo[[~singlescooter_geo['dist'] < 10], [~singlescooter_geo['dist'] > 10000]]
# singlescooter_filter['time_between'] = pd.to_timedelta(singlescooter_filter['elapsed_time'].astype(str))
# singlescooter_filter['mins_elapsed']= singlescooter_filter['time_between'].dt.total_seconds()/60
# singlescooter_filter.head(2)

<img src="images/mins_elapsed.png">

## Once these new columns were established, I ran my final filter over the data, removing any trips that did not travel at least 10 meters, last at least six minutes, and did not last more than 180 minutes.

In [11]:
# singlescooter_filter.loc[(singlescooter_filter["dist"] > 10) & 
#                          (singlescooter_filter["mins_elapsed"] >= 6) & 
#                          (singlescooter_filter["mins_elapsed"] <= 180)]

## This final filtering removed over 97% of the data, giving a better idea of what data is actually useful for the end analysis.

<img src="images/single_final.png">