## About the Data

Handling null values, missing station_names, in divvy bikes dataset using station coordinates.

Download the data from [Google Drive](https://drive.google.com/file/d/1Hiz34DeaEZUPocs8pe9zabngCHAZQKNZ/view?usp=sharing)

In [1]:
# load libraries
import pandas as pd
from glob import glob
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

## Merge csv files: trips table

The csv files are in the same directory with the python notebook

In [None]:
data = sorted(glob('data/divvy/*.csv'))

In [None]:
df = pd.concat(
    (pd.read_csv(file).assign()
    for file in data), 
    ignore_index = True
    )
df

## Import Stations table 

In [97]:
# Stations table
stations = pd.read_csv('data/Stations.csv')

## Preparing the data

### Trips table

In [91]:
df.isna().sum()

ride_id                    0
rideable_type              0
started_at                 0
ended_at                   0
start_station_name    785465
start_station_id      786088
end_station_name      850051
end_station_id        850512
start_lat                  0
start_lng                  0
end_lat                 9026
end_lng                 9026
member_casual              0
dtype: int64

In [109]:
# Filter data 
bool = df['start_station_name'].isna() | (df['end_station_name'].isna() & df['end_lat'].notna())

df_filt = df[bool]

Xs = df_filt[ ['start_lat', 'start_lng'] ]
Xe = df_filt[ ['end_lat', 'end_lng'] ]

In [110]:
df_filt.isna().sum()

ride_id                    0
rideable_type              0
started_at                 0
ended_at                   0
start_station_name    785465
start_station_id      785564
end_station_name      841025
end_station_id        841082
start_lat                  0
start_lng                  0
end_lat                    0
end_lng                    0
member_casual              0
dtype: int64

### Stations table

In [98]:
stations.isnull().sum()

id             6
name           0
docks         15
in_service    30
latitude       5
longitude      5
coordinate     5
dtype: int64

In [99]:
# drop rows where coordinate column is null
stations.dropna(subset = ['coordinate'], inplace = True)

# For input
X = stations[ ['latitude', 'longitude'] ]
# For output
y = stations['name']

In [102]:
stations.isnull().sum()

id             6
name           0
docks         10
in_service    25
latitude       0
longitude      0
coordinate     0
dtype: int64

## Learning and Predicting

### Learning

In [111]:
X = stations[ ['latitude', 'longitude'] ]
y = stations['name']

Xs = df_filt[ ['start_lat', 'start_lng'] ]
Xe = df_filt[ ['end_lat', 'end_lng'] ]

model = DecisionTreeClassifier()
model.fit(X.values, y)

### Predicting

In [112]:
# start station
s = model.predict(Xs)
# end station
e = model.predict(Xe)



In [114]:
df_filt['start_station_name'] = s
df_filt['end_station_name'] = e

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filt['start_station_name'] = s
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filt['end_station_name'] = e


In [115]:
df_filt.isnull().sum()

ride_id                    0
rideable_type              0
started_at                 0
ended_at                   0
start_station_name         0
start_station_id      785564
end_station_name           0
end_station_id        841082
start_lat                  0
start_lng                  0
end_lat                    0
end_lng                    0
member_casual              0
dtype: int64

In [117]:
df_filt

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual
630367,60742256DFFFCA29,electric_bike,2020-07-31 08:30:34,2020-07-31 08:57:54,Rockwell St & Archer Ave,,Wallace St & 35th St,367.0,41.900000,-87.690000,41.830704,-87.656085,member
642029,EBBD4FE9C8A95116,electric_bike,2020-07-29 19:02:25,2020-07-29 19:22:40,Rockwell St & Archer Ave,,Western Ave & Walton St,374.0,41.900000,-87.690000,41.898404,-87.686592,member
653657,976336C6499A7189,electric_bike,2020-07-30 22:02:45,2020-07-30 22:17:54,Clarendon Ave & Leland Ave,,Broadway & Sheridan Rd,251.0,41.940000,-87.650000,41.967841,-87.649991,member
653658,9A2F60AEB9CABA6A,electric_bike,2020-07-30 21:46:34,2020-07-30 21:55:33,Wilton Ave & Belmont Ave,117.0,Clarendon Ave & Leland Ave,,41.940119,-87.653015,41.940000,-87.650000,member
670853,EC549CDABDE45F98,electric_bike,2020-07-31 15:54:41,2020-07-31 16:00:35,California Ave & Francis Pl Temp,,Lincoln Ave & Belle Plaine Ave,,41.920000,-87.700000,41.910000,-87.680000,member
...,...,...,...,...,...,...,...,...,...,...,...,...,...
8709850,08E5EC2EC583D230,electric_bike,2021-12-17 07:55:47,2021-12-17 08:03:45,Canal St & Madison St,13341,Fairbanks Ct & Grand Ave,,41.882002,-87.639457,41.890000,-87.620000,casual
8709852,DFE48801A70DFEA7,electric_bike,2021-12-23 21:28:41,2021-12-23 21:36:27,Clinton St & Madison St,13341,Franklin St & Monroe St,,41.882197,-87.639226,41.880000,-87.650000,casual
8709853,92BBAB97D1683D69,electric_bike,2021-12-24 15:42:09,2021-12-24 19:29:35,Green St & Madison St,13341,Franklin St & Monroe St,,41.881800,-87.639970,41.880000,-87.640000,casual
8709854,847431F3D5353AB7,electric_bike,2021-12-12 13:36:55,2021-12-12 13:56:08,Clinton St & Madison St,13341,Lake Park Ave & 35th St,,41.882289,-87.639752,41.890000,-87.610000,casual


## Confirming accuracy

By checking the latitude and longitude of `Rockwell St & Archer Ave	` to Stations table, we will pointed at a different location using `Google Maps`.
 
This project failed, got to dig deeper and explore alternative ways or try a different approach.