In [None]:
%run './feature_engineering.ipynb'


## [Unsupervised Learning] Predictive Analysis (Kmeans Clustering)

Using the added features, I'll apply K-means clustering to find the most popular source stations by time of day.

The steps involve:
#### Data Preparation:
##### Convert the categorical features to numeric index (StringIndexer)
* Potentially use `OneHotEncoder`, since, we added the categorical fields for time of day and day of week, we may skip this step.
* Vectorize the features (VectorAssembler)
* Apply PCA to inspect the relationships between the chosen features, for correlation and variance.

##### Clustering:

Run an initial KMeans clustering with a default of k=5 clusters, then apply the silhouette index to look for the optimal number of clusters.
Compare and plot results.

Proceed with k-means for some specific questions:

* Apply K-means clustering to find the most popular source/destinations stations - Plot
* Apply K-means for most popular stations by time of day/day of week - Plot
* Most popular routes by day of week - Plot
* Most popular routes by day of week/time of day - Plot

In a real-world scenario, the period of the day (Morning, Afternoon, Evening, and Night) is not granular enough. To get better accuracy in the prediction, to confidently allocate, employees to relocate bikes. We should consider shorter time windows, increasing  granularity, such as 15-minute intervals.

#### [Supervised Learning] Predictive Analysis (Regression)

Explore a predictive analysis using regression models to improve user experience. Refining the re-stock strategy by considering the bike type per route, to predict the number of bikes needed at each station at different times of the day.

* Apply regression models to predict the number of bikes needed at each station at different times of the day.
* Bike type per routes, to predict the number of bikes needed at each station at different times of the day.


In [None]:
# Select features to be used in the PCA analysis
features = ['start_station_id_index', 'end_station_id_index', 'day_period_index', 'week_day_index']

In [None]:
from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(inputCols=features, outputCol="features")
# # Scaling the features
from pyspark.ml.feature import StandardScaler

scaler = StandardScaler(inputCol="features", outputCol="scaled_features", withStd=True, withMean=False)

# # Combine the VectorAssembler and StandardScaler into a Pipeline
from pyspark.ml.feature import StandardScaler, VectorAssembler
from pyspark.ml import Pipeline

# You can now define a pipeline that includes both the assembler and the scaler
pipeline = Pipeline(stages=[assembler, scaler])

# Fit and transform the DataFrame using the defined pipeline
sampled_df_scaled = pipeline.fit(sampled_df_with_added_features_indexed).transform(
    sampled_df_with_added_features_indexed)

In [None]:
import pyspark
from pyspark.ml.clustering import KMeans

adf_kmeans = sampled_df_scaled
# Initialize KMeans with the specified number of clusters (k) and a seed for reproducibility
kmeans = KMeans().setK(5).setSeed(1).setFeaturesCol("features")

# Fit the model to the data
model = kmeans.fit(adf_kmeans)

# Transform the dataset to include cluster predictions
predictions = model.transform(adf_kmeans)
from pyspark.ml.evaluation import ClusteringEvaluator

# Initialize the evaluator with silhouette score
evaluator = ClusteringEvaluator()

# Evaluate the model
silhouette = evaluator.evaluate(predictions)
print(f"Silhouette Score with k={k}: {silhouette}")
# Silhouette Score with k=5: 0.5733339010164845

In [None]:
# TODO apply K-means clustering and to find the most popular source stations - Plot
# TODO - Can I apply K-means for most popular stations by time of day? - Plot
# TODO - most popular routes - Plot
