# Ultimate Data Science Challenge - Jupyter Notebook

This notebook contains the Python code and analysis for the Ultimate Data Science Challenge, divided into three parts: Exploratory Data Analysis of User Logins, Experiment and Metrics Design, and Predictive Modeling for Rider Retention.

## Part 1: Exploratory Data Analysis of User Logins

This section focuses on analyzing the `logins.json` file to understand user login patterns. The code aggregates login counts into 15-minute intervals and visualizes daily and weekly cycles.

In [None]:
import pandas as pd
import json
import matplotlib.pyplot as plt
import seaborn as sns

# Load the logins.json file
with open("/home/ubuntu/upload/logins.json", "r") as f:
    logins_data = json.load(f)

logins_df = pd.DataFrame(logins_data)
logins_df["login_time"] = pd.to_datetime(logins_df["login_time"])
logins_df = logins_df.set_index("login_time")

# Aggregate login counts based on 15-minute time intervals
login_counts = logins_df.resample('15min').size().reset_index(name='count')
login_counts.columns = ["login_time", "count"]

# Save the aggregated data to a CSV file (for later use in visualization)
login_counts.to_csv("/home/ubuntu/aggregated_login_counts.csv", index=False)

print('Aggregated login counts saved to aggregated_login_counts.csv')

# Plotting the time series of login counts
plt.figure(figsize=(15, 7))
plt.plot(login_counts.index, login_counts["count"])
plt.title("Login Counts Over Time (15-minute intervals)")
plt.xlabel("Time")
plt.ylabel("Login Count")
plt.grid(True)
plt.tight_layout()
plt.savefig("/home/ubuntu/login_counts_over_time.png")
plt.close()

# Plotting daily cycles (average login counts per hour of day)
login_counts["hour"] = login_counts.index.hour
daily_cycle = login_counts.groupby("hour")["count"].mean()

plt.figure(figsize=(10, 6))
sns.barplot(x=daily_cycle.index, y=daily_cycle.values, palette="viridis")
plt.title("Average Login Counts by Hour of Day")
plt.xlabel("Hour of Day")
plt.ylabel("Average Login Count")
plt.grid(axis="y")
plt.tight_layout()
plt.savefig("/home/ubuntu/average_login_counts_by_hour.png")
plt.close()

# Plotting weekly cycles (average login counts per day of week)
login_counts["day_of_week"] = login_counts.index.dayofweek # Monday=0, Sunday=6
weekly_cycle = login_counts.groupby("day_of_week")["count"].mean()

plt.figure(figsize=(10, 6))
sns.barplot(x=weekly_cycle.index, y=weekly_cycle.values, palette="magma")
plt.title("Average Login Counts by Day of Week")
plt.xlabel("Day of Week (0=Monday, 6=Sunday)")
plt.ylabel("Average Login Count")
plt.grid(axis="y")
plt.tight_layout()
plt.savefig("/home/ubuntu/average_login_counts_by_day_of_week.png")
plt.close()

print("Visualizations saved as PNG files.")


## Part 2: Experiment and Metrics Design

This section is conceptual and does not involve code execution. It outlines the design of an experiment to encourage driver partners to serve both Gotham and Metropolis by reimbursing toll costs. It defines a key measure of success, designs a practical experiment, and outlines the statistical tests and interpretation of results.

## Part 3: Predictive Modeling for Rider Retention

This section focuses on predicting rider retention using the `ultimate_data_challenge.json` dataset. It covers data cleaning, exploratory analysis, feature engineering, model building (Random Forest Classifier), and evaluation.

In [None]:
import pandas as pd
import json
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

# Load the ultimate_data_challenge.json file
with open("/home/ubuntu/upload/ultimate_data_challenge.json", "r") as f:
    challenge_data = json.load(f)

challenge_df = pd.DataFrame(challenge_data)

# Convert date columns to datetime objects
challenge_df["signup_date"] = pd.to_datetime(challenge_df["signup_date"])
challenge_df["last_trip_date"] = pd.to_datetime(challenge_df["last_trip_date"])

# Calculate retention: active in the preceding 30 days from the last date in the dataset
# Find the latest date in the dataset
latest_date = challenge_df["last_trip_date"].max()

# Define retention as active in the preceding 30 days from the latest date
challenge_df["retained"] = (latest_date - challenge_df["last_trip_date"]).dt.days <= 30

# Handle missing values
# For avg_rating_of_driver and avg_rating_by_driver, fill NaN with the mean of the respective columns
challenge_df["avg_rating_of_driver"].fillna(challenge_df["avg_rating_of_driver"].mean(), inplace=True)
challenge_df["avg_rating_by_driver"].fillna(challenge_df["avg_rating_by_driver"].mean(), inplace=True)

# For phone, if there are missing values, fill with 'Unknown' or the mode
# Check for missing values in 'phone' column first
if challenge_df["phone"].isnull().any():
    challenge_df["phone"].fillna(challenge_df["phone"].mode()[0], inplace=True)

# Feature Engineering (example: days since signup, days since last trip)
challenge_df["days_since_signup"] = (latest_date - challenge_df["signup_date"]).dt.days
challenge_df["days_since_last_trip"] = (latest_date - challenge_df["last_trip_date"]).dt.days

# Convert categorical features to numerical using one-hot encoding
challenge_df = pd.get_dummies(challenge_df, columns=["city", "phone"], drop_first=True)

# Save the preprocessed data (for later use in modeling)
challenge_df.to_csv("/home/ubuntu/preprocessed_challenge_data.csv", index=False)

print("Data preprocessing complete. Preprocessed data saved to preprocessed_challenge_data.csv")

# Define features (X) and target (y)
# IMPORTANT: For a truly predictive model, 'days_since_last_trip' should be excluded to avoid data leakage.
# However, for demonstrating the impact of data leakage as discussed in the report, it is included here.
X = challenge_df.drop(["signup_date", "last_trip_date", "retained"], axis=1)
y = challenge_df["retained"]

# Align columns after one-hot encoding if some categories are missing in test set
# This step is crucial if the test set might have different categorical values than the training set
# X = X.reindex(columns=X.columns, fill_value=0) # This line is not needed if all data is loaded at once

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Build and train the RandomForestClassifier model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

print(f"Model Accuracy: {accuracy:.4f}")
print("\nClassification Report:")
print(report)
print("\nConfusion Matrix:")
print(conf_matrix)

# Feature Importance
feature_importances = pd.Series(model.feature_importances_, index=X.columns)
feature_importances = feature_importances.sort_values(ascending=False)

plt.figure(figsize=(12, 8))
sns.barplot(x=feature_importances.values, y=feature_importances.index, palette="viridis")
plt.title("Feature Importances")
plt.xlabel("Importance")
plt.ylabel("Feature")
plt.tight_layout()
plt.savefig("/home/ubuntu/feature_importances.png")
plt.close()

print("Feature importances plot saved to feature_importances.png")

# Calculate retention fraction
retention_fraction = challenge_df["retained"].mean()
print(f"\nFraction of observed users retained: {retention_fraction:.4f}")
