# Task
Build a simple model to predict the date when I will 2,525.25km finish based on "/content/vietnam-vr-ultra-records.txt". Create a cumulative plot with 200km milestones and the date I will finish each.

## Load and inspect data

### Subtask:
Load the data from the text file into a pandas DataFrame and inspect its structure and content.


**Reasoning**:
The first step is to load the data from the text file into a pandas DataFrame and then inspect it as requested in the instructions.



In [None]:
import pandas as pd

# Load the data from the text file into a pandas DataFrame
df = pd.read_csv('/content/vietnam-vr-ultra-records.txt', sep=' ')

# Display the first 5 rows of the DataFrame
display(df.head())

# Display a concise summary of the DataFrame
display(df.info())

# Generate descriptive statistics of the numerical columns
display(df.describe())

**Reasoning**:
The data is not space-separated as assumed. The column contains tab-separated values. I need to reload the data using tab as a separator and then repeat the inspection steps.



In [None]:
# Load the data from the text file into a pandas DataFrame using tab separator
df = pd.read_csv('/content/vietnam-vr-ultra-records.txt', sep='\t')

# Display the first 5 rows of the DataFrame
display(df.head())

# Display a concise summary of the DataFrame
display(df.info())

# Generate descriptive statistics of the numerical columns
display(df.describe())

## Preprocess data

### Subtask:
Clean and format the data as needed for analysis and modeling, including converting relevant columns to numeric and datetime types.


**Reasoning**:
Convert 'Date' to datetime and 'Distance' to numeric by removing 'km' and casting. Then display the data types to verify the changes.



In [None]:
df['Date'] = pd.to_datetime(df['Date'])
df['Distance'] = df['Distance'].str.replace('km', '').astype(float)
display(df.dtypes)

## Calculate cumulative distance

### Subtask:
Calculate the cumulative distance covered over time.


**Reasoning**:
Sort the DataFrame by date and calculate the cumulative distance.



In [None]:
df = df.sort_values(by='Date', ascending=True)
df['Cumulative_Distance'] = df['Distance'].cumsum()
display(df.head())

## Build prediction model

### Subtask:
Build a simple model (e.g., linear regression) to predict the date based on the cumulative distance.


**Reasoning**:
Prepare the data for linear regression by creating the feature and target variables and then fit the model.



In [None]:
from sklearn.linear_model import LinearRegression

# Prepare the feature variable X
X = df['Cumulative_Distance'].values.reshape(-1, 1)

# Prepare the target variable y by converting dates to numerical representation (days since the first date)
y = (df['Date'] - df['Date'].min()).dt.days

# Instantiate a LinearRegression model
model = LinearRegression()

# Fit the linear regression model to the data
model.fit(X, y)

## Predict finish date

### Subtask:
Use the model to predict the date when the cumulative distance reaches 2525.25 km.


**Reasoning**:
Use the trained model to predict the date for the target cumulative distance and print the predicted date.



In [None]:
# Define the target cumulative distance
target_distance = 2525.25

# Use the trained model to predict the number of days to reach the target distance
# The model expects a 2D array, so reshape the target_distance
predicted_days = model.predict([[target_distance]])[0]

# Convert the predicted number of days back into a date
# Add the predicted number of days to the first date in the dataset
predicted_finish_date = df['Date'].min() + pd.Timedelta(days=predicted_days)

# Print the predicted finish date
print(f"Predicted date to reach {target_distance} km: {predicted_finish_date.strftime('%Y-%m-%d')}")

## Generate milestone dates

### Subtask:
Calculate the predicted dates for each 200 km milestone.


**Reasoning**:
Calculate the predicted dates for each 200 km milestone as requested by the subtask.



In [None]:
import numpy as np

# Create a list of milestone distances
milestone_distances = np.arange(200, target_distance + 200, 200)

# Use the trained model to predict the number of days for each milestone
# The model expects a 2D array, so reshape the milestone_distances
predicted_days_milestones = model.predict(milestone_distances.reshape(-1, 1))

# Convert the predicted number of days back into dates
first_date = df['Date'].min()
predicted_dates_milestones = first_date + pd.to_timedelta(predicted_days_milestones, unit='D')

# Store the milestone distances and their corresponding predicted dates in a DataFrame
milestone_predictions_df = pd.DataFrame({
    'Milestone_Distance': milestone_distances,
    'Predicted_Date': predicted_dates_milestones
})

# Display the resulting DataFrame
display(milestone_predictions_df)

## Visualize cumulative progress

### Subtask:
Create a cumulative plot showing the distance covered over time, with markers for the 200 km milestones and their predicted dates.


**Reasoning**:
Generate a cumulative plot showing the actual distance covered over time and the predicted milestone dates and distances.



In [None]:
import matplotlib.pyplot as plt

# Create a scatter plot of the actual cumulative distance over time
plt.figure(figsize=(12, 6))
plt.scatter(df['Date'], df['Cumulative_Distance'], label='Actual Progress', color='blue', s=10)

# Add a scatter plot of the predicted milestone dates and distances
plt.scatter(milestone_predictions_df['Predicted_Date'], milestone_predictions_df['Milestone_Distance'], label='Predicted Milestones', color='red', s=50)

# Add labels and title
plt.xlabel('Date')
plt.ylabel('Cumulative Distance (km)')
plt.title('Cumulative Distance Progress with Milestone Predictions')

# Rotate x-axis labels for better readability
plt.xticks(rotation=45)

# Add a legend
plt.legend()

# Display the plot
plt.tight_layout()
plt.show()

## Present results

### Subtask:
Display the predicted finish date and the cumulative plot.


**Reasoning**:
Print the predicted finish date and ensure the previously generated plot is displayed.



In [None]:
# Print the previously calculated predicted_finish_date
print(f"Predicted date to reach {target_distance} km: {predicted_finish_date.strftime('%Y-%m-%d')}")

# The plot was generated in the previous step and plt.show() was called, so it should be displayed.
# No new code is needed to display the plot again.

## Summary:

### Q&A

*   **What is the predicted date to finish 2525.25km?**
    The predicted date to reach 2525.25 km is 2026-04-24.

### Data Analysis Key Findings

*   The raw data was tab-separated and contained 'Date', 'Distance', and 'Link' columns.
*   After cleaning, the 'Date' column was converted to datetime objects and the 'Distance' column to float, with the 'km' suffix removed.
*   The cumulative distance was calculated and added as a new column, 'Cumulative_Distance'.
*   A linear regression model was successfully trained to predict the number of days from the start date based on the cumulative distance.
*   The predicted dates for 200 km milestones were calculated and stored in a DataFrame.

### Insights or Next Steps

*   The linear model provides a simple projection, but future analysis could explore more complex models or factors affecting progress.
*   Adding confidence intervals to the predictions could provide a better understanding of the uncertainty in the estimated finish date and milestone dates.
