## PHASE 1: Data Cleaning, Analysis, and Visualization

### Step 1: Load the Data and Inspect It

In [None]:
import pandas as pd

# Load the dataset
df = pd.read_csv(r"C:\Users\boluw\PYTHON\GOMYCODE\PROJECT\london_merged.csv")

# View data types and check for nulls
print(df.info())
print(df.head())


### Step 2: Check for Missing Values

In [None]:
# Count missing values
print(df.isnull().sum())

# Optionally drop or fill missing values if found
df = df.dropna()  # or use df.fillna(method='ffill') for forward filling


### Step 3: Convert timestamp to DateTime and Extract Features

In [None]:
df['timestamp'] = pd.to_datetime(df['timestamp'])

# Extract useful time-based features
df['hour'] = df['timestamp'].dt.hour
df['day'] = df['timestamp'].dt.day
df['month'] = df['timestamp'].dt.month
df['weekday'] = df['timestamp'].dt.weekday  # Monday = 0


### Step 4: Detect and Handle Outliers



In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Boxplot to detect outliers
sns.boxplot(data=df[['cnt', 't1', 'hum', 'wind_speed']])
plt.title("Boxplots of Numerical Columns")
plt.show()

# Optionally, remove outliers using IQR
Q1 = df['cnt'].quantile(0.25)
Q3 = df['cnt'].quantile(0.75)
IQR = Q3 - Q1

df = df[(df['cnt'] >= Q1 - 1.5 * IQR) & (df['cnt'] <= Q3 + 1.5 * IQR)]


### Step 5: Encode Categorical Features (if any)

The dataset already encodes `season`, `is_holiday`, and `is_weekend` as numerics. If they were strings, you’d do:

In [None]:
# Example only if you had categorical strings
df['season'] = df['season'].astype('category').cat.codes


### Step 6: Exploratory Data Analysis (EDA)

In [None]:
# Demand by hour
sns.lineplot(x='hour', y='cnt', data=df.groupby('hour').mean().reset_index())
plt.title('Average Demand by Hour of Day')
plt.show()

# Demand by weekday
sns.barplot(x='weekday', y='cnt', data=df.groupby('weekday').mean().reset_index())
plt.title('Average Demand by Weekday')
plt.show()

# Demand by temperature
sns.scatterplot(x='t1', y='cnt', data=df, alpha=0.3)
plt.title('Demand vs Temperature')
plt.show()

# Correlation heatmap
sns.heatmap(df[['cnt', 't1', 'hum', 'wind_speed', 'hour', 'weekday']].corr(), annot=True, cmap='coolwarm')
plt.title("Correlation Heatmap")
plt.show()


## Summary of Insights (to document):
- Peak demand occurs during morning and evening hours

- Weekdays vs weekends have noticeable differences

- Temperature and humidity moderately affect ride count

- Strong correlation between `hour` and `cnt` (demand)