## Introduction

In the competition, it's required to predict the `y` `Rented Bike count (Target), Count of bikes rented at each hour` .

This is the getting started notebook. Things are kept simple so that it's easier to understand the steps and modify it.

Feel free to `Fork` this notebook and share it with your modifications **OR** use it to create your submissions.

*You can submit up to 2 submissions per day. You can select only one of the submission you make to be considered in the final ranking.*


Data fields
- ID - an ID for this instance
- Date - year-month-day
- Hour - Hour of he day
- Temperature - Temperature in Celsius
- Humidity - %
- Windspeed - m/s
- Visibility - 10m
- Dew point temperature - Celsius
- Solar radiation - MJ/m2
- Rainfall - mm
- Snowfall - cm
- Seasons - Winter, Spring, Summer, Autumn
- Holiday - Holiday/No holiday
- Functional Day - NoFunc(Non Functional Hours), Fun(Functional hours)
- y - Rented Bike count (Target), Count of bikes rented at each hour

## Import the libraries

We'll use `pandas` to load and manipulate the data. Other libraries will be imported in the relevant sections.

In [None]:
import pandas as pd
import os
import numpy as np

# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn import preprocessing

#Encoding
from sklearn.preprocessing import LabelEncoder


In [None]:
# machine learning
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

## Exploratory Data Analysis

Let's load the data using `pandas` and have a look at the generated `DataFrame`.

In [None]:
dataset_path = '/kaggle/input/seoul-bike-rental-ai-pro-iti/'

df = pd.read_csv(os.path.join(dataset_path, 'train.csv'))

print("The shape of the dataset is {}.\n\n".format(df.shape))

df.head(5)

In [None]:
df.rename(columns={'Temperature(�C)':'Temperature(C)','Dew point temperature(�C)':'Dew point temperature(C)'},inplace=True)

We've got 5760 examples in the dataset with 15 featues.

By looking at the features and a sample from the data, the features look of numerical and catogerical types.

In [None]:
print(df.columns.values)

## Categorical: 
- Seasons
- Holiday
- Functioning Day


## Continous:
- y (target)
- Hour
- Temperature(�C)
- Humidity(%)
- Wind speed (m/s)
- Visibility (10m)
- Dew point temperature(�C)
- Solar Radiation (MJ/m2)
- Rainfall(mm)
- Snowfall (cm)

In [None]:
df.info()

In [None]:
print(df.shape)
df.describe(include=['O'])

In [None]:
df.drop(columns='ID').describe()

In [None]:
for col in df.columns:
    print(col)
    print("------------------------")
    print(df[col].unique())
    print("------------------------")

In [None]:
# Number of NaNs in each row
print(df.isnull().sum(axis=1).unique())
df.isnull().sum(axis=1).head(15)

In [None]:
# Number of NaNs in each Column 
print(df.isnull().sum(axis=0).unique())
df.isnull().sum(axis=0).head(15)

----------------

know some about data

In [None]:
df.drop(columns='ID').hist(bins=50, figsize=(20,15))
plt.show()

**split and make the test set**

In [None]:
df["Seasons"].value_counts()

In [None]:
corr_matrix = df.corr()
corr_matrix["y"].sort_values(ascending=False)[1:]

In [None]:
from pandas.plotting import scatter_matrix

attributes = ['y',"Temperature(C)", "Hour", "Dew point temperature(C)",
              "Solar Radiation (MJ/m2)",'Visibility (10m)']
scatter_matrix(df[attributes], figsize=(16, 12));

In [None]:
df.plot(kind="scatter", x="Temperature(C)", y="y",
             alpha=0.4)
plt.show()

In [None]:
# df_num = df_new.drop(['Seasons', 'Holiday', 'Functioning Day','ID','Date' ], axis=1)
# df_cat = df_new[['Seasons', 'Holiday', 'Functioning Day']]

----------------

In [None]:
def encode_and_bind(original_dataframe, feature_to_encode):
    dummies = pd.get_dummies(original_dataframe[[feature_to_encode]])
    res = pd.concat([original_dataframe, dummies], axis=1)
    return(res) 

df = encode_and_bind(df, 'Seasons')
df = encode_and_bind(df, 'Holiday')
df = encode_and_bind(df, 'Functioning Day')


In [None]:
df

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
# def preprocess(df):
#     df['Date'] = pd.to_datetime(df.Date)
#     df['year']= df['Date'].apply(lambda x:x.year)
#     df['month']= df['Date'].apply(lambda x:x.month)
#     df['Week Days'] = df['Date'].apply(lambda x:x.dayofweek)
#     df.drop("Date",inplace=True,axis=1)
        
#     # standrization
#     min_max_scaler = preprocessing.StandardScaler()
#     x_scaled = min_max_scaler.fit_transform(df)
#     df = pd.DataFrame(x_scaled, columns=df.columns)
    
#     return df

In [None]:
#df.drop("ID",axis=1,inplace=True)
df['Holiday'].replace({"Holiday": 0, "No Holiday": 1}, inplace=True)
df['Functioning Day'].replace({"Yes": 0, "No": 1}, inplace=True)
df['Seasons'].replace({"Autumn": 1, "Spring": 2, "Summer": 3, "Winter": 4}, inplace=True)
df['Date'] = pd.to_datetime(df.Date)
df['year']= df['Date'].apply(lambda x:x.year)
df['month']= df['Date'].apply(lambda x:x.month)
df['Week Days'] = df['Date'].apply(lambda x:x.dayofweek)
df.drop("Date",inplace=True,axis=1)

# # Normalization
# min_max_scaler2 = preprocessing.StandardScaler()
# x_scaled = min_max_scaler2.fit_transform(df.drop('y',axis=1))
# df2 = pd.DataFrame(x_scaled, columns=df.drop('y',axis=1).columns)
# x_scaled.shape

In [None]:
min_max_scaler3 = preprocessing.StandardScaler()
x_scaled = min_max_scaler3.fit_transform(df['y'].to_numpy().reshape(-1, 1))
x_scaled

In [None]:
df_new2 = df.copy()

In [None]:
df_new2['y'] = x_scaled

----------------

In [None]:
df_new2

In [None]:
corr = df_new2.corr()
corr_mask = np.ones_like(corr)
corr_mask[np.tril_indices_from(corr_mask)] = False

plt.subplots(figsize=(10,10))
sns.heatmap(corr, mask=corr_mask, 
            cmap=sns.diverging_palette(220, 10, as_cmap=True),
            square=True, annot=True)
plt.show()

In [None]:
corr_matrix = df_new2.drop('ID',axis=1).corr()
corr_matrix["y"].sort_values(ascending=False)[1:]

## No outliers!!!!!

As expected ther is a rise in the demand for bikes early from 8:9 am and from 5:7 pm and as this hypothesis is true we can divide bike demand into 3 main categories:

- High : 7-9 and 17-19 hours

- Average : 10-16 hours

- Low : 0-6 and 20-24 hours Here we have analyzed the distribution of total bike demand.

As expected on average there is a high demand during non holidays, but let's check if also the hours during which there is a high demand in Holidays is different.

As we can see that we can divide the amount of rain into 3 categories:

- No rain
- Slightly raining (from >0 to 4)
- Heavily raining (>4)

As we can see that we can divide the amount of snow into 3 categories:

- No rain
- Slightly snowing (from >0 to 1.6)
- Heavily snowing (>1.6)

And we can combine these conditions together we can combine the weather conditions to create a newly categorical feature describing the weather during each day.

In [None]:
df_new2.info()

In [None]:
# for index, row in df.iterrows():
#     if 0 < row['Rainfall(mm)'] < 4:
#         if 0 < row['Snowfall (cm)'] < 1.6:
#             df.loc[index, 'compined_weather'] = 2 # rain and snowfall
#         if row['Snowfall (cm)'] > 1.6:
#             df.loc[index, 'compined_weather'] = 3 # rain and havily snowfall
#         else:
#             df.loc[index, 'compined_weather'] = 1 #rain only
            
#     if row['Rainfall(mm)'] > 4:
#         if 0 < row['Snowfall (cm)'] < 1.6:
#             df.loc[index, 'compined_weather'] = 2 # rain and snowfall
#         else:
#             df.loc[index, 'compined_weather'] = 1 #rain only
    
        

In [None]:
# def preprocess(df):
#     df = df.drop(columns=['ID','y'])
#     df['Holiday'].replace({"Holiday": 0, "No Holiday": 1}, inplace=True)
#     df['Functioning Day'].replace({"Yes": 0, "No": 1}, inplace=True)
#     df['Seasons'].replace({"Autumn": 1, "Spring": 2, "Summer": 3, "Winter": 4}, inplace=True)
#     df['Date'] = pd.to_datetime(df.Date)
#     df['year']= df['Date'].apply(lambda x:x.year)
#     df['month']= df['Date'].apply(lambda x:x.month)
#     df['Week Days'] = df['Date'].apply(lambda x:x.dayofweek)
#     df.drop("Date",inplace=True,axis=1)
        
#     # Normalization
#     min_max_scaler = preprocessing.MinMaxScaler()
#     x_scaled = min_max_scaler.fit_transform(df)
#     df = pd.DataFrame(x_scaled, columns=df.columns)
    
#     return df

In [None]:
# train_df2 = preprocess(df)

In [None]:
# train_df2['ID'] = df['ID']
# train_df2['y'] = df['y']
# train_df2

Now let's check for the linear fit with the most correlated features before starting Feature engineering

In [None]:
# figure, axes = plt.subplots(nrows=2, ncols=2) 
# plt.tight_layout()
# figure.set_size_inches(7, 6)


# sns.regplot(x='Temperature(C)', y='y', data=df, ax=axes[0, 0], scatter_kws={'alpha': 0.2}, line_kws={'color': 'red'})
# sns.regplot(x='Visibility (10m)', y='y', data=df, ax=axes[0, 1], scatter_kws={'alpha': 0.2}, line_kws={'color': 'red'})
# sns.regplot(x='Hour', y='y', data=df, ax=axes[1, 0], scatter_kws={'alpha': 0.2}, line_kws={'color': 'red'})
# sns.regplot(x='Humidity(%)', y='y', data=df, ax=axes[1, 1], scatter_kws={'alpha': 0.2}, line_kws={'color': 'red'});

As we can see that no variable will give us a proper model with single variable linear regression so we have a lot to think about which model will be better?

Polynomial regression?(which features to put in? how to check if it is valid to assume that we can approximate the real model with a linear one?)
Kmeans?
Let's see what the data is hiding from us by creating the suggested variable above.

# Encoding

In [None]:
# cols_categ_encoding = ['Seasons', 'Holiday', 'Functioning Day']
# df_new = df.copy()
# my_encoder = LabelEncoder()

# for col in cols_categ_encoding:
#     df_new[col] = my_encoder.fit_transform(df_new[col])
    

# Data Splitting


Now it's time to split the dataset for the training step. Typically the dataset is split into 3 subsets, namely, the training, validation and test sets. In our case, the test set is already predefined. So we'll split the "training" set into training and validation sets with 0.8:0.2 ratio.

Note: a good way to generate reproducible results is to set the seed to the algorithms that depends on randomization. This is done with the argument random_state in the following command

In [None]:

# from sklearn.preprocessing import MinMaxScaler

# def normalize_column(df,column):
#     return MinMaxScaler().fit_transform(np.array(df[column]).reshape(-1,1))

# names=['Hour',
#     'Temperature(C)',
#     'Dew point temperature(C)',
#     'Solar Radiation (MJ/m2)',  #---------
#     'Rainfall(mm)',
#     'Snowfall (cm)',
#     'Seasons',
#     'Functioning Day',
    
#     'Humidity(%)',
#     'Wind speed (m/s)',
#     'Visibility (10m)',
#     'Holiday']

# for i in names:
#     df_new[i]=normalize_column(df_new,i)

In [None]:
# df_new['combined_weather']=df_new['Rainfall(mm)'].astype(float)+df_new['Snowfall (cm)'].astype(float)

In [None]:
# colormap = plt.cm.RdBu
# plt.figure(figsize=(22,11))
# plt.title('Pearson Correlation of Features', y=1.05, size=20)

# sns.heatmap(df_new.drop('ID',axis=1).corr(),linewidths=0.1,vmax=1.0,cmap=colormap, linecolor='white', annot=True)

In [None]:
# df_new.drop(['ID','Date'],axis=1).corr()['y'][1:]

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_log_error

train_df, val_df = train_test_split(df_new2, test_size=0.2, random_state=0) # Try adding `stratify` here

X_train = train_df.drop(columns=['ID', 'y'])
y_train = train_df['y']

X_total = df_new2.drop(columns=['ID', 'y'])
y_total = df_new2['y']

X_val = val_df.drop(columns=['ID', 'y'])
y_val = val_df['y']

In [None]:
train_df.columns


In [None]:
corr_matrix = df_new2.drop('ID', axis=1).corr()
corr_matrix["y"].sort_values(ascending=False)[1:]

In [None]:
features = [
    'Hour',
    'Temperature(C)',
    'Dew point temperature(C)',
   # 'Solar Radiation (MJ/m2)',  #---------
   'Rainfall(mm)',
    'Snowfall (cm)',
    'Seasons',
    'Functioning Day',
    
 #   'Humidity(%)',
#  'Wind speed (m/s)',
     'Visibility (10m)',
     'Holiday',
    'Seasons_Spring',
   # 'month',
    
    
    #'Week Days',
    
    'Seasons_Summer',
  # 'Functioning Day_Yes',
   # 'Functioning Day_No',
  #  'Seasons_Winter',
    'Holiday_Holiday',
    






]

new features 

In [None]:
# features = [
#     'Hour',
#     'Temperature(C)',
#   #  'Dew point temperature(C)',
#     'Solar Radiation (MJ/m2)',  #---------
#    # 'Rainfall(mm)',
#     #'Snowfall (cm)',
#   #  'Seasons',
#    # 'Functioning Day',
    
#  #   'Humidity(%)',
# #     'Wind speed (m/s)',
# #     'Visibility (10m)',
#  #    'Holiday',
    
    
#     #'day_of_week',
    
#    #'combined_weather',
    
#     'Seasons_Summer',
#     'Seasons_Winter',
#     'Functioning Day_Yes',
#     'Functioning Day_No',
#     'Seasons_Winter'
    






# ]

In [None]:
# This cell is used to select the numerical features only. IT SHOULD BE REMOVED AS YOU DO YOUR WORK.
X_train = X_train[features]
X_val = X_val[features]

X_total = X_total[features]

# Model Training
Let's train a model with the data! We'll train a Random Forest Classifier to demonstrate the process of making submissions.

In [None]:
# for x in range(1,101):
#     # Create an instance of the classifier
#     Regressor1 = RandomForestRegressor(max_depth=17, random_state=0, n_estimators=x)

#     # Train the classifier
#     Regressor1.fit(X_train, y_train)
#     y_pred = Regressor1.predict(X_val).astype(int)


#     acc_RandomForestscore = round(Regressor1.score(X_train, y_train) * 100, 2)
#     acc_RandomForestMSLE = round((mean_squared_log_error(y_val, y_pred)), 4)
    
#     print(x)
#     print(acc_RandomForestMSLE)

In [None]:
# Create an instance of the classifier
Regressor1 = RandomForestRegressor(max_depth=20, random_state=42, n_estimators=7)

# Train the classifier
Regressor1.fit(X_train, y_train)

y_pred = Regressor1.predict(X_val).round().astype(int)


acc_RandomForestscore = round(Regressor1.score(X_train, y_train) * 100, 2)
#acc_RandomForestMSLE = round(np.sqrt(mean_squared_log_error(y_val, y_pred)), 4)

In [None]:
# Create an instance of the classifier
Regressor2 = GradientBoostingRegressor(random_state=0)

# Train the classifier
Regressor2.fit(X_train, y_train)

y_pred = Regressor2.predict(X_val).round()
y_pred[y_pred<0] = 1

acc_GradientBoostingscore = round(Regressor2.score(X_train, y_train) * 100, 2)

try:
    acc_GradientBoostingMSLE = round(np.sqrt(mean_squared_log_error(y_val, y_pred)), 4)
except:
    acc_GradientBoostingMSLE = -1000
    
y_pred

In [None]:
# Create an instance of the classifier
Regressor3 = LinearRegression()

# Train the classifier
Regressor3.fit(X_train, y_train)

y_pred = Regressor3.predict(X_val).round()

y_pred[y_pred<0] = 1

acc_LinearRegressionscore = round(Regressor3.score(X_train, y_train) * 100, 2)

try:
    acc_LinearRegressionMSLE= round(np.sqrt(mean_squared_log_error(y_val, y_pred)), 4)
except:
    acc_LinearRegressionMSLE= -1000
    
y_pred

In [None]:
knn = KNeighborsRegressor(n_neighbors = 4, weights='distance',algorithm='auto', p=1)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_val).round()

acc_knnscore = round(knn.score(X_train, y_train) * 100, 2)
#acc_knnMSLE = round(np.sqrt(mean_squared_log_error(y_val, y_pred)), 4)

# machine learning
- RandomForestRegressor
- KNeighborsRegressor
- SVR
- GradientBoostingRegressor
- LinearRegression

In [None]:
# Create an instance of the classifier
Regressorfull = RandomForestRegressor(max_depth=20, random_state=42, n_estimators=7)

# Train the classifier
Regressorfull.fit(X_total, y_total)



acc_RandomForestscore = round(Regressorfull.score(X_train, y_train) * 100, 2)
acc_RandomForestscore

In [None]:
# models = pd.DataFrame({
#     'Model': ['Random Forest', 'KNeighbors', 'SVR', 
#               'Gradient Boosting', 'Linear Regression'],
#     'Score': [acc_RandomForestscore, acc_knnscore, acc_SVRscore, 
#               acc_GradientBoostingscore, acc_LinearRegressionscore],
#     'MSLE': [acc_RandomForestMSLE, acc_knnMSLE, acc_SVRMSLE, 
#               acc_GradientBoostingMSLE, acc_LinearRegressionMSLE]})
# models.sort_values(by='MSLE', ascending=True)

# Submission File Generation

We have built a model and we'd like to submit our predictions on the test set! In order to do that, we'll load the test set, predict the class and save the submission file.

First, we'll load the data.

In [None]:
test_df = pd.read_csv(os.path.join(dataset_path, 'test.csv'))

print(test_df.shape)


Note that the test set has the same features and doesn't have the `y` column.
At this stage one must **NOT** forget to apply the same processing done on the training set on the features of the test set.

**Note** y is `Rented Bike count (Target), Count of bikes rented at each hour` .

Now we'll add `y` column to the test `DataFrame` and add the values of the predicted class to it.

**I'll select the numerical features here as I did in the training set. DO NOT forget to change this step as you change the preprocessing of the training data.**

In [None]:
test_df.rename(columns={'Temperature(�C)':'Temperature(C)','Dew point temperature(�C)':'Dew point temperature(C)'},inplace=True)

In [None]:
test_df = encode_and_bind(test_df, 'Seasons')
test_df = encode_and_bind(test_df, 'Holiday')
test_df = encode_and_bind(test_df, 'Functioning Day')

In [None]:
# cols_categ_encoding = ['Seasons', 'Holiday', 'Functioning Day']
# test_df_new = test_df.copy()
# my_encoder = LabelEncoder()

# for col in cols_categ_encoding:
#     test_df_new[col] = my_encoder.fit_transform(test_df_new[col])
    

In [None]:
# names=['Hour',
#     'Temperature(C)',
#     'Dew point temperature(C)',
#     'Solar Radiation (MJ/m2)',  #---------
#     'Rainfall(mm)',
#     'Snowfall (cm)',
#     'Seasons',
#     'Functioning Day',
    
#     'Humidity(%)',
#     'Wind speed (m/s)',
#     'Visibility (10m)',
#     'Holiday']

# for i in names:
#     test_df_new[i]=normalize_column(test_df_new,i)

In [None]:
# test_df_new['combined_weather']=test_df_new['Rainfall(mm)'].astype(float)+test_df_new['Snowfall (cm)'].astype(float)

In [None]:
# test_df_new['Date'] = pd.to_datetime(test_df_new.Date)
# test_df_new['day_of_week'] = test_df_new['Date'].dt.dayofweek

In [None]:
df = test_df.copy()
df['Holiday'].replace({"Holiday": 0, "No Holiday": 1}, inplace=True)
df['Functioning Day'].replace({"Yes": 0, "No": 1}, inplace=True)
df['Seasons'].replace({"Autumn": 1, "Spring": 2, "Summer": 3, "Winter": 4}, inplace=True)
df['Date'] = pd.to_datetime(df.Date)
df['year']= df['Date'].apply(lambda x:x.year)
df['month']= df['Date'].apply(lambda x:x.month)
df['Week Days'] = df['Date'].apply(lambda x:x.dayofweek)
df.drop("Date",inplace=True,axis=1)

# # Normalization
# min_max_scaler = preprocessing.StandardScaler()
# x_scaled = min_max_scaler.fit_transform(df)
# df = pd.DataFrame(x_scaled, columns=df.columns)
# df.head()

In [None]:
test_df_new = df.copy()

In [None]:
X_test = test_df_new.drop(columns=['ID'])

# You should update/remove the next line once you change the features used for training
X_test = X_test[features]

y_test_predicted = Regressorfull.predict(X_test)

sssss = min_max_scaler3.inverse_transform(y_test_predicted.reshape(-1, 1))

test_df_new['y'] = sssss

test_df_new.head()


In [None]:
test_df_new['y'] = test_df_new['y'].astype(int)
test_df_new[['ID', 'y']]

Now we're ready to generate the submission file. The submission file needs the columns ID and Severity only.

In [None]:
test_df_new[['ID', 'y']].to_csv('/kaggle/working/submission.csv', index=False)

The remaining steps is to submit the generated file and are as follows.

Press Save Version on the upper right corner of this notebook.
Write a Version Name of your choice and choose Save & Run All (Commit) then click Save.
Wait for the saved notebook to finish running the go to the saved notebook.
Scroll down until you see the output files then select the submission.csv file and click Submit.
Now your submission will be evaluated and your score will be updated on the leaderboard! CONGRATULATIONS!!

# Conclusion
In this notebook, we have demonstrated the essential steps that one should do in order to get "slightly" familiar with the data and the submission process. We chose not to go into details in each step to keep the welcoming notebook simple and make a room for improvement.

You're encourged to `Fork` the notebook, edit it, add your insights and use it to create your submission.

