## Instructions {-}

1. Please answer the following questions as part of your project proposal.

2. Write your answers in the *Markdown* cells of the Jupyter notebook. You don't need to write any code, but if you want to, you may use the *Code* cells.

3. Use [Quarto](https://quarto.org/docs/output-formats/html-basics.html) to print the *.ipynb* file as HTML. You will need to open the command prompt, navigate to the directory containing the file, and use the command: `quarto render filename.ipynb --to html`. Submit the HTML file.

4. The project proposal is worth 8 points, and is due on **18th April 2023 at 11:59 pm**. 

5. You must make one submission as a group, and not individually.

6. Maintaining a GitHub repository is optional, though encouraged for the project.

7. Share the link of your project's GitHub repository [here](https://docs.google.com/spreadsheets/d/1khao3unpj_vsx4kOSg_Zzo77YK1UWL2w73Oa0aAirOo/edit#gid=0) (optional).

# 1) Team name
Mention your team name.

*(0 points)*

Team name: 404

# 2) Member names
Mention the names of your team members.

*(0 points)*

Arush Iyer, Anjali Patel, Erica Zhou, Yida Hao 

# 3) Link to the GitHub repository (optional)
Share the link of the team's project repository on GitHub.

Also, put the link of your project's GitHub repository [here](https://docs.google.com/spreadsheets/d/1khao3unpj_vsx4kOSg_Zzo77YK1UWL2w73Oa0aAirOo/edit#gid=0).

We believe there is no harm in having other teams view your GitHub repository. However, if you don't want anyone to see your team's work, you may make the repository *Private* and add your instructor and graduate TA as *Colloborators* in it.

*(0 points)*

Link to our team's project repository on GitHub: https://github.com/ericaez/STAT-303-3-Project.git

# 4) Topic
Mention the topic of your course project.

*(0 points)*

Rental bike sharing system

# 5) Problem statement

*(4 points)*

Explain the problem statement. The problem statement must include:

## 5a) The problem

We wish to predict the relationship between the number of bikes rented and the characteristics of environmental and seasonal settings. 

## 5b) Type of response
Is it about predicting a continuous response or a binary response or a combination of both?

We are predicting a continuous response.

## 5c) Performance metric
How will you assess model accuracy?

  - If it is a classification problem, then which measure(s) will you optimize for your model – precision, recall, false negative rate (FNR), accuracy, ROC-AUC etc., and why?
  - If it is a regression problem, then which measure(s) will you optimize for your model – RMSE (Root mean squared error), MAE (mean absolute error), maximum absolute error etc., and why?

For our regression problem, we will be optimizing RMSE since we want to focus on minimizing our prediction error for each data point rather than the total error of the model. We want to examine how accurate we are on different days with environmental situations.

## 5d) Naive model accuracy
What is the accuracy of the naive model (Standard deviation of response in case of continuous response / proportion of the majority class in case of classification model)

In [17]:
import pandas as pd
import numpy as np

data = pd.read_csv("hour.csv")

In [18]:
sd = np.std(data["cnt"])
sd

181.38238043116962

# 6) Data

## 6a) Source
What data sources will you use, and how will the data help solve the problem? Explain.
If the data is open source, share the link of the data.

*(0.5 point)*

https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset

We are using the Bike Sharing Dataset from UCI Archives. This dataset contains hourly bike rental counts for the Capital Bike Sharing system in Washington D.C. from 2011 to 2012. The dataset will help us understand consumer behaviors based on seasonal and environmental factors so that the Bike Sharing company can be more strategic with the allocation/distribution of bikes in the area.

## 6b) Response & predictors
What is the response, and mention some of the predictors.

*(0.5 point)*

The response is the total rental bike count ("cnt"). We have many environmental variables such as humidity, temperature, windspeed, etc. We also have seasonal variables such as holdiay, working day, month, week, etc.

## 6c) Size
What is the number of continuous predictors, categorical predictors, and observations in your dataset(s). If you are using multiple datasets, please provide the information for each dataset. When counting predictors, count only those that have sufficient non-missing values, and will be useful.

*(1 point)*

In [16]:
data.head()

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,3,13,16
1,2,2011-01-01,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,8,32,40
2,3,2011-01-01,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,5,27,32
3,4,2011-01-01,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,3,10,13
4,5,2011-01-01,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,0,1,1


Categorical variables (4): season, month, weekday, weathersit (type of weather)
<br> Continuous variables (6): hum (humidity), temp, windspeed, atemp 
<br> Binary variables (3): workingday, holliday, yr

<br> The size of the dataset is 17,379 observations with 13 predictors.

# 7) Exisiting solutions
Are there existing solutions of your problem? Almost all Kaggle datasets have exisiting solutions. If yes, then how do you plan to build up on those solutions? **What is the highest model accuracy / performance achieved in the existing solutions?**

*(1 point)*

Yes, there are existing solutions to our problem on Kaggle. The highest model accuracy is about 85%. By implementing the topics that we learn in this class, we plan to improve our accuracy by combining the results of all of our individual models. 

# 8) Stakeholders

Who are the stakeholders, and how will your project benefit them? Explain.

*(1 point)*

1. Bike rental companies (like Divvy). This project will benefit them because we can help them better understand their target consumer and demand.
2. Tourists/local community members who use these bikes (like NU students) so that they have a better sense of bike rental availability depending on the type of day. Our model could also help them understand the differences in demand based on casual/registered user of the rental bikes.
3. Cities that partner with the bike rental companies for environmental and transportation benefits.

# 9) Work-split
*(This question is answered for you)*

How do you plan to split the project work amongst individual team members?

We will learn to develop and tune the following models in the STAT303 sequence:

1. MARS

2. Decision trees with cost-complexity pruning

3. Bagging (Bagging MARS / decision trees)

4. Random Forests

5. AdaBoost

6. Gradient boosting

7. XGBoost

8. Lasso / Ridge / Stepwise selection 

Each team member is required to develop and tune at least one of the above models. In the end, the team will combine all the developed models to create a model more accurate than each of the individual models.

*(0 points)*

Yida: MARS

Anjali: Decision trees

Arush: Random forests

Erica: Gradient Boosting