# Data Science Challenge

In [22]:
# If additional packages are needed that are not installed by default, uncomment the last two lines of this 
# cell and replace <package list> with a list of additional packages.
# This will ensure the notebook has all the dependencies and works everywhere

#import sys
#!{sys.executable} -m pip install <package list>

In [23]:
#Libraries
import pandas as pd
pd.set_option("display.max_columns", 101)

## Data Description

Column | Description
:---|:---
`id` | Record index
`timestamp` | Datetime (YYYY:MM:DD HH AM/PM)
`season` | Season (spring, summer, fall, winter)
`holiday` | Whether day is a holiday or not (Yes or No)
`workingday` | Whether day is a working day or not (Yes or No)
`weather`| Weather condition (Clear or partly cloudy, Mist, Light snow or rain, heavy rain/ice pellets/ snow + fog 
`temp`| Average temperature recorded for the hour ( in degree Celsius)
`temp_feel`| Average feeling temperature recorded for the hour ( in degree Celsius)
`hum`| Average humidity recorded for the hour (in %)
`windspeed`| Average wind speed recorded for the hour (in miles/hour)
`demand`| Hourly count of bikes rented

## Data Wrangling & Visualization

In [24]:
# The dataset is already loaded below
data = pd.read_csv("train.csv")

In [25]:
data.head()

Unnamed: 0,id,timestamp,season,holiday,workingday,weather,temp,temp_feel,humidity,windspeed,demand
0,1,2017-01-01 00:00:00,spring,No,No,Clear or partly cloudy,9.84,14.395,81.0,0.0,2.772589
1,2,2017-01-01 01:00:00,spring,No,No,Clear or partly cloudy,9.02,13.635,80.0,0.0,3.688879
2,3,2017-01-01 02:00:00,spring,No,No,Clear or partly cloudy,9.02,13.635,80.0,0.0,3.465736
3,4,2017-01-01 03:00:00,spring,No,No,Clear or partly cloudy,9.84,14.395,75.0,0.0,2.564949
4,5,2017-01-01 04:00:00,spring,No,No,Clear or partly cloudy,9.84,14.395,75.0,0.0,0.0


In [26]:
#Explore columns
data.columns

Index(['id', 'timestamp', 'season', 'holiday', 'workingday', 'weather', 'temp',
       'temp_feel', 'humidity', 'windspeed', 'demand'],
      dtype='object')

In [27]:
# starting from no data science experience, so here is my thought process:
# 1. notice the features and how they might cause differences
#    A. Timestamp: most people like to bike in the morning and day. No one really bikes.
#        a. areas of concern: 8:00AM - 7:00PM
#        b. overall importance: moderate
#    B. Season: not too sure when people like to bike ride. guess would be anytime that isn't winter
#        a. areas of concern: !(winter)
#        b. overall importance: little
#    C. Holiday: Holiday should be a special consideration and people love to bike when they think they have free time
#        a. area of concern: anytime the csv file says yes
#        b. overall importance: high
#    D. Working day: If people aren't working, they have free time. 
#         a. Huge potential EXCEPTION: Is some people using the service as they are working? (e.g. people that do food dropoff service.)
#                                         If they are not: you can completely filter work for increased demand
#                                         If they are: you can just make it a small factor
#         b. area of concern: looking at yes's for increased demand.
#         c. overall importance:  Either high or moderate depending on variable
#    E. Weather: No one goes biking on bad weather. the demand should be very low
#         a. area of concern: everything within sunny. We don't care about others.
#         b. overall importance: high
#    F. Temperature: Generally people don't go outside if it is to hot or too cold
#         a. area of concern: maybe (20-80) Farenheit to start off with.
#         b. overall importance: maybe high or moderate
#    G. temp_feel: almost same as temperature
#         a. area of concern: maybe (20-80) Farenheit to start off with.
#         b. overall importance: maybe high or moderate (depending on what I do for last column before)
#    H. Humidity:
#    I. Windspeed:
#    J. Demand:
# 2. Need to figure out how to make overall model and finish one submission
# 3. Need to see how to put my observation to my machine learning analysis.
'holiday', 'workingday', 'weather', 'temp',
       'temp_feel', 'humidity', 'windspeed', 'demand'],

In [28]:
#Description
data.describe()

Unnamed: 0,id,temp,temp_feel,humidity,windspeed,demand
count,8708.0,7506.0,8606.0,8669.0,8508.0,8708.0
mean,4354.5,20.089454,23.531261,60.99354,13.048589,4.452725
std,2513.927405,8.023304,8.737997,19.67989,8.311058,1.493963
min,1.0,0.82,0.76,0.0,0.0,0.0
25%,2177.75,13.94,15.91,46.0,7.0015,3.637586
50%,4354.5,20.5,24.24,60.0,12.998,4.867534
75%,6531.25,26.24,31.06,77.0,19.0012,5.556828
max,8708.0,41.0,45.455,100.0,56.9969,6.792344


## Visualization, Modeling, Machine Learning

Build a model that can predict hourly demand and identify how different features influence the decision. Please explain the findings effectively to technical and non-technical audiences using comments and visualizations, if appropriate.
- **Build an optimized model that effectively solves the business problem.**
- **The model will be evaluated on the basis of mean absolute error.**
- **Read the test.csv file and prepare features for testing.**

In [29]:
#Loading Test data
test_data=pd.read_csv('test.csv')
test_data.head()

Unnamed: 0,id,timestamp,season,holiday,workingday,weather,temp,temp_feel,humidity,windspeed
0,8709,2018-08-05 05:00:00,fall,No,No,Clear or partly cloudy,29.52,34.85,74.0,16.9979
1,8710,2018-08-05 06:00:00,fall,No,No,Clear or partly cloudy,29.52,34.85,79.0,16.9979
2,8712,2018-08-05 08:00:00,fall,No,No,Clear or partly cloudy,31.16,36.365,66.0,22.0028
3,8713,2018-08-05 09:00:00,fall,No,No,Clear or partly cloudy,32.8,38.635,59.0,23.9994
4,8714,2018-08-05 10:00:00,fall,No,No,Clear or partly cloudy,32.8,38.635,59.0,27.9993




**Identify the most important features of the model for management.**

> #### Task:
- **Visualize the top 20 features and their feature importance.**


> #### Task:
- **Submit the predictions on the test dataset using the optimized model** <br/>
    For each record in the test set (`test.csv`), predict the value of the `demand` variable. Submit a CSV file with a header row and one row per test entry.
    
The file (`submissions.csv`) should have exactly 2 columns:
   - **id**
   - **demand**

In [30]:
#Submission
submission_df.to_csv('submissions.csv',index=False)

NameError: name 'submission_df' is not defined

---