## Multiple Linear Regression

#### The Goal
To apply linear regression in Python to determine the predicted demand for the London cycle hire scheme for the next 12-month period. 

#### About the datasets


There are three datasets in total. These are:  

bike_data_2021_part1.xlsx  
bike_data_2021_part2.xlsx  
Bike_data_new.xlsx  

The datasets track specific weather conditions that may affect the dependent variable. There are 14 variables included in the datasets; these are listed below. The bikecount variable is the dependent variable.   

The three datasets answer the business question:  
‘What is the predicted demand for the London cycle hire scheme for the next 12-month period?’ 

Variables

    1.	timestamp – a unique date and time stamp
    2.	year - categorical field showing the year 2021
    3.	season - categorical field showing meteorological seasons: 
            •	Spring (March-May)
            •	Summer (June-August)
            •	Autumn (September-November)
            •	Winter (December-February)
    4.	month - categorical field showing the months: January-December
            •	1 = January 
            •	2 = February 
            •	3 = March
            •	4 = April 
            •	5 = May
            •	6 = June 
            •	7 = July 
            •	8 = August 
            •	9 = September 
            •	10 = October
            •	11 = November 
            •	12 = December
    5.	day – categorical field showing the day of the week: Monday-Sunday
            •	1 = Monday
            •	2 = Tuesday
            •	3 = Wednesday
            •	4 = Thursday
            •	5 = Friday
            •	6 = Saturday
            •	7 = Sunday
    6.	hour - categorical field showing the hours from 0-23
    7.	isholiday - category field that shows whether or not the day is a public holiday
            •	1 = holiday
            •	0 = non holiday
    8.	isweekend - categorical field that shows whether the day is a weekend or weekday
            •	1 = weekend
            •	0 = weekday  
    9.	weathercode - categorical field showing the days weather status
            •	1 = mostly clear but have some areas with patches of fog/haze
            •	2 = scattered clouds or few clouds
            •	3 = Broken clouds
            •	4 = Cloudy
            •	7 = Rain/ light Rain shower
            •	10 = Rain with thunderstorm
            •	26 = Snowfall
            •	94 = Freezing Fog
    10.	t1 – numerical field showing the real temperature in degrees Celsius  
    11.	t2 – numerical field showing the "feels like" temperature in degrees Celsius   
    12.	humidity - numerical field showing the humidity in percentage  
    13.	wind - numerical field showing the wind speed in km/h  
    14.	bikecount - numerical field showing the the count of new bike shares  

#### Process map
Below illustrates the process used during this lab.

 
 
TASK 2 - INVENTORY ANALYSIS
    
SECTION A - IMPORT & COMBINE DATA
    
    A.1 Access to Jupyter Notebook (COMPLETED)
    A.2 Set default folder (COMPLETED)
    A.3 Open file "Module 5 Fusion Day.ipynb" (COMPLETED)
    A.4 Install Python Libraries
    A.5 Import data 
    A.6 Combine datasets 

SECTION B - DATA PREPARATION
    
    B.1 Data Quality Checks
        B.1.1 View sample of data 
        B.1.2 Shape of data
        B.1.3 Duplicate rows
        B.1.4 Missing data
    B.2 Data Cleansing
        B.2.1 Convert variables to appropriate data types  
        B.2.2 Remove duplicate rows
        B.2.3 Interpolate using median 
 
SECTION C - EXPLANATORY DATA ANALYSIS & VISUALISATION
    
    C.1 Aggregation - Calculate total & avg bike shares
    C.2 Distribution - Calculate total & avg bike shares by: 
        C.2.1 Season 
        C.2.2 Month
        C.2.3 Day of week
        C.2.4 Day of week for most popular month
        C.2.5 Calculate All descriptive stats (for numerical variables only)
        C.2.6 Is bike shares distribution normal or skewed?
    C.3 Correlation
        C.3.1 Correlation Matrix
        C.3.2 Note correlated variables 
    C.4 Visualisations
        C.4.1 Create a jointplot
        C.4.2 Create a lineplot
        C.4.3 Create a pairgrid
        C.4.4 Note findings

SECTION D - PREDICTIVE MODELLING (REGRESSION)   
    
    D.1 Pre-processing
        D.1.1 Remove variables "timestamp" & "year"
        D.1.2 Encode categorical data 
    D.2 Train/Test Split (70%/30%) 
        D.2.1 Split df dataset to X & Y datasets 
		D.2.2 Perform 70:30 random split (use "random_state=1")
    D.3 Build Linear Regression
		D.3.1 Save the regression function "LinearRegression()" into a container called "model" 
		D.3.2 Fit the regression into the training data 
    D.4 Evaluate (training dataset)
        D.4.1 Identify regression intercept
			D.4.1a Reformat intercept into a dataframe	
        D.4.2 Identify regression coefficients
			D.4.2a Reformat coefficients into a dataframe
    D.5 Run Model on Testing Dataset
    D.6 Calculate R-squared and RMSE from D.5

SECTION E - INTERPRET RESULTS 


TASK 3 - MODEL PREDICTIONS 

    1. IMPORT NEW DATA 'bike_data_new.xlsx'
    2. DATA QUALITY CHECKS
    3. DATA CLEANSING
        3.1 Remove duplicates
        3.2 Remove missing data
        3.3 Make a copy of dataset 
        3.4 Convert variables to appropriate data types
    4. PRE-PROCESSING
        4.1 Remove variables "timestamp" & "year"
        4.2 Encode categorical data
    5. PREDICTION
        5.1 Use model to predict bike shares
        5.2 Combine predicted bikeshares from 5.1 with the new data ('bike_data_new.xlsx')
			5.2.1 Reformat predictions into a dataframe called 'dfr' 
			5.2.2 Create a new dataframe that joins 'dfr' to the 'dfp_copy'  
    6. SAVE PREDICTIONS (into csv)

### SECTION A - IMPORT & COMBINE DATA

In [None]:
#   A.1 Access to Jupyter Notebook (COMPLETED)
#   A.2 Set default folder (COMPLETED)
#   A.3 Open file "Module 5 Fusion Day.ipynb" (COMPLETED)#### A.5 IMPORT DATA

#### A.4 Install Python Libraries

In [1]:
# RUN/EXCEUTE THE ENTIRE CELL

# INSTALL LIBRARIES
#!pip install pandas
#!pip install numpy
#!pip install scikit-learn
#!pip install scipy
#!pip install seaborn
#!pip install matplotlib#### Import Python Libraries


# IMPORT LIBRARIES
# These libraries are commonly used in data science
import numpy as np
import pandas as pd
import math
import scipy


# Import different modules from the sklearn library to build and evaluate the linear regression model
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score


# Import matplotlib and seaborn libraries for data visualisation 
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline


# Switch off unnecessary warning messages 
import warnings
warnings.filterwarnings('ignore')

#### A.5 IMPORT DATA

In [2]:
# Read data from an "bike_data_2021_part1.xlsx" file and save that data into a dataframe called "df1"

# Answer:


In [1]:
# Read data from an "bike_data_2021_part2.xlsx" file and save that data into a dataframe called "df2"

# Answer:


#### A.6 COMBINE DATASETS

In [4]:
# Left Join data from df1 and df2, and save that data into a dataframe called "df" - use variable "timestamp" as the join key

# Answer:


### SECTION B - DATA PREPARATION

#### B.1 Data Quality Checks

        B.1.1 View sample of data 
        B.1.2 Shape of data
        B.1.3 Duplicate rows
        B.1.4 Missing data

In [5]:
#B.1.1 VIEW SAMPLE OF DATA

# Question 1: Use the right function to view top 5 records

#Answer:



# Question 2: View last 5 records

#Answer:



# Question 3: View top 3 records

#Answer:



# Question 4: View last 3 records

#Answer:


In [6]:
#B.1.2 SHAPE OF DATA

# What function can we use to extract the structure (# columns and # of rows) of the dataframe 'df'?

#Answer:


In [7]:
#B.1.3 DUPLICATE ROWS

# Use a function to identify how many duplicate records there are in the 'df' dataset

#Answer:


In [8]:
#B.1.4 MISSING DATA

#Use the right function on 'df' that will allow us to identify the number of missing data?

# This method prints out information about a dataFrame including the index, dtype, columns, non-null values and memory usage
# This method is also useful for finding out missing values in a dataset

#Answer:


#### B.2 Data Cleansing

        B.2.1 Convert variables to appropriate data types 
        B.2.2 Remove duplicate rows
        B.2.3 Interpolate using median

In [9]:
#Run the script below to convert some appropriate variables using astype() method

df["timestamp"] = df.timestamp.astype("datetime64")
df["year"] = df.year.astype("category")
df["season"] = df.season.astype("category")
df["month"] = df.month.astype("category")
df["day"] = df.day.astype("category")
df["hour"] = df.hour.astype("category")
df["isholiday"] = df.isholiday.astype("category")
df["isweekend"] = df.isweekend.astype("category")
df["weathercode"] = df.weathercode.astype("category")
df["t1"] = df.t1.astype("float64")
df["humidity"] = df.humidity.astype("float64")
df["windspeed"] = df.windspeed.astype("float64")

In [9]:
# B.2.1 - CONVERT VARIABLES TO APPROPRIATE DATA TYPES


#Question 1: Use the approriate script to convert variable 't2' into float

#Answer:



#Question 2: Use the approriate script to convert variable 'bikecount' into int64

#Answer:



#Question 3: What function can we use to check if you were successful?

#Answer:



In [10]:
# B.2.2 - REMOVE DUPLICATE ROWS

# Please remove all the duplicates from the dataset 'df' 

#Answer:


In [11]:
# B.2.3 - INTERPOLATE USING MEDIAN

# For variable 't1', use appropriate function to fill in missing values with the median

#Answer:


### SECTION C - EXPLANATORY DATA ANALYSIS & VISUALISATION

#### C.1 Aggregation

In [12]:
# Calculate total and average bike shares for the year 2021

#Answer:


#### C.2 Distribution

Calculate total and average bike shares for the following:

        C.2.1 Season 
        C.2.2 Month
        C.2.3 Day of week
        C.2.4 Day of week for the most popular month
        C.2.5 Calculate all descriptive stats (for numerical variables only)
        C.2.6 Is bike shares distribution normal or skewed?

In [13]:
# C.2.1 - SEASON
# Create a report to showcase the total and average bike shares for each season. 
# And identify the season with the highest average value.

#Answer: 


In [14]:
# C.2.2 - MONTH
# Create a report to showcase the total and daily average bike shares for each month.
# And identify the month with the highest average value.

#Answer:


In [15]:
# C.2.3 - DAY OF WEEK
# Create a report to showcase the total and average bike shares for each day of the week. 
# And identify the day with the highest average value.

#Answer:


In [16]:
# C.2.4 - DAY OF WEEK FOR MOST POPULAR MONTH
# Create a report to showcase the total and average bike shares each day of week for the month with the highest average value. 
# And identify the day with the highest average value.

#Answer:


In [17]:
# C.2.5 - CALCULATE ALL DESCRIPTIVE STATS (for numerical variables only)
# Use a function to showcase all the descriptive statics for all numerical variables. 

#Answer:


In [19]:
# C.2.6 Is bike shares distribution normal or skewed? 

#Answer: 


#### C.3 Correlation

In [18]:
# C.3.1 - CORRELATION MATRIX

# As part of your exploratory analysis use the right function to calculate the correlation for the numerical variables. 

#Answer:


In [19]:
# C.3.2 - Note down which variables are strongly correlated.

#Answer:


#### C.4 Visualisations

        C.4.1 Create a jointplot
        C.4.2 Create a lineplot
        C.4.3 Create a pairgrid
        C.4.4 Note findings

In [20]:
# C.4.1 - CREATE A JOINTPLOT

# Use seaborn to create a jointplot that shows the datapoints between the variables 'bikecount' and 't1' 
# Please label the axis accordingly

#Answer:


#CHALLENGE: Add hue to the script above to break down the data by the variable 'season':

#Answer:


In [21]:
# C.4.2 - CREATE A LINEPLOT

# Use seaborn to create a line plot that shows 'month' in the x-axis and 'bikecount' in the y axis 
# Please label the axis accordingly

#Answer:



#CHALLENGE: Add hue to the script above to break down the data by the variable 'isweekend':

#Answer:


In [22]:
# C.4.3 - CREATE A PAIRGRID
# Use seaborn to plot a PairGrid with histograms and scatterplots for variables ["t1","t2","humidity","windspeed","bikecount"]
# Please label the axis accordingly

#Answer:



#CHALLENGE: Add hue to the script above to break down the data by the variable 'season':

#Answer:


In [23]:
#C.4.4 - NOTE FINDINGS BELOW

#Answer: 

### SECTION D - PREDICTIVE MODELLING (REGRESSION)

#### D.1 PRE-PROCESSING

    D.1.1	Remove “timestamp” and “year” variables from the dataset
    D.1.2	Encode categorical data


In [24]:
# D.1.1 - REMOVE "TIMESTAMP" & "YEAR" VARIABLES

# Use the right function to drop "timestamp" and "year" variables

#Answer:


In [25]:
# D.1.2 - ENCODE CATEGORICAL DATA

# Create dummy variables (one-hot encoding) for the following categorical variables ["season","month","day","hour","isholiday","isweekend","weathercode"]

#Answer:


#### D.2 TRAIN/TEST SPLIT

    D.2.1 Split df dataset into X & Y datasets
    D.2.2 Perform 70:30 random split (use "random_state=1")

In [26]:
#D.2.1 - Split df dataset into X & Y datasets

# Filter the dataset in order to split the 'df' dataset into X & Y datasets

#Answer:


In [27]:
#D.2.2 - Perform 70:30 random split (use "random_state=1")

# Split further the 'x' and 'y' datasets data into train and test datasets. 
# Your test size should be 30% (0.3) and please use "random_state=1"
# There should be 4 resulting datasets in the output - x_train, x_test, y_train, y_test

#Answer:


#### D.3 BUILD LINEAR REGRESSION
    D.3.1 - Save the regression function "LinearRegression()" into a container called 'model'
    D.3.2 - Fit the regression into the training data 

In [28]:
#D.3.1 - Save the regression function "LinearRegression()" into a container called 'model'

#Answer:


In [29]:
#D.3.2 - Fit the regression into the training data 

# use x_train and y_train to fit the regression line

#Answer:


#### D.4 EVALUATE (TRAINING DATASET)

    D.4.1 Identify regression intercept
        D.4.1a Reformat intercept into a dataframe	
    D.4.2 Identify regression coefficients
        D.4.2a Reformat coefficients into a dataframe

In [30]:
#D.4.1 - Identify regression intercept 

# Use the right function to extract intercept of the model

#Answer:


In [31]:
#D.4.1a - Reformat intercept into a dataframe

# The output from D.4.1 is an array format 
# Please reformat the intercept output into a dataframe for a much cleaner view 

#Answer:


In [32]:
#D.4.2 - Identify regression coefficients

# Use the right function to extract the coefficients of the model

#Answer:


In [33]:
#D.4.2a - Reformat coefficients into a dataframe

# The output from D.4.2 is an array 
# Please reformat the coefficients output into a dataframe for a much cleaner view 

#Answer:


#### D.5 RUN MODEL ON TESTING DATASET

In [34]:
# Use the right function from regression 'model' to predict the test dataset 'x_test'
# Save the predictions into a container called 'y_pred'

#Answer:


#### D.6 CALCULATE R-SQUARED & RMSE

In [35]:
# Use the rignt scripts to evaluate the overall model performance by showing us the Rsquared and RMSE statistics

#Answer:


### SECTION E - INTERPRET RESULTS

In [36]:
#EDA Evaluation

#Answer: 



#Model Evaluation:

#Answer: 


# TASK 3 - MODEL PREDICTIONS
Now that we have a working model, please start preparing the new dataset for the model to predict and output the predictions into csv

#### 1 - IMPORT NEW DATA 'bike_data_new.xlsx'


In [37]:
# Import data "bike_data_new.xlsx" and save as a dataframe called 'dfp'

#Answer:


#### 2 - DATA QUALITY CHECKS

In [38]:
# Please use various functions to conduct quality checks on the data


# Answer:


#### SECTION 3 - DATA CLEANSING

In [39]:
# 3.1 - REMOVE DUPLICATES

# Remove all the duplicates from 'dfp' 

# Answer:


In [40]:
# 3.2 - REMOVE MISSING DATA

# Remove all the null values from 'dfp'

# Answer:


In [41]:
# 3.3 - MAKE A COPY OF DATASET

# Create a copy of the dataset 'dfp' and call it 'dfp_copy'. This will be used during the last stage.

# Answer:


In [46]:
# 3.4 - CONVERT VARIABLES TO APPROPRIATE DATA TYPES
# Run data conversions below as is:

dfp["timestamp"] = dfp.timestamp.astype("datetime64")
dfp["year"] = dfp.year.astype("category")
dfp["month"] = dfp.month.astype("category")
dfp["day"] = dfp.day.astype("category")
dfp["hour"] = dfp.hour.astype("category")
dfp["isholiday"] = dfp.isholiday.astype("category")
dfp["weathercode"] = dfp.weathercode.astype("category")
dfp["t1"] = dfp.t1.astype("float64")
dfp["t2"] = dfp.t2.astype("float64")
dfp["humidity"] = dfp.humidity.astype("float64")
dfp["windspeed"] = dfp.windspeed.astype("float64")


In [47]:
#Question 1: Use the approriate script to convert variable 'season' into category

# Answer:



#Question 2: Use the approriate script to convert variable 'isweekend' into category

# Answer:



#Question 3: What function can we use to check if you were successful?

# Answer:


#### SECTION 4 - PRE-PROCESSING


In [44]:
# 4.1 - REMOVE VARIABLES "TIMESTAMP" & "YEAR"

# drop "timestamp" and "year" columns from 'dfp'

# Answer:


Unnamed: 0,season,month,day,hour,isholiday,isweekend,weathercode,t1,t2,humidity,windspeed
0,Winter,1,6,6,0,1,1,3.0,0.0,87.0,10.0
1,Winter,1,6,7,0,1,1,3.0,0.0,87.0,12.0
2,Winter,1,6,8,0,1,1,3.0,0.0,87.0,12.0
3,Winter,1,6,9,0,1,1,4.5,1.5,84.0,14.0
4,Winter,1,6,10,0,1,1,5.5,2.5,76.0,17.0
...,...,...,...,...,...,...,...,...,...,...,...
7485,Winter,12,5,18,0,0,4,4.0,2.0,100.0,9.0
7486,Winter,12,5,19,0,0,4,4.0,1.5,100.0,11.0
7487,Winter,12,5,20,0,0,4,4.0,1.5,100.0,10.0
7488,Winter,12,5,21,0,0,4,4.0,1.0,100.0,12.0


In [48]:
# 4.2 - ENCODE CATEGORICAL DATA

# Create dummy variables (one-hot encoding) from the following categorical variables: ["season","month","day","hour","isholiday","isweekend","weathercode"]
# Please save to 'dfp2'

# Answer:


#### SECTION 5 - PREDICTION

#### 5.1 USE MODEL TO PREDICT BIKE SHARES 

In [49]:
# Use the "model" to predict dfp2 and save results to 'newbikeshares'

# Answer:


#### 5.2 - COMBINE PREDICTED BIKE SHARE DATA FROM 5.1 WITH THE NEW DATA ('bike_data_new.xlsx')
    5.2.1 - Reformat predictions into a dataframe called 'dfr'
    5.2.2 - Create a new dataframe that joins dfr to the 'dfp_copy'


In [50]:
# 5.2.1 - Reformat predictions into a dataframe called 'dfr' and name the new column "newbikecount"

# Answer:


In [51]:
# 5.2.2 - Create a new dataframe 'newdf' that joins 'dfr' to the 'dfp_copy' 
#note: dataset 'dfp_copy' was saved in section 3.3 for later reference

# Answer:


#### SECTION F - SAVE PREDICTIONS TO CSV

In [52]:
# Save the above dataframe ("newdf") as a csv file

# Answer:
