# <center> Lending Club Case Study


## Table of Content

1. [Introduction](#Introduction)
2. [Important Settings](#Important-Settings)
3. [Data Understanding](#Data-Understanding)  
4. [Data Preparation](#Data-Preparation)
    1. [Dropping Data](#Dropping-Data)
    2. [Formatting Data](#Formatting-Data)
    3. [Derived Data](#Derived-Data)
5. [Data Analysis](#Data-Analysis)
    1. [Univariable Analysis](#Univariable-Analysis)
        1. [Plot Loan Numeric Data](#Plot-Loan-Numeric-Data)
        2. [Plot Loan Categorical Data](#Plot-Loan-Categorical-Data)
        3. [Plot Customer Numeric Data](#Plot-Customer-Categorical-Data)
        4. [Plot Customer Categorical Data](#Plot-Customer-Categorical-Data)
    2. [Bivariable Analysis](#Bivariable-Analysis)
6. [Conclusions and Observations](#Conclusions-and-Observations)

# Introduction 

### Contact:
* Jheser Guzman (https://github.com/dicotips)

## Business Understanding

**Source:** UpGrad Assignment description

A bike-sharing system is a service in which bikes are made available for shared use to individuals on a short term basis for a price or free. Many bike share systems allow people to borrow a bike from a "dock" which is usually computer-controlled wherein the user enters the payment information, and the system unlocks it. This bike can then be returned to another dock belonging to the same system.

A US bike-sharing provider **BoomBikes** has recently suffered considerable dips in their revenues due to the ongoing Corona pandemic. The company is finding it very difficult to sustain in the current market scenario. So, it has decided to come up with a mindful business plan to be able to accelerate its revenue as soon as the ongoing lockdown comes to an end, and the economy restores to a healthy state.

In such an attempt, BoomBikes aspires to understand the demand for shared bikes among the people after this ongoing quarantine situation ends across the nation due to Covid-19. They have planned this to prepare themselves to cater to the people's needs once the situation gets better all around and stand out from other service providers and make huge profits.

They have contracted a consulting company to understand the factors on which the demand for these shared bikes depends. Specifically, they want to understand the factors affecting the demand for these shared bikes in the American market. The company wants to know:

* Which variables are significant in predicting the demand for shared bikes.
* How well those variables describe the bike demands

Based on various meteorological surveys and people's styles, the service provider firm has gathered a large dataset on daily bike demands across the American market based on some factors.

### Business Goal:

You are required to model the demand for shared bikes with the available independent variables. It will be used by the management to understand how exactly the demands vary with different features. They can accordingly manipulate the business strategy to meet the demand levels and meet the customer's expectations. Further, the model will be a good way for management to understand the demand dynamics of a new market.

# Important Initial Settings

In [18]:
# Importing Required Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sys

# Importing ML Libaries
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import r2_score
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Ignore version warnings
import warnings
warnings.filterwarnings('ignore')

# Setting the format of Float numbers to two digits.
pd.options.display.float_format = '{:.2f}'.format 

In [19]:
# Reading CSV file with dtype object and saving it in raw_data dataframe. 
# All the processing in the data will be stored in new dataframes

# Set Dataset File Path
DATA_FILE_PATH = '_data/data.csv'

raw_data = pd.read_csv(DATA_FILE_PATH, dtype=object)

### Important Custom Functions

In [20]:
def univar_plot(dataframe, column, var_type , hue = None):
    
    '''
    univar_plot function plots a column from a dataframe.
    dataframe  : dataframe variable
    column     : Column name
    var_type   : variable type to specify i it is continuos or categorical
                Continuos=0   ==> Graph contains:  Distribution, Violin & Boxplot.
                Categorical=1 ==> Graph contains:  Countplot.
    hue        : Only for categorical data (coloring).
    
    '''
    sns.set(style="darkgrid")
    
    if var_type == 0:
        fig, ax = plt.subplots(nrows=1, ncols=3, figsize=(20,8))
        
        ax[0].set_title("Distribution Plot")
        sns.distplot(dataframe[column],ax=ax[0])
        
        ax[1].set_title("Violin Plot")
        sns.violinplot(data=dataframe, x=column, ax=ax[1], inner = "quartile")
        
        ax[2].set_title("Box Plot")
        sns.boxplot(data=dataframe, x=column, ax=ax[2], orient='v')
    
    if var_type == 1:
        temp = pd.Series(data = hue)
        fig, ax = plt.subplots()
        width = len(dataframe[column].unique()) + 6 + 4*len(temp.unique())
        fig.set_size_inches(width , 7)
        ax = sns.countplot(data = dataframe, x=column, order=dataframe[column].value_counts().index, hue=hue) 
        if len(temp.unique()) > 0:
            for p in ax.patches:
                ax.annotate('{:1.1f}%'.format((p.get_height()*100)/float(len(dataframe))), (p.get_x()+0.05, p.get_height()+20))  
        else:
            for p in ax.patches:
                ax.annotate(p.get_height(), (p.get_x()+0.32, p.get_height()+20)) 
        del temp
    else:
        exit
        
    plt.show()

# Data Understanding
* Ensure data quality:  identify issues and report them. 
* Interpreted the meaning of the variables and describe the actions in as comments.

In [21]:
# Getting the first 5 rows from the raw_data dataframe for data exploration
raw_data.head()

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,01-01-2018,1,0,1,0,6,0,2,14.110847,18.18125,80.5833,10.749882,331,654,985
1,2,02-01-2018,1,0,1,0,0,0,2,14.902598,17.68695,69.6087,16.652113,131,670,801
2,3,03-01-2018,1,0,1,0,1,1,1,8.050924,9.47025,43.7273,16.636703,120,1229,1349
3,4,04-01-2018,1,0,1,0,2,1,1,8.2,10.6061,59.0435,10.739832,108,1454,1562
4,5,05-01-2018,1,0,1,0,3,1,1,9.305237,11.4635,43.6957,12.5223,82,1518,1600


In [22]:
# Understanding the structure of the dataset
# Checking shape & datatype of raw_data dataframe
print(raw_data.shape)
print(raw_data.info())

(730, 16)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 730 entries, 0 to 729
Data columns (total 16 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   instant     730 non-null    object
 1   dteday      730 non-null    object
 2   season      730 non-null    object
 3   yr          730 non-null    object
 4   mnth        730 non-null    object
 5   holiday     730 non-null    object
 6   weekday     730 non-null    object
 7   workingday  730 non-null    object
 8   weathersit  730 non-null    object
 9   temp        730 non-null    object
 10  atemp       730 non-null    object
 11  hum         730 non-null    object
 12  windspeed   730 non-null    object
 13  casual      730 non-null    object
 14  registered  730 non-null    object
 15  cnt         730 non-null    object
dtypes: object(16)
memory usage: 91.4+ KB
None


In [23]:
print(raw_data.dtypes)

instant       object
dteday        object
season        object
yr            object
mnth          object
holiday       object
weekday       object
workingday    object
weathersit    object
temp          object
atemp         object
hum           object
windspeed     object
casual        object
registered    object
cnt           object
dtype: object


In [24]:
# We verify if there is any dublicated rows in ['instant']

print(raw_data.duplicated(subset=None, keep='first').count())
sum(raw_data.duplicated(['instant']))

## Result: There are no duplicated IDs !!!

730


0

In [25]:
# Checking how many rows have all missing values
empty_rows = raw_data.isnull().all(axis=1).sum()
print(f'N Empty Rows: {empty_rows}')

## Result: There are 54 columns have all missing values.

N Empty Rows: 0


In [26]:
# Checking how many columns have all missing values
empty_columns = raw_data.isnull().all(axis=0).sum()
print(f'N Empty Columns: {empty_columns}')

## Result: There are no missing values in all the columns.

N Empty Columns: 0


In [27]:
# Counting Nulls in each column
raw_data.isnull().sum()

instant       0
dteday        0
season        0
yr            0
mnth          0
holiday       0
weekday       0
workingday    0
weathersit    0
temp          0
atemp         0
hum           0
windspeed     0
casual        0
registered    0
cnt           0
dtype: int64

## Data Preparation

* Data quality issues are addressed
 * Missing value imputation.
 * Outlier treatment.
 * Remove data redundancies.
* If needed, converted data to a suitable and convenient format to work with using the right methods.
* Manipulation of strings and dates is done correctly wherever required.

### Dropping Data

In [16]:
# Verifying how much memory is used by raw_data dataframe
raw_data_size_mb = format(sys.getsizeof(raw_data) /(1028**2), '.0f')

print(f'Raw_Data Memory Usage: {raw_data_size_mb}MB')

Raw_Data Memory Usage: 1MB


### Formatting Data

### Derived Data

## Data Analysis

* Univariate and segmented univariate analysis. 
 * Describe assumptions 
 * The analyses successfully identify at least the 5 important driver variables.
* The most useful insights are explained correctly in the comments.
* Appropriate plots are created to present the results of the analysis.

### Univariable Analysis

#### Plot Loan Numeric Data

#### Plot Loan Categorical Data

#### Plot Customer Numeric Data

#### Plot Customer Categorical Data

### Bivariable Analysis

## Model Building

## Model Testing

## Conclusions and Observations