# Linear Regression and Ensemble learning

# Note: Whatever work we are doing in sklearn, do all of that in keras as homework!
# everything starting from k-means clustering

## Simple linear regression (SLR)
- SLR is a statistical technique used for finding the existence of an association relationship between a dependent variable (response variable) and an independent variable (feature).
- In SLR : y = f(x)
- Here, x and y are continuous variables.

## Multiple Linear Regression (MLR)
- For multiple linear regression, there are multiple independent variables like x1, x2, x3, ..., xn.
- However, dependent variable is still one like y.
- In SLR: y = f(x1, x2, x3, ..., xn)

## Few examples of simple and multiple regression.
... (Refer slides)
2. Insurance companies would like to understand the association between healthcare costs and ageing. (SLR).
4. Restaurants would like to know the relationship between the customer waiting time and revenue...

## Steps to building Simple Linear Regression model:
1. Extract Data
2. Pre-process the data
... refer slides

## Steps in building Linear regression model.
1. Perform Descriptive Analytics
2. Build the Model
3. Perform Model Diagnostics
4. Validate the Model and Measure Model Accuracy
5. Decide on Model Deployment.

## SLR in more detail with example..



- Assume that we have 8 samples of x and y.
- 5 we are using to make training 
- output of this will be "m" and "c"
- we will use these value to formulate the equation: y = m*x + c.
- Lets say, m = 1.8 and c = 25
- then output equation becomes y = 1.8*x + 25.

- Evaluate this model by applying remaining 3 samples and calculate error.
- These 3 samples we apply to model to get y_out (also known as y cap)


- Instructor that this dataset is way too small. 
- Real world model production need Big Data.

## Why are negative errors undesireable?
- According to Instructor, positive and negative errors tend to cancel each other.
- Therefore, for a more accurate representation of errors, we need to only consider the amplitude of the error.


- There are various ways we can measure the variance between predicted and actual values.
- One such metric is Mean Squared Error (MSE) or Root Mean Squared Error (RMSE)

## Example:
- File name: MBA Salary.csv contains the salary of 50 graduating MBA students of a Business School in 2016 and their corresponding percentage marks in grade 10. Develop an SLR model to understand and predict salary based on the percentage of marks in Grade 10.

## Steps:
 (Refer slides) 

- Importing pandas and numpy packages
- loading data set
- Printing the detail information of DF
- Creating feature X and label y
- Splitting dataset into training and validation sets (80-20 split).
    - train_test_split() returns 4 variable below:
        1. train_X
        2. train_Y
        3. test_X
        4. test_Y
- Building the model
- Evaluate the trained model using validation set


## Finding MSE and R2 Score

### R2 Score: 
- A relative metric in which the higher the value, the better the model's fit.
- In essence, this metric represents homw much of the variance between predicted and actual label values the model is able to explain.

```python
    from sklearn.metrics import r2_score, mean_squared_error
    np.abs(r2_score(test_y, pred_y))
    # 0.1566
    # So, the model only explains 15.6% of the variance in the validation set.
```



## SLR with simple linear regression with decision tree.


- Decision tree is a systematic structured representation of our problem.
    - It is used for linearly inseparable problems
    - as the number of leaves in decision can be numerous,
    - we can use decision tree can be used for classification as well as regression models.
    - we can have any number of leaves. It is important to note the outcome vs leaf relation.
    - It should also be noted that decision trees can be used for non-linear regression problems as well.
    - Important things:
        1. Selection of attribute for root node (Do IG calculation)
        2. Make systematic tree structure by seeing the problem
        3. Make logical constructs like if conditions from the tree.


## SLR using desicion tree
- The regression method is good, if the relationship between ind., and dep., variables are linear
- (Refer slides)
- also notice the improvement in R2 score.

# Ensemble Algorithm (ENA)

- ENA construct not just one decision tree, but a large number of trees, allowing better predictions on more complex data.
- It work by combining multiple base estimators to produce an optimal model, by applying **an aggregate function to a collection of base models** (referred to **bagging**).
- It work by combining multiple base estimators to produce an optimal model, by building **a sequence of model that build on one another to improve predictive performance** (referred to as **boosting**).

## SLR using Ensemble algorithm (ENA)


## Bagging
- It works by combining multiple base estimators to produce an optimal model, by applying **an aggregate funciton to a collection of base models** (referred to as **bagging**).

### 
- Data 
<pre>
/     |    \
V     V     V
</pre>
- Bootstrap samples (B1, B2, B3)
<pre>
|     |     |
V     V     V
</pre>
- Models (M1, M2, M3)
<pre>
\     |     /
 V    V    V
</pre>
- Aggregation/Voting (voting by aggregate function) 
<pre>
    |
    V
</pre>
- Output

- Data size is 10k
- B1 => 3300 => 70:30 => M1
- B2 => 3300 => 70:30 => M2
- B3 => 3300 => 70:30 => M3

- eg: 
- M1 = 90 
- M2 = 85 
- M3 = 91 
- (M1, M2, M3) ---> Agg

## Boosting
- It works by combining multiple base estimators to produce an optimal model, **by building a sequence of models that build on one another to imporve predictive performance** (referred to as **boosting**).
- Eg: Gradient boosting, cat boosting, and XG boosting.


## The process of boosting:
- Training set
<pre>
    |    
    V     
</pre>
- Subset 1
<pre>
    | training
    V     
</pre>
- Weak Learner 1
<pre>
    | testing
    V     
</pre>
- False prediction
<pre>
    |    
    V     
</pre>
- Subset 2
<pre>
    | training
    V     
</pre>
- Weak Learner
<pre>
    | testing
    V     
</pre>
- False prediction
<pre>
    ........
    |    
    V     
</pre>
- Subset m
<pre>
    | training
    V     
</pre>
- Weak Learner
<pre>
    | testing
    V     
</pre>
- Final prediction

- Eg: Gradient boosting:
```python
    from sklearn.model_selection import train_test_split
    train_X, test_X, train_y, test_y = train_test_split(X, Y, train_size = 0.8, random_state=100)

    # Train the model
    from sklearn.ensemble import GradientBoostingRegressor

    # Fit a GradientBoostingRegressor algorithm model on the training set 
    # TRy : ADAboost.XGboost
    model = GradientBoostingRegressor().fit(train_X, train_y)
    print(model)

    pred_y = model.predict(test_X)
```

# Multiple Linear Regression (MLR):

### On IPL data, let us make the problem statement which may become our prediction model as:
### "Representing sold price (y) in terms of other feature vectors (eg: Playing Roles, Strike rate, sixers, etc.)"


- Categorical data:
    - columns: B, D, E, and F.
- Why there is a need to convert categorical variable to a numerical variable?
- Categorical variable needs to be converted into a number, otherwise it cannot be considered as a variable for processing.
- Variables can be any of these three types:
    - Integer
    - Float
    - Object

In [30]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn
%matplotlib inline

In [31]:
ipl_auction_df = pd.read_csv("IPL2013.csv")
ipl_auction_df.head(5)

Unnamed: 0,Sl.NO.,PLAYER NAME,AGE,COUNTRY,TEAM,PLAYING ROLE,T-RUNS,T-WKTS,ODI-RUNS-S,ODI-SR-B,...,SR-B,SIXERS,RUNS-C,WKTS,AVE-BL,ECON,SR-BL,AUCTION YEAR,BASE PRICE,SOLD PRICE
0,1,"Abdulla, YA",2,SA,KXIP,Allrounder,0,0,0,0.0,...,0.0,0,307,15,20.47,8.9,13.93,2009,50000,50000
1,2,Abdur Razzak,2,BAN,RCB,Bowler,214,18,657,71.41,...,0.0,0,29,0,0.0,14.5,0.0,2008,50000,50000
2,3,"Agarkar, AB",2,IND,KKR,Bowler,571,58,1269,80.62,...,121.01,5,1059,29,36.52,8.81,24.9,2008,200000,350000
3,4,"Ashwin, R",1,IND,CSK,Bowler,284,31,241,84.56,...,76.32,0,1125,49,22.96,6.23,22.14,2011,100000,850000
4,5,"Badrinath, S",2,IND,CSK,Batsman,63,0,79,45.93,...,120.71,28,0,0,0.0,0.0,0.0,2011,100000,800000


In [32]:
ipl_auction_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 130 entries, 0 to 129
Data columns (total 26 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Sl.NO.         130 non-null    int64  
 1   PLAYER NAME    130 non-null    object 
 2   AGE            130 non-null    int64  
 3   COUNTRY        130 non-null    object 
 4   TEAM           130 non-null    object 
 5   PLAYING ROLE   130 non-null    object 
 6   T-RUNS         130 non-null    int64  
 7   T-WKTS         130 non-null    int64  
 8   ODI-RUNS-S     130 non-null    int64  
 9   ODI-SR-B       130 non-null    float64
 10  ODI-WKTS       130 non-null    int64  
 11  ODI-SR-BL      130 non-null    float64
 12  CAPTAINCY EXP  130 non-null    int64  
 13  RUNS-S         130 non-null    int64  
 14  HS             130 non-null    int64  
 15  AVE            130 non-null    float64
 16  SR-B           130 non-null    float64
 17  SIXERS         130 non-null    int64  
 18  RUNS-C    

In [33]:
ipl_auction_df.iloc[0:5, 0:10]

Unnamed: 0,Sl.NO.,PLAYER NAME,AGE,COUNTRY,TEAM,PLAYING ROLE,T-RUNS,T-WKTS,ODI-RUNS-S,ODI-SR-B
0,1,"Abdulla, YA",2,SA,KXIP,Allrounder,0,0,0,0.0
1,2,Abdur Razzak,2,BAN,RCB,Bowler,214,18,657,71.41
2,3,"Agarkar, AB",2,IND,KKR,Bowler,571,58,1269,80.62
3,4,"Ashwin, R",1,IND,CSK,Bowler,284,31,241,84.56
4,5,"Badrinath, S",2,IND,CSK,Batsman,63,0,79,45.93


In [34]:
# Note: T-RUNS = Test runs
X_features = ['AGE', 'COUNTRY', 'PLAYING ROLE', 'T-RUNS', 'T-WKTS', 'ODI-RUNS-S', 'ODI-SR-B', 'ODI-WTKS', 'ODI-SR-BL', 'CAPTAINCY EXP', 'RUNS-S', 'HS', 'AVE', 'SR-B', 'SIXERS', 'RUNS-C', 'WKTS', 'AVE-BL', 'ECON', 'SR-BL']

- Let us try to address the problem of how to convert categorical variable into a unit number.

In [35]:
ipl_auction_df['PLAYING ROLE'].unique()

array(['Allrounder', 'Bowler', 'Batsman', 'W. Keeper'], dtype=object)

- Converting the different characters of categorical variables into some unit number is called **encoding**.
- Example:
    - Considering all characters of 'PLAYING ROLE' feature,
    - it is looking to be:  
        - Allrounder
        - Bowler
        - Batsman
        - W. Keeper
    - After performing encoding operation on playing role, the playing roles are encoded as follows:

In [36]:
pd.get_dummies(ipl_auction_df['PLAYING ROLE'])[0:5]

Unnamed: 0,Allrounder,Batsman,Bowler,W. Keeper
0,True,False,False,False
1,False,False,True,False
2,False,False,True,False
3,False,False,True,False
4,False,True,False,False


- get_dummies are used to convert categorical variable into a number.

In [37]:
categorical_features = ['AGE', 'COUNTRY', 'PLAYING ROLE', 'CAPTAINCY EXP']

In [38]:
ipl_auction_encoded_df = pd.get_dummies(ipl_auction_df[X_features], columns = categorical_features, drop_first = True)

KeyError: "['ODI-WTKS'] not in index"

- for more code and notes refer photos.

- For lab activity:
- SLR with linear regression
- SLR with decision tree
- SLR with random forest
- SLR with boosting
- MLR with linear regression
- MLR with decision tree
- MLR with random forest
- MLR with boosting