## **Topic: Inferential statistics**
**Agenda:** Profit prediction for emerging startups

**Description:** A venture capitalist company has hired us for a model creation assignment. Our goal is to prepare a model that can predict the profit of a company based on the company's spending pattern and company's location

In [1]:
import numpy as np
import pandas as pd

In [2]:
data = pd.read_csv('50_Startups.csv')

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   R&D Spend        50 non-null     float64
 1   Administration   50 non-null     float64
 2   Marketing Spend  50 non-null     float64
 3   State            50 non-null     object 
 4   Profit           50 non-null     float64
dtypes: float64(4), object(1)
memory usage: 2.1+ KB


In [4]:
data.head(5)

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


In [5]:
# Prerequisites for inferential stats:
# |---- data must be complete
# |---- data must be numeric in nature
#
# Preprocessing steps whenever/wherever required:
# |---- categorical columns: use dummy variables or use `One Hot Encoding` technique aka OHE
# |---- missing values: use imputation technique depending on the approach and nature of the data:
# |     |---- stat approach: replace with mean/median/mode if the data is numerical/categorical
# |     |---- domain based approach: replace with the default value
# |     |---- hybrid approach: based on requirements you can switch betwwen `stat` and `domain` approach

In [6]:
# Dealing with categorical data
finalDataset = pd.concat( [pd.get_dummies(data['State']) , data.iloc[:,[0,1,2,4]]] , axis = 1)
finalDataset.head()

Unnamed: 0,California,Florida,New York,R&D Spend,Administration,Marketing Spend,Profit
0,0,0,1,165349.2,136897.8,471784.1,192261.83
1,1,0,0,162597.7,151377.59,443898.53,191792.06
2,0,1,0,153441.51,101145.55,407934.54,191050.39
3,0,0,1,144372.41,118671.85,383199.62,182901.99
4,0,1,0,142107.34,91391.77,366168.42,166187.94


In [7]:
# Seperate `data` as `features` and `label`
features = finalDataset.iloc[:,[0,1,2,3,4,5]].values
label = finalDataset.iloc[:,[6]].values

In [9]:
features[:10]

array([[0.0000000e+00, 0.0000000e+00, 1.0000000e+00, 1.6534920e+05,
        1.3689780e+05, 4.7178410e+05],
       [1.0000000e+00, 0.0000000e+00, 0.0000000e+00, 1.6259770e+05,
        1.5137759e+05, 4.4389853e+05],
       [0.0000000e+00, 1.0000000e+00, 0.0000000e+00, 1.5344151e+05,
        1.0114555e+05, 4.0793454e+05],
       [0.0000000e+00, 0.0000000e+00, 1.0000000e+00, 1.4437241e+05,
        1.1867185e+05, 3.8319962e+05],
       [0.0000000e+00, 1.0000000e+00, 0.0000000e+00, 1.4210734e+05,
        9.1391770e+04, 3.6616842e+05],
       [0.0000000e+00, 0.0000000e+00, 1.0000000e+00, 1.3187690e+05,
        9.9814710e+04, 3.6286136e+05],
       [1.0000000e+00, 0.0000000e+00, 0.0000000e+00, 1.3461546e+05,
        1.4719887e+05, 1.2771682e+05],
       [0.0000000e+00, 1.0000000e+00, 0.0000000e+00, 1.3029813e+05,
        1.4553006e+05, 3.2387668e+05],
       [0.0000000e+00, 0.0000000e+00, 1.0000000e+00, 1.2054252e+05,
        1.4871895e+05, 3.1161329e+05],
       [1.0000000e+00, 0.0000000e+00,

In [10]:
label[:10]

array([[192261.83],
       [191792.06],
       [191050.39],
       [182901.99],
       [166187.94],
       [156991.12],
       [156122.51],
       [155752.6 ],
       [152211.77],
       [149759.96]])

#### **ML Coding Begins from here...**

In [None]:
# Steps:
# |---- 1. Create Train-Test split
# |---- 2. Build the model using `train` split
# |---- 3. Check the quality of the model
# |---- 4. Deploy the model (optional stage in ML engineering as deployment is usually handled by app/dev team)

In [13]:
# 1. Create train test split
# |---- Use the `train` split to perform model training
# |---- Use the `test` split to perform model evaluation
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(features,
                                                 label,
                                                 test_size=0.2,
                                                 random_state=10)
print("X_train dimensions:",X_train.ndim)

X_train dimensions: 2


In [14]:
# NOTE:
# random state is used to shuffle the data with given `random seed`
# here we have chosen random state to be 10
# this concept is discussed in detail in the ML course (4)

In [15]:
# 2. Train the model
from sklearn.linear_model import LinearRegression
modelProfitPredictor = LinearRegression()
modelProfitPredictor.fit(X_train,y_train)

In [17]:
# 3. Check the quality of the model
#
# Model Evaluation theory:
# |---- SL = 0.05
# |---- CL = 0.95 (1 - SL)
# |---- SL is decided during the inception phase of the project:
# |     |---- either from the standard alpha values
# |     |---- or it can be a practical value that is inferred from the data (discussed in detail in the ML course)
# |
# |---- Therefore our model must achieve atleast 95% accuracy!!!

# Score of the model with the `training` and `testing` data
# |---- Note: here the score is the `r2 value` i.e. `coefficient of determination`
trainScore = modelProfitPredictor.score(X_train,y_train)
testScore = modelProfitPredictor.score(X_test,y_test)

# Test for Generalization:
# |---- We are trying to derive a generalized model
# |---- A generalized model means that the model understands the population pattern to its optimum level
# |---- A generalized model is a model that:
# |     |---- not only performs well with the known data
# |     |---- but also performs good/better with the unknown data
# |
# |---- Basics of Data Drift:
# |     |---- The model might underperform or give wrong predictions with incoming new data or an outlier during the test/deploy/live phase
# |     |---- We need to ensure that there is NO DRIFT in the model when unknown data is introduced for prediction
# |
# |---- Condition/test for success:
# |     |---- testScore > trainScore (Ensures a generalized model based on the standard definition of generalization)
# |     |---- testScore >= CL (Ensures our SL criteria is achieved i.e. there is no `drift` detected)

SL = 0.05
CL = 1 - SL
print("TestScore is {} and TrainScore is {} ".format(testScore,trainScore))
if testScore > trainScore and testScore >= CL:
  print("Model is approved")
else:
  print("Model is rejected")

TestScore is 0.9901105113397705 and TrainScore is 0.9385918220043519 
Model is approved


In [18]:
# 4. Deploy the model (App Example)
#
# simple app shown below
# ** advanced version is/will be documented in the ML course

rdSpend = float(input("Enter RD Spend: "))
admSpend = float(input("Enter Admin Spend: "))
markSpend = float(input("Enter Marketing Spend: "))
state = input("Enter State: ") # **

approvedState = ['California','Florida', 'New York']

# Drift Check
if state in approvedState:
  if state == "California":
    finalFeatureSet = np.array([[1,0,0,rdSpend,admSpend,markSpend]])
  elif state == "Florida":
    finalFeatureSet = np.array([[0,1,0,rdSpend,admSpend,markSpend]])
  elif state == "New York":
    finalFeatureSet = np.array([[0,0,1,rdSpend,admSpend,markSpend]])

  profit = modelProfitPredictor.predict(finalFeatureSet)
  print(f"\nModel Predicted profit of ${profit[0][0]}")
else:
  print("\nModel doesnt identify {} state, thus can't do prediction !".format(state))


Enter RD Spend: 343443
Enter Admin Spend: 566566
Enter Marketing Spend: 2332321
Enter State: California

Model Predicted profit of $369883.7859982224


In [19]:
# Testing with multiple data entries
modelProfitPredictor.predict(np.array([
    [1,0,0,878787,9889,8989],
    [0,1,0,2323,2332,2332]
]))

array([[758292.0250614 ],
       [ 52580.49311604]])