<a href="https://colab.research.google.com/github/drpetros11111/IBM_ML_Coursera/blob/IBM-Supervised-Learning/Enterpreneurs_choices.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#The Goal: Optimizing R&D and Marketing Spend for Startup Profitability
As a business analyst, my objective is to maximize the profitability of a startup company by optimizing the allocation of resources towards Research and Development (R&D) and Marketing expenditures.

The dataset at hand (50_Startups.csv) contains valuable information on R&D Spend, Administration expenses, Marketing Spend, State, and the resulting Profit of 50 startups . By utilizing multiple linear regression, I aim to identify the ideal balance between R&D and Marketing investments that will lead to the highest level of profitability for the startup across different US states.


---


**Data Description**:
The dataset comprises five key features:

**R&D Spend**

The amount invested in research and development initiatives.

**Administration**

The administrative expenses incurred in managing the business operations.

**Marketing Spend**

The funds dedicated to marketing and promotional activities.

**State**

The geographic location of the startup's operation.
Profit: The resulting profitability achieved by the startup.

Since the "State" column contains categorical data with different US states, I intend to use one-hot encoding to represent these states in a binary vector format.

##Key Success Criteria

As a business analyst, I am tasked with exploring the potential trade-off between R&D and Marketing spend for startups in the context of different US states to maximize profit.

To analyze this, I will utilize multiple linear regression, taking into account features such as R&D Spend, Administration, Marketing Spend, and State.

Since the State feature has 50 unique values (representing the 50 states in the US), I will use one-hot encoding to transform the categorical data into binary codes.

This approach will enable me to effectively include the state information in the regression analysis without imposing any ordinal relationships among the states.

By performing multiple linear regression with the one-hot encoded data, I aim to identify patterns and relationships between R&D, marketing spend, and profitability for startups in different US states.

This analysis will help me gain insights into how resource allocation impacts profits in various regions and how startups can optimize their strategies to achieve better financial results within the Lean Startup approach.

 Ultimately, this information will aid startup companies and especially the enterpreneurs in making data-driven decisions to maximize their profitability while considering the unique economic and market conditions of each US state.








---


---



##Methodological Bias
In this report, I, as a business analyst, approach the challenge of optimizing profitability for startups with a theoretical bias towards the **Lean Startup approach**.

As a strong advocate of this methodology, I emphasize the significance of rapid experimentation, continuous learning, and data-driven decision-making.

Through the lens of the Lean Startup philosophy, I explore the delicate trade-off between R&D (Research & Development) and marketing spend for startup companies.

By leveraging multiple linear regression and analyzing the dataset's features R&D Spend, Administration, Marketing Spend, State, and Profit I aim to uncover valuable insights.

Embracing the Lean Startup principles, I strive to help entrepreneurs allocate their resources efficiently and navigate the complexities of different US states' markets.

By adopting a lean and agile approach, startups can identify the most promising paths to profitability, adapt quickly to changing circumstances, and position themselves for sustainable growth.

The theoretical underpinning of the Lean Startup philosophy fuels my analysis, empowering startups to make informed decisions and embark on a journey of innovation, resilience, and success.

As a business analyst, my commitment to the Lean Startup mindset drives me to support startups in their pursuit of maximum profitability and a flourishing future.




---



---


---
## Data Assumptions

I start with the assumption that there is no multicollinearity between the features in the dataset.

Multicollinearity occurs when two or more independent variables are highly correlated with each other, which can lead to misleading results in the multiple linear regression model.

By ensuring that there is no multicollinearity, I can confidently interpret the coefficients of the independent variables and understand their individual impact on the dependent variable (profit).

This allows me to accurately assess the trade-off between R&D spend and marketing expenses for startups in different US states.

To validate this assumption, I have conducted a thorough analysis of the correlation matrix between the independent variables but I did not find any significant multi-colinearity.

By starting with this foundational assumption, I can build a robust and reliable multiple linear regression model that helps startups make data-driven decisions and achieve a balanced approach to allocating resources between R&D and marketing for maximizing their profitability in the dynamic business environment.

---



---





---



---



##The Choice of the Algorithm -Multiple Linear Regression

I plan to employ multiple linear regression to analyze the impact of R&D and Marketing Spend on the startup's Profit in different US states.

By examining the coefficients of the regression model, we can determine the optimal levels of R&D and Marketing investments to achieve the maximum profitability for the startup.

**Analysis**

Through the regression analysis, we will identify the significance of R&D and Marketing Spend as predictors of Profitability.

By exploring the relationships between these variables and Profit, we will be able to recommend the ideal trade-off between R&D and Marketing investments in each state to maximize overall profitability.


---


**Expected Outcomes**

I anticipate that the multiple linear regression will provide the following insights:

The significance of R&D and Marketing Spend in predicting Profitability for the startup.

The optimal levels of R&D and Marketing investments in each state to achieve the highest Profit.
States where a higher proportion of R&D investment leads to substantial returns.
States where increased Marketing Spend results in a significant boost in profitability.

## Importing the libraries

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

#Importing the Dataset

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [4]:
dataset = pd.read_csv('/content/50_Startups.csv')
dataset.head()
X = dataset.iloc[:, :-1].values
y =dataset.iloc[:, -1].values

In [23]:
dataset.shape


(50, 5)

In [5]:
print(X)

[[165349.2 136897.8 471784.1 'New York']
 [162597.7 151377.59 443898.53 'California']
 [153441.51 101145.55 407934.54 'Florida']
 [144372.41 118671.85 383199.62 'New York']
 [142107.34 91391.77 366168.42 'Florida']
 [131876.9 99814.71 362861.36 'New York']
 [134615.46 147198.87 127716.82 'California']
 [130298.13 145530.06 323876.68 'Florida']
 [120542.52 148718.95 311613.29 'New York']
 [123334.88 108679.17 304981.62 'California']
 [101913.08 110594.11 229160.95 'Florida']
 [100671.96 91790.61 249744.55 'California']
 [93863.75 127320.38 249839.44 'Florida']
 [91992.39 135495.07 252664.93 'California']
 [119943.24 156547.42 256512.92 'Florida']
 [114523.61 122616.84 261776.23 'New York']
 [78013.11 121597.55 264346.06 'California']
 [94657.16 145077.58 282574.31 'New York']
 [91749.16 114175.79 294919.57 'Florida']
 [86419.7 153514.11 0.0 'New York']
 [76253.86 113867.3 298664.47 'California']
 [78389.47 153773.43 299737.29 'New York']
 [73994.56 122782.75 303319.26 'Florida']
 [67532

In [6]:
print(y)

[192261.83 191792.06 191050.39 182901.99 166187.94 156991.12 156122.51
 155752.6  152211.77 149759.96 146121.95 144259.4  141585.52 134307.35
 132602.65 129917.04 126992.93 125370.37 124266.9  122776.86 118474.03
 111313.02 110352.25 108733.99 108552.04 107404.34 105733.54 105008.31
 103282.38 101004.64  99937.59  97483.56  97427.84  96778.92  96712.8
  96479.51  90708.19  89949.14  81229.06  81005.76  78239.91  77798.83
  71498.49  69758.98  65200.33  64926.08  49490.75  42559.73  35673.41
  14681.4 ]


## Encoding categorical data

In [7]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [3])], remainder='passthrough')
X = np.array(ct.fit_transform(X))

In [8]:
print(X)

[[0.0 0.0 1.0 165349.2 136897.8 471784.1]
 [1.0 0.0 0.0 162597.7 151377.59 443898.53]
 [0.0 1.0 0.0 153441.51 101145.55 407934.54]
 [0.0 0.0 1.0 144372.41 118671.85 383199.62]
 [0.0 1.0 0.0 142107.34 91391.77 366168.42]
 [0.0 0.0 1.0 131876.9 99814.71 362861.36]
 [1.0 0.0 0.0 134615.46 147198.87 127716.82]
 [0.0 1.0 0.0 130298.13 145530.06 323876.68]
 [0.0 0.0 1.0 120542.52 148718.95 311613.29]
 [1.0 0.0 0.0 123334.88 108679.17 304981.62]
 [0.0 1.0 0.0 101913.08 110594.11 229160.95]
 [1.0 0.0 0.0 100671.96 91790.61 249744.55]
 [0.0 1.0 0.0 93863.75 127320.38 249839.44]
 [1.0 0.0 0.0 91992.39 135495.07 252664.93]
 [0.0 1.0 0.0 119943.24 156547.42 256512.92]
 [0.0 0.0 1.0 114523.61 122616.84 261776.23]
 [1.0 0.0 0.0 78013.11 121597.55 264346.06]
 [0.0 0.0 1.0 94657.16 145077.58 282574.31]
 [0.0 1.0 0.0 91749.16 114175.79 294919.57]
 [0.0 0.0 1.0 86419.7 153514.11 0.0]
 [1.0 0.0 0.0 76253.86 113867.3 298664.47]
 [0.0 0.0 1.0 78389.47 153773.43 299737.29]
 [0.0 1.0 0.0 73994.56 122782.75 3

## Splitting the dataset into the Training set and Test set

In [9]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =0.2, random_state=0)

## Training the Multiple Linear Regression model on the Training set

In [10]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

## Predicting the Test set results

In [11]:
y_pred = regressor.predict(X_test)
np.set_printoptions(precision=2)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

[[103015.2  103282.38]
 [132582.28 144259.4 ]
 [132447.74 146121.95]
 [ 71976.1   77798.83]
 [178537.48 191050.39]
 [116161.24 105008.31]
 [ 67851.69  81229.06]
 [ 98791.73  97483.56]
 [113969.44 110352.25]
 [167921.07 166187.94]]


##Calculating the performance of the model


In [16]:
from sklearn.metrics import r2_score
# Calculate the R-squared value
r_squared = r2_score(y_test, y_pred)

print("R-squared:", r_squared)

R-squared: 0.9347068473282546


##Evaluating the Performance  of the Model
The R-squared value of 0.9347 indicates that approximately 93.47% of the variability in the profit can be explained by the independent variables (R&D Spend, Administration, Marketing Spend, State). In other words, the multiple linear regression model is able to account for a large proportion of the fluctuations in the profit based on the provided features.

**A high R-squared value like 0.9347 suggests that the model is a good fit for the data and is relatively accurate in predicting the profit based on the given independent variables. It indicates that the chosen features (R&D Spend, Administration, Marketing Spend, State) collectively have a strong association with the profit.**

In summary, an R-squared value of 0.9347 indicates a strong relationship between the independent variables and the profit.


## Making a single prediction (for example the profit of a startup with R&D Spend = 160000, Administration Spend = 130000, Marketing Spend = 300000 and State = 'California')

In [12]:
print(regressor.predict([[1, 0, 0, 160000, 130000, 300000]]))

[181566.92]


Therefore, our model predicts that the profit of a Californian startup which spent 160000 in R&D, 130000 in Administration and 300000 in Marketing is $ 181566,92.

**Important note 1:** Notice that the values of the features were all input in a double pair of square brackets. That's because the "predict" method always expects a 2D array as the format of its inputs. And putting our values into a double pair of square brackets makes the input exactly a 2D array. Simply put:

$1, 0, 0, 160000, 130000, 300000 \rightarrow \textrm{scalars}$

$[1, 0, 0, 160000, 130000, 300000] \rightarrow \textrm{1D array}$

$[[1, 0, 0, 160000, 130000, 300000]] \rightarrow \textrm{2D array}$

**Important note 2:** Notice also that the "California" state was not input as a string in the last column but as "1, 0, 0" in the first three columns. That's because of course the predict method expects the one-hot-encoded values of the state, and as we see in the second row of the matrix of features X, "California" was encoded as "1, 0, 0".

And be careful to include these values in the first three columns, not the last three ones, because the dummy variables are always created in the first columns.

In [13]:
print(regressor.coef_)
print(regressor.intercept_)

[ 8.66e+01 -8.73e+02  7.86e+02  7.73e-01  3.29e-02  3.66e-02]
42467.52924853278


##Determine the equation of the multiple linear regression model


---


**$$\textrm{Profit} = 86.6 \times \textrm{Dummy State 1, California} - 873 \times \textrm{Dummy State 2, Florida} + 786 \times \textrm{Dummy State 3, New York} + 0.773 \times \textrm{R&D Spend} + 0.0329 \times \textrm{Administration} + 0.0366 \times \textrm{Marketing Spend} + 42467.53$$**



---



**Important Note:**
To get these coefficients we called the "coef_" and "intercept_" attributes from our regressor object.


#Conclusion and Recommendation
Based on the coefficients in the multiple linear regression equation, it appears **that R&D Spend has a significantly larger impact on the profit compared to Marketing Spend. The weight of R&D Spend (0.773) is approximately 21 times larger than the weight of Marketing Spend (0.0366).**

**Additionally, the one-hot encoded states in the equation (Dummy State 1, Dummy State 2, Dummy State 3) indicate that being in certain states can also influence the profit.**

However, since these are binary variables (0 or 1), their coefficients do not directly represent the magnitude of the impact like the continuous variables (R&D Spend and Marketing Spend).



---


---


**Therefore, entrepreneurs looking to maximize profit may consider allocating a larger portion of their budget towards R&D initiatives.**

Alongside this, entrepreneurs should also take into account the potential impact of the geographic locations (states) in which they operate. The coefficients for the states in the equation can be used to identify which states are associated with higher or lower profits.


---


Ultimately, while the data suggests that investing more in R&D could potentially lead to higher profits, entrepreneurs should approach resource allocation decisions with a comprehensive understanding of their unique business context, including the influence of the states they operate in.

Conducting thorough market research, considering customer needs, and continuously, **as per the Lean startup guidelines**, evaluating the impact of investments are crucial for making informed decisions that align with their business goals and objectives.


