# <em>Coursework 2</em>
# Business start-ups : A Big data analysis and its impact in decision making

<p>As mostly explained in my topic brief for this project, we aim to identify the key factors that play a major role when analyzing budgeting planning for start-up business using big-data algorithms and at the same time able to predict the profitability of a start-up business given its feature budgets. This is achieved by loading in a sampled dataset and testing against its features and values. This process is then replicated on the actual big data via cluster-computing frameworks(such as Pyspark for this project) & a distributed file system(such as Hadoop FS) for storage of data. The exact procedures breakdowns are once again explained in my topic brief.</p>

<p><strong>Objectives overall for this project is to analyze the business start-up budgets with following aims:</strong></p>

* 1.) - Aim to initialize and set-up data in forms viable for meaningful analysis.

* 2.) - Ensure data scalibility by utilizing Pyspark context when doing analysis

* 3.) - Ensure datasets are fit for cross analysis.

* 4.) - Set-up machine-learning algorithms in Pyspark context utilizing the MLLIB functions.

* 5.) - Further explore results while discussing with reference to the initial targeted aim of project.

* 6.) - Cross-analyse and conclude on the impact of business budgeting features to the start-ups profitability


In [1]:
#initiating spark & other useful libs

import os
import pandas as pd
import numpy as np
import findspark
findspark.init()

In [2]:
#Importing all important functions and libraries.

from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession, SQLContext

from pyspark.sql.types import *
import pyspark.sql.functions as F
from pyspark.sql.functions import udf, col

from pyspark.ml.regression import LinearRegression
from pyspark.mllib.evaluation import RegressionMetrics

from pyspark.ml.tuning import ParamGridBuilder, CrossValidator, CrossValidatorModel
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark.ml.evaluation import RegressionEvaluator

In [3]:
#importing visualisation

import seaborn as sns
import matplotlib.pyplot as plt

In [4]:
# Visualization
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

pd.set_option('display.max_columns', 200)
pd.set_option('display.max_colwidth', 400)

from matplotlib import rcParams
sns.set(context='notebook', style='whitegrid', rc={'figure.figsize': (18,4)})
rcParams['figure.figsize'] = 18,4

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [5]:
#random seed for reproducability
rnd_seed=23
np.random.seed=rnd_seed
np.random.set_state=rnd_seed

In [6]:
spark = SparkSession.builder.master("local[2]").appName("Startup-profits").getOrCreate()

In [7]:
spark

In [8]:
sc = spark.sparkContext
sc

In [9]:
sqlContext = SQLContext(spark.sparkContext)
sqlContext



<pyspark.sql.context.SQLContext at 0x14f94f23f40>

In [10]:
startup_data = '../CW2/Startups.csv'

In [11]:
# define the schema, corresponding to a line in the csv data file.
schema = StructType([
    StructField("R&D Spend", DoubleType(), nullable=True),
    StructField("Administration", DoubleType(), nullable=True),
    StructField("Marketing Spend", DoubleType(), nullable=True),
    StructField("State", StringType(), nullable=True),
    StructField("Profit", DoubleType(), nullable=True),
    ])

In [12]:
startup_df = spark.read.option("header","True").csv(path=startup_data, schema=schema).cache()

In [13]:
startup_df.take(5)

[Row(R&D Spend=165349.2, Administration=136897.8, Marketing Spend=471784.1, State='New York', Profit=192261.83),
 Row(R&D Spend=162597.7, Administration=151377.59, Marketing Spend=443898.53, State='California', Profit=191792.06),
 Row(R&D Spend=153441.51, Administration=101145.55, Marketing Spend=407934.54, State='Florida', Profit=191050.39),
 Row(R&D Spend=144372.41, Administration=118671.85, Marketing Spend=383199.62, State='New York', Profit=182901.99),
 Row(R&D Spend=142107.34, Administration=91391.77, Marketing Spend=366168.42, State='Florida', Profit=166187.94)]

In [14]:
startup_df.columns

['R&D Spend', 'Administration', 'Marketing Spend', 'State', 'Profit']

In [15]:
startup_df.printSchema()

root
 |-- R&D Spend: double (nullable = true)
 |-- Administration: double (nullable = true)
 |-- Marketing Spend: double (nullable = true)
 |-- State: string (nullable = true)
 |-- Profit: double (nullable = true)



### Data exploration

Before making any analysis on a dataset, it is important to understand the dataset itself first. By taking a deeper look into the dataset using different spark operations such as .select & .show, we can further break the data into bite sizes to better understanding the data we are dealing with.

In [16]:
startup_df.select('Marketing Spend','Profit').show(10)

+---------------+---------+
|Marketing Spend|   Profit|
+---------------+---------+
|       471784.1|192261.83|
|      443898.53|191792.06|
|      407934.54|191050.39|
|      383199.62|182901.99|
|      366168.42|166187.94|
|      362861.36|156991.12|
|      127716.82|156122.51|
|      323876.68| 155752.6|
|      311613.29|152211.77|
|      304981.62|149759.96|
+---------------+---------+
only showing top 10 rows



In [17]:
#Dropping all rows with NULL values.

startup_df.na.drop("all")

DataFrame[R&D Spend: double, Administration: double, Marketing Spend: double, State: string, Profit: double]

In [18]:
startup_df.filter(col("Marketing Spend").isNull()).show()

+---------+--------------+---------------+-----+------+
|R&D Spend|Administration|Marketing Spend|State|Profit|
+---------+--------------+---------------+-----+------+
+---------+--------------+---------------+-----+------+



In [19]:
startup_df.filter(col("Administration").isNull()).show()

+---------+--------------+---------------+-----+------+
|R&D Spend|Administration|Marketing Spend|State|Profit|
+---------+--------------+---------------+-----+------+
+---------+--------------+---------------+-----+------+



In [20]:
startup_df.filter(col("R&D Spend").isNull()).show()

+---------+--------------+---------------+-----+------+
|R&D Spend|Administration|Marketing Spend|State|Profit|
+---------+--------------+---------------+-----+------+
+---------+--------------+---------------+-----+------+



In [51]:
#Brief look on other fields comparing to the target variable.

group_df = startup_df.select("R&D Spend","Profit").sort("R&D Spend", ascending=False)

In [52]:
group_df.show(10)

+---------+------------------+
|R&D Spend|            Profit|
+---------+------------------+
| 165349.2|         19.226183|
| 162597.7|         19.179206|
|153441.51|         19.105039|
|144372.41|18.290198999999998|
|142107.34|         16.618794|
|134615.46|         15.612251|
| 131876.9|         15.699112|
|130298.13|          15.57526|
|123334.88|14.975995999999999|
|120542.52|15.221176999999999|
+---------+------------------+
only showing top 10 rows



#### Correlations of features.

Below lines of codes will attempt to understand how related are the features to the targeted variable which is "Profit" in our current case. The higher the correlation factor, the more related the feature will be to the target variable. This simply means that in terms of statistical analysis, the higher correlational feature is the most important ones that will affect the targeted variable the most. 

In this case, as we can see below, R&D spend seems to drive our profits the most as it has the highest correlational value. We will keep this in mind as we proceed to conduct our machine learning algorithm in order to predict our profits using below features.

In [23]:
group_df.stat.corr("Marketing Spend","Profit")

0.7477657217414766

In [24]:
startup_df.stat.corr("Administration","Profit")

0.20071656826872125

In [25]:
startup_df.stat.corr("R&D Spend","Profit")

0.9729004656594831

### Summary Stats

With the in-built functions of Spark dataframes for statistical processings, we can utilize the describe function to immediately understand specificities of the datasets. This step is also important as well as it can tell us important pointers to note for regarding to this dataset.
Some examples are as follows:

* Brief count of dataset to understand the size of the sample data.

* Statistics such as Mean standard deviations so as to further identify what kind of data pre-processing is needed.

* Min & Max to see the scales of the dataset to identify if standard scaling is required.


In [26]:
(startup_df.describe().select(
                    "summary",
                    F.round("R&D Spend",4).alias("R&D Spend"),
                    F.round("Administration",4).alias("Administration"),
                    F.round("Marketing Spend",4).alias("Marketing Spend"),
                    F.round("State",4).alias("State"),
                    F.round("Profit",4).alias("Profit"),
                    )
                    .show())


+-------+----------+--------------+---------------+-----+-----------+
|summary| R&D Spend|Administration|Marketing Spend|State|     Profit|
+-------+----------+--------------+---------------+-----+-----------+
|  count|      50.0|          50.0|           50.0| 50.0|       50.0|
|   mean|73721.6156|   121344.6396|    211025.0978| null|112012.6392|
| stddev|45902.2565|    28017.8028|    122290.3107| null| 40306.1803|
|    min|       0.0|      51283.14|            0.0| null|    14681.4|
|    max|  165349.2|     182645.56|       471784.1| null|  192261.83|
+-------+----------+--------------+---------------+-----+-----------+



### Data-Preprocessing

<p>With the previous data exploration we did, we know enough about the data to move on to the next step of our analysis.
For example, by looking at the difference between the min to the max value of some features, we can clearly see the need to standardize the values using a scaler as the differences are big.</p>

In [27]:
startup_df.columns

['R&D Spend', 'Administration', 'Marketing Spend', 'State', 'Profit']

In [28]:
startup_df = startup_df.select(
    "R&D Spend",
    "Administration",
    "Marketing Spend",
    "Profit",
)

#### Scaling of the profit values to the standard scaling level.

As we are going to utilize the function of a standard scaler later on below in the codes. We will be scaling the "Profit" levels down to a similar level as the standard scaler as well. This can be done with the operation "withColumn" below.

In [29]:
#Factor scaling the column "Profit" downwards in preparation for the standard scaler.

startup_df = startup_df.withColumn("Profit", col("Profit")/10000)

In [30]:
startup_df.show(10)

+---------+--------------+---------------+------------------+
|R&D Spend|Administration|Marketing Spend|            Profit|
+---------+--------------+---------------+------------------+
| 165349.2|      136897.8|       471784.1|         19.226183|
| 162597.7|     151377.59|      443898.53|         19.179206|
|153441.51|     101145.55|      407934.54|         19.105039|
|144372.41|     118671.85|      383199.62|18.290198999999998|
|142107.34|      91391.77|      366168.42|         16.618794|
| 131876.9|      99814.71|      362861.36|         15.699112|
|134615.46|     147198.87|      127716.82|         15.612251|
|130298.13|     145530.06|      323876.68|          15.57526|
|120542.52|     148718.95|      311613.29|15.221176999999999|
|123334.88|     108679.17|      304981.62|14.975995999999999|
+---------+--------------+---------------+------------------+
only showing top 10 rows



In [31]:
startup_df.show(10)

+---------+--------------+---------------+------------------+
|R&D Spend|Administration|Marketing Spend|            Profit|
+---------+--------------+---------------+------------------+
| 165349.2|      136897.8|       471784.1|         19.226183|
| 162597.7|     151377.59|      443898.53|         19.179206|
|153441.51|     101145.55|      407934.54|         19.105039|
|144372.41|     118671.85|      383199.62|18.290198999999998|
|142107.34|      91391.77|      366168.42|         16.618794|
| 131876.9|      99814.71|      362861.36|         15.699112|
|134615.46|     147198.87|      127716.82|         15.612251|
|130298.13|     145530.06|      323876.68|          15.57526|
|120542.52|     148718.95|      311613.29|15.221176999999999|
|123334.88|     108679.17|      304981.62|14.975995999999999|
+---------+--------------+---------------+------------------+
only showing top 10 rows



#### Vector-Assembler function

<p>
    The vector assembler feature of Pyspark ML is a transformer that enables the user to transform a list of columns into a single vector column which will be used during the algorithm execution step. For the below lines, we will attempt to build a vector assembler for the above mentioned purpose.
</p>

In [53]:
#Set-up features list for vector assembler.
features = ["R&D Spend", "Administration", "Marketing Spend"]

In [34]:
#Importing the vector assembler function
from pyspark.ml.feature import VectorAssembler

#Execution of the transformation.
assembler = VectorAssembler(inputCols=features, outputCol="features")
assemble_df = assembler.transform(startup_df)
assemble_df.show(10, truncate=False)

#### Standard scaling

<p>
As explained above from the data-preprocessing stage, we can observe that the differences between features are stark. Hence a scaling is required for the features in order to have a more accurate results. This will also help with the post-analysis stage for evaluation of our model to acquire reliable metrics to evaluate our algorithm results. 
</p>

In [37]:
# Initialize the `standardScaler`

standardScaler = StandardScaler(inputCol="features", outputCol="features_scaled")
scaled_df = standardScaler.fit(assemble_df).transform(assemble_df)
scaled_df.select("features", "features_scaled").show(10, truncate=False)

+-------------------------------+----------------------------------------------------------+
|features                       |features_scaled                                           |
+-------------------------------+----------------------------------------------------------+
|[165349.2,136897.8,471784.1]   |[3.6022019977107624,4.886100498126383,3.8579025370019875] |
|[162597.7,151377.59,443898.53] |[3.542259410769301,5.402907262966764,3.6298749047677807]  |
|[153441.51,101145.55,407934.54]|[3.3427879533360665,3.6100457585020873,3.33578790975944]  |
|[144372.41,118671.85,383199.62]|[3.1452137882512723,4.23558731695162,3.1335239703419373]  |
|[142107.34,91391.77,366168.42] |[3.095868283834228,3.261917816952879,2.9942553733540604]  |
|[131876.9,99814.71,362861.36]  |[2.87299383747791,3.5625459594773656,2.967212674873934]   |
|[134615.46,147198.87,127716.82]|[2.9326545210666466,5.2537620913604215,1.0443739920353954]|
|[130298.13,145530.06,323876.68]|[2.8385996677575496,5.194199604802725

### Train-Test split

<p>
Consistent with other machine-learning techniques, we will require a split for our data into both the "Train set" & "Test-set".
This can be achieved by utilizing the random split operation.
</p>

In [38]:
train_set,test_set = scaled_df.randomSplit([.7,.3], seed=rnd_seed)

In [39]:
train_set.columns

['R&D Spend',
 'Administration',
 'Marketing Spend',
 'Profit',
 'features',
 'features_scaled']

### Linear Regression

<p>
By utilizing the Pyspark MLLIB, we can implement a linear regression model simply by initializing the context and transform each relavant datasets to the train & test sets.
</p>

In [40]:
#Initialize the algorithm by calling the features and target variables.

lr = (LinearRegression(featuresCol="features_scaled",labelCol="Profit", predictionCol="Profit_predict"
     ,
     maxIter=10, regParam=0.3, elasticNetParam=0.8, standardization=False))

In [41]:
#Fitting data into the train set

linearModel = lr.fit(train_set)

In [42]:
#Obtaining coefficients via the in-built MLLIB functions

linearModel.coefficients

DenseVector([3.188, 0.0, 0.3544])

In [43]:
linearModel.intercept

5.593056147001444

In [44]:
#Showcase of coefficients in a table format

coeff_df = pd.DataFrame(
    {"Feature": ["Intercept"] + features, "Co-efficients": np.insert(linearModel.coefficients.toArray(), 0, linearModel.intercept
                                                                    )})
coeff_df = coeff_df[["Feature", "Co-efficients"]]

In [45]:
coeff_df

Unnamed: 0,Feature,Co-efficients
0,Intercept,5.593056
1,R&D Spend,3.188018
2,Administration,0.0
3,Marketing Spend,0.35444


In [46]:
predict = linearModel.transform(test_set)

In [47]:
predictlabels = predict.select("Profit_predict","Profit")

In [48]:
predictlabels.show()

+------------------+------------------+
|    Profit_predict|            Profit|
+------------------+------------------+
|  5.72398352627413|           1.46814|
|  6.54555985016247|          4.949075|
| 7.215475446183338|          6.520033|
| 7.663932565489306| 7.149849000000001|
| 8.090933357647366|          7.823991|
| 8.778243227723763|          8.100576|
|10.162846481756457| 9.993758999999999|
|11.166673275404722|10.873399000000001|
|11.754697527969716|         11.847403|
|11.351375964084175|10.855203999999999|
| 11.77741471480051|12.699292999999999|
| 13.33534411414415|14.612195000000002|
|14.666856035043338|13.260264999999999|
|14.868177624299463|15.221176999999999|
|15.803911420567498|         15.699112|
| 16.52402416087771|         16.618794|
| 16.73070090516566|18.290198999999998|
+------------------+------------------+



### Evaluation of predictions

<p>
With the in-built functions of the regression evaluators from MLLIB package, we can obtain the score of our algorithm by cross-analyzing the predicted values and the actual values. Below are some examples of the evaluators:
    
<b>* RMSE- Root-mean-square deviation. Describes the residual standard deviations which can be summarized as the prediction errors. Generally the smaller value, the better the predictors.</b>

<b>* MAE: Mean-absolute-error. Describes the mean of errors between pairs of observation regardless of the directions.</b>

<b>* R2: A statistical measure that shows the coefficiency of determination between the regression line and data that are fitted in it. It can be summarized to mean how relevant the regression prediction is against the actual data plots.</b>
</p>

In [56]:
evaluator = RegressionEvaluator(predictionCol="Profit_predict", labelCol='Profit', metricName='rmse')
print("RMSE: {0}".format(evaluator.evaluate(predictlabels)))

RMSE: 1.3138196604670513


In [57]:
evaluator = RegressionEvaluator(predictionCol="Profit_predict", labelCol='Profit', metricName='mae')
print("MAE: {0}".format(evaluator.evaluate(predictlabels)))

MAE: 0.8691229472399623


In [58]:
evaluator = RegressionEvaluator(predictionCol="Profit_predict", labelCol='Profit', metricName='r2')
print("R2: {0}".format(evaluator.evaluate(predictlabels)))

R2: 0.910598061955934


###  Conclusions & findings

<p>As we can see from the above evaluators, our algorithm obtained fairly good results from our analysis above. The RMSE shows a small value(close to zero indicating that the deviations between predictions and actual data are good. The MAE values are looking good as well as it is fairly small. It evaluates similarly to the RMSE hence both of the error metrics shows quite some good results here. Finally, the R2 value indicates that our algorithm results here are giving very strong predictions. It is always scored between 0 to 100 and with our score of .91, it means that our algorithm has high relevance in explaining the variability around the mean. </p>

* <strong>Feature budget analysis.</strong>

<p>From above correlation analysis, it is clear that the R&D development budget is the most impactful feature of the dataset. Out of both the correlation calculation using MLLIB function and stat operation both indicate that R&D development seem to impact the profits of a start-up most. This is shown by high correlation values between R&D budget and profits which suggests that R&D budgets will likely impact the profitability of a start-up most. This is followed next by marketing budgets and lastly administration budgets. For the conclusion derived from this project, we can assume that for a start-up business to have highest projected profitability for its business, it would be wise to concentrate their budgets more into R&D and next into Marketing.</p>

* <strong>Profitability predictions</strong>

<p>With the above evaluator evaluation metrics, we can assume that this algorithm has shown strong and relevant results. The metrics states that the predictions shows strong explainability to the actual data and the error rates are generally good. With this algorithm result, it is safe to say that our algorithm can predict the profitability of a start-up business on its inception planning stage to ensure a strong base before the management team can move on to its execution.</p>

<p>As shown above, we have achieved our aims for this project of determining the most impactful features when considering for the budgeting input for a start-up business out of the different features. We have also managed to analyze the data to obtain a good predictor in regards to the specific sampled dataset we have. This can have very meaningful impact during a start-up planning stage as this analysis can give a yardstick measure of the profitability of the planning in regards to the amount of budgets given by the management team. It is also important to note that above project is done in Pyspark context which can easily be scaled up to very big data populations given the nature of this analysis. To get our analysis to a more reliable level, we can collect up to millions of statistics from different start-up informations to get a better understanding of the census. This would then be an issue of the codes were executed using the standard sci-kit learn. But with the Pyspark context and pairing of distributed file systems such as Hadoop, we can easily circumvent the issue and scale up using the codes as showcased in this project.</p>
