Here, we're going to look at a data set of 50 startup companies and try to predict the profit of each.

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

data = pd.read_csv('50_Startups.csv')
X = data.iloc[:, :-1].values #Remove the last column to create our matrix of indpt variables
y = data.iloc[:, 4].values #Dependent variable, profit

data.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


By looking at a portion of the data above, we can see there is an interesting column of data.
We see that the State column is categorical!! How can we adjust this data? We need to encode it before we split the data at all.

In [3]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder() #changes text to numbers
X[:, 3] = labelencoder_X.fit_transform(X[:, 3])
onehotencoder = OneHotEncoder(categorical_features = [3])
X = onehotencoder.fit_transform(X).toarray()

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


Let's look at the nature of our encoded data now. The States column should now be divided up to three columns of zeroes and ones.

In [5]:
df = pd.DataFrame(X)
df.head()

Unnamed: 0,0,1,2,3,4,5
0,0.0,0.0,1.0,165349.2,136897.8,471784.1
1,1.0,0.0,0.0,162597.7,151377.59,443898.53
2,0.0,1.0,0.0,153441.51,101145.55,407934.54
3,0.0,0.0,1.0,144372.41,118671.85,383199.62
4,0.0,1.0,0.0,142107.34,91391.77,366168.42


Looking at the first observation (first row), we notice that the column marked '2' is equated to 1. When we look at our original data dataframe, it says that the first observation is from New York. This shows us that the 1 in the '2' column of our encoded X dataframe means the first observation is from New York!
Taking this logic, let's rename some of these columns to better organize this X dataframe.

In [6]:
X_temp = df.rename(columns={0: "California", 1: "Florida", 2: "New York", 3: "R&D Spend",
                   4: "Administration", 5: "Marketing Spend"})
X_temp.head()

Unnamed: 0,California,Florida,New York,R&D Spend,Administration,Marketing Spend
0,0.0,0.0,1.0,165349.2,136897.8,471784.1
1,1.0,0.0,0.0,162597.7,151377.59,443898.53
2,0.0,1.0,0.0,153441.51,101145.55,407934.54
3,0.0,0.0,1.0,144372.41,118671.85,383199.62
4,0.0,1.0,0.0,142107.34,91391.77,366168.42


Now we split the data, fit the model, and make predictions

In [7]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

y_pred = regressor.predict(X_test)

#R^2
r_sq = regressor.score(X, y)
print('R squared: ', r_sq)

#RMSE
rss = ((y_test-y_pred)**2).sum()
mse = np.mean((y_test-y_pred)**2)
print("Final rmse value is =", np.sqrt(np.mean((y_test-y_pred)**2)))

R squared:  0.948522354717154
Final rmse value is = 9137.990152795797


Let's compare our predicted Profit results (y_pred) with the actual results (y_test) to see how accurate our predictions were.

In [35]:
d = {'y_test': y_test, 'y_pred': y_pred}
dfnew = pd.DataFrame(data=d)
dfnew

Unnamed: 0,y_test,y_pred
0,103282.38,103015.201598
1,144259.4,132582.277608
2,146121.95,132447.738452
3,77798.83,71976.098513
4,191050.39,178537.482211
5,105008.31,116161.242302
6,81229.06,67851.692097
7,97483.56,98791.733747
8,110352.25,113969.43533
9,166187.94,167921.065696


We see here that our model predicted great, especially at the indexes of 0, 7, and 9. We can see there is a multiple linear dependency between the independent variables and the dependent variable.