## <span style="color:lightblue"> **Problem Statement 4:** </span>
I have been provided with the '50_Startups' data. Using the given features, I must predict the profit of these startups.

**Dataset Description:**
- R&D Spend: Expenditures in Research and Development
- Administration: Expenditures in Administration
- Marketing Spend: Expenditures in Marketing
- State: In which state the company belongs to
- Profit: The profit made by the company

I would write python code to perform the following tasks mentioned:
1. Load the data, check its shape and check for null values
2. Convert categorical features to numerical values using Label Encoder
3. Split the dataset for training and testing
4. Train the model using sklearn (linear regression), also find the intercept and coefficient from the trained model
5. Predict the profits of test data and evaluate the model using r2 score and mean squared error
6. Regularize the model using Ridge Regression and find the score
7. Regularize the model using Lasso Regeression and find the score

In [43]:
# Importing useful libraries

#dataFrame manipulation and visualiztion
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('Display.max_columns', None)
pd.set_option('Display.max_rows', None)

#Needed libraries
import datetime
import math

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import normalize
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge

In [3]:
#Task1: Loading the data, checking its shape and checking for null values
startups_df = pd.read_csv("./../Assignment_files/50_Startups_ass4.csv")
startups_df.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


In [4]:
startups_df.isna().sum()

R&D Spend          0
Administration     0
Marketing Spend    0
State              0
Profit             0
dtype: int64

In [6]:
startups_df.duplicated().sum(), startups_df.shape

(0, (50, 5))

In [8]:
startups_df.dtypes

R&D Spend          float64
Administration     float64
Marketing Spend    float64
State               object
Profit             float64
dtype: object

In [31]:
#Task 2: Converting the categorical features to numerical values using LabelEncoder
cat_column = "State"

encoder = LabelEncoder()

encoded_data = encoder.fit_transform(startups_df[cat_column])
encoded_df = pd.DataFrame(encoded_data, columns=[cat_column])
transformed_startups["State"] = encoded_df
transformed_startups.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,Profit,State_California,State_Florida,State_New York,State
0,165349.2,136897.8,471784.1,192261.83,0.0,0.0,1.0,2
1,162597.7,151377.59,443898.53,191792.06,1.0,0.0,0.0,0
2,153441.51,101145.55,407934.54,191050.39,0.0,1.0,0.0,1
3,144372.41,118671.85,383199.62,182901.99,0.0,0.0,1.0,2
4,142107.34,91391.77,366168.42,166187.94,0.0,1.0,0.0,1


after OneHotEncoding the state column, I would normalize the other coloumns to have values between 0 and 1 so the model can do better

In [32]:
columns = ["R&D Spend", "Administration", "Marketing Spend",]

norm_data = normalize(transformed_startups[columns])
scaled_df = pd.DataFrame(norm_data, columns=columns)
# scaled_df.head()
scaled_startup = transformed_startups.drop(columns, axis=1)
scaled_startup = pd.concat([scaled_startup, scaled_df], axis=1)
scaled_startup.head()

Unnamed: 0,Profit,State_California,State_Florida,State_New York,State,R&D Spend,Administration,Marketing Spend
0,192261.83,0.0,0.0,1.0,2,0.319006,0.264115,0.910208
1,191792.06,1.0,0.0,0.0,0,0.327563,0.304959,0.894261
2,191050.39,0.0,1.0,0.0,1,0.342947,0.226064,0.911747
3,182901.99,0.0,0.0,1.0,2,0.33863,0.278348,0.898806
4,166187.94,0.0,1.0,0.0,1,0.352388,0.226627,0.907999


In [35]:
#Task3: spliting the dataset for training and testing
X = scaled_startup.drop("Profit", axis=1)
y = scaled_startup["Profit"]


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [37]:
# Task4:  training the model using sklearn(linear regression)
model = LinearRegression()
model.fit(X_train, y_train)
model.score(X_train, y_train)

0.8111090083938026

In [38]:
#Finding the intercept and coefficient from the trained model
model.intercept_, model.coef_

(194768.70500908702,
 array([   -905.32791889,    -653.8377094 ,    1559.16562829,
           2464.49354718,  184865.52520949, -158716.93131074,
         -67060.74303457]))

In [40]:
# Task5: predicting the profits of test data and evaluate the model using r2 score and mean squared error
y_preds = model.predict(X_test)

r2score = r2_score(y_preds, y_test)
mae = mean_absolute_error(y_preds, y_test)
rmse = math.sqrt(mean_squared_error(y_preds, y_test))
r2score, mae, rmse

(0.7203173364073493, 11985.309148849714, 16297.576557465418)

In [None]:
#Task6: Regularizing the model using Ridge Regression
ridge = Ridge(alpha=0.1)
ridge.fit(X_train, y-train

In [42]:
#Task7: Regularizing the model using Lasso Regression
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
lasso.score(X_train, y_train), lasso.score(X_test, y_test)

(0.8111090053092881, 0.7671702785477617)