# Step 1: Data Preprocessing
### Importing the libraries

In [1]:
import pandas as pd
import numpy as np

### Importing the dataset

In [2]:
dataset = pd.read_csv('../datasets/50_Startups.csv')
# 这行代码选择了 dataset 中的所有行（:）和除最后一列之外的所有列（:-1），并将其存储在变量 X 中。这表示 X 包含了数据集的特征数据，而最后一列则被排除在外。
X = dataset.iloc[:,:-1].values
Y = dataset.iloc[:,4].values

### Encoding Categorical data

R&D Spend,Administration,Marketing Spend,State,Profit
165349.2,136897.8,471784.1,New York,192261.83
162597.7,151377.59,443898.53,California,191792.06
153441.51,101145.55,407934.54,Florida,191050.39
144372.41,118671.85,383199.62,New York,182901.99
142107.34,91391.77,366168.42,Florida,166187.94

根据提供的数据集，经过标签编码和独热编码的过程后，数据集 X 的变化如下：
+-----+--------------+----------------+------------------+-------+-----------------------+
|     | R&D Spend    | Administration | Marketing Spend  | State | Profit                |
+-----+--------------+----------------+------------------+-------+-----------------------+
| 0   | 165349.2     | 136897.8       | 471784.1         | 2     | 192261.83             |
| 1   | 162597.7     | 151377.59      | 443898.53        | 0     | 191792.06             |
| 2   | 153441.51    | 101145.55      | 407934.54        | 1     | 191050.39             |
| 3   | 144372.41    | 118671.85      | 383199.62        | 2     | 182901.99             |
| 4   | 142107.34    | 91391.77       | 366168.42        | 1     | 166187.94             |
+-----+--------------+----------------+------------------+-------+-----------------------+
其中，"State" 列经过标签编码后被转换为整数标签：

New York 转换为 2
California 转换为 0
Florida 转换为 1
然后，针对 "State" 列进行独热编码，将其转换为三个新的二进制特征列：

"State_New York"：[1, 0, 0]
"State_California"：[0, 1, 0]
"State_Florida"：[0, 0, 1]
X = [[0.0, 0.0, 1.0, 165349.2, 136897.8, 471784.1, 192261.83],
     [1.0, 0.0, 0.0, 162597.7, 151377.59, 443898.53, 191792.06],
     [0.0, 1.0, 0.0, 153441.51, 101145.55, 407934.54, 191050.39],
     [0.0, 0.0, 1.0, 144372.41, 118671.85, 383199.62, 182901.99],
     [0.0, 1.0, 0.0, 142107.34, 91391.77, 366168.42, 166187.94]]

最终，数据集 X 包含了经过标签编码和独热编码处理后的特征列。

In [6]:
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
labelencoder = LabelEncoder()
# X[:, 3] = labelencoder.fit_transform(X[:, 3])：这行代码将第四列数据（索引为3）进行标签编码，并将编码后的结果覆盖原始数据中的第四列。标签编码是指将每个类别映射为一个整数，用于表示该类别。
X[:,3] = labelencoder.fit_transform(X[:,3])
onehotencoder = OneHotEncoder()
# X = onehotencoder.fit_transform(X).toarray()：这行代码使用 OneHotEncoder 对象对数据集 X 进行独热编码，并将编码后的结果转换为一个密集的 NumPy 数组。
X = onehotencoder.fit_transform(X).toarray()

### Avoiding Dummy Variable Trap

In [8]:
X = X[: , 1:]

### Splitting the dataset into the training set and test set

In [9]:
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.2,random_state=0)

### Step 2:Fitting  Multiple linear regression to the training set

In [10]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train,Y_train)

### Step 3: Predicting the Test set results

In [11]:
y_pred = regressor.predict(X_test)