### This code demonstrates a data preprocessing step in machine learning:
1. Splitting a dataset into training and testing sets for model evaluation. 
2. Save the split data into separate CSV files for further analysis or modeling.

#### Explanation
- Import Libraries. Import pandas for data manipulation and train_test_split from scikit-learn for splitting the dataset.
- Define File Paths. Define variables containing the file paths for the dataset, training data, and new data.
- Load Dataset. Load the dataset from the specified file path.
- Define Features and Target. Specify the columns to be used as features (X_columns) and the target variable (y_column).
- Split Dataset. Use train_test_split to split the dataset into training and testing sets. Here, 70% of the data is used for training (train_size=0.7).
- Output Information. Print the shapes of the training and testing sets to check the size of the splits.
Write to CSV: Save the training and testing sets to CSV files at the specified file paths.

In [1]:
# import necessary libraries for i/o and split data
import pandas as pd
from sklearn.model_selection import train_test_split

In [3]:
# define file paths
data_file_path = "D:/programming/information-technologies-of-smart-systems/term-paper/data/Clean_Dataset.csv"
train_data_file_path = "D:/programming/information-technologies-of-smart-systems/term-paper/data/final/train_data.csv"
new_data_file_path = "D:/programming/information-technologies-of-smart-systems/term-paper/data/final/new_data.csv"

In [4]:
# read the whole dataset from a CSV file
ds = pd.read_csv(data_file_path)

In [5]:
ds.columns

Index(['Unnamed: 0', 'airline', 'flight', 'source_city', 'departure_time',
       'stops', 'arrival_time', 'destination_city', 'class', 'duration',
       'days_left', 'price'],
      dtype='object')

In [6]:
y_column = ['price'] # target variable
X_columns = ['Unnamed: 0', 'airline', 'flight', 'source_city', 'departure_time',
       'stops', 'arrival_time', 'destination_city', 'class', 'duration',
       'days_left']   # define the feature columns

In [11]:
# split the dataset into training and testing sets
X_train, X_test = train_test_split(ds, train_size=0.7)

In [12]:
# output the shape of the training set
print("Training set shape:", X_train.shape)

Training set shape: (210107, 12)


In [13]:
# output the shape of the test set
print("Training set shape:", X_test.shape)

Training set shape: (90046, 12)


In [14]:
# save the training data to a CSV file
X_train.to_csv(train_data_file_path, index=False)

In [15]:
# output the shape of the testing set
print("Testing set shape:", X_test.shape)

Testing set shape: (90046, 12)


In [16]:
# save the testing data to a CSV file (with only selected feature columns)
X_test[X_columns].to_csv(new_data_file_path, index=False)