In [2]:
# Install necessary libraries
!pip install striprtf scikit-learn pandas xgboost

Collecting striprtf
  Downloading striprtf-0.0.27-py3-none-any.whl.metadata (2.3 kB)
Downloading striprtf-0.0.27-py3-none-any.whl (7.6 kB)
Installing collected packages: striprtf
Successfully installed striprtf-0.0.27


#**1. Importing Required Libraries**

In [3]:
from striprtf.striprtf import rtf_to_text
import json
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest, f_regression, f_classif
from sklearn.ensemble import ExtraTreesRegressor, ExtraTreesClassifier, RandomForestClassifier, RandomForestRegressor
from sklearn.linear_model import LogisticRegression, LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.svm import SVC, SVR
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.neural_network import MLPClassifier, MLPRegressor
from sklearn.linear_model import SGDClassifier, SGDRegressor
from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import mean_squared_error, accuracy_score, f1_score
from sklearn.preprocessing import LabelEncoder
import xgboost
from google.colab import files


The import statements bring in various libraries used throughout the script:

- **striprtf.striprtf**: To convert RTF content into plain text.  
- **json, pandas**: For handling JSON data and manipulating dataframes.  
- **Scikit-learn**: For various machine learning tasks like regression, classification, feature selection, model evaluation, etc.  
- **xgboost**: A library for the XGBoost model, used for both classification and regression tasks.  
- **google.colab.files**: To upload files directly in a Google Colab environment.


#**2. Upload the JSON Configuration File**

In [4]:
print("Please upload the JSON configuration file (RTF format):")
uploaded = files.upload()


Please upload the JSON configuration file (RTF format):


Saving algoparams_from_ui.json.rtf to algoparams_from_ui.json.rtf


The user is prompted to upload a configuration file in RTF format, which contains important data for parsing the target and features.

# **3. Reading and Parsing the JSON Configuration**

In [5]:
file_name = list(uploaded.keys())[0]
with open(file_name, 'r') as rtf_file:
    rtf_content = rtf_file.read()

plain_text = rtf_to_text(rtf_content)
try:
    json_start = plain_text.find("{")
    json_data = plain_text[json_start:]
    parsed_data = json.loads(json_data)
    print("Parsed JSON Data:")
    print(json.dumps(parsed_data, indent=4))
except json.JSONDecodeError as e:
    print("Error decoding JSON:", e)
    exit()


Parsed JSON Data:
{
    "session_name": "test",
    "session_description": "test",
    "design_state_data": {
        "session_info": {
            "project_id": "1",
            "experiment_id": "kkkk-11",
            "dataset": "iris_modified.csv",
            "session_name": "test",
            "session_description": "test"
        },
        "target": {
            "prediction_type": "Regression",
            "target": "petal_width",
            "type": "regression",
            "partitioning": true
        },
        "train": {
            "policy": "Split the dataset",
            "time_variable": "sepal_length",
            "sampling_method": "No sampling(whole data)",
            "split": "Randomly",
            "k_fold": false,
            "train_ratio": 0,
            "random_seed": 0
        },
        "metrics": {
            "optomize_model_hyperparameters_for": "AUC",
            "optimize_threshold_for": "F1 Score",
            "compute_lift_at": 0,
            "cost_mat



- **RTF File Reading**:  
  The uploaded RTF file is read, and the `rtf_to_text` function is used to convert the RTF file content into plain text.

- **JSON Parsing**:  
  The JSON content is extracted from the plain text and parsed using Python's `json` library.

- **Validation**:  
  The script then prints the parsed JSON data to ensure that the correct structure has been extracted.


# **4. Upload the CSV Dataset**

In [6]:
print("\nPlease upload the CSV file containing the dataset:")
uploaded_csv = files.upload()
csv_file_name = list(uploaded_csv.keys())[0]
df = pd.read_csv(csv_file_name)



Please upload the CSV file containing the dataset:


Saving iris.csv to iris.csv




- **File Upload**:  
  The user is asked to upload a CSV file containing the dataset.

- **Reading the File**:  
  The file is read using `pandas` (`pd.read_csv()`), and it is stored as a dataframe (`df`).


# **5. Handle Categorical Features**

In [8]:
# Step 4: Get target column and features
target = parsed_data['design_state_data']['target']['target']
features = parsed_data['design_state_data']['feature_handling']

# Handle categorical features
for feature, details in features.items():
    if details['is_selected'] and details['feature_variable_type'] == "text":
        print(f"Encoding categorical feature: {feature}")
        le = LabelEncoder()
        df[feature] = le.fit_transform(df[feature].astype(str))

Encoding categorical feature: species




- **Label Encoding**:  
  The categorical features are encoded using `LabelEncoder`.

- **Conversion to Numeric**:  
  If a feature is selected and has a type of text, it is converted into numeric values, enabling the models to process it.


# **6. Format the Target Column**

In [34]:
task_type = input("Enter task type (classification/regression): ").strip().lower()
if task_type == "classification":
    if df[target].dtype in ["float64", "int64"] and df[target].nunique() > 20:
        print(f"Warning: Converting numeric target with {df[target].nunique()} unique values into discrete bins.")
        df[target] = pd.cut(df[target], bins=3, labels=["class_1", "class_2", "class_3"])
    df[target] = df[target].astype("category").cat.codes
elif task_type == "regression":
    df[target] = pd.to_numeric(df[target], errors="coerce")
    df = df.dropna(subset=[target])
else:
    print("Invalid task type. Exiting.")
    exit()

Enter task type (classification/regression): regression


Based on the task type (classification or regression), the target column is formatted:
- **Classification**:  
  If the target column is numeric with more than 20 unique values, it is converted into discrete bins (3 classes).

- **Regression**:  
  The target column is converted to numeric values, and rows with missing target values are dropped.


# **7. Feature Reduction**

In [35]:
# Step 6: User selects feature reduction method
print("\nFeature Reduction Methods:")
print("1. No Reduction")
print("2. PCA")
print("3. Correlation with Target")
print("4. Tree-based Selection")
reduction_choice = input("Select a feature reduction method (enter 1, 2, 3, or 4): ").strip()

reduced_features = None
selected_columns = None  # Track selected feature names or PCA components
if reduction_choice == "1":
    print("No Reduction selected. Using all original features.")
    reduced_features = df.drop(columns=[target])
    selected_columns = reduced_features.columns.tolist()
elif reduction_choice == "2":
    num_of_features_to_keep = int(input("Enter the number of components to keep for PCA: "))
    pca = PCA(n_components=num_of_features_to_keep)
    reduced_features = pd.DataFrame(pca.fit_transform(df.drop(columns=[target])),
                                    columns=[f"PC{i+1}" for i in range(num_of_features_to_keep)])
    selected_columns = reduced_features.columns.tolist()
elif reduction_choice == "3":
    num_of_features_to_keep = int(input("Enter the number of features to select based on correlation with the target: "))
    score_func = f_regression if task_type == "regression" else f_classif
    selector = SelectKBest(score_func=score_func, k=num_of_features_to_keep)
    reduced_features = selector.fit_transform(df.drop(columns=[target]), df[target])
    selected_columns = df.drop(columns=[target]).columns[selector.get_support()].tolist()
    reduced_features = pd.DataFrame(reduced_features, columns=selected_columns)
elif reduction_choice == "4":
    num_of_features_to_keep = int(input("Enter the number of features to select using Tree-based importance: "))
    tree_model = ExtraTreesRegressor() if task_type == "regression" else ExtraTreesClassifier()
    tree_model.fit(df.drop(columns=[target]), df[target])
    top_features = df.drop(columns=[target]).columns[tree_model.feature_importances_.argsort()[-num_of_features_to_keep:]]
    reduced_features = df[top_features]
    selected_columns = top_features.tolist()
else:
    print("Invalid choice. Exiting.")
    exit()

# Step 6b: Display the selected features or PCA components and their values
if selected_columns:
    print("\nSelected Features or Components:")
    print(reduced_features.head())
else:
    print("Feature reduction failed or no features selected.")



Feature Reduction Methods:
1. No Reduction
2. PCA
3. Correlation with Target
4. Tree-based Selection
Select a feature reduction method (enter 1, 2, 3, or 4): 2
Enter the number of components to keep for PCA: 3

Selected Features or Components:
        PC1       PC2       PC3
0 -2.685814  0.305194  0.051476
1 -2.715303 -0.178050 -0.172252
2 -2.888472 -0.161684  0.048363
3 -2.745779 -0.323027  0.026984
4 -2.729896  0.307330  0.160715




The user is prompted to select a feature reduction method from the following options:

- **PCA (Principal Component Analysis)**:  
  Reduces the feature space to a specified number of components.

- **Correlation with Target**:  
  Selects the most correlated features with the target.

- **Tree-based Selection**:  
  Uses a tree-based model (e.g., RandomForest, ExtraTrees) to select the most important features.

- **No Reduction**:  
  No feature reduction; all features are used.


# **8. Split the Data**

In [36]:
X_train, X_test, y_train, y_test = train_test_split(reduced_features, df[target], test_size=0.2, random_state=42,
                                                    stratify=df[target] if task_type == "classification" else None)



- **Train-Test Split**:  
  The dataset is split into training and testing sets, with **80% for training** and **20% for testing**.

- **Stratification**:  
  For classification tasks, stratification is used to maintain the distribution of the target classes in both sets.


# **9. Model Training and Evaluation**



In [37]:
# Define available models based on task type
if task_type == "classification":
    available_models = {
        "RandomForestClassifier": {"model": RandomForestClassifier(), "param_grid": {'n_estimators': [10, 20], 'max_depth': [5, 10]}},
        "GradientBoostingClassifier": {"model": GradientBoostingClassifier(), "param_grid": {'n_estimators': [50], 'max_depth': [3]}},
        "LogisticRegression": {"model": LogisticRegression(), "param_grid": {'C': [0.1, 1]}},
        "DecisionTreeClassifier": {"model": DecisionTreeClassifier(), "param_grid": {'max_depth': [3, 5, 10]}},
        "SVC": {"model": SVC(), "param_grid": {'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']}},
        "KNeighborsClassifier": {"model": KNeighborsClassifier(), "param_grid": {'n_neighbors': [3, 5, 7]}},
        "SGDClassifier": {"model": SGDClassifier(), "param_grid": {'alpha': [0.0001, 0.001], 'penalty': ['l2', 'l1']}},
        "MLPClassifier": {"model": MLPClassifier(max_iter=1000), "param_grid": {'hidden_layer_sizes': [(50,), (100,)], 'activation': ['relu', 'tanh']}},
        "XGBoostClassifier": {"model": xgboost.XGBClassifier(), "param_grid": {'n_estimators': [50, 100], 'learning_rate': [0.01, 0.1]}}
    }
elif task_type == "regression":
    available_models = {
        "RandomForestRegressor": {"model": RandomForestRegressor(), "param_grid": {'n_estimators': [10, 20], 'max_depth': [5, 10]}},
        "GradientBoostingRegressor": {"model": GradientBoostingRegressor(), "param_grid": {'n_estimators': [50], 'max_depth': [3]}},
        "LinearRegression": {"model": LinearRegression(), "param_grid": None},
        "Ridge": {"model": Ridge(), "param_grid": {'alpha': [0.1, 1.0, 10.0]}},
        "Lasso": {"model": Lasso(), "param_grid": {'alpha': [0.1, 1.0, 10.0]}},
        "ElasticNet": {"model": ElasticNet(), "param_grid": {'alpha': [0.1, 1.0, 10.0], 'l1_ratio': [0.1, 0.5]}},
        "DecisionTreeRegressor": {"model": DecisionTreeRegressor(), "param_grid": {'max_depth': [3, 5, 10]}},
        "SVR": {"model": SVR(), "param_grid": {'C': [0.1, 1], 'kernel': ['linear', 'rbf']}},
        "KNeighborsRegressor": {"model": KNeighborsRegressor(), "param_grid": {'n_neighbors': [3, 5, 7]}},
        "SGDRegressor": {"model": SGDRegressor(), "param_grid": {'alpha': [0.0001, 0.001]}},
        "MLPRegressor": {"model": MLPRegressor(max_iter=1000), "param_grid": {'hidden_layer_sizes': [(50,), (100,)], 'activation': ['relu', 'tanh']}},
        "XGBoostRegressor": {"model": xgboost.XGBRegressor(), "param_grid": {'n_estimators': [50, 100], 'learning_rate': [0.01, 0.1]}}
    }
else:
    print("Invalid task type. Exiting.")
    exit()

print("\nAvailable Models to Train:")
for model_name in available_models:
    print(f"- {model_name}")



Available Models to Train:
- RandomForestRegressor
- GradientBoostingRegressor
- LinearRegression
- Ridge
- Lasso
- ElasticNet
- DecisionTreeRegressor
- SVR
- KNeighborsRegressor
- SGDRegressor
- MLPRegressor
- XGBoostRegressor


**Model Availability:** Based on the task type (classification or regression), a dictionary of available models is defined. Each model has associated hyperparameters that can be tuned using GridSearchCV.

# **10. Model Selection**

In [38]:
selected_models_input = input("\nEnter the model names you want to train (comma-separated): ").strip()
selected_models = [model.strip() for model in selected_models_input.split(",")]


Enter the model names you want to train (comma-separated): RandomForestRegressor, GradientBoostingRegressor, LinearRegression, Ridge, Lasso, ElasticNet




- **User Input**:  
  The user is asked to input the models they want to train (comma-separated).

- **Model Processing**:  
  The input is processed to select the specified models from the available list.


# **11. Train and Evaluate Models**

In [39]:
for model_name in selected_models:
    if model_name not in available_models:
        print(f"Model {model_name} is not available. Skipping.")
        continue

    print(f"\nTraining model: {model_name}")
    model_info = available_models[model_name]
    model = model_info["model"]
    param_grid = model_info["param_grid"]

    try:
        # Grid search for hyperparameter tuning if param_grid exists
        if param_grid:
            grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5,
                                       scoring='accuracy' if task_type == "classification" else 'neg_mean_squared_error', error_score='raise')
            grid_search.fit(X_train, y_train)
            best_model = grid_search.best_estimator_
            print(f"Best parameters for {model_name}: {grid_search.best_params_}")
        else:
            model.fit(X_train, y_train)
            best_model = model

        # Evaluate the model
        y_pred = best_model.predict(X_test)

        if task_type == "classification":
            accuracy = accuracy_score(y_test, y_pred)
            f1 = f1_score(y_test, y_pred, average='weighted')
            print(f"Accuracy for {model_name}: {accuracy}")
            print(f"F1 Score for {model_name}: {f1}")
        elif task_type == "regression":
            mse = mean_squared_error(y_test, y_pred)
            print(f"Mean Squared Error for {model_name}: {mse}")
    except Exception as e:
        print(f"Error training {model_name}: {e}")



Training model: RandomForestRegressor
Best parameters for RandomForestRegressor: {'max_depth': 10, 'n_estimators': 10}
Mean Squared Error for RandomForestRegressor: 0.02233333333333333

Training model: GradientBoostingRegressor
Best parameters for GradientBoostingRegressor: {'max_depth': 3, 'n_estimators': 50}
Mean Squared Error for GradientBoostingRegressor: 0.006915289514699861

Training model: LinearRegression
Mean Squared Error for LinearRegression: 0.018604841823369984

Training model: Ridge
Best parameters for Ridge: {'alpha': 0.1}
Mean Squared Error for Ridge: 0.018724832050328665

Training model: Lasso
Best parameters for Lasso: {'alpha': 0.1}
Mean Squared Error for Lasso: 0.05729888615591594

Training model: ElasticNet
Best parameters for ElasticNet: {'alpha': 0.1, 'l1_ratio': 0.1}
Mean Squared Error for ElasticNet: 0.035678364543306126



- **Hyperparameter Tuning**:  
  For each selected model, `GridSearchCV` is used to tune the hyperparameters.

- **Model Training**:  
  The model is trained on the training dataset.

- **Performance Evaluation**:  
  - **Classification**: Evaluated using metrics like **Accuracy** and **F1 Score**.  
  - **Regression**: Evaluated using **Mean Squared Error (MSE)**.
