<a href="https://colab.research.google.com/github/drshahizan/Python_EDA/blob/main/assignment/ass5/hpdp/ANGKASA/Tool%201%20-%20TPOT/Assignment_5A_TPOT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 5 - TPOT

TPOT, which stands for "Tree-based Pipeline Optimization Tool," is a powerful Python library designed to automate the machine learning pipeline, making it an invaluable asset for data scientists and machine learning practitioners. Developed with the aim of simplifying the model selection and hyperparameter tuning process, TPOT leverages genetic programming to discover and optimize machine learning pipelines. Unlike traditional approaches that require manual intervention in the model selection and hyperparameter tuning stages, TPOT automates this complex process, enabling users to efficiently navigate through a vast search space of possible pipelines.

One of TPOT's notable features is its ability to explore diverse combinations of preprocessing steps, feature selection techniques, and machine learning models, adapting its search strategy to the unique characteristics of the dataset at hand. By employing genetic programming, TPOT evolves and refines its pipelines over multiple generations, gradually converging towards the most effective configuration for a given machine learning task. This automation not only accelerates the model development process but also helps users discover high-performing models that may have been overlooked through manual exploration.

Moreover, TPOT is designed to handle regression and classification tasks, allowing users to apply it to a wide range of predictive modeling scenarios. Its flexibility, ease of use, and automated optimization capabilities make TPOT a valuable tool for both beginners and experienced practitioners seeking efficient and effective solutions to complex machine learning challenges.

## Mounting google drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Install TPOT using the following command:

In [None]:
!pip install tpot

Collecting tpot
  Downloading TPOT-0.12.1-py3-none-any.whl (87 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/87.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━[0m [32m81.9/87.4 kB[0m [31m2.5 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m87.4/87.4 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
Collecting deap>=1.2 (from tpot)
  Downloading deap-1.4.1-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (135 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m135.4/135.4 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting update-checker>=0.16 (from tpot)
  Downloading update_checker-0.18.0-py3-none-any.whl (7.0 kB)
Collecting stopit>=1.1.1 (from tpot)
  Downloading stopit-1.1.2.tar.gz (18 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected 

In [None]:
!pip install --upgrade scikit-learn

Collecting scikit-learn
  Downloading scikit_learn-1.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (10.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.8/10.8 MB[0m [31m75.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: scikit-learn
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 1.2.2
    Uninstalling scikit-learn-1.2.2:
      Successfully uninstalled scikit-learn-1.2.2
Successfully installed scikit-learn-1.3.2


## Importing any other necessary libraries

In [None]:
import pandas as pd
from tpot import TPOTClassifier
from tpot import TPOTRegressor
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

## Read the file and Divide into chunk

In [None]:
file_path = '/content/drive/MyDrive/realtor-data.zip.csv'
chunk_size = 50000  # Adjust the chunk size based on your available RAM

# Create an empty DataFrame to store the chunks
chunks = []

for chunk in pd.read_csv(file_path, chunksize=chunk_size):
    chunks.append(chunk)

# Concatenate chunks into a single DataFrame
data = pd.concat(chunks, axis=0, ignore_index=True)

## Data Information

In [None]:
data.head()

Unnamed: 0,status,bed,bath,acre_lot,city,state,zip_code,house_size,prev_sold_date,price
0,for_sale,3.0,2.0,0.12,Adjuntas,Puerto Rico,601.0,920.0,,105000.0
1,for_sale,4.0,2.0,0.08,Adjuntas,Puerto Rico,601.0,1527.0,,80000.0
2,for_sale,2.0,1.0,0.15,Juana Diaz,Puerto Rico,795.0,748.0,,67000.0
3,for_sale,4.0,2.0,0.1,Ponce,Puerto Rico,731.0,1800.0,,145000.0
4,for_sale,6.0,2.0,0.05,Mayaguez,Puerto Rico,680.0,,,65000.0


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 904966 entries, 0 to 904965
Data columns (total 10 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   status          904966 non-null  object 
 1   bed             775126 non-null  float64
 2   bath            791082 non-null  float64
 3   acre_lot        638324 non-null  float64
 4   city            904894 non-null  object 
 5   state           904966 non-null  object 
 6   zip_code        904762 non-null  float64
 7   house_size      612080 non-null  float64
 8   prev_sold_date  445865 non-null  object 
 9   price           904895 non-null  float64
dtypes: float64(6), object(4)
memory usage: 69.0+ MB


## Data Cleaning and Pre-processing

In [None]:
columns_to_replace = ['bed', 'bath', 'house_size']
data[columns_to_replace] = data[columns_to_replace].fillna(0)

In [None]:
data = data.drop('prev_sold_date', axis=1)

In [None]:
data = data.drop('status', axis=1)

In [None]:
data = data.dropna(subset=['acre_lot', 'zip_code'])

In [None]:
data.head(80)

Unnamed: 0,bed,bath,acre_lot,city,state,zip_code,house_size,price
0,3.0,2.0,0.12,Adjuntas,Puerto Rico,601.0,920.0,105000.0
1,4.0,2.0,0.08,Adjuntas,Puerto Rico,601.0,1527.0,80000.0
2,2.0,1.0,0.15,Juana Diaz,Puerto Rico,795.0,748.0,67000.0
3,4.0,2.0,0.10,Ponce,Puerto Rico,731.0,1800.0,145000.0
4,6.0,2.0,0.05,Mayaguez,Puerto Rico,680.0,0.0,65000.0
...,...,...,...,...,...,...,...,...
79,0.0,0.0,0.37,Aguada,Puerto Rico,602.0,0.0,95000.0
80,0.0,0.0,0.23,Rincon,Puerto Rico,602.0,0.0,329000.0
81,0.0,0.0,0.68,Aguada,Puerto Rico,602.0,0.0,130000.0
82,0.0,0.0,0.25,Aguada,Puerto Rico,602.0,0.0,94770.0


# 1. Hyperparameter Tuning with TPOT:

## Splitting the Data into Features and Target

In [None]:
X = data.drop('price', axis=1)
y = data['price']

## Train-Test Split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Imputing null value into mean value

In [None]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='mean')
X_train_numeric = X_train.select_dtypes(include=['number'])
X_train[X_train_numeric.columns] = imputer.fit_transform(X_train_numeric)

## Dropping unnecessary column that cannot be used

In [None]:
non_numeric_columns = X_train.select_dtypes(exclude=['number']).columns
X_train = X_train.drop(columns=non_numeric_columns)

## Running the TPOT

In [None]:
tpot = TPOTRegressor(verbosity=2, generations=5, population_size=20, random_state=42, n_jobs=-1)
tpot.fit(X_train, y_train)

Optimization Progress:   0%|          | 0/120 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -380643391240.4049

Generation 2 - Current best internal CV score: -136255062103.94263

Generation 3 - Current best internal CV score: -76898490101.93436

Generation 4 - Current best internal CV score: -76898490101.93436

Generation 5 - Current best internal CV score: -75861861206.35892

Best pipeline: KNeighborsRegressor(input_matrix, n_neighbors=71, p=1, weights=distance)


## Evaluating and Exporting  the model

In [None]:
tpot.score(X_test, y_test)
tpot.export('/content/gdrive/MyDrive/tpot_best_model.py')

Imputing missing values in feature set


AttributeError: ignored

# 2. Feature selection using **TPOT**

In [None]:
tpot_fs = TPOTRegressor(verbosity=2, generations=5, population_size=20, random_state=42, n_jobs=-1,
                        config_dict='TPOT sparse', memory='auto', periodic_checkpoint_folder='/content/gdrive/MyDrive/tpot_checkpoint')

tpot_fs.fit(X_train, y_train)

Optimization Progress:   0%|          | 0/120 [00:00<?, ?pipeline/s]



TPOT closed during evaluation in one generation.


TPOT closed prematurely. Will use the current best pipeline.


RuntimeError: ignored

## Getting the best pipeline

In [None]:
best_model_fs = tpot_fs.fitted_pipeline_

## Applying feature selection on the dataset


In [None]:
feature_selector = SelectFromModel(best_model_fs.steps[-1][1], prefit=True)
X_train_selected = feature_selector.transform(X_train)
X_test_selected = feature_selector.transform(X_test)

# 4. Conclusion

In conclusion, TPOT significantly streamlines the machine learning model development process by automating the search for optimal pipelines. Its utilization of genetic programming allows it to intelligently explore the expansive landscape of possible combinations of preprocessing steps, feature selection methods, and machine learning algorithms. The adaptability of TPOT to different datasets, coupled with its ability to uncover high-performing models, empowers users to harness the full potential of machine learning without the need for extensive manual experimentation.

TPOT's automation of tasks such as feature engineering, model selection, and hyperparameter tuning is particularly advantageous in scenarios where large datasets or limited computational resources pose challenges. By automating these intricate processes, TPOT not only accelerates model development but also democratizes access to advanced machine learning techniques, enabling practitioners to focus on the interpretation and application of results rather than the intricacies of algorithmic tuning.

In essence, TPOT stands as a valuable ally in the machine learning landscape, offering an automated, intelligent approach to model development that can enhance productivity and facilitate the discovery of robust, high-performance models across various domains and applications.*italicized text*