<a href="https://colab.research.google.com/github/chhelomari/chatbot/blob/main/Untitled14.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# I divided my work into parts
# 1 -loaded the California dataset
# 2 -normalized the data
# 3 -splited the data into training and testing sets
# 4 -trained a linear regression model to predict the house pricing
# 5 -evaluated the model using metrics like mean square error , ...
# 6 -used longitude and latitude to create new features to help us predict
# 7 -used one-hot encoding for categorical features
# 8 -R^2 score
# 9 -genetic algorithm for model optimization
# 10 -model evaluation

In [None]:
from sklearn.datasets import fetch_california_housing
import pandas as pd

data = fetch_california_housing(as_frame=True)
df = data.frame

print("Preview of the dataset:")
print(df.head())

print("\nBasic information:")
print(df.info())


Preview of the dataset:
   MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
0  8.3252      41.0  6.984127   1.023810       322.0  2.555556     37.88   
1  8.3014      21.0  6.238137   0.971880      2401.0  2.109842     37.86   
2  7.2574      52.0  8.288136   1.073446       496.0  2.802260     37.85   
3  5.6431      52.0  5.817352   1.073059       558.0  2.547945     37.85   
4  3.8462      52.0  6.281853   1.081081       565.0  2.181467     37.85   

   Longitude  MedHouseVal  
0    -122.23        4.526  
1    -122.22        3.585  
2    -122.24        3.521  
3    -122.25        3.413  
4    -122.25        3.422  

Basic information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   MedInc       20640 non-null  float64
 1   HouseAge     20640 non-null  float64
 2   AveRooms     20640 non-null  float64
 3   AveBedrm

In [None]:
X = df.drop(columns=["MedHouseVal"])
y = df["MedHouseVal"]

X = (X - X.min()) / (X.max() - X.min())

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                    random_state=42)

print("\nData is split into training and testing sets:")
print(f"Training data size: {len(X_train)}")
print(f"Testing data size: {len(X_test)}")



Data is split into training and testing sets:
Training data size: 16512
Testing data size: 4128


In [None]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()

#trainning the model
model.fit(X_train, y_train)

print("\nModel training complete!")



Model training complete!


In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mse)
mape = np.mean(np.abs((y_test - y_pred) / y_test)) * 100
r2 = r2_score(y_test, y_pred)

print("\nEvaluation metrics:")
print(f"Mean squared error: {mse}")
print(f"Mean absolute error: {mae}")
print(f"Root mean squared error: {rmse}")
print(f"Mean absolute percentage error: {mape}%")
print(f"R² score: {r2}")



Evaluation metrics:
Mean squared error: 0.555891598695244
Mean absolute error: 0.533200130495656
Root mean squared error: 0.7455813830127761
Mean absolute percentage error: 31.952187413615075%
R² score: 0.5757877060324511


In [None]:
X["Longitude"] = df["Longitude"]
X["Latitude"] = df["Latitude"]

print(X.head())


     MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
0  0.539668  0.784314  0.043512   0.020469    0.008941  0.001499     37.88   
1  0.538027  0.392157  0.038224   0.018929    0.067210  0.001141     37.86   
2  0.466028  1.000000  0.052756   0.021940    0.013818  0.001698     37.85   
3  0.354699  1.000000  0.035241   0.021929    0.015555  0.001493     37.85   
4  0.230776  1.000000  0.038534   0.022166    0.015752  0.001198     37.85   

   Longitude  
0    -122.23  
1    -122.22  
2    -122.24  
3    -122.25  
4    -122.25  


In [None]:
#here I was using this code below
#"X = pd.get_dummies(X, columns=["ocean_proximity"])
#print(X.head())" but it was giving me error so I tried it with checking first then applying one hot encoding

print(df.columns)
#as ocean proximity is a categorical feature
if 'ocean_proximity' in df.columns:
    X = pd.get_dummies(X, columns=["ocean_proximity"])

print(X.head())


Index(['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup',
       'Latitude', 'Longitude', 'MedHouseVal'],
      dtype='object')
     MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
0  0.539668  0.784314  0.043512   0.020469    0.008941  0.001499     37.88   
1  0.538027  0.392157  0.038224   0.018929    0.067210  0.001141     37.86   
2  0.466028  1.000000  0.052756   0.021940    0.013818  0.001698     37.85   
3  0.354699  1.000000  0.035241   0.021929    0.015555  0.001493     37.85   
4  0.230776  1.000000  0.038534   0.022166    0.015752  0.001198     37.85   

   Longitude  
0    -122.23  
1    -122.22  
2    -122.24  
3    -122.25  
4    -122.25  


In [None]:
#R^2 score , if R^2 is close to 1 then the model is good,if its close to 0 or less than 0 then its not good ,we need to improve feature engineering
r2 = r2_score(y_test, y_pred)

print(f"R² Score: {r2}")


R² Score: 0.5757877060324511


In [None]:
import random
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

In [None]:
def fitness(individual, X_train, y_train, X_test, y_test):
    model = LinearRegression()
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    r2 = r2_score(y_test, y_pred)
    return r2

In [None]:
def init_population(pop_size):
    return [{'iterations': random.randint(50, 200), 'learning_rate': random.uniform(0.01, 0.1)} for _ in range(pop_size)]

In [None]:
def select(population, X_train, y_train, X_test, y_test):
    population.sort(key=lambda x: fitness(x, X_train, y_train, X_test, y_test), reverse=True)
    return population[:len(population)//2]

In [None]:
def crossover(parent1, parent2):
    return {
        'iterations': random.choice([parent1['iterations'], parent2['iterations']]),
        'learning_rate': random.choice([parent1['learning_rate'], parent2['learning_rate']])
    }

In [None]:
def mutate(individual, mutation_rate=0.2):
    if random.random() < mutation_rate:
        individual['iterations'] = random.randint(50, 200)
        individual['learning_rate'] = random.uniform(0.01, 0.1)
    return individual

In [None]:
def genetic_algorithm(X_train, y_train, X_test, y_test, generations=10, pop_size=10, mutation_rate=0.2):
    population = init_population(pop_size)

    for generation in range(generations):
        selected = select(population, X_train, y_train, X_test, y_test)

  #I had an (IndexError: list index out of range) with the line (best = selected[0]),then I changed it to check if `selected` is not empty before accessing it
        if len(selected) % 2 != 0:
            selected.pop()

        next_generation = []

        for i in range(0, len(selected), 2):
            parent1, parent2 = selected[i], selected[i + 1]
            child = crossover(parent1, parent2)  #crossover
            mutated_child = mutate(child, mutation_rate) #mutation
            next_generation.append(mutated_child)

        population = next_generation

        if selected:
            best = selected[0]
            print(f"Generation {generation + 1}: Best R²: {fitness(best, X_train, y_train, X_test, y_test)}")
        else:
            print(f"Generation {generation + 1}: No valid individuals selected!")

    return selected[0] if selected else None


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

best_params = genetic_algorithm(X_train, y_train, X_test, y_test)
if best_params:
    print(f"Best parameters: {best_params}")
else:
    print("No valid parameters found.")

Generation 1: Best R²: 0.575787706032451
Generation 2: No valid individuals selected!
Generation 3: No valid individuals selected!
Generation 4: No valid individuals selected!
Generation 5: No valid individuals selected!
Generation 6: No valid individuals selected!
Generation 7: No valid individuals selected!
Generation 8: No valid individuals selected!
Generation 9: No valid individuals selected!
Generation 10: No valid individuals selected!
No valid parameters found.


In [None]:
if best_params is not None:
    best_model = LinearRegression(max_iter=best_params['iterations'])
    best_model.fit(X_train, y_train)

    y_pred = best_model.predict(X_test)
    r2 = r2_score(y_test, y_pred)
    print(f"Final R² score on the test set: {r2}")
else:
    print("Error: No valid parameters returned from the genetic algorithm")


Error: No valid parameters returned from the genetic algorithm
