Overview:

Definition: Extrapolation is when you make a guess about something outside the range of the data you already have, based on patterns you've seen.

Why I used these models and libraries for this task

I used Pandas and NumPy for efficient data manipulation, missing value detection, and preprocessing tasks such as feature transformation and cleaning. For extrapolating missing values, I employed Ridge Regression for continuous variables like yob and RandomForestClassifier for categorical variables like gender and zipcode. Ridge Regression offers a regularized linear approach that mitigates overfitting and handles multicollinearity well, making it ideal for predicting numerical values. Random Forest, as an ensemble model, captures complex feature interactions, handles both categorical and numerical data, and performs robustly even with noisy or imbalanced datasets. I used scikitlearn’s ColumnTransformer, StandardScaler, and OneHotEncoder to build preprocessing pipelines tailored to mixed feature types, while LabelEncoder ensured compatibility for categorical variables in classification tasks. 

Why I used RMSE, accuracy, precision and F1 score?

Model performance was evaluated using appropriate metrics RMSE for regression and Accuracy, Precision, and F1 Score for classification—ensuring reliable and interpretable imputation results.

How yob is predicted and filled ?


To predict yob I used ridge model for the better performance, or the results.
The range in the given data of yob was from 1900-1999. To purpose of extrapolation was to get the data from beyond the range.
After the results I observed that in my dataset I have yob range from 1900-2001 so that means the data I am getting now is predicting future values.
I have also try to solve it with GradientBoostingRegressor but the values after result were not satisfying like it gave the result of past. The range of dataset was 1895-1999. The RMSE was 12.96. But with ridge regression RMSE is around 10. So that means now I am getting better results than before.

How is 'gender' Predicted and Filled?

By using RandomForestClassifier method 
Why I used Random Forest?
1. Handles non-linear relationships well.
2. Works out of the box with categorical variables (after Label Encoding).
3. Robust against overfitting (with max_depth and balanced class weight).

How is I handled zipcode column

RandomForestClassifier: Zipcode behaves like a categorical variable and Random Forest can capture patterns in location, demographics, or other attributes to infer missing zipcodes.
There were some values in the zipcode which was 0. So i had to perform 2 tasks on zipcode. 
1. to remove 0's because I considered them as missing values and than to predict them for the better results.
2. To predict the missing values in the zipcode make sure that the format of zipcode must be same. As there were some zipcode which has 0 in the beginning.
I make sure after the results that the zipcode format is not changed.

Importing the libraries

In [73]:
import pandas as pd
import numpy as np
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import mean_squared_error, accuracy_score, precision_score, f1_score
import warnings

warnings.filterwarnings("ignore")

Loading the dataset

In [74]:
file_path = r"C:\Users\musta\Desktop\Costory_Anees_Ahmad\dataset\user_exo_2M.csv"
df = pd.read_csv(file_path)

To show first 20 rows of dataset

In [75]:
df.head(20)

Unnamed: 0,yob,domain,firstname,zipcode,gender
0,1985,@wanadoo.fr,b08253b305fb5ec,94450.0,F
1,1961,@sfr.fr,7ff135854376850,78580.0,M
2,1977,@free.fr,172522ec1028ab7,62640.0,
3,-1,@hotmail.com,d3ca5dde60f88db,94100.0,M
4,-1,@gmail.com,bdaae16837dd576,78100.0,
5,1975,@wanadoo.fr,57c2877c1d84c4b,92600.0,M
6,1974,@gmail.com,d47de916cacd0b7,65200.0,M
7,1957,@free.fr,7ff135854376850,83250.0,
8,-1,@live.fr,a5410ee37744c57,5100.0,M
9,-1,@wanadoo.fr,60784186ea5b29f,68300.0,


To find how many values are missing in the columns

Printing last 20 rows of dataset

In [76]:
df.tail(20)

Unnamed: 0,yob,domain,firstname,zipcode,gender
1999980,1964,@hotmail.com,0d0de813c110549,62240.0,M
1999981,1957,@gmail.com,8a94bdfc825df46,74000.0,F
1999982,1958,@gmail.com,b67ef00bdcc7f86,31700.0,
1999983,-1,@orange.fr,dee484ff7366319,92000.0,F
1999984,-1,@gmail.com,11619bf6d82bf8f,94460.0,
1999985,1999,@gmail.com,083af24243207a8,59260.0,F
1999986,1973,@free.fr,84675f2baf71400,13013.0,M
1999987,1994,@hotmail.fr,b55050b2f605b7c,77550.0,
1999988,-1,@gmail.com,c3dae848d72c51c,59710.0,F
1999989,-1,@sfr.fr,e409f05a10574ad,85200.0,


To identify the datatypes of column 

In [77]:
unique_summary = df.nunique()
dtype_summary = df.dtypes

summary_df = pd.DataFrame({
    "Unique_Values": unique_summary,
    "Data_Type": dtype_summary
})

summary_df.sort_values(by="Unique_Values", ascending=False)

Unnamed: 0,Unique_Values,Data_Type
firstname,15682,object
zipcode,2315,float64
yob,85,int64
domain,43,object
gender,2,object


Showing Missing values 

In [78]:
empty_zipcode_rows = df[df['zipcode'].isna() | (df['zipcode'].astype(str).str.strip() == "")]
 
# Print the first 20 such rows
empty_zipcode_rows.head(20)

Unnamed: 0,yob,domain,firstname,zipcode,gender
402,1977,@free.fr,91ab7b369d48cd0,,
419,-1,@orange.fr,d41d8cd98f00b20,,
2901,1968,@free.fr,d41d8cd98f00b20,,
4844,-1,@wanadoo.fr,d41d8cd98f00b20,,
5030,1970,@netcourrier.com,d41d8cd98f00b20,,M
12906,1977,@hotmail.com,6cb528d1b005724,,M
27470,1975,@aliceadsl.fr,2a4ac4d8e4ebdf7,,F
27497,1981,@yahoo.fr,9b22e8ac450bf8d,,
27948,-1,@free.fr,8f8c4ba92dab870,,M
28014,-1,@hotmail.com,c57f431343f100b,,


 Replacing the -1 in 'yob' column with NaN values because the columns has no missing value so i consider -1 as missing value in 'yob' column so that it can be imputed properly.


In [79]:
#  Replace -1 in 'yob' with NaN
if 'yob' in df.columns:
    df['yob'] = df['yob'].apply(lambda x: np.nan if x == -1 else x)

In [80]:
df.isnull().sum()

yob          800000
domain            0
firstname         0
zipcode       17194
gender       804954
dtype: int64

The below code is to identify the 0's value in zipcode and also considering them as missing.

In [82]:
df['zipcode'] = df['zipcode'].apply(lambda x: np.nan if x == 0 or str(x).strip() == '00000' else x)

In [90]:
df.isnull().sum()

yob          800000
domain            0
firstname         0
zipcode       18897
gender       804954
dtype: int64

After the 0's finding zipcode missing values increases to 18897.

Finding the columns in the dataset that contain missing values.

In [83]:
# Identify columns with missing values
columns_with_missing = df.columns[df.isna().any()].tolist()
columns_with_missing

['yob', 'zipcode', 'gender']

These 3 columns have missing values in our dataset.

For model performace evaluation
the regression evaluation for the column which has numeric values.
classification evaluation i used for the columns having categorical values.

In [None]:
# this code is to evaluate the functions ......
# It will tell us that what is the performace of our model.
def evaluate_regression(y_true, y_pred):
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    print(f" RMSE (yob): {rmse:.2f}")

def evaluate_classification(y_true, y_pred):
    acc = accuracy_score(y_true, y_pred)
    precision = precision_score(y_true, y_pred, average='weighted', zero_division=0)
    f1 = f1_score(y_true, y_pred, average='weighted', zero_division=0)
    print(f" Accuracy: {acc:.4f}")
    print(f" Precision: {precision:.4f}")
    print(f" F1 Score: {f1:.4f}")


Using machine learning predictions.


This part of the code is preparing the data so that a machine learning model can predict and fill in the missing values of a specific column. First, it creates a copy of the dataset to avoid changing the original data. Then, it separates the rows into two groups: one where the target column has known (non-missing) values, and one where the values are missing. If either group is empty, the function simply returns the original data. To make the process faster and use less memory, it randomly selects up to 50,000 rows from the known data for training. After that, it splits the known data into features (independent variables) and labels (the column we want to predict). It also prepares the unknown data’s features so the model can later use them to make predictions. This setup is essential for using machine learning to fill in the missing values accurately.

It also fills any missing values in the input features. If the target column is yob, it uses a regression model (Ridge) since it is numeric. For other columns, it uses a classification model (Random Forest), and encodes the categorical data.

It builds a preprocessing pipeline to scale numeric data and encode categorical data before training the model. If the column is something else (assumed categorical), it uses a classification model (Random Forest). For categorical columns, it also encodes all string labels using LabelEncoder. After the model is trained, it predicts the missing values and fills them back into the original DataFrame. Finally, the updated DataFrame with the missing values filled in is returned. This function helps automate the imputation process intelligently, choosing the right model based on the data type and handling a variety of missing data scenarios efficiently.

In [None]:
def predict_and_fill_extrapolation(df, target_column, sample_size=50000):
    df = df.copy()
    known = df[df[target_column].notna()]
    unknown = df[df[target_column].isna()]
    if known.empty or unknown.empty:
        return df

    if len(known) > sample_size:
        known = known.sample(n=sample_size, random_state=42)
            
    X_known = known.drop(columns=[target_column]) # keeping all columns except the target column as input features for the model
    y_known = known[target_column]
    X_unknown = unknown.drop(columns=[target_column])

    # Filling the missing data in features
    for col in X_known.columns:
        if X_known[col].isna().any() or X_unknown[col].isna().any():
            fill_val = X_known[col].mode()[0] if X_known[col].dtype == 'object' else X_known[col].median()
            X_known[col] = X_known[col].fillna(fill_val)
            X_unknown[col] = X_unknown[col].fillna(fill_val)

    if target_column == 'yob':
        # Separating the features by the types of column data
        numeric_features = X_known.select_dtypes(include=[np.number]).columns.tolist()
        categorical_features = X_known.select_dtypes(exclude=[np.number]).columns.tolist()

        # Preprocessor using columntransformer which allows you to apply different transformations to different columns at the same time
        # defining a preprocessing pipeline to prepare your features before sending them to the model
        preprocessor = ColumnTransformer(transformers=[
            ('num', StandardScaler(), numeric_features),
            ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
        ])

        # Ridge Regression
        model = make_pipeline(preprocessor, Ridge(alpha=1.0)) #alpha=1.0 controls the regularization strength (1.0 is a standard starting point)
        model.fit(X_known, y_known)
        y_pred = np.round(model.predict(X_unknown)).astype(int)

        print(f"\nEvaluating Ridge regression model for: {target_column}")
        evaluate_regression(y_known, model.predict(X_known))

    else:
        # Classification..................... 
        for col in X_known.columns:
            if X_known[col].dtype == 'object':
                le = LabelEncoder()
                X_known[col] = le.fit_transform(X_known[col].astype(str))
                X_unknown[col] = X_unknown[col].astype(str)
                X_unknown[col] = X_unknown[col].apply(lambda x: x if x in le.classes_ else le.classes_[0])
                X_unknown[col] = le.transform(X_unknown[col])

        le_target = LabelEncoder()
        y_encoded = le_target.fit_transform(y_known.astype(str))

        model = RandomForestClassifier(
            n_estimators=30,                     #### This sets the number of trees in the forest to 30
            max_depth=15,                        ### Each tree can grow up to a maximum depth of 15
            class_weight='balanced_subsample',  ### Adjusts weights for imbalanced classes, but per bootstrap sample
            random_state=42                     ###Sets the seed for random number generation, ensuring reproducible results
        )
        model.fit(X_known, y_encoded)
        y_pred_encoded = model.predict(X_unknown)

        print(f"\nEvaluating classification model for: {target_column}")
        evaluate_classification(y_encoded, model.predict(X_known))

        valid_classes = list(le_target.classes_)
        fallback = valid_classes[0]
        y_pred = [valid_classes[idx] if 0 <= idx < len(valid_classes) else fallback for idx in y_pred_encoded]

    # Fill in the missing values
    df.loc[df[target_column].isna(), target_column] = y_pred
    return df

Loops through all columns with missing data and fills them using the prediction model.

In this section of code I have applied the prediction for all columns with missing values
also to make sure that the values of yob column is with the int datatype.
And as the format of zipcode is for 5 digit so make sure that after the prediction the zipcode would be of 5 digit. As in france there are many zipcodes which has 0 in the beginning like they are starting from 0 in this code i make sure that if the prediction is getting a value like 6600 as it is a zipcode of france so the value would 06600 with the correct format.
After the prediction it will fill the missing values in the columns.

In [None]:
### filling the missing values in column after the prediction
df_filled = df.copy()
for col in columns_with_missing:
    print(f"\nFilling missing values in: {col}")
    df_filled = predict_and_fill_extrapolation(df_filled, col)

# This code is for after the processing or prediction it will make sure not to break the format of zipcode column.
if 'zipcode' in df_filled.columns:
    df_filled['zipcode'] = df_filled['zipcode'].apply(
        lambda x: str(int(float(x))).zfill(5) if pd.notna(x) else '00000'
    )
#### This is for yob column because the values after process must be in the integers
if 'yob' in df_filled.columns:
    df_filled['yob'] = df_filled['yob'].fillna(df_filled['yob'].median()).astype(int)

#  Save Final Output ing the extrapolation_final.csv by giving a path of folder.
output_path = r"C:\\Users\\musta\\Desktop\\Costory_Anees_Ahmad\\dataset\\extrapolation_final_.csv"
df_filled.to_csv(output_path, index=False)
print(f"\nAll missing values filled. File saved to: {output_path}")


Filling missing values in: yob

Evaluating Ridge regression model for: yob
 RMSE (yob): 10.39

Filling missing values in: zipcode

Evaluating classification model for: zipcode
 Accuracy: 0.2056
 Precision: 0.5527
 F1 Score: 0.2284

Filling missing values in: gender

Evaluating classification model for: gender
 Accuracy: 0.8647
 Precision: 0.8701
 F1 Score: 0.8647

All missing values filled. File saved to: C:\\Users\\musta\\Desktop\\Costory_Anees_Ahmad\\dataset\\extrapolation_final_.csv


For results:

During the process, missing values in the columns yob, zipcode, and gender were filled using machine learning models. Below is a summary and interpretation of the model performance for each:
For RMSE: its 10.39 for YOB column with an average error of ~11 years. 
For zipcode accuracy is 20% , precision = 55%, f1 score = 22%. For this according to me these are the good results because the zipcode missing values as compare to the data are not alot so for the large dataset of 2M i only have missing values in thousands so results are considerable.

For Gender: The model predicted gender with more than 80% accuracy, showing reliable and balanced performance. 

I have tried to predict these values with another models like hisgradientboostingclassifier, gradinetboostingregresser, linear regression, LGBM, randomforest but the results of them were not satisfying specially for the zipcode column. Most of them were unable to predict the missing values from that column. Specially for the yob column the prediction of those models were not satisfying the demands. By using ridge I am getting the better results for yob and with randomforest classifier i am getting better results for gender and zipcode.