<a href="https://colab.research.google.com/github/charoo-rumsan/community_tool_R-D/blob/main/Citizenship_validation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Embedding
from sklearn.model_selection import train_test_split

In [None]:
# Step 1: Helper functions
max_len = 8  # standard digit length after removing dashes

def preprocess(cid):
    cid = str(cid).replace("-", "").strip()
    cid = cid[:max_len].ljust(max_len, "0")  # pad or truncate
    return [int(d) if d.isdigit() else 0 for d in cid]

def corrupt_cid(cid):
    cid = cid.replace("-", "")
    choices = ["short", "long", "wrong_province", "random_digit"]
    choice = np.random.choice(choices)
    if choice == "short":
        return cid[:-1]
    elif choice == "long":
        return cid + str(np.random.randint(0,9))
    elif choice == "wrong_province":
        return str(np.random.randint(8,9)) + cid[1:]
    elif choice == "random_digit":
        idx = np.random.randint(0, len(cid))
        return cid[:idx] + str(np.random.randint(0,9)) + cid[idx+1:]
    return cid



In [None]:
#Step 2: Generate synthetic dataset for training
valid_cids = [f"{p}-{np.random.randint(100000,999999)}-{d}"
              for p in range(1,8) for d in range(0,10) for _ in range(100)]

invalid_cids = [corrupt_cid(cid) for cid in valid_cids]

all_cids = valid_cids + invalid_cids
labels = [1]*len(valid_cids) + [0]*len(invalid_cids)  # 1=Valid, 0=Invalid

X = np.array([preprocess(cid) for cid in all_cids])
y = np.array(labels)

In [None]:
#Step 3: Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [None]:
#Step 4: Build LSTM model
model = Sequential([
    Embedding(input_dim=10, output_dim=8, input_length=max_len),  # digits 0-9
    LSTM(32),
    Dense(16, activation="relu"),
    Dense(1, activation="sigmoid")
])

model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
model.fit(X_train, y_train, epochs=5, batch_size=32, validation_split=0.2, verbose=1)

Epoch 1/5




[1m280/280[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 5ms/step - accuracy: 0.5611 - loss: 0.6805 - val_accuracy: 0.7219 - val_loss: 0.5844
Epoch 2/5
[1m280/280[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.7271 - loss: 0.5506 - val_accuracy: 0.7295 - val_loss: 0.5421
Epoch 3/5
[1m280/280[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.7158 - loss: 0.5521 - val_accuracy: 0.7304 - val_loss: 0.5452
Epoch 4/5
[1m280/280[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.7276 - loss: 0.5408 - val_accuracy: 0.7304 - val_loss: 0.5408
Epoch 5/5
[1m280/280[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.7291 - loss: 0.5390 - val_accuracy: 0.7308 - val_loss: 0.5402


<keras.src.callbacks.history.History at 0x7bc7c72e8350>

In [None]:
#Step 5: Save model for reuse
model.save("nepali_cid_validator.h5")



In [None]:
#Step 6: Function to classify new citizenship numbers
def predict_cid(cid):
    x = np.array([preprocess(cid)])
    pred = model.predict(x, verbose=0)[0][0]
    return "Valid" if pred > 0.5 else "Invalid"

In [None]:

#Step 7: Apply to a CSV file

df = pd.read_csv("/content/first_100_rows (1) - first_100_rows (1).csv.csv")

df["label"] = df["Citizenship Number of House Owner (नागरिकता नं)"].apply(predict_cid)

In [None]:
# Save result
df[["Citizenship Number of House Owner (नागरिकता नं)", "label"]].to_csv("validated_citizenships.csv", index=False)

print("✅ Validation complete! Check 'validated_citizenships.csv'")

✅ Validation complete! Check 'validated_citizenships.csv'


METHOD 2

Validation Rules
The validation logic checks if a citizenship number matches one of the above formats and applies additional checks for plausibility (e.g., valid district codes and years). Here are the specific rules:

1. Empty or Invalid Input Check:
If the input is missing (NaN), not a string, or empty after stripping whitespace, it’s invalid.
Rationale: Empty or malformed entries (e.g., non-string data) cannot be valid citizenship numbers.
Example: Row 90 (Nankala Odd) has an empty citizenship number → Invalid.


2. Normalization:
Replace / with - to standardize formats for regex matching.
Rationale: Your data uses / (e.g., 753029/563), but the standard format often uses -. Normalizing ensures consistency.
Example: 753029/563 becomes 753029-563.


3. Standard Format Check (\d{1,2}-\d{1,2}-\d{2,3}-\d{4,7}):
Matches: 1-2 digits (district), 1-2 digits (VDC/municipality), 2-3 digits (year), 4-7 digits (serial).
Additional checks:

4. District Code: Must be 01-77 (Nepal has 77 districts).
Year: Must be 01-82 (corresponding to BS years ~1944-2025 AD, as 2082 BS ≈ 2025 AD).


5. Rationale: This is a common format for Nepali citizenship numbers, though not seen in your sample. Included for robustness.
Example: Hypothetical 12-34-56-12345 would pass if district 12 and year 56 are valid.


6. Slash Format Check (\d{4,8}/\d{3,5}):
Matches: 4-8 digits before /, 3-5 digits after /.
Additional check:

7. Year:
Take the last 2 digits of the first part (e.g., 753020 → 20) as the year. Must be 01-82.


8. Rationale: This format is prevalent in your data (e.g., 753020/506, 1411/695). The year check ensures the number is plausible for recent issuances.
Example: 753020/506 → Year 20 (2020 BS ≈ 1963 AD, plausible for older citizens) → Valid.


9. Plain Digits Check (\d{10,12}):
Matches: 10-12 digits with no separators.
Rationale: Some records might lack separators due to data entry variations. 10-12 digits is a reasonable length for a complete citizenship number.
Example: 123456789012 would pass, but 2862 (Row 89) fails (too short).


10. Invalid Cases:
Numbers that don’t match any pattern.
Numbers with invalid district codes or years.
Short numbers (e.g., 189, 1574) that don’t meet the minimum length for plain digits.
Rationale: These are likely errors or incomplete entries.

In [None]:
import pandas as pd
import re

# Load the CSV file (all columns as string to avoid type issues)
df = pd.read_csv('/content/first_100_rows (1) - first_100_rows (1).csv.csv', dtype=str)

# Define the citizenship column
citizenship_col = 'Citizenship Number of House Owner (नागरिकता नं)'

# Check if the column exists
if citizenship_col not in df.columns:
    print(f"Error: Column '{citizenship_col}' not found in the dataset.")
else:
    # Function to validate citizenship number
    def is_valid_citizenship(num):
        if pd.isna(num) or not isinstance(num, str) or num.strip() == '':
            return False

        # Normalize / to -
        num_normalized = num.replace('/', '-')

        # Standard format: XX-XX-XX-XXXXX
        if re.match(r'^\d{1,2}-\d{1,2}-\d{2,3}-\d{4,7}$', num_normalized):
            parts = num_normalized.split('-')
            district = int(parts[0])
            year = int(parts[2])
            if 1 <= district <= 77 and 1 <= year <= 82:
                return True

        # Simple / format common in data: XXXX/XXX
        if re.match(r'^\d{4,8}/\d{3,5}$', num):
            parts = num.split('/')
            year_str = parts[0][-2:] if len(parts[0]) > 2 else parts[0]  # Last 2 digits as year
            year = int(year_str)
            if 1 <= year <= 82:
                return True

        # Plain digits: 10-12 long
        if re.match(r'^\d{10,12}$', num):
            return True

        return False

    # Apply validation
    df['valid_citizenship'] = df[citizenship_col].apply(is_valid_citizenship)

    # Create new dataframe with only the required columns
    output_df = df[[citizenship_col, 'valid_citizenship']]

    # Display summary
    print("\nValidation Summary:")
    print(output_df)

    # Count valid/invalid
    print(f"\nTotal rows: {len(df)}")
    print(f"Valid citizenship numbers: {df['valid_citizenship'].sum()}")
    print(f"Invalid or empty: {len(df) - df['valid_citizenship'].sum()}")

    # Save new CSV with only citizenship and validity columns
    output_df.to_csv('citizenship_validity.csv', index=False)
    print("\nNew CSV saved as 'citizenship_validity.csv' with only citizenship and validity columns")


Validation Summary:
   Citizenship Number of House Owner (नागरिकता नं)  valid_citizenship
0                                          353/044              False
1                                        12-035-36              False
2                                       752002/321               True
3                                       733012/902               True
4                                       733908/925               True
..                                             ...                ...
95                                             NaN              False
96                                             NaN              False
97                                     75/69742601              False
98                                     752020/1799               True
99                                            1574              False

[100 rows x 2 columns]

Total rows: 100
Valid citizenship numbers: 22
Invalid or empty: 78

New CSV saved as 'citizenship_validity.csv' wi