# 🚀 HBase Crime Data Ingestion Notebook

## 📌 Objectives
- Create an HBase namespace (`practice`) and table (`practice:crimes`).
- Connect to HBase from Python using **HappyBase**.
- Preprocess crime data: clean & rename columns.
- Implement an efficient **RowKey strategy** (`YYYYMMDD_DR_NO`).
- Batch insert **500,000 crime records** into HBase.
---

## 1️⃣ Create Namespace & Table in HBase Shell
**Run these commands inside the HBase Shell before proceeding with Python.**

In [1]:
!echo "create_namespace 'practice'" | hbase shell
!echo "create 'practice:crimes', 'location', 'crime_info'" | hbase shell

'hbase' is not recognized as an internal or external command,
operable program or batch file.
'hbase' is not recognized as an internal or external command,
operable program or batch file.


## 2️⃣ Connect to HBase using Python (HappyBase)
### **Install `happybase` if not installed:**
Run this in your terminal:
```
pip install happybase
```

In [2]:
import happybase

# Connect to HBase
connection = happybase.Connection('localhost')  # Use 'hbase' if inside Docker
connection.open()

# Access the practice:crimes table
table = connection.table('practice:crimes')

print("✅ Connected to HBase successfully!")

✅ Connected to HBase successfully!


## 3️⃣ Data Preprocessing
- Load the dataset
- Rename columns (**convert to lowercase, replace spaces with `_`**)
- Convert date columns (`DATE OCC`, `Date Rptd`) to string format (`YYYYMMDD`).

In [3]:
import pandas as pd

# Load crime dataset
file_path = "Crime_Data_from_2020_to_Present.csv"
df = pd.read_csv(file_path)

# Rename columns for consistency
df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")

# Convert date columns to 'YYYYMMDD' format
df['date_occurred'] = pd.to_datetime(df['date_occ']).dt.strftime("%Y%m%d")
df['date_reported'] = pd.to_datetime(df['date_rptd']).dt.strftime("%Y%m%d")

print("✅ Data cleaned successfully!")

FileNotFoundError: [Errno 2] No such file or directory: 'Crime_Data_from_2020_to_Present.csv'

## 4️⃣ Map Columns to Column Families
**Assign each column to its respective column family (`location`, `crime_info`).**

In [None]:
COLUMN_FAMILIES = {
    'location': ['location', 'cross_street', 'lat', 'lon'],
    'crime_info': ['dr_no', 'date_reported', 'date_occurred', 'area', 'crm_cd', 'crm_cd_desc', 'vict_age', 'vict_sex']
}

print("✅ Column mappings set!")

## 5️⃣ Implement RowKey Strategy
**Create an optimized rowkey format:**
`YYYYMMDD_DR_NO` → Example: `20200301_190326475`

In [None]:
# Create efficient rowkeys
df['rowkey'] = df['date_occurred'] + "_" + df['dr_no'].astype(str)
print("✅ RowKeys generated successfully!")

## 6️⃣ Efficient Data Insertion into HBase
**Insert data with batching (`batch_size=1000`) & skip null values (`NA`).**

In [None]:
def push_to_hbase(table, df):
    batch = table.batch(batch_size=1000)  # Batch processing

    for _, row in df.iterrows():
        rowkey = row['rowkey']
        hbase_data = {}

        for cf, cols in COLUMN_FAMILIES.items():
            for col in cols:
                if pd.notna(row[col]):  # Only insert non-null values
                    hbase_data[f"{cf}:{col}"] = str(row[col])

        batch.put(rowkey, hbase_data)

    batch.send()
    print("✅ Data inserted into HBase successfully!")

# Push the first 500,000 rows to HBase
push_to_hbase(table, df.head(500000))

### Achieved following activities:-
✔ Created & connected to HBase (Namespace, Table).
✔ Loaded crime data and renamed columns for consistency.
✔ Implemented efficient RowKey strategy (YYYYMMDD_DR_NO).
✔ Mapped columns to column families (location, crime_info).
✔ Inserted data efficiently using batching (batch_size=1000).
🚀 Run this notebook, and you’ll have 500,000 crime records stored in HBase! Let me know if you need further optimizations!
