# üîÑ ETL Raw ‚Üí Silver
## Crime Data Pipeline

Pipeline de transforma√ß√£o de dados brutos para camada Silver.

**Objetivo**: Transformar dados da camada Raw para Silver com limpeza, valida√ß√£o e feature engineering.

**Entrada**: `Data Layer/raw/data_raw.csv`  
**Sa√≠da**: `Data Layer/silver/data_silver.csv`

In [None]:
# Configura√ß√£o inicial
import pandas as pd
import numpy as np
from pathlib import Path
from datetime import datetime

# Configurar raiz do projeto
PROJECT_ROOT = Path.cwd().parent.parent
RAW_PATH = PROJECT_ROOT / 'Data Layer' / 'raw' / 'data_raw.csv'
SILVER_PATH = PROJECT_ROOT / 'Data Layer' / 'silver' / 'data_silver.csv'

print(f"üìÅ Projeto: {PROJECT_ROOT}")
print(f"üì• Raw: {RAW_PATH}")
print(f"üì§ Silver: {SILVER_PATH}")

Note: you may need to restart the kernel to use updated packages.
üìÅ Projeto: c:\Users\David\Documents\UnB\SBD2\SBD2
üì• Raw: c:\Users\David\Documents\UnB\SBD2\SBD2\Data Layer\raw\data_raw.csv
üì§ Silver: c:\Users\David\Documents\UnB\SBD2\SBD2\Data Layer\silver\data_silver.csv



[notice] A new release of pip is available: 25.2 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
# Carregar dados Raw
df = pd.read_csv(RAW_PATH)
print(f"‚úÖ Dados Raw carregados: {len(df):,} registros")
print(f"üìã Colunas: {len(df.columns)}")
df.head(3)

‚úÖ Dados Raw carregados: 50,000 registros
üìã Colunas: 28


Unnamed: 0,DR_NO,Date Rptd,DATE OCC,TIME OCC,AREA,AREA NAME,Rpt Dist No,Part 1-2,Crm Cd,Crm Cd Desc,...,Status,Status Desc,Crm Cd 1,Crm Cd 2,Crm Cd 3,Crm Cd 4,LOCATION,Cross Street,LAT,LON
0,230608367,04/13/2023 12:00:00 AM,04/12/2023 12:00:00 AM,1900,6,Hollywood,678,1,510,VEHICLE - STOLEN,...,IC,Invest Cont,510.0,,,,800 N HOBART BL,,34.0852,-118.3051
1,200218154,12/03/2020 12:00:00 AM,12/03/2020 12:00:00 AM,850,2,Rampart,275,1,230,"ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT",...,AO,Adult Other,230.0,,,,11TH ST,BURLINGTON AV,34.0488,-118.2775
2,201816346,08/27/2020 12:00:00 AM,05/22/2020 12:00:00 AM,1125,18,Southeast,1881,2,354,THEFT OF IDENTITY,...,IC,Invest Cont,354.0,,,,15600 BONSALLO AV,,33.8908,-118.2867


In [3]:
## Etapa 1: Limpeza de Dados

In [4]:
# Limpeza de dados
print("üßπ Aplicando limpeza...")

df_clean = df.copy()

# 1. Remover duplicados
df_clean = df_clean.drop_duplicates(subset=['DR_NO'])
print(f"   Ap√≥s remover duplicados: {len(df_clean):,}")

# 2. Remover nulos cr√≠ticos
df_clean = df_clean.dropna(subset=['DR_NO', 'DATE OCC', 'Crm Cd'])
print(f"   Ap√≥s remover nulos cr√≠ticos: {len(df_clean):,}")

# 3. Remover coordenadas inv√°lidas (0,0)
df_clean = df_clean[(df_clean['LAT'] != 0) & (df_clean['LON'] != 0)]
print(f"   Ap√≥s remover coordenadas inv√°lidas: {len(df_clean):,}")

# 4. Remover idades inv√°lidas
df_clean = df_clean[(df_clean['Vict Age'] >= 0) & (df_clean['Vict Age'] <= 120)]
print(f"   Ap√≥s remover idades inv√°lidas: {len(df_clean):,}")

# 5. Remover registros sem dados essenciais
df_clean = df_clean.dropna(subset=['Crm Cd Desc', 'AREA NAME', 'Status'])
print(f"   Ap√≥s remover campos essenciais nulos: {len(df_clean):,}")

# 6. Filtro de v√≠tima identificada
df_clean = df_clean[(df_clean['Vict Age'] > 0) | (df_clean['Vict Sex'].isin(['M', 'F']))]
print(f"   Ap√≥s filtro de v√≠tima identificada: {len(df_clean):,}")

# 7. Remover localiza√ß√µes inv√°lidas
df_clean = df_clean[~df_clean['LOCATION'].isna()]
df_clean = df_clean[~df_clean['Premis Desc'].isna()]
print(f"   Ap√≥s filtro de localiza√ß√£o: {len(df_clean):,}")

print(f"\n‚úÖ Limpeza conclu√≠da: {len(df):,} ‚Üí {len(df_clean):,} ({100*len(df_clean)/len(df):.1f}%)")

üßπ Aplicando limpeza...
   Ap√≥s remover duplicados: 50,000
   Ap√≥s remover nulos cr√≠ticos: 50,000
   Ap√≥s remover coordenadas inv√°lidas: 49,887
   Ap√≥s remover idades inv√°lidas: 49,882
   Ap√≥s remover campos essenciais nulos: 49,882
   Ap√≥s filtro de v√≠tima identificada: 38,424
   Ap√≥s filtro de localiza√ß√£o: 38,405

‚úÖ Limpeza conclu√≠da: 50,000 ‚Üí 38,405 (76.8%)


In [5]:
## Etapa 2: Transforma√ß√µes e Feature Engineering

In [6]:
# Mapeamentos
descent_map = {
    'A': 'Other Asian', 'B': 'Black', 'C': 'Chinese', 'D': 'Cambodian', 
    'F': 'Filipino', 'G': 'Guamanian', 'H': 'Hispanic/Latino', 'I': 'American Indian', 
    'J': 'Japanese', 'K': 'Korean', 'L': 'Laotian', 'O': 'Other', 
    'P': 'Pacific Islander', 'S': 'Samoan', 'U': 'Hawaiian', 'V': 'Vietnamese', 
    'W': 'White', 'X': 'Unknown', 'Z': 'Asian Indian', '-': 'Unknown'
}

sex_map = {'M': 'Male', 'F': 'Female', 'X': 'Unknown', 'H': 'Unknown', '-': 'Unknown'}

def get_period(hour):
    if 5 <= hour < 12: return 'Morning'
    elif 12 <= hour < 17: return 'Afternoon'
    elif 17 <= hour < 21: return 'Evening'
    else: return 'Night'

def get_crime_category(desc):
    desc = str(desc).upper()
    if any(x in desc for x in ['HOMICIDE', 'RAPE', 'ROBBERY', 'ASSAULT', 'KIDNAP', 'BATTERY']): 
        return 'Violent Crime'
    elif any(x in desc for x in ['THEFT', 'BURGLARY', 'STOLEN', 'VEHICLE', 'SHOPLIFTING']): 
        return 'Property Crime'
    elif any(x in desc for x in ['VANDALISM', 'TRESPASS', 'DISTURBING']): 
        return 'Quality of Life'
    else: 
        return 'Other Crime'

def get_age_group(age):
    try:
        age = int(age)
        if age <= 0: return 'Unknown'
        elif age < 18: return '0-17'
        elif age < 26: return '18-25'
        elif age < 36: return '26-35'
        elif age < 51: return '36-50'
        elif age < 66: return '51-65'
        else: return '65+'
    except: 
        return 'Unknown'

def get_premise_category(desc):
    desc = str(desc).upper()
    if any(x in desc for x in ['DWELLING', 'RESIDENCE', 'HOUSE', 'APARTMENT', 'CONDOMINIUM']): 
        return 'Residential'
    elif any(x in desc for x in ['STREET', 'SIDEWALK', 'PARKING', 'ALLEY', 'PARK', 'BEACH']): 
        return 'Public'
    elif any(x in desc for x in ['STORE', 'SHOP', 'RESTAURANT', 'COMMERCIAL', 'OFFICE', 'BANK', 'MARKET']): 
        return 'Commercial'
    else: 
        return 'Other'

def get_weapon_category(desc):
    desc = str(desc).upper() if pd.notna(desc) else ''
    if 'GUN' in desc or 'FIREARM' in desc or 'RIFLE' in desc or 'REVOLVER' in desc: 
        return 'Firearm'
    elif 'KNIFE' in desc or 'BLADE' in desc or 'CUTTING' in desc: 
        return 'Blade'
    elif 'BLUNT' in desc or 'CLUB' in desc or 'BAT' in desc: 
        return 'Blunt Object'
    elif 'STRONG-ARM' in desc or 'HANDS' in desc or 'FIST' in desc: 
        return 'Physical Force'
    elif desc == '' or desc == 'NAN': 
        return 'No Weapon'
    else: 
        return 'Other Weapon'

print("‚úÖ Fun√ß√µes de transforma√ß√£o definidas")

‚úÖ Fun√ß√µes de transforma√ß√£o definidas


In [7]:
# Aplicar transforma√ß√µes
print("üîÑ Aplicando transforma√ß√µes...")

# Converter data
df_clean['date_temp'] = pd.to_datetime(df_clean['DATE OCC'], format='%m/%d/%Y %I:%M:%S %p', errors='coerce')

# Criar DataFrame Silver
silver = pd.DataFrame()

# Identifica√ß√£o
silver['crime_id'] = df_clean['DR_NO'].values

# Datas
silver['date_reported'] = pd.to_datetime(df_clean['Date Rptd'], format='%m/%d/%Y %I:%M:%S %p', errors='coerce').dt.strftime('%Y-%m-%d')
silver['date_occurred'] = df_clean['date_temp'].dt.strftime('%Y-%m-%d')
silver['time_occurred'] = df_clean['TIME OCC'].astype(str).str.zfill(4).str[:2] + ':' + df_clean['TIME OCC'].astype(str).str.zfill(4).str[2:]

# Temporal
silver['hour'] = df_clean['TIME OCC'].astype(str).str.zfill(4).str[:2].astype(int)
silver['day_of_week'] = df_clean['date_temp'].dt.dayofweek.values
silver['day_name'] = df_clean['date_temp'].dt.day_name().values
silver['period_of_day'] = silver['hour'].apply(get_period)

# Localiza√ß√£o
silver['area_code'] = df_clean['AREA'].values
silver['area_name'] = df_clean['AREA NAME'].values
silver['district_code'] = df_clean['Rpt Dist No'].values

# Crime
silver['crime_severity'] = df_clean['Part 1-2'].map({1: 'Serious', 2: 'Minor'}).values
silver['crime_code'] = df_clean['Crm Cd'].values
silver['crime_description'] = df_clean['Crm Cd Desc'].values
silver['crime_category'] = df_clean['Crm Cd Desc'].apply(get_crime_category).values

# V√≠tima
silver['victim_age'] = df_clean['Vict Age'].values
silver['victim_age_group'] = df_clean['Vict Age'].apply(get_age_group).values
silver['victim_sex'] = df_clean['Vict Sex'].fillna('X').values
silver['victim_sex_desc'] = df_clean['Vict Sex'].map(sex_map).fillna('Unknown').values
silver['victim_descent'] = df_clean['Vict Descent'].fillna('X').values
silver['victim_descent_desc'] = df_clean['Vict Descent'].map(descent_map).fillna('Unknown').values

# Premissa
silver['premise_code'] = df_clean['Premis Cd'].values
silver['premise_description'] = df_clean['Premis Desc'].values
silver['premise_category'] = df_clean['Premis Desc'].apply(get_premise_category).values

# Arma
silver['weapon_code'] = df_clean['Weapon Used Cd'].values
silver['weapon_description'] = df_clean['Weapon Desc'].values
silver['weapon_category'] = df_clean['Weapon Desc'].apply(get_weapon_category).values

# Flags
silver['is_violent'] = (silver['crime_category'] == 'Violent Crime')
silver['has_weapon'] = (silver['weapon_category'] != 'No Weapon')

# Status
silver['status_code'] = df_clean['Status'].values
silver['status_description'] = df_clean['Status Desc'].values
silver['case_closed'] = df_clean['Status'].isin(['AA', 'JA']).values

# Coordenadas
silver['latitude'] = df_clean['LAT'].values
silver['longitude'] = df_clean['LON'].values
silver['location'] = df_clean['LOCATION'].str.strip().values

# Dimens√µes temporais
silver['year'] = df_clean['date_temp'].dt.year.values
silver['month'] = df_clean['date_temp'].dt.month.values
silver['quarter'] = df_clean['date_temp'].dt.quarter.values

# Metadados
silver['collected_at'] = datetime.now().strftime('%Y-%m-%d %H:%M:%S')

print(f"‚úÖ Transforma√ß√µes aplicadas: {len(silver.columns)} colunas criadas")

üîÑ Aplicando transforma√ß√µes...
‚úÖ Transforma√ß√µes aplicadas: 39 colunas criadas


In [8]:
# Salvar na camada Silver
silver.to_csv(SILVER_PATH, index=False)

print("\n" + "="*50)
print("‚úÖ ETL Raw ‚Üí Silver conclu√≠do!")
print("="*50)
print(f"\nüìä Resumo:")
print(f"   Raw: {len(df):,} registros")
print(f"   Silver: {len(silver):,} registros")
print(f"   Redu√ß√£o: {(1 - len(silver)/len(df))*100:.1f}%")
print(f"\nüìÅ Arquivo salvo em: {SILVER_PATH}")


‚úÖ ETL Raw ‚Üí Silver conclu√≠do!

üìä Resumo:
   Raw: 50,000 registros
   Silver: 38,405 registros
   Redu√ß√£o: 23.2%

üìÅ Arquivo salvo em: c:\Users\David\Documents\UnB\SBD2\SBD2\Data Layer\silver\data_silver.csv
