# Lights in the Sky: Decoding UFO Sightings Patterns
## By Clay Bowser, December 7th 2024

## [Data supplied by: National UFO Reporting Center (NUFORC) · Anna Wolak](https://www.kaggle.com/datasets/NUFORC/ufo-sightings/data)
## [Video walkthrough of this Notebook](https://youtu.be/XoPRliDiQy4)
### Description: This UFO data analysis notebook explores a dataset of UFO sightings through data cleaning, statistical analysis, and visualization. It reveals trends in sighting durations and shapes, ultimately implementing a Random Forest Classifier to predict whether sightings occurred at night. The findings suggest that most UFO sightings happen at night, which may indicate that these phenomena are avoiding detection or that it's easier to misidentify satellites or aircraft under low visibility conditions.

## Import Libraries

In [1]:
# !pip install setuptools
# !pip install ydata-profiling
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from ydata_profiling import ProfileReport
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
import warnings
warnings.filterwarnings('ignore')

  from .autonotebook import tqdm as notebook_tqdm


## Load Data and Perform Basic Statistics

In [2]:
# Load and perform basic statistics
df = pd.read_csv('ufo_kaggle_complete.csv', 
                 quoting=1,  # Change quoting mode
                 escapechar='\\',  # Add escape character
                 on_bad_lines='skip'  # Skip problematic rows
                )

# Clean numeric columns first
df['duration (seconds)'] = pd.to_numeric(df['duration (seconds)'], errors='coerce')
df['latitude'] = pd.to_numeric(df['latitude'].astype(str).str.replace('q', ''), errors='coerce')
df['longitude'] = pd.to_numeric(df['longitude'].astype(str).str.replace('q', ''), errors='coerce')

# Show statistics for all numeric columns
print("Basic Statistics:")
print(df[['duration (seconds)', 'latitude', 'longitude']].describe())

# Show null values for all columns
print("\nNull Values Count:")
print(df.isnull().sum())

# Additional data insights
print("\nShape Categories:")
print(df['shape'].value_counts().head())

print("\nTop Countries:")
print(df['country'].value_counts().head())

print("\nDate Range:")
print(f"First sighting: {df['datetime'].min()}")
print(f"Last sighting: {df['datetime'].max()}")

Basic Statistics:
       duration (seconds)      latitude     longitude
count        8.867400e+04  88679.000000  88679.000000
mean         8.391920e+03     37.453033    -85.021836
std          5.911567e+05     11.572439     41.421744
min          0.000000e+00    -82.862752   -176.658056
25%          1.500000e+01     34.035000   -112.073333
50%          1.200000e+02     39.233333    -87.650000
75%          6.000000e+02     42.717817    -77.769738
max          9.783600e+07     72.700000    178.441900

Null Values Count:
datetime                    0
city                        0
state                    7409
country                 12365
shape                    2922
duration (seconds)          5
duration (hours/min)     3019
comments                   35
date posted                 0
latitude                    0
longitude                   0
dtype: int64

Shape Categories:
shape
light       17872
triangle     8489
circle       8453
fireball     6562
unknown      6319
Name: count, dtype

## Generate YData Profiler Report

In [3]:
# Generate YData Profiler Report
profile = ProfileReport(df, title="UFO Sightings Profiling Report")
profile.to_file("ufo_profile_report.html")

Summarize dataset: 100%|██████████| 30/30 [00:05<00:00,  5.08it/s, Completed]                                     
Generate report structure: 100%|██████████| 1/1 [00:05<00:00,  5.65s/it]
Render HTML: 100%|██████████| 1/1 [00:00<00:00,  1.03it/s]
Export report to file: 100%|██████████| 1/1 [00:00<00:00, 92.16it/s]


## Visualizations

In [15]:
# Data cleaning steps
df['duration (seconds)'] = pd.to_numeric(df['duration (seconds)'], errors='coerce')
df['latitude'] = pd.to_numeric(df['latitude'].astype(str).str.replace('q', ''), errors='coerce')
df['longitude'] = pd.to_numeric(df['longitude'].astype(str).str.replace('q', ''), errors='coerce')
df['datetime'] = pd.to_datetime(df['datetime'], format='mixed', errors='coerce')

# 1. Distribution of Sighting Durations
plt.figure(figsize=(10, 6))
df['duration (seconds)'].dropna().hist(bins=50, range=(0, df['duration (seconds)'].quantile(0.95)))
plt.title('Distribution of Sighting Durations\n(95th percentile)')
plt.xlabel('Duration (seconds)')
plt.ylabel('Frequency')

# Add caption
plt.text(0.5, -0.15, "Majority of UFO experiences last less than 30 minutes.", 
         horizontalalignment='center', verticalalignment='center', 
         transform=plt.gca().transAxes, fontsize=10, color='green', wrap=True)

plt.tight_layout()
plt.savefig('ufo_durations.png', bbox_inches='tight')
plt.show()
plt.close()

# 2. Top 10 UFO Shapes
plt.figure(figsize=(10, 6))
df['shape'].value_counts().head(10).plot(kind='bar')
plt.title('Top 10 UFO Shapes')
plt.xlabel('Shape')
plt.ylabel('Count')
plt.xticks(rotation=45)

# Add caption
plt.text(0.5, -0.15, "Most frequent type recorded is a ball of light; could be a satellite or aircraft.", 
         horizontalalignment='center', verticalalignment='center', 
         transform=plt.gca().transAxes, fontsize=10, color='green', wrap=True)

plt.tight_layout()
plt.savefig('ufo_shapes.png', bbox_inches='tight')
plt.show()
plt.close()

# 3. Sightings Over Time
plt.figure(figsize=(10, 6))
df.dropna(subset=['datetime']).set_index('datetime').resample('Y').size().plot()
plt.title('Sightings Over Time')
plt.xlabel('Year')
plt.ylabel('Number of Sightings')

# Add caption
plt.text(0.5, -0.15, "1995 saw a large increase in sightings; coincides with 'Men in Black' release.", 
         horizontalalignment='center', verticalalignment='center', 
         transform=plt.gca().transAxes, fontsize=10, color='green', wrap=True)

plt.tight_layout()
plt.savefig('ufo_sightings_over_time.png', bbox_inches='tight')
plt.show()
plt.close()

![Graph of UFO durations.](ufo_durations.png)
![Graph of UFO sightings over time.](ufo_sightings_over_time.png)
![Graph of most common UFO shapes.](ufo_shapes.png)

## Preprocessing Pipeline

In [5]:
# Separate features
numeric_features = ['latitude', 'longitude', 'duration (seconds)']
categorical_features = ['shape', 'state', 'country']

# Create preprocessing steps
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(drop='first', 
                            sparse_output=False,
                            handle_unknown='ignore'))
])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

## Feature Engineering

In [6]:
df['hour'] = df['datetime'].dt.hour
df['month'] = df['datetime'].dt.month
df['year'] = df['datetime'].dt.year
df['is_night'] = (df['hour'] >= 18) | (df['hour'] <= 5)

## Model Building and Evaluation

In [7]:
# Let's predict if a sighting occurred at night
X = df[numeric_features + categorical_features]
y = df['is_night']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the pipeline
model = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

model.fit(X_train, y_train)

# Evaluate the model
y_pred = model.predict(X_test)
print("\nModel Performance:")
print(classification_report(y_test, y_pred))


Model Performance:
              precision    recall  f1-score   support

       False       0.33      0.18      0.23      3724
        True       0.81      0.91      0.85     14012

    accuracy                           0.75     17736
   macro avg       0.57      0.54      0.54     17736
weighted avg       0.71      0.75      0.72     17736

