<img src="Ufo-icon.png" height="256" width="256" style="float: right">

# A Needle in a Data Haystack - Final Project
## Matan Cohen, Nir Schipper & Ran Shaham
### Exploring UFO sightings data (or - ARE ALIENS REAL?)

#### Contents
<span id="the-top" />
We explore UFO sightings data by:
- [Shapes](#Shapes)
- [Duration](#Duration)
- [Location](#Locations)
- [Time of day](#Time)

If running interactively, start by selecting `Cell` from the menu, and click `Run All`. This will run the code in the notebook and plot the graphs and widgets.

### Initialization

In [None]:
# imports
# standard library
import re
import json
import html
import datetime as dt
# 3rd party
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import seaborn as sns
import ipywidgets as widgets

In [None]:
# constants
UFO_FILE = 'scrubbed.csv'
DRONES_FILE = 'drones_google.csv'
POP_DENSITY_FILE = 'pop_density.csv'
POP_ESTIMATE_FILE = 'pop_estimate.csv'
NAME_TO_CODE_FILE = 'state_name_to_code.json'
# column names
DATETIME = 'datetime'
YEAR = 'year'
MONTH = 'month'
DAY = 'day'
HOUR = 'hour'
COUNTRY = 'country'
STATE = 'state'
SHAPE = 'shape'
COUNT = 'count'
DUR_HOURS = 'duration (hours/min)'
DUR_SECONDS = 'duration (seconds)'
DURATION = 'duration'
DESCRIPTION = 'comments'
LAT = 'latitude'
LON = 'longitude'
# etc
DATE_FORMAT = '%m/%d/%Y %H:%M'
FIX_TIME_REGEX = (r'\s24:(\d{2})\s*$', r' 00:\1')

In [None]:
# magic & settings
%matplotlib inline
# big, juicy plots
sns.set_context('talk')
LARGE_FIGSIZE = (12, 8)
sns.set_style(style='white', rc={'figsize': LARGE_FIGSIZE})
plt.rcParams.update({
    'figure.figsize': LARGE_FIGSIZE,
})

### Read & Clean the dataset

Read the csv file:

In [None]:
data.info()

An important thing to note about this dataset is that it grows over time. That is, the number of sightings (per year) is increasing as time passes. Let's plot it:

In [None]:
count = data[YEAR].value_counts()
count.name = COUNT
count.reset_index().sort_values(by='index').plot(x='index', y=COUNT);
plt.xlabel(YEAR);
plt.title('Sightings count over the years');

interesting_years = [year for year in range(1990, 2015)]
count.loc[interesting_years].reset_index().plot(x='index', y=COUNT);
plt.xlabel(YEAR);
plt.title('Zoom in to the range 1990-2014');

This, in our opinion, is correlated to the development of internet technologies such as search engines, that made the reporting mechanism more efficient and thus more people managed to report.

### Shapes

[[back to the top]](#the-top)

First, we extract the unique shape values from all sightings:

In [None]:
shapes = pd.unique(data[SHAPE])
shapes = [shape for shape in shapes if not pd.isnull(shape)]

... and for every year, get the proportion of sightings with a given shape:

In [None]:
# get the distribution of each shape for every year
def get_shape_distribution(shape, data):
    if len(data) > 0:
        return len(data.loc[data[SHAPE] == shape, SHAPE]) / len(data)
    else:
        return 0

shapes_dist = pd.DataFrame(columns=[YEAR, SHAPE, COUNT]).set_index([YEAR, SHAPE])
for year in years:
    year_data = data.loc[data[YEAR] == year, :]
    shapes_distributions = {shape: get_shape_distribution(shape, year_data)
                            for shape in shapes}
    for shape in shapes:
        shapes_dist.loc[(year, shape), COUNT] = shapes_distributions[shape]

Use the following widget to explore the change in shapes in sightings over the years.

_Pro tip: The real action starts at ~1950..._

In [None]:
interesting_years = (1960, 2014)

select_years_widget = widgets.widgets.IntRangeSlider(
    min = years[0],
    max = years[-1],
    value = interesting_years,
    step = 1,
    description = 'Years range'
)

select_shapes_widget = widgets.widgets.SelectMultiple(
    options = shapes,
    value = ['disk', 'light', 'fireball'],
    description = 'Shapes to plot. Use Ctrl to select multiple values:',
    disabled = False
)

# an interactive plotting function
@widgets.interact(years_range=select_years_widget,
                  selected_shapes=select_shapes_widget)
def plot_shape_distributions(years_range, selected_shapes):
    # get all years in the selected range
    selected_years = years[(years >= years_range[0]) & 
                           (years <= years_range[1])]
    selected_data = shapes_dist.loc[(selected_years, selected_shapes), COUNT].reset_index()
    # plot & format
    sns.lmplot(x=YEAR, y=COUNT, hue=SHAPE, data=selected_data, size=10)
    plt.title('Changes in shape frequency over time')
    plt.xlabel('Year')
    plt.ylabel('Proportion of sightings')

_We get a lot less disk-shaped UFOs these days. What a shame - I liked those._

But wait - there's more!

In [None]:
selected_years = years[(years >= interesting_years[0]) &
                       (years <= interesting_years[1])]
shapes_dist.loc[(selected_years, ['triangle']), COUNT].reset_index().plot(x=YEAR, y=COUNT);
plt.ylabel('proportion of sightings');
plt.title('Triangular UFOs count in modern time');
plt.legend(['triangle']);

___Fun fact:___
<img src="https://upload.wikimedia.org/wikipedia/commons/a/a1/F-117_Nighthawk_Front.jpg" height="300" width="200" style="float: right; margin-left: 10px">
The first operational aircraft to be designed around 'stealth technology' was the [Lockhead F-117 Nighthawk](https://en.wikipedia.org/wiki/Lockheed_F-117_Nighthawk). It was introduced in 1983 after having its first flight in 1981. It was officialy retired in 2008. Oh - and it looks like a big, black triangle in the sky.

Also, the [B-2 stealth bomber](https://en.wikipedia.org/wiki/Northrop_Grumman_B-2_Spirit) (which is triangular as well) was produced in the years 1987-2000, and its shape is even weirder. 

These could account for the dramatically higher proportion of triangular shaped UFO sightings in those years (note that most of the sightings in our dataset are [from the US](#Locations), where these aircrafts were present).

### Duration

[[back to the top]](#the-top)

In [None]:
SELECTED_COUNTRIES = ['AU', 'GB', 'US', 'CA']
import collections
def calcduration(duration):
    s = 0
    disqualified = 0
    for d in duration:
        try:
            #eliminate outliers
            #if a sighting lasts more than two days
            if(float(d)/3600 < 48):
                s = s+(float(d)/60)
            else:
                disqualified = disqualified+1
        except:
            pass
    return s,disqualified

def averageDurationPerYear():
    averageDurationPerYear = {}
    for year in years:
        num_sightings = len(data.loc[data[YEAR] == year, :])
        duration = data.loc[data[YEAR] == year, DURATION]
        [sumOfDurations,disqualified] = calcduration(duration)
        averageDurationPerYear[year] = sumOfDurations/(num_sightings-disqualified)

    x = []
    y = []
    averageDurationPerYear = collections._OrderedDictItemsView(averageDurationPerYear)
    for key,val in averageDurationPerYear:
        x.append(key)
        y.append(val)
    plt.xlabel('Year')
    plt.ylabel('average time ogf sighting in minutes ')
    plt.title('average time of sighting each year')
    plt.plot(x, y)

def averageDurationPerCountry():
    averageDurationPerCountry = {}
    for country in SELECTED_COUNTRIES:
        num_sightings = len(data.loc[data[COUNTRY] == country , :])
        duration = data.loc[data[COUNTRY] == country, DURATION]
        [sumOfDurations,disqualified] = calcduration(duration)
        averageDurationPerCountry[country] = sumOfDurations/(num_sightings - disqualified)
    x = []
    y = []

    for key,val in averageDurationPerCountry.items():
        x.append(key)
        y.append(val)
    plt.figure()
    plt.xlabel('country')
    plt.ylabel('average time of sighting in minutes ')
    plt.title('average time of sighting per counrty')
    y_pos = np.arange(len(x))
    plt.bar(y_pos, y, align = 'center')
    plt.xticks(y_pos,x)

In [None]:
averageDurationPerYear()

In the graph above, we can see two major spikes. one is in the beginning (1906) when there was only one long sighting and a second spike in the mid 1940's. however, we can see that throughout the past 30 years there have been no major spikes (and even before the spikes were much less extreme). this can be attributed to the development of better means of documentation (in the absence of which, inventing accounts of alien invasions would have been easier) and also to better dissemination of information about events which account for strange sightings that might have been percieved as alien sightings.

In [None]:
averageDurationPerCountry()

### Locations

[[back to the top]](#the-top)

_We had a hard time with maps, so this analysis is kind of awkward, but we did put a lot of thought into it._

In [None]:
data[COUNTRY].value_counts().plot.bar(logy=True);
plt.xlabel(COUNTRY);
plt.ylabel(COUNT);
plt.title('Number of sightings for each country in this dataset (log y scale)');

This dataset consists mainly of sightings from the US. It was scraped from the [NUFORC](http://www.nuforc.org/), which is an American organization with a website in English. Therefore non English speaking UFO sighters are probably not included in this dataset.

From now on, we'll focus on US sightings.

In [None]:
data = pd.read_csv(UFO_FILE, low_memory=False)
data.head()

and clean it up:

In [None]:
# make comments readable (unescape html)
data.loc[:, DESCRIPTION] = data[DESCRIPTION].apply(lambda val: html.unescape(str(val)))

# parse datetime - fix 24h format first (24:xx --> 00:xx)
# and insert time columns (year, month, day and hour)
data.loc[:, DATETIME] = data[DATETIME].apply(lambda val: re.sub(*FIX_TIME_REGEX, str(val)))
data.loc[:, DATETIME] = pd.to_datetime(data[DATETIME], format=DATE_FORMAT, errors='coerce')

data.insert(1, YEAR, data[DATETIME].dt.year)
data.loc[:, YEAR] = data[YEAR].fillna(0).astype(int)
data.insert(2, MONTH, data[DATETIME].dt.month)
data.loc[:, MONTH] = data[MONTH].fillna(0).astype(int)
data.insert(3, HOUR, data[DATETIME].dt.hour)
data.loc[:, HOUR] = data[HOUR].fillna(0).astype(int)

# tidy up the rest of the textual columns
data.loc[:, 'city'] = data['city'].str.title()
for col in ['state', 'country']:
    data.loc[:, col] = data[col].str.upper()

# parse location data
data[LAT] = pd.to_numeric(data[LAT], errors='coerce')
data = data.rename(columns={'longitude ': LON})

# drop the duration in hours column and rename the one with the seconds
data = data.drop(DUR_HOURS, axis=1)
data = data.rename(columns={DUR_SECONDS: DURATION})

# get a list of all years with sightings
years = np.unique(data[YEAR])

# display the result
data.head()

Done! Now for some stats on the file: