## Sprint 4 Project: UFO/UAP Sighting Analysis

This project aims to creat a web application exploring some basic trends regarding global UFO sightings from 1910 to 2014. This data set was obtained from : https://mavenanalytics.io/

UFO (unidentified flying objects) or UAPs (unidentified aerial phenomena) as the are now called are terms used to describe "any apparent object in the sky that can’t be identified and classified as an object or phenomenon already known" (Petrescu, 1). While some of these can be accredited to phenomena such as weather balloons or aircrafts, others are harder to explain by conventional means. These sightings are often linked to extraterrestrials and alien life. While this topic is controversial and discussion is widespread even throughout the United States government, data does exist to support the existence of UAPs and patterns in the experiences seen. 

This study aims to:
Identify potential patterns in UAP sightings such as:
1) Places with high numbers of sightings
2) Years/ times of the year when sightings are high
3) Patterns in sightings such as the types of crafts and duration of the sightings

For the purposes of this study, we will be limiting our research to the United States (Washington D.C and Puerto Rico). We will also be using the term "UAP" moving forward as that is the current terminology used in government and media discussions.

While this information cannot prove that extraterrestrial life is behind these UAPs, understanding patterns behind them could help the public stay more vigilant and help our government know how best to monitor and study such phenomena in the future.



Sources:
https://mavenanalytics.io/

Petrescu, Relly Victoria, et al. "What is a UFO?." Journal of Aircraft and Spacecraft Technology 1.2 (2017), https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3073997



In [1]:
import pandas as pd
from scipy import stats as st
import numpy as np
import plotly.express as px
import altair as alt
import streamlit as st
import matplotlib as plt


In [2]:
#import data set 
uaps = pd.read_csv('/Users/corinnehultman/Desktop/TripleTen/Sprint_4_Project/ufo_sightings_scrubbed.csv', low_memory=False) 
display(uaps.head(10))

Unnamed: 0,datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude
0,1949-10-10 20:30:00,san marcos,tx,us,cylinder,2700,45 minutes,This event took place in early fall around 194...,2004-04-27,29.8830556,-97.941111
1,1949-10-10 21:00:00,lackland afb,tx,,light,7200,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,2005-12-16,29.38421,-98.581082
2,1955-10-10 17:00:00,chester (uk/england),,gb,circle,20,20 seconds,Green/Orange circular disc over Chester&#44 En...,2008-01-21,53.2,-2.916667
3,1956-10-10 21:00:00,edna,tx,us,circle,20,1/2 hour,My older brother and twin sister were leaving ...,2004-01-17,28.9783333,-96.645833
4,1960-10-10 20:00:00,kaneohe,hi,us,light,900,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,2004-01-22,21.4180556,-157.803611
5,1961-10-10 19:00:00,bristol,tn,us,sphere,300,5 minutes,My father is now 89 my brother 52 the girl wit...,2007-04-27,36.595,-82.188889
6,1965-10-10 21:00:00,penarth (uk/wales),,gb,circle,180,about 3 mins,penarth uk circle 3mins stayed 30ft above m...,2006-02-14,51.434722,-3.18
7,1965-10-10 23:45:00,norwalk,ct,us,disk,1200,20 minutes,A bright orange color changing to reddish colo...,1999-10-02,41.1175,-73.408333
8,1966-10-10 20:00:00,pell city,al,us,disk,180,3 minutes,Strobe Lighted disk shape object observed clos...,2009-03-19,33.5861111,-86.286111
9,1966-10-10 21:00:00,live oak,fl,us,disk,120,several minutes,Saucer zaps energy from powerline as my pregna...,2005-05-11,30.2947222,-82.984167


## Clean Data Set

In [3]:
#general info
print(uaps.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 80332 entries, 0 to 80331
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   datetime              80332 non-null  object 
 1   city                  80332 non-null  object 
 2   state                 74535 non-null  object 
 3   country               70662 non-null  object 
 4   shape                 78400 non-null  object 
 5   duration (seconds)    80332 non-null  object 
 6   duration (hours/min)  80332 non-null  object 
 7   comments              80317 non-null  object 
 8   date posted           80332 non-null  object 
 9   latitude              80332 non-null  object 
 10  longitude             80332 non-null  float64
dtypes: float64(1), object(10)
memory usage: 6.7+ MB
None


In [4]:
#make all the writing lower case to avoid confusion
def convert_to_lower(df):
    for column in df.columns:
        if df[column].dtype == 'object':
            if df[column].str.contains('[A-Z]').any():
                df[column] = df[column].str.lower()
    return df
uaps = convert_to_lower(uaps)
display(uaps)

Unnamed: 0,datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude
0,1949-10-10 20:30:00,san marcos,tx,us,cylinder,2700,45 minutes,this event took place in early fall around 194...,2004-04-27,29.8830556,-97.941111
1,1949-10-10 21:00:00,lackland afb,tx,,light,7200,1-2 hrs,1949 lackland afb&#44 tx. lights racing acros...,2005-12-16,29.38421,-98.581082
2,1955-10-10 17:00:00,chester (uk/england),,gb,circle,20,20 seconds,green/orange circular disc over chester&#44 en...,2008-01-21,53.2,-2.916667
3,1956-10-10 21:00:00,edna,tx,us,circle,20,1/2 hour,my older brother and twin sister were leaving ...,2004-01-17,28.9783333,-96.645833
4,1960-10-10 20:00:00,kaneohe,hi,us,light,900,15 minutes,as a marine 1st lt. flying an fj4b fighter/att...,2004-01-22,21.4180556,-157.803611
...,...,...,...,...,...,...,...,...,...,...,...
80327,2013-09-09 21:15:00,nashville,tn,us,light,600,10 minutes,round from the distance/slowly changing colors...,2013-09-30,36.1658333,-86.784444
80328,2013-09-09 22:00:00,boise,id,us,circle,1200,20 minutes,boise&#44 id&#44 spherical&#44 20 min&#44 10 r...,2013-09-30,43.6136111,-116.202500
80329,2013-09-09 22:00:00,napa,ca,us,other,1200,hour,napa ufo&#44,2013-09-30,38.2972222,-122.284444
80330,2013-09-09 22:20:00,vienna,va,us,circle,5,5 seconds,saw a five gold lit cicular craft moving fastl...,2013-09-30,38.9011111,-77.265556


In [5]:
#remove all rows that are not in the USA
print(uaps['country'].unique())
uaps_us = uaps[uaps['country'] == 'us']
print(uaps_us['country'].unique())
display(uaps_us.info())

['us' nan 'gb' 'ca' 'au' 'de']
['us']
<class 'pandas.core.frame.DataFrame'>
Index: 65114 entries, 0 to 80331
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   datetime              65114 non-null  object 
 1   city                  65114 non-null  object 
 2   state                 65114 non-null  object 
 3   country               65114 non-null  object 
 4   shape                 63561 non-null  object 
 5   duration (seconds)    65114 non-null  object 
 6   duration (hours/min)  65114 non-null  object 
 7   comments              65101 non-null  object 
 8   date posted           65114 non-null  object 
 9   latitude              65114 non-null  object 
 10  longitude             65114 non-null  float64
dtypes: float64(1), object(10)
memory usage: 6.0+ MB


None

In [6]:
#check for duplicates and drop any if they exist
dup_uaps_us = uaps_us[uaps_us.duplicated()]
display(dup_uaps_us)
uaps_us = uaps_us.drop_duplicates().reset_index()

Unnamed: 0,datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude
2873,2013-10-18 22:34:00,norwood,oh,us,fireball,900,15 minutes,fireballs in sky making different formations.,2013-10-23,39.1555556,-84.459722


In [7]:
#fix data types
#change the date/time to date time
uaps_us['datetime'] = pd.to_datetime(uaps_us['datetime'], format='%Y-%m-%d %H:%M:%S')
# change the duration in seconds to a float type
uaps_us['duration (seconds)'] = pd.to_numeric(uaps_us['duration (seconds)'], errors='coerce')
display(uaps_us.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65113 entries, 0 to 65112
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   index                 65113 non-null  int64         
 1   datetime              65113 non-null  datetime64[ns]
 2   city                  65113 non-null  object        
 3   state                 65113 non-null  object        
 4   country               65113 non-null  object        
 5   shape                 63560 non-null  object        
 6   duration (seconds)    65111 non-null  float64       
 7   duration (hours/min)  65113 non-null  object        
 8   comments              65100 non-null  object        
 9   date posted           65113 non-null  object        
 10  latitude              65113 non-null  object        
 11  longitude             65113 non-null  float64       
dtypes: datetime64[ns](1), float64(2), int64(1), object(8)
memory usage: 6.0+ M

None

In [8]:
#check for missing values
#dropping missing values in 'shape' column. Only looking at sightings that have a shape category to simplify, as there already is an 'unknown' column. Do not want to lump those two together
uaps_us = uaps_us.dropna(subset=['shape'])
print(uaps_us['shape'].unique())
#missing values in 'comments' column
uaps_us['comments']= uaps_us['comments'].fillna('no_comment')

['cylinder' 'circle' 'light' 'sphere' 'disk' 'fireball' 'unknown' 'oval'
 'other' 'rectangle' 'chevron' 'formation' 'triangle' 'cigar' 'delta'
 'changing' 'diamond' 'flash' 'egg' 'teardrop' 'cone' 'cross' 'pyramid'
 'round' 'flare' 'hexagon' 'crescent' 'changed']


## Question 1: Which states in the US reported more UAP sightings from 1949-2014?

In [9]:
#group the rows by state and count the number of rows
uaps_state_group = uaps_us.groupby('state').size().reset_index(name='count')
#sort results in descending order
uaps_state_group = uaps_state_group.sort_values(by='count', ascending=False)
display(uaps_state_group)
#percent of total sightings in CA
cali_percent = 8684/65113
print(f'The percent of total sightings in California is {cali_percent:.2%}')
#top five states percent of total sightings
top_five = ((8684+3754+3708+3399+2915)/65113)
print(f'The total percent of sightings found in the top five states are {top_five:.2%}')

Unnamed: 0,state,count
4,ca,8684
9,fl,3754
48,wa,3708
44,tx,3399
34,ny,2915
14,il,2447
3,az,2362
38,pa,2319
35,oh,2251
22,mi,1781


The percent of total sightings in California is 13.34%
The total percent of sightings found in the top five states are 34.49%


In [29]:
#create a graphic describing the distribution of sightings across states --> bar graph: discrete data
#create bargraph via plotly.express
fig_bar1 = px.bar(uaps_state_group, x='state', y='count', title='UAP Sightings per U.S State from 1910-2014')
fig_bar1.update_traces(marker_color='green')
fig_bar1.update_layout(
    xaxis_title='State (abbreviation)',
    yaxis_title='UAP Sighting Count',
    title_font=dict(family="Arial", size=24, color="black"),
    font=dict(family="Arial", size=14, color="black") )
fig_bar1.show()

The top states in the U.S.A that reported UAP sightings are New York, Texas, Washington, Florida, with the highest number in California, with 13.34% being found in CA alone. These five states made up 34.34% of all sightings in this time period 

## Question 2: Which year was the number of sightings the highest? What is the average number of sightings per year? what months show the highest number of sightings on average? In the year with the highest number of sightings, What was the distribution over the months of that year?

In [11]:
#which year was the number of sightings highest?
#create year column
uaps_us['year'] = uaps_us['datetime'].dt.year
#groupby by year (own df) and count which year is the highest 
uaps_year_group = uaps_us.groupby('year').size().reset_index(name='count')
#sort results in descending order
uaps_year_group = uaps_year_group.sort_values(by='count', ascending=False)
display(uaps_year_group.head(10))

Unnamed: 0,year,count
80,2012,6252
81,2013,5993
79,2011,4333
76,2008,3970
77,2009,3613
78,2010,3507
75,2007,3445
72,2004,3204
73,2005,3192
71,2003,2916


In [30]:
# bar graph 
fig_bar2 = px.bar(uaps_year_group, x='year', y='count', title='UAP sightings per Year from 1910-2014')
fig_bar2.update_traces(marker_color='lightgreen')
fig_bar2.update_layout(
    xaxis_title='Year',
    yaxis_title='UAP Sighting Count',
    title_font=dict(family="Arial", size=24, color="black"),
    font=dict(family="Arial", size=14, color="black") )
fig_bar2.update_layout(
    xaxis=dict(
        tickmode='linear',
        dtick=10
    )
)
fig_bar2.update_layout(
    yaxis=dict(
        tickmode='linear',
        dtick=500
    )
)
fig_bar2.show()

In [13]:
#making a histogram of data
#make sure df is sort chronologcally
uaps_year_group['year'] = pd.to_numeric(uaps_year_group['year'])
uaps_year_group = uaps_year_group.sort_values(by='year')

#histogram
uaps_years = px.histogram(uaps_year_group, 
                   x='year', 
                   y='count', 
                   histfunc='sum',  
                   title='UAP Sightings by Year',
                   labels={'year': 'Year', 'count': 'Number of Sightings'})  

#clean up histogram
uaps_years.update_traces(marker=dict(color='lightgreen', line=dict(width=2, color='black')))
uaps_years.update_layout(
    bargap=0.1,  
    xaxis_type='category',  
    xaxis_tickangle=-45  
)
uaps_years.show()

In [14]:
#average number of sightings per year
print(f'There were an average of {uaps_year_group['count'].mean():.0f} UAP sightings per year in the United States from 1910-2014.')

There were an average of 766 UAP sightings per year in the United States from 1910-2014.


The range of years from this data set with the overall highest UAP sightings was from 2003-2013 (excluding 2006), with the peak number of sightings being in 2012. These are the top ten years with the highest number of sightings overall. It is interesting to note that 2014 was not the highest on this list, indicating some sort of drop in sightings at this point. How much of this trend is due to increased UAP activity or better methods of tracking and recording sightings is unclear.

In [15]:
#which months show the highest number of sightings
#create month column
uaps_us['month'] = uaps_us['datetime'].dt.month
uaps_month_group = uaps_us.groupby('month').size().reset_index(name='count')
#sort results in descending order
uaps_month_group = uaps_month_group.sort_values(by='count', ascending=False)
display(uaps_month_group)
total_sightings_6 = (7538+6674+6278+6109+6050+5521)/(uaps_month_group['count'].sum())
print(f'The percent of total sightings happening from June to November is {total_sightings_6:.2%}')

Unnamed: 0,month,count
6,7,7538
7,8,6674
5,6,6278
9,10,6109
8,9,6050
10,11,5521
11,12,4508
0,1,4455
2,3,4315
3,4,4306


The percent of total sightings happening from June to November is 60.05%


In [31]:
#converting month numbers to names
uaps_month_group['month_name'] = uaps_month_group['month'].apply(lambda x: pd.to_datetime(f'{x}-01', format='%m-%d').strftime('%B'))
#bar graph showing the months with the highest number of sightings
fig_month = px.bar(uaps_month_group, x='month_name', y='count', title='UAP sightings per Month from 1910-2014')
fig_month.update_traces(marker_color='#006400')
fig_month.update_layout(
    xaxis_title='Month',
    yaxis_title='UAP Sighting Count',
    title_font=dict(family="Arial", size=24, color="black"),
    font=dict(family="Arial", size=14, color="black") )
fig_month.update_layout(
    xaxis=dict(
        tickmode='linear',
        dtick=1
    )
)
fig_month.update_xaxes(type='category', categoryorder='array', categoryarray=[
    'January', 'February', 'March', 'April', 'May', 'June', 'July', 'August',
    'September', 'October', 'November', 'December'
])
fig_month.show()

In [17]:
#In the year with the highest number of sightings, What was the distribution over the months of that year? (2012)
#create a data frame containing only sightings in 2012
uaps_us_2012 = uaps_us[uaps_us['year'] == 2012]
display(uaps_us_2012.info())
#count the number of sightings distributed across the months of that year
uaps_month_group_2012 = uaps_us_2012.groupby('month').size().reset_index(name='count')
#sort results in descending order
uaps_month_group_2012 = uaps_month_group_2012.sort_values(by='count', ascending=False)
display(uaps_month_group_2012)

<class 'pandas.core.frame.DataFrame'>
Index: 6252 entries, 202 to 65098
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   index                 6252 non-null   int64         
 1   datetime              6252 non-null   datetime64[ns]
 2   city                  6252 non-null   object        
 3   state                 6252 non-null   object        
 4   country               6252 non-null   object        
 5   shape                 6252 non-null   object        
 6   duration (seconds)    6252 non-null   float64       
 7   duration (hours/min)  6252 non-null   object        
 8   comments              6252 non-null   object        
 9   date posted           6252 non-null   object        
 10  latitude              6252 non-null   object        
 11  longitude             6252 non-null   float64       
 12  year                  6252 non-null   int32         
 13  month               

None

Unnamed: 0,month,count
6,7,755
7,8,687
10,11,629
5,6,592
8,9,569
11,12,537
9,10,523
0,1,442
2,3,422
3,4,403


In [34]:
#plot this distribution on the same axis as the total month data to see how it compares?
#converting month numbers to names for 2012
uaps_month_group_2012['month_name'] = uaps_month_group_2012['month'].apply(lambda x: pd.to_datetime(f'{x}-01', format='%m-%d').strftime('%B'))
#bar graph showing the months with the highest number of sightings
fig_2012 = px.bar(uaps_month_group_2012, x='month_name', y='count', title='UAP sightings per Month in 2012')
fig_2012.update_traces(marker_color='lightgreen')
fig_2012.update_layout(
    xaxis_title='Month',
    yaxis_title='UAP Sighting Count',
    title_font=dict(family="Arial", size=24, color="black"),
    font=dict(family="Arial", size=14, color="black") )
fig_2012.update_layout(
    xaxis=dict(
        tickmode='linear',
        dtick=1
    )
)
fig_2012.update_xaxes(type='category', categoryorder='array', categoryarray=[
    'January', 'February', 'March', 'April', 'May', 'June', 'July', 'August',
    'September', 'October', 'November', 'December'
])
fig_2012.show()

On average the highest number of UAP sightings in this time frame occured in July and August, with the most occuring in July. June through November is the half of the year with the highest number of sightings overall with 60.05% of sightings occuring in that time. It is least likely that a UAP sighting would happen in the month of February. The months of January-May were the lowest for sightings. This trend was similar in 2012, except November and December showed higher levels of activity.

## Question 3: What was the most likely time of day to see a UAP?

In [19]:
#create 'hour' column using datetime data
uaps_us['hour'] = uaps_us['datetime'].dt.hour
#group sightings by hour and count which ones are most likely
uaps_hour_group = uaps_us.groupby('hour').size().reset_index(name='count')
#sort results in descending order
uaps_hour_group = uaps_hour_group.sort_values(by='count', ascending=False)
display(uaps_hour_group)

Unnamed: 0,hour,count
21,21,9432
22,22,8617
20,20,7110
23,23,5959
19,19,4977
0,0,3520
18,18,3253
1,1,2453
17,17,2074
2,2,1746


In [35]:
#create a plot to show the distribution bar
fig_day = px.bar(uaps_hour_group, x='hour', y='count', title='UAP sightings Hour of the Day: 1910-2014',color='hour', color_continuous_scale='greens')
fig_day.update_layout(showlegend=False)
fig_day.update_layout(
    xaxis_title='Hour of Day (military time)',
    yaxis_title='UAP Sighting Count',
    title_font=dict(family="Arial", size=24, color="black"),
    font=dict(family="Arial", size=14, color="black") )
fig_day.update_layout(
    xaxis=dict(
        tickmode='linear',
        dtick=1
    )
)
fig_day.show()

The time of day where the most sightings occured was between hour 20-22 (9-10 pm) with the peak number being at hour 21 (9pm). It seems that sightings were lease likely at hour 8 (8 am), with later morning hours generally having less sightings. 

## Question 4: How long does a 'sighting' typically last? 

In [21]:
#remove rows without exact number of seconds for duration
print(uaps_us['duration (hours/min)'].unique())
#dont need values with "or less, or more,  -, so far, ~, over, +"
#While we are assuming that all of these are approximations due to the sources being eye witnesses, ones with a range or very general timeline will be excluded from this analyisis
import re  
#terms to exclude
terms_to_exclude = ["or less", "or more", "-", "so far", "~", "over", "+"]
#escape special characters in the terms using re.escape
escaped_terms = [re.escape(term) for term in terms_to_exclude]
#create a regular expression pattern to match any of the escaped terms
pattern = '|'.join(escaped_terms)
#filter the DataFrame by removing rows where the 'duration (hours/min)' contains any of the terms
uaps_us_cleaned = uaps_us[~uaps_us['duration (hours/min)'].str.contains(pattern, case=False, na=False)]

display(uaps_us.info())
display(uaps_us_cleaned.info())

['45 minutes' '1/2 hour' '15 minutes' ... '5-6 seconda' 'approx. 35 mins'
 '3 minutes or less']
<class 'pandas.core.frame.DataFrame'>
Index: 63560 entries, 0 to 65112
Data columns (total 15 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   index                 63560 non-null  int64         
 1   datetime              63560 non-null  datetime64[ns]
 2   city                  63560 non-null  object        
 3   state                 63560 non-null  object        
 4   country               63560 non-null  object        
 5   shape                 63560 non-null  object        
 6   duration (seconds)    63560 non-null  float64       
 7   duration (hours/min)  63560 non-null  object        
 8   comments              63560 non-null  object        
 9   date posted           63560 non-null  object        
 10  latitude              63560 non-null  object        
 11  longitude             63560 non-null  flo

None

<class 'pandas.core.frame.DataFrame'>
Index: 54113 entries, 0 to 65112
Data columns (total 15 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   index                 54113 non-null  int64         
 1   datetime              54113 non-null  datetime64[ns]
 2   city                  54113 non-null  object        
 3   state                 54113 non-null  object        
 4   country               54113 non-null  object        
 5   shape                 54113 non-null  object        
 6   duration (seconds)    54113 non-null  float64       
 7   duration (hours/min)  54113 non-null  object        
 8   comments              54113 non-null  object        
 9   date posted           54113 non-null  object        
 10  latitude              54113 non-null  object        
 11  longitude             54113 non-null  float64       
 12  year                  54113 non-null  int32         
 13  month                

None

In [22]:
#general statistics of sight duration
print(uaps_us_cleaned['duration (seconds)'].describe())
#calculate the average length of a sighting in seconds
sightings_mean = uaps_us_cleaned['duration (seconds)'].mean()
print(f"The average length of a UAP sighting is {sightings_mean:.2f} seconds or {(sightings_mean)/3600:.2f} hours.")
#calculate dispersion(variance)/standard deviation
sightings_variance = np.var(uaps_us_cleaned['duration (seconds)'])
print(f"The variance of a UAP sighting is {sightings_variance:.2f} seconds.")
sightings_std = np.std(uaps_us_cleaned['duration (seconds)'])
print(f"The standard deviation of a UAP sighting is {sightings_std:.2f}seconds or {(sightings_std)/3600:.2f} hours.")

count    5.411300e+04
mean     6.126647e+03
std      4.396868e+05
min      1.000000e-02
25%      3.000000e+01
50%      1.800000e+02
75%      6.000000e+02
max      6.627600e+07
Name: duration (seconds), dtype: float64
The average length of a UAP sighting is 6126.65 seconds or 1.70 hours.
The variance of a UAP sighting is 193320891599.29 seconds.
The standard deviation of a UAP sighting is 439682.72seconds or 122.13 hours.


While the sighting durations are all taken as approximations due to the nature of eye witness testimony, sightings where the range of time was too broad were removed from this portion of analysis. For example, if a sighting was listed as "an hour or less" this would be excluded because that could mean anywhere from 1-59 minutes.

The average length of a sighting is 1.70 hours. The large standard deviation shows a lot of variation in length of sightings. 


In [36]:

# log transformation to the duration column to reduce outlier impact
uaps_us_cleaned['log_duration'] = np.log10(uaps_us_cleaned['duration (seconds)'])

# box plot for the transformed data
fig_box = px.box(
    uaps_us_cleaned,
    y='log_duration',  
    title='Box Plot of Log-Transformed Sighting Durations',
    labels={'log_duration': 'Log of Duration (seconds)'}
)






A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



The amount of time a sighting can last ranges from a couple of seconds to years of continued phenomena sightings. This extreme spread could be indicative of issues with data collection rather that the spanning of long-term phenomena, however given the nature of the data it is hard to know for sure.

## Question 5: How does the length of sightings change over time?

In [24]:
#create a scatterplot for the data
display(uaps_us_cleaned.info())
display(uaps_us_cleaned.head())
#removing outlier values to visualize the main trends
Q1 = uaps_us_cleaned['duration (seconds)'].quantile(0.25)
Q3 = uaps_us_cleaned['duration (seconds)'].quantile(0.75)
IQR = Q3 - Q1

# Define outlier bounds (1.5 times the IQR above Q3 or below Q1)
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Filter out the outliers
filtered_data = uaps_us_cleaned[(uaps_us_cleaned['duration (seconds)'] >= lower_bound) & 
                           (uaps_us_cleaned['duration (seconds)'] <= upper_bound)]

# Create the scatterplot with filtered data
fig = px.scatter(filtered_data, x='year', y='duration (seconds)', 
                 title='UAP Sightings Duration Over Time (Outliers Removed)',
                 labels={'year': 'Year', 'duration (seconds)': 'Duration (seconds)'})


<class 'pandas.core.frame.DataFrame'>
Index: 54113 entries, 0 to 65112
Data columns (total 16 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   index                 54113 non-null  int64         
 1   datetime              54113 non-null  datetime64[ns]
 2   city                  54113 non-null  object        
 3   state                 54113 non-null  object        
 4   country               54113 non-null  object        
 5   shape                 54113 non-null  object        
 6   duration (seconds)    54113 non-null  float64       
 7   duration (hours/min)  54113 non-null  object        
 8   comments              54113 non-null  object        
 9   date posted           54113 non-null  object        
 10  latitude              54113 non-null  object        
 11  longitude             54113 non-null  float64       
 12  year                  54113 non-null  int32         
 13  month                

None

Unnamed: 0,index,datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude,year,month,hour,log_duration
0,0,1949-10-10 20:30:00,san marcos,tx,us,cylinder,2700.0,45 minutes,this event took place in early fall around 194...,2004-04-27,29.8830556,-97.941111,1949,10,20,3.431364
1,3,1956-10-10 21:00:00,edna,tx,us,circle,20.0,1/2 hour,my older brother and twin sister were leaving ...,2004-01-17,28.9783333,-96.645833,1956,10,21,1.30103
2,4,1960-10-10 20:00:00,kaneohe,hi,us,light,900.0,15 minutes,as a marine 1st lt. flying an fj4b fighter/att...,2004-01-22,21.4180556,-157.803611,1960,10,20,2.954243
3,5,1961-10-10 19:00:00,bristol,tn,us,sphere,300.0,5 minutes,my father is now 89 my brother 52 the girl wit...,2007-04-27,36.595,-82.188889,1961,10,19,2.477121
4,7,1965-10-10 23:45:00,norwalk,ct,us,disk,1200.0,20 minutes,a bright orange color changing to reddish colo...,1999-10-02,41.1175,-73.408333,1965,10,23,3.079181


In [37]:
# Create the scatterplot with log transformed data
fig_scatter = px.scatter(filtered_data, x='year', y='log_duration', 
                 title='UAP Sightings Duration Over Time (Log-Transformed)',
                 labels={'year': 'Year', 'log_duration': 'Duration (seconds)'})

fig_scatter.show()


There is no visible trend in the duration of a UAP sighting through time. 

## Conclusion

In the United States from 1910-1949, thousands of UAP cases were documented across the country. A majority of these cases were recorded in California and in the summer months of June-August. The most likely time to see a UAP is between 8-10 pm with sightings lasting anywhere between less than a second to multiple hours. 

I believe the largest potential issue in this data set is the reliance on eyewitness testimony. Data could be skewed given different recording methods, number of witnessess, time of recording, etc. However that does not mean it does not give valuable insight into where and when these UAPs will be seen in the future. Information from this study could help the U.S government and other researchers track potential movements, focus areas of UAP studies, and develop methods for efficient response time to unidentified aerial phenomena.