# UFO Sightings - Exploratory Data Analysis

## Introduction

Whether you believe in extraterrrials or not, UFO sightings are becoming harder to deny. With over 80,000 reported encounters since 1906, reports offered by the National UFO Reporting Center Online Database (NURCOB) offer key insights into the world of the unexplainable. 

Each one provides a clue as to the location and date they may be appearing. Reports also help identify what shape, form and color a UFO might be taking as well as how long the encounter may take. By loading and cleaning the data, an analysis can be formed.

## Process

Loading includes:
- Importing appropriate packages
- Reading in file
- Inspecting data

Cleaning includes:
- Selecting appropriate data
- Ensuring data is readable
- Handling duplicate and missing values
- Enriching data

Analysis includes:
- Timing. Whenare reports are made more likely to be made. Which season, month, week, day, time of day are reports occuring most? When are they occuring least?
- Report day lag. Does the gap between the date the report is made and the actual encounter lead to extreme discrepencies?


## 1 Load Relevant Data

This process will involve importing packages and reading the data. 

### 1.1 Import Relevant Packages

In [1]:
# Import appropriate packages to clean and analyse the data
import pandas as pd # To handle dataframes
import numpy as np # To handle arrays
import calendar # To handle dates
from geopy.geocoders import Nominatim # For finding locations
import pycountry_convert as pc # for grouping countries
import plotly.express as px # to visualize data

### 1.2 Read CSV File

In [2]:
# Read file
df = pd.read_csv('./data//ufo-sightings-transformed.csv', parse_dates=['Date_time','date_documented'])


### 1.3 Inspect Data

In [3]:
df.head(5)

Unnamed: 0.1,Unnamed: 0,Date_time,date_documented,Year,Month,Hour,Season,Country_Code,Country,Region,Locale,latitude,longitude,UFO_shape,length_of_encounter_seconds,Encounter_Duration,Description
0,0,1949-10-10 20:30:00,2004-04-27,1949,10,20,Autumn,USA,United States,Texas,San Marcos,29.883056,-97.941111,Cylinder,2700.0,45 minutes,This event took place in early fall around 194...
1,1,1949-10-10 21:00:00,2005-12-16,1949,10,21,Autumn,USA,United States,Texas,Bexar County,29.38421,-98.581082,Light,7200.0,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...
2,2,1955-10-10 17:00:00,2008-01-21,1955,10,17,Autumn,GBR,United Kingdom,England,Chester,53.2,-2.916667,Circle,20.0,20 seconds,Green/Orange circular disc over Chester&#44 En...
3,3,1956-10-10 21:00:00,2004-01-17,1956,10,21,Autumn,USA,United States,Texas,Edna,28.978333,-96.645833,Circle,20.0,1/2 hour,My older brother and twin sister were leaving ...
4,4,1960-10-10 20:00:00,2004-01-22,1960,10,20,Autumn,USA,United States,Hawaii,Kaneohe,21.418056,-157.803611,Light,900.0,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 80328 entries, 0 to 80327
Data columns (total 17 columns):
 #   Column                       Non-Null Count  Dtype         
---  ------                       --------------  -----         
 0   Unnamed: 0                   80328 non-null  int64         
 1   Date_time                    80328 non-null  datetime64[ns]
 2   date_documented              80328 non-null  datetime64[ns]
 3   Year                         80328 non-null  int64         
 4   Month                        80328 non-null  int64         
 5   Hour                         80328 non-null  int64         
 6   Season                       80328 non-null  object        
 7   Country_Code                 80069 non-null  object        
 8   Country                      80069 non-null  object        
 9   Region                       79762 non-null  object        
 10  Locale                       79871 non-null  object        
 11  latitude                     80328 non-nu

In [5]:
round(df.describe(),2)

Unnamed: 0.1,Unnamed: 0,Date_time,date_documented,Year,Month,Hour,latitude,longitude,length_of_encounter_seconds
count,80328.0,80328,80328,80328.0,80328.0,80328.0,80328.0,80328.0,80328.0
mean,40163.5,2004-05-17 07:19:24.235882880,2007-07-28 02:52:52.990737920,2003.85,6.84,15.53,38.12,-86.77,9017.34
min,0.0,1906-11-11 00:00:00,1998-03-07 00:00:00,1906.0,1.0,0.0,-82.86,-176.66,0.0
25%,20081.75,2001-08-02 22:25:00,2003-11-26 00:00:00,2001.0,4.0,10.0,34.13,-112.07,30.0
50%,40163.5,2006-11-22 05:57:00,2007-11-28 00:00:00,2006.0,7.0,19.0,39.41,-87.9,180.0
75%,60245.25,2011-06-21 03:30:00,2011-10-10 00:00:00,2011.0,9.0,21.0,42.79,-78.76,600.0
max,80327.0,2014-05-08 18:45:00,2014-05-08 00:00:00,2014.0,12.0,23.0,72.7,178.44,97836000.0
std,23188.84,,,10.43,3.23,7.75,10.47,39.7,620232.23


## 2 Clean Data

Cleaning will require an analysis into potential missing and duplicated values. Once this done, the data will be refined to give more context for an analysis. Drop unnecessary columns. Make values uniform and readable. Check for null and duplicate values.

### 2.1 Drop and Rename Columns

In [6]:

# Drop encounter_duration (we will use 'length_of_encounter_seconds' instead)
df = df.drop(columns='Encounter_Duration')

# Rename duration, unnamed, Date-time and documented_date to clearer titles
df = df.rename(columns= {'length_of_encounter_seconds':'duration_secs', 'Unnamed: 0':'id', 'Date_time':'encounter_date', 'date_documented':'reported_date'})

### 2.2 Make Values Uniform and Readable

In [7]:
# Make column names lowercase for easier referencing
df.columns = df.columns.str.lower()

# Make description lower case for easier searching 
df['description']= df['description'].str.lower()

# Change month number rows to month name
df['month'] = df['month'].apply(lambda x: calendar.month_name[x])

# Make description text more readable by providing appropriate replacement values
# Replace '&#44' with "'"
df['description'] = df['description'].str.replace('&#44', "")

# Replace '&#39' with "'"
df['description'] = df['description'].str.replace('&#39', "")

# Replace '&#33' with "."
df['description'] = df['description'].str.replace('&#33', '.')

# Replace '&amp;' with "&"
df['description'] = df['description'].str.replace('&amp;', '.')

# Replace '&quot;' with "'"
df['description'] = df['description'].str.replace('&quot;', "'")

### 2.3 Replace Missing Values

Missing values exist for countries, regions, locales, and ufo shapes. These only comprise of a small amount of our total dataset, and information without this data is still valuable. However, we will attempt to catch missing values using two process. The first will locate locations using the longitude and latitude data and the second will compare descriptions with words in these categories to see if missing data can be added.

### 2.3.1 Find Columns with Null Values

In [8]:
# Find a sum of null values
df.isnull().sum()

id                   0
encounter_date       0
reported_date        0
year                 0
month                0
hour                 0
season               0
country_code       259
country            259
region             566
locale             457
latitude             0
longitude            0
ufo_shape         1930
duration_secs        0
description         15
dtype: int64

### 2.3.2 Reverse Search with Co-ordinate Data

Missing values will be found following a process that will:

1. Create a new dataframe for null values within a given column.

2. Locate and save longitude and lattitude data for each row with missing data.

3. Save values.

4. Merge back to original dataframe.

Functions will be created to accomplish this task and then applied to the country, region, and locale columns.

#### 2.3.2.1 Create Functions

In [9]:
# Allow for geolocator by passing user agent details
geolocator = Nominatim(user_agent="Jupiter Notebook")

# Save longitude and latitudes for each row of locale_null
def reverse_coordinates(dataframe, column, location_type):

    # 1. Create subset of df where column is null
    col_null = dataframe[dataframe[column].isnull()]

    # 2. Locate and save longitude and lattitude data for each row with missing data
    for i in range(len(col_null)):
        lat = col_null.iloc[i,col_null.columns.get_loc('latitude')].astype(str) # Locate latitude, save as string to combine
        long = col_null.iloc[i,col_null.columns.get_loc('longitude')].astype(str)# Locate longitude, save as string to combine
        
    # 3. Find location given longitude and latitude using geolocator, set language to English
        location = geolocator.reverse(lat+","+long, language='en')
        
        # Where location is given, find 
        if location is not None: # Where a location is given
            address = location.raw['address']
            col_null.iloc[i,col_null.columns.get_loc(column)] = address.get(location_type)
            
        # Save col_null to skip time consuming cell execution
        col_null.to_csv('./data/'+column+'_null.csv', index=False)
    
    return col_null



In [10]:
# 4. Merge back to original dataframe 
def merge_locations(dataframe, col_null, column):
    # Print sum of missing values to compare before
    print('Missing values before merge = ', df[column].isnull().sum())
    
    # Replace null values in original dataframe with values from col_null
    for i in range(len(col_null)):
        id_to_match = col_null.iloc[i, 0]
        value_to_merge = col_null.iloc[i, dataframe.columns.get_loc(column)]
        dataframe.loc[dataframe['id'] == id_to_match, column] = value_to_merge
    
    # Print sum of missing values again to compare after merge
    print('Missing values after merge = ', df[column].isnull().sum())
    
    return dataframe

#### 2.3.2.2 Country

Run both of the newly created functions to fill in missing data

In [11]:
# Run reverse_coordinates for country data
# Note: running this cell can take upto 10 mins, output saved and imported in next cells
#country_null = reverse_coordinates(df, 'country', 'country')

In [12]:
# Read country_null to skip last cell execution
country_null = pd.read_csv('./data/country_null.csv')

In [13]:
# merge locations for country data
df = merge_locations(df, country_null, 'country')

Missing values before merge =  259
Missing values after merge =  69


#### 2.3.2.3 Region

Repeat the same process for region

In [14]:
# Run reverse_coordinates for region data
# Note: running this cell can take upto 10 mins, output saved and imported in next cell (can skip this one if needed)
#region_null = reverse_coordinates(df, 'region', 'state')

In [15]:
# Read region_null to skip last cell execution
region_null = pd.read_csv('./data/region_null.csv')

In [16]:
# 4. Merge back to original dataframe 
def merge_locations(dataframe, col_null, column):
    # Print sum of missing values to compare before
    print('Missing values before merge = ', dataframe[column].isnull().sum())
    
    # Replace null values in original dataframe with values from col_null
    for i in range(len(col_null)):
        id_to_match = col_null.iloc[i, 0]
        value_to_merge = col_null.iloc[i, dataframe.columns.get_loc(column)]
        dataframe.loc[dataframe['id'] == id_to_match, column] = value_to_merge
    
    # Print sum of missing values again to compare after merge
    print('Missing values after merge = ', dataframe[column].isnull().sum())
    
    return dataframe

In [17]:
# merge locations for region data
df = merge_locations(df, region_null, 'region')

Missing values before merge =  566
Missing values after merge =  372


#### 2.3.2.4 Locale

In [18]:
#Note: running this cell can take upto 10 mins, output saved and imported in next cell (can delete this one if needed)
#locale_null = reverse_coordinates(df, 'locale', 'city')

In [19]:
# Read country_null to skip last cell execution
locale_null = pd.read_csv('./data/locale_null.csv')

In [20]:
# run df through merge locations, should be 315
df = merge_locations(df, locale_null, 'locale')

Missing values before merge =  457
Missing values after merge =  315


### 2.3.3 Check Description for Insights

The second process will look to the descriptions to see if missing data can be added. This will be done for all location columns, and also the ufo shape column. Do this using a five step process:

1. Create a new dataframe with null values.

2. Create a list to compare the unique values to.

3. Compare the two sets using the set() method.

4. Determine which values should be integrated.

5. Merge appropriate values.

**Note:** If more time was granted, each description piece would be read to allow for values that might not yet exist in the unique values set.

#### 2.3.3.1 Create Functions

In [21]:
# Create description_match function that does the first 3 steps
def description_match(dataframe, column_name):
    # 1. Create null dataframe
    null_df= dataframe[dataframe[column_name].isna()]
    #drop where no description exists
    null_df= null_df.dropna(subset='description')
    
    # Make description lower case for easier searching 
    df['description']= df['description'].str.lower()
    
    # 2. Create unique list from column to compare to
    unique_list= np.array(dataframe[column_name].unique()).astype(str)
    #make np.array list lower case for easier searching
    unique_list= np.char.lower(unique_list)


    # 3. Compare both sets
    # Create set list from unique list to be able to compare intersection
    list_set = set(unique_list)
    
    # Create matches list that will contain matched data
    match = []
    
    # Create string set by splitting words from description
    for i in range(len(null_df)): 
        string_set = set(null_df.iloc[i,15].split())

        # Where one word matches, add index, match word and description to matches list
        if len(list_set.intersection(string_set))==1:
            match.append([null_df.iloc[i,0], list(list_set.intersection(string_set))[0],np.nan,null_df.iloc[i,15]])
        
        # Where two words match, add index, both match words, and description
        if len(list_set.intersection(string_set))== 2:
            match.append([null_df.iloc[i,0], list(list_set.intersection(string_set))[0], list(list_set.intersection(string_set))[1],null_df.iloc[i,15]])
            
    # Save list as dataframe
    potential_matches= pd.DataFrame(match)
    
    # Rename column names
    potential_matches= potential_matches.rename(columns={0:'id', 1:column_name, 2:column_name+'_2', 3:'description'})

    # Change column types to string
    potential_matches[column_name]= potential_matches[column_name].astype(str)
    potential_matches[column_name+'_2']= potential_matches[column_name+'_2'].astype(str)
    
    # Captialize string columns
    potential_matches[column_name] = potential_matches[column_name].apply(lambda x: x.capitalize())
    potential_matches[column_name+'_2'] = potential_matches[column_name+'_2'].apply(lambda x: x.capitalize())
    
    # Return potential_matches
    return potential_matches

In [22]:
#4. Determine which values should be integrated
def drop_edit_rows(potential_matches, rows_to_drop, rows_to_edit=[], new_string=[]):
    
    #edit rows
    for i in range(len(rows_to_edit)):
        potential_matches.iloc[rows_to_edit[i],1] = new_string[i]
    
    #drop rows
    matches = potential_matches.drop(labels= rows_to_drop)

    return matches

In [23]:
def merge_matches(dataframe, matches, column_name):
    # Print sum of missing values to compare before
    print('Number of NaN values before merge =', dataframe[column_name].isna().sum())
    
    for i in range(len(matches)):
        id_to_match = matches.iloc[i, 0]
        value_to_merge = matches.iloc[i, 1]
        dataframe.loc[dataframe['id'] == id_to_match, column_name] = value_to_merge
    
    # Print sum of missing values again to compare after merge
    print('Number of NaN values after merge =', dataframe[column_name].isna().sum())
    
    return dataframe

#### 2.3.3.2 Countries

In [24]:
# Steps 1-3
# Allow for larger width to read description 
pd.options.display.max_colwidth = 10000

# Run description match on country column
country_matches= description_match(df, 'country')

# Print country_matches
country_matches

Unnamed: 0,id,country,country_2,description
0,26926,Japan,Nan,i was sitting in seat 47k (a window seat on the right side of the jet airliner) of japan airlines flight jl 060 on feb 16 2006 on my
1,39914,Mexico,Nan,three objects in the sky in the pacific ocean off the coast of mexico or usa on a cruise ship.


**Step 4**:
As each description clearly relates to a country location, run merge_matches function.

In [25]:
# 5. Merge appropriate values
# Run merge_match on all matches from above
df= merge_matches(df, country_matches, 'country')

Number of NaN values before merge = 69
Number of NaN values after merge = 67


In [26]:
# As country data is now complete as possible, link missing country codes
# Save country_codes
country_codes=df[['country_code','country']]

# Group by 'country' and fill missing values in 'country_code' with the mode of each group
filled_country_codes = country_codes.copy()  # Create a copy of the dataFrame to avoid copyerror
filled_country_codes['country_code'] = country_codes.groupby('country')['country_code'].transform(lambda x: x.fillna(x.mode().iloc[0] if not x.mode().empty else np.nan))

# Merge the filled_country_codes DataFrame back to the original DataFrame based on the index
df['country_code'] = filled_country_codes['country_code']

#### 2.3.3.3 Regions

In [27]:
# Steps 1-3
# Run description match on region column
potential_region=description_match(df, 'region')
potential_region

Unnamed: 0,id,region,region_2,description
0,12980,Bali,Nan,silent ufo hanging over the ocean bali
1,19056,Aegean,Nan,1/2/2000 aegean sea
2,25137,Southeast,Nan,two brothers witness 10 to 15 spherical objects traveling southeast toward tokyo.
3,53934,Centre,Nan,three distinct glowing orbs over the sony centre berlin. brilliant white in colour daylight other events night time. three people saw t
4,60108,Cebu,Nan,i am looking at the clear skies near near the airplane path from davao city to cebu city. i was at my verandah in my home when i saw t
5,63181,Florida,Nan,july 6 2007 aboard the carnival liberty atlantic ocean south of florida keys observed three round blue/green objects.
6,68274,Centre,Nan,three light spots converging to centre and disappearing at amazing speed


**Step 4:** Some rows are problematic and need to be deleted. For instance England is a country, not a region, the same goes for "centre" in row 15. Others need to be editted to reflect their true region such as rows 6, 7, 9, and 15. Information is saved in lists below.

In [28]:
# Save lists to edit potential_matches dataframe
# rows to drop
drop_regions= [6]

#regions to edit
edit_regions= [2, 3]

#strings to edit regions
new_string_regions = ['Kanto', 'Berlin']

In [29]:
# Pass through the drop_edit_rows function
region_matches= drop_edit_rows(potential_region, drop_regions, edit_regions, new_string_regions)

# print result
region_matches

Unnamed: 0,id,region,region_2,description
0,12980,Bali,Nan,silent ufo hanging over the ocean bali
1,19056,Aegean,Nan,1/2/2000 aegean sea
2,25137,Kanto,Nan,two brothers witness 10 to 15 spherical objects traveling southeast toward tokyo.
3,53934,Berlin,Nan,three distinct glowing orbs over the sony centre berlin. brilliant white in colour daylight other events night time. three people saw t
4,60108,Cebu,Nan,i am looking at the clear skies near near the airplane path from davao city to cebu city. i was at my verandah in my home when i saw t
5,63181,Florida,Nan,july 6 2007 aboard the carnival liberty atlantic ocean south of florida keys observed three round blue/green objects.


In [30]:
# 5. Merge appropriate values

# Run merge_match on all matches from above
df= merge_matches(df, region_matches, 'region')

Number of NaN values before merge = 372
Number of NaN values after merge = 366


#### 2.3.3.4 Locale

In [31]:
# Steps 1-3
# Run description match on locale column
locale_matches= description_match(df, 'locale')

# Show results
locale_matches

Unnamed: 0,id,locale,locale_2,description
0,515,Bright,Nan,bright object seemingly appeared out of nowhere in the indian ocean miles from any land.
1,1882,White,Nan,october 1990 - saudi arabia - large dark triangular object - 3 white lights - 20 seconds - no sound - no stars behind object
2,4013,Macedonia,Nan,report from macedonia
3,4212,Blue,Nan,blue colour sphere was obsereved from containershipdia abt 4mtrsfrom dist of abt 8mtr.moved after twds italian coast.
4,4274,Bright,Nan,bright flash lightens up night sky object fires across sky at unbelievable speed&#8230; ((nuforc note: possible meteor. pd))
...,...,...,...,...
118,74114,Trail,Nan,a flash movin slow at firstchanges into a spherical shapecirclesraces westleavin behind a hazy greenish smoke trail
119,74303,Orange,Nan,flashing object changes colour . second obect- orange sphere does nothing except sit there.
120,76278,Bow,Nan,the light clearly lit up the bow of the vessel where no light should have been in the middle of the atlantic.
121,76932,Center,Nan,ship object was landed at the center of tol road approx 60 km from jakarta to cikampek in dawuan place


**Step 4:** As ample locale names are supplied, many irrelevant matches arise. For instance, Where the light has been described as bright, a match exists. Filter these out.

In [32]:
# create lists to edit potential_matches dataframe
locale_matches= locale_matches[(locale_matches['locale'] != 'Bright') & (locale_matches['locale'] != 'Blue')
                               & (locale_matches['locale_2'].isna()==False) & (locale_matches['locale'] != 'White')
                               & (locale_matches['locale'] != 'Edge') & (locale_matches['locale'] != 'Center')
                               & (locale_matches['locale'] != 'Bow') & (locale_matches['locale'] != 'Oblong')
                               & (locale_matches['locale'] != 'Tiny') & (locale_matches['locale'] != 'West')
                               & (locale_matches['locale'] != 'Orange') & (locale_matches['locale'] != 'Star')]

In [33]:
# Reset index
locale_matches= locale_matches.reset_index()

In [34]:
# Drop index column
locale_matches= locale_matches.drop(columns='index')

##print all results using .to_string to view in text editor for deeper analysis
print(locale_matches.to_string())

       id      locale  locale_2                                                                                                                                  description
0    4013   Macedonia       Nan                                                                                                                        report from macedonia
1    4292        Hull       Nan                                                               3 red lights flying fast in loose triangular formation hull east yorkshire uk.
2   11333    Brisbane       Nan                                                                    3 disc like crafts sighted on the eve of 2013 in brisbane area australia.
3   14770        Lake    Bright                             at the approx time of 8.15pm a bright circular red light was spotted north east of lake macquarie nsw australia.
4   17450       Trail      Blue        starfox shaped craft with blue glow and smoke/debris trail behind with no sound within 2500m and

In [35]:
# Save lists to edit matches dataframe
# note: each time the jypter notebook is re-opened, the length varies. 
# Why?? I do not know... a list of id's to keep will be made instead
# Locales to keep (current index when matches_null is 48)
#remaining_locales= [0, 1, 2, 3, 5, 6, 7, 8, 9, 12, 14, 15, 19, 20, 22, 26, 28, 31, 39, 45, 47]

# create list of id's to keep
#keep_locales_id=[]

# create for loop that saves the id of each row to keep
#for i in range(len(remaining_locales)):
    #keep_locales_id.append(locale_matches.iloc[remaining_locales[i],0])
    
# Save ids to edit locales_matches (as each time the notebook is re-opened, the length of locales_matches varies)
keep_locales_id = [4013, 4292, 11333, 14770, 18592, 
                   20065, 20878, 22674, 22676, 23882, 
                   25671, 26842, 30233, 30355, 30823, 
                   40933, 43019,44275, 65926, 71325, 73408]

# Slice locales_matches to only include rows to keep
locale_matches= locale_matches[locale_matches['id'].isin(keep_locales_id)]

In [36]:
# Locales to edit (current index when matches_null is 48)
#edit_locales= [3, 7, 20, 22, 26, 28, 39, 45]

# Loop to find locales to edit id numbers
#edit_locales_id=[]
#for i in range(len(edit_locales)):
    #edit_locales_id.append(locale_matches.iloc[edit_locales[i],0])

# Save id's to edit locales_matches (as each time the notebook is re-opened, the length of locales_matches varies)
edit_locales_id= [14770, 22674, 25671, 30233, 30355, 30823, 65926, 73408]

#strings to edit locales
new_string_locales = ['Macquarie Park', 'Hull', 'Huntington', 'Essex', 'Dubai', 'Cozumel','Nassau', 'Surrey']

# Edit locale_matches by replacing associated locales strings and id's
for i in range(len(edit_locales_id)):
    locale_matches.loc[df['id'] == edit_locales_id[i], 'locale'] = new_string_locales[i]

In [37]:
# 5.Merge appropriate values
df= merge_matches(df, locale_matches, 'locale')

Number of NaN values before merge = 315
Number of NaN values after merge = 295


#### 2.3.3.5 UFO Shape

In [38]:
# Steps 1-3
# Run description match on ufo_shape column
ufo_matches= description_match(df, 'ufo_shape')

#print all results using .to_string to view in text editor for deeper analysis
print(ufo_matches.to_string())

        id  ufo_shape ufo_shape_2                                                                                                                                     description
0       62      Light       Cross                    man  on hwy 43 sw of milwaukee sees large bright blue light streak by his car descend turn cross road ahead strobe. bizarre.
1       63      Light         Nan                   woman repts.  bright light in nw sky suddenly approaches her flies slowly overhead.  swept wings 2 blurry lights either side.
2      285      Light         Nan                                                                                             being  of light reportedjesus or another messenger.
3      294      Round         Nan                  young man . grandfather see a 'large orange round or oval' obj. move along horizon very fast hover move erratically.  bizarre.
4      436  Formation       Light                                                                    orange li

**Step 4:** Taking a quick look, description words are not always representative. The biggest culptrit is those of shape "Other". Cross has also been used inaccurately. Drop these.

In [39]:
# Find ufo_shape values where other or cross has been selected
ufo_other= ufo_matches[ufo_matches['ufo_shape']=='Other']['id']
ufo_cross= ufo_matches[ufo_matches['ufo_shape']=='Cross']['id']

# Combine other and cross values, find index values
drop_ufo = pd.concat([ufo_cross, ufo_other]).index

In [40]:
# Pass through the drop_edit_rows function
drop_edit_rows(ufo_matches, drop_ufo)

Unnamed: 0,id,ufo_shape,ufo_shape_2,description
0,62,Light,Cross,man on hwy 43 sw of milwaukee sees large bright blue light streak by his car descend turn cross road ahead strobe. bizarre.
1,63,Light,Nan,woman repts. bright light in nw sky suddenly approaches her flies slowly overhead. swept wings 2 blurry lights either side.
2,285,Light,Nan,being of light reportedjesus or another messenger.
3,294,Round,Nan,young man . grandfather see a 'large orange round or oval' obj. move along horizon very fast hover move erratically. bizarre.
4,436,Formation,Light,orange light formation over monroe ct 10/11/11--hangs in sky then flys away.
...,...,...,...,...
667,79202,Light,Nan,i was outside on my back patio when i looked toward the eastern sky and saw a massively bright light rising in altitude. i asked my br
668,79324,Light,Nan,bright stationary light in clear blue sky fading and reappearing in different locations
669,79820,Dome,Light,blue light 'explosion' as if viewed within a dome or planetarium
670,79866,Light,Nan,strange bright flickering light in the w sky over lake michigan. flashing an moving both hor. and vert. extremley fast. ((arcturus?))


In [41]:
# 5.Merge appropriate values
# Run merge_match on all matches from above
df= merge_matches(df, ufo_matches, 'ufo_shape')

Number of NaN values before merge = 1930
Number of NaN values after merge = 1257


In [42]:
# Clean up ufo_shape column
# Replace NaN and 'Unknown' values with 'Other'
df['ufo_shape']= df['ufo_shape'].fillna('Other')
df['ufo_shape']= df['ufo_shape'].replace('Unknown', 'Other')

#### 2.3.3.6 Null Values Summary

Using the coordinate and description methods has lead to a decrease in null values of over 1000 values. Missing data for each column is less than 0.5%, except ufo_shape which has 1.6% of values missing.


In [43]:
#Percentage of values that are null
round(df.isnull().sum()/len(df),3)*100

id                0.0
encounter_date    0.0
reported_date     0.0
year              0.0
month             0.0
hour              0.0
season            0.0
country_code      0.1
country           0.1
region            0.5
locale            0.4
latitude          0.0
longitude         0.0
ufo_shape         0.0
duration_secs     0.0
description       0.0
dtype: float64

### 2.3.4 Duplicates

Each entry is treated with an individual ID, so duplicates on every metric will be zero. However, duplicates may be found if two of the same events are reported. The chances that the exact string description and the local area it is reported in is the exact same is low. Thus, duplicates will be dropped on these metrics.

In [44]:
# Show all duplicates
print('Duplicates on all metrics= ',df.duplicated().sum())

# Print the length of duplicated rows based on description and locale
print('Duplicates on description and locale= ',len(df[df.duplicated(['description','locale'])]))

# Drop these rows
df= df.drop_duplicates(['description','locale'])

Duplicates on all metrics=  0
Duplicates on description and locale=  71


### 2.3.5 Odd Values

Investigate values where the duration is extreme.

In [45]:
# show values with high duration (over 3 days)
high_duration = df[df['duration_secs'] > 259000].sort_values(by='duration_secs', ascending=False)

# show description and duration_days
high_duration[['description','duration_secs']][:20]

Unnamed: 0,description,duration_secs
559,firstly i was stunned and stared at the object for what seemed minutes but probably was only seconds. my first inclination was to bec,97836000.0
53381,((hoax??)) i was out in a field near mil base heard a sounds . it sounded like a motor starting up.,82800000.0
74656,orange or amber balls or orbs of light multiplying and maneuvering beyond known and current aircraft abilities,66276000.0
38259,hi i&#8217;m writing to you because i wanted to talk to someone professional about my experiences. never have i heard anyone talk about wha,52623200.0
69211,bright stars moving erratically over the gulf of mexico,52623200.0
64386,there have been several flying objects in a period of about two months that look like an orb of white light (resembling a star). they m,52623200.0
52706,first time it was a bright light and missing time frome 10.45 untill 4 in morning with no real idea why. second time parked up in same,25248000.0
30595,sun city / menifee ufo sightings in 1994,10526400.0
6991,bright flying orb.,10526400.0
71168,this object was very high up and emmited no sound..,10526400.0


After reading multiple descriptions, encounters do not seem to occur for as long as stated. In some cases, reports occurred multiple times over multiple days, rather than last days. Others seem to be a mistake. At the very least, these seem like mistakes. Duration will be limited to 3 days.

In [46]:
# convert extreme duration values to one day (86400 seconds)
df['duration_secs'] = np.where(df['duration_secs'] > 86400, 86400, df['duration_secs'])

### 2.3.6 Add Useful Data

Data which will be useful includes data on time, ufo color and continent in which the encounter is reported.

#### 2.3.6.1 Duration and Reported Difference
The duration column currently accounts for encounters based on seconds. Create new columns to see how duration works on a minutely, hourly and daily basis. Create an age category based on how long ago the encounter was reported. Also create a column which shows the difference between when the encounter was reported and when the encounter actually occurred.

In [47]:
# Create duration in minutes column
df['duration_mins'] = df.loc[:,'duration_secs'] / 60

# Create duration in hours
df['duration_hours'] = df['duration_mins'] / 60

# Create duration in days 
df['duration_days'] = df['duration_hours'] / 24

In [48]:
# Convert encounter_date to datetime
df['encounter_date'] = pd.to_datetime(df['encounter_date'])

# Encounter year column
df['encounter_year'] = df['encounter_date'].dt.year

In [49]:
# Create reported difference column which is the difference in years between the reported and encountered dates
df['reported_diff'] = round((df['reported_date'].dt.year + df['reported_date'].dt.month/12)
                            - (df['encounter_date'].dt.year + df['encounter_date'].dt.month/12), 1)

In [50]:
df['reported_diff']

0        54.5
1        56.2
2        52.2
3        47.2
4        43.2
         ... 
80323     0.0
80324     0.0
80325     0.0
80326     0.0
80327     0.0
Name: reported_diff, Length: 80257, dtype: float64

In [51]:
# Create age column which is the difference in years between the reported date and the encounter date
df['age'] = round((df['reported_date'].dt.year + df['reported_date'].dt.month/12)
                            - (df['encounter_date'].dt.year + df['encounter_date'].dt.month/12), 1)

#### 2.3.6.2 Color Data
Add data which assigns a color based on the encounter. 

In [52]:
# Create list of color
colors={'white','yellow','orange','red','green','blue','purple','brown','silver','gold','gray','grey','black', 'amber', 'aqua','indigo','pink'}

# Create ufo_color column
df['ufo_color'] = np.nan

In [53]:
df.columns[22]

'ufo_color'

In [54]:
# Create loop to determine if details column provides insight into ufo colors
for i in range(len(df)): 
    if df.iloc[i,15] is not np.nan: # Where description is not null
        string_set = set(df.iloc[i,15].split()) # Split description
        
        if len(colors.intersection(string_set))==1: # where there is only one color:
            matched_color = list(colors.intersection(string_set))[0] # Let the color be the first mentioned color
            if matched_color not in (df.iloc[i,8:11]).values: # Ensure that the matched color is not referring to the country, region, or locale instead
                df.iloc[i,22] = matched_color
                
        if len(colors.intersection(string_set)) > 1: # where there are multiple colors
            if matched_color not in (df.iloc[i,8:11]).values: 
                # Ensure that the matched color is not referring to the country, region, or locale instead
                df.iloc[i,22] = 'multicolor'

  df.iloc[i,22] = matched_color


In [55]:
# Consolidate gray colors
df['ufo_color'] = df['ufo_color'].replace('grey','gray')

In [56]:
# Capitalize color column
df['ufo_color']= df['ufo_color'].str.capitalize()

In [57]:
# Clean up ufo_color column
# Replace NaN values with 'Other'
df['ufo_color']= df['ufo_color'].fillna('Other')

In [58]:
# Percentage of each color
round(df['ufo_color'].value_counts(dropna=False)/len(df),4)*100

ufo_color
Other         66.73
Orange         7.61
White          6.11
Multicolor     5.37
Red            4.69
Green          2.43
Blue           1.81
Black          1.77
Silver         1.29
Yellow         0.67
Amber          0.60
Gray           0.47
Gold           0.18
Pink           0.10
Brown          0.08
Purple         0.07
Aqua           0.00
Indigo         0.00
Name: count, dtype: float64

**Note:** As much of this data is still missing, this data will only supplement our analysis. Also this assumes that if a color is mentioned it is in reference to the ufo color. However, these colors should not be referencing the location name which has been accounted for.

#### 2.3.6.3 Continent Data

Create list of continents for their respective countries


In [59]:
# Unique countries
countries = df['country'].unique()

# Find values that contain "The"
the_countries = df[df['country'].str.contains('The ', na=False)]['country'].unique()

# Replace "The " with ""
df['country'] = df['country'].str.replace('The  ','')

In [60]:
# Create function to group countries into continents
def country_to_continent(country_name):
    if country_name == 'Kosovo':
        return 'Europe'  # Assign 'Europe' as the continent for Kosovo
    elif isinstance(country_name, str):
        try:
            country_alpha2 = pc.country_name_to_country_alpha2(country_name)
            country_continent_code = pc.country_alpha2_to_continent_code(country_alpha2)
            country_continent_name = pc.convert_continent_code_to_continent_name(country_continent_code)
            return country_continent_name
        except Exception as e:
            return None
    #else:
        #return 'Unknown'

# Apply the function to create a new 'continent' column in the DataFrame
df['continent'] = df['country'].apply(country_to_continent)

In [61]:
df['continent'].value_counts(dropna=False)

continent
North America    74822
Europe            3388
Asia               859
Oceania            740
South America      205
Africa             165
None                78
Name: count, dtype: int64

#### 2.3.6.4 Season, Time of Day and Day of Week
Create a season column which will assign each sighting with one of the four seasons

In [62]:
# Create a column for seasons
df['season'] = np.nan

# Fill column according to season
df.loc[df['month'].isin(['December','January','February']),'season'] = 'Winter'
df.loc[df['month'].isin(['March','April','May']),'season'] = 'Spring'
df.loc[df['month'].isin(['June','July','August']),'season'] = 'Summer'
df.loc[df['month'].isin(['September','October','November']),'season'] = 'Autumn'

# Create a column for time of day
df['time_of_day'] = np.nan

# Fill column according to time of day
df.loc[df['encounter_date'].dt.hour.isin(range(6,12)),'time_of_day'] = 'Morning'
df.loc[df['encounter_date'].dt.hour.isin(range(12,18)),'time_of_day'] = 'Afternoon'
df.loc[df['encounter_date'].dt.hour.isin(range(18,24)),'time_of_day'] = 'Evening'
df.loc[df['encounter_date'].dt.hour.isin(range(0,6)),'time_of_day'] = 'Night'

# Create a column for day of week
df['day_of_week'] = df['encounter_date'].dt.day_name()

  df.loc[df['month'].isin(['December','January','February']),'season'] = 'Winter'
  df.loc[df['encounter_date'].dt.hour.isin(range(6,12)),'time_of_day'] = 'Morning'


In [63]:
# Save cleaned data to csv
#df.to_csv('../ufo-sightings-cleaned.csv', index=False)

# load cleaned data
df = pd.read_csv('../ufo-sightings-cleaned.csv', parse_dates=['encounter_date','reported_date'])

## 3 Analysis

After the data is ordered and colorized, an analysis will visualize the data to answer three main questions. 

The first looks to the timing. When are these reports occurring? The second looks to the validity of the reports. Do the details of the story might change over time? Finally the last question will look to where these encounters happen and will look to see if any similiarities may be drawn looking to shape and color.

### 3.1 Ordering & Colorizing
Order and set columns to use as the legend argument in plotly


In [64]:
# Order months
months_ordered = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']

# Define colors for each month
month_colors = {
    'January': 'blue',
    'February': 'red',
    'March': 'green',
    'April': 'orange',
    'May': 'purple',
    'June': 'brown',
    'July': 'cyan',
    'August': 'magenta',
    'September': 'lime',
    'October': 'hotpink',
    'November': 'teal',
    'December': 'yellow'
}

# Order season
season_ordered = ['Winter', 'Spring', 'Summer', 'Autumn']

# Order colors for each season
season_colors = {
    'Winter': 'blue',
    'Spring': 'green',
    'Summer': 'orange',
    'Autumn': 'red'
}

# Define ufo colors
ufo_color_ordered = ['Red', 'Orange', 'Yellow', 'Green', 'Blue', 'Purple', 'Brown', 'Gray', 'Black', 'White', 'Multicolor']

# Define colors for each ufo color
ufo_color_colors = {
    'Red': 'red',
    'Orange': 'orange',
    'Yellow': 'yellow',
    'Green': 'green',
    'Blue': 'blue',
    'Purple': 'purple',
    'Brown': 'brown',
    'Gray': 'gray',
    'Black': 'black',
    'White': 'white',
    'Multicolor': 'hotpink'
}

# Order time of day
time_of_day_ordered = ['Morning', 'Afternoon', 'Evening', 'Night']
    
# Define colors for time of day
time_of_day_colors = {
    'Morning': 'salmon',
    'Afternoon': 'peachpuff',
    'Evening': 'powderblue',
    'Night': 'slategrey'
}

# Order day of week
day_of_week_ordered = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

# Define colors for day of week
day_of_week_colors = {
    'Monday': 'blue',
    'Tuesday': 'red',
    'Wednesday': 'green',
    'Thursday': 'orange',
    'Friday': 'purple',
    'Saturday': 'brown',
    'Sunday': 'cyan'
}

# Order continents
continent_ordered = ['North America', 'Europe', 'Asia', 'South America', 'Africa', 'Oceania', 'Antarctica']

# Define colors for continents
continent_colors = {
    'North America': 'blue',
    'Europe': 'red',
    'Asia': 'green',
    'South America': 'orange',
    'Africa': 'purple',
    'Oceania': 'brown',
    'Antarctica': 'cyan'
}

### 3.2 Timing

What days of the week and months of the year and seasons have the most sightings? Each will be displayed as a line graph

#### 3.2.1 Days of the Week

In [65]:
# Create series for each time of day
morning = df[df['time_of_day']=='Morning']['day_of_week'].value_counts().reindex(day_of_week_ordered)
afternoon = df[df['time_of_day']=='Afternoon']['day_of_week'].value_counts().reindex(day_of_week_ordered)
evening = df[df['time_of_day']=='Evening']['day_of_week'].value_counts().reindex(day_of_week_ordered)
night = df[df['time_of_day']=='Night']['day_of_week'].value_counts().reindex(day_of_week_ordered)

# Create day of week dataframe
day_of_week = df['day_of_week'].value_counts().reindex(day_of_week_ordered)

# Rename count to total
day_of_week = day_of_week.rename('total')


In [66]:
# Create line graph of ufo sightings per time of day per day of week
fig = px.line(day_of_week, x=day_of_week.index, y='total', title='UFO Sightings per Day of the Week')

# Add line graph for morning sightings
fig.add_scatter(x=morning.index, 
                y=morning.values, 
                name='Morning', 
                line=dict(color='salmon', width=2, dash='dot'))

# Update line color
fig.update_traces(line_color='purple')

# Add line graph for afternoon sightings
fig.add_scatter(x=afternoon.index, 
                y=afternoon.values, 
                name='Afternoon', 
                line=dict(color='peachpuff', width=2, dash='dot'))

# Add line graph for evening sightings
fig.add_scatter(x=evening.index, 
                y=evening.values, 
                name='Evening', 
                line=dict(color='powderblue', width=2, dash='dot'))

# Add line graph for night sightings
fig.add_scatter(x=night.index, 
                y=night.values, 
                name='Night', 
                line=dict(color='slategrey', width=2, dash='dot'))

# Update layout
fig.update_layout(
    title={
        'text': "UFO Sightings per Day of the Week",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
    xaxis_title="Day",
    yaxis_title="Number of Sightings",
    legend_title="Time of Day",
    font=dict(
        family="Courier New, monospace",
        size=12,
        color="RebeccaPurple"
    )
)

# Show graph
fig.show()


### 3.2.2 Months of the Year

In [67]:
# Create months dataframe
months = df['month'].value_counts().reindex(months_ordered)

# Rename count to total
months = months.rename('total')

# Create series for each time of day per month
morning = df[df['time_of_day']=='Morning']['month'].value_counts().reindex(months_ordered)
afternoon = df[df['time_of_day']=='Afternoon']['month'].value_counts().reindex(months_ordered)
evening = df[df['time_of_day']=='Evening']['month'].value_counts().reindex(months_ordered)
night = df[df['time_of_day']=='Night']['month'].value_counts().reindex(months_ordered)

In [68]:
# Create line graph of ufo sightings per month
fig = px.line(months, 
              title='UFO Sightings per Month', 
              labels={'index':'Month', 'value':'Number of Sightings'})

# Update line color
fig.update_traces(line_color='purple')

# Add line graph for morning sightings
fig.add_scatter(x=morning.index, 
                y=morning.values, 
                name='Morning', 
                line=dict(color='salmon', width=2, dash='dot'))

# Add line graph for afternoon sightings
fig.add_scatter(x=afternoon.index, 
                y=afternoon.values, 
                name='Afternoon', 
                line=dict(color='peachpuff', width=2, dash='dot'))

# Add line graph for evening sightings
fig.add_scatter(x=evening.index,
                y=evening.values, 
                name='Evening', 
                line=dict(color='powderblue', width=2, dash='dot'))

# Add line graph for night sightings
fig.add_scatter(x=night.index,
                y=night.values, 
                name='Night', 
                line=dict(color='slategrey', width=2, dash='dot'))

# Update layout
fig.update_layout(
    title={
        'text': "UFO Sightings per Month",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
    xaxis_title="Month",
    yaxis_title="Number of Sightings",
    legend_title="Time of Day",
    font=dict(
        family="Courier New, monospace",
        size=12,
        color="RebeccaPurple"
    )
)

# Show figure
fig.show()

#### 3.2.3 UFO Sightings per season per day of week

In [69]:
# Create seasons dataframe
seasons = df['season'].value_counts().reindex(season_ordered)

# Rename count to total
seasons = seasons.rename('total')

# Create series for each time of day per season
morning = df[df['time_of_day']=='Morning']['season'].value_counts().reindex(season_ordered)
afternoon = df[df['time_of_day']=='Afternoon']['season'].value_counts().reindex(season_ordered)
evening = df[df['time_of_day']=='Evening']['season'].value_counts().reindex(season_ordered)
night = df[df['time_of_day']=='Night']['season'].value_counts().reindex(season_ordered)

In [70]:
# Create line graph of ufo sightings per season
fig = px.line(seasons, 
              title='UFO Sightings per Season', 
              labels={'index':'Season', 'value':'Number of Sightings'})

# Update line color
fig.update_traces(line_color='purple')

# Add line graph for morning sightings
fig.add_scatter(x=morning.index, 
                y=morning.values, 
                name='Morning', 
                line=dict(color='salmon', width=2, dash='dot'))

# Add line graph for afternoon sightings
fig.add_scatter(x=afternoon.index, 
                y=afternoon.values, 
                name='Afternoon', 
                line=dict(color='peachpuff', width=2, dash='dot'))

# Add line graph for evening sightings
fig.add_scatter(x=evening.index,
                y=evening.values, 
                name='Evening', 
                line=dict(color='powderblue', width=2, dash='dot'))

# Add line graph for night sightings
fig.add_scatter(x=night.index,
                y=night.values, 
                name='Night', 
                line=dict(color='slategrey', width=2, dash='dot'))

# Update layout
fig.update_layout(
    title={
        'text': "UFO Sightings per Season",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
    xaxis_title="Season",
    yaxis_title="Number of Sightings",
    legend_title="Time of Day",
    font=dict(
        family="Courier New, monospace",
        size=12,
        color="RebeccaPurple"
    )
)

# Show figure
fig.show()


#### 3.2.4 Conclusion

Saturday has the most encounters per week, July has the most per month, and Summer has the most encounters per season. Most encounters happen in the evening, followed by night, afternoon and finally mornings.

### 3.3 Report Day Lag

Validity may be tested by comparing the relationship of the duration of an encounter with the encounter-report age lag (the difference between the encounter date and the date it was reported). In short, this answers whether the details of the story might change over time. Where there is a large encounter-report date gap, perhaps the duration is being reported longer than those with a smaller gap.

#### 3.3.1 Histogram of UFO Sightings Duration

In [71]:
# Create slice where encounter date is from the 1990's onwards (not many before that), disclude 2014 as it is incomplete
df_90s_onwards = df[(df['encounter_year'] >= 1980) & (df['encounter_year'] < 2014)]

# Create histogram of ufo sightings duration per continent
fig = px.histogram(df_90s_onwards, 
                   x='encounter_year', 
                   color='season', 
                   title='UFO Sightings Duration per Continent', 
                   labels={'duration_secs':'Duration (minutes)', 'continent':'Continent'},
                   color_discrete_map=season_colors,
                   category_orders={'season':season_ordered})

# Update layout
fig.update_layout(
    title={
        'text': "UFO Sightings Count and Duration per Continent",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
    xaxis_title="Duration (minutes)",
    yaxis_title="Number of Sightings",
    legend_title="Continent",
    font=dict(
        family="Courier New, monospace",
        size=12,
        color="RebeccaPurple"
    )
)

# Show figure
fig.show()

#### 3.3.2 Linegraph

In [72]:
# Group data by reported difference and find the median
reported_diff_duration = df.groupby('reported_diff')['duration_secs'].agg(['median', 'count']).reset_index()

# Limit data to the first 50 years as the reliability of data after this point is questionable
first_50 = reported_diff_duration[reported_diff_duration['reported_diff'] < 50]

# Create line graph of reported difference and median duration
fig = px.line(first_50, 
              x="reported_diff", 
              y='median', 
              hover_data=['count'])

# Update titles
fig.update_layout(
    title={
        'text': "Report Date Lag and Median Encounter Duration",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
    xaxis_title="Reported Difference (years)",
    yaxis_title="Median Duration (secs)",
    font=dict(
        family="Courier New, monospace",
        size=12,
        color="black")
)

# Show Figure
fig.show()

#### 3.3.3 Scatterplot

In [73]:
# Create df2 where duration_day is less than 1 to exclude outliers
df2 = df[df['duration_hours'] < 24]

# Create scatter plot with specific colors for each month
fig = px.scatter(df2, 
                 x='reported_diff', 
                 y='duration_hours', 
                 category_orders={'time_of_day': season_ordered},
                 color_discrete_map=season_colors, 
                 opacity=0.5
                 )

# Update titles
fig.update_layout(
    title={
        'text': "Report Date Lag and Encounter Duration",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
    xaxis_title="Reported Difference (years)",
    yaxis_title="Duration (hours)",
    legend_title='Season',
    font=dict(
        family="Courier New, monospace",
        size=14,
        color="black")
)

# Show figure
fig.show()

#### 3.3.4 Conclusion

Encounters are increasing at a dramatic rate. More encounters are now occurring in Summer when Autumn used to have more. However, this could be due to a lack of data that spikes up this data.

Report day lag seems to relay that the higher the lag, the longer the duration. This may occur due to a number of reasons. While it is possible that the details of the story may change, reports with the same lag but smaller duration may not have been reported. Further analysis is needed.

In [75]:
# Save cleaned data to csv home directory
df.to_csv('../ufo-sightings-cleaned.csv', index=False)