# **Analysis of UFO sightings data set** to learn data analysis using python

**Project goal:**
I want to explore, clean, and analyze UFO sightings from the Kaggle dataset using Python. Learn python data analytics workflow and become comfrotable in using it for data analysis.

**Overview of ufo_dataset.csv:**
*   datetime - date and time information about the sighting
*   city - city where the sighting occured
*   state - state where the sighting occured
*   country - country where the sighting occured
*   shape - shape of the ufo
*   duration (seconds) - duration in seconds of the sighting
*   duration (hours/min) - duration in hours/minutes of teh sighting
*   comments - comment describing sighting
*   date posted - date this posting sighting made public
*   latitude - latitude information
*   longitude - longitude information

**Process steps:**
1.   Getting data
2.   Cleaning data







# **Step 1. Getting data**

Let's import our python toolset to be able to work with our data.

In [4]:
import pandas as pd
import numpy as np

Let's read data from our CSV file.

In [5]:
# Read the csv file using pandas read_csv
df = pd.read_csv('./drive/MyDrive/datasets/ufo_dataset.csv')

  df = pd.read_csv('./drive/MyDrive/datasets/ufo_dataset.csv')


Let's see first 10 lines of our dataset.

In [6]:
# Get first 10 records of the dataset
df.head(10)

Unnamed: 0,datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude
0,10/10/1949 20:30,san marcos,tx,us,cylinder,2700,45 minutes,This event took place in early fall around 194...,4/27/2004,29.8830556,-97.941111
1,10/10/1949 21:00,lackland afb,tx,,light,7200,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,12/16/2005,29.38421,-98.581082
2,10/10/1955 17:00,chester (uk/england),,gb,circle,20,20 seconds,Green/Orange circular disc over Chester&#44 En...,1/21/2008,53.2,-2.916667
3,10/10/1956 21:00,edna,tx,us,circle,20,1/2 hour,My older brother and twin sister were leaving ...,1/17/2004,28.9783333,-96.645833
4,10/10/1960 20:00,kaneohe,hi,us,light,900,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,1/22/2004,21.4180556,-157.803611
5,10/10/1961 19:00,bristol,tn,us,sphere,300,5 minutes,My father is now 89 my brother 52 the girl wit...,4/27/2007,36.595,-82.188889
6,10/10/1965 21:00,penarth (uk/wales),,gb,circle,180,about 3 mins,penarth uk circle 3mins stayed 30ft above m...,2/14/2006,51.434722,-3.18
7,10/10/1965 23:45,norwalk,ct,us,disk,1200,20 minutes,A bright orange color changing to reddish colo...,10/02/1999,41.1175,-73.408333
8,10/10/1966 20:00,pell city,al,us,disk,180,3 minutes,Strobe Lighted disk shape object observed clos...,3/19/2009,33.5861111,-86.286111
9,10/10/1966 21:00,live oak,fl,us,disk,120,several minutes,Saucer zaps energy from powerline as my pregna...,05/11/2005,30.2947222,-82.984167


Let's see general information about our dataset.

In [7]:
# General information about the dataset using df.info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 80332 entries, 0 to 80331
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   datetime              80332 non-null  object 
 1   city                  80332 non-null  object 
 2   state                 74535 non-null  object 
 3   country               70662 non-null  object 
 4   shape                 78400 non-null  object 
 5   duration (seconds)    80332 non-null  object 
 6   duration (hours/min)  80332 non-null  object 
 7   comments              80317 non-null  object 
 8   date posted           80332 non-null  object 
 9   latitude              80332 non-null  object 
 10  longitude             80332 non-null  float64
dtypes: float64(1), object(10)
memory usage: 6.7+ MB


**What can we see?**
In total there is 80332 entries in this dataset. This dataset contains records of UFO sightings around the world. Some of the columns have missing values like **country, state, shape, comments and longitude**

# **Step 2. Processing Data**

In this step we will look at our data whether we are renaming columns, removing duplicates, changing types or dealing with missing values our mission during this step is to ensure the data is in it's best shape for us to work with.

First we will create a copy of our data frame and store it in a variable df_processing.

In [8]:
# Create a copy of the imported dataframe
df_processing = df.copy()

Let's look at our column names.

In [9]:
# Retrieve column names from the dataset
df_processing.columns

Index(['datetime', 'city', 'state', 'country', 'shape', 'duration (seconds)',
       'duration (hours/min)', 'comments', 'date posted', 'latitude',
       'longitude '],
      dtype='object')

We would wan't to **remove spaces** in our column names and **symbols** as this will make our life easier when working with column names. We will also remove a trailing space in our longitude column

In [10]:
# Lets use rename and dictionary to rename multiple columns at the same time
df_processing.rename(columns={
    'duration (seconds)': 'duration_seconds',
    'duration (hours/min)': 'duration_hours_min',
    'date posted': 'date_posted',
    'longitude ': 'longitude'
}, inplace=True)

# Let's check the resulting column names
df_processing.columns

Index(['datetime', 'city', 'state', 'country', 'shape', 'duration_seconds',
       'duration_hours_min', 'comments', 'date_posted', 'latitude',
       'longitude'],
      dtype='object')

Lets also check our dataset for any duplicates.

In [11]:
# Display number of duplicated records
df_processing.duplicated().sum()

np.int64(0)

Lets check our data frame for missing values.

In [12]:
# Display sum of missing values
df_processing.isnull().sum()

Unnamed: 0,0
datetime,0
city,0
state,5797
country,9670
shape,1932
duration_seconds,0
duration_hours_min,0
comments,15
date_posted,0
latitude,0


We can see that our **state column** is **missing 5797 values**, **country** is **missing 9670 values**, **shape** is **missing 1932 value**s and **comments** and **longitude** both **missing 1 value**.

Lets start by looking at our records that are missing country and see if we are able to fix our data by filling the missing values.

In [13]:
# Select records that are missing country
df_processing[df_processing['country'].isnull()]

Unnamed: 0,datetime,city,state,country,shape,duration_seconds,duration_hours_min,comments,date_posted,latitude,longitude
1,10/10/1949 21:00,lackland afb,tx,,light,7200,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,12/16/2005,29.38421,-98.581082
18,10/10/1973 23:00,bermuda nas,,,light,20,20 sec.,saw fast moving blip on the radar scope thin w...,01/11/2002,32.364167,-64.678611
29,10/10/1979 22:00,saddle lake (canada),ab,,triangle,270,4.5 or more min.,Lights far above&#44 that glance; then flee f...,1/19/2005,53.970571,-111.689885
35,10/10/1982 07:00,gisborne (new zealand),,,disk,120,2min,gisborne nz 1982 wainui beach to sponge bay,01/11/2002,-38.662334,178.017649
40,10/10/1986 20:00,holmes/pawling,ny,,chevron,180,3 minutes,Football Field Sized Chevron with bright white...,10/08/2007,41.523427,-73.646795
...,...,...,...,...,...,...,...,...,...,...,...
80238,09/09/2009 14:15,broomfield?lafayette,co,,rectangle,120.0,2 min,Large&#44 rectangular object seen flying in br...,12/12/2009,39.993596,-105.089706
80244,09/09/2009 20:17,lyman,me,,light,600.0,10 mins,Two lights ran across the sky&#44 as bright as...,12/12/2009,43.505096,-70.637968
80319,09/09/2013 20:15,clifton,nj,,other,3600.0,~1hr+,Luminous line seen in New Jersey sky.,9/30/2013,40.858433,-74.163755
80322,09/09/2013 21:00,aleksandrow (poland),,,light,15.0,15 seconds,Two points of light following one another in a...,9/30/2013,50.465843,22.891814


In [14]:
# Select records that are missing country
df_processing[df_processing['state'].isnull()]

Unnamed: 0,datetime,city,state,country,shape,duration_seconds,duration_hours_min,comments,date_posted,latitude,longitude
2,10/10/1955 17:00,chester (uk/england),,gb,circle,20,20 seconds,Green/Orange circular disc over Chester&#44 En...,1/21/2008,53.2,-2.916667
6,10/10/1965 21:00,penarth (uk/wales),,gb,circle,180,about 3 mins,penarth uk circle 3mins stayed 30ft above m...,2/14/2006,51.434722,-3.180000
18,10/10/1973 23:00,bermuda nas,,,light,20,20 sec.,saw fast moving blip on the radar scope thin w...,01/11/2002,32.364167,-64.678611
20,10/10/1974 21:30,cardiff (uk/wales),,gb,disk,1200,20 minutes,back in 1974 I was 19 at the time and lived i...,02/01/2007,51.5,-3.200000
24,10/10/1976 22:00,stoke mandeville (uk/england),,gb,cigar,3,3 seconds,White object over Buckinghamshire UK.,12/12/2009,51.783333,-0.783333
...,...,...,...,...,...,...,...,...,...,...,...
80217,09/09/2007 19:01,melbourne (australia),,au,circle,600.0,10 min,Hostile,10/08/2007,-37.813938,144.963425
80234,09/09/2009 03:14,aberdeen (uk/scotland),,gb,light,6.0,6 seconds,Bright light seen over Aberdeen&#44 Scotland&#...,12/12/2009,57.166667,-2.666667
80254,09/09/2009 21:15,nottinghamshire (uk/england),,gb,fireball,600.0,10 mins,resembled orange flame imagine a transparent h...,12/12/2009,53.166667,-1.000000
80255,09/09/2009 21:38,kaiserlautern (germany),,de,light,40.0,about 40 seconds,2 white lights over Kaiserslautern&#44 ramstei...,12/12/2009,49.45,7.750000


As we can see there is alot of missing data. Some records contain the country in the city column some don't. We could possibly extract this data an fill in the missing values. However we do have longitude and latitude data an perhaps should be able to use some kind of geolocation API to replace the missing data.

After some digging we could potentially use the longitude and latitude data to fix our missing country, state values a by using something like **geopy** with it's **Nominatim** module.

In [15]:
from geopy.geocoders import Nominatim
from geopy.exc import GeocoderTimedOut, GeocoderServiceError
import time
from tqdm import tqdm
