Import python packages.

In [1]:
import numpy as np
import pandas as pd


In [2]:
ufos = pd.read_csv('./data/ufos.csv',)
ufos.head() 

Unnamed: 0,datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude
0,10/10/1949 20:30,san marcos,tx,us,cylinder,2700.0,45 minutes,This event took place in early fall around 194...,4/27/2004,29.883056,-97.941111
1,10/10/1949 21:00,lackland afb,tx,,light,7200.0,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,12/16/2005,29.38421,-98.581082
2,10/10/1955 17:00,chester (uk/england),,gb,circle,20.0,20 seconds,Green/Orange circular disc over Chester&#44 En...,1/21/2008,53.2,-2.916667
3,10/10/1956 21:00,edna,tx,us,circle,20.0,1/2 hour,My older brother and twin sister were leaving ...,1/17/2004,28.978333,-96.645833
4,10/10/1960 20:00,kaneohe,hi,us,light,900.0,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,1/22/2004,21.418056,-157.803611


In [3]:
ufos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 80332 entries, 0 to 80331
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   datetime              80332 non-null  object 
 1   city                  80332 non-null  object 
 2   state                 74535 non-null  object 
 3   country               70662 non-null  object 
 4   shape                 78400 non-null  object 
 5   duration (seconds)    80332 non-null  float64
 6   duration (hours/min)  80332 non-null  object 
 7   comments              80317 non-null  object 
 8   date posted           80332 non-null  object 
 9   latitude              80332 non-null  float64
 10  longitude             80332 non-null  float64
dtypes: float64(3), object(8)
memory usage: 6.7+ MB


This project focuses on building a model can can help us predict the location for ufo sightings specifying their geographical locations, `latitude` and `longitude`.

We will be needing four(4) columns:
* The country the ufo was spotted. (label attribute)
* The Duration for which the ufo wa seen.
* The latitude and longitude of the location.

In [4]:
# Create a sub-dataframe from the initial 

ufos = pd.DataFrame({
    'Country': ufos['country'],
    'Seconds': ufos['duration (seconds)'],
    'Latitude': ufos['latitude'],
    'Longitude': ufos['longitude']
})

In [5]:
ufos.Country.unique()  

array(['us', nan, 'gb', 'ca', 'au', 'de'], dtype=object)

In [6]:
ufos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 80332 entries, 0 to 80331
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Country    70662 non-null  object 
 1   Seconds    80332 non-null  float64
 2   Latitude   80332 non-null  float64
 3   Longitude  80332 non-null  float64
dtypes: float64(3), object(1)
memory usage: 2.5+ MB


Drop missing values. 

In [7]:
ufos.dropna(inplace=True)

The time duration will be specified between 1 to 60s.

In [8]:
ufos = ufos[(ufos.Seconds >= 1) & (ufos.Seconds <= 60)]  # filter out sightings with duration less than a seconnd or greater than 60s
ufos.head(10)

Unnamed: 0,Country,Seconds,Latitude,Longitude
2,gb,20.0,53.2,-2.916667
3,us,20.0,28.978333,-96.645833
14,us,30.0,35.823889,-80.253611
23,us,60.0,45.582778,-122.352222
24,gb,3.0,51.783333,-0.783333
25,us,30.0,29.423889,-98.493333
26,us,30.0,38.254167,-85.759444
36,us,60.0,29.763056,-95.363056
38,us,20.0,41.033889,-73.763333
43,us,60.0,40.015,-105.27


In [9]:
# encode the country names as integers 
from sklearn.preprocessing import LabelEncoder

ufos['Country'] = LabelEncoder().fit_transform(ufos['Country'])

In [10]:
ufos.head()

Unnamed: 0,Country,Seconds,Latitude,Longitude
2,3,20.0,53.2,-2.916667
3,4,20.0,28.978333,-96.645833
14,4,30.0,35.823889,-80.253611
23,4,60.0,45.582778,-122.352222
24,3,3.0,51.783333,-0.783333


In [11]:
ufos.shape

(25863, 4)

Build A Regression Model 

In [12]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# select the features and the target
selected_features = ['Seconds', 'Latitude', 'Longitude']

X = ufos[selected_features]
y = ufos['Country']


# Split into training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [13]:
X_train.shape, X_test.shape

((20690, 3), (5173, 3))

In [14]:
y_train.shape, y_test.shape

((20690,), (5173,))

In [15]:
# Train the model
# Logistic Regression algorithm will be used to predict the country of the sighting based on the duration, latitude and longitude

model = LogisticRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)


print(classification_report(y_test, predictions))
print('Predicted Labels: ', predictions)
print('Acciracy: ', accuracy_score(y_test, predictions))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        41
           1       0.83      0.24      0.37       250
           2       1.00      1.00      1.00         8
           3       1.00      1.00      1.00       131
           4       0.96      1.00      0.98      4743

    accuracy                           0.96      5173
   macro avg       0.96      0.85      0.87      5173
weighted avg       0.96      0.96      0.95      5173

Predicted Labels:  [4 4 4 ... 3 4 4]
Acciracy:  0.9609510922095496


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [16]:
# An accuracy of 96% is quite good considering the data set is small. 
# The model is able to predict the country of the sighting based on the duration, latitude and longitude.