# Checking column types

We take a look at the UFO dataset's column types using the dtypes attribute. Two columns jump out for transformation: the seconds column, which is a numeric column but is being read in as object, and the date column, which can be transformed into the datetime type. That will make our feature engineering efforts easier later on.

In [3]:
# import libraries
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler 
from sklearn.decomposition import PCA
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import GaussianNB
import re
import vocab

In [4]:
ufo = pd.read_csv('ufo_sightings_large.csv')
ufo.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4935 entries, 0 to 4934
Data columns (total 11 columns):
date              4935 non-null object
city              4926 non-null object
state             4516 non-null object
country           4255 non-null object
type              4776 non-null object
seconds           4935 non-null float64
length_of_time    4792 non-null object
desc              4932 non-null object
recorded          4935 non-null object
lat               4935 non-null object
long              4935 non-null float64
dtypes: float64(2), object(9)
memory usage: 424.2+ KB


In [5]:
ufo.head()

Unnamed: 0,date,city,state,country,type,seconds,length_of_time,desc,recorded,lat,long
0,11/3/2011 19:21,woodville,wi,us,unknown,1209600.0,2 weeks,Red blinking objects similar to airplanes or s...,12/12/2011,44.9530556,-92.291111
1,10/3/2004 19:05,cleveland,oh,us,circle,30.0,30sec.,Many fighter jets flying towards UFO,10/27/2004,41.4994444,-81.695556
2,9/25/2009 21:00,coon rapids,mn,us,cigar,0.0,,Green&#44 red&#44 and blue pulses of light tha...,12/12/2009,45.12,-93.2875
3,11/21/2002 05:45,clemmons,nc,us,triangle,300.0,about 5 minutes,It was a large&#44 triangular shaped flying ob...,12/23/2002,36.0213889,-80.382222
4,8/19/2010 12:55,calgary (canada),ab,ca,oval,0.0,2,A white spinning disc in the shape of an oval.,8/24/2010,51.083333,-114.083333


In [6]:
# Change the date column to type datetime
ufo['date'] = pd.to_datetime(ufo['date'])

In [7]:
ufo.shape

(4935, 11)

# Dropping missing data

Let's remove some of the rows where certain columns have missing values. We're going to look at the length_of_time column, the state column, and the type column. If any of the values in these columns are missing, we're going to drop the rows.

In [8]:
# Check how many values are missing in the length_of_time, state, and type columns
print(ufo[["length_of_time", "state", "type"]].isnull().sum())

length_of_time    143
state             419
type              159
dtype: int64


In [9]:
# Keep only rows where length_of_time, state, and type are not null
ufo_no_missing = ufo[ufo["length_of_time"].notnull() & ufo["state"].notnull() & ufo["type"].notnull()]

In [10]:
# Print out the shape of the new dataset
print(ufo_no_missing.shape)

(4283, 11)


In [11]:
# Select rows with 'minutes'
ufo2 = ufo_no_missing[ufo_no_missing['length_of_time'].str.contains("minutes|minute")]
ufo2.shape

(1866, 11)

# Extracting numbers from strings

The length_of_time field in the UFO dataset is a text field that has the number of minutes within the string. Here, you'll extract that number from that text field using regular expressions

In [12]:
# Extracting numbers from strings

def return_minutes(time_string):

    # Use \d+ to grab digits
    pattern = re.compile(r"\d+")
    
    # Use match on the pattern and column
    num = re.match(pattern, time_string)
    if num is not None:
        return int(num.group(0))

In [19]:
# Apply the extraction to the length_of_time column
ufo2["minutes"] = ufo2["length_of_time"].apply(lambda row: return_minutes(row))

# Take a look at the head of both of the columns
print(ufo2[["minutes","length_of_time"]].head())

    minutes   length_of_time
3       NaN  about 5 minutes
5      10.0       10 minutes
8       2.0        2 minutes
9       2.0        2 minutes
10      5.0        5 minutes


As we can see, we end up with some NaNs in the DataFrame. That's okay for now; we'll take care of those before modeling.

In [14]:
ufo2.head()

Unnamed: 0,date,city,state,country,type,seconds,length_of_time,desc,recorded,lat,long,minutes
3,2002-11-21 05:45:00,clemmons,nc,us,triangle,300.0,about 5 minutes,It was a large&#44 triangular shaped flying ob...,12/23/2002,36.0213889,-80.382222,
5,2012-06-16 23:00:00,san diego,ca,us,light,600.0,10 minutes,Dancing lights that would fly around and then ...,7/4/2012,32.7152778,-117.156389,10.0
8,2013-06-09 00:00:00,oakville (canada),on,ca,light,120.0,2 minutes,Brilliant orange light or chinese lantern at o...,7/3/2013,43.433333,-79.666667,2.0
9,2013-04-26 23:27:00,lacey,wa,us,light,120.0,2 minutes,Bright red light moving north to north west fr...,5/15/2013,47.0344444,-122.821944,2.0
10,2013-09-13 20:30:00,ben avon,pa,us,sphere,300.0,5 minutes,North-east moving south-west. First 7 or so li...,9/30/2013,40.5080556,-80.083333,5.0


# Identifying features for standardization

Let's investigate the variance of columns in the UFO dataset to determine which features should be standardized. After taking a look at the variances of the seconds and minutes column, you'll see that the variance of the seconds column is extremely high. Because seconds and minutes are related to each other (an issue we'll deal with when we select features for modeling), let's log normlize the seconds column.

In [15]:
# Identifying features for standardization

# Check the variance of the seconds and minutes columns
print(ufo2[['seconds','minutes']].var())

# Log normalize the seconds column
ufo2["seconds_log"] = np.log(ufo2['seconds'])

# Print out the variance of just the seconds_log column
print(ufo2['seconds_log'].var())

seconds    424087.417474
minutes       117.546372
dtype: float64
1.1223923881183004


Let's continue on by extracting some date parts.

# Features from dates

Another feature engineering task to perform is month and year extraction. Perform this task on the date column of the ufo dataset.

In [16]:
# Features from dates

# Look at the first 5 rows of the date column
print(ufo2['date'].head())

# Extract the month from the date column
ufo2["month"] = ufo2["date"].apply(lambda row: row.month)

# Extract the year from the date column
ufo2["year"] = ufo2["date"].apply(lambda row: row.year)

# Take a look at the head of all three columns
print(ufo2[['date','month','year']].head())

3    2002-11-21 05:45:00
5    2012-06-16 23:00:00
8    2013-06-09 00:00:00
9    2013-04-26 23:27:00
10   2013-09-13 20:30:00
Name: date, dtype: datetime64[ns]
                  date  month  year
3  2002-11-21 05:45:00     11  2002
5  2012-06-16 23:00:00      6  2012
8  2013-06-09 00:00:00      6  2013
9  2013-04-26 23:27:00      4  2013
10 2013-09-13 20:30:00      9  2013


 'apply' and 'lambda' are extremely useful for extraction tasks.

# Encoding categorical variables

There are couple of columns in the UFO dataset that need to be encoded before they can be modeled through scikit-learn. We'll do that transformation here, using both binary and one-hot encoding methods.

In [17]:
ufo2.country.value_counts()

us    1643
ca      75
Name: country, dtype: int64

In [58]:
# Use Pandas to encode us values as 1 and others as 0
ufo2["country_enc"] = ufo2["country"].apply(lambda val: 1 if val=='us' else 0)

# Print the number of unique type values
print(len(ufo2['type'].unique()))

# Create a one-hot encoded set of the type values
type_set = pd.get_dummies(ufo2['type'])

# Concatenate this set back to the ufo DataFrame
ufo2 = pd.concat([ufo2, type_set], axis=1)

21


In [59]:
type_set.head()

Unnamed: 0,changing,chevron,cigar,circle,cone,cross,cylinder,diamond,disk,egg,...,flash,formation,light,other,oval,rectangle,sphere,teardrop,triangle,unknown
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
10,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0


# Text vectorization

Let's transform the desc column in the UFO dataset into tf/idf vectors, since there's likely something we can learn from this field

In [60]:
# Text vectorization

# Take a look at the head of the desc field
print(ufo2.desc.head())

# Create the tfidf vectorizer object
vec = TfidfVectorizer()

# Use vec's fit_transform method on the desc field
desc_tfidf = vec.fit_transform(ufo2['desc'])

# Look at the number of columns this creates
print(desc_tfidf.shape)

3     It was a large&#44 triangular shaped flying ob...
5     Dancing lights that would fly around and then ...
8     Brilliant orange light or chinese lantern at o...
9     Bright red light moving north to north west fr...
10    North-east moving south-west. First 7 or so li...
Name: desc, dtype: object
(1866, 3422)


In [61]:
print(vec.vocabulary_)

{'it': 1664, 'was': 3275, 'large': 1744, '44': 147, 'triangular': 3123, 'shaped': 2657, 'flying': 1320, 'object': 2134, 'dancing': 910, 'lights': 1794, 'that': 3002, 'would': 3379, 'fly': 1319, 'around': 395, 'and': 340, 'then': 3007, 'merge': 1923, 'into': 1645, 'one': 2173, 'light': 1787, 'brilliant': 604, 'orange': 2188, 'or': 2184, 'chinese': 718, 'lantern': 1738, 'at': 412, 'less': 1774, 'than': 3001, '1000': 15, 'ft': 1363, 'moving': 2021, 'east': 1102, 'to': 3050, 'west': 3298, 'across': 273, 'oakville': 2130, 'ontario': 2176, 'midnight': 1942, 'june': 1690, '9th': 251, '2013': 92, 'bright': 596, 'red': 2472, 'north': 2097, 'from': 1360, 'the': 3003, 'horizon': 1539, 'till': 3041, 'disapeared': 1003, 'behind': 502, 'clouds': 766, 'south': 2793, 'first': 1276, 'so': 2766, 'craft': 873, 'half': 1462, 'dozen': 1063, 'stragglers': 2899, 'they': 3015, 'were': 3296, 'surely': 2943, 'not': 2107, 'planes': 2330, 'nor': 2094, 'ball': 449, 'of': 2157, 'slowly': 2751, 'stationary': 2872, '

In [62]:
vocab = {v:k for k,v in vec.vocabulary_.items()}
print(vocab)

{1664: 'it', 3275: 'was', 1744: 'large', 147: '44', 3123: 'triangular', 2657: 'shaped', 1320: 'flying', 2134: 'object', 910: 'dancing', 1794: 'lights', 3002: 'that', 3379: 'would', 1319: 'fly', 395: 'around', 340: 'and', 3007: 'then', 1923: 'merge', 1645: 'into', 2173: 'one', 1787: 'light', 604: 'brilliant', 2188: 'orange', 2184: 'or', 718: 'chinese', 1738: 'lantern', 412: 'at', 1774: 'less', 3001: 'than', 15: '1000', 1363: 'ft', 2021: 'moving', 1102: 'east', 3050: 'to', 3298: 'west', 273: 'across', 2130: 'oakville', 2176: 'ontario', 1942: 'midnight', 1690: 'june', 251: '9th', 92: '2013', 596: 'bright', 2472: 'red', 2097: 'north', 1360: 'from', 3003: 'the', 1539: 'horizon', 3041: 'till', 1003: 'disapeared', 502: 'behind', 766: 'clouds', 2793: 'south', 1276: 'first', 2766: 'so', 873: 'craft', 1462: 'half', 1063: 'dozen', 2899: 'stragglers', 3015: 'they', 3296: 'were', 2943: 'surely', 2107: 'not', 2330: 'planes', 2094: 'nor', 449: 'ball', 2157: 'of', 2751: 'slowly', 2872: 'stationary', 2

We'll return a list of numbers with the function.We'll write another function to collect the top words across all documents, extract them, and then use that list to filter down our text_tfidf vector to those top words.

In [63]:
# Add in the rest of the parameters
def return_weights(vocab, original_vocab, vector, vector_index, top_n):
    zipped = dict(zip(vector[vector_index].indices, vector[vector_index].data))
    
    # Let's transform that zipped dict into a series
    zipped_series = pd.Series({vocab[i]:zipped[i] for i in vector[vector_index].indices})
    
    # Let's sort the series to pull out the top n weighted words
    zipped_index = zipped_series.sort_values(ascending=False)[:top_n].index
    return [original_vocab[i] for i in zipped_index]

# Print out the weighted words
print(return_weights(vocab, vec.vocabulary_, desc_tfidf, 8, 3))

[3085, 1326, 3077]


In [64]:
def words_to_filter(vocab, original_vocab, vector, top_n):
    filter_list = []
    for i in range(0, vector.shape[0]):
    
        # Here we'll call the function from the previous exercise, and extend the list we're creating
        filtered = return_weights(vocab, original_vocab, vector, i, top_n)
        filter_list.extend(filtered)
    # Return the list in a set, so we don't get duplicate word indices
    return set(filter_list)

# Call the function to get the list of word indices
filtered_words = words_to_filter(vocab, vec.vocabulary_, desc_tfidf, 3)

# By converting filtered_words back to a list, we can use it to filter the columns in the text vector
filtered_text = desc_tfidf[:, list(filtered_words)]

In [65]:
# Selecting the ideal dataset

# Because we have an encoded country column, country_enc, keep it and drop other columns related to location: 
# city, country, lat, long, state

ufo2.shape

(1866, 37)

# Selecting the ideal dataset
Let's get rid of some of the unnecessary features. Because we have an encoded country column, country_enc,we keep it and drop other columns related to location: city, country, lat, long, state.

We have columns related to month and year, so we don't need the date or recorded columns.

We vectorized desc, so we don't need it anymore. For now we'll keep type.

We'll keep seconds_log and drop seconds and minutes.

Let's also get rid of the length_of_time column, which is unnecessary after extracting minutes.

In [66]:
# Check the correlation between the seconds, seconds_log, and minutes columns
print(ufo2[['seconds','seconds_log','minutes']].corr())

              seconds  seconds_log   minutes
seconds      1.000000     0.853371  0.980341
seconds_log  0.853371     1.000000  0.824493
minutes      0.980341     0.824493  1.000000


In [67]:
print(ufo2[['seconds','seconds_log','minutes']].head())

    seconds  seconds_log  minutes
3     300.0     5.703782      NaN
5     600.0     6.396930     10.0
8     120.0     4.787492      2.0
9     120.0     4.787492      2.0
10    300.0     5.703782      5.0


In [68]:
ufo2.head()

Unnamed: 0,date,city,state,country,type,seconds,length_of_time,desc,recorded,lat,...,flash,formation,light,other,oval,rectangle,sphere,teardrop,triangle,unknown
3,2002-11-21 05:45:00,clemmons,nc,us,triangle,300.0,about 5 minutes,It was a large&#44 triangular shaped flying ob...,12/23/2002,36.0213889,...,0,0,0,0,0,0,0,0,1,0
5,2012-06-16 23:00:00,san diego,ca,us,light,600.0,10 minutes,Dancing lights that would fly around and then ...,7/4/2012,32.7152778,...,0,0,1,0,0,0,0,0,0,0
8,2013-06-09 00:00:00,oakville (canada),on,ca,light,120.0,2 minutes,Brilliant orange light or chinese lantern at o...,7/3/2013,43.433333,...,0,0,1,0,0,0,0,0,0,0
9,2013-04-26 23:27:00,lacey,wa,us,light,120.0,2 minutes,Bright red light moving north to north west fr...,5/15/2013,47.0344444,...,0,0,1,0,0,0,0,0,0,0
10,2013-09-13 20:30:00,ben avon,pa,us,sphere,300.0,5 minutes,North-east moving south-west. First 7 or so li...,9/30/2013,40.5080556,...,0,0,0,0,0,0,1,0,0,0


In [69]:
# Make a list of features to drop
to_drop = ['city','country','lat','long','state','date','recorded','desc','seconds','minutes','length_of_time']

# Drop those features
ufo_dropped = ufo2.drop(to_drop,axis=1)

# Let's also filter some words out of the text vector we created
filtered_words = words_to_filter(vocab, vec.vocabulary_, desc_tfidf, 4)

In [70]:
ufo_dropped.head()

Unnamed: 0,type,seconds_log,month,year,country_enc,changing,chevron,cigar,circle,cone,...,flash,formation,light,other,oval,rectangle,sphere,teardrop,triangle,unknown
3,triangle,5.703782,11,2002,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
5,light,6.39693,6,2012,1,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
8,light,4.787492,6,2013,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
9,light,4.787492,4,2013,1,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
10,sphere,5.703782,9,2013,1,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0


In [71]:
ufo_dropped2 = ufo_dropped.drop('type',axis=1)

# Modeling the UFO_dropped2 dataset, part 1

We're going to build a k-nearest neighbor model to predict which country the UFO sighting took place in. Our X dataset has the log-normalized seconds column, the one-hot encoded type columns, as well as the month and year when the sighting took place. The y labels are the encoded country column, where 1 is us and 0 is ca.

In [72]:
X = ufo_dropped2.drop('country_enc',axis=1)
y = ufo_dropped2['country_enc']

In [73]:
# Take a look at the features in the X set of data
print(X.columns)

knn = KNeighborsClassifier()

# Split the X and y sets using train_test_split, setting stratify=y
train_X, test_X, train_y, test_y = train_test_split(X,y,stratify=y)

# Fit knn to the training sets
knn.fit(train_X,train_y)

# Print the score of knn on the test sets
print(knn.score(test_X,test_y))

Index(['seconds_log', 'month', 'year', 'changing', 'chevron', 'cigar',
       'circle', 'cone', 'cross', 'cylinder', 'diamond', 'disk', 'egg',
       'fireball', 'flash', 'formation', 'light', 'other', 'oval', 'rectangle',
       'sphere', 'teardrop', 'triangle', 'unknown'],
      dtype='object')
0.8779443254817987


# Modeling the UFO_dropped dataset, part 2

Finally, let's build a model using the text vector we created, desc_tfidf, using the filtered_words list to create a filtered text vector. Let's see if we can predict the type of the sighting based on the text. We'll use a Naive Bayes model for this.


In [75]:
y = ufo_dropped['type']

In [76]:
# Use the list of filtered words we created to filter the text vector
filtered_text = desc_tfidf[:, list(filtered_words)]

nb = GaussianNB()

# Split the X and y sets using train_test_split, setting stratify=y 
train_X, test_X, train_y, test_y = train_test_split(filtered_text.toarray(), y, stratify = y)

# Fit nb to the training sets
nb.fit(train_X,train_y)

# Print the score of nb on the test sets
print(nb.score(test_X,test_y))

0.16916488222698073


As we can see, this model performs very poorly on this text data. This is a clear case where iteration would be necessary to figure out what subset of text improves the model, and if perhaps any of the other features are useful in predicting 'type'.