# Putting It All Together
  
Now that you've learned all about preprocessing you'll try these techniques out on a dataset that records information on UFO sightings.

In [152]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## UFOs and preprocessing
  
Now it's time for you to apply the concepts you've learned throughout this course to a brand new dataset.
  
**Identifying areas for preprocessing**
  
The final chapter in this course will walk you through an entire preprocessing workflow on a dataset related to UFO sightings. Each row in this dataset contains information like the location, the type of the sighting, the number of seconds and minutes the sighting lasted, a description of the sighting, and the date the sighting was recorded. As you might imagine, there are a number of preprocessing tasks that need to be done prior to doing any modeling on this dataset.
  
**Important concepts to remember**
  
In the very first chapter of this course, we covered things like removing missing data, altering the type of columns in a DataFrame, and creating training and test sets based on class distribution. Some useful pandas functions to remember are `.dropna()` and `.isna()` for missing data, `astype()` for type conversion, and the `stratify=` parameter in the `train_test split()` function.

### Checking column types
  
Take a look at the UFO dataset's column types using the `.info()` method. Two columns jump out for transformation: the seconds column, which is a numeric column but is being read in as object, and the date column, which can be transformed into the datetime type. That will make our feature engineering efforts easier later on.
  
1. Call the `.info()` method on the ufo dataset.
  
2. Convert the type of the seconds column to the float data type.
  
3. Convert the type of the date column to the datetime data type.
  
4. Call `.info()` on ufo again to see if the changes worked.

In [153]:
# Loading dataset
ufo = pd.read_csv('../_datasets/ufo_sightings_large.csv')
ufo.head()

Unnamed: 0,date,city,state,country,type,seconds,length_of_time,desc,recorded,lat,long
0,11/3/2011 19:21,woodville,wi,us,unknown,1209600.0,2 weeks,Red blinking objects similar to airplanes or s...,12/12/2011,44.9530556,-92.291111
1,10/3/2004 19:05,cleveland,oh,us,circle,30.0,30sec.,Many fighter jets flying towards UFO,10/27/2004,41.4994444,-81.695556
2,9/25/2009 21:00,coon rapids,mn,us,cigar,0.0,,Green&#44 red&#44 and blue pulses of light tha...,12/12/2009,45.12,-93.2875
3,11/21/2002 05:45,clemmons,nc,us,triangle,300.0,about 5 minutes,It was a large&#44 triangular shaped flying ob...,12/23/2002,36.0213889,-80.382222
4,8/19/2010 12:55,calgary (canada),ab,ca,oval,0.0,2,A white spinning disc in the shape of an oval.,8/24/2010,51.083333,-114.083333


In [154]:
# Print the DataFrame info
print(ufo.info(), '\n')

# Change the type of seconds to float
ufo["seconds"] = ufo['seconds'].astype('float')

# Change the date column to type datetime
ufo["date"] = pd.to_datetime(ufo['date'])

# Check the column types
print(ufo[['seconds', 'date']].dtypes)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4935 entries, 0 to 4934
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   date            4935 non-null   object 
 1   city            4926 non-null   object 
 2   state           4516 non-null   object 
 3   country         4255 non-null   object 
 4   type            4776 non-null   object 
 5   seconds         4935 non-null   float64
 6   length_of_time  4792 non-null   object 
 7   desc            4932 non-null   object 
 8   recorded        4935 non-null   object 
 9   lat             4935 non-null   object 
 10  long            4935 non-null   float64
dtypes: float64(2), object(9)
memory usage: 424.2+ KB
None 

seconds           float64
date       datetime64[ns]
dtype: object


Nice job on transforming the column types! This will make feature engineering and standardization much easier.

### Dropping missing data
  
In this exercise, you'll remove some of the rows where certain columns have missing values. You're going to look at the length_of_time column, the state column, and the type column. You'll drop any row that contains a missing value in at least one of these three columns.
  
1. Print out the number of missing values in the length_of_time, state, and type columns, in that order, using `.isna()` and `.sum()`.
  
2. Drop rows that have missing values in at least one of these columns.
  
3. Print out the shape of the new ufo_no_missing dataset.

In [155]:
# Dataframe shape
print(ufo.shape)

# Count the missing values in the length_of_time, state, and type columns, in that order
print(ufo[['length_of_time', 'state', 'type']].isna().sum())

# Drop rows where length_of_time, state, or type are missing
ufo_no_missing = ufo.dropna(subset=['length_of_time', 'state', 'type'], axis=0)

# Print out the shape of the new dataset
print(ufo_no_missing.shape)

(4935, 11)
length_of_time    143
state             419
type              159
dtype: int64
(4283, 11)


We'll work with this set going forward.

## Categorical variables and standardization
  
The next tasks we're going to tackle are dealing with some of the categorical variables and standardization in the UFO dataset.
  
**Categorical variables**
  
Recall that there are a number of categorical variables in the UFO dataset, including location data and the type of the encounter. The following exercises will be about dealing with these categorical variables. There are a number of categorical variables that need to be one hot encoded. Remember that we can one hot encode variables with pandas' `get_dummies()` function.
  
**Standardization**
  
In addition, we need to standardize the seconds column. Recall that we can check the variance of a column with the `.var()` method. After we've done that, we can log normalize the column using NumPy's `log()` function.

### Extracting numbers from strings
  
The length_of_time field in the UFO dataset is a text field that has the number of minutes within the string. Here, you'll extract that number from that text field using regular expressions.
  
1. Search time_string for numbers using an appropriate RegEx pattern.
  
2. Use the `.apply()` method to call the `return_minutes()` on every row of the length_of_time column.
  
3. Print out the `.head()` of both the length_of_time and minutes columns to compare.

In [156]:
# Loading the dataset
ufo = pd.read_csv('../_datasets/ufo_sample.csv')

# Changing the type of seconds to float
ufo['seconds'] = ufo['seconds'].astype(float)

# Change the date column to type datetime
ufo['date'] = pd.to_datetime(ufo['date'])

In [157]:
import re


# Creating a function to extract the time as an int from a string
def return_minutes(time_string):
    """
    Extracts the time as an integer from a string.

    Parameters:
    - time_string (str): A string containing time information.

    Returns:
    - int: The extracted time as an integer, or None if no numbers are found.

    Example:
    >>> return_minutes("10 minutes")
    10
    >>> return_minutes("1 hour 30 minutes")
    1
    """

    # Search for numbers in time_string using regular expression '\d+' to extract digits
    num = re.search('\d+', time_string)

    if num is not None:
        return int(num.group(0))
    else:
        return None


# Apply the extraction function to the length_of_time column
ufo["minutes"] = ufo["length_of_time"].apply(return_minutes)

# Take a look at the head of both of the columns
print(ufo[['minutes', 'length_of_time']].head())

   minutes   length_of_time
0      5.0  about 5 minutes
1     10.0       10 minutes
2      2.0        2 minutes
3      2.0        2 minutes
4      5.0        5 minutes


The minutes information is now in a form where it can be inputted into a model.

### Identifying features for standardization
  
In this exercise, you'll investigate the variance of columns in the UFO dataset to determine which features should be standardized. After taking a look at the variances of the seconds and minutes column, you'll see that the variance of the seconds column is extremely high. Because seconds and minutes are related to each other (an issue we'll deal with when we select features for modeling), let's `np.log()` normalize the seconds column.
  
1. Calculate the variance in the seconds and minutes columns and take a close look at the results.
  
2. Perform `np.log()` normalization on the seconds column, transforming it into a new column named seconds_log.
  
3. Print out the variance of the seconds_log column.

The normalization of a value $x$ within a range using logarithm is given by:
  
$normalized(x) = \frac{{\log(x - \text{{min\_value}} + 1)}}{{\log(\text{{max\_value}} - \text{{min\_value}} + 1)}}$



The variance of a dataset is calculated using the formula:
  
$variance = \frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2$
  
Variance is a statistical measure that quantifies the spread or dispersion of a dataset. It provides information about how individual data points deviate from the mean. A high variance indicates that the data points are widely spread out from the mean, while a low variance indicates that the data points are clustered closely around the mean.
  
Components of the variance equation:
  
$x_i$: Represents each individual value in the dataset. The subscript $i$ denotes a specific data point in the dataset.  
$\mu$: Represents the mean (average) of the dataset. To calculate the mean, you sum up all the values and divide by the total number of data points.  
$N$: Represents the number of data points in the dataset.

In [158]:
# Check the variance of the seconds and minutes columns
print(ufo[['seconds', 'minutes']].var())

# Log normalize the seconds column
ufo["seconds_log"] = ufo.seconds.apply(np.log)

# Print out the variance of just the seconds_log column
print(ufo.seconds_log.var())

seconds    424087.417474
minutes       117.907176
dtype: float64
1.1223923881183004


Now it's time to engineer new features in the ufo dataset.

## Engineering new features
  
Now that we've taken care of some of the more straightforward preprocessing tasks, it's time to engineer new features.
  
**UFO feature engineering**
  
There are several fields in the UFO dataset that are great candidates for feature engineering. From the date field, we may want to know the month of the sighting. The number of minutes needs to be extracted from the length of time field. And finally, the description field contains a text description of the sighting. 
  
It would be interesting to vectorize that text and see what we can learn from it. Some important code to remember for date extraction is to use attributes like `.dt.month` and `.dt.hour` to get the pieces of the date you need. Regular expressions will help you extract numbers from text, and you can use the `.group()` method to return the results. And finally, scikit-learn and `TfidfVectorizer()` will vectorize text fields.

### Encoding categorical variables
  
There are couple of columns in the UFO dataset that need to be encoded before they can be modeled through scikit-learn. You'll do that transformation here, using both binary and one-hot encoding methods.
  
1. Using `.apply()`, write a conditional `lambda` function that returns a 1 if the value is "us", `else` `return` 0.
  
2. Print out the number of `.unique()` values in the type column.
  
3. Using `pd.get_dummies()`, create a one-hot encoded set of the type column.
  
4. Finally, use `pd.concat()` to concatenate the type_set encoded variables to the ufo dataset.

In [159]:
ufo.head(1)

Unnamed: 0,date,city,state,country,type,seconds,length_of_time,desc,recorded,lat,long,minutes,seconds_log
0,2002-11-21 05:45:00,clemmons,nc,us,triangle,300.0,about 5 minutes,It was a large&#44 triangular shaped flying ob...,12/23/2002,36.021389,-80.382222,5.0,5.703782


In [160]:
# Use pandas to encode country (us) values as 1 and others as 0
ufo["country_enc"] = ufo.country.apply(lambda x: 1 if x == 'us' else 0)

# Print the number of unique type values
print(len(ufo.type.unique()))

# Create a one-hot encoded set of the type values
type_set = pd.get_dummies(ufo.type)
print(type_set.columns)

# Concatenate this set back to the ufo DataFrame
ufo = pd.concat([ufo, type_set], axis=1)

21
Index(['changing', 'chevron', 'cigar', 'circle', 'cone', 'cross', 'cylinder',
       'diamond', 'disk', 'egg', 'fireball', 'flash', 'formation', 'light',
       'other', 'oval', 'rectangle', 'sphere', 'teardrop', 'triangle',
       'unknown'],
      dtype='object')


Let's continue on by extracting date components.

### Features from dates
  
Another feature engineering task to perform is month and year extraction. Perform this task on the date column of the ufo dataset.
  
1. Print out the `.head()` of the date column.
  
2. Retrieve the month attribute of the date column.
  
3. Retrieve the year attribute of the date column.
  
4. Take a look at the `.head()` of the date, month, and year columns.

In [161]:
# Look at the first 5 rows of the date column
print(ufo.date.head())

# Extract the month from the date column
ufo["month"] = ufo["date"].dt.month

# Extract the year from the date column
ufo["year"] = ufo["date"].dt.year

# Take a look at the head of all three columns
print(ufo[['date', 'month', 'year']].head())

0   2002-11-21 05:45:00
1   2012-06-16 23:00:00
2   2013-06-09 00:00:00
3   2013-04-26 23:27:00
4   2013-09-13 20:30:00
Name: date, dtype: datetime64[ns]
                 date  month  year
0 2002-11-21 05:45:00     11  2002
1 2012-06-16 23:00:00      6  2012
2 2013-06-09 00:00:00      6  2013
3 2013-04-26 23:27:00      4  2013
4 2013-09-13 20:30:00      9  2013


The pandas series attributes `.dt.month` and `.dt.year` are extremely useful for extraction tasks.

### TF-IDF (Term Frequency-Inverse Document Frequency)
  
Formula:  
  
$\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)$
  
Where:  

- $TF(t, d)$: Represents the Term Frequency of term $t$ in document $d$. It measures the frequency of a term within a document, often computed as the count of term $t$ divided by the total number of terms in document $d$.  
- $IDF(t)$: Represents the Inverse Document Frequency of term $t$. It measures the rarity of term $t$ across a collection of documents, often computed as the logarithm of the total number of documents divided by the count of documents that contain term $t$.  
  
TF-IDF combines both the local importance (TF) and the global importance (IDF) of a term to quantify its significance in a specific document within a collection of documents.
    
---
  
### Scikit-learn's Implementation of TF-IDF
  
Imported as:  
  
`from sklearn.feature_extraction.text import TfidfVectorizer`
  
Formula:
  
$\text{TF-IDF}(t, d) = (\text{TF}(t, d) + 1) \times \log\left(\frac{N+1}{\text{DF}(t) + 1}\right) + 1$
  
Where:
  
- $TF(t, d)$: Represents the Term Frequency of term $t$ in document $d$.  
- $DF(t)$: Represents the Document Frequency of term $t$, which is the number of documents in the collection that contain term $t$.  
- $N$: Represents the total number of documents in the collection.  
- The $+1$ terms in the equation are for smoothing, preventing potential division by zero errors, and avoiding extreme IDF values for terms that appear in all documents.
  
The equation in scikit-learn's `TfidfVectorizer` incorporates both the logarithmic IDF transformation and the sublinear TF scaling to provide a more balanced and effective representation of TF-IDF for text analysis tasks. You can use this equation to understand the underlying calculation performed by scikit-learn's `TfidfVectorizer` when transforming text data into TF-IDF features.
  
The `TfidfVectorizer()` class in scikit-learn allows you to preprocess text data and convert it into a matrix where each row represents a document and each column represents a term. The cell values represent the TF-IDF scores for each term in each document. This matrix can then be used as input for machine learning algorithms or other text analysis tasks.
  
Overall, `TfidfVectorizer()` is a useful tool for transforming text data into a numerical representation that can be used for various NLP tasks, such as document classification, clustering, information retrieval, and more.
  
**Suppose we have a collection of three documents**:
  
Document 1: "I love cats."  
Document 2: "I hate dogs."  
Document 3: "I have a cat and a dog."  
  
Using `TfidfVectorizer()`, we can transform this collection of documents into a TF-IDF matrix. Here's a sample output:  
<table>
  <tr>
    <th></th>
    <th>cat</th>
    <th>dog</th>
    <th>hate</th>
    <th>have</th>
    <th>love</th>
  </tr>
  <tr>
    <td>Document 1</td>
    <td>0.594534</td>
    <td>0.000000</td>
    <td>0.000000</td>
    <td>0.000000</td>
    <td>0.80473</td>
  </tr>
  <tr>
    <td>Document 2</td>
    <td>0.000000</td>
    <td>0.594534</td>
    <td>0.80473</td>
    <td>0.000000</td>
    <td>0.00000</td>
  </tr>
  <tr>
    <td>Document 3</td>
    <td>0.425441</td>
    <td>0.425441</td>
    <td>0.000000</td>
    <td>0.594534</td>
    <td>0.00000</td>
  </tr>
</table>
  
In this matrix, each row represents a document, and each column represents a term. The cell values represent the TF-IDF scores. Higher values indicate that a term is more important within a particular document.
  
For example, in Document 1, the term "cat" has a TF-IDF score of 0.594534, while the term "love" has a score of 0.80473. The term "dog" has a score of 0.0 in Document 1 since it doesn't appear in that document.

### Text vectorization
  
You'll now transform the desc column in the UFO dataset into tf/idf vectors, since there's likely something we can learn from this field.
  
1. Print out the `.head()` of the desc column.
  
2. Instantiate a `TfidfVectorizer()` object.
  
3. Fit and transform the desc column using vec.
  
4. Print out the `.shape` of the desc_tfidf vector, to take a look at the number of columns this created.

In [162]:
from sklearn.feature_extraction.text import TfidfVectorizer


# Take a look at the head of the desc field
print(ufo.desc.head())

# Instantiate the tfidf vectorizer object
vec = TfidfVectorizer()

# Fit and transform desc using vec
desc_tfidf = vec.fit_transform(ufo.desc)

# Look at the number of columns and rows
print(desc_tfidf.shape)

0    It was a large&#44 triangular shaped flying ob...
1    Dancing lights that would fly around and then ...
2    Brilliant orange light or chinese lantern at o...
3    Bright red light moving north to north west fr...
4    North-east moving south-west. First 7 or so li...
Name: desc, dtype: object
(1866, 3422)


You'll notice that the text vector has a large number of columns. We'll work on selecting the features we want to use for modeling in the next section.

## Feature selection and modeling
  
In this final section, you'll select which features to use for modeling and you'll model the processed UFO data in different ways.
  
**Feature selection and modeling**
  
We need to do a little bit of feature selection before we model this data. Keep in mind that you want to eliminate redundant features, and there are a couple of candidates for that in this dataset, both in its original form and due to feature engineering. We also have a text vector that we can inspect and eliminate words from. As far as modeling goes, you've had plenty of practice with it, and now you get to see the results of your preprocessing work.
  
**Final thoughts**
  
And finally, remember that preprocessing and modeling are often iterative practices, and it might take a few tries to find the ideal feature configuration that improves your model's performance. It also helps to be extremely knowledgeable about the dataset that you're working with, as well as having a good understanding of the model you're trying to build.

### Selecting the ideal dataset
  
Now to get rid of some of the unnecessary features in the ufo dataset. Because the country column has been encoded as country_enc, you can select it and drop the other columns related to location: city, country, lat, long, and state.
  
You've engineered the month and year columns, so you no longer need the date or recorded columns. You also standardized the seconds column as seconds_log, so you can drop seconds and minutes.
  
You vectorized desc, so it can be removed. For now you'll keep type.
  
You can also get rid of the length_of_time column, which is unnecessary after extracting minutes.
  
1. Make a list of all the columns to drop, to_drop.
  
2. Drop these columns from ufo.
  
3. Use the `words_to_filter()` function you created previously; pass in vocab, vec`.vocabulary_`, desc_tfidf, and keep the top 4 words as the last parameter.

In [163]:
# Acquired the 2 functions that I made in last exercise to use here

def return_weights(vocab, original_vocab, vector, vector_index, top_n):
    """
    Returns the top weighted words from a vectorized representation of text.

    Parameters:
    - vocab (list): List of words representing the vocabulary.
    - original_vocab (dict): Mapping of index to original word from the vectorizer.
    - vector (scipy.sparse.csr_matrix): Vectorized representation of text.
    - vector_index (int): Index of the vector in the vector matrix.
    - top_n (int): Number of top weighted words to return.

    Returns:
    - list: List of top weighted words.

    """

    # Create a dictionary of word indices and their corresponding weights
    zipped = dict(zip(vector[vector_index].indices, vector[vector_index].data))
    
    # Transform the zipped dictionary into a pandas Series with words as indices
    zipped_series = pd.Series({vocab[i]: zipped[i] for i in vector[vector_index].indices})
    
    # Sort the series to retrieve the top n weighted words
    zipped_index = zipped_series.sort_values(ascending=False)[:top_n].index
    
    # Retrieve the original words corresponding to the indices
    return [original_vocab[i] for i in zipped_index]


def words_to_filter(vocab, original_vocab, vector, top_n):
    """
    Returns a set of word indices to filter based on the top weighted words in a vectorized representation of text.

    Parameters:
    - vocab (list): List of words representing the vocabulary.
    - original_vocab (dict): Mapping of index to original word from the vectorizer.
    - vector (scipy.sparse.csr_matrix): Vectorized representation of text.
    - top_n (int): Number of top weighted words to consider for filtering.

    Returns:
    - set: Set of word indices to filter.

    """

    filter_list = []
    for i in range(0, vector.shape[0]):
        # Call the return_weights function and extend filter_list
        filtered = return_weights(vocab, original_vocab, vector, i, top_n)
        filter_list.extend(filtered)
        
    # Return the list in a set to remove duplicate word indices
    return set(filter_list)

In [164]:
# Loading the required vocab
vocab_csv = pd.read_csv('../_datasets/vocab_ufo.csv', index_col=0).to_dict()
vocab = vocab_csv['0']

In [165]:
# Check the correlation between the seconds, seconds_log, and minutes columns
print(ufo[['seconds', 'seconds_log', 'minutes']].corr())

# Make a list of features to drop
to_drop = ['city', 'country', 'date', 'desc', 'lat', 
           'length_of_time', 'seconds', 'minutes', 'long', 'state', 'recorded']

# Drop those features
ufo_dropped = ufo.drop(to_drop, axis=1)

# Let's also filter some words out of the text vector we created
filtered_words = words_to_filter(vocab, vec.vocabulary_, desc_tfidf, top_n=4)

              seconds  seconds_log   minutes
seconds      1.000000     0.853371  0.980944
seconds_log  0.853371     1.000000  0.825924
minutes      0.980944     0.825924  1.000000


You're almost done. In the next exercises, you'll model the UFO data in a couple of different ways.

### Modeling the UFO dataset, part 1
  
In this exercise, you're going to build a k-nearest neighbor model to predict which country the UFO sighting took place in. The X dataset contains the log-normalized seconds column, the one-hot encoded type columns, as well as the month and year when the sighting took place. The y labels are the encoded country column, where 1 is "us" and 0 is "ca".
  
1. Print out the `.columns` of the X set.
  
2. Split the X and y sets, ensuring that the class distribution of the labels is the same in the training and tests sets, and using a `random_state=` of 42.
  
3. Fit knn to the training data.
  
4. Print the test set accuracy of the knn model.

In [166]:
# Changing the display to see more (up to 50) columns as it cuts off normally 
pd.set_option('display.max_columns', 50)

# Display
ufo_dropped.head()

Unnamed: 0,type,seconds_log,country_enc,changing,chevron,cigar,circle,cone,cross,cylinder,diamond,disk,egg,fireball,flash,formation,light,other,oval,rectangle,sphere,teardrop,triangle,unknown,month,year
0,triangle,5.703782,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,11,2002
1,light,6.39693,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,6,2012
2,light,4.787492,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,6,2013
3,light,4.787492,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,4,2013
4,sphere,5.703782,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,9,2013


In [167]:
# X/y split, dropping 'type' because we OHEncoded it already, 'country_enc' = target
X = ufo_dropped.drop(['type', 'country_enc'], axis=1)
y = ufo_dropped['country_enc']

In [168]:
# Take a look at the features in the X set of data
print(X.columns)

Index(['seconds_log', 'changing', 'chevron', 'cigar', 'circle', 'cone',
       'cross', 'cylinder', 'diamond', 'disk', 'egg', 'fireball', 'flash',
       'formation', 'light', 'other', 'oval', 'rectangle', 'sphere',
       'teardrop', 'triangle', 'unknown', 'month', 'year'],
      dtype='object')


Use `.fit_transform()` when scaling the training features.  
Use `.transform()` when scaling the test features.  
Use `.fit()` when fitting the model to the scaled training features.

In [169]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier


# Seeding
SEED = 42

# Instanciating the model
knn = KNeighborsClassifier()

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=SEED)

# Fit knn to the training sets
knn.fit(X_train, y_train)

# Print the accuracy score of knn on the test sets
print(knn.score(X_test, y_test))

0.867237687366167


This model performs pretty well (at 86.72% accuracy). It seems like you've made pretty good feature selection choices here.

### Modeling the UFO dataset, part 2
  
Finally, you'll build a model using the text vector we created, desc_tfidf, using the filtered_words list to create a filtered text vector. Let's see if you can predict the type of the sighting based on the text. You'll use a Naive Bayes model for this.
  
1. Filter the desc_tfidf vector by passing a list of filtered_words into the index.
  
2. Split the filtered_text features and y, ensuring an equal class distribution in the training and test sets; use a `random_state=` of 42.
  
3. Use the nb model's `.fit()` to fit X_train and y_train.
  
4. Print out the `.score()` of the nb model on the X_test and y_test sets.

In [170]:
# Exercise ask to predict type of sighting based on the vocab (description, ufo.desc)
y = ufo_dropped['type']

In [171]:
from sklearn.naive_bayes import GaussianNB


# Seeding
SEED = 42

# Instanciate model
nb = GaussianNB()

# Use the list of filtered words we created to filter the text vector
filtered_text = desc_tfidf[:, list(filtered_words)]

# Split the X and y sets using train_test_split, setting stratify=y 
X_train, X_test, y_train, y_test = train_test_split(
    filtered_text.toarray(),
    y,
    stratify=y,
    random_state=SEED
)

# Fit nb to the training sets
nb.fit(X_train, y_train)

# Print the score of nb on the test sets
print(nb.score(X_test, y_test))

0.17987152034261242


As you can see, this model performs very poorly on this text data (at 17.99% accuracy). This is a clear case where iteration would be necessary to figure out what subset of text improves the model, and if perhaps any of the other features are useful in predicting type.

You've learned valuable skills for preparing your data for modeling. You now know how to deal with missing data and incorrect types, how to standardize numerical values and process categorical ones, how to engineer new features that will improve your dataset, and finally, how to select features for modeling.