# Feature engineering
  
In this chapter, we're going to talk about a very important part of the preprocessing workflow: feature engineering. 
  
You'll learn about feature engineering. You'll explore different ways to create new, more useful, features from the ones already in your dataset. You'll see how to encode, aggregate, and extract information from both numerical and textual features.
  
**What is feature engineering?**
  
Real-world data is often not neat and tidy, and in addition to preprocessing steps like standardization, we'll likely have to extract and expand information from existing features. Feature engineering is the creation of new features based on existing features, and it adds information to the dataset that can improve prediction or clustering tasks, or adds insight into relationships between features. In this chapter, we'll just focus on the key components for preprocessing. 
  
There are automated ways to create new features, but for now, we're going to cover manual methods of feature engineering. Manual methods require us to already have an in-depth knowledge of the dataset that we're working with. Feature engineering is also something that is very dependent on the particular dataset you're analyzing. The goal for this chapter is to demonstrate some scenarios where feature engineering can be useful.
  
**Feature engineering scenarios**
  
There are a variety of scenarios where we might want to engineer features from existing data. An extremely common one is with text data. For example, if we're building some kind of natural language processing model, we'll have to create a vector of the words in our dataset. Another scenario might also be related to string data: maybe we have a column of people's favorite colors. In order to feed this information into a scikit-learn model, we'll have to encode this information numerically.
  
Another common example is with timestamps. We might see a full timestamp that includes the time down to the second or millisecond, which might be much too granular for a prediction task, so we can create a new column that contains the day or the month component. Some columns can also contain a list of some kind, such as test scores, or running times, and maybe it's more useful to use an average. These are all examples of situations where we'd want to generate new features from existing columns.

In [67]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Feature engineering knowledge test
  
Now that you've learned about feature engineering, which of the following examples are good candidates for creating new features?
  
Possible Answers
  
- [ ] A column of timestamps
  
- [ ] A column of newspaper headlines
  
- [ ] A column of weight measurements
  
- [x] Both 1 and 2
  
- [ ] None of the above
  
Correct! Timestamps can be broken into days or months, and headlines can be used for natural language processing.

### Identifying areas for feature engineering
  
Take an exploratory look at the volunteer dataset.
  
Which of the following columns would you want to perform a feature engineering task on? 
- [ ] vol_requests
  
- [ ] title
  
- [ ] created_data
  
- [ ] category_desc
  
- [x] 2, 3, and 4
  
Correct! All three of these columns will require some feature engineering before modeling.

In [68]:
# Load data
volunteer = pd.read_csv('../_datasets/volunteer_opportunities.csv')

# Changing the display to see more (up to 200) columns as it cuts off normally 
pd.set_option('display.max_columns', 200)

# Display
volunteer.head()

Unnamed: 0,opportunity_id,content_id,vol_requests,event_time,title,hits,summary,is_priority,category_id,category_desc,amsl,amsl_unit,org_title,org_content_id,addresses_count,locality,region,postalcode,primary_loc,display_url,recurrence_type,hours,created_date,last_modified_date,start_date_date,end_date_date,status,Latitude,Longitude,Community Board,Community Council,Census Tract,BIN,BBL,NTA
0,4996,37004,50,0,Volunteers Needed For Rise Up & Stay Put! Home...,737,Building on successful events last summer and ...,,,,,,Center For NYC Neighborhoods,4426,1,,NY,,,/opportunities/4996,onetime,0,January 13 2011,June 23 2011,July 30 2011,July 30 2011,approved,,,,,,,,
1,5008,37036,2,0,Web designer,22,Build a website for an Afghan business,,1.0,Strengthening Communities,,,Bpeace,37026,1,"5 22nd St\nNew York, NY 10010\n(40.74053152272...",NY,10010.0,,/opportunities/5008,onetime,0,January 14 2011,January 25 2011,February 01 2011,February 01 2011,approved,,,,,,,,
2,5016,37143,20,0,Urban Adventures - Ice Skating at Lasker Rink,62,Please join us and the students from Mott Hall...,,1.0,Strengthening Communities,,,Street Project,3001,1,,NY,10026.0,,/opportunities/5016,onetime,0,January 19 2011,January 21 2011,January 29 2011,January 29 2011,approved,,,,,,,,
3,5022,37237,500,0,Fight global hunger and support women farmers ...,14,The Oxfam Action Corps is a group of dedicated...,,1.0,Strengthening Communities,,,Oxfam America,2170,1,,NY,2114.0,,/opportunities/5022,ongoing,0,January 21 2011,January 25 2011,February 14 2011,March 31 2012,approved,,,,,,,,
4,5055,37425,15,0,Stop 'N' Swap,31,Stop 'N' Swap reduces NYC's waste by finding n...,,4.0,Environment,,,Office of Recycling Outreach and Education,36773,1,,NY,10455.0,,/opportunities/5055,onetime,0,January 28 2011,February 01 2011,February 05 2011,February 05 2011,approved,,,,,,,,


## Encoding categorical variables
  
Because models in scikit-learn require numerical input, if the dataset contains categorical variables, we'll have to encode them. Let's take a look at how to do that.
  
**Categorical variables**
  
Often, real-world data contains categorical variables to store values that can only take a finite number of discrete values. For example, here's a set of some user data with categorical values. We have a subscribed column, with binary yes or no values, as well as a column with users' favorite colors, which has multiple categorical values.
  
**Encoding binary variables - pandas**
  
The first encoding we'll cover is encoding binary values, like in the column shown. This is actually quite simple, and can be done in both pandas and scikit-learn. In pandas, we can use the `.apply()` method to encode 1s and 0s in a DataFrame column. Using `.apply()`, we can write a conditional that returns a 1 if the value in subscribed is y, and a 0 if the value is n. Looking at a side by side comparison of the columns, we can see that the column is now numerically encoded. pandas could be a good choice if we've not finished preprocessing, or if we're interested in further exploratory work once we've encoded.
  
**Encoding binary variables - scikit-learn**
  
We can also encode binary variables in scikit-learn using `LabelEncoder()`. It's useful to know both methods if, for example, we're implementing encoding as part of scikit-learn's pipeline functionality, which allows us to string together different steps of the machine learning workflow. Creating a `LabelEncoder()` object also allows us to reuse this encoding on other data, such as on new data or a test set. To encode values in scikit-learn, we'll need to instantiate the `LabelEncoder()` transformer. We can use the `.fit_transform()` method to both fit the encoder to the data as well as transform the column. Printing out both the subscribed column and the new column, we can see that the y's and n's have been encoded to 1s and 0s.
  
**One-hot encoding**
  
One-hot encoding encodes categorical variables into 1s and 0s when there are more than two values to encode. It works by looking at the entire list of unique values in a column, transforming each value into an array, and designating a 1 in the appropriate position to encode that a particular value occurs. For example, in the fav_color column, we have three values: blue, green, and orange. If we were to encode these colors with 0s and 1s based on this list, we would get something like this: blue would have a 1 in the first position followed by two zeros, green would have a one in the second position, and orange would have a one in the last position. 
  
An encoded column would look something like this:
  
![Alt text](../_images/encoded-col-demo.png)  
  
We can use the pandas `get_dummies()` function to directly encode categorical values in this way.
  
![Alt text](../_images/encoded-col-pandas.pd.png)  

> NOTE: `from sklearn.preprocess import LabelEncoder`

### Encoding categorical variables - binary
  
Take a look at the hiking dataset. There are several columns here that need encoding before they can be modeled, one of which is the Accessible column. Accessible is a binary feature, so it has two values, Y or N, which need to be encoded into 1's and 0's. Use scikit-learn's `LabelEncoder()` method to perform this transformation.
  
1. Store `LabelEncoder()` in a variable named enc.
  
2. Using the encoder's `.fit_transform()` method, encode the hiking dataset's "Accessible" column. Call the new column Accessible_enc.
  
3. Compare the two columns side-by-side to see the encoding.

In [69]:
hiking = pd.read_json('../_datasets/hiking.json')
hiking.head()

Unnamed: 0,Prop_ID,Name,Location,Park_Name,Length,Difficulty,Other_Details,Accessible,Limited_Access,lat,lon
0,B057,Salt Marsh Nature Trail,"Enter behind the Salt Marsh Nature Center, loc...",Marine Park,0.8 miles,,<p>The first half of this mile-long trail foll...,Y,N,,
1,B073,Lullwater,Enter Park at Lincoln Road and Ocean Avenue en...,Prospect Park,1.0 mile,Easy,Explore the Lullwater to see how nature thrive...,N,N,,
2,B073,Midwood,Enter Park at Lincoln Road and Ocean Avenue en...,Prospect Park,0.75 miles,Easy,Step back in time with a walk through Brooklyn...,N,N,,
3,B073,Peninsula,Enter Park at Lincoln Road and Ocean Avenue en...,Prospect Park,0.5 miles,Easy,Discover how the Peninsula has changed over th...,N,N,,
4,B073,Waterfall,Enter Park at Lincoln Road and Ocean Avenue en...,Prospect Park,0.5 miles,Easy,Trace the source of the Lake on the Waterfall ...,N,N,,


In [70]:
from sklearn.preprocessing import LabelEncoder


# Set up the LabelEncoder object
enc = LabelEncoder()

# Apply the encoding to the "Accessible" column
hiking['Accessible_enc'] = enc.fit_transform(hiking['Accessible'])

# Compare the two columns
print(hiking[['Accessible', 'Accessible_enc']].head())

  Accessible  Accessible_enc
0          Y               1
1          N               0
2          N               0
3          N               0
4          N               0


`.fit_transform()` is a good way to both fit an encoding and transform the data in a single step.

### Encoding categorical variables - one-hot
  
One of the columns in the volunteer dataset, category_desc, gives category descriptions for the volunteer opportunities listed. Because it is a categorical variable with more than two categories, we need to use one-hot encoding to transform this column numerically. Use the `pd.get_dummies()` function to do so.
  
1. Call `get_dummies()` on the volunteer.category_desc column to create the encoded columns and assign it to category_enc.
  
2. Print out the `.head()` of the category_enc variable to take a look at the encoded columns.

In [71]:
# Load data
volunteer = pd.read_csv('../_datasets/volunteer_opportunities.csv')

# Transform the category_desc column
category_enc = pd.get_dummies(volunteer['category_desc'])

# Take a look at the encoded columns
category_enc.head(15)

Unnamed: 0,Education,Emergency Preparedness,Environment,Health,Helping Neighbors in Need,Strengthening Communities
0,0,0,0,0,0,0
1,0,0,0,0,0,1
2,0,0,0,0,0,1
3,0,0,0,0,0,1
4,0,0,1,0,0,0
5,0,0,1,0,0,0
6,0,0,0,0,0,1
7,0,0,0,0,1,0
8,0,0,0,0,0,0
9,0,0,0,1,0,0


`get_dummies()` is a simple and quick way to encode categorical variables.

## Engineering numerical features
  
Though we may have a dataset filled with numerical features, they may need a little bit of feature engineering to properly prepare for modeling. In this section, we'll talk about aggregate statistics as well as dates and how engineering numerical features can add value to our model's performance.

**Aggregate statistics**
  
If we have, a collection of features related to a single feature, like temperatures on different days, we may want to take an average or median to use as a feature for modeling instead. A common method of feature engineering is to take an aggregate of a set of numbers to use in place of those features. This can be helpful in reducing the dimensionality of our feature space, or perhaps we simply don't need multiple similar values that are close in distance to each other. 
  
In this dataset of temperatures over the course of three days in four different cities. Rather than using all three days, let's take an average of the three. First, we can subset the columns we want to aggregate over using `.loc[]`. Then, we set the `axis=` parameter to `axis=1` in order for the calculation of the mean for each row, and save the results in the mean column.
  
**Dates**
  
Dates and timestamps are another area where we might want to reduce granularity in our dataset. If we're doing time series analysis, we will likely need to keep this granularity to capture underlying trends on different timescales, but if we're running a prediction task, we may need higher-level information like the month, year, or both. Here's a collection of purchase dates. The full date is too granular for the prediction task we want to do, so let's extract the month from each date.
  
The first thing to do is to convert this date column into a pandas datetime column using the `pd.datetime()` function. This makes extracting the components much easier. Once it's converted, we can use the `dt.month` attribute to extract out the month. There are a lots of other attributes for extracting different components, like day and year, and I encourage you to try these out yourself. We can see that there is now a column of month values ready for modeling.

### Aggregating numerical features
  
A good use case for taking an aggregate statistic to create a new feature is when you have many features with similar, related values. Here, you have a DataFrame of running times named running_times_5k. For each name in the dataset, take the mean of their 5 run times.
  
1. Use the `.loc[]` method to select all rows and columns to find the `.mean()` of the each columns.
  
2. Print the `.head()` of the DataFrame to see the mean column.

In [72]:
running_times_5k = pd.read_csv('../_datasets/running_times_5k.csv')
running_times_5k

Unnamed: 0,name,run1,run2,run3,run4,run5
0,Sue,20.1,18.5,19.6,20.3,18.3
1,Mark,16.5,17.1,16.9,17.6,17.3
2,Sean,23.5,25.1,25.2,24.6,23.9
3,Erin,21.7,21.1,20.9,22.1,22.2
4,Jenny,25.8,27.1,26.1,26.7,26.9
5,Russell,30.9,29.6,31.4,30.4,29.9


In [73]:
# Use .loc to create a mean column
running_times_5k["mean"] = running_times_5k.loc[:, :].mean(axis=1, numeric_only=True)

# Take a look at the results
print(running_times_5k.head())

    name  run1  run2  run3  run4  run5   mean
0    Sue  20.1  18.5  19.6  20.3  18.3  19.36
1   Mark  16.5  17.1  16.9  17.6  17.3  17.08
2   Sean  23.5  25.1  25.2  24.6  23.9  24.46
3   Erin  21.7  21.1  20.9  22.1  22.2  21.60
4  Jenny  25.8  27.1  26.1  26.7  26.9  26.52


`.loc[]` is especially helpful for operating across columns.

### Extracting datetime components
  
There are several columns in the volunteer dataset comprised of datetimes. Let's take a look at the start_date_date column and extract just the month to use as a feature for modeling.
  
1. Convert the start_date_date column into a pandas datetime column and store it in a new column called start_date_converted.
  
2. Retrieve the month component of start_date_converted and store it in a new column called start_date_month.
  
3. Print the `.head()` of just the start_date_converted and start_date_month columns.

In [74]:
volunteer.start_date_date.head()

0        July 30 2011
1    February 01 2011
2     January 29 2011
3    February 14 2011
4    February 05 2011
Name: start_date_date, dtype: object

In [75]:
# First, convert string column to date column
volunteer["start_date_converted"] = pd.to_datetime(volunteer['start_date_date'])

# Extract just the month from the converted column
volunteer["start_date_month"] = volunteer['start_date_converted'].apply(lambda row: row.month)

# Take a look at the converted and new month columns
volunteer[['start_date_converted', 'start_date_month']].head()


Unnamed: 0,start_date_converted,start_date_month
0,2011-07-30,7
1,2011-02-01,2
2,2011-01-29,1
3,2011-02-14,2
4,2011-02-05,2


You can also use attributes like `.month` to get the month, `.day` to get the day, and `.year` to get the year from datetime columns.

## Engineering text features
  
Though text data is a little more complicated to work with, there's a lot of useful feature engineering we can do with it.
  
**Extraction**
  
One method is to extract the pieces of information that you need: maybe part of a string, or extracting a number, and transforming it into a feature. We can also transform the text itself into features, for use with natural language processing methods or prediction tasks. Let's learn how to extract data from text fields. 
  
We're going to use regular expressions to extract information from strings. Regular expressions are patterns that can be used to extract information from text data. You should already be familiar with regular expression, but for the purposes of this course, we're going to only focus on extracting numbers from strings. To use Python's rich regular expressions functionality, we'll need to first import the re module. Here we have a string
  
`my_string = 'temperature:75.6 F'`
  
and we want to extract the temperature digit from it, so we can model using the numerical data. 
  
`temp = re.search('\d+\.\d+', my_string)`

We'll need use a pattern to extract this float, so let's break down the pattern in `re.search()`. "`\d`" means that we want to grab digits, and the "`+`" means we want to grab as many as possible. So if there are two next to each other, we want both (like the 75). "`\.`" means we want to grab the decimal point, and then there's another "`\d+`" at the end to grab the digits on the right-hand side of the decimal. `re.search` then searches for a string matching the pattern, which we can extract with the `.group()` method.
  
`print(float(temp.group(0)))`
  
out: `75.6`
  
**Vectorizing text**
  
If we're working with text, we might want to model it in some way. Maybe we want to use document text in a classification task, such as classifying emails as spam or not. In order to do that, we'll need to vectorize the text and transform it into a numerical input that scikit-learn can use. We're going to create a tf/idf vector. tf/idf is a way of vectorizing text that reflects how important a word is in a document beyond how frequently it occurs. It stands for term frequency inverse document frequency and places the weight on words that are ultimately more significant in the entire corpus of words. We can create tf/idf vectors in scikit-learn by using `TfidfVectorizer()`.
    
`from sklearn.feature_extraction.text import TfidfVectorizer()`  
  
Here we have a collection of text. In order to vectorize it, we can simply pass the column of text we want to vectorize into the `.fit_transform()` method, which is called on the `TfidfVectorizer()`.
  
**Text classification**
  
Now that we have a vectorized version of the text, we can use it for classification. We'll use a Naive Bayes classifier, which is based on Bayes' theorem of conditional probability, seen here, and performs well on text classification tasks. 
  
Naive Bayes treats each feature as independent from the others, which can be a naive assumption, but works out quite well on text data. Because each feature is treated independently, this classifier works well on high-dimensional data and is very efficient.

### Conditional Probability
  
**Defined as**:
  
$P(A | B) = \frac{P(A \cap B)}{P(B)}$
  
where:  
  
$P(A | B)$ represents the conditional probability of event $A$ given event $B$  
$P(A \cap B)$ denotes the probability of both events $A$ and $B$ occurring simultaneously  
$P(B)$ is the probability of event $B$  
  
Conditional probability is a concept that is used in Bayes' theorem. It is the probability of an event $A$ given that another event $B$ has occurred. Conditional probability is a fundamental concept in probability theory and it provides a way to quantify the likelihood of an event based on additional information or conditions. As so, it is widely used in various statistical and machine learning applications. 

### Naive Bayes
  
**The Naive Bayes algorithm is defined as**: 
  
$P(C_k | x_1, x_2, ..., x_n) = \frac{P(C_k) \cdot P(x_1, x_2, ..., x_n | C_k)}{P(x_1, x_2, ..., x_n)}$

where:

$P(C_k)$ is the prior probability of class $C_k$  
$P(x_1, x_2, ..., x_n | C_k)$ is the likelihood of observing features $x_1, x_2, ..., x_n$ given class $C_k$  
$P(x_1, x_2, ..., x_n)$ is the probability of observing features $x_1, x_2, ..., x_n$  
  
**Alternatively it can be written as**:  
  
$P(C | X) = \frac{P(X | C) \cdot P(C)}{P(X)}$
  
where:
  
$P(C | X)$ represents the posterior probability of class $C$ given evidence $X$  
$P(X | C)$ denotes the likelihood of evidence $X$ given class $C$  
$P(C)$ is the prior probability of class $C$  
$P(X)$ is the probability of evidence $X$  
  
**In both Naive Bayes formulas**:
  
$C_k$ represents a specific class or category.  
$X$ represents a set of features or evidence.  
$P(C_k | x_1, x_2, ..., x_n)$ and $P(C | X)$ denote the posterior probability of class $C_k$ or $C$ given the evidence $x_1, x_2, ..., x_n$ or $X$, respectively.  
$P(C_k)$ and $P(C)$ are the prior probabilities of class $C_k$ or $C$, respectively.  
$P(x_1, x_2, ..., x_n | C_k)$ and $P(X | C)$ represent the likelihood of observing the evidence $x_1, x_2, ..., x_n$ or $X$ given class $C_k$ or $C$, respectively.  
$P(x_1, x_2, ..., x_n)$ and $P(X)$ are the probabilities of observing the evidence $x_1, x_2, ..., x_n$ or $X$, respectively.  
  
The two Naive Bayes formulas are mathematically equivalent, but they may be written in slightly different notations or symbols. The first formula uses the subscript $k$ to denote different classes, while the second formula uses a single class $C$. Additionally, the first formula explicitly represents the evidence $x_1, x_2, ..., x_n$, while the second formula represents it as a single entity $X$. However, the underlying principle and calculation of conditional probability using Bayes' theorem are the same in both formulas.
  
Bayes' theorem allows us to update our beliefs or knowledge about the probability of an event based on new evidence. It is widely used in various fields, including statistics, machine learning, and data analysis, particularly in Bayesian inference.
  
In summary, conditional probability is a concept that describes the likelihood of an event given that another event has occurred, while Bayes' theorem is a mathematical formula that provides a framework for updating probabilities based on new evidence. Bayes' theorem involves conditional probabilities but extends beyond them by incorporating prior knowledge.

### Extracting string patterns
  
The Length column in the hiking dataset is a column of strings, but contained in the column is the mileage for the hike. We're going to extract this mileage using regular expressions, and then use a lambda in pandas to apply the extraction to the DataFrame.
  
1. Search the text in the length argument for numbers and decimals using an appropriate pattern.
  
2. Extract the matched pattern and convert it to a float.
  
3. Apply the `return_mileage()` function to each row in the hiking.Length column.

In [76]:
import re


# Write a pattern to extract numbers and decimals
def return_mileage(length):

    if length == None:
        return
    
    # Search the text for matches
    mile = re.search('\d+\.\d+', length)
    
    # If a value is returned, use group(0) to return the found value
    if mile is not None:
        return float(mile.group())
        
# Apply the function to the Length column and take a look at both columns
hiking["Length_num"] = hiking.Length.apply(lambda row: return_mileage(row))
print(hiking[["Length", "Length_num"]].head())

       Length  Length_num
0   0.8 miles        0.80
1    1.0 mile        1.00
2  0.75 miles        0.75
3   0.5 miles        0.50
4   0.5 miles        0.50


Regular expressions are a useful way to perform text extraction.

### Vectorizing text
  
You'll now transform the volunteer dataset's title column into a text vector, which you'll use in a prediction task in the next exercise.
  
1. Store the volunteer.title column in a variable named title_text.
  
2. Instantiate a `TfidfVectorizer()` as tfidf_vec.
  
3. Transform the text in title_text into a tf-idf vector using tfidf_vec.

In [77]:
from sklearn.feature_extraction.text import TfidfVectorizer


# Need to drop NaN observations in the 'category_desc' for train_test_split
volunteer = pd.read_csv('../_datasets/volunteer_opportunities.csv')
volunteer = volunteer.dropna(subset=['category_desc'], axis=0)

# Taking the title text from the title column
title_text = volunteer.title

# Create the vectorizer method
tfidf_vec = TfidfVectorizer()

# Transform the text into tf-idf vectors
text_tfidf = tfidf_vec.fit_transform(title_text)

### Text classification using tf/idf vectors
  
Now that you've encoded the volunteer dataset's title column into tf/idf vectors, you'll use those vectors to predict the category_desc column.
  
1. Split the text_tfidf vector and y target variable into training and test sets, setting the `stratify=` parameter equal to y, since the class distribution is uneven. Notice that we have to run the `.toarray()` method on the tf/idf vector, in order to get in it the proper format for scikit-learn.
  
2. Fit the X_train and y_train data to the Naive Bayes model, nb.
  
3. Print out the test set accuracy.

In [78]:
# CSR Matrix
print(text_tfidf)

  (0, 278)	0.6821380095940299
  (0, 1048)	0.7312234513930028
  (1, 832)	0.4089128467305852
  (1, 559)	0.4089128467305852
  (1, 90)	0.2211952015096988
  (1, 890)	0.3668183240931356
  (1, 490)	0.38428912950191935
  (1, 38)	0.4089128467305852
  (1, 1017)	0.4089128467305852
  (2, 680)	0.20380137329146378
  (2, 498)	0.17806074091111465
  (2, 240)	0.3097133180239295
  (2, 27)	0.3097133180239295
  (2, 708)	0.3097133180239295
  (2, 969)	0.16859216949793618
  (2, 535)	0.26756671781861974
  (2, 356)	0.3097133180239295
  (2, 1061)	0.27783064576035155
  (2, 944)	0.22729780579979794
  (2, 68)	0.1625840364724973
  (2, 487)	0.3097133180239295
  (2, 423)	0.3097133180239295
  (2, 368)	0.3097133180239295
  (3, 947)	0.7071067811865476
  (3, 922)	0.7071067811865476
  :	:
  (612, 681)	0.49600210945097983
  (612, 378)	0.49600210945097983
  (612, 773)	0.38733928023398656
  (612, 380)	0.20302987585336688
  (612, 1037)	0.19533322522098384
  (613, 937)	0.4699182426672902
  (613, 522)	0.4699182426672902
  (613, 

In [79]:
print(text_tfidf.toarray())

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [80]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB


# Instanciate seed
SEED = 42

# Instanciate the Naive Bayes model
nb = GaussianNB()

# Split the dataset according to the class distribution of category_desc
y = volunteer['category_desc']
X_train, X_test, y_train, y_test = train_test_split(text_tfidf.toarray(), y, stratify=y, random_state=SEED)

# Fitting the model to the training data
nb.fit(X_train, y_train)

# Displaying the models accuracy
print(nb.score(X_test, y_test))

0.5161290322580645


Notice that the model doesn't score very well. We'll work on selecting the best features for modeling in the next chapter.