### Data Camp - Preprocessing Data for Machine Learning

Data preprocessing
- Beyond cleaning and exploratory data analysis
- prepping data for modeling
- python ML modeling requires numerical input

- Understand data with:
        df.head()
        df.columns()
        df.dtypes()
        df.describe()

- First step in preprocessing-remove missing data
        df.dropna() to drop all rows with a missing value
        df.drop([1,2,3]) to drop specific rows by index label
        df.drop('col label', axis=1) to drop columns
        df.dropna(axis=1, thresh=# of null to allow) axis 0 = row, axis 1 = col

- Filter dataframe based on values
        df[df['col'] == x]

- Got a count of null values in a column then create a df that has those rows removed where the specified col has a null value
        df['col'].isnull().sum()
        df[df['col].notnull()]


In [None]:
#drop features/columns that have at least 3 missing values.
volunteer.dropna(axis=1, thresh=3)

# Check how many values are missing in the category_desc column
print(volunteer['category_desc'].isnull().sum())

# Subset the volunteer dataset by indexing by where category_desc is notnull()
volunteer_subset = volunteer[volunteer['category_desc'].notnull()]

# Print out the shape of the subset
print(volunteer_subset.shape)

#### Working with Data Types

df.dtypes
- most commonly used in pandas are
- object - string/mixed
- int64 - integer
- float64 - float
- datetime64 (timedelta) - datetime

- converting column types
        df['col'] = df['col'].astype("float")

In [None]:
# Print the head of the hits column
print(volunteer["hits"].head())

# Convert the hits column to type int
volunteer["hits"] = volunteer['hits'].astype('int')

# Look at the dtypes of the dataset
print(volunteer.dtypes)

#### Training and Test Sets

- function splits 75% into training and 25% into testing sets
        from sklearn.model_selection import train_test_split
        x_train, x_test, y_train, y_test = train_test_split(x,y)

if you have imbalanced classes, stratified sampling, takes into account the distribution of classes in dataset
ex: if data set contains 100 samples, 80 class 1, 20 class 2
    - we want training set to contain 75 samples, 60 class 1 / 15 class 2
    - test set containing 25 samples, 20 class 1, 5 class 2
    
can use the stratify parameter in train_test_split:
 - stratify = y
 - check value_counts
 
Code example below - We know that the distribution of variables in the category_desc column in the volunteer dataset is uneven. If we wanted to train a model to try to predict category_desc, we would want to train the model on a sample of data that is representative of the entire dataset. Stratified sampling is a way to achieve this.

In [None]:
# Create a data with all columns except category_desc
volunteer_X = volunteer.drop("category_desc", axis=1)

# Create a category_desc labels dataset
volunteer_y = volunteer[["category_desc"]]

# Use stratified sampling to split up the dataset according to the volunteer_y dataset
X_train, X_test, y_train, y_test = train_test_split(volunteer_X, volunteer_y, stratify=volunteer_y)

# Print out the category_desc counts on the training y labels
print(y_train["category_desc"].value_counts())

## Chapter 2 
#### Standardizing Data
standardization: preprocessing method that transforms continuous numerical data to make it normally distributed
- can be a prerequesite for many models (including scikit0learn models), skipping this step can bias the model
- log normalization and feature scaling

model in linear space, data must be in linear space
- like KNN, linear regression, K-means clustering

other models for nonlinear space

standardization is needed when:
- dataset features have high variance
- dataset features are continuous and on different scales
- linearity assumptions 

In [None]:
#modeling without normalizing
#column = Proline, has an extremely high variance compared to the other columns.
#knn=KNeighborsClassifier(n_neighbors=k)

# Split the dataset and labels into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Fit the k-nearest neighbors model to the training data
knn.fit(X_train, y_train)

# Score the model on the test data
print(knn.score(X_test, y_test))

#low accuracy becasue we did not standardize

#### Log Normalization
method for standardizing when you have a column with high variance
- applies log transformation (approximates normality)
- if value = 30, log transformation = 3.4 because  e(2.718)^3.4 = 30
- captures relative changes, magnitude of change, maintains positive values

- Can check variance of data before transforming
        df.var()
        df['log_2'] = np.log(df['col2'])
- check variance again and col 1 and col2 should be closer to 1

In [None]:
#Proline has a large amount of variance, let's log normalize it
# Print out the variance of the Proline column
print(wine['Proline'].var())
#99166.71735542436

# Apply the log normalization function to the Proline column
wine['Proline_log'] = np.log(wine['Proline'])

# Check the variance of the normalized Proline column
print(wine['Proline_log'].var())
#0.17231366191842012

#### Scaling Data
- useful when features are on different scale and using a linear model
- centers features around 0 and transforms unit variance (approximates normal dist.)
- required for many models in scikit.learn

- StandardScaler object can apply the same transformation on new sets without rescaling everything
        #create object
        scaler = StandardScaler()
        #convert array to df 
        df_scaled = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
        #print the variance for each column, should be equal
        df_scaled.var()

In [None]:
# Import StandardScaler from scikit-learn
from sklearn.preprocessing import StandardScaler

# Create the scaler
ss = StandardScaler()

# Take a subset of the DataFrame you want to scale 
wine_subset = wine[['Ash', "Alcalinity of ash", 'Magnesium']]

# Apply the scaler to the DataFrame subset
wine_subset_scaled = ss.fit_transform(wine_subset)

#### Standardized data and modeling
KNN - model classifies data based on class the majority of surrounding data points belong to

do preprocess and train test split first

knn.score(X_test, y_test) to evaluate model

In [None]:
#try KNN WITHOUT standardizing data
# Split the dataset and labels into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y)

# Fit the k-nearest neighbors model to the training data
knn.fit(X_train, y_train)

# Score the model on the test data
print(knn.score(X_test, y_test))

#OUTPUT = 0.6444444444444445
#low score without standardiaztion 


#KNN on scaled data
# Create the scaling method.
ss = StandardScaler()

# Apply the scaling method to the dataset used for modeling.
X_scaled = ss.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)

# Fit the k-nearest neighbors model to the training data.
knn.fit(X_train, y_train)

# Score the model on the test data.
print(knn.score(X_test, y_test))

#OUTPUT = 0.9555555555555556 
#much higher accuracy

### Chapter 3
#### Feature engineering

creation of new features based on existing one
- useful for prediction/clustering
- insight into relationships between features

in addition to preprocessing, you will likely need to extract and expand data
will cover manual methods (automater wasy exist as well)

dataset dependent, must have good knowledge of dataset
examples:
- text data for natural language processing
- string data, encode into numerical data
- timestamps, reducing from seconds to day or month

#### Encoding categorical variables

- Encoding binary variables - Pandas .apply()
        df['new_col'] = df['col'].apply(lambda val: 1 if val == 'yes' else 0)

- Encoding binary variables - scikit-learn
        from sklearn.preprocessing import LabelEncoder
        le = LabelEncoder()
        df['new_col'] = le.fit_transform(df['col'])

Multiple categories to encode - One-hot encoding
- transforms each unique value into an array of 0 and 1
        pd.get_dummies(df['col'])

In [None]:
#Encoding categorical variables - binary
# Set up the LabelEncoder object
enc = LabelEncoder()

# Apply the encoding to the "Accessible" column
hiking['Accessible_enc'] = enc.fit_transform(hiking['Accessible'])

# Compare the two columns
print(hiking[['Accessible', 'Accessible_enc']].head())


#Encoding categorical variables - one-hot
# Transform the category_desc column
category_enc = pd.get_dummies(volunteer['category_desc'])

# Take a look at the encoded columns
print(category_enc.head())

OUTPUT:
   Education  Emergency Preparedness  Environment  Health  Helping Neighbors in Need  Strengthening Communities
0          0                       0            0       0                          0                          0
1          0                       0            0       0                          0                          1
2          0                       0            0       0                          0                          1
3          0                       0            0       0                          0                          1
4          0                       0            1       0                          0                          0

#### Engineering numerical features
can take an aggregate of a set of numbers to replace many values that are close together in time

- EX: daily rainfall to get weekly average
        df['mean'] = df.apply(lambda row: row['columns].mean(), axis=1

Dates
- be sure the column is converted to datetime
        df['date_converted'] = pd.to_datetime(df['date'])
        extract month from datetime format
        df['month'] = df['date_converted].apply(lambda row: row.month)

In [None]:
#Engineering numerical features - taking an average
#For each name in the dataset, take the mean of their 5 run times.

print(running_times_5k)
OUTPUT
      name  run1  run2  run3  run4  run5
0      Sue  20.1  18.5  19.6  20.3  18.3
1     Mark  16.5  17.1  16.9  17.6  17.3
2     Sean  23.5  25.1  25.2  24.6  23.9
3     Erin  21.7  21.1  20.9  22.1  22.2
4    Jenny  25.8  27.1  26.1  26.7  26.9
5  Russell  30.9  29.6  31.4  30.4  29.9

# Create a list of the columns to average
run_columns = ['run1', 'run2', 'run3', 'run4', 'run5']

# Use apply to create a mean column
running_times_5k["mean"] = running_times_5k.apply(lambda row: row[run_columns].mean(), axis=1)

# Take a look at the results
print(running_times_5k['mean'])

OUTPUT
      name  run1  run2  run3  run4  run5   mean
0      Sue  20.1  18.5  19.6  20.3  18.3  19.36
1     Mark  16.5  17.1  16.9  17.6  17.3  17.08
2     Sean  23.5  25.1  25.2  24.6  23.9  24.46
3     Erin  21.7  21.1  20.9  22.1  22.2  21.60
4    Jenny  25.8  27.1  26.1  26.7  26.9  26.52
5  Russell  30.9  29.6  31.4  30.4  29.9  30.44

In [None]:
#Engineering numerical features - datetime
#look at the start_date_date column and extract just the month

# First, convert string column to date column
volunteer["start_date_converted"] = pd.to_datetime(volunteer['start_date_date'])

# Extract just the month from the converted column
volunteer["start_date_month"] = volunteer['start_date_converted'].apply(lambda row: row.month)

# Take a look at the converted and new month columns
print(volunteer[['start_date_converted', 'start_date_month']].head())

OUTPUT
  start_date_converted  start_date_month
0           2011-07-30                 7
1           2011-02-01                 2
2           2011-01-29                 1
3           2011-02-14                 2
4           2011-02-05                 2

#### Engineering features from text - text classification

extract pieces you need, a string or number

transform text itself into features - language processing, prediction

- extraction from text with .compile()
        import re
        my_string = "temperature:75.6 F"
        pattern = re.compile("\d+\.\d+")
        \d+  = collect all digits before decimal
        \. collect decimal
        \d+ collect all digits after decimal
        temp = re.match(pattern, my_string) - search string for matching pattern
        print(float(temp.group(0)) - extract with group
    
vectorizing text - encode numerically with tfidf vector
- ex: using document text for vectorization
- reflects the importance of a term beside by frequency
- tf = term frequency
- idf = inverse document frequency

        from sklearn.feature_extraction.text import TfidfVectorizer
        tfidf_vec = TfidfVectorizer()
        text_tfidf = tfidf_vec.fit_transform(df)

Text classification - Naive Bayes Classifier
- treats each feature as independent of others, works well for text and high dimensional data
![image.png](attachment:image.png)


In [None]:
#Engineering features from strings - extraction
#The Length column in the hiking dataset is a column of strings, 
#but contained in the column is the mileage for the hike
print(hiking['Length'].head())
OUTPUT 
0     0.8 miles
1      1.0 mile
2    0.75 miles
3     0.5 miles
4     0.5 miles

# Write a pattern to extract numbers and decimals
def return_mileage(length):
    pattern = re.compile(r"\d+\.\d+")
    
    # Search the text for matches
    mile = re.match(pattern, length)
    
    # If a value is returned, use group(0) to return the found value
    if mile is not None:
        return float(mile.group(0))
        
# Apply the function to the Length column and take a look at both columns
hiking["Length_num"] = hiking["Length"].apply(lambda row: return_mileage(row))
print(hiking[["Length", "Length_num"]].head())

OUTPUT
           Length  Length_num
    0   0.8 miles        0.80
    1    1.0 mile        1.00
    2  0.75 miles        0.75
    3   0.5 miles        0.50
    4   0.5 miles        0.50

In [None]:
#Engineering features from strings - tf/idf
# Take the title text
title_text = volunteer['title']

# Create the vectorizer method
tfidf_vec = TfidfVectorizer()

#Use the tfidf_vec vectorizer's fit_transform() function on title_text 
# to transform the text into a tf-idf vector.
text_tfidf = tfidf_vec.fit_transform(title_text)


#Text classification using tf/idf vectors
#Using train_test_split, split the text_tfidf vector, 
#along with your y variable, into training and test sets. 
#Set the stratify parameter equal to y, since the class distribution is uneven. 
#Notice that we have to run the toarray() method on the tf/idf vector, 
#in order to get in it the proper format for scikit-learn.
# Split the dataset according to the class distribution of category_desc
y = volunteer["category_desc"]
X_train, X_test, y_train, y_test = train_test_split(text_tfidf.toarray(), y, stratify=y)

#Use Naive Bayes' fit() method on the X_train and y_train variables.
# Fit the model to the training data
nb.fit(X_train, y_train)

# Print out the model's accuracy
print(nb.score(X_test, y_test))

OUTPUT = 0.567741935483871
#more feature selection later to increase score

## Chapter 4
#### Feature selection

select features from existing feature set to improve model performance
- automated feature selection exists, covering manual selection
- Examples:
    - redundant features adding noise, i.e city/state & lat/long
    - some features may be strongly correlated, breaking independent variable assumption of some models
    - text vectors, use tfidf 
    - large feature set, use dimensionality reduction to reduce overall variance
 
Removing redundant features
- remove noisy features
- remove highly correlated features
- remove duplicate features

Correlated features
- statistically correlated - feature move together directionally
- linear models assume feature independence
- pearson correlation coefficient - measure of this directionality, range is -1 to 1
    - df.corr()


In [None]:
#if you explore the volunteer dataset in the console, you'll see three features 
#which are related to location: locality, region, and postalcode
# Create a list of redundant column names to drop
to_drop = ["category_desc", "created_date", "locality", "region", "vol_requests"]

# Drop those columns from the dataset
volunteer_subset = volunteer.drop(to_drop, axis=1)

# Print out the head of the new dataset
print(volunteer_subset.head())

In [None]:
#checking for correlated features
# Print out the column correlations of the wine dataset
print(wine.corr())

# Take a minute to find the column where the correlation value is greater than 0.75 at least twice
to_drop = "Flavanoids"

# Drop that column from the DataFrame
wine = wine.drop('Flavanoids', axis=1)

#### Selecting features using text vectors
don't necessarily need all words to train a model, could take the top 20% of weighted words to train

test subset of tfidf to find out what works
- how do we pull out words and weights to compare

after vectorizing text, vocabulary and weights are stored in vectorizer
- pull vocab list to look at word weights, underscore intentional
        print(tfidf_vec.vocabulary_)
- access items in the vocabulary like a list with 0 indexing
        print(text_tfdif[3].data)
- get indices of words that have been weighted
        print(text_tfidf[3].indices)
- reverse the key, value pairs to match conventional python formatting
        vocab = {v:k for k,v in tfidf_vec.vocabulary_.items()}
- zip together row indeces and weights and turn into dictionary
        zipped_rows = dict(zip(text_tfidf[3].indices, text_tfidf[3].data))
- can do this all in a function
        def return_weights(vocab, vector, vector_index):
        zipped = dict(zip(vector[vector_index].indices, vector[vector_index].data
        return {vocab[i]:zipped[i] for i in vector[vector_index].indices}
- call function
        print(return_weights(vocab, text_tfidf, 3))
- sort by score or eliminate words below a certain threshhold

In [None]:
# Add in the rest of the parameters
def return_weights(vocab, original_vocab, vector, vector_index, top_n):
    zipped = dict(zip(vector[vector_index].indices, vector[vector_index].data))
    
    # Let's transform that zipped dict into a series
    zipped_series = pd.Series({vocab[i]:zipped[i] for i in vector[vector_index].indices})
    
    # Let's sort the series to pull out the top n weighted words
    zipped_index = zipped_series.sort_values(ascending=False)[:top_n].index
    return [original_vocab[i] for i in zipped_index]

# Print out the weighted words
print(return_weights(vocab, tfidf_vec.vocabulary_, text_tfidf, 8, 3))


#Using the function we wrote in the previous exercise, 
#we're going to extract the top words from each document in the text vector, 
#return a list of the word indices, 
#and use that list to filter the text vector down to those top words.

def words_to_filter(vocab, original_vocab, vector, top_n):
    filter_list = []
    for i in range(0, vector.shape[0]):
    
        # Here we'll call the function from the previous exercise, and extend the list we're creating
        filtered = return_weights(vocab, original_vocab, vector, i, top_n)
        filter_list.extend(filtered)
    # Return the list in a set, so we don't get duplicate word indices
    return set(filter_list)

# Call the function to get the list of word indices
filtered_words = words_to_filter(vocab, tfidf_vec.vocabulary_, text_tfidf, 3)

# By converting filtered_words back to a list, we can use it to filter the columns in the text vector
filtered_text = text_tfidf[:, list(filtered_words)]

In [None]:
#Training Naive Bayes with feature selection
#Use train_test_split on the filtered_text text vector, the y labels (which is the category_desc labels), 
#and pass the y set to the stratify parameter, since we have an uneven class distribution.
# Split the dataset according to the class distribution of category_desc, using the filtered_text vector
train_X, test_X, train_y, test_y = train_test_split(filtered_text.toarray(), y, stratify=y)

# Fit the model to the training data
nb.fit(train_X,train_y)

# Print out the model's accuracy
print(nb.score(test_X,test_y))

#### Dimensionality reduction
 - another way to reduce size of feature set
 - unsupervied learning method
 - combines/dcecomposes a feature space
 - feature extraction - used here to reduce feature space
 - PCA - principal component analysis
 - linear transformation to uncorrelated space
 - captures as much variance as possible in each component
 - useful for large number of features and no easy way to eliminate
 
 PCA
        from sklearn.decomposition import PCA
        pca = PCA()
        df_pca = pca.fit_transform(df)
        print(df_pca)
        - percentage of variance explained by that component
        print(pca.explained_variance_ratio_)
- can be difficult to interpret components beyond the ones with highest explained variance
- black box method
- end of preprocessing because of the way it is reshaped
- mostlyu good for eliminating components

In [None]:
#Using PCA
from sklearn.decomposition import PCA

# Set up PCA and the X vector for diminsionality reduction
pca = PCA()
wine_X = wine.drop("Type", axis=1)

# Apply PCA to the wine dataset X vector
transformed_X = pca.fit_transform(wine_X)

# Look at the percentage of variance explained by the different components
print(pca.explained_variance_ratio_)
OUTPUT:
 [9.98091230e-01 1.73591562e-03 9.49589576e-05 5.02173562e-05
 1.23636847e-05 8.46213034e-06 2.80681456e-06 1.52308053e-06
 1.12783044e-06 7.21415811e-07 3.78060267e-07 2.12013755e-07
 8.25392788e-08]


#Training a model with PCA
# Split the transformed X and the y labels into training and test sets
X_wine_train, X_wine_test, y_wine_train, y_wine_test = train_test_split(transformed_X, y)

# Fit knn to the training data
knn.fit(X_wine_train, y_wine_train)

# Score knn on the test data and print it out
knn.score(X_wine_test, y_wine_test)

OUTPUT: 0.7555555555555555

### Chapter 5
#### Putting it All Together

In [None]:
# Check the column types, look at the data
print(ufo.dtypes)

# Change the type of seconds to float
ufo["seconds"] = ufo["seconds"].astype(float)

# Change the date column to type datetime
ufo["date"] = pd.to_datetime(ufo["date"])

# Check the column types
print(ufo[["seconds", "date"]].dtypes)

In [None]:
# Check how many values are missing in the length_of_time, state, and type columns
print(ufo[["length_of_time", "state", "type"]].isnull().sum())

# Keep only rows where length_of_time, state, and type are not null
ufo_no_missing = ufo[ufo["length_of_time"].notnull() & 
          ufo["state"].notnull() & 
          ufo["type"].notnull()]

# Print out the shape of the new dataset
print(ufo_no_missing.shape)

#### Categorical variables and Standardization
- one hot encode with:
        pd.get_dummies()
- check variance then standardize with:
        var()  
        np.log()

In [None]:
#Extracting numbers from strings
def return_minutes(time_string):
    
    #use \d+ to grab digits and match it to the column values
    pattern = re.compile(r"\d+")
        
    # Use match on the pattern and column
    num = re.match(pattern, time_string)
    if num is not None:
        return int(num.group(0))
        
# Apply the extraction to the length_of_time column
ufo["minutes"] = ufo["length_of_time"].apply(return_minutes)

# Take a look at the head of both of the columns
print(ufo[["length_of_time", "minutes"]].head())

In [None]:
#Identifying features for standardization
#investigate the variance of columns in the UFO dataset 
#to determine which features should be standardized
# Check the variance of the seconds and minutes columns
print(ufo[['seconds', 'minutes']].var())

# Log normalize the seconds column
ufo["seconds_log"] = np.log(ufo['seconds'])

# Print out the variance of just the seconds_log column
print(ufo['seconds_log'].var())

#### Engineering new features
- extract month from dat
        .month or .hour
- extract minutes value from string
        regex(regular expressions \d)
        .group() to return results
- vectorize text in descripton
        tf-idf and IfidfVectorizer

In [None]:
#Encoding categorical variables
# Use Pandas to encode us values as 1 and others as 0
ufo["country_enc"] = ufo["country"].apply(lambda x: 1 if x=='us' else 0)

# Print the number of unique type values
print(len(ufo['type'].unique()))

# Create a one-hot encoded set of the type values
type_set = pd.get_dummies(ufo['type'])

# Concatenate this set back to the ufo DataFrame
ufo = pd.concat([ufo, type_set], axis=1)

In [None]:
#extracting features from dates
# Look at the first 5 rows of the date column
print(ufo['date'].head())

# Extract the month from the date column
ufo["month"] = ufo["date"].apply(lambda row: row.month)

# Extract the year from the date column
ufo["year"] = ufo["date"].apply(lambda row: row.year)

# Take a look at the head of all three columns
print(ufo[['date', 'month', 'year']].head())

OUTPUT 
                 date  month  year
0 2002-11-21 05:45:00     11  2002
1 2012-06-16 23:00:00      6  2012
2 2013-06-09 00:00:00      6  2013
3 2013-04-26 23:27:00      4  2013
4 2013-09-13 20:30:00      9  2013

In [None]:
#Text vectorization
# Take a look at the head of the desc field
print(ufo['desc'].head())

# Create the tfidf vectorizer object
vec = TfidfVectorizer()

# Use vec's fit_transform method on the desc field
desc_tfidf = vec.fit_transform(ufo['desc'])

# Look at the number of columns this creates
print(desc_tfidf.shape)

OUTPUT: (1866, 3422)

#### Feature selection and modeling
- eliminate redundant features
        in original form
        due to feature engineering
- inspect text vector and eliminate words
- preprocessing are iterative practices, play around to discover the best model fro your needs

In [None]:
# selecting the ideal dataset

# Check the correlation between the seconds, seconds_log, and minutes columns
print(ufo[['seconds', 'seconds_log', 'minutes']].corr())

# Make a list of features to drop
to_drop = ['city', 'country', 'date', 'desc', 'lat', 'length_of_time', 'long', 'minutes', 'recorded', 'seconds', 'state']

# Drop those features
ufo_dropped = ufo.drop(to_drop, axis=1)

# Let's also filter some words out of the text vector we created
filtered_words = words_to_filter(vocab, vec.vocabulary_, desc_tfidf, 4)

In [None]:
#Modeling the UFO dataset
#build a k-nearest neighbor model to predict which country 
#the UFO sighting took place in.

# Take a look at the features in the X set of data
print(X.columns)

# Split the X and y sets using train_test_split, setting stratify=y
train_X, test_X, train_y, test_y = train_test_split(X,y, stratify=y)

# Fit knn to the training sets
knn.fit(train_X, train_y)

# Print the score of knn on the test sets
print(knn.score(test_X, test_y))

OUTPUT: 0.8758029978586723
    
#fair performance for predicting location

In [None]:
# build a model using the text vector we created, desc_tfidf, 
#using the filtered_words list to create a filtered text vector. 
#predict the type of the sighting based on the text.
#use a Naive Bayes model for this.

# Use the list of filtered words we created to filter the text vector
filtered_text = desc_tfidf[:, list(filtered_words)]

# Split the X and y sets using train_test_split, setting stratify=y 
train_X, test_X, train_y, test_y = train_test_split(filtered_text.toarray(), y, stratify=y)

# Fit nb to the training sets
nb.fit(train_X, train_y)

# Print the score of nb on the test sets
nb.score(test_X, test_y)

OUTPUT: 0.16059957173447537
    #poor performance on text data
    #next step would be to iterate through text data to improve performance