# Preprocessing for Machine Learning in Python

1. Intro to Data Preprocessing 
2. Standardizing Data
3. Feature Engineering
4. Selecting Features for Modeling
5. Putting it All Together

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder
import re
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.naive_bayes import GaussianNB
from sklearn.decomposition import PCA

In [24]:
volunteer = pd.read_csv("volunteer_opportunities.csv")

In [13]:
volunteer.head()

Unnamed: 0,opportunity_id,content_id,vol_requests,event_time,title,hits,summary,is_priority,category_id,category_desc,...,end_date_date,status,Latitude,Longitude,Community Board,Community Council,Census Tract,BIN,BBL,NTA
0,4996,37004,50,0,Volunteers Needed For Rise Up & Stay Put! Home...,737,Building on successful events last summer and ...,,,,...,July 30 2011,approved,,,,,,,,
1,5008,37036,2,0,Web designer,22,Build a website for an Afghan business,,1.0,Strengthening Communities,...,February 01 2011,approved,,,,,,,,
2,5016,37143,20,0,Urban Adventures - Ice Skating at Lasker Rink,62,Please join us and the students from Mott Hall...,,1.0,Strengthening Communities,...,January 29 2011,approved,,,,,,,,
3,5022,37237,500,0,Fight global hunger and support women farmers ...,14,The Oxfam Action Corps is a group of dedicated...,,1.0,Strengthening Communities,...,March 31 2012,approved,,,,,,,,
4,5055,37425,15,0,Stop 'N' Swap,31,Stop 'N' Swap reduces NYC's waste by finding n...,,4.0,Environment,...,February 05 2011,approved,,,,,,,,


## Data Preprocessing

In [15]:
# dropping columns and rows

volunteer_cols = volunteer.drop(["Latitude", "Longitude"], axis=1)

volunteer_subset = volunteer_cols.dropna(subset=["category_desc"])

volunteer_subset.shape

(617, 33)

In [16]:
volunteer.dtypes

opportunity_id          int64
content_id              int64
vol_requests            int64
event_time              int64
title                  object
hits                    int64
summary                object
is_priority            object
category_id           float64
category_desc          object
amsl                  float64
amsl_unit             float64
org_title              object
org_content_id          int64
addresses_count         int64
locality               object
region                 object
postalcode            float64
primary_loc           float64
display_url            object
recurrence_type        object
hours                   int64
created_date           object
last_modified_date     object
start_date_date        object
end_date_date          object
status                 object
Latitude              float64
Longitude             float64
Community Board       float64
Community Council     float64
Census Tract          float64
BIN                   float64
BBL       

In [24]:
volunteer.shape

(665, 35)

In [25]:
volunteer[~volunteer["category_desc"].isnull()].shape

(617, 35)

In [32]:
volunteer['category_desc'].value_counts()

Strengthening Communities    307
Helping Neighbors in Need    119
Education                     92
Health                        52
Environment                   32
Emergency Preparedness        15
Name: category_desc, dtype: int64

In [30]:
volunteer_target_filt = volunteer[~volunteer["category_desc"].isnull()]

In [31]:
X = volunteer_target_filt.drop("category_desc", axis=1)

y = volunteer_target_filt[["category_desc"]]

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

y_train["category_desc"].value_counts()

Strengthening Communities    230
Helping Neighbors in Need     89
Education                     69
Health                        39
Environment                   24
Emergency Preparedness        11
Name: category_desc, dtype: int64

# Standardization

## What is Standardization?

Standardization involves the transformation of continuous data to appear normally distributed.

## When to Standardize?

Standardization is recommended in the following scenarios:

- **Model in Linear Space:** It is beneficial when working with models that operate in a linear space, such as k-Nearest Neighbors (kNN), Linear Regression (LR), and K-Means Clustering.

- **High Variance:** Standardization can be useful when dealing with features that have high variance. This helps to bring the features to a similar scale, preventing those with larger variances from dominating the model.

- **Features on Different Scales:** When your dataset contains features measured in different units or with varying magnitudes, standardization is important. For example, when predicting house prices based on the number of bedrooms and the last selling price, standardization ensures that both features contribute proportionally.

In [57]:
wine_df = pd.read_csv("wine_types.csv")

In [35]:
wine_df.head()

Unnamed: 0,Type,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


In [36]:
X = wine_df.drop("Type", axis=1)
y = wine_df[["Type"]]

In [39]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

knn = KNeighborsClassifier()

knn.fit(X_train, y_train)

knn.score(X_test, y_test)

  return self._fit(X, y)


0.7777777777777778

# Log Normalization

## When is Log Normalization Useful?

Log normalization is particularly useful in the following scenario:

- **For Features with High Variance:** Log normalization is beneficial when dealing with features that exhibit high variance. Taking the logarithm of these features can help mitigate the impact of extreme values, making the distribution more manageable and suitable for certain modeling techniques.

In [42]:
print(wine_df['Proline'].var())

wine_df['Proline_log'] = np.log(wine_df['Proline'])

print(wine_df['Proline_log'].var())

99166.71735542436
0.17231366191842012


# Feature Scaling

## When is Feature Scaling Useful?

Feature scaling is particularly useful in the following scenarios:

1. **For Features on Different Scales:** When the features of your dataset are measured in different units or have varying magnitudes, feature scaling becomes important to bring them to a comparable scale.

2. **For Models with Linear Characteristics:** Feature scaling is beneficial when working with models that rely on linear relationships, as it can help prevent certain features from dominating the others due to their scale.

## What Feature Scaling Does


1. **Centers Features around 0 and Transforms to a Variance of 1:** This involves adjusting the features so that they have a mean of 0 and a standard deviation of 1. This normalization helps in making the features comparable.

2. **Transforms to an Approximate Normal Distribution:** Feature scaling may also involve transforming the features to approximate a normal distribution. This can be particularly beneficial for certain statistical methods and algorithms that assume normality.

In [45]:
scaler = StandardScaler()

wine_subset = wine_df[["Ash", "Alcalinity of ash", "Magnesium"]]

wine_subset_scaled = scaler.fit_transform(wine_subset)

In [47]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn.fit(X_train_scaled, y_train)

knn.score(X_test_scaled, y_test)

  return self._fit(X, y)


0.9333333333333333

### Summary: 
without standaidzation result of kNN was 0.7. After applying the mention technique it upgraded up to 0.9. So, no-standardization may have introduced bias

# Feature Engineering

## What is Feature Engineering?

Feature engineering involves the creation of new features from existing ones in a dataset.

## Why Use Feature Engineering?

Feature engineering is employed for various reasons:

- **Improve Performance:** Creating new features can enhance the performance of machine learning models by providing them with more relevant and discriminative information.

- **Insights into Relationships between Features:** Feature engineering allows for a deeper understanding of the relationships between different features, enabling the identification of patterns and dependencies in the data.

- **Requires Data Understanding:** The process of feature engineering necessitates a thorough understanding of the dataset, including domain-specific knowledge. This understanding is crucial for designing meaningful and effective new features.

- **Dataset-Dependent:** Feature engineering is tailored to the specific characteristics of the dataset. Different datasets may require different types of engineered features based on their unique properties.

In [4]:
hiking = pd.read_json("hiking.json")

In [5]:
hiking.head()

Unnamed: 0,Prop_ID,Name,Location,Park_Name,Length,Difficulty,Other_Details,Accessible,Limited_Access,lat,lon
0,B057,Salt Marsh Nature Trail,"Enter behind the Salt Marsh Nature Center, loc...",Marine Park,0.8 miles,,<p>The first half of this mile-long trail foll...,Y,N,,
1,B073,Lullwater,Enter Park at Lincoln Road and Ocean Avenue en...,Prospect Park,1.0 mile,Easy,Explore the Lullwater to see how nature thrive...,N,N,,
2,B073,Midwood,Enter Park at Lincoln Road and Ocean Avenue en...,Prospect Park,0.75 miles,Easy,Step back in time with a walk through Brooklyn...,N,N,,
3,B073,Peninsula,Enter Park at Lincoln Road and Ocean Avenue en...,Prospect Park,0.5 miles,Easy,Discover how the Peninsula has changed over th...,N,N,,
4,B073,Waterfall,Enter Park at Lincoln Road and Ocean Avenue en...,Prospect Park,0.5 miles,Easy,Trace the source of the Lake on the Waterfall ...,N,N,,


### Encoding categorical variables

In [7]:
enc = LabelEncoder()

hiking["Accessible_enc"] = enc.fit_transform(hiking["Accessible"])

hiking[["Accessible_enc", "Accessible"]].head()

Unnamed: 0,Accessible_enc,Accessible
0,1,Y
1,0,N
2,0,N
3,0,N
4,0,N


In [8]:
category_enc = pd.get_dummies(volunteer["category_desc"])

category_enc.head()

Unnamed: 0,Education,Emergency Preparedness,Environment,Health,Helping Neighbors in Need,Strengthening Communities
0,0,0,0,0,0,0
1,0,0,0,0,0,1
2,0,0,0,0,0,1
3,0,0,0,0,0,1
4,0,0,1,0,0,0


### Engineering numerical features

In [9]:
data = {
    'name': ['Sue', 'Mark', 'Sean', 'Erin', 'Jenny', 'Russell'],
    'run1': [20.1, 16.5, 23.5, 21.7, 25.8, 30.9],
    'run2': [18.5, 17.1, 25.1, 21.1, 27.1, 29.6],
    'run3': [19.6, 16.9, 25.2, 20.9, 26.1, 31.4],
    'run4': [20.3, 17.6, 24.6, 22.1, 26.7, 30.4],
    'run5': [18.3, 17.3, 23.9, 22.2, 26.9, 29.9]
}

running_times_5k = pd.DataFrame(data)

In [10]:
running_times_5k["mean"] = running_times_5k.loc[:, "run1":"run5"].mean(axis=1)

running_times_5k.head()

Unnamed: 0,name,run1,run2,run3,run4,run5,mean
0,Sue,20.1,18.5,19.6,20.3,18.3,19.36
1,Mark,16.5,17.1,16.9,17.6,17.3,17.08
2,Sean,23.5,25.1,25.2,24.6,23.9,24.46
3,Erin,21.7,21.1,20.9,22.1,22.2,21.6
4,Jenny,25.8,27.1,26.1,26.7,26.9,26.52


In [11]:
volunteer["start_date_converted"] = pd.to_datetime(volunteer["start_date_date"])

volunteer["start_date_month"] = volunteer["start_date_converted"].dt.month

volunteer[["start_date_converted", "start_date_month"]].head()

Unnamed: 0,start_date_converted,start_date_month
0,2011-07-30,7
1,2011-02-01,2
2,2011-01-29,1
3,2011-02-14,2
4,2011-02-05,2


### Engineering text features

In [15]:
hiking['Length'] = hiking["Length"].astype(str)

In [16]:
def return_mileage(length):

    mile = re.search("\d+\.\d+", length)

    if mile is not None:
        return float(mile.group(0))
        
hiking["Length_num"] = hiking["Length"].apply(return_mileage)
hiking[["Length", "Length_num"]].head()

Unnamed: 0,Length,Length_num
0,0.8 miles,0.8
1,1.0 mile,1.0
2,0.75 miles,0.75
3,0.5 miles,0.5
4,0.5 miles,0.5


In [31]:
volunteer['category_desc'].fillna("Null", inplace=True)

In [32]:
title_text = volunteer["title"]

tfidf_vec = TfidfVectorizer()

text_tfidf = tfidf_vec.fit_transform(title_text)

In [35]:
y = volunteer["category_desc"]
X_train, X_test, y_train, y_test = train_test_split(text_tfidf.toarray(), y, stratify=y, random_state=42)

nb = GaussianNB() 

nb.fit(X_train, y_train)

nb.score(X_test, y_test)

0.4251497005988024

# Feature selection

### Which features are redundant?
1. A feature that has gone through the feature engineering process, like encoding, can be redundant. 
2. One of the features was normalized.
3. statistically correlated (sklearn classifiers assume features are independent)

### Removing redundant features

In [None]:
to_drop = ["category_desc", "created_date", "locality", "region", "vol_requests"]

volunteer_subset = volunteer.drop(to_drop, axis=1)

volunteer_subset.head()

In [58]:
print(wine_df.corr())

wine_df = wine_df.drop(columns=["Flavanoids"], axis=1)

wine_df.head()

                                  Type   Alcohol  Malic acid       Ash  \
Type                          1.000000 -0.328222    0.437776 -0.049643   
Alcohol                      -0.328222  1.000000    0.094397  0.211545   
Malic acid                    0.437776  0.094397    1.000000  0.164045   
Ash                          -0.049643  0.211545    0.164045  1.000000   
Alcalinity of ash             0.517859 -0.310235    0.288500  0.443367   
Magnesium                    -0.209179  0.270798   -0.054575  0.286587   
Total phenols                -0.719163  0.289101   -0.335167  0.128980   
Flavanoids                   -0.847498  0.236815   -0.411007  0.115077   
Nonflavanoid phenols          0.489109 -0.155929    0.292977  0.186230   
Proanthocyanins              -0.499130  0.136698   -0.220746  0.009652   
Color intensity               0.265668  0.546364    0.248985  0.258887   
Hue                          -0.617369 -0.071747   -0.561296 -0.074667   
OD280/OD315 of diluted wines -0.788230

Unnamed: 0,Type,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
0,1,14.23,1.71,2.43,15.6,127,2.8,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,0.3,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.5,16.8,113,3.85,0.24,2.18,7.8,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.8,0.39,1.82,4.32,1.04,2.93,735


### Selecting features using text vectors

In [None]:
vocab = {index: term for term, index in tfidf_vec.vocabulary_.items()}

In [46]:
text_tfidf

<665x1136 sparse matrix of type '<class 'numpy.float64'>'
	with 3397 stored elements in Compressed Sparse Row format>

In [24]:
def return_weights(vocab, original_vocab, vector, vector_index, top_n):
    zipped = dict(zip(vector[vector_index].indices, vector[vector_index].data))

    zipped_series = pd.Series({vocab[i]:zipped[i] for i in vector[vector_index].indices})

    zipped_index = zipped_series.sort_values(ascending=False)[:top_n].index
    return [original_vocab[i] for i in zipped_index]

return_weights(vocab, tfidf_vec.vocabulary_, text_tfidf, 8, 3)

In [19]:
def words_to_filter(vocab, original_vocab, vector, top_n):
    filter_list = []
    for i in range(0, vector.shape[0]):
    
        filtered = return_weights(vocab, original_vocab, vector, i, top_n)
        filter_list.extend(filtered)

    return set(filter_list)

In [None]:
filtered_words = words_to_filter(vocab, tfidf_vec.vocabulary_, text_tfidf, 3)

filtered_text = text_tfidf[:, list(filtered_words)]

In [50]:
X_train, X_test, y_train, y_test = train_test_split(filtered_text.toarray(), y, stratify=y, random_state=42)

nb.fit(X_train, y_train)

nb.score(X_test, y_test)

0.4311377245508982

### Dimensionality reduction

In [54]:
pca = PCA()

X = wine_df.drop(columns=["Type"], axis=1)
y = wine_df["Type"]

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

pca_X_train = pca.fit_transform(X_train)
pca_X_test = pca.transform(X_test)

pca.explained_variance_ratio_

array([9.97802349e-01, 2.02071713e-03, 9.82348559e-05, 5.53994004e-05,
       1.10395648e-05, 5.87233448e-06, 3.13858204e-06, 1.54420449e-06,
       1.02927386e-06, 3.90521513e-07, 1.95535151e-07, 8.99659634e-08])

In [56]:
knn = KNeighborsClassifier()

knn.fit(pca_X_train, y_train)

knn.score(pca_X_test, y_test)

0.7777777777777778

# Putting it all together

In [2]:
ufo = pd.read_csv("ufo_sightings_large.csv")

### Checking column types

In [3]:
print(ufo.info())

ufo["seconds"] = ufo["seconds"].astype("float")

ufo["date"] = pd.to_datetime(ufo["date"])

print(ufo.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4935 entries, 0 to 4934
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   date            4935 non-null   object 
 1   city            4926 non-null   object 
 2   state           4516 non-null   object 
 3   country         4255 non-null   object 
 4   type            4776 non-null   object 
 5   seconds         4935 non-null   float64
 6   length_of_time  4792 non-null   object 
 7   desc            4932 non-null   object 
 8   recorded        4935 non-null   object 
 9   lat             4935 non-null   object 
 10  long            4935 non-null   float64
dtypes: float64(2), object(9)
memory usage: 424.2+ KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4935 entries, 0 to 4934
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   date            4935 non-null   d

### Dropping missing data

In [5]:
print(ufo[["length_of_time", "state", "type"]].isna().sum())
print(ufo.shape)

ufo_no_missing = ufo.dropna(subset=["length_of_time", "state", "type"])

print(ufo_no_missing.shape)

length_of_time    143
state             419
type              159
dtype: int64
(4935, 11)
(4283, 11)


### Categorical variables and standardization

In [7]:
ufo["length_of_time"] = ufo["length_of_time"].astype(str)

def return_minutes(time_string):

    num = re.search("\d+", time_string)
    if num is not None:
        return int(num.group(0))

ufo["minutes"] = ufo["length_of_time"].apply(return_minutes)

print(ufo[["minutes", "length_of_time"]].head())

   minutes   length_of_time
0      2.0          2 weeks
1     30.0           30sec.
2      NaN              nan
3      5.0  about 5 minutes
4      2.0                2


In [8]:
ufo.var()

  ufo.var()


seconds    3.156735e+10
long       1.824025e+03
minutes    8.425929e+02
dtype: float64

In [9]:
print(ufo[["seconds", "minutes"]].var())

ufo["seconds_log"] = np.log(ufo[['seconds']])

print(ufo['seconds_log'].var())

seconds    3.156735e+10
minutes    8.425929e+02
dtype: float64
nan


  result = func(self.values, **kwargs)


### Engineering features

In [12]:
ufo["country_enc"] = ufo["country"].apply(lambda x: 1 if x == "us" else 0)

print(len(ufo["type"].unique()))

type_set = pd.get_dummies(ufo["type"])

ufo = pd.concat([ufo, type_set], axis=1)

22


In [13]:
ufo.head()

Unnamed: 0,date,city,state,country,type,seconds,length_of_time,desc,recorded,lat,...,flash,formation,light,other,oval,rectangle,sphere,teardrop,triangle,unknown
0,2011-11-03 19:21:00,woodville,wi,us,unknown,1209600.0,2 weeks,Red blinking objects similar to airplanes or s...,12/12/2011,44.9530556,...,0,0,0,0,0,0,0,0,0,1
1,2004-10-03 19:05:00,cleveland,oh,us,circle,30.0,30sec.,Many fighter jets flying towards UFO,10/27/2004,41.4994444,...,0,0,0,0,0,0,0,0,0,0
2,2009-09-25 21:00:00,coon rapids,mn,us,cigar,0.0,,Green&#44 red&#44 and blue pulses of light tha...,12/12/2009,45.12,...,0,0,0,0,0,0,0,0,0,0
3,2002-11-21 05:45:00,clemmons,nc,us,triangle,300.0,about 5 minutes,It was a large&#44 triangular shaped flying ob...,12/23/2002,36.0213889,...,0,0,0,0,0,0,0,0,1,0
4,2010-08-19 12:55:00,calgary (canada),ab,ca,oval,0.0,2,A white spinning disc in the shape of an oval.,8/24/2010,51.083333,...,0,0,0,0,1,0,0,0,0,0


In [14]:
print(ufo["date"].head())

ufo["month"] = ufo["date"].dt.month

ufo["year"] = ufo["date"].dt.year

print(ufo[["date", "month", "year"]].head())

0   2011-11-03 19:21:00
1   2004-10-03 19:05:00
2   2009-09-25 21:00:00
3   2002-11-21 05:45:00
4   2010-08-19 12:55:00
Name: date, dtype: datetime64[ns]
                 date  month  year
0 2011-11-03 19:21:00     11  2011
1 2004-10-03 19:05:00     10  2004
2 2009-09-25 21:00:00      9  2009
3 2002-11-21 05:45:00     11  2002
4 2010-08-19 12:55:00      8  2010


In [16]:
ufo["desc"].fillna(value='Nan', inplace=True)

In [17]:
print(ufo['desc'].head())

vec = TfidfVectorizer()

desc_tfidf = vec.fit_transform(ufo["desc"])

print(desc_tfidf.shape)

0    Red blinking objects similar to airplanes or s...
1                 Many fighter jets flying towards UFO
2    Green&#44 red&#44 and blue pulses of light tha...
3    It was a large&#44 triangular shaped flying ob...
4       A white spinning disc in the shape of an oval.
Name: desc, dtype: object
(4935, 6434)


### Feature selection and modeling

In [22]:
vocab = {index: term for term, index in vec.vocabulary_.items()}

In [25]:
to_drop = ["city", "country", "date", "desc", "lat", "length_of_time", "long", "minutes", "recorded", "seconds", "state"]

ufo_dropped = ufo.drop(to_drop, axis=1)

filtered_words = words_to_filter(vocab, vec.vocabulary_, desc_tfidf, 4)

In [28]:
X = ufo_dropped.drop(columns=["type"])
y = ufo_dropped[["type"]]

In [41]:
X["seconds_log"] = X["seconds_log"].replace([np.inf, -np.inf], np.nan)
X["seconds_log"].fillna(value=X["seconds_log"].mean(), inplace=True)

In [34]:
y.fillna(value='Unknown', inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  y.fillna(value='Unknown', inplace=True)


In [35]:
y.isna().sum()

type    0
dtype: int64

In [42]:
print(X.columns)

knn = KNeighborsClassifier()

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

knn.fit(X_train, y_train)

print(knn.score(X_test, y_test))

Index(['seconds_log', 'country_enc', 'changing', 'chevron', 'cigar', 'circle',
       'cone', 'cross', 'cylinder', 'diamond', 'disk', 'egg', 'fireball',
       'flash', 'formation', 'light', 'other', 'oval', 'rectangle', 'sphere',
       'teardrop', 'triangle', 'unknown', 'month', 'year'],
      dtype='object')


  return self._fit(X, y)


0.5915721231766613


In [44]:
filtered_text = desc_tfidf[:, list(filtered_words)]

X_train, X_test, y_train, y_test = train_test_split(filtered_text.toarray(), y, stratify=y, random_state=42)

nb = GaussianNB()

nb.fit(X_train, y_train)

print(nb.score(X_test, y_test))

  y = column_or_1d(y, warn=True)


0.1239870340356564
