# Feature Selection

In this notebook we select the best features for training.
We have used Chi2Test from sklearn.feature_selection to rank the features based on their pvalues. 

In [None]:
import pandas as pd 
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin 
from sklearn.preprocessing import LabelEncoder
#chi2 test 
from sklearn.feature_selection import chi2

df = pd.read_csv("Bicycle_Thefts.csv")

df.head()


label_encoder = LabelEncoder()
df['Status'] = label_encoder.fit_transform(df['Status'])

X = df.drop(columns=['Status'])
y = df['Status']

X.info()



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25569 entries, 0 to 25568
Data columns (total 34 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   X                      25569 non-null  float64
 1   Y                      25569 non-null  float64
 2   OBJECTID               25569 non-null  int64  
 3   event_unique_id        25569 non-null  object 
 4   Primary_Offence        25569 non-null  object 
 5   Occurrence_Date        25569 non-null  object 
 6   Occurrence_Year        25569 non-null  int64  
 7   Occurrence_Month       25569 non-null  object 
 8   Occurrence_DayOfWeek   25569 non-null  object 
 9   Occurrence_DayOfMonth  25569 non-null  int64  
 10  Occurrence_DayOfYear   25569 non-null  int64  
 11  Occurrence_Hour        25569 non-null  int64  
 12  Report_Date            25569 non-null  object 
 13  Report_Year            25569 non-null  int64  
 14  Report_Month           25569 non-null  object 
 15  Re

  interactivity=interactivity, compiler=compiler, result=result)


We will be dropping the following features
1. X  because it is same as Longitude
2. Y  because it is same as Latitude
3. OBJECTID  because it is a Unique ID
4. event_unique_id  because it is a Unique ID
5. City because the dataset most of the data has only 1 value for this column 'Toronto'. 25560 rows have 'Toronto' as City and 9 data rows have 'NSA'.
6. Occurrence_Date because we will use Date part features
7. Occurrence_Year because year will not influence the bike theft
8. Occurrence_DayOfMonth because day of month will not influence the bike theft
9. Occurrence_DayOfYear because day of the year will not influence the bike theft
10.Report_Date because we will use Date part features
11. Report_Year because year will not influence the bike theft
12. Report_DayOfWeek because we will use Occurrence_DayOfWeek
13. ObjectId2  because it is a Unique ID
14. Hood_ID  because it is a Unique ID
                                

In [None]:
X = X.drop(columns=['X','Y','OBJECTID','event_unique_id',
                                'Occurrence_Date','Occurrence_Year',
                                'Occurrence_DayOfMonth','Occurrence_DayOfYear',
                                'Report_Date','Report_Year','Report_DayOfWeek','ObjectId2','Hood_ID', 'City'
                                ])

X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25569 entries, 0 to 25568
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Primary_Offence       25569 non-null  object 
 1   Occurrence_Month      25569 non-null  object 
 2   Occurrence_DayOfWeek  25569 non-null  object 
 3   Occurrence_Hour       25569 non-null  int64  
 4   Report_Month          25569 non-null  object 
 5   Report_DayOfMonth     25569 non-null  int64  
 6   Report_DayOfYear      25569 non-null  int64  
 7   Report_Hour           25569 non-null  int64  
 8   Division              25569 non-null  object 
 9   NeighbourhoodName     25569 non-null  object 
 10  Location_Type         25569 non-null  object 
 11  Premises_Type         25569 non-null  object 
 12  Bike_Make             25448 non-null  object 
 13  Bike_Model            15923 non-null  object 
 14  Bike_Type             25569 non-null  object 
 15  Bike_Speed         

Let us make a list of categorical columns

In [None]:
cat_columns = []

for i in X.columns:
    if X[i].dtype == 'O':
        cat_columns.append(i)

print(cat_columns)

X_cat = X[cat_columns]
X_cat.info()
X_cat.head()
X_cat.shape

['Primary_Offence', 'Occurrence_Month', 'Occurrence_DayOfWeek', 'Report_Month', 'Division', 'NeighbourhoodName', 'Location_Type', 'Premises_Type', 'Bike_Make', 'Bike_Model', 'Bike_Type', 'Bike_Colour']
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25569 entries, 0 to 25568
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Primary_Offence       25569 non-null  object
 1   Occurrence_Month      25569 non-null  object
 2   Occurrence_DayOfWeek  25569 non-null  object
 3   Report_Month          25569 non-null  object
 4   Division              25569 non-null  object
 5   NeighbourhoodName     25569 non-null  object
 6   Location_Type         25569 non-null  object
 7   Premises_Type         25569 non-null  object
 8   Bike_Make             25448 non-null  object
 9   Bike_Model            15923 non-null  object
 10  Bike_Type             25569 non-null  object
 11  Bike_Colour           23508 non-nu

(25569, 12)

Let us build a custom transformer to retain only top 6 categories in all categorical columns and encode categorical columns

In [None]:
class CategoricalTransformer(BaseEstimator,TransformerMixin):
    def __init__ (self,cols):
        self.cols=cols
        
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        for i in self.cols:
                
                temp_list = X[i].value_counts().head(5).index.tolist()
                X[i].loc[np.logical_not(X[i].isin(temp_list))] = "Others"
        
        return X[self.cols]
    
from sklearn.pipeline import Pipeline
cat_pipeline = Pipeline([
                ('cat_pipe',CategoricalTransformer(cols=cat_columns))
])

X_cat_trans = cat_pipeline.fit_transform(X_cat)
X_cat_trans.head()

cat_encoder = LabelEncoder()
for i in X_cat_trans.columns:
    X_cat_trans[i] = cat_encoder.fit_transform(X_cat_trans[i])

X_cat_trans.head()


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  del sys.path[0]


Unnamed: 0,Primary_Offence,Occurrence_Month,Occurrence_DayOfWeek,Report_Month,Division,NeighbourhoodName,Location_Type,Premises_Type,Bike_Make,Bike_Model,Bike_Type,Bike_Colour
0,4,4,4,4,5,4,5,5,0,3,2,0
1,5,4,5,4,5,4,4,2,3,3,3,0
2,5,5,0,5,5,4,2,4,1,3,1,0
3,4,3,3,3,5,4,2,4,3,3,3,3
4,4,3,2,3,5,4,2,4,0,3,1,4


In [None]:
X_cat_trans.isnull().sum()
type(X_cat_trans)

f_p_values = chi2(X_cat_trans,y)
f_p_values

f_values = pd.Series(f_p_values[0])
p_values = pd.Series(f_p_values[1])

p_values.index = X_cat_trans.columns
p_values.sort_values(ascending=False)

Occurrence_DayOfWeek     9.671731e-01
Bike_Model               8.850014e-01
Bike_Colour              3.036290e-03
Bike_Make                1.563163e-03
NeighbourhoodName        5.018902e-04
Occurrence_Month         2.494427e-04
Report_Month             5.438804e-05
Bike_Type                5.379680e-10
Division                 3.516800e-10
Premises_Type            3.445315e-13
Location_Type            1.626134e-18
Primary_Offence         5.860375e-104
dtype: float64

The above code has listed features in the in the decreasing order of p-values. A higher pvalue indicates a good correlation with the Target Variable. Seeing the pvalues, we can conclude that Primary_Offence feature can be dropped because it has very low p-value. Below is the list of features that will be used to train the models.