# Variable Selection With Orders

<a id='up'></a>

# INDEX

0. [Load libraries](#load-lib)
1. [Data import and aggregation](#data-import)

   1.1 [Train Without Orders](#without-orders)
   
2.[Some Copy](#cp1) 

3.[Encoder on <code>cluster</code>](#cluster-encoder)

4.[Encoder on <code>day_of_week</code>](#day-week-encoder)

5.[Encoder on <code>air_genre_name</code> and <code>air_area_name</code>](#genre-area-encoder) 

6.[Drop some vars](#drop-vars1)

7.[Preparation and separation of data for Variable selection](#preparation)

8.[Drop Nulls from the train](#dropne)

9.[Remove unnecessary var X](#x)

10.[Variable Selection using LASSO](#lasso)

11.[Variable Selection using Random Forest](#random-forest)

12.[Variable Selection using Gradient Boosting classification](#gradient-boosting)

 

#### <a id='load-lib'>0. Load libraries</a>

In [39]:
#import sys
#!{sys.executable} -m pip install xgboos

Collecting xgboos


  Could not find a version that satisfies the requirement xgboos (from versions: )
No matching distribution found for xgboos


In [40]:
import numpy as np 
import pandas as pd 
from subprocess import check_output
import matplotlib.pyplot as plt
import seaborn as sns
import glob, re

from sklearn import *
from datetime import datetime
from xgboost import XGBRegressor

In [41]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.feature_selection import SelectFromModel




In [42]:
np.random.seed(10)

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error

### <a id='data-import'>1.Data import and aggregation</a>

In [43]:
# Data Aggregation
train_wo = pd.read_csv('../input/train_without_orders.csv')
#train_io = pd.read_csv('../input/train_in_orders.csv')
#test = pd.read_csv('../input/test.csv')

<a id='without-orders'>1.1 Train Without Orders</a>

In [5]:
#train_io.info()

[To the header](#up)

In [6]:
#test.info()

[To the header](#up)

#### <a id='cp1'>2 Some copy</a>

In [7]:
des_width = 620
pd.set_option('display.width', des_width)
pd.set_option('display.max_columns', 200)
np.set_printoptions(linewidth=des_width)

In [8]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:95% !important; } </style>"))

In [44]:
twoCopy = train_wo.copy() # two - train without orders copy

In [45]:
twoCopy.head(1)

Unnamed: 0.1,Unnamed: 0,visit_date,visitors,air_genre_name,air_area_name,air_store_id,latitude,longitude,cluster,prefecture,month,date,dw,dy,day_of_week,holiday_flg,sunday,saturday,sat/sun/hol,calendar_date,precipitation,avg_temperature,hours_sunlight,avg_wind_speed,avg_vapor_pressure,avg_humidity,avg_sea_pressure,avg_local_pressure,solar_radiation,cloud_cover,high_temperature,low_temperature,lon_plus_lat
0,0,2016-01-13,25,Dining bar,Tōkyō-to Minato-ku Shibakōen,air_ba937bf13d40fb24,35.658068,139.751599,1,Tōkyō-to,1,13,2,13,Wednesday,0,0,0,0.0,2016-01-13,0.0,3.542857,8.833333,1.514286,4.9,60.0,1013.1,1010.1,10.86,2.5,9.171429,-2.028571,175.409667


[To the header](#up)

#### <a id='cluster-encoder'>3. Encoder on <code>cluster</code></a>

In [46]:
lbl = preprocessing.LabelEncoder()
twoCopy['cluster'] = lbl.fit_transform(twoCopy['cluster'])

twoCopy['cluster'].unique()

array([1, 3, 2, 7, 0, 5, 4, 6], dtype=int64)

In [47]:
twoCopy.head(2)

Unnamed: 0.1,Unnamed: 0,visit_date,visitors,air_genre_name,air_area_name,air_store_id,latitude,longitude,cluster,prefecture,month,date,dw,dy,day_of_week,holiday_flg,sunday,saturday,sat/sun/hol,calendar_date,precipitation,avg_temperature,hours_sunlight,avg_wind_speed,avg_vapor_pressure,avg_humidity,avg_sea_pressure,avg_local_pressure,solar_radiation,cloud_cover,high_temperature,low_temperature,lon_plus_lat
0,0,2016-01-13,25,Dining bar,Tōkyō-to Minato-ku Shibakōen,air_ba937bf13d40fb24,35.658068,139.751599,1,Tōkyō-to,1,13,2,13,Wednesday,0,0,0,0.0,2016-01-13,0.0,3.542857,8.833333,1.514286,4.9,60.0,1013.1,1010.1,10.86,2.5,9.171429,-2.028571,175.409667
1,1,2016-01-13,21,Izakaya,Tōkyō-to Shinagawa-ku Higashigotanda,air_25e9888d30b386df,35.626568,139.725858,1,Tōkyō-to,1,13,2,13,Wednesday,0,0,0,0.0,2016-01-13,0.0,3.542857,8.833333,1.514286,4.9,60.0,1013.1,1010.1,10.86,2.5,9.171429,-2.028571,175.352426


In [48]:
twoCopy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 247009 entries, 0 to 247008
Data columns (total 33 columns):
Unnamed: 0            247009 non-null int64
visit_date            247009 non-null object
visitors              247009 non-null int64
air_genre_name        247009 non-null object
air_area_name         247009 non-null object
air_store_id          247009 non-null object
latitude              247009 non-null float64
longitude             247009 non-null float64
cluster               247009 non-null int64
prefecture            247009 non-null object
month                 247009 non-null int64
date                  247009 non-null int64
dw                    247009 non-null int64
dy                    247009 non-null int64
day_of_week           247009 non-null object
holiday_flg           247009 non-null int64
sunday                247009 non-null int64
saturday              247009 non-null int64
sat/sun/hol           247009 non-null float64
calendar_date         247009 non-null obj

[To the header](#up)

#### <a id='day-week-encoder'>4. Encoder on <code>day_of_week</code></a>

In [49]:
lbl = preprocessing.LabelEncoder()
twoCopy['day_of_week'] = lbl.fit_transform(twoCopy['day_of_week'])

In [50]:
twoCopy.head(1)

Unnamed: 0.1,Unnamed: 0,visit_date,visitors,air_genre_name,air_area_name,air_store_id,latitude,longitude,cluster,prefecture,month,date,dw,dy,day_of_week,holiday_flg,sunday,saturday,sat/sun/hol,calendar_date,precipitation,avg_temperature,hours_sunlight,avg_wind_speed,avg_vapor_pressure,avg_humidity,avg_sea_pressure,avg_local_pressure,solar_radiation,cloud_cover,high_temperature,low_temperature,lon_plus_lat
0,0,2016-01-13,25,Dining bar,Tōkyō-to Minato-ku Shibakōen,air_ba937bf13d40fb24,35.658068,139.751599,1,Tōkyō-to,1,13,2,13,6,0,0,0,0.0,2016-01-13,0.0,3.542857,8.833333,1.514286,4.9,60.0,1013.1,1010.1,10.86,2.5,9.171429,-2.028571,175.409667


In [51]:
twoCopy.columns

Index(['Unnamed: 0', 'visit_date', 'visitors', 'air_genre_name', 'air_area_name', 'air_store_id', 'latitude', 'longitude', 'cluster', 'prefecture', 'month', 'date', 'dw', 'dy', 'day_of_week', 'holiday_flg', 'sunday', 'saturday', 'sat/sun/hol', 'calendar_date', 'precipitation', 'avg_temperature', 'hours_sunlight', 'avg_wind_speed', 'avg_vapor_pressure', 'avg_humidity', 'avg_sea_pressure', 'avg_local_pressure', 'solar_radiation', 'cloud_cover', 'high_temperature', 'low_temperature', 'lon_plus_lat'], dtype='object')

[To the header](#up)

<a id='genre-area-encoder'>5. Encoder on <code>air_genre_name</code> and <code>air_area_name</code></a>

In [52]:
# 
twoCopy['air_genre_name'] = twoCopy['air_genre_name'].map(lambda x: str(str(x).replace('/',' ')))
twoCopy['air_area_name'] = twoCopy['air_area_name'].map(lambda x: str(str(x).replace('-',' ')))

lbl = preprocessing.LabelEncoder()

for i in range(10):
    twoCopy['air_genre_name'+str(i)] = lbl.fit_transform(twoCopy['air_genre_name']\
                                                                 .map(lambda x:\
                                                                      str(str(x).split(' ')[i])\
                                                                      if len(str(x).split(' '))>i else ''))
    twoCopy['air_area_name'+str(i)] = lbl.fit_transform(twoCopy['air_area_name']\
                                                                .map(lambda x: str(str(x).split(' ')[i])\
                                                                     if len(str(x).split(' '))>i else ''))
    
twoCopy['air_genre_name'] = lbl.fit_transform(twoCopy['air_genre_name'])
twoCopy['air_area_name'] = lbl.fit_transform(twoCopy['air_area_name'])

In [53]:
twoCopy.head(1)

Unnamed: 0.1,Unnamed: 0,visit_date,visitors,air_genre_name,air_area_name,air_store_id,latitude,longitude,cluster,prefecture,month,date,dw,dy,day_of_week,holiday_flg,sunday,saturday,sat/sun/hol,calendar_date,precipitation,avg_temperature,hours_sunlight,avg_wind_speed,avg_vapor_pressure,avg_humidity,avg_sea_pressure,avg_local_pressure,solar_radiation,cloud_cover,high_temperature,low_temperature,lon_plus_lat,air_genre_name0,air_area_name0,air_genre_name1,air_area_name1,air_genre_name2,air_area_name2,air_genre_name3,air_area_name3,air_genre_name4,air_area_name4,air_genre_name5,air_area_name5,air_genre_name6,air_area_name6,air_genre_name7,air_area_name7,air_genre_name8,air_area_name8,air_genre_name9,air_area_name9
0,0,2016-01-13,25,4,59,air_ba937bf13d40fb24,35.658068,139.751599,1,Tōkyō-to,1,13,2,13,6,0,0,0,0.0,2016-01-13,0.0,3.542857,8.833333,1.514286,4.9,60.0,1013.1,1010.1,10.86,2.5,9.171429,-2.028571,175.409667,4,7,7,3,0,26,0,4,0,75,0,0,0,0,0,0,0,0,0,0


In [54]:
twoCopy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 247009 entries, 0 to 247008
Data columns (total 53 columns):
Unnamed: 0            247009 non-null int64
visit_date            247009 non-null object
visitors              247009 non-null int64
air_genre_name        247009 non-null int32
air_area_name         247009 non-null int32
air_store_id          247009 non-null object
latitude              247009 non-null float64
longitude             247009 non-null float64
cluster               247009 non-null int64
prefecture            247009 non-null object
month                 247009 non-null int64
date                  247009 non-null int64
dw                    247009 non-null int64
dy                    247009 non-null int64
day_of_week           247009 non-null int32
holiday_flg           247009 non-null int64
sunday                247009 non-null int64
saturday              247009 non-null int64
sat/sun/hol           247009 non-null float64
calendar_date         247009 non-null object

[To the header](#up)

### <a id='drop-vars1'>6. Drop some vars</a>

In [55]:
twoCopy=twoCopy.drop(['calendar_date' , 'prefecture','air_genre_name6','air_area_name6',\
                      'air_genre_name7', 'air_area_name7', 'air_genre_name8', 'air_area_name8','air_genre_name9',\
                      'air_area_name9'], axis=1)

In [56]:
twoCopy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 247009 entries, 0 to 247008
Data columns (total 43 columns):
Unnamed: 0            247009 non-null int64
visit_date            247009 non-null object
visitors              247009 non-null int64
air_genre_name        247009 non-null int32
air_area_name         247009 non-null int32
air_store_id          247009 non-null object
latitude              247009 non-null float64
longitude             247009 non-null float64
cluster               247009 non-null int64
month                 247009 non-null int64
date                  247009 non-null int64
dw                    247009 non-null int64
dy                    247009 non-null int64
day_of_week           247009 non-null int32
holiday_flg           247009 non-null int64
sunday                247009 non-null int64
saturday              247009 non-null int64
sat/sun/hol           247009 non-null float64
precipitation         247009 non-null float64
avg_temperature       247009 non-null float

[To the header](#up)

## <a id='preparation'>7 Preparation and separation of data for Variable selection</a>

In [57]:
cols = ['air_genre_name', 'air_area_name','latitude', 'longitude', 'cluster', 'month', 'date', 'dw', 'dy', 'day_of_week','holiday_flg', 'sunday', 'saturday', 'sat/sun/hol',\
        'precipitation', 'avg_temperature', 'hours_sunlight', 'avg_wind_speed','avg_vapor_pressure','avg_humidity', 'avg_sea_pressure', 'avg_local_pressure',\
        'solar_radiation','cloud_cover', 'high_temperature','low_temperature','lon_plus_lat', 'air_genre_name0', 'air_area_name0','air_genre_name1', 'air_area_name1', 'air_genre_name2',\
        'air_area_name2','air_genre_name3','air_area_name3','air_genre_name4', 'air_area_name4', 'air_genre_name5', 'air_area_name5' ]


varSel = pd.DataFrame({'Variable': cols})
varSel.head(50)

Unnamed: 0,Variable
0,air_genre_name
1,air_area_name
2,latitude
3,longitude
4,cluster
5,month
6,date
7,dw
8,dy
9,day_of_week


In [58]:
varSel.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39 entries, 0 to 38
Data columns (total 1 columns):
Variable    39 non-null object
dtypes: object(1)
memory usage: 392.0+ bytes


[To the header](#up)

#### <a id='dropne'> 8. Drop Nulls from the train </a> 

In [59]:
twoCopy =twoCopy.dropna()

In [60]:
twoCopy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 206176 entries, 0 to 247008
Data columns (total 43 columns):
Unnamed: 0            206176 non-null int64
visit_date            206176 non-null object
visitors              206176 non-null int64
air_genre_name        206176 non-null int32
air_area_name         206176 non-null int32
air_store_id          206176 non-null object
latitude              206176 non-null float64
longitude             206176 non-null float64
cluster               206176 non-null int64
month                 206176 non-null int64
date                  206176 non-null int64
dw                    206176 non-null int64
dy                    206176 non-null int64
day_of_week           206176 non-null int32
holiday_flg           206176 non-null int64
sunday                206176 non-null int64
saturday              206176 non-null int64
sat/sun/hol           206176 non-null float64
precipitation         206176 non-null float64
avg_temperature       206176 non-null float

In [26]:
twoCopy.head(5)

Unnamed: 0.1,Unnamed: 0,visit_date,visitors,air_genre_name,air_area_name,air_store_id,latitude,longitude,cluster,month,date,dw,dy,day_of_week,holiday_flg,sunday,saturday,sat/sun/hol,precipitation,avg_temperature,hours_sunlight,avg_wind_speed,avg_vapor_pressure,avg_humidity,avg_sea_pressure,avg_local_pressure,solar_radiation,cloud_cover,high_temperature,low_temperature,lon_plus_lat,air_genre_name0,air_area_name0,air_genre_name1,air_area_name1,air_genre_name2,air_area_name2,air_genre_name3,air_area_name3,air_genre_name4,air_area_name4,air_genre_name5,air_area_name5
0,0,2016-01-13,25,4,59,air_ba937bf13d40fb24,35.658068,139.751599,1,1,13,2,13,6,0,0,0,0.0,0.0,3.542857,8.833333,1.514286,4.9,60.0,1013.1,1010.1,10.86,2.5,9.171429,-2.028571,175.409667,4,7,7,3,0,26,0,4,0,75,0,0
1,1,2016-01-13,21,7,72,air_25e9888d30b386df,35.626568,139.725858,1,1,13,2,13,6,0,0,0,0.0,0.0,3.542857,8.833333,1.514286,4.9,60.0,1013.1,1010.1,10.86,2.5,9.171429,-2.028571,175.352426,7,7,0,3,0,39,0,4,0,18,0,0
2,2,2016-01-13,40,7,59,air_fd6aac1043520e83,35.658068,139.751599,1,1,13,2,13,6,0,0,0,0.0,0.0,3.542857,8.833333,1.514286,4.9,60.0,1013.1,1010.1,10.86,2.5,9.171429,-2.028571,175.409667,7,7,0,3,0,26,0,4,0,75,0,0
3,3,2016-01-13,5,4,59,air_64d4491ad8cdb1c6,35.658068,139.751599,1,1,13,2,13,6,0,0,0,0.0,0.0,3.542857,8.833333,1.514286,4.9,60.0,1013.1,1010.1,10.86,2.5,9.171429,-2.028571,175.409667,4,7,7,3,0,26,0,4,0,75,0,0
4,4,2016-01-13,16,11,71,air_5c65468938c07fa5,35.661777,139.704051,1,1,13,2,13,6,0,0,0,0.0,0.0,3.542857,8.833333,1.514286,4.9,60.0,1013.1,1010.1,10.86,2.5,9.171429,-2.028571,175.365828,11,7,0,3,0,38,0,4,0,76,0,0


In [61]:
twoCopy.columns # u = 0 vd =1 vis =2 

Index(['Unnamed: 0', 'visit_date', 'visitors', 'air_genre_name', 'air_area_name', 'air_store_id', 'latitude', 'longitude', 'cluster', 'month', 'date', 'dw', 'dy', 'day_of_week', 'holiday_flg', 'sunday', 'saturday', 'sat/sun/hol', 'precipitation', 'avg_temperature', 'hours_sunlight', 'avg_wind_speed', 'avg_vapor_pressure', 'avg_humidity', 'avg_sea_pressure', 'avg_local_pressure', 'solar_radiation', 'cloud_cover', 'high_temperature', 'low_temperature', 'lon_plus_lat', 'air_genre_name0', 'air_area_name0', 'air_genre_name1', 'air_area_name1', 'air_genre_name2', 'air_area_name2', 'air_genre_name3', 'air_area_name3',
       'air_genre_name4', 'air_area_name4', 'air_genre_name5', 'air_area_name5'],
      dtype='object')

[To the header](#up)

In [62]:
finallTraon = twoCopy.copy()

## <a id='x'>9. Remove unnecessary var X </a> 

In [63]:
## remove unnecessary vars
X = twoCopy.loc[:,cols]
X.head(2)

Unnamed: 0,air_genre_name,air_area_name,latitude,longitude,cluster,month,date,dw,dy,day_of_week,holiday_flg,sunday,saturday,sat/sun/hol,precipitation,avg_temperature,hours_sunlight,avg_wind_speed,avg_vapor_pressure,avg_humidity,avg_sea_pressure,avg_local_pressure,solar_radiation,cloud_cover,high_temperature,low_temperature,lon_plus_lat,air_genre_name0,air_area_name0,air_genre_name1,air_area_name1,air_genre_name2,air_area_name2,air_genre_name3,air_area_name3,air_genre_name4,air_area_name4,air_genre_name5,air_area_name5
0,4,59,35.658068,139.751599,1,1,13,2,13,6,0,0,0,0.0,0.0,3.542857,8.833333,1.514286,4.9,60.0,1013.1,1010.1,10.86,2.5,9.171429,-2.028571,175.409667,4,7,7,3,0,26,0,4,0,75,0,0
1,7,72,35.626568,139.725858,1,1,13,2,13,6,0,0,0,0.0,0.0,3.542857,8.833333,1.514286,4.9,60.0,1013.1,1010.1,10.86,2.5,9.171429,-2.028571,175.352426,7,7,0,3,0,39,0,4,0,18,0,0


In [30]:
X.columns

Index(['air_genre_name', 'air_area_name', 'latitude', 'longitude', 'cluster', 'month', 'date', 'dw', 'dy', 'day_of_week', 'holiday_flg', 'sunday', 'saturday', 'sat/sun/hol', 'precipitation', 'avg_temperature', 'hours_sunlight', 'avg_wind_speed', 'avg_vapor_pressure', 'avg_humidity', 'avg_sea_pressure', 'avg_local_pressure', 'solar_radiation', 'cloud_cover', 'high_temperature', 'low_temperature', 'lon_plus_lat', 'air_genre_name0', 'air_area_name0', 'air_genre_name1', 'air_area_name1', 'air_genre_name2', 'air_area_name2', 'air_genre_name3', 'air_area_name3', 'air_genre_name4', 'air_area_name4', 'air_genre_name5',
       'air_area_name5'],
      dtype='object')

In [31]:
y = twoCopy.loc[:,twoCopy.columns[2]]
print([X.shape,y.shape])

[(206176, 39), (206176,)]


[To the header](#up)

## <a id='lasso'>10. Variable Selection using LASSO</a>

In [32]:
from sklearn.linear_model import Lasso
from sklearn.feature_selection import SelectFromModel

lassomod = Lasso(alpha=0.1).fit(X, y)

In [33]:
model = SelectFromModel(lassomod, prefit=True)
model.get_support()

array([ True,  True, False, False,  True, False,  True,  True,  True,  True,  True, False,  True, False,  True, False,  True, False,  True,  True,  True, False,  True,  True,  True,  True, False,  True, False,  True, False, False,  True, False,  True, False,  True, False, False])

In [34]:
varSel['Lasso'] = model.get_support().astype('int64')
varSel

Unnamed: 0,Variable,Lasso
0,air_genre_name,1
1,air_area_name,1
2,latitude,0
3,longitude,0
4,cluster,1
5,month,0
6,date,1
7,dw,1
8,dy,1
9,day_of_week,1


[To the header](#up)

## <a id='random-forest'>11 Variable Selection using Random Forest</a>

In [35]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel

rfmod = RandomForestClassifier().fit(X, y)



In [36]:
model1 = SelectFromModel(rfmod, prefit=True)
model1.get_support()

array([ True,  True,  True,  True, False,  True,  True, False,  True, False, False, False, False, False,  True,  True,  True,  True,  True,  True,  True,  True,  True,  True,  True,  True,  True,  True, False,  True, False, False, False, False, False, False,  True, False, False])

In [37]:
varSel['RandomForest'] = model.get_support().astype('int64')
varSel

Unnamed: 0,Variable,Lasso,RandomForest
0,air_genre_name,1,1
1,air_area_name,1,1
2,latitude,0,0
3,longitude,0,0
4,cluster,1,1
5,month,0,0
6,date,1,1
7,dw,1,1
8,dy,1,1
9,day_of_week,1,1


[To the header](#up)

### <a id='gradient-boosting'>12. Variable Selection using Gradient Boosting classification</a>

In [None]:
gbmod = GradientBoostingClassifier().fit(X, y)

In [None]:
model2 = SelectFromModel(gbmod, prefit=True)
model2.get_support()

In [None]:
varSel['GradientBoost'] = model2.get_support().astype('int64')
varSel

[To the header](#up)