## Feature Selection

In this lab, we will work on the carseat dataset (you can find it under the data_sets directory) and practice the feature selection techniques we have learned.

More info. about the dataset: It has 400 records on the following 11 variables.

    Sales
    Unit sales (in thousands) at each location

    CompPrice
    Price charged by competitor at each location

    Income
    Community income level (in thousands of dollars)

    Advertising
    Local advertising budget for company at each location (in thousands of dollars)

    Population
    Population size in region (in thousands)

    Price
    Price company charges for car seats at each site

    ShelveLoc
    A factor with levels Bad, Good and Medium indicating the quality of the shelving location for the car seats at each site

    Age
    Average age of the local population

    Education
    Education level at each location

    Urban
    A factor with levels No and Yes to indicate whether the store is in an urban or rural location

    US
    A factor with levels No and Yes to indicate whether the store is in the US or not


1. First mount the drive and load the carseat data set.

In [None]:
from google.colab import drive
drive.mount('/content/drive')
import pandas as pd
CarseatData = pd.read_csv('/content/drive/MyDrive/ISCH 370 Labs/Carseat.csv')

Mounted at /content/drive


2. Check the null values in the dataset

In [None]:
CarseatData.isnull().sum()

Unnamed: 0,0
Sales,0
CompPrice,0
Income,0
Advertising,0
Population,0
Price,0
ShelveLoc,0
Age,0
Education,0
Urban,0


3. Check the data type of each column

In [None]:
CarseatData.dtypes

Unnamed: 0,0
Sales,float64
CompPrice,int64
Income,int64
Advertising,int64
Population,int64
Price,int64
ShelveLoc,object
Age,int64
Education,int64
Urban,object


4. In this dataset, "Sales" is the target variable and others are features. Use the variable "target" to take the values of "Sales" and "features" to take the values of the other attributes.  

In [None]:
target = CarseatData['Sales']
print(target)
features = CarseatData.drop('Sales',axis=1)
print(features)

0       9.50
1      11.22
2      10.06
3       7.40
4       4.15
       ...  
395    12.57
396     6.14
397     7.41
398     5.94
399     9.71
Name: Sales, Length: 400, dtype: float64
     CompPrice  Income  Advertising  Population  Price ShelveLoc  Age  \
0          138      73           11         276    120       Bad   42   
1          111      48           16         260     83      Good   65   
2          113      35           10         269     80    Medium   59   
3          117     100            4         466     97    Medium   55   
4          141      64            3         340    128       Bad   38   
..         ...     ...          ...         ...    ...       ...  ...   
395        138     108           17         203    128      Good   33   
396        139      23            3          37    120    Medium   55   
397        162      26           12         368    159    Medium   40   
398        100      79            7         284     95       Bad   50   
399        13

5. Use sklearn.preprocessing.labelencoder to convert the categorical attributes, including ShelveLoc, Urban, US, to numerical values. Here is the sample code:


```
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
df['Cabin']=le.fit_transform(df['Cabin'].astype(str))
df['Cabin']
```



In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
features['ShelveLoc'] = le.fit_transform(features['ShelveLoc'].astype(str))
features['US'] = le.fit_transform(features['US'].astype(str))
features['Urban'] = le.fit_transform(features['Urban'].astype(str))
features[['ShelveLoc','Urban','US']]

Unnamed: 0,ShelveLoc,Urban,US
0,0,1,1
1,1,1,1
2,2,1,1
3,2,1,1
4,0,1,0
...,...,...,...
395,1,1,1
396,2,0,1
397,2,1,1
398,0,1,1


6. Check the variance of each attribute and identify those that are less than 0.01. You can use the method df.var() here.

In [None]:
features.var()

Unnamed: 0,0
CompPrice,235.147243
Income,783.218239
Advertising,44.227343
Population,21719.813935
Price,560.584436
ShelveLoc,0.69468
Age,262.449618
Education,6.867168
Urban,0.208496
US,0.229549


7.1 Check the correlation between different features. List those pairs where the absolute value of the correlation is above 0.5. (hint: you should be able to find 2 such pairs).

In [None]:
corrmatrix = features.corr()
print(corrmatrix)
corrmatrix = corrmatrix[corrmatrix>0.5]
corrmatrix = corrmatrix[corrmatrix<1]
print(list[corrmatrix.stack().index])

             CompPrice    Income  Advertising  Population     Price  \
CompPrice     1.000000 -0.080653    -0.024199   -0.094707  0.584848   
Income       -0.080653  1.000000     0.058995   -0.007877 -0.056698   
Advertising  -0.024199  0.058995     1.000000    0.265652  0.044537   
Population   -0.094707 -0.007877     0.265652    1.000000 -0.012144   
Price         0.584848 -0.056698     0.044537   -0.012144  1.000000   
ShelveLoc     0.023350 -0.067678     0.008544   -0.044772  0.014633   
Age          -0.100239 -0.004670    -0.004557   -0.042663 -0.102177   
Education     0.025197 -0.056855    -0.033594   -0.106378  0.011747   
Urban         0.066594  0.037967     0.042035   -0.052025  0.047016   
US            0.016869  0.089601     0.684460    0.060564  0.057861   

             ShelveLoc       Age  Education     Urban        US  
CompPrice     0.023350 -0.100239   0.025197  0.066594  0.016869  
Income       -0.067678 -0.004670  -0.056855  0.037967  0.089601  
Advertising   0.0085

7.2 From 7.1, we can get that US and Advertising are correlated. Drop the US as Advertising is more informative.

In [None]:
features = features.drop('US',axis = 1)
print(features)

     CompPrice  Income  Advertising  Population  Price  ShelveLoc  Age  \
0          138      73           11         276    120          0   42   
1          111      48           16         260     83          1   65   
2          113      35           10         269     80          2   59   
3          117     100            4         466     97          2   55   
4          141      64            3         340    128          0   38   
..         ...     ...          ...         ...    ...        ...  ...   
395        138     108           17         203    128          1   33   
396        139      23            3          37    120          2   55   
397        162      26           12         368    159          2   40   
398        100      79            7         284     95          0   50   
399        134      37            0          27    120          1   49   

     Education  Urban  
0           17      1  
1           10      1  
2           12      1  
3           14 

7.3 From 7.1, we can get that CompPrice and Price are correlated. However, they contain different information and should be both kept regardless their independency. Instead, we should revise the values of CompPrice and convert it to the difference of Price in terms of percentage, i.e., compPrice=(compPrice-Price)/Price

In [None]:
features['CompPrice'] = (features['CompPrice']-features['Price'])/features['Price']
print(features)

     CompPrice  Income  Advertising  Population  Price  ShelveLoc  Age  \
0     0.150000      73           11         276    120          0   42   
1     0.337349      48           16         260     83          1   65   
2     0.412500      35           10         269     80          2   59   
3     0.206186     100            4         466     97          2   55   
4     0.101562      64            3         340    128          0   38   
..         ...     ...          ...         ...    ...        ...  ...   
395   0.078125     108           17         203    128          1   33   
396   0.158333      23            3          37    120          2   55   
397   0.018868      26           12         368    159          2   40   
398   0.052632      79            7         284     95          0   50   
399   0.116667      37            0          27    120          1   49   

     Education  Urban  
0           17      1  
1           10      1  
2           12      1  
3           14 

8. Let's move on to the next phase, i.e., check the relevant attributes for the target.

8.0 Use standard scaler to scale all the attributes

In [None]:
from sklearn.preprocessing import StandardScaler
sscaler = StandardScaler()
numerics = ['int16','int32','int64','float16','float32','float64']
new_features = features.select_dtypes(include = numerics)
ss_features = sscaler.fit_transform(new_features)
new_features = pd.DataFrame(ss_features,columns = new_features.columns)
print(new_features)

     CompPrice    Income  Advertising  Population     Price  ShelveLoc  \
0     0.149385  0.155361     0.657177    0.075819  0.177823  -1.570698   
1     0.967942 -0.739060     1.409957   -0.032882 -1.386854  -0.369399   
2     1.296286 -1.204159     0.506621    0.028262 -1.513719   0.831899   
3     0.394868  1.121336    -0.396715    1.366649 -0.794814   0.831899   
4    -0.062245 -0.166631    -0.547271    0.510625  0.516132  -1.570698   
..         ...       ...          ...         ...       ...        ...   
395  -0.164647  1.407551     1.560513   -0.420131  0.516132  -0.369399   
396   0.185795 -1.633482    -0.547271   -1.547909  0.177823   0.831899   
397  -0.423550 -1.526151     0.807733    0.700853  1.827078   0.831899   
398  -0.276031  0.370022     0.054953    0.130170 -0.879391  -1.570698   
399   0.003747 -1.132606    -0.998939   -1.615848  0.177823  -0.369399   

          Age  Education     Urban  
0   -0.699782   1.184449  0.646869  
1    0.721723  -1.490113  0.646869  


8.1 Use sklearn.feature_selection.SelectKBest to list the scores of each attribute, choose a meaning score_func using the table in slide 21.

Here is the sample code


```
from sklearn.feature_selection import SelectKBest, mutual_info_regression #don't forget to import the score function (i.e., multual_info_regression) here as well
k=5 # top 5 features
fit=SelectKBest(mutual_info_regression, k).fit(feature, target) # here "feature" should be the one after applying the scaling
result=pd.DataFrame({'Features': feature.columns, 'Score':fit.scores_})
result.sort_values(by='Score', ascending=False, inplace=True)
result

```




In [None]:
from sklearn.feature_selection import SelectKBest, mutual_info_regression
k = 5
fit = SelectKBest(mutual_info_regression,k=k).fit(new_features,target)
result = pd.DataFrame({'Features':new_features.columns,'Score':fit.scores_})
result.sort_values(by='Score',ascending=False, inplace=True)
result

Unnamed: 0,Features,Score
0,CompPrice,0.260302
5,ShelveLoc,0.201204
4,Price,0.08067
7,Education,0.054154
6,Age,0.042227
1,Income,0.038964
2,Advertising,0.014025
3,Population,0.0
8,Urban,0.0


8.2 Use PCA to generate a new feature space with 2 dimensions, display the result with the values of the two new features and the corresponding target variable, i.e., "Price". The output should look like:

	Col_1	Col_2	target
0	-0.162894	0.386054	9.50

1	1.359854	1.493284	11.22

2	1.773408	0.509947	10.06

3	0.766607	0.951937	7.40

4	-0.530513	0.323145	4.15

...	...	...	...

395	-0.753606	1.046455	12.57

396	0.194831	-1.555259	6.14

397	-1.860075	-0.372875	7.41

398	0.324242	0.873643	5.94

399	0.145237	-2.094606	9.71

400 rows × 3 columns

In [None]:
from sklearn import decomposition
pca = decomposition.PCA(n_components=2)
X = pca.fit_transform(new_features)
X_df = pd.DataFrame(X,columns =['Col_1','Col_2'])
X_df['Sales'] = target
X_df

Unnamed: 0,Col_1,Col_2,Sales
0,-0.162915,0.386046,9.50
1,1.359997,1.493304,11.22
2,1.773409,0.509954,10.06
3,0.766555,0.951933,7.40
4,-0.530646,0.323123,4.15
...,...,...,...
395,-0.753554,1.046454,12.57
396,0.194949,-1.555240,6.14
397,-1.860155,-0.372883,7.41
398,0.324319,0.873647,5.94


<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=cf886d58-b5c5-494f-83a3-52efa87a4945' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>