# Exploratory Data Analysis Part 2:
## Feature Extraction

From the previous notebook, a few features were ommited from analysis due to their complexity.  They will be evaluated here.

Per the previous notebook:

The remaining features which have not yet been investigated are provided as follows, along with a brief plan to evaluate each of those features:
* amenities
    * These need to be separated out into the individual amenties, counted up, and one hot encoded
* description
    * This will require some manual examination and NLP techniques to attmept to find useful features
* host_about
    * This will require some manual examination and NLP techniques to attmept to find useful features
* name
    * This will require some manual examination and NLP techniques to attmept to find useful features
* neighborhood_overview
    * This will require some manual examination and NLP techniques to attmept to find useful features
* host_location
    * This will need to be combined with host_neighbourhood and encoded accordingly
* host_neighbourhood
    * This will need to be combined with host_location and encoded accordingly


In [129]:
# Basic Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# For tesxt:
import re

# For times:
import time

# Set a random seed for imputation
#  Source:  https://numpy.org/doc/stable/reference/random/generated/numpy.random.seed.html
np.random.seed(42)

# Sklearn
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Read Data and Examine Dataframe

In [8]:
# keep the same dataframe name as in the previous notebook
lstn = pd.read_csv('../data/listings_train_2.csv')

In [9]:
lstn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3477 entries, 0 to 3476
Data columns (total 56 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   id                                            3477 non-null   int64  
 1   name                                          3477 non-null   object 
 2   description                                   3464 non-null   object 
 3   neighborhood_overview                         2245 non-null   object 
 4   host_id                                       3477 non-null   int64  
 5   host_since                                    3477 non-null   float64
 6   host_location                                 2683 non-null   object 
 7   host_about                                    2463 non-null   object 
 8   host_response_time                            3477 non-null   object 
 9   host_response_rate                            3477 non-null   f

# Explore and Fxtract 'amenities'

In [11]:
# View the data
lstn.amenities

0       ["Shampoo", "Carbon monoxide alarm", "Wifi", "...
1       ["Wifi", "Carbon monoxide alarm", "Hot water",...
2       ["Shampoo", "Carbon monoxide alarm", "Wifi", "...
3       ["Wifi", "Kitchen", "First aid kit", "Refriger...
4       ["Wifi", "Window guards", "Kitchen", "Long ter...
                              ...                        
3472    ["Wifi", "Stove", "Kitchen", "Dishwasher", "Re...
3473    ["Wifi", "Stove", "Keypad", "Kitchen", "Dishwa...
3474    ["Wifi", "Stove", "Keypad", "Kitchen", "Dishwa...
3475                        ["Wifi", "Microwave", "Oven"]
3476    ["Clothing storage: wardrobe and closet", "Cit...
Name: amenities, Length: 3477, dtype: object

In [16]:
lstn.amenities[0]

'["Shampoo", "Carbon monoxide alarm", "Wifi", "Washer", "Hair dryer", "Kitchen", "Smoke alarm", "Indoor fireplace", "Breakfast", "Free parking on premises", "Private entrance", "Hangers", "First aid kit", "TV", "Dryer", "Heating", "Air conditioning", "Fire extinguisher", "Essentials", "Iron"]'

In [17]:
lstn.amenities[100]

'["Wifi", "Carbon monoxide alarm", "Washer", "Keypad", "Security cameras on property", "Dishes and silverware", "Kitchen", "Smoke alarm", "Self check-in", "TV", "Refrigerator", "Hot tub", "Heating", "Air conditioning"]'

In [12]:
type(lstn.amenities[0])

str

In [14]:
lstn.amenities.nunique()

2697

#### OBSERVATIONS:
* There are many different combinations of amenities
* Individual amenities appear to be part of a smaller list of identical terms
* Each element in this column is actually list stored as a string

These individual terms will need to be extracted and eventually one hot encoded.

In [51]:
lstn.amenities[0]

'["Shampoo", "Carbon monoxide alarm", "Wifi", "Washer", "Hair dryer", "Kitchen", "Smoke alarm", "Indoor fireplace", "Breakfast", "Free parking on premises", "Private entrance", "Hangers", "First aid kit", "TV", "Dryer", "Heating", "Air conditioning", "Fire extinguisher", "Essentials", "Iron"]'

In [73]:
# Use a regular expression to extract the amenities which are between quotes.
#  Code adapted from this source: https://stackoverflow.com/questions/1454913/regular-expression-to-find-a-string-included-between-two-characters-while-exclud
# Also helpful:  https://regex101.com/
regex_string = '(?<=")[^"]+(?=",)'

amn_lst = []

for string_lists in lstn.amenities:
    a_list = re.findall(regex_string, string_lists)
    for amenity in a_list:
        amn_lst.append(amenity)

len(set(amn_lst))       

1072

In [153]:
# Create a pandas series of all amenities and their number of occurences
amn_counts = pd.Series(amn_lst).value_counts(ascending=False)

# Filter the datafarme to use only words that appear in 99% of posts
print(amn_counts[amn_counts >= 35])

#  Create a vocab variable by using the index attribute to get the list of amenities
amn_vocab = amn_counts[amn_counts >= 35].index

Wifi                                                3261
Smoke alarm                                         3207
Carbon monoxide alarm                               3051
Kitchen                                             2976
Essentials                                          2808
                                                    ... 
Heating - split type ductless system                  39
Free driveway parking on premises \u2013 1 space      37
Stainless steel electric stove                        37
Baby safety gates                                     36
HDTV                                                  36
Length: 152, dtype: int64


In [157]:
# Use countevectorizer to one hot encode all the amenities
#  Use the vocab to get only the amenities encoded
cvec = CountVectorizer(lowercase=False, vocabulary=amn_vocab)

# Create a new dataframe with the count vectorized data from the amenities column
amen_df = pd.DataFrame(cvec.fit_transform(lstn.amenities).todense(), 
             columns = cvec.get_feature_names_out())

amen_df

Unnamed: 0,Wifi,Smoke alarm,Carbon monoxide alarm,Kitchen,Essentials,Hangers,Hair dryer,Heating,Hot water,Refrigerator,...,Mosquito net,Park view,Paid parking lot off premises,Pool table,Smoking allowed,Heating - split type ductless system,Free driveway parking on premises \u2013 1 space,Stainless steel electric stove,Baby safety gates,HDTV
0,1,0,0,1,1,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,1,0,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,1,0,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,1,1,1,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,1,1,1,0,1,0,1,...,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3472,1,0,0,1,1,1,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
3473,1,0,0,1,1,0,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
3474,1,0,0,1,1,1,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
3475,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [143]:
new_df


<3477x773 sparse matrix of type '<class 'numpy.int64'>'
	with 191353 stored elements in Compressed Sparse Row format>

In [136]:
new_df.todense()

matrix([[1]], dtype=int64)

# APPENDIX OR JUNK

In [117]:
'Pack \\u2019n p'

'Pack \\u2019n p'

In [120]:
amn_counts[amn_counts >= 100].index[-1]

'Pack \\u2019n play/Travel crib - available upon request'

In [127]:
len([i for i in lstn.amenities if '\\u2019n' in i])

382

In [140]:
df = lstn['amenities']

In [141]:
df

0       ["Shampoo", "Carbon monoxide alarm", "Wifi", "...
1       ["Wifi", "Carbon monoxide alarm", "Hot water",...
2       ["Shampoo", "Carbon monoxide alarm", "Wifi", "...
3       ["Wifi", "Kitchen", "First aid kit", "Refriger...
4       ["Wifi", "Window guards", "Kitchen", "Long ter...
                              ...                        
3472    ["Wifi", "Stove", "Kitchen", "Dishwasher", "Re...
3473    ["Wifi", "Stove", "Keypad", "Kitchen", "Dishwa...
3474    ["Wifi", "Stove", "Keypad", "Kitchen", "Dishwa...
3475                        ["Wifi", "Microwave", "Oven"]
3476    ["Clothing storage: wardrobe and closet", "Cit...
Name: amenities, Length: 3477, dtype: object