In [1]:
OneHotEncoder cannot process string values directly.
If your nominal features are strings, then you need to first map them into integers. The following example when executed receives a ValueError because
OneHotEncoder could not convert string into float values: example Male
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(handle_unknown='ignore')
X = [['Male', 1], ['Female', 3], ['Female', 2]]
enc.fit(X)

ValueError: could not convert string to float: 'Male'

In [2]:
# pandas.get_dummies is kind of the opposite.
# By default, it only converts string columns into one-hot representation, unless columns are specified.

import pandas as pd
df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'],'C': [1, 2, 3]})
pd.get_dummies(df)

Unnamed: 0,C,A_a,A_b,B_a,B_b,B_c
0,1,1,0,0,1,0
1,2,0,1,1,0,0
2,3,1,0,0,0,1


In [3]:
s = pd.Series(list('abca'))
pd.get_dummies(s)

Unnamed: 0,a,b,c
0,1,0,0
1,0,1,0
2,0,0,1
3,1,0,0


In [5]:
#Count Vectorizer: Convert a collection of text documents to a matrix of token counts

from sklearn.feature_extraction.text import CountVectorizer
corpus = ['This is the first document.','This document is the second document.','And this is the third one.','Is this the first document?']
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
X


<4x9 sparse matrix of type '<class 'numpy.int64'>'
	with 21 stored elements in Compressed Sparse Row format>

In [6]:
print(X.toarray()) 

[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]


In [3]:
df=pd.read_csv('/Users/username/Desktop/housing_prices_kaggle_train.csv')
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [4]:
len(df)

1460

In [5]:
df.columns

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive

In [6]:
len(df.columns)#this is including the id column

81

In [None]:
#The very first question I should have probably asked: What is my goal doing feature engineering, meaning do I have to pick
# the significant features, choose a model, and fit it to the data

# or select as many as possible and then remove the insignificant ones in the process because by that way you would not
# avoid including the significant features in the process

For now I have to select as many of them possible and eradicate one by one according to their significance during the process



#First let us figure out which features are important and which are not. Also which ones can be transformed to more meaningful features
The Id column is not required unless there are specific ones and also there will be need to associate each row with their appropriate id
SalePrice - the property sale price in dollars. This is the target variable that you are trying to predict.

MSSubClass: Identifies the type of dwelling involved in the sale like whether the house is a one story or two story etc.,


MSZoning: Identifies the general zoning classification of the sale like whether it is agricultural or industrial
    
LotFrontage: Linear feet of street connected to property

LotArea: Lot size in square feet

# Street: Type of road access - I am not going to include this feature to begin with

Alley: Type of alley access
    
# LotShape: General shape of property - I am not going to include this feature to begin with assuming the landcontour will be
# more insightful compared to the lot shape

LandContour: Flatness of the property
    
Utilities: Type of utilities available
        
LotConfig: Lot configuration
    
LandSlope: Slope of property (have to decide according to the type of the city Ames. say slope of a property might
be a significant feature to be considered in cities like SF or Monterey) - research what type of city Ames is?
    
Turns out Iowa in general is a state with extensive land - of smooth gradient. So Ames can also be expected to be the same
so severe slope of any property could be rather undesirable.
    
Neighborhood: Physical locations within Ames city limits
    
# Condition1: Proximity to main road or railroad

Condition2: Proximity to main road or railroad (if a second is present) otherwise is it the same as condition1? have to check
maybe I can plot this and see. if this is redundant then include this one and leave the condition1 out of the picture
    
BldgType: Type of dwelling (how is this different from MSSubClass? Although similar idea has been captured in MSSubClass and LotConfig, I feel
this feature tells a little bit more info especially through saying single house detached, for example) - research
    
HouseStyle: Style of dwelling (including info like 2nd level unfinished can tell a story)
    
OverallQual: Overall material and finish quality
    
OverallCond: Overall condition rating (these two qualities might look redundant but they both tell different stories:they are
significant in that people might like a good conditioned even if the house had few cosmetic issues like material quality/finishing touches
At the same time it might also be different in some cases because the expectations were move-in ready and not spend a dime)
    
YearBuilt: Original construction date - will be used in finding the age of the building
    
YearRemodAdd: Remodel date - find the difference between this date and the date the property was sold
    
RoofStyle: Type of roof
    
RoofMatl: Roof material
    
for both Exterior1st and Exterior2nd: My idea is to combine these columns and encode single material and two materials individually. So I will have only two categories overall
Exterior1st: Exterior covering on house
Exterior2nd: Exterior covering on house (if more than one material)

Categorize them both under: Wood, Brick, Vinyl, Other (check for how balanced the categories are)

From Iowa State University re Siding
Siding has many attractive qualities, but it simply is not a
good insulator. The Federal Trade Commission (FTC)
reports that no type of siding can insulate a home or
lower fuel bills. Even siding sold with thin panels called
“backer board” or “drop-in panels” will provide only a very
small energy-saving benefit. The FTC warns consumers
not to confuse these siding products with true insulation
products, such as fiberglass, cellulose, and rigid plastic
sheet siding.
    

After researching it turns out Masonry veneer type earned more than 95% of increase in returns for sellers    
MasVnrType: Masonry veneer type (how different is this from Exterior2nd?)    
MasVnrArea: Masonry veneer area in square feet
    
# ExterQual: Exterior material quality  - research
    
ExterCond: Present condition of the material on the exterior (i feel like this feature covers it all compared to ExterQual)
    
Foundation: Type of foundation
    
BsmtQual: Height of the basement(have to encode this feature differently as this one is different from the following one)
    
# BsmtCond: General condition of the basement
    
BsmtExposure: Walkout or garden level basement walls (it also includes NA for no basement and the same thing is captured by BsmtCond)
    
# BsmtFinType1: Quality of basement finished area
    
BsmtFinSF1: Type 1 finished square feet
    
# BsmtFinType2: Quality of second finished area (if present) - research
    
BsmtFinSF2: Type 2 finished square feet - (research BsmtFinSF1)
    

BsmtUnfSF: Unfinished square feet of basement area (should I encode this negatively to mention that it is unfinished?)
    
    
# TotalBsmtSF: Total square feet of basement area you can add type1, type2 and, unfinished and see if it equals the total square feet of basement but i feel like overall
#people really care about what the square footage of finished and unfinished basement areas
    
    
Re Finishing basementI believe that it matters, as customers tend to add the expense of finishing the basement when 
they are determining how much to offer for the home.
Although it is not supposed to be considered as living space, most people do.


Heating: Type of heating
    
HeatingQC: Heating quality and condition
    
CentralAir: Central air conditioning
    
Electrical: Electrical system
    
1stFlrSF: First Floor square feet
    
2ndFlrSF: Second floor square feet
    
LowQualFinSF: Low quality finished square feet (all floors) - check whether this provides a strong signal 
    
GrLivArea: Above grade (ground) living area square feet
    
BsmtFullBath: Basement full bathrooms
    
BsmtHalfBath: Basement half bathrooms
    
FullBath: Full bathrooms above grade
    
HalfBath: Half baths above grade
    
Bedroom: Number of bedrooms above basement level
    
Kitchen: Number of kitchens
    
KitchenQual: Kitchen quality
    
TotRmsAbvGrd: Total rooms above grade (does not include bathrooms) - looks important 

Functional: Home functionality rating
    
# Fireplaces: Number of fireplaces - check
    
FireplaceQu: Fireplace quality - categorize into a few
    
GarageType: Garage location - check for signals - maybe categorize them under a few topics
    
# GarageYrBlt: Year garage was built 
    
# GarageFinish: Interior finish of the garage 

GarageCars: Size of garage in car capacity - check amongst this one and the next one - which feature is most important
GarageArea: Size of garage in square feet
    
# GarageQual: Garage quality

GarageCond: Garage condition

PavedDrive: Paved driveway

WoodDeckSF: Wood deck area in square feet 
OpenPorchSF: Open porch area in square feet 
EnclosedPorch: Enclosed porch area in square feet
3SsnPorch: Three season porch area in square feet 
ScreenPorch: Screen porch area in square feet

# PoolArea: Pool area in square feet

PoolQC: Pool quality

Fence: Fence quality - categorize into a few 

MiscFeature: Miscellaneous feature not covered in other categories - check for signal

MiscVal: $Value of miscellaneous feature - you can even decide to keep the features or categorize above certain threshold amount

MoSold: Month Sold:     
Why seasonality matters in housing market?  Once it gets nice where you live then list. Spring one review says. this gives
me an idea: instead of listing every month which amounts to 12 I could categorize under seasons: summer, winter, spring, fall
totally dependent on your location.    
    

YrSold: Year Sold - i think i can find age of the building-since the time it was constructed to the time it was sold
how can I include the remodeled date of the building? have to subtract the date it was remodeled from the date it was sold - in order to include the 
is it possible to encode difference within few years with a negative sign? 


SaleType: Type of sale


SaleCondition: Condition of sale
    
    

    
    

    
    
    
What else I know about Ames:
it is college town - IOWA state university
lots of students with plenty of restaurants but not many malls or clothing stores
not great night life in the city

One of the nicer things about Ames is that it has the amenities of a city but still has the feel of a small town.
lower cost of living

