# Problem Statement

An investor has approached my construction company; she thinks there is an opportunity in Ames, Iowa to buy,renovate and sell houses or buy land and build from scratch. He'd like to know what the biggest predictors are of higher valued houses, and if location matters. If there is an opportunity, she'd like to work with my construction company to begin the work together. 

Luckily, there is a data set that can help us answer these questions! The data dictionary used in this data analysis can be found here: http://jse.amstat.org/v19n3/decock/DataDocumentation.txt 

Our success can be validated if we are able to answer the following: 
1. Does location matter? 
2. What features correlated with higher selling houses? 
3. Where should we invest our time and money? On square footage? On overall quality? 

# Familiarity with data and data cleaning

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV,Ridge, Lasso
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error

%matplotlib inline

In [None]:
pd.set_option('display.max_columns', 100)

In [None]:
filepath_train_data="./datasets/train.csv"
filepath_test_data="./datasets/test.csv"

df_train = pd.read_csv(filepath_train_data)
df_test = pd.read_csv(filepath_test_data) 

In [None]:
df_train.set_index("Id",inplace=True)
df_test.set_index("Id",inplace=True)

In [None]:
df_train.shape

In [None]:
df_test.shape

In [None]:
df_train.info()

In [None]:
df_train.head()

In [None]:
df_train.describe()

In [None]:
df_test.info()

In [None]:
df_test.describe()

In [None]:
df_train.isnull().sum().sort_values(ascending=False)[:30]

In [None]:
df_test.isnull().sum().sort_values(ascending=False)[:30]

Columns have NaN in them that aren't meant to be a number. 

The following should be NA from the data dictionary:

- Alley (Nominal) 
- Bsmt Qual (Ordinal)
- Bsmt Cond (Ordinal)
- Bsmt Exposure	(Ordinal)
- BsmtFin Type 1
- BsmtFinType 2
- FireplaceQu
- Garage Type (Nominal)
- Garage Finish (Ordinal)	
- Garage Qual (Ordinal)
- Garage Cond (Ordinal):
- Pool QC (Ordinal):
- Fence (Ordinal):
- Misc Feature (Nominal)

In [None]:
cols_na= ["Alley","Bsmt Qual","Bsmt Cond","Bsmt Exposure","BsmtFin Type 1","BsmtFin Type 2","Fireplace Qu","Garage Type","Garage Finish","Garage Qual","Garage Cond","Pool QC","Fence","Misc Feature"]

[df_train[cols].fillna("NA",inplace=True) for cols in cols_na]
[df_test[cols].fillna("NA",inplace=True) for cols in cols_na]

In [None]:
#Mas Vnr Type has None and NaN 
df_train["Mas Vnr Type"].fillna("None",inplace=True) 
df_test["Mas Vnr Type"].fillna("None",inplace=True)

In [None]:
#replace not a number with 0 
df_train["Mas Vnr Area"].fillna(0 ,inplace=True) 
df_test["Mas Vnr Area"].fillna(0 ,inplace=True)

We can use the median or mean for NaN values, there isn't a big difference between these numbers 

In [None]:
df_train["Lot Frontage"].median()

In [None]:
df_train["Lot Frontage"].fillna(69.05520046484602 ,inplace=True) 
df_test["Lot Frontage"].fillna(69.05520046484602 ,inplace=True)
#mean and median are similar here

In [None]:
{final: df_train[final].isnull().sum() for final in df_train.columns if df_train[final].isnull().sum() > 0}

In [None]:
df_train["Garage Yr Blt"].median()

In [None]:
df_train["Garage Yr Blt"].fillna(1978.7077955601446 ,inplace=True) 
df_test["Garage Yr Blt"].fillna(1978.7077955601446 ,inplace=True)
#mean and median are similar here

In [None]:
{final: df_train[final].isnull().sum() for final in df_train.columns if df_train[final].isnull().sum() > 0}

In [None]:
#drop the rest of the null values here: only one or two of them 
df_train["BsmtFin SF 1"].dropna(inplace=True) 
df_test["BsmtFin SF 1"].dropna(inplace=True)

In [None]:
df_train["Bsmt Unf SF"].dropna(inplace=True) 
df_test["Bsmt Unf SF"].dropna(inplace=True)

In [None]:
df_train["BsmtFin SF 2"].dropna(inplace=True) 
df_test["BsmtFin SF 2"].dropna(inplace=True)

In [None]:
df_train["Total Bsmt SF"].dropna(inplace=True) 
df_test["Total Bsmt SF"].dropna(inplace=True)

In [None]:
df_train["Bsmt Full Bath"].dropna(inplace=True) 
df_test["Bsmt Full Bath"].dropna(inplace=True)

In [None]:
df_train["Bsmt Half Bath"].dropna(inplace=True) 
df_test["Bsmt Half Bath"].dropna(inplace=True)

In [None]:
df_train["Garage Cars"].dropna(inplace=True) 
df_test["Garage Cars"].dropna(inplace=True)

In [None]:
df_train["Garage Area"].dropna(inplace=True) 
df_test["Garage Area"].dropna(inplace=True)

In [None]:
{final: df_train[final].isnull().sum() for final in df_train.columns if df_train[final].isnull().sum() > 0}

In [None]:
{final: df_test[final].isnull().sum() for final in df_train.columns if df_train[final].isnull().sum() > 0}

In [None]:
#rename our colmns 
def clean(df):
    df.columns=df.columns.str.lower().str.replace("/ ","_").str.replace(" ","_")
    return df 

In [None]:
df_train=clean(df_train)

In [None]:
df_test=clean(df_test)

In [None]:
#I am converting ordinal columns here to check if they have an impact on saleprice 
conv_dict={'Ex':1.0,'Gd':2.0,'TA':3.0,'Fa':4.0,"Po":5.0,"NA":6.0}
df_train["fireplace_qu"]=df_train["fireplace_qu"].apply(conv_dict.get)

In [None]:
conv_dict_garage={'Ex':1.0,'Gd':2.0,'TA':3.0,'Fa':4.0,"Po":5.0,"NA":6.0}
df_train["garage_qual"]=df_train["garage_qual"].apply(conv_dict_garage.get)

In [None]:
df_test["garage_qual"]=df_test["garage_qual"].apply(conv_dict_garage.get)

In [None]:
df_test["fireplace_qu"]=df_test["fireplace_qu"].apply(conv_dict.get)

In [None]:
conv_dict_exter={'Ex':1.0,'Gd':2.0,'TA':3.0,'Fa':4.0,"Po":5.0}
df_train["exter_qual"]=df_train["exter_qual"].apply(conv_dict_exter.get)

In [None]:
df_test["exter_qual"]=df_test["exter_qual"].apply(conv_dict_exter.get)

In [None]:
conv_dict_bsmt_qual={'Ex':1.0,'Gd':2.0,'TA':3.0,'Fa':4.0,"Po":5.0,"NA":6.0}
df_train["bsmt_qual"]=df_train["bsmt_qual"].apply(conv_dict_bsmt_qual.get)

In [None]:
df_test["bsmt_qual"]=df_test["bsmt_qual"].apply(conv_dict_bsmt_qual.get)

In [None]:
conv_dict_electrical={'SBrkr':1.0,'FuseA':2.0,'FuseF':3.0,'FuseP':4.0,"Mix":5.0}
df_train["electrical"]=df_train["electrical"].apply(conv_dict_electrical.get)

In [None]:
df_test["electrical"]=df_test["electrical"].apply(conv_dict_electrical.get)

In [None]:
conv_dict_functional={'Typ':1.0,'Min1':2.0,'Min2':3.0,'Mod':4.0,"Maj1":5.0,"Maj2":6.0,"Sev":7.0,"Sal":8.0}
df_train["functional"]=df_train["functional"].apply(conv_dict_functional.get)

In [None]:
df_test["functional"]=df_test["functional"].apply(conv_dict_functional.get)

In [None]:
conv_dict_kitchen={'Ex':1.0,'Gd':2.0,'TA':3.0,'Fa':4.0,"Po":5.0}
df_train["kitchen_qual"]=df_train["kitchen_qual"].apply(conv_dict_kitchen.get)

In [None]:
df_test["kitchen_qual"]=df_test["kitchen_qual"].apply(conv_dict_kitchen.get)

# EDA

In [None]:
plt.figure(figsize=(10,20))
sns.heatmap(df_train.corr()[["saleprice"]].sort_values(by="saleprice",ascending=False), annot=True)

In [None]:
fig,ax=plt.subplots()
ax.scatter(x=df_train["overall_qual"], y=df_train["saleprice"])
plt.ylabel('saleprice', fontsize=13)
plt.xlabel('gr_liv_area', fontsize=13)
plt.show()

In [None]:
plt.figure(figsize=(130,40))
ax= sns.boxplot(df_train["overall_qual"],df_train["saleprice"])
# We see an increase in sale price related to overall quality

ax.set_title('Overall quality and saleprice', size = 80);
ax.set_ylabel('saleprice', size = 80);
ax.set_xlabel('overall quality', size = 80);

In [None]:
fig, ax = plt.subplots()
ax.scatter(x = df_train['gr_liv_area'], y = df_train['saleprice'])
plt.ylabel('saleprice', fontsize=13)
plt.xlabel('gr_liv_area', fontsize=13)
plt.show()
# We see an increase in sale price related to above ground living area 

In [None]:
df_train[(df_train["gr_liv_area"]>4000)]

In [None]:
df_train=df_train.drop(df_train[(df_train["gr_liv_area"]>4000)].index)
#Get rid of outliers here 

In [None]:
fig, ax = plt.subplots()
ax.scatter(x = df_train['gr_liv_area'], y = df_train['saleprice'])
plt.ylabel('saleprice', fontsize=13)
plt.xlabel('gr_liv_area', fontsize=13)
plt.show()
#Visualize what this looks like without outliers 

In [None]:
sns.distplot(df_train["saleprice"])

In [None]:
{final: df_train[final].isnull().mean()*100 for final in df_train.columns if df_train[final].isnull().mean() > 0}

In [None]:
plt.figure(figsize=(100,30))
ax= sns.barplot(x=df_train['neighborhood'],y=df_train['saleprice'])
ax.set_title('Neighborhood by Saleprice', size = 50);
ax.set_ylabel('Price', size = 50);
ax.set_xlabel('Neighborhood', size = 50);

In [None]:
df_train.groupby("neighborhood").mean()["saleprice"].sort_values(ascending=False)
#neighborhood has an impact on price
# let's dummy this!
#A true hypothesis test would be helpful here 

In [None]:
plt.figure(figsize=(100,30))
ax= sns.barplot(x=df_train['house_style'],y=df_train['saleprice'])
ax.set_ylabel('Average Price', size = 20);
ax.set_xlabel('House Style', size = 20);

In [None]:
sns.barplot(x=df_train["yr_sold"],y=df_train['saleprice'])
#year sold doesn't look like it has a huge impact

In [None]:
#check columns to see if there is a big difference in saleprice 
ax = df_train.groupby("house_style")["saleprice"].agg([np.mean]).sort_values(by="mean", ascending=False).plot(kind = 'bar')
ax.set_title('House Style by Saleprice', size = 0);
ax.set_ylabel('Average Price', size = 20);
ax.set_xlabel('House Style', size = 20);
ax.tick_params(labelsize = 'large')

In [None]:
#check columns to see if there is a big difference in saleprice 
df_train.groupby("house_style")["saleprice"].agg([np.mean,np.std]).sort_values(by="mean", ascending=False)

Check columns to see if there is a big difference in saleprice 
This also checks the standard deviation 

In [None]:
df_train.groupby("neighborhood")["saleprice"].agg([np.mean,np.std]).sort_values(by="mean", ascending=False)

In [None]:
df_train.groupby("neighborhood").mean()["saleprice"].sort_values(ascending=False)

In [None]:
df_train.groupby("kitchen_qual")["saleprice"].agg([np.mean,np.std]).sort_values(by="mean", ascending=False)

In [None]:
df_train.groupby("functional")["saleprice"].agg([np.mean,np.std]).sort_values(by="mean", ascending=False)

In [None]:
df_train.groupby("misc_feature")["saleprice"].agg([np.mean,np.std]).sort_values(by="mean", ascending=False)

In [None]:
df_train.groupby("fireplace_qu")["saleprice"].agg([np.mean,np.std,np.median]).sort_values(by="mean", ascending=False)

In [None]:
df_train.dropna(inplace=True)
#dropping the rest 

In [None]:
#overall_cond
fig, ax = plt.subplots()
ax.scatter(x = df_train['overall_cond'], y = df_train['saleprice'])
plt.ylabel('saleprice', fontsize=13)
plt.xlabel('overall_cond', fontsize=13)
plt.show()

# Dummy Variables and Feature Engineering

ID in data dictionary anything that has square feet 
I attempted to add square footage up here to see if this made a difference, and it didn't 
df_train["total_square_feet_below_ground"]=  df_train["total_bsmt_sf"] + df_train["1st_flr_sf"] + df_train["2nd_flr_sf"] + df_train["pool_area"]+df_train["mas_vnr_area"]+df_train["garage_area"]+df_train["lot_area"]+df_train["wood_deck_sf"]+df_train["open_porch_sf"]+df_train["enclosed_porch"]+df_train["3ssn_porch"]+df_train["screen_porch"]

df_train["over_all_qual_and_cond"]= df_train["overall_qual"] * df_train["overall_cond"]
attempted feature engineering with quality and condition, also did not have an impact

In [None]:
sns.distplot(df_train["year_built"])

In [None]:
df_train["year_built"].value_counts().sort_values(ascending=False)

In [None]:
df_train = pd.get_dummies(df_train, columns=['neighborhood'], drop_first=True)

Since location is important after looking at the median and mean prices based off of neighborhood, dummy neighborhood! 

In [None]:
df_test = pd.get_dummies(df_test, columns=['neighborhood'], drop_first=True)

In [None]:
df_train = pd.get_dummies(df_train, columns=['house_style'], drop_first=True)

In [None]:
neighborhoods_test= [column for column in df_test.columns if "neighborhood" in column ]

In [None]:
neighborhoods= [column for column in df_train.columns if "neighborhood" in column ]

In [None]:
df_test = pd.get_dummies(df_test, columns=['house_style'], drop_first=True)


In [None]:
df_test.head()

In [None]:
house_style_test= [column for column in df_train.columns if "house_style" in column ]
#dummy house style to see if this makes a difference 

Iniitally I tried dummying central air, but did not think this told us much
central_air_test= [column for column in df_train.columns if "central_air" in column ]

df_train = pd.get_dummies(df_train, columns=['central_air'], drop_first=True)

Time to reduce multicollinearity with polynomial features! 

## Polynomial Features

In [None]:
# Instantiate PolynomialFeatures object to create all two-way terms.
features_to_poly= ["totrms_abvgrd","lot_frontage","lot_area","exter_qual","bsmtfin_sf_1","year_built","total_bsmt_sf","gr_liv_area","garage_cars","garage_area","overall_qual","overall_cond","garage_qual","kitchen_qual","kitchen_abvgr","full_bath","year_remod/add","fireplaces","bsmt_qual"] 
df_train2 = df_train[features_to_poly] 

polynomial_features = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)

# Fit and transform our X data.
poly_train = polynomial_features.fit_transform(df_train2)

In [None]:
df_train.columns

In [None]:
# Transform into a dataframe
poly_train = pd.DataFrame(poly_train, columns = polynomial_features.get_feature_names(df_train2.columns), index=df_train2.index)
poly_train.head()

In [None]:
# Instantiate PolynomialFeatures object to create all two-way terms.
features_to_poly_test= ["totrms_abvgrd","lot_frontage","lot_area","exter_qual","bsmtfin_sf_1","year_built","total_bsmt_sf","gr_liv_area","garage_cars","garage_area","overall_qual","overall_cond","garage_qual","kitchen_qual","kitchen_abvgr","full_bath","year_remod/add","fireplaces","bsmt_qual"]
df_test2 = df_test[features_to_poly_test] 

polynomial_features_test = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)

# Fit and transform our X data.
poly_test = polynomial_features_test.fit_transform(df_test2)

In [None]:
# Transform into a dataframe
poly_test = pd.DataFrame(poly_test, columns = polynomial_features_test.get_feature_names(df_test2.columns), index=df_test2.index)

In [None]:
poly_test.head()

In [None]:
poly_train['saleprice'] = df_train['saleprice']

In [None]:
poly_train.corr()[["saleprice"]].sort_values(by="saleprice",ascending=False)[60:95]

In [None]:
poly_train.corr()[["saleprice"]].sort_values(by="saleprice",ascending=False)[-25:]

In [None]:
plt.figure(figsize=(35,55))
sns.heatmap(poly_train.corr()[["saleprice"]].sort_values(by="saleprice",ascending=False), annot=True)


In [None]:
# list comprehension to get all columns regarding neighborhood
neighborhoods= [col for col in df_train.columns if col.find('neighborhood') != -1]
# for each column in above list
for col in neighborhoods:
    # add those columns from df_train to poly_train
    poly_train[col] = df_train[col]
poly_train.columns

In [None]:
# list comprehension to get all columns regarding neighborhood
house_style = [col for col in df_train.columns if col.find('house_style') != -1]
# for each column in above list
for col in house_style:
    # add those columns from df_train to poly_train
    poly_train[col] = df_train[col]
poly_train.columns

In [None]:
# list comprehension to get all columns regarding neighborhood
neighborhoods= [col for col in df_test.columns if col.find('neighborhood') != -1]
# for each column in above list
for col in neighborhoods:
    # add those columns from df_train to poly_train
    poly_test[col] = df_test[col]
poly_test.columns

In [None]:
# list comprehension to get all columns regarding neighborhood
house_style = [col for col in df_test.columns if col.find('house_style') != -1]
# for each column in above list
for col in house_style:
    # add those columns from df_train to poly_train
    poly_test[col] = df_test[col]
poly_test.columns

In [None]:
new_features_corr=[ 
"gr_liv_area overall_qual",
"overall_qual^2",
"total_bsmt_sf overall_qual",
"garage_area overall_qual",
"garage_cars overall_qual",
"total_bsmt_sf gr_liv_area",
"year_built overall_qual",
"totrms_abvgrd overall_qual",
"overall_qual year_remod/add",
"gr_liv_area garage_area",
"gr_liv_area garage_cars",
"overall_qual",
"total_bsmt_sf garage_cars",
"total_bsmt_sf garage_area",
"overall_qual full_bath",
"totrms_abvgrd total_bsmt_sf",
"garage_area full_bath",
"year_built gr_liv_area",
"total_bsmt_sf full_bath",
"totrms_abvgrd garage_area",
"gr_liv_area year_remod/add",
"totrms_abvgrd garage_cars",
"gr_liv_area",
"lot_frontage overall_qual",
"garage_cars full_bath",
"gr_liv_area^2",
"gr_liv_area full_bath",
"garage_cars garage_area",
"total_bsmt_sf^2",
"year_built total_bsmt_sf",
"total_bsmt_sf year_remod/add",
"total_bsmt_sf",
"year_built garage_area",
"lot_frontage gr_liv_area",
"garage_cars^2",
"garage_area year_remod/add",
"lot_frontage total_bsmt_sf",
"year_built garage_cars",
"garage_cars year_remod/add",
"garage_area^2",
"garage_area",
"garage_area fireplaces",
"totrms_abvgrd gr_liv_area",
"garage_cars",
"lot_frontage garage_cars",
"lot_frontage garage_area",
"garage_cars fireplaces",
"total_bsmt_sf fireplaces",
"garage_area garage_qual",
"overall_qual kitchen_abvgr",
"year_built year_remod/add",
"overall_qual fireplaces",
"bsmtfin_sf_1 gr_liv_area",
"garage_cars garage_qual",
"gr_liv_area fireplaces",
"totrms_abvgrd full_bath",
"lot_frontage full_bath",
"total_bsmt_sf overall_cond",
"bsmtfin_sf_1 garage_area",
"bsmtfin_sf_1 garage_cars",
"bsmtfin_sf_1 overall_qual",
"full_bath fireplaces",
"year_built^2",
"gr_liv_area overall_cond",
"garage_cars overall_cond",
"year_built",
"bsmtfin_sf_1 total_bsmt_sf",
"garage_area overall_cond",
"overall_qual overall_cond",
"lot_area overall_qual",
"bsmtfin_sf_1 full_bath",
"totrms_abvgrd fireplaces",
"year_built full_bath",
"year_remod/add^2",
"year_remod/add",
"totrms_abvgrd bsmtfin_sf_1",
"full_bath year_remod/add",
"totrms_abvgrd year_built",
"garage_area kitchen_abvgr",
"full_bath",
"lot_area garage_area",
"lot_area garage_cars",
"full_bath^2",
"bsmtfin_sf_1 fireplaces",
"bsmtfin_sf_1^2",
"totrms_abvgrd year_remod/add",
"totrms_abvgrd lot_frontage",
"lot_frontage fireplaces",
"total_bsmt_sf kitchen_abvgr",
"overall_qual garage_qual",
"lot_area total_bsmt_sf",
"totrms_abvgrd",
"lot_area gr_liv_area",
"total_bsmt_sf garage_qual",
"bsmt_qual^2",
"overall_cond bsmt_qual",
"exter_qual overall_cond",
"exter_qual kitchen_abvgr",
"kitchen_qual kitchen_abvgr",
"overall_cond kitchen_qual",
"garage_qual bsmt_qual",
"exter_qual garage_qual",
"garage_qual kitchen_qual",
"year_built bsmt_qual",
"year_remod/add bsmt_qual",
"bsmt_qual",
"kitchen_qual^2",
"kitchen_qual bsmt_qual",
"exter_qual bsmt_qual",
"year_built kitchen_qual",
"exter_qual^2",
"kitchen_qual year_remod/add",
"kitchen_qual",
"exter_qual year_built",
"exter_qual year_remod/add",
"exter_qual",
"exter_qual kitchen_qual",'neighborhood_Blueste', 'neighborhood_BrDale', 'neighborhood_BrkSide',
       'neighborhood_ClearCr', 'neighborhood_CollgCr', 'neighborhood_Crawfor',
       'neighborhood_Edwards', 'neighborhood_Gilbert', 'neighborhood_Greens', 'neighborhood_IDOTRR',
       'neighborhood_MeadowV', 'neighborhood_Mitchel', 'neighborhood_NAmes',
       'neighborhood_NPkVill', 'neighborhood_NWAmes', 'neighborhood_NoRidge',
       'neighborhood_NridgHt', 'neighborhood_OldTown', 'neighborhood_SWISU',
       'neighborhood_Sawyer', 'neighborhood_SawyerW', 'neighborhood_Somerst',
       'neighborhood_StoneBr', 'neighborhood_Timber', 'neighborhood_Veenker',
       'house_style_1.5Unf', 'house_style_1Story', 'house_style_2.5Fin',
       'house_style_2.5Unf', 'house_style_2Story', 'house_style_SFoyer',
       'house_style_SLvl'
]

In [None]:
len(features_for_corr) 

# Modeling

Linear Regression: Initially, I used the linear regression model here, this data (for the most part) follows the MLR assumptions, and we are predicting a continuous outcome (saleprice) as well as impact of features on saleprice. In order to prevent our data from being overfit, I've also used Lasso and Ridge. There isn't a significant difference between the scores. 

**From regression lab

SLR AND MLR:

- Linearity: Y must have an approximately linear relationship with each independent X_i.
- Independence: Errors (residuals) e_i and e_j must be independent of one another for any i != j.
- Normality: The errors (residuals) follow a Normal distribution.
- Equality of Variances: The errors (residuals) should have a roughly consistent pattern, regardless of the value of the X_i. (There should be no discernable relationship between X_1 and the residuals.)

MLR ONLY:
- Independence Part 2: The independent variables X_i and X_j must be independent of one another for any i != j

Potential downfalls of this model are that it could be too simplistic - there are many factors that go into the sale price of a house that this model might not take into account, such as the economy. We also risk being overfit - not understanding what is truly important to the saleprice of the house.  

In [None]:
X = poly_train[new_features_corr]
y = poly_train["saleprice"]
#Did not train-test-split because we want to train our model with the most data we can -- advice from Matt during class. 

In [None]:
lr=LinearRegression()

In [None]:
cross_val_score(lr,X,y,cv=5).mean()

In [None]:
lr.fit(X,y)

In [None]:
lr.score(X,y)

In [None]:
def r2_adj(X,y): 
    lr= LinearRegression()
    model=lr.fit(X,y)
    r_squared=model.score(X,y)
    adjusted_r2_formula= 1 - (1 - r_squared)*(len(y)-1)/(len(y)-X.shape[1]-1)
    print(adjusted_r2_formula)

In [None]:
r2_adj(X,y)
#look at adjusted r2 as well to make sure we aren't including any "noise"

In [None]:
df_train.shape

# Using Standard Scalar and Lasso 

In [None]:
#Use Lasso or Ridge to prevent data being overfit

In [None]:
ss = StandardScaler()
X_train_ss = ss.fit_transform(X)

In [None]:
r_alphas = np.logspace(0, 5, 100)

lasso_model = LassoCV(alphas=r_alphas, cv=5, max_iter=5000)

In [None]:
X.shape

In [None]:
lasso_model = lasso_model.fit(X, y)

In [None]:
lasso_model.alpha_

In [None]:
print(lasso_model.score(X, y))
#not a significant difference in score between this and linear regression

# Using Ridge

In [None]:
# Instantiate.
ridge_model = Ridge(alpha=10)

# Fit.
ridge_model.fit(X, y)

# Evaluate model using R2.
print(ridge_model.score(X, y))
#not a significant difference in score between this and linear regression

# Kaggle Submission

In [None]:
X_kaggle = poly_test[new_features_corr]

In [None]:
X_kaggle['saleprice'] = lasso_model.predict(X_kaggle)
X_kaggle.head()

In [None]:
output = X_kaggle[['saleprice']]
output.head()

In [None]:
# Saving our predictions to our datasets folder
output.to_csv("./datasets/my_first_submission.csv")

# Conclusion

From the data, it seems like there is potential for us to make a profit off of buying, renovating and selling houses in Ames Iowa. I believe the investor was right in coming to me and my team with this opportunity. 

The features that seemed to have the most impact on saleprice throughout the data are: 

- The square footage living area above ground and below ground (house style I believe fits in with this as well, since it contributes to the size of the house)
- The overall quality and condition 
- Garage area and cars 
- The year remod/add 
- The neighborhood

Negatively effected prices: 
- Low kitchen quality and basement quality

I believe we can be profitable if we focus our attention on: 

- Buying houses that are big/ have higher square footage in neighborhoods where houses are higher valued in general
- Renovate house so we can have a high rank in quality and condition 
- Ensure garage quality is up to par, and if house does not have a garage, build a garage that can fit at least two cars for it 

For further investigation, we would need to do some hypothesis testing. I am going to continue to feature engineer to see if we've missed any important features that, when put together, impact price. I'd also like to experiment with cleaning the data differently - perhaps using the mean scores for some rows wasn't the best approach. I would also add the above ground square footage and below ground square footage to get the total square footage and add this as a potential feauture in my model. 

 We were limited in valuable information such as average age of neighborhoods (investopedia predicted that this was an important factor in sales price). It would have also been better to obtain a dataset with more concise information