## Introduction

Predict house prices. Familiarize myself with Python data analysis.

*Note: Codes may also come from other kernels for practice/workflow

## Modeling Framework

1. Framing the problem
    - What are we trying to solve?
    - Understand the problem and ask questions
2. Collecting Relevant Information & Data
    - What type of data do we have?
    - What other requirements are there?
    - What is considered a success for this problem?
3. Process for analysis (Preprocessing & Cleaning)
    - How does the data structure look like?
    - Is the data usable?
    - Can the data be plotted?
    - What changes do we need to do to make the data usable?
    - What type of predictive problem are we trying to answer?
        - Models we can use
        - What are the model inputs? Figure out the data inputs needed for the model to work.
    - Check for common errors like missing values, corrupted values, dates
4. Explore the data (Exploratory Data Analysis)
    - How does the data look like?
    - Are there any patterns?
        - Identify using summary statistics, plotting, counting, etc.
    - Familiarizing process
5. Feature Engineering (Applied Machine Learning)
    - Can we create more features that will be helpful to the model?
6. Statistical Analysis
    - Univariate, bivariate, multivariate analysis of features
7. Modeling & Scoring
    - Splitting the data into train/test
    - Standardizing data
    - What models are appropriate?
        - Regression
    - Pre-Tuning
    - Cross-Validating
    - Hyperarameter Tuning
    - Tuned Models Using Best Hyperparameters
    - Comparison: Pre-Tuning vs. CV vs. Tuned Model vs. CV Tuned 
8. Evaluation
    - How accurate are the models?
    - What evaluation metric is appropriate?
    - Is the model good enough?
    - Iteration needed?

### Framing the problem

Predict house prices in Ames, Iowa.

#### Competition Description

Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.
    
### Method

Since we are predicting price (numerical), this will be a regression problem so we will focus on algorithms like linear regression, random forest, various boosting algorithms, regularized linear models, etc. We can attempt to try methods that will use meta-learners called stacking or blending.
- Evaluation metric used will be Root-Mean-Squared-Error (RMSE)

### Get relevant data

In [3]:
# Set directory
import os
path = 'C:\\Users\\' + os.getlogin() + '\\Documents\\Programming\\Python\\MachineLearning\\Data'

# Move to directory with the data
os.chdir(path)

# Check
os.getcwd()
os.listdir()

['01-ign.csv',
 '02-winequality-red.csv',
 '02-winequality-white.csv',
 '03-thanksgiving-2015-poll-data.csv',
 '05-ibm-sales-loss.csv',
 '07-test.csv',
 '07-train.csv',
 '09-house-train.csv']