# Software Notes - General Concepts for Model Building

Prepared for ISyE 4031 <br>
Brandon Kang <br>
brandonkang@gatech.edu

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from jupyterthemes import jtplot
jtplot.style(theme='onedork')

We will go through several steps in this software notes that highlights a general pipeline you should try to follow when building models. Don't think of these as concrete steps but rather a general guideline for what you should probably use when analyzing your data or building your models.
1. Obtain and Clean Data
    * Unfortunately, cleaning data is the most tedious and unfun part of model building but also one of the most important. We won't go through it here because it can be highly domain and data dependent.
2. Exploratory Data Analysis
    * This includes understanding your data (mean, spread, outliers, missing values, correlations, etc.) through visualizations and analysis. If you don't understand your data, you won't be able to build a good model.
3. Feature Engineering
    * After understanding your data and gathering more domain expertise, you may be able to "engineer" features that may improve your model.
4. Model Training
    * This includes training, comparing models, hyperparameter tuning, cross validation, etc.
5. Validate Model
    * Assess how your model is performing on testing set with your performance metrics and plotting learning curves to see if there are signs of overfitting
6. Predict and Interpretation
    * Use your model to predict future observations. 
    * Also, understand how you can interpret your model in context of the problem. 
        * Is your chosen model comprehensible? For example, you may be able to develop a highly sophisticated deep learning model that performs well, but perhaps a simple linear regression model performs fairly well and can be easily interpreted with p-values and coefficients. Which do you prefer/need and why?

In [5]:
dfPoverty = pd.read_csv("poverty.csv")
dfPoverty.head()

Unnamed: 0,Id,v2a1,hacdor,rooms,hacapo,v14a,refrig,v18q,v18q1,r4h1,...,SQBescolari,SQBage,SQBhogar_total,SQBedjefe,SQBhogar_nin,SQBovercrowding,SQBdependency,SQBmeaned,agesq,Target
0,ID_279628684,190000.0,0,3,0,1,1,0,,0,...,100,1849,1,100,0,1.0,0.0,100.0,1849,4
1,ID_f29eb3ddd,135000.0,0,4,0,1,1,1,1.0,0,...,144,4489,1,144,0,1.0,64.0,144.0,4489,4
2,ID_68de51c94,,0,8,0,1,1,0,,0,...,121,8464,1,0,0,0.25,64.0,121.0,8464,4
3,ID_d671db89c,180000.0,0,5,0,1,1,1,1.0,0,...,81,289,16,121,4,1.777778,1.0,121.0,289,4
4,ID_d56d6f5f5,180000.0,0,5,0,1,1,1,1.0,0,...,121,1369,16,121,4,1.777778,1.0,121.0,1369,4
