---
# Process Outline & Objectives

    * OSEMN: OBTAIN > SCRUB > EXPLORE > MODEL > INTERPRET

---

---
### Column Names and descriptions for Kings County Data Set

#### SET_INDEX
* **id** - unique identified for a house

#### TARGET
* **pricePrice** -  is prediction target

#### DATETIME
* **dateDate** - house was sold
* **yr_built** - Built Year

#### BOOL
* **yr_renovated** - Year when house was renovated
* **waterfront** - House which has a view to a waterfront

#### CONTINUOUS
* **sqft_livingsquare** -  footage of the home
(drop or combine with sqft_living if both == higher coef)
* **sqft_living15** - The square footage of interior housing living space for the nearest 15 neighbors
* **sqft_lot15** - The square footage of the land lots of the nearest 15 neighbors
* **sqft_lotsquare** -  footage of the lot
* **sqft_above** - square footage of house apart from basement
* **sqft_basement** - square footage of the basement


#### CATEGORICAL

* **grade** - overall grade given to the housing unit, based on King County grading system
(drop bc or multicollinearity - or combine with grade if higher coef)
* **condition** - How good the condition is ( Overall )
* **bathroomsNumber** -  of bathrooms/bedrooms
* **bedroomsNumber** -  of Bedrooms/House
* **floorsTotal** -  floors (levels) in house
* **view** - Has been viewed

* **zipcode** - zip
(drop - redundant with zipcode)
* **lat** - Latitude coordinate
* **long** - Longitude coordinate
---


## 1 - OBTAIN

   ### 1.1 - Import Data, Libraries, Inspect Data Types
   Obtain data and review data types, etc.
       * Display header and info
           * df.head()        
           * df.info()
           
       
   functions:
       * def check_column(series, nlargest):
       * def log_z(col):
       * def rem_out_z(col_name):
       * def multiplot(df):
       * def plot_hist_scat(df,target,stats):
       * def plot_hist_scat_sns(df,target,stats):
       * def detect_outliers(df,n,features): (using IQRs)

## 2 - SCRUB 

##### Scrub 1 : categorizing / casting data types

**Q1:Which predictors should be analyzed as continuous data, vs binned/categorical data?

+ preliminary analysis, data casting, and visualizations
+ check for linearity, normal distributions

### Review initial data summaries

       * Check and cast data types
           * categorical variables stored as integers
           * numbers stored as objects
           * odd values (lots of 0's, strings that can't be converted, etc)
               * df.info()
               * df.unique()
               * df.isna().sum()
               * df.describe()-min/max, etc 
               * df.set_index
               * df.describe()
               * df.value_counts()

##### Scrub  2 : Null / Missing Values

**Q2: How do we deal with missing values?**
 
+ recast data types, remove null values

          
          * Identifying and removing **NULL VALUES**: 
              * df.isna().sum()
          * Drop null rows or columns as appropriate
              * df.drop() / df.drop(['col1','col2'],axis=1)
                   * drop sqft_basement (most values = 0.0)
                   * drop date
          * Coarse Binning NUMERICAL Data
              * replace with median or bin/convert to categorical
                   * bin yr_built
                   * bin sqft_above
          
          * CATEGORICAL data: 
              * make NaN own category OR replace with most common category
              * Fill in null values and recast variables for EDA
                   * zipcode --> coded
                   * View --> category
                   * Waterfront --> boolean
                   * yr_renovated --> is_reno (boolean)


##### Scrub 3: Multicollinearity

**Q3: which predictors are closely related (and should be dropped)?**
    + multicollinearity: one-hot dummy variables, data dropping
    + remove variable having most corr with largest # of variables

        * Checking for Multicollinearity
        * use seaborn to make correlation matrix plot
        * threshold >= 0.5 corr (rank correlations -- build custom function?) 
        * one-hot dummy variables, and data dropping


## EXPLORE

##### EDA 1 : pre-normalization/transformation

**Q4: Which categorical variables show the greatest potential as predictors?**
Check distributions, outliers, etc
Check scales, ranges (df.describe())
Check histograms to get an idea of distributions and data transformations to perform
    df.hist() 
    
    Can also do kernel density estimates
       + Re-check for linearity, normal distributions
       + scatterplots to check for linearity and possible categorical variables 
            * df.plot(kind='scatter')
            * categoricals will look like vertical lines
            * pd.plotting.scatter_matrix to visualize possible relationships
            * Check for linearity

    **Q5: Does removal of outliers improve the distributions?**
       * Outlier removal >> visualization
            * Filling in df_norm
            * Examine basic descriptive stats
            * Visualizing numerical data
            * Visualizing categorical data
                * BOX PLOTS
                    IQR / Percentiles
                * VIOLIN PLOTS

       * NORMALIZING & TRANSFORMING
           * Normalize data (may want to do after some exploring)
               * Most popular is Z-scoring (but won't fix skew)
           * Can log-transform to fix skewed data
               * (RobustScaler)
           * CHECKING NORMALIZED DATASET
           * Recheck multipol
           * CAT.CODES FOR BINNED DATA
           * Concatenate final df for modeling (df_run)
           * Saving/loading df_run after cleaning up








### MODEL
* FITTING AN INTIAL MODEL:
* Feature Selection: (Least number of features that gives you the best results)
    * DETERMINING IDEAL FEATURES TO USE
        * Using elbow plots to identify the best # of features to use
        * Choosing Features Based on Rankings
       
    * PRELIMINARY UNIVARIATE LINEAR REGRESSION MODELING
    Various forms, detail later...
    Assessing the model:
    Assess parameters (slope,intercept)
    Check if the model explains the variation in the data (RMSE, F, R_square)
    Are the coeffs, slopes, intercepts in appropriate units?
    Whats the impact of collinearity? Can we ignore?
    
    * MULTIVARIATE REGRESSIONS
        * Cross-Validation with K-Fold Test-Train Splits:
            * Save df_run_ols to csv
            * FINAL REGRESSION RESULTS
        * K-Fold validation with OLS
        * Q-Q Plots
        * FINAL MODEL - New
        * Predictor Coefficients & Their Affect On Sales Price
        * Future Directions
        * Revise the fitted model
            * Multicollinearity is big issue for lin regression and cannot fully remove it
        Use the predictive ability of model to test it (like R2 and RMSE)
        * Check for missed non-linearity
        Holdout validation / Train/test split
        * use sklearn train_test_split



### INTERPRET

* Observations
* Conclusions
* Recommendations