## Notes from Kaggle notebooks


### References:
- [House prices: Lasso, XGBoost, and a detailed EDA](https://www.kaggle.com/erikbruin/house-prices-lasso-xgboost-and-a-detailed-eda)


*EDA*
- Start with univariate analysis for SalePrice --> indicative price range if no information at all; skewness; main stats
- Look at numeric variables with high correlation with SalePrice + corrmat
  -- Explain meaning of numeric variables
  -- For integer variables, e.g. Overall Quality, we can use box plots to discribe its relationship with target var. 
  -- For float variables, e.g. GrLivArea, we can use scatter plots + regression line. 
  -- Annotate potential outliers I(e.g. id) directly on the plot.
  -- Outliers: double check values of other important variables before taking them out.

*Missing data, label encoding, factorizing*
- pool varivables into different groups
- Important to understand if missing value means None or sampling mistake
- Lable encode if there is clear ordinality in categorical variable
- Find correlating variable to impute missing value
- Check if NAs in different variables are from the same observations
- Use barplot (with count) to assess ordinality.
- If number of categories is too large, consider binning.
- Changing some numeric variables into factors (e.g. Year, month)

*Look for important variables after data cleaning*  
- corrmat with clean data (for numeric variables)
- Use a quick random forecast to see feature importance (use horizontal bar plot) --> pay attention to categorical --> shouldnot aggregate importance from encoded categorical variables

*Feature engineering*
- group features in the same category. Sum-of-parts can be a new feature.  
- Check correlation between new feature and target var. Or use bar/count plot for categorical variable.
- Binning categorical value, especially for extreme categories.

*Data preparation for modeling* 
- Drop highly correlated variables
- Remove outliers
- Pre-processing
    -- Check normality for 'true numeric' features; cut-off using skewness
    -- Scale 'true numeric' features
    -- one hot encoding
    -- remove levels (categories) with few or no observations in train or test data set (use a cut-off ratio)
- Deal with skewness of target variable


*Modeling*
- Lasso: GridSearch and check coef_ (zeros -> works even without normalization)
- XGBOOST
- Combine




In [None]:
# Simple clustering

features = quantitative + qual_encoded
model = TSNE(n_components=2, random_state=0, perplexity=50)
X = train[features].fillna(0.).values
tsne = model.fit_transform(X)

std = StandardScaler()
s = std.fit_transform(X)
pca = PCA(n_components=30)
pca.fit(s)
pc = pca.transform(s)
kmeans = KMeans(n_clusters=5)
kmeans.fit(pc)

fr = pd.DataFrame({'tsne1': tsne[:,0], 'tsne2': tsne[:, 1], 'cluster': kmeans.labels_})
sns.lmplot(data=fr, x='tsne1', y='tsne2', hue='cluster', fit_reg=False)
print(np.sum(pca.explained_variance_ratio_))