# Project - Parameters with Highest Impact on House Price Class

![Data Science Workflow](img/ds-workflow.png)

## Goal of Project
- The real estate dealer from last assignment calls back and clarifies his objective
- Not so interested in finding what matters most to find house price, but more in which range a house is in.
- There are 3 classes: 33% cheapest, 33% mid-range, 33% expensive houses.
- He needs to find which 10 parameters matters most to determine that

## Step 1: Acquire
- Explore problem
- Identify data
- Import data

### Step 1.a: Import libraries
- Execute the cell below (SHIFT + ENTER)
- NOTE: You might need to install mlxtend, if so, run the following in a cell
```
!pip install mlxtend
```

### Step 1.b: Read the data
- Use ```pd.read_parquet()``` to read the file `files/house_sales.parquet`
- NOTE: Remember to assign the result to a variable (e.g., ```data```)
- Apply ```.head()``` on the data to see all is as expected

### Step 1.c: Inspect the data
- Check the number of rows and columns
    - HINT: `.shape`

## Step 2: Prepare
- Explore data
- Visualize ideas
- Cleaning data

### Step 2.a: Check the data types
- This step tells you if some numeric column is not represented numeric.
- Get the data types by ```.dtypes```

### Step 2.b: Check for null (missing) values
- Let's check if any features are not valuable
- Use ```.info()```
- Should we remove any?
    - You can remove features (columns):
    ```Python
data.drop([<column_name>, ..., <column_name>], axis=1)
```

## Step 3: Analyze
- Feature selection
- Model selection
- Analyze data

### Step 3.a: Quasi constant features
- Let see if there are any quasi features
- Create a `VarianceThreshold(threshold=0.01)` and fit it
- The features that are not quasi constant are given by `sel.get_feature_names_out()`
- Get all the qausi features as with list comprehension

### Step 3.b: Correlated features
- Calculate the correlation matrix `corr_matrix` and inspect it
    - HINT: use `.corr()`
- Get all the correlated features
    - HINT: A feature is correlated to a feature before it if
```Python
(corr_matrix[feature].iloc[:corr_matrix.columns.get_loc(feature)] > 0.8).any()
```
    - HINT: Use list comprehension to get a list of the correlated features

### Step 3.c: Prepare training and test set
- Create 3 categorical price ranges using `qcut`
    - HINT: `pd.qcut(data['SalePrice'], q=3, labels=[1, 2, 3])`
- Assign all features in `X`
    - HINT: Use `.drop(['SalePrice', 'Target'] + quasi_features + corr_features, axis=1)`
        - (assuming the same naming)
- Assign the target to `y`
    - HINT: The target is column `Target`
- Split into train and test using `train_test_split`

### Step 3.d: 10 best features for KNeighborsClassifier model
- Use the `SFS` to find 10 best features for a `KNeighborsClassifier` model
    - HINT: `SFS(KNeighborsClassifier(), k_features=10, verbose=2)`
    - HINT: when fitting fill missing values or remove them
        - Notice: ideally we would investigate them further to find appropriate values
- You get the best feature index from `.k_feature_idx_`

### Step 3.e: Explore the features
- Let's try to explore the features
    - HINT: The features can be accessed by `sfs.k_feature_idx_`
    - HINT: Get the feature names by: `X_train.columns[list(sfs.k_feature_idx_)]`
- Try to list them according to correlation score
    - HINT: This is a bit more advanced Python

```Python
for item in X_train.columns[list(sfs.k_feature_idx_)]:
    loc = corr_matrix['SalePrice'].sort_values(ascending=False).index.get_loc(item)
    print(item, loc)
```
- Does the result surprise you?
- Does it change your recommendations?

## Step 4: Report
- Present findings
- Visualize results
- Credibility counts

### Step 4.a: Present findings
- Use the analysis from Step 3 to figures out how to present your findings
- Try to think how the real estate dealer can use these findings

## Step 5: Actions
- Use insights
- Measure impact
- Main goal

### Step 5.a: Measure impact
- Can we help the dealer to use these insights?