# Project - Parameters with Highest Impact on House Prices

![Data Science Workflow](img/ds-workflow.png)

## Goal of Project
- A real estate dealer wants to figure out what matters most when selling a house
- They provide various sales data
- Your job is to figure out which 10 parameters (features) matter the most and present the findings

## Step 1: Acquire
- Explore problem
- Identify data
- Import data

### Step 1.a: Import libraries
- Execute the cell below (SHIFT + ENTER)
- NOTE: You might need to install mlxtend, if so, run the following in a cell
```
!pip install mlxtend
```

### Step 1.b: Read the data
- Use ```pd.read_parquet()``` to read the file `files/house_sales.parquet`
- NOTE: Remember to assign the result to a variable (e.g., ```data```)
- Apply ```.head()``` on the data to see all is as expected
    - The target is `SalePrice`

### Step 1.c: Inspect the data
- Check the number of rows and columns
    - HINT: `.shape`

## Step 2: Prepare
- Explore data
- Visualize ideas
- Cleaning data

### Step 2.a: Check the data types
- This step tells you if some numeric column is not represented numeric.
- Get the data types by ```.dtypes```

### Step 2.b: Check for null (missing) values
- Let's check if any features are not valuable
- Use ```.info()```
- Should we remove any?
    - You can remove features (columns):
    ```Python
data.drop([<column_name>, ..., <column_name>], axis=1)
```
- If you keep some with missing value you can add -1 `fillna(-1)`
    - Notice: This is not a validated or good approach - but for this purpose it will do

## Step 3: Analyze
- Feature selection
- Model selection
- Analyze data

### Step 3.a: Quasi constant features
- Let see if there are any quasi features
- Create a `VarianceThreshold(threshold=0.01)` and fit it
- The features that are not quasi constant are given by `sel.get_feature_names_out()`
- Get all the qausi features as with list comprehension

### Step 3.b: Correlated features
- Calculate the correlation matrix `corr_matrix` and inspect it
    - HINT: use `.corr()`
- Get all the correlated features
    - HINT: A feature is correlated to a feature before it if
```Python
(corr_matrix[feature].iloc[:corr_matrix.columns.get_loc(feature)] > 0.8).any()
```
    - HINT: Use list comprehension to get a list of the correlated features

### Step 3.c: Prepare training and test set
- Assign all features in `X`
    - HINT: Use `.drop(['SalePrice'] + quasi_features + corr_features, axis=1)`
        - (assuming the same naming)
- Assign the target to `y`
    - HINT: The target is column `SalePrice`
- Split into train and test using `train_test_split`

### Step 3.d: 10 best features for LinearRegression model
- Use the `SFS` to find 10 best features for a `LinearRegression` model
    - HINT: `SFS(LinearRegression(), k_features=10, verbose=2)`
    - HINT: when fitting fill missing values or remove them
        - Notice: ideally we would investigate them further to find appropriate values
- You get the best feature index from `.k_feature_idx_`

### Step 3.e: Test the result
- Create a normal `LinearRegression` model and run it on full data and calculate the r2_score
- Then do the same for only the 10 best features
- Did the score surprice you?
    - Notice that the test score is far from as good as from `SFS`

### Step 3.f: Test with 10 highest correlated features
- Find the 10 highest correlated features
    - HITN: `corr_matrix['SalePrice'].sort_values(ascending=False)`
- Then calculate the `r2_score` for them.
- Does the score surprice you?

## Step 4: Report
- Present findings
- Visualize results
- Credibility counts

### Step 4.a: Present findings
- Use the analysis from Step 3 to figures out how to present your findings
- Try to think how the real estate dealer can use these findings

## Step 5: Actions
- Use insights
- Measure impact
- Main goal

### Step 5.a: Measure impact
- Can we help the dealer to use these insights?