# Project - Classification with Hidden Features

![Data Science Workflow](img/ds-workflow.png)

## Goal of Project
- You are hired by a company
- They classfied a dataset
- The features are hidden (you do not know what they are)
- They ask you to create a model to predict classes
- How accurate can you predict the classes
- Are some features more important than others

## Step 1: Acquire
- Explore problem
- Identify data
- Import data

### Step 1.a: Import libraries
- Execute the cell below (SHIFT + ENTER)

### Step 1.b: Read the data
- Use ```pd.read_csv()``` to read the file `files/classified_data.csv`
- NOTE: Remember to assign the result to a variable (e.g., ```data```)
- Apply ```.head()``` on the data to see all is as expected

### Step 1.c: Inspect the data
- How big it the dataset?
    - HINT: Use `len(.)`
- How many classes are there?
    - HINT: Use `.value_counts()` on the column containing the classes

## Step 2: Prepare
- Explore data
- Visualize ideas
- Cleaning data

### Step 2.a: Check the data types
- This step tells you if some numeric column is not represented numeric.
- Get the data types by ```.info()```

### Step 2.b: Check for null (missing) values
- Data often is missing entries - there can be many reasons for this
- We need to deal with that (will do later in course)
- Use ```.isnull().any()```

## Step 3: Analyze
- Feature selection
- Model selection
- Analyze data

### Step 3.a: Dependent and independent features
- Assign independent features (those predicting) to `X`
- Assign classes (labels/dependent features) to `y`

### Step 3.b: Divide into training and test set
- Divide into training and test set
    - HINT: `train_test_split`

### Step 3.c: Train, fit, score a SVC model
- Create the model
```Python
svc = SVC()
```
- Fit the model
```Python
svc.fit(X_train, y_train)
```
- Predict with the model
```Python
y_pred = svc.predict(X_test)
```
- Test the accuracy
```Python
accuracy_score(y_test, y_pred)
```

### Step 3.d: Find most important features
- To find the most important features use [`permutation_importance`](https://scikit-learn.org/stable/modules/generated/sklearn.inspection.permutation_importance.html)

```Python
perm_importance = permutation_importance(svc, X_test, y_test)
```
- The results will be found in `perm_importance.importances_mean`

### Step 3.e: Visualize the results
- To visualize the result we want the most important features sorted
- This can be `perm_importance.importances_mean.argsort()`
    - HINT: assign it to `sorted_idx`
- Then to visualize it we will create a DataFrame
```Python
pd.DataFrame(perm_importance.importances_mean[sorted_idx], X_test.columns[sorted_idx], columns=['Value'])
```
- Then make a `barh` plot (use `figsize`)

### Step 3.f: Train, fit, score a KNeighborsClassifier
- Do the same as above for `KNeighborsClassifier`

### Step 3.g: Conclusion
-  Are the models using the same features?

## Step 4: Report
- Present findings
- Visualize results
- Credibility counts

### Step 4.a: What are the findings?
- Write down your findings

### Step 4.b: How to present the findings?
- We need to present the findings?

## Step 5: Actions
- Use insights
- Measure impact
- Main goal

### Step 5.a: How to follow up?
- This is potentially a long-term relationship with a company
- How can we follow up and improve on the model after more data is available?