# Project - Feature Scaling

![Data Science Workflow](img/ds-workflow.png)

## Goal of Project
- A sport magazine is writing an article on soccer players
- They have a special interest in left-footed players
- A question is whether they playing style can predict if a player is left-footed
- The questions they want to answer:
    - Can you from a features set on players predict if it is left-footed player
    - If so, what features matters the most

## Step 1: Acquire
- Explore problem
- Identify data
- Import data

### Step 1.a: Import libraries
- Execute the cell below (SHIFT + ENTER)

### Step 1.b: Read the data
- Use ```pd.read_parquet()``` to read the file `files/soccer.parquet`
    - The data is from [Kaggle European Soccer Database](https://www.kaggle.com/hugomathien/soccer)
- NOTE: Remember to assign the result to a variable (e.g., ```data```)
- Apply ```.head()``` on the data to see all is as expected

## Step 2: Prepare
- Explore data
- Visualize ideas
- Cleaning data

### Step 2.a: Check the data types
- This step tells you if some numeric column is not represented numeric.
- Apply `info()` to get an idea of the data

### Step 2.b: Check for null (missing) values
- Data often is missing entries - there can be many reasons for this
- We need to deal with that (will do later in course)
- Use ```.isnull().any()``` and `.isnull().sum()`

### Step 2.c: Drop missing data
- A great idea is to investigate missing data and outliers
- But for this project we ignore it
- Apply `dropna()`

### Step 2.d: Limite dataset size
- This project is only for demonstration
- Limit the dataset to the first 2000 rows
    - HINT: `iloc[:2000]`

## Step 3: Analyze
- Feature selection
- Model selection
- Analyze data

### Step 3.a: Feature Selection
- The classifier we want to predict is `preferred_foot` (independent feature/classification)
- For now we keep the other numeric features as depdent features
    - HINT: Use `.info()` to see numeric columns
    - HINT: Use `.drop([...], axis=1)`
- Assign the dependent features to `X` and the independent feature to `y`

### Step 3.b: Split into train and test
- Use `train_test_split` to divide into train and test data.
- A great thing is to use `random_state` to be able to reproduce while experimenting
```Python
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
```

### Step 3.c: Normalize data
- Create a `MinMaxScaler()`
- Fit it on the `X_train` dataset
- Then transform `X_train` and `X_test`
- Remember to assign the results to unique variables

### Step 3.c: Standardize data
- Create a `StandardScaler()`
- Fit it on the `X_train` dataset
- Then transform `X_train` and `X_test`
- Remember to assign the results to unique variables

### Step 3.d: Compare sets
- For the Original, Normalized, and Standardized datasets
    - Create a `SVM` model and fit it
    - Predict values to calculate an accuracy score
- HINT: For each dataset be inspired by this
```Python
svc = SVC()
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)
accuracy_score(y_test, y_pred)
```

### Step 3.e: Finding most important feature
- We now know that the features can predict if a player is left-footed
- Now we need to find the most important features
- [`permutation_importance`](https://scikit-learn.org/stable/modules/generated/sklearn.inspection.permutation_importance.html) Permutation importance for feature evaluation.
- We will use the standardized data and fit a new `SVC` model
- Then use the `permutation_importance` to calculate it.
```Python
perm_importance = permutation_importance(svc, X_test_stand, y_test)
```
- The results will be found in `perm_importance.importances_mean`

### Step 3.f: Visualize the results
- To visualize the result we want the most important features sorted
- This can be `perm_importance.importances_mean.argsort()`
    - HINT: assign it to `sorted_idx`
- Then to visualize it we will create a DataFrame
```Python
pd.DataFrame(perm_importance.importances_mean[sorted_idx], X_test.columns[sorted_idx], columns=['Value'])
```
- Then make a `barh` plot (use `figsize`)

## Step 4: Report
- Present findings
- Visualize results
- Credibility counts

### Step 4.a: Present findings
- There are many ways to present the findings.
- Be creative
- Ideas
    - Explore how the features are related to the value

## Step 5: Actions
- Use insights
- Measure impact
- Main goal

### Step 5.a: Reflection
- There might not be any actions?