# Predict Soccer Players with Regression

![Data Science Workflow](img/ds-workflow.png)

## Goal of Project
- Make a model to predict players overall rating based on metrics
- This is a subset of the Kaggle dataset [European Soccer Database](https://www.kaggle.com/hugomathien/soccer)
    - A bigger project is to predict outcomes of games

## Step 1: Acquire
- Explore problem
- Identify data
- Import data

### Step 1.a: Import libraries
- Execute the cell below (SHIFT + ENTER)

### Step 1.b: Read the data
- Use ```pd.read_parquet()``` to read the file `files/soccer.parquet`
- NOTE: Remember to assign the result to a variable (e.g., ```data```)
- Apply ```.head()``` on the data to see all is as expected

### Step 1.c: Data size
- HINT: `len(data)`

## Step 2: Prepare
- Explore data
- Visualize ideas
- Cleaning data

### Step 2.a: Inspect the data
- There are many metrics
- To keep it simple let's keep the numeric
    - HINT: find them with `.dtypes`
- You can select all columns of numeric data types as follows `.select_dtypes(include='number')`
    - HINT: assign all the numeric columns to your variable (this is needed for the model, as it does not understand non-numeric features).

### Step 2.b: Check for null (missing) values
- Data often is missing entries - there can be many reasons for this
- We need to deal with that (will do later in course)
- Use ```.isnull().any()```
- See how many have null values (Assuming `data` contains your data)
```Python
data.isnull().sum()/len(data)*100
```

### Step 2.c: Drop missing data
- Remove rows with missing data
- HINT: `.dropna()`

### Step 2.d: Visualize data
- Make a histogram of the `overall_rating`
- This gives you an understanding of the data
- What does it tell you?

## Step 3: Analyze
- Feature selection
- Model selection
- Analyze data

### Step 3.a: Feature and target selection
- The target data is given by `overall_rating`
- As we do not have a description of the date, let's learn a bit about it
    - HINT: Use `data.corr()['overall_rating'].sort_values(ascending=False)`
- For simplicity de-select features you do not thing should be part of the analysis
- Create DataFrames `X` and `y` containing the features and target, respectively.
    - HINT: To get all columns except one use `.drop(['overall_rating', <insert other here>], axis=1)`
    - HINT: Keep `y` as a DataFrame for simplicity later

### Step 3.b: Divide into test and train
- We do this to test the accuracy of our model
- The idea is: We train on one dataset, then test on another to see how it performs
- To split dataset use
```Python
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
```
- The `random_state=42` is used for reproducability

### Step 3.c: Train the model
- Create a Linear Regression instance and fit it.
- HINT: Do this on train data (`X_train` and `y_train`)

### Step 3.d: Predict on test data
- Here we make predictions
- HINT: Use your model to predict `.predict(X_test)` and assign the result to `y_pred`

### Step 3.e: Evaluate the model
- Apply r-squared on the predicted results and the real results
- HINT: Use `r2_score` on `y_pred` and `y_test`

## Step 4: Report
- Present findings
- Visualize results
- Credibility counts

### Step 4.a: Present finding
- This is more a practice of creating a model
- But feel free to be creative
- An option could be to investigate the best indicator of a player

## Step 5: Actions
- Use insights
- Measure impact
- Main goal