# Intro to ML
## Part 1: Reading Data

The read_csv() function reads data and describe() gives a summary of the data.

In [None]:
import pandas as pd
dataset = pd.read_csv("C:/Users/ali49/Downloads/winequality-red.csv")
print(dataset.describe())

## Part 2: Your First Machine Learning Model

We can output all of our features by using the columns() method.

In [None]:
dataset.columns

### Selecting Prediction Target

You can pull out a variable with dot notation, and this single column is stored in a series. The prediction target is the column we want to predict.

In [None]:
y = dataset.quality

### Choosing Features
Select features by using a column list. We'll use describe() and head() to look at this section of the data. head() looks at the first few rows.

In [None]:
features = ['fixed acidity', 'density', 'pH', 'sulphates', 'alcohol']
X = dataset[features]
print(X.describe())
print(X.head())

### Building Your Model
We'll be using a decision tree to predict our models. To learn about decision trees: https://towardsdatascience.com/decision-trees-in-machine-learning-641b9c4e8052<br></br>
These are the steps to building a model:
1. Define: What type of model?
2. Fit: Capture patterns from provided data.
3. Predict: Self-explanatory.
4. Evaluate: Determine accuracy of predictions.


In [None]:
from sklearn.tree import DecisionTreeRegressor
model = DecisionTreeRegressor(random_state = 1)
model.fit(X,y)

### Prediction
In practice, you'll want to make predictions for new players instead of players we already have in the training data. But for this example, we'll be making predictions for the first few rows of the training data.

In [None]:
print("Making predictions for the following 5 players")
print(X.head())
print("The predictions are")
print(model.predict(X.head()))

# Part 3: Model Validation

A common evaluation metric used in practice is mean absolute error. The mean absolute error to is equal to the average of the sum of the difference of the actual and predicted values for every training example.

In [None]:
from sklearn.metrics import mean_absolute_error

predicted_home_prices = model.predict(X)
mean_absolute_error(y, predicted_home_prices)

### The problem with "In-Sample" Scores
If the pattern in training data doesn't hold in new data, the model will be very inaccurate in practice. We will use the train_test_split function to seperate our testing and training data.

In [None]:
from sklearn.model_selection import train_test_split
train_X, val_X, train_y, val_y = train_test_split(X,y, random_state = 1)
model = DecisionTreeRegressor()
model.fit(train_X, train_y)
val_predictions = model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))

# Part 4: Underfitting and Overfitting

Overfitting is when the the model matches the training data almost perfectly, but does poorly when given new data. This is because when there are many leaves in a decision tree, there will be very little examples under each leave. <br></br>
Underfitting is when the model performs poorly even in training data due to a lack of features. 

<br>
We can use the max_leaf_nodes argument to control this overfitting vs underfitting issue by comparing different numbers of max_leaf_nodes with MAE scores.

In [None]:
def get_mae(max_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_nodes, random_state = 0)
    model.fit(train_X, train_y)
    predictions = model.predict(val_X)
    mae = mean_absolute_error(val_y, predictions)
    return mae

In [None]:
for max_leaf_nodes in [5,50,500,5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print(f'Max leaf nodes: {max_leaf_nodes} \t \t Mean Absolute Error: {my_mae}')

From the output above it is evident that using a max_leaf_nodes value of 500 will suffice for optimizing the model's performance.