# Intro to ML
## Part 1: Reading Data

The read_csv() function reads data and describe() gives a summary of the data.

In [2]:
import pandas as pd
dataset = pd.read_csv("C:/Users/ali49/Downloads/winequality-red.csv")
print(dataset.describe())

       fixed acidity  volatile acidity  citric acid  residual sugar  \
count    1599.000000       1599.000000  1599.000000     1599.000000   
mean        8.319637          0.527821     0.270976        2.538806   
std         1.741096          0.179060     0.194801        1.409928   
min         4.600000          0.120000     0.000000        0.900000   
25%         7.100000          0.390000     0.090000        1.900000   
50%         7.900000          0.520000     0.260000        2.200000   
75%         9.200000          0.640000     0.420000        2.600000   
max        15.900000          1.580000     1.000000       15.500000   

         chlorides  free sulfur dioxide  total sulfur dioxide      density  \
count  1599.000000          1599.000000           1599.000000  1599.000000   
mean      0.087467            15.874922             46.467792     0.996747   
std       0.047065            10.460157             32.895324     0.001887   
min       0.012000             1.000000         

## Part 2: Your First Machine Learning Model

We can output all of our features by using the columns() method.

In [3]:
dataset.columns

Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality'],
      dtype='object')

### Selecting Prediction Target

You can pull out a variable with dot notation, and this single column is stored in a series. The prediction target is the column we want to predict.

In [4]:
y = dataset.quality

### Choosing Features
Select features by using a column list. We'll use describe() and head() to look at this section of the data. head() looks at the first few rows.

In [5]:
features = ['fixed acidity', 'density', 'pH', 'sulphates', 'alcohol']
X = dataset[features]
print(X.describe())
print(X.head())

       fixed acidity      density           pH    sulphates      alcohol
count    1599.000000  1599.000000  1599.000000  1599.000000  1599.000000
mean        8.319637     0.996747     3.311113     0.658149    10.422983
std         1.741096     0.001887     0.154386     0.169507     1.065668
min         4.600000     0.990070     2.740000     0.330000     8.400000
25%         7.100000     0.995600     3.210000     0.550000     9.500000
50%         7.900000     0.996750     3.310000     0.620000    10.200000
75%         9.200000     0.997835     3.400000     0.730000    11.100000
max        15.900000     1.003690     4.010000     2.000000    14.900000
   fixed acidity  density    pH  sulphates  alcohol
0            7.4   0.9978  3.51       0.56      9.4
1            7.8   0.9968  3.20       0.68      9.8
2            7.8   0.9970  3.26       0.65      9.8
3           11.2   0.9980  3.16       0.58      9.8
4            7.4   0.9978  3.51       0.56      9.4


### Building Your Model
We'll be using a decision tree to predict our models. To learn about decision trees: https://towardsdatascience.com/decision-trees-in-machine-learning-641b9c4e8052<br></br>
These are the steps to building a model:
1. Define: What type of model?
2. Fit: Capture patterns from provided data.
3. Predict: Self-explanatory.
4. Evaluate: Determine accuracy of predictions.


In [6]:
from sklearn.tree import DecisionTreeRegressor
model = DecisionTreeRegressor(random_state = 1)
model.fit(X,y)

### Prediction
In practice, you'll want to make predictions for new players instead of players we already have in the training data. But for this example, we'll be making predictions for the first few rows of the training data.

In [7]:
print("Making predictions for the following 5 players")
print(X.head())
print("The predictions are")
print(model.predict(X.head()))

Making predictions for the following 5 players
   fixed acidity  density    pH  sulphates  alcohol
0            7.4   0.9978  3.51       0.56      9.4
1            7.8   0.9968  3.20       0.68      9.8
2            7.8   0.9970  3.26       0.65      9.8
3           11.2   0.9980  3.16       0.58      9.8
4            7.4   0.9978  3.51       0.56      9.4
The predictions are
[5. 5. 5. 6. 5.]


# Part 3: Model Validation

A common evaluation metric used in practice is mean absolute error. The mean absolute error to is equal to the average of the sum of the difference of the actual and predicted values for every training example.

In [8]:
from sklearn.metrics import mean_absolute_error

predicted_home_prices = model.predict(X)
mean_absolute_error(y, predicted_home_prices)

0.0

### The problem with "In-Sample" Scores
If the pattern in training data doesn't hold in new data, the model will be very inaccurate in practice. We will use the train_test_split function to seperate our testing and training data.

In [9]:
from sklearn.model_selection import train_test_split
train_X, val_X, train_y, val_y = train_test_split(X,y, random_state = 1)
model = DecisionTreeRegressor()
model.fit(train_X, train_y)
val_predictions = model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))

0.4975


# Part 4: Underfitting and Overfitting

Overfitting is when the the model matches the training data almost perfectly, but does poorly when given new data. This is because when there are many leaves in a decision tree, there will be very little examples under each leave. <br></br>
Underfitting is when the model performs poorly even in training data due to a lack of features. 

<br>
We can use the max_leaf_nodes argument to control this overfitting vs underfitting issue by comparing different numbers of max_leaf_nodes with MAE scores.

In [10]:
def get_mae(max_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_nodes, random_state = 0)
    model.fit(train_X, train_y)
    predictions = model.predict(val_X)
    mae = mean_absolute_error(val_y, predictions)
    return mae

In [11]:
for max_leaf_nodes in [5,50,500,5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print(f'Max leaf nodes: {max_leaf_nodes} \t \t Mean Absolute Error: {my_mae}')

Max leaf nodes: 5 	 	 Mean Absolute Error: 0.51739165618039
Max leaf nodes: 50 	 	 Mean Absolute Error: 0.49670970957388244
Max leaf nodes: 500 	 	 Mean Absolute Error: 0.4825
Max leaf nodes: 5000 	 	 Mean Absolute Error: 0.4825


From the output above it is evident that using a max_leaf_nodes value of 500 will suffice for optimizing the model's performance.