Lambda School Data Science

*Unit 2, Sprint 3, Module 1*

---


# Define ML problems

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your decisions.

- [ ] Choose your target. Which column in your tabular dataset will you predict?
- [ ] Is your problem regression or classification?
- [ ] How is your target distributed?
    - Classification: How many classes? Are the classes imbalanced?
    - Regression: Is the target right-skewed? If so, you may want to log transform the target.
- [ ] Choose your evaluation metric(s).
    - Classification: Is your majority class frequency >= 50% and < 70% ? If so, you can just use accuracy if you want. Outside that range, accuracy could be misleading. What evaluation metric will you choose, in addition to or instead of accuracy?
    - Regression: Will you use mean absolute error, root mean squared error, R^2, or other regression metrics?
- [ ] Choose which observations you will use to train, validate, and test your model.
    - Are some observations outliers? Will you exclude them?
    - Will you do a random split or a time-based split?
- [ ] Begin to clean and explore your data.
- [ ] Begin to choose which features, if any, to exclude. Would some features "leak" future information?

If you haven't found a dataset yet, do that today. [Review requirements for your portfolio project](https://lambdaschool.github.io/ds/unit2) and choose your dataset.

Some students worry, ***what if my model isn't “good”?*** Then, [produce a detailed tribute to your wrongness. That is science!](https://twitter.com/nathanwpyle/status/1176860147223867393)

In [1]:
import pandas as pd

In [2]:
DATA_PATH = '../data/'

In [3]:
df = pd.read_csv(DATA_PATH+'/winequality-red.csv', sep=';')

In [4]:
df.shape

(1599, 12)

In [5]:
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [6]:
df.isna().sum()

fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64

In [7]:
#define target variable
df['quality'].value_counts()

5    681
6    638
7    199
4     53
8     18
3     10
Name: quality, dtype: int64

In [8]:
df['quality'].describe()


count    1599.000000
mean        5.636023
std         0.807569
min         3.000000
25%         5.000000
50%         6.000000
75%         6.000000
max         8.000000
Name: quality, dtype: float64

In [9]:
import seaborn as sns
y = df['quality']
sns.distplot(y);

ModuleNotFoundError: No module named 'seaborn'

In [10]:
df['great'] = df['quality'] >=7

In [11]:
df['great'].value_counts()

False    1382
True      217
Name: great, dtype: int64

In [12]:
#confirm that you are only passing true and false values
y = df['great']
y.unique()

array([False,  True])

In [13]:
#baseline prediction
y.value_counts(normalize=True)

False    0.86429
True     0.13571
Name: great, dtype: float64

In [14]:
df = df.drop(columns='quality')


In [15]:
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,great
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,False
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,False
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,False
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,False
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,False


In [16]:
#do train_test split
from sklearn.model_selection import train_test_split

ModuleNotFoundError: No module named 'sklearn'

In [17]:
train, val = train_test_split(df, train_size=0.80, test_size = .20, stratify=df['great'], random_state=42)
train, test = train_test_split(train, train_size=0.80, test_size=.20,
                              stratify=train['great'], random_state=42)
train.shape, val.shape, test.shape


NameError: name 'train_test_split' is not defined

In [18]:
target ='great'
features = train.columns.drop([target])
X_train = train[features]
y_train = train[target]
X_val = val[features]
y_val = val[target]
X_test = train[features]
y_test = train[target]

NameError: name 'train' is not defined

In [19]:
features

NameError: name 'features' is not defined

In [20]:
X_train.head()

NameError: name 'X_train' is not defined

In [21]:

import numpy as np
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt



ModuleNotFoundError: No module named 'sklearn'

In [22]:
#create pipeline
pipeline = make_pipeline(StandardScaler(),
                        DecisionTreeClassifier(max_depth=3))

NameError: name 'make_pipeline' is not defined

In [23]:
pipeline.fit(X_train, y_train)
print(f'Val score: {pipeline.score(X_val, y_val)}')

NameError: name 'pipeline' is not defined

In [24]:
#baseline prediction was >70% so accuracy could be misleading 


In [25]:
#ROC AUC
from sklearn.metrics import roc_auc_score
y_pred_proba = pipeline.predict_proba(X_val)[:,1]
roc_auc_score(y_val, y_pred_proba)

ModuleNotFoundError: No module named 'sklearn'

In [26]:
# "The ROC curve is created by plotting the true positive rate (TPR) 
# against the false positive rate (FPR) 
# at various threshold settings."

from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_val, y_pred_proba)

ModuleNotFoundError: No module named 'sklearn'

In [27]:
#see the results in a table
pd.DataFrame({'False Positive Rate': fpr,
             'True Positive Rate': tpr,
             'Thresholds': thresholds})

NameError: name 'fpr' is not defined

In [28]:
# See the results on a plot. 
# This is the "Receiver Operating Characteristic" curve

plt.scatter(fpr, tpr)
plt.title('ROC curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')

NameError: name 'plt' is not defined

In [29]:
#Precission and Recall
#recall is also referred as sensitivity


In [30]:
#did not load
#from sklearn.metrics import plot_confusion_matrix

#plot_confusion_matrix(pipeline, X_val, y_val, xticks_rotation='vertical',
#                     values_format='.0f', cmap='Blues');

In [31]:
from sklearn.metrics import confusion_matrix
y_pred=pipeline.predict(X_val)
cm = confusion_matrix(y_val, y_pred)
cm

ModuleNotFoundError: No module named 'sklearn'

In [32]:
normalize_cm = cm/cm.sum(axis=1)[:, np.newaxis]
normalize_cm

NameError: name 'cm' is not defined

In [33]:
cm.sum(axis=1)[:, np.newaxis].shape

NameError: name 'cm' is not defined

In [34]:
import seaborn as sns
from sklearn.utils.multiclass import unique_labels

cols = unique_labels(y_val)

df_cm = pd.DataFrame(cm, columns=cols, index=cols)
plt.figure(figsize=(10,7))
sns.heatmap(df_cm, annot=True, cmap='Blues', fmt='.0f');

ModuleNotFoundError: No module named 'seaborn'

In [35]:
from sklearn.metrics import classification_report

print(classification_report(y_val, y_pred))

ModuleNotFoundError: No module named 'sklearn'

In [36]:
#precision of positive class
#precision is the ration between allthe positive class vs. actuall positive classes
16/(16+3)

0.8421052631578947

#recall of positive class
#s the number of true positives divided by the number of true positives plus 
#the number of false negatives
16/(16+27)

Run the pipeline with random forest 

In [37]:
#Run the pipeline with random forest 
from sklearn.ensemble import RandonForestClassifier

pipeline = make_pipeline(StandarScaler(),
                        RandomForestClassifier(n_estimator=100, random_state=42, n_jobs=-1))
pipeline.fit(X_train, y_train)
print(f'Validation Acc: {pipeline.score(X_val, y_val)})

SyntaxError: EOL while scanning string literal (<ipython-input-37-53e7bd72f6d9>, line 7)