<a href="https://colab.research.google.com/github/dewangulbuddin/machine-learning-iitm/blob/main/Week_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Week: 1

##Step 1: Look at the big picture

1. Frame the problem
2. Select performance measure
3. List and check assumptions

  1.1 Frame the problem

      * What is input and output?
      * Business objectives from the model
      * Supervised, unsupervised or RL problem
      * Classification, regression or some other task
      * Single or multiple outputs
      * Continuous learning or periodic updates
      * Batch or online learning
  
  1.2 Selection of performance measure

      * Regression
        * Mean squared error (MSE) or,
        * Mean Absolute error (MAE)

      * Classification metric
        * Precision
        * Recall
        * F1-Score
        * Accuracy


##Step 2: Get the data

In [None]:
import pandas as pd
import matplotlib.pyplot as plt     #for graphing purpose
import seaborn as sns               #for plotting histogram
import numpy as np

In [None]:
data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'
#we use pandas to read the data from the csv file
data = pd.read_csv(data_url, sep = ";") #specify the separator by identifying it in the csv file
#Here since we use pd.read_csv, the results stored in "data" will already be
#in dataframe format

In [None]:
#Get the first 5 rows
data.head()

In [None]:
#separate features and labels
#apart from the last column [:-1], all are features
feature_list = data.columns[:-1].values
#print("Feature List: ",feature_list)
#last column is the label
label = [data.columns[-1]]
#print("Label: ",label)

###Data Statistics

In [None]:
#basic info on data
data.info()

In [None]:
#learn about the numeric attributes of the data
data.describe()

In [None]:
#count the number of samples under a given quality index
data['quality'].value_counts()

In [None]:
#visualise the above count
sns.set()
data.quality.hist()
plt.xlabel('Wine Quality')
plt.ylabel('Count')

###Make Test and Training Set

In [None]:
#make a function to split the training and testing data
def split_train_test(data,test_ratio):
  np.random.seed(42)                                        #setting random.seed(n), will allow us to get same random number for every test set
  shuffled_indices = np.random.permutation(len(data))       #random.permutation(x) will randomly shuffle the numbers, 0 to (x-1)
  test_set_size = int(len(data) * test_ratio)                          
  test_indices = shuffled_indices[:test_set_size]               #generate test set indices
  train_indices = shuffled_indices[test_set_size:]              #generate training dataset indices
  return data.iloc[train_indices], data.iloc[test_indices]
#call the above function & specify the splitting ratio
  train_set_m, test_set_m = split_train_test(data,0.2)          #split as 80-20, m for manual

In [None]:
#or make test sets using sci-kit
#here we use Random Sampling
#Random Sampling randomly selects k% points for the test set
from sklearn.model_selection import train_test_split
train_set_r, test_set_r = train_test_split(data, test_size = 0.2, random_state = 42)
#specify the random_state so that we get same sets everytime we run this piece
#of code, so as to get consistent result for study purpose
#during actual generation, we can leave the random_state at default

In [None]:
#here we use Stratified Sampling
#SSS divides samples such that they are representative of overall distribution
from sklearn.model_selection import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(n_splits = 1, test_size = 0.2, random_state = 42)
#n_splits is the no of re-shuffling and splitting iterations
for train_index, test_index in sss.split(data, data["quality"]):
#.split() is a method of sss, which generates indices to split a data into
#training and test set
#use loc to select the data based on the index, specified in the argument of the method
  train_set_s = data.loc[train_index]
  test_set_s = data.loc[test_index]

####Sampling Bias Comparision

In [None]:
#preparing it for comparision in terms of percentages:
overall_dist = data["quality"].value_counts()/len(data)
random_dist = test_set_r["quality"].value_counts()/len(test_set_r)
strat_dist = test_set_s["quality"].value_counts()/len(test_set_s)
#lesser the difference better the distribution achieved

In [None]:
dist_comp = pd.DataFrame({'overall':overall_dist, 'stratified': strat_dist, 'random': random_dist})
dist_comp['s-o'] = dist_comp['stratified'] - dist_comp['overall']
dist_comp['(s-o)%'] = 100*dist_comp['s-o']/dist_comp['overall']
dist_comp['r-o'] = dist_comp['random'] - dist_comp['overall']
dist_comp['(r-o)%'] = 100*dist_comp['r-o']/dist_comp['overall']
dist_comp

##Step 3: Data Visualisation

In [None]:
#explore using stratfied sampling
exploration_set = train_set_s.copy()
#copy the training set to exploration set, to prevent change in original data 
#incase of modification during exploration. Here we use the complete set because
#of small training data

#using seaborn library
sns.scatterplot(x = 'fixed acidity', y = 'density', hue = 'quality', data = exploration_set)

In [None]:
#using matplotlib
exploration_set.plot(kind = 'scatter', x = 'fixed acidity', y = 'density', 
                     alpha = 0.5, c = 'quality', cmap = plt.get_cmap('jet'))

In [None]:
#calculate correlation between features
corr_matrix = exploration_set.corr()

#lets check features that are correlated with the label, here 'quality'
corr_matrix['quality']
#+1 = strong postitve correlation
#-1 = strong negative correlation
#0  = no correlation

In [None]:
#visualize correlation matrix with heatmap
plt.figure(figsize = (14,7))
sns.heatmap(corr_matrix, annot = True); #adding ';' to the end of this line removes the automatic annotation

In [None]:
#we can even correlate between specific feature sets using scatter matrix
from pandas.plotting import scatter_matrix
attribute_list = ['citric acid', 'pH', 'alcohol', 'sulphates', 'quality']
scatter_matrix(exploration_set[attribute_list]);

##Step 4: Prepare data for ML algorithm

###Separate Features from Labels

In [None]:
#separate features and labels from the training set
#copy all features leaving aside the labels
wine_features = train_set_s.drop('quality', axis = 1)
#copy the label list
wine_labels = train_set_s['quality'].copy()
#Data Cleaning
#check for missing values in feature set
wine_features.isna().sum()
#if the count of NaN values in all feature is 0 then it means no missing data
#if we have missing data we drop the rows containing them using dropna() method

###Impute Missing Values

In [None]:
#impute missing values using median
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy = 'median')
#if there is non numerical attributes they need to be dropped/categorized before imputing
imputer.fit(wine_features)
imputer.statistics_
#statistics learnt by the imputer. the resulting array is a collection of median
#values for each feature.

In [None]:
#we can cross check by directly calculating median
wine_features.median()

In [None]:
#then we use the trained imputer to transform the training set such that missing
#values are replaced by the medians
tr_features = imputer.transform(wine_features)
tr_features.shape
#type(tr_features)

In [None]:
#now tr_features is an array which has just the numerical values in it.
#we need to form it into a dataframe object again with proper column headings
wine_features_tr = pd.DataFrame(tr_features, columns = wine_features.columns)
wine_features_tr
#type(wine_features_tr)

###Handling Text and Categorical Attributes

In [None]:
#converting categories to numbers
#Method 1: Using OrdinalEncoder (for ordered (ordinal) data)
from sklearn.preprocessing import OrdinalEncoder
ordinal_encoder = OrdinalEncoder()
#then call fit_transform() method on ordinal_encoder object to convert text to
#numbers
#then the list of encoded categories can be viewed using categories_instance var

#Note: One issue with this representation is that the ML algorithm would assume
#that the two nearby values are closer than the distinct ones

In [None]:
#Method 2: Using OneHotEncoder (for unordered (nominal) data)
#here we create one binary feature per category: 1-present(hot) 0-absent(cold)
#the new features are referred to as dummy features

from sklearn.preprocessing import OneHotEncoder
cat_encoder = OneHotEncoder()
#then call fit_transorm() on OneHotEncoder object
#output is a SciPy sparse matrix rather than a numpy array. This saves space in
#case of huge number of categories
#in case we want the dense array format, convert it using toarray() method
#the list of categories can be obtained via categories_instance variable

#if number of category is huge, OHE will result in a large number of features.
#address it by: 1) replacing them with categorical numerical features or,
#2) convert to low dimensional learnable vectors called embeddings

###Feature Scaling and Transformation Pipeline

In [None]:
#most ML algos don't perform well when input features are on very different scales
#scaling of target labels is generally not required

#1) Min-max scaling or Normalization:
#(Current value - Min value)/(Max value - Min value)
#this way, all values fall in [0,1]
#Scikit-Learn provides MinMaxScalar transformer for this
#can specify hyperparameter feature_range to specify range of feature

#2) Standardization:
#(Current value - Mean value)/(Standard Deviation), to give a feature of unit variance
#No bounds on data unlike Normalization
#Less affected by outliers compared to Normalization
#Scikit-Learn provides StandardScaler

#Note: ALWAYS learn these transformers on training data and never on full data
#only then apply them to training and test set to transform them

#now we use pipeline to line up transformations in an intended order
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
transform_pipeline = Pipeline([('imputer', SimpleImputer(strategy = 'median')), 
                               ('std_scaler', StandardScaler()),])
wine_features_tr = transform_pipeline.fit_transform(wine_features)
#Missing value imputation followed by standardization
#pairs of ('name',estimator) is defined for each step
#__(double underscore) is not allowed in name

#####Transform Mixed features

In [None]:
#real world datasets have both categorical and numerical features and hence we
#need to apply different transformations to them
#Our dataset doesnt have Mixed Features but this would have been an ideal setup
#if there were any, we use ColumnTransformer from Scikit-Learn

#For illustration purpose, consider the example below:
'''from sklearn.compose import ColumnTransformer
num_attribs = list(wine_features)
cat_attribs = ["place_of_manufacturing"]
full_pipeline = ColumnTransformer([('num', num_pipeline, num_attribs), 
                                   ('cat', OneHotEncoder, cat_attribs)])
wine_features_tr = full_pipeline.fit_transform(wine_features)'''

#where num_pipeline is a pipeline which needs to be created to handle numerical
#values, while OHE handles categorical
#ColumnTransformer applies each transformation to the appropriate columns and then
#concatenates the outputs along the columns
#here both the transformation must return the same number of rows
#we know the numerical transformation will return dense matrix while categorical
#will return sparse. ColumnTransformer automatically determines type of output
#base on density of resulting matrix

##Step 5: Select and train ML Models

  * It is a good practice to build a quick baseline model on the preprocessed data and get an idea about model performance

####a) Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(wine_features_tr, wine_labels)
#Regression is now active. we can evaluate performance of model on training/test
#sets. For regression we use MeanSquaredError as an evaluator
from sklearn.metrics import mean_squared_error
quality_predictions = lin_reg.predict(wine_features_tr)
mean_squared_error(wine_labels, quality_predictions)

In [None]:
#Lets evaluate performance on a Test Set
#copy all features apart from label
wine_features_test = test_set_s.drop('quality', axis = 1)
#copy label list
wine_labels_test = test_set_s['quality'].copy()
#apply transformation
wine_features_test_tr = transform_pipeline.fit_transform(wine_features_test)
#call predict function and perform MSE
quality_test_predictions = lin_reg.predict(wine_features_test_tr)
mean_squared_error(wine_labels_test, quality_test_predictions)

In [None]:
#visualising the error
plt.scatter(wine_labels_test, quality_test_predictions)
plt.plot(wine_labels_test, wine_labels_test, 'r-')
plt.xlabel('Actual Quality')
plt.ylabel('Predicted Quality');

####b) Decision Tree Regressor

In [None]:
#model seems to be making error in the low/high quality regions
#so we try Decision Tree Regressor
from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor()
tree_reg.fit(wine_features_tr, wine_labels)
#Evaluate performance of model on training set
quality_predictions = tree_reg.predict(wine_features_tr)
mean_squared_error(wine_labels, quality_predictions)

In [None]:
#training set has 0 error. Lets check the test set
quality_test_predictions = tree_reg.predict(wine_features_test_tr)
mean_squared_error(wine_labels_test, quality_test_predictions)
#test error is 0.66 i.e. overfitted model

In [None]:
plt.scatter(wine_labels_test, quality_test_predictions)
plt.plot(wine_labels_test, wine_labels_test, 'r-')
plt.xlabel('Actual Quality')
plt.ylabel('Predicted Quality')

In [None]:
#we can use cross validation (CV) for robust evaluation of model performance

from sklearn.model_selection import cross_val_score
#this provides a separate MSE for each validation set, which we can use to get a
#mean estimation of MSE as well as the standard deviation, which helps us determine
#how precise the estimate is
#the additional cost for this step is additional training runs

def display_scores(scores):
  print("Scores: ", scores)
  print("Mean: ", scores.mean())
  print("Standard Deviation: ", scores.std())

####Linear Regression CV

In [None]:
scores = cross_val_score(lin_reg, wine_features_tr, wine_labels, 
                         scoring = "neg_mean_squared_error", cv = 10)
lin_reg_mse_scores = -scores
display_scores(lin_reg_mse_scores)

####Decision Tree CV

In [None]:
scores = cross_val_score(tree_reg, wine_features_tr, wine_labels, 
                         scoring = "neg_mean_squared_error", cv = 10)
tree_mse_scores = -scores
display_scores(tree_mse_scores)

####Random forest CV

* It builds multiple decision trees on randomly selected features and then average their predections
* Ensemble learning or building a model on top of another, improves performance of ML models

In [None]:
from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor()
forest_reg.fit(wine_features_tr, wine_labels)

scores = cross_val_score(forest_reg, wine_features_tr, wine_labels, 
                         scoring = "neg_mean_squared_error", cv = 10)
forest_mse_scores = -scores
display_scores(forest_mse_scores)

In [None]:
quality_test_predictions = forest_reg.predict(wine_features_test_tr)
mean_squared_error(wine_labels_test, quality_test_predictions)

In [None]:
plt.scatter(wine_labels_test, quality_test_predictions)
plt.plot(wine_labels_test, wine_labels_test, 'r-')
plt.xlabel('Actual quality')
plt.ylabel('Predicted quality')

#####Random forest looks more promising than the previous two
Note: Its a good practice to build a few such models quickly without tuning their hyperparameters and shortlist a few promising ones among them and save those models to disk in Python pickle format

##Step 6: Finetune the model

In [None]:
#Tuning hyperparameters lead to better accuracy of ML models
#Scikit-Learn provides GridSearchSV for this purpose

from sklearn.model_selection import GridSearchCV