# Supervised Machine Learning Systems - (Classification)

In [1]:
# Helper functions to display a video or an image 
from IPython.display import HTML
def display_video(src):
    print('Source : '+src+ '?autoplay=1;modestbranding=1;rel=0')
    return HTML('<iframe width="800" height="400" src=' + src + '?autoplay=1;modestbranding=1;rel=0 frameborder="0" allowfullscreen></iframe>')

def display_image(src):
    print('Source : '+src)
    return HTML('<img width="600" height="300" src=' + src + '></img>')

## What is a Classification Problem ?

Dependent vs Independent variables:

1. **Independent Variables for classification** - These are also called features of our dataset. They are the variables which when varied can affect our target classes that we want to predict.
2. **Dependent Variable for classification** - When your target variable has certain class labels, its a classification problem. For instance classifying pictures of dogs and cats or a tumour to be cancerous or non cancerous etc. You are not predicting a continuous quantity here but different classes.

Lets take an example to understand it clearly :

<b> [Breast Cancer Diagnostic] </b>

There are two main classifications of tumors. One is known as benign and the other as malignant. A benign tumor is a tumor that does not invade its surrounding tissue or spread around the body. A malignant tumor is a tumor that may invade its surrounding tissue or spread around the body.

In [2]:
display_image('https://www.verywellhealth.com/thmb/xnYC1DVmfPtwjWCEdO0HjSZbcBo=/1787x0/filters:no_upscale():max_bytes(150000):strip_icc():format(webp)/514240-article-img-malignant-vs-benign-tumor2111891f-54cc-47aa-8967-4cd5411fdb2f-5a2848f122fa3a0037c544be.png')

Source : https://www.verywellhealth.com/thmb/xnYC1DVmfPtwjWCEdO0HjSZbcBo=/1787x0/filters:no_upscale():max_bytes(150000):strip_icc():format(webp)/514240-article-img-malignant-vs-benign-tumor2111891f-54cc-47aa-8967-4cd5411fdb2f-5a2848f122fa3a0037c544be.png


Our target it to train a Logistic Regression model that can predict whether the cancer is benign (B) or malignant (M).

Attribute Information:
<br>1) ID number 
<br>2) Diagnosis (M = malignant, B = benign) 
<br>3-32) Ten real-valued features are computed for each cell nucleus: 
<br>a) radius (mean of distances from center to points on the perimeter) 
<br>b) texture (standard deviation of gray-scale values) 
<br>c) perimeter 
<br>d) area 
<br>e) smoothness (local variation in radius lengths) 
<br>f) compactness (perimeter^2 / area - 1.0) 
<br>g) concavity (severity of concave portions of the contour) 
<br>h) concave points (number of concave portions of the contour) 
<br>i) symmetry 
<br>j) fractal dimension ("coastline approximation" - 1)

**`'Diagnosis'`** column is the **Dependent Variable or target column** because we want our algorithm to predict this class.

**`'1,3-32'`** are your **Features or Independent Variables** which will help you predict the Benign/Malignant class. Vary any one of them and it is going to affect your Diagnostic.

## Building a Machine Learning classifier model

Now we will discuss about the Logistic Regression algorithm. Don't be confused by the name "Logistic Regression"; it is named that way for historical reasons and is actually an approach to classification problems, not regression problems.

Instead of our output vector y being a continuous range of values, it will only be 'M' or 'B'.

# A Famous Classification Task (Hands-On !)

Its time for you to build your first Classification model and run it on Titanic Survival prediction problem.

You have to load train and test sets and see the relevant details of the features yourself using pandas:

Once you have made your model and are ready with your predictions save it to a csv file, upload it on KAGGLE. See what you get.

In [None]:
# Importing required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
# Load the Titanic data set '../data/train.csv' and '../data/test.csv' into separate dataframes and view the head of dataframe
df = pd.read_csv('../Data/train.csv')
df.head()

In [None]:
# printing missing values in dataset
print(df.isnull().sum())

In [None]:
# Replacing missing values with mean
df.fillna(df.mean(), inplace=True);

In [None]:
# printing missing values in dataset
print(df.isnull().sum())

In [None]:
# Load the features to a variable X. Make sure that you remove 'Age', 'Embarked', 'Fare' from X.
X = df.drop(['Survived','Age','Embarked','Fare'],axis=1)
# Load the dependent variable to y
y = df['Survived']

In [None]:
# View the head of X and y
Y.head()

In [1]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import pandas as pd
import numpy as np
from sklearn.svm import LinearSVC
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.tree import ExtraTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

data=pd.read_csv("../Data/trainoriginal.csv")
X_test = pd.read_csv('../Data/testOriginal.csv')
data_df = data.append(X_test)
data_df['Title'] = data_df['Name']
for name_string in data_df['Name']:
    data_df['Title'] = data_df['Name'].str.extract('([A-Za-z]+)\.', expand=True)
mapping = {'Mlle': 'Miss', 'Major': 'Mr', 'Col': 'Mr', 'Sir': 'Mr', 'Don': 'Mr', 'Mme': 'Miss',
          'Jonkheer': 'Mr', 'Lady': 'Mrs', 'Capt': 'Mr', 'Countess': 'Mrs', 'Ms': 'Miss', 'Dona': 'Mrs'}
data_df.replace({'Title': mapping}, inplace=True)
titles = ['Dr', 'Master', 'Miss', 'Mr', 'Mrs', 'Rev']
for title in titles:
    age_to_impute = data_df.groupby('Title')['Age'].median()[titles.index(title)]
    data_df.loc[(data_df['Age'].isnull()) & (data_df['Title'] == title), 'Age'] = age_to_impute
data['Age'] = data_df['Age'][:891]
X_test['Age'] = data_df['Age'][891:]
data_df.drop('Title', axis = 1, inplace = True)
data_df['Family_Size'] = data_df['Parch'] + data_df['SibSp']
data['Family_Size'] = data_df['Family_Size'][:891]
X_test['Family_Size'] = data_df['Family_Size'][891:]
data['Sex'].replace(['male','female'],[0,1],inplace=True)
X_test['Sex'].replace(['male','female'],[0,1],inplace=True)
data_df['Fare'].fillna(data_df['Fare'].median(), inplace = True)
data_df['Farebin'] = pd.qcut(data_df['Fare'], 5)
label = LabelEncoder()
data_df['FareBin_Code'] = label.fit_transform(data_df['Farebin'])
data['FareBin_Code'] = data_df['FareBin_Code'][:891]
X_test['FareBin_Code'] = data_df['FareBin_Code'][891:]
data_df['Agebin'] = pd.qcut(data_df['Age'], 4)
label = LabelEncoder()
data_df['AgeBin_Code'] = label.fit_transform(data_df['Agebin'])
data['AgeBin_Code'] = data_df['AgeBin_Code'][:891]
X_test['AgeBin_Code'] = data_df['AgeBin_Code'][891:]
data_df['Last_Name'] = data_df['Name'].apply(lambda x: str.split(x, ",")[0])
data_df['Fare'].fillna(data_df['Fare'].mean(), inplace=True)
DEFAULT_SURVIVAL_VALUE = 0.5
data_df['Family_Survival'] = DEFAULT_SURVIVAL_VALUE
for grp, grp_df in data_df[['Survived','Name', 'Last_Name', 'Fare', 'Ticket', 'PassengerId',
                           'SibSp', 'Parch', 'Age', 'Cabin']].groupby(['Last_Name', 'Fare']):
    
    if (len(grp_df) != 1):
        # A Family group is found.
        for ind, row in grp_df.iterrows():
            smax = grp_df.drop(ind)['Survived'].max()
            smin = grp_df.drop(ind)['Survived'].min()
            passID = row['PassengerId']
            if (smax == 1.0):
                data_df.loc[data_df['PassengerId'] == passID, 'Family_Survival'] = 1
            elif (smin==0.0):
                data_df.loc[data_df['PassengerId'] == passID, 'Family_Survival'] = 0
                
for _, grp_df in data_df.groupby('Ticket'):
    if (len(grp_df) != 1):
        for ind, row in grp_df.iterrows():
            if (row['Family_Survival'] == 0) | (row['Family_Survival']== 0.5):
                smax = grp_df.drop(ind)['Survived'].max()
                smin = grp_df.drop(ind)['Survived'].min()
                passID = row['PassengerId']
                if (smax == 1.0):
                    data_df.loc[data_df['PassengerId'] == passID, 'Family_Survival'] = 1
                elif (smin==0.0):
                    data_df.loc[data_df['PassengerId'] == passID, 'Family_Survival'] = 0
data['Family_Survival'] = data_df['Family_Survival'][:891]
X_test['Family_Survival'] = data_df['Family_Survival'][891:]
X_test = X_test.drop(['Name','PassengerId','Age','SibSp','Parch','Ticket','Cabin','Embarked','Fare'],axis=1)
X=data.drop(['Name','PassengerId','SibSp','Age','Parch','Ticket','Cabin','Embarked','Fare','Survived'],axis=1)
y=data['Survived']

In [4]:
from sklearn.ensemble import VotingClassifier #82.25 SVC 81.25 RFC 81 LR 82.25 GBC
from sklearn.model_selection import GridSearchCV

std_scaler = StandardScaler()
X = std_scaler.fit_transform(X)
X_test = std_scaler.transform(X_test)
clf1 = SVC(probability=True)
clf2 = GradientBoostingClassifier()
clf3 = RandomForestClassifier()
clf4 = KNeighborsClassifier()
clf5 = LogisticRegression(solver="lbfgs")
clf = VotingClassifier(estimators=[('SVC',clf1),('GBC',clf2),('RFC',clf3),('KNN',clf4),('LR',clf5)],n_jobs=-1,voting="soft")
params = {'SVC__C': range(1,10),'GBC__n_estimators':range(100,200),'RFC__n_estimators':range(150, 250),'KNN__n_neighbors':range(3,20),'LR__C': range(1,10)}
hyperparam = {'C':range(1,20), 'kernel':['rbf','poly','sigmoid'], 'gamma':['auto','scale']}
grid = GridSearchCV(estimator=clf1, param_grid=hyperparam, cv=10, n_jobs=-1, scoring="neg_log_loss")
grid = grid.fit(X, y)
predictions = grid.predict(X_test)



In [None]:
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, log_loss
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression

classifiers = [
    KNeighborsClassifier(n_neighbors=16),
    SVC(probability=True),
    DecisionTreeClassifier(),
    RandomForestClassifier(n_estimators=200),
    AdaBoostClassifier(),
    GradientBoostingClassifier(),
    GaussianNB(),
    LinearDiscriminantAnalysis(),
    QuadraticDiscriminantAnalysis(),
    LogisticRegression(solver="lbfgs"),
    GridSearchCV(estimator=clf, param_grid=params, refit=True, cv=3, n_jobs=-1, scoring="neg_log_loss")]

sss = StratifiedKFold(n_splits=3)
x1 = np.array(X)
y1 = np.array(y)
for train_index, test_index in sss.split(x1, y1):
    X_train, X_t = x1[train_index], x1[test_index]
    y_train, y_t = y1[train_index], y1[test_index]
    
    for clf1 in classifiers:
        name = clf1.__class__.__name__
        clf1.fit(X_train, y_train)
        train_predictions = clf1.predict(X_t)
        acc = accuracy_score(y_t, train_predictions)
        print(name, acc)

In [6]:
passengerId = pd.read_csv('../Data/testOriginal.csv')['PassengerId']
results = pd.DataFrame({
    "PassengerId": passengerId,
    "Survived": predictions})
results.to_csv('Your_submission.csv', index=False)
results = pd.DataFrame({
    "PassengerId": passengerId,
    "Survived": y_pred})
results.to_csv('Your_submission2.csv', index=False)

### Import a Classifier of your own choice from the list below !
1. LinearSVC()
2. MLPClassifier()
3. KNeighborsClassifier()
4. SVC()
5. DecisionTreeClassifier()
6. RandomForestClassifier()
7. ExtraTreeClassifier()
8. LogisticRegression()

In [None]:
# Create an instance for the classifier
    


In [None]:
# Train the model on our X-train dataframe


In [None]:
# Print the Accuracy score for your model, dont forget to import mertrics from sklearn library
    

# Submit results on Kaggle

In [None]:
# loading test file for Kaggle Submissions
X_test_Kaggle = pd.read_csv('../Data/test.csv')
X_train.head()

In [None]:
# printing missing values in dataset
print(X_test_Kaggle.isnull().sum())

In [None]:
X_test_Kaggle.fillna(X_test_Kaggle.mean(), inplace=True);

In [None]:
# printing missing values in dataset
print(X_test_Kaggle.isnull().sum())

In [None]:
kaggle_predictions = clf.predict(X_test_Kaggle.drop(['Age','Embarked','Fare'],axis=1))

In [None]:
X_test_ids = pd.read_csv('../Data/testOriginal.csv')

results = pd.DataFrame({
    "PassengerId": X_test_ids['PassengerId'],
    "Survived": kaggle_predictions})

In [None]:
results.to_csv('Your_submission.csv', index=False)

## Congratulations on successfully building your first Classification Model !!!