# Supervised Machine Learning Systems - (Classification)

In [1]:
# Helper functions to display a video or an image 
from IPython.display import HTML
def display_video(src):
    print('Source : '+src+ '?autoplay=1;modestbranding=1;rel=0')
    return HTML('<iframe width="800" height="400" src=' + src + '?autoplay=1;modestbranding=1;rel=0 frameborder="0" allowfullscreen></iframe>')

def display_image(src):
    print('Source : '+src)
    return HTML('<img width="600" height="300" src=' + src + '></img>')

## What is a Classification Problem ?

Dependent vs Independent variables:

1. **Independent Variables for classification** - These are also called features of our dataset. They are the variables which when varied can affect our target classes that we want to predict.
2. **Dependent Variable for classification** - When your target variable has certain class labels, its a classification problem. For instance classifying pictures of dogs and cats or a tumour to be cancerous or non cancerous etc. You are not predicting a continuous quantity here but different classes.

Lets take an example to understand it clearly :

<b> [Breast Cancer Diagnostic] </b>

There are two main classifications of tumors. One is known as benign and the other as malignant. A benign tumor is a tumor that does not invade its surrounding tissue or spread around the body. A malignant tumor is a tumor that may invade its surrounding tissue or spread around the body.

In [2]:
display_image('https://www.verywellhealth.com/thmb/xnYC1DVmfPtwjWCEdO0HjSZbcBo=/1787x0/filters:no_upscale():max_bytes(150000):strip_icc():format(webp)/514240-article-img-malignant-vs-benign-tumor2111891f-54cc-47aa-8967-4cd5411fdb2f-5a2848f122fa3a0037c544be.png')

Source : https://www.verywellhealth.com/thmb/xnYC1DVmfPtwjWCEdO0HjSZbcBo=/1787x0/filters:no_upscale():max_bytes(150000):strip_icc():format(webp)/514240-article-img-malignant-vs-benign-tumor2111891f-54cc-47aa-8967-4cd5411fdb2f-5a2848f122fa3a0037c544be.png


Our target it to train a Logistic Regression model that can predict whether the cancer is benign (B) or malignant (M).

Attribute Information:
<br>1) ID number 
<br>2) Diagnosis (M = malignant, B = benign) 
<br>3-32) Ten real-valued features are computed for each cell nucleus: 
<br>a) radius (mean of distances from center to points on the perimeter) 
<br>b) texture (standard deviation of gray-scale values) 
<br>c) perimeter 
<br>d) area 
<br>e) smoothness (local variation in radius lengths) 
<br>f) compactness (perimeter^2 / area - 1.0) 
<br>g) concavity (severity of concave portions of the contour) 
<br>h) concave points (number of concave portions of the contour) 
<br>i) symmetry 
<br>j) fractal dimension ("coastline approximation" - 1)

**`'Diagnosis'`** column is the **Dependent Variable or target column** because we want our algorithm to predict this class.

**`'1,3-32'`** are your **Features or Independent Variables** which will help you predict the Benign/Malignant class. Vary any one of them and it is going to affect your Diagnostic.

## Building a Machine Learning classifier model

Now we will discuss about the Logistic Regression algorithm. Don't be confused by the name "Logistic Regression"; it is named that way for historical reasons and is actually an approach to classification problems, not regression problems.

Instead of our output vector y being a continuous range of values, it will only be 'M' or 'B'.

# A Famous Classification Task (Hands-On !)

Its time for you to build your first Classification model and run it on Titanic Survival prediction problem.

You have to load train and test sets and see the relevant details of the features yourself using pandas:

Once you have made your model and are ready with your predictions save it to a csv file, upload it on KAGGLE. See what you get.

In [4]:
import pandas as pd

In [5]:
data=pd.read_csv("./Data/train.csv")
X_test = pd.read_csv('./Data/test.csv')
data_df = data.append(X_test)

In [6]:
data.shape, X_test.shape, data_df.shape

((891, 12), (418, 11), (1309, 12))

In [7]:
data_df['Title'] = data_df['Name']

In [8]:
data_df['Title'] = data_df['Name'].str.extract('([A-Za-z]+)\.', expand=True)
#print(data_df['Name'].str.extract('([A-Za-z]+)\.', expand=True))

In [9]:
data_df['Title'].head()

0      Mr
1     Mrs
2    Miss
3     Mrs
4      Mr
Name: Title, dtype: object

In [10]:
mapping = {'Mlle': 'Miss', 'Major': 'Mr', 'Col': 'Mr', 'Sir': 'Mr', 'Don': 'Mr', 'Mme': 'Miss',
          'Jonkheer': 'Mr', 'Lady': 'Mrs', 'Capt': 'Mr', 'Countess': 'Mrs', 'Ms': 'Miss', 'Dona': 'Mrs'}

In [11]:
data_df.replace({'Title': mapping}, inplace=True)

In [12]:
 data_df.groupby('Title')['Age'].median()

Title
Dr        49.0
Master     4.0
Miss      22.0
Mr        30.0
Mrs       36.0
Rev       41.5
Name: Age, dtype: float64

In [13]:
titles = ['Dr', 'Master', 'Miss', 'Mr', 'Mrs', 'Rev']
for title in titles:
    age_to_impute = data_df.groupby('Title')['Age'].median()[titles.index(title)]
    data_df.loc[(data_df['Age'].isnull()) & (data_df['Title'] == title), 'Age'] = age_to_impute

In [14]:
print(data_df.isnull().sum())

Age               0
Cabin          1014
Embarked          2
Fare              1
Name              0
Parch             0
PassengerId       0
Pclass            0
Sex               0
SibSp             0
Survived        418
Ticket            0
Title             0
dtype: int64


In [15]:
data['Age'] = data_df['Age'][:891]
X_test['Age'] = data_df['Age'][891:]

In [16]:
data_df.drop('Title', axis = 1, inplace = True)
data_df['Family_Size'] = data_df['Parch'] + data_df['SibSp']

In [17]:
data['Family_Size'] = data_df['Family_Size'][:891]
X_test['Family_Size'] = data_df['Family_Size'][891:]

In [18]:
data['Sex'].replace(['male','female'],[0,1],inplace=True)
X_test['Sex'].replace(['male','female'],[0,1],inplace=True)

In [19]:
data_df['Fare'].fillna(data_df['Fare'].median(), inplace = True)
data_df['Farebin'] = pd.qcut(data_df['Fare'], 5)

In [20]:
data_df['Farebin'].head()

0      (-0.001, 7.854]
1    (41.579, 512.329]
2        (7.854, 10.5]
3    (41.579, 512.329]
4        (7.854, 10.5]
Name: Farebin, dtype: category
Categories (5, interval[float64]): [(-0.001, 7.854] < (7.854, 10.5] < (10.5, 21.558] < (21.558, 41.579] < (41.579, 512.329]]

In [23]:
from sklearn.preprocessing import LabelEncoder

In [24]:
label = LabelEncoder()
data_df['FareBin_Code'] = label.fit_transform(data_df['Farebin'])

In [25]:
data_df['FareBin_Code'].head()

0    0
1    4
2    1
3    4
4    1
Name: FareBin_Code, dtype: int64

In [26]:
data['FareBin_Code'] = data_df['FareBin_Code'][:891]
X_test['FareBin_Code'] = data_df['FareBin_Code'][891:]

In [27]:
data_df['Agebin'] = pd.qcut(data_df['Age'], 4)
label = LabelEncoder()
data_df['AgeBin_Code'] = label.fit_transform(data_df['Agebin'])
data['AgeBin_Code'] = data_df['AgeBin_Code'][:891]
X_test['AgeBin_Code'] = data_df['AgeBin_Code'][891:]

In [28]:
data_df['Last_Name'] = data_df['Name'].apply(lambda x: str.split(x, ",")[0])
data_df['Fare'].fillna(data_df['Fare'].mean(), inplace=True)

In [29]:
DEFAULT_SURVIVAL_VALUE = 0.5
data_df['Family_Survival'] = DEFAULT_SURVIVAL_VALUE
for grp, grp_df in data_df[['Survived','Name', 'Last_Name', 'Fare', 'Ticket', 'PassengerId',
                           'SibSp', 'Parch', 'Age', 'Cabin']].groupby(['Last_Name', 'Fare']):
    
    if (len(grp_df) != 1):
        # A Family group is found.
        for ind, row in grp_df.iterrows():
            smax = grp_df.drop(ind)['Survived'].max()
            smin = grp_df.drop(ind)['Survived'].min()
            passID = row['PassengerId']
            if (smax == 1.0):
                data_df.loc[data_df['PassengerId'] == passID, 'Family_Survival'] = 1
            elif (smin==0.0):
                data_df.loc[data_df['PassengerId'] == passID, 'Family_Survival'] = 0

In [30]:
for _, grp_df in data_df.groupby('Ticket'):
    if (len(grp_df) != 1):
        for ind, row in grp_df.iterrows():
            if (row['Family_Survival'] == 0) | (row['Family_Survival']== 0.5):
                smax = grp_df.drop(ind)['Survived'].max()
                smin = grp_df.drop(ind)['Survived'].min()
                passID = row['PassengerId']
                if (smax == 1.0):
                    data_df.loc[data_df['PassengerId'] == passID, 'Family_Survival'] = 1
                elif (smin==0.0):
                    data_df.loc[data_df['PassengerId'] == passID, 'Family_Survival'] = 0

In [31]:
data['Family_Survival'] = data_df['Family_Survival'][:891]
X_test['Family_Survival'] = data_df['Family_Survival'][891:]

In [32]:
X_test = X_test.drop(['Name','PassengerId','Age','SibSp','Parch','Ticket','Cabin','Embarked','Fare'],axis=1)
X=data.drop(['Name','PassengerId','SibSp','Age','Parch','Ticket','Cabin','Embarked','Fare','Survived'],axis=1)
y=data['Survived']

In [33]:
X.head()

Unnamed: 0,Pclass,Sex,Family_Size,FareBin_Code,AgeBin_Code,Family_Survival
0,3,0,1,0,0,0.5
1,1,1,1,4,3,0.5
2,3,1,0,1,1,0.5
3,1,1,1,4,2,0.0
4,3,0,0,1,2,0.5


In [48]:
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, log_loss
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression

classifiers = [
    KNeighborsClassifier(n_neighbors=16),
    SVC(probability=True),
    DecisionTreeClassifier(),
    RandomForestClassifier(n_estimators=200),
    AdaBoostClassifier(),
    GradientBoostingClassifier(),
    GaussianNB(),
    LinearDiscriminantAnalysis(),
    QuadraticDiscriminantAnalysis(),
    LogisticRegression(solver="lbfgs")]

sss = StratifiedKFold(n_splits=3)
x1 = np.array(X)
y1 = np.array(y)
for train_index, test_index in sss.split(x1, y1):
    X_train, X_t = x1[train_index], x1[test_index]
    y_train, y_t = y1[train_index], y1[test_index]
    
    for clf1 in classifiers:
        name = clf1.__class__.__name__
        clf1.fit(X_train, y_train)
        train_predictions = clf1.predict(X_t)
        acc = accuracy_score(y_t, train_predictions)
        print(name, acc)

KNeighborsClassifier 0.8282828282828283
SVC 0.8383838383838383
DecisionTreeClassifier 0.7777777777777778
RandomForestClassifier 0.8080808080808081
AdaBoostClassifier 0.7912457912457912
GradientBoostingClassifier 0.835016835016835
GaussianNB 0.7272727272727273
LinearDiscriminantAnalysis 0.8114478114478114
QuadraticDiscriminantAnalysis 0.7946127946127947
LogisticRegression 0.8114478114478114
KNeighborsClassifier 0.8585858585858586
SVC 0.8552188552188552
DecisionTreeClassifier 0.835016835016835
RandomForestClassifier 0.8552188552188552
AdaBoostClassifier 0.8316498316498316
GradientBoostingClassifier 0.8552188552188552
GaussianNB 0.7912457912457912
LinearDiscriminantAnalysis 0.835016835016835
QuadraticDiscriminantAnalysis 0.8148148148148148
LogisticRegression 0.8417508417508418
KNeighborsClassifier 0.8417508417508418
SVC 0.835016835016835
DecisionTreeClassifier 0.835016835016835
RandomForestClassifier 0.8383838383838383
AdaBoostClassifier 0.8148148148148148
GradientBoostingClassifier 0.838

In [35]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [36]:
from sklearn.svm import LinearSVC
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.tree import ExtraTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

In [37]:
from sklearn.ensemble import VotingClassifier #82.25 SVC 81.25 RFC 81 LR 82.25 GBC
from sklearn.model_selection import RandomizedSearchCV

In [38]:
std_scaler = StandardScaler()
X = std_scaler.fit_transform(X)
X_test = std_scaler.transform(X_test)

In [39]:
clf1 = SVC(probability=True)
clf2 = GradientBoostingClassifier()
clf3 = RandomForestClassifier()
clf4 = KNeighborsClassifier()
clf5 = LogisticRegression(solver="lbfgs")

In [40]:
clf = VotingClassifier(estimators=[('SVC',clf1),('GBC',clf2),('RFC',clf3),('KNN',clf4),('LR',clf5)],n_jobs=-1,voting="soft")

In [42]:
params = {'SVC__C': range(1,10),'GBC__n_estimators':range(100,200),'RFC__n_estimators':range(150, 250),'KNN__n_neighbors':range(3,20),'LR__C': range(1,10)}
hyperparam = {'C':range(1,20), 'kernel':['rbf','poly','sigmoid'], 'gamma':['auto','scale']}

In [44]:
grid = RandomizedSearchCV(estimator=clf, param_distributions=params, cv=5, n_jobs=-1, scoring="neg_log_loss", n_iter=50)

In [45]:
grid = grid.fit(X, y)

In [61]:
model = grid.best_estimator_.fit(X,y)

In [46]:
predictions = grid.predict(X_test)

  if diff:


In [62]:
model_preds = model.predict(X_test)

  if diff:


In [64]:
model_preds

array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
       0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,

In [47]:
predictions

array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
       0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,

In [52]:
def count_survived(preds):
    count=0
    for i in preds:
        if(i==1):
            count=count+1
    return count 

In [58]:
print( count_survived(predictions), "survived out of 418")

145 survived out of 418


In [63]:
print( count_survived(model_preds), "survived out of 418")

145 survived out of 418


In [59]:
passengerId = pd.read_csv('./Data/test.csv')['PassengerId']
results = pd.DataFrame({
    "PassengerId": passengerId,
    "Survived": predictions})
results.to_csv('softEnsemble_with_family_association.csv', index=False)

### Import a Classifier of your own choice from the list below !
1. LinearSVC()
2. MLPClassifier()
3. KNeighborsClassifier()
4. SVC()
5. DecisionTreeClassifier()
6. RandomForestClassifier()
7. ExtraTreeClassifier()
8. LogisticRegression()

In [None]:
# Create an instance for the classifier
    


In [None]:
# Train the model on our X-train dataframe


In [None]:
# Print the Accuracy score for your model, dont forget to import mertrics from sklearn library
    

# Submit results on Kaggle

In [None]:
# loading test file for Kaggle Submissions
X_test_Kaggle = pd.read_csv('../Data/test.csv')
X_train.head()

In [None]:
# printing missing values in dataset
print(X_test_Kaggle.isnull().sum())

In [None]:
X_test_Kaggle.fillna(X_test_Kaggle.mean(), inplace=True);

In [None]:
# printing missing values in dataset
print(X_test_Kaggle.isnull().sum())

In [None]:
kaggle_predictions = clf.predict(X_test_Kaggle.drop(['Age','Embarked','Fare'],axis=1))

In [None]:
X_test_ids = pd.read_csv('../Data/testOriginal.csv')

results = pd.DataFrame({
    "PassengerId": X_test_ids['PassengerId'],
    "Survived": kaggle_predictions})

In [None]:
results.to_csv('Your_submission.csv', index=False)

## Congratulations on successfully building your first Classification Model !!!