<a href="https://colab.research.google.com/github/ganesh-lakshman/AI/blob/main/titanicregression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# importing all the necessary libraries
import pandas as pd
import numpy as np

In [2]:

#we need to read the data
data = pd.read_csv("https://raw.githubusercontent.com/naveenjoshii/Intro-to-MachineLearning/master/Titanic/titanic.csv")
#print top 5 rows 
print(data.head())

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  


In [3]:
# function to calculate the lower and upperbound
def detect_outliers(data,threshold):
  mean = np.mean(data)
  std =np.std(data)
  lb = max(mean - (threshold * std),min(data))
  ub = min(mean + (threshold * std),max(data))
  return lb,ub

In [4]:
df = data.copy()
lb,ub = detect_outliers(data["Fare"],4)
# removing the rows which are greater than upperbound
df.drop(df[df.Fare > ub].index, inplace=True)
# removing the rows which are less than lowerbound
df.drop(df[df.Fare < lb ].index, inplace=True)


lb,ub = detect_outliers(data["Age"],5)
# removing the rows which are greater than upperbound
df.drop(df[df.Age > ub].index, inplace=True)
# removing the rows which are less than lowerbound
df.drop(df[df.Age < lb].index, inplace=True)

In [5]:
#printing the missing value percentage for every column
df.isnull().mean() * 100

PassengerId     0.000000
Survived        0.000000
Pclass          0.000000
Name            0.000000
Sex             0.000000
Age            20.113636
SibSp           0.000000
Parch           0.000000
Ticket          0.000000
Fare            0.000000
Cabin          77.954545
Embarked        0.227273
dtype: float64

In [6]:
# get all the column names in our dataset
df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [7]:
# As we can see cabin column has more than 30% of missing values, so we have to drop that column
df.drop(['Cabin'],inplace=True,axis=1)

In [8]:
# after removing the column cabin, printing the columns again. If you observe there is no Cabin in the output
df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Embarked'],
      dtype='object')

 If missing values are less than 30% of entire data then create a new data frame

i. Missing values in numeric columns are filled with the mean of the corresponding column.

In [9]:
#printing the percentage of missing values in Age before handling
df['Age'].isnull().mean() * 100

20.113636363636363

In [10]:
# Filling the missing values with the mean of respective column
df['Age']=df['Age'].fillna(df['Age'].mean())

In [11]:
#printing the percentage of missing values in Age after handling
df['Age'].isnull().mean() * 100

0.0

Missing values in categorical columns are filled with the most frequently occurring value.

In [12]:
#printing the percentage of missing values in Embarked before handling
df['Embarked'].isnull().mean() * 100

0.22727272727272727

In [13]:
# filling with filled with the most frequently occurring value.
df["Embarked"].fillna(df['Embarked'].mode()[0],inplace=True)

In [14]:
#printing the percentage of missing values in Embarked after handling
df['Embarked'].isnull().mean() * 100

0.0

Determine the categorical columns in Titanic Dataset. Convert Columns with string data type to numerical data using encoding techniques.

In [15]:
#information about data
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 880 entries, 0 to 890
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  880 non-null    int64  
 1   Survived     880 non-null    int64  
 2   Pclass       880 non-null    int64  
 3   Name         880 non-null    object 
 4   Sex          880 non-null    object 
 5   Age          880 non-null    float64
 6   SibSp        880 non-null    int64  
 7   Parch        880 non-null    int64  
 8   Ticket       880 non-null    object 
 9   Fare         880 non-null    float64
 10  Embarked     880 non-null    object 
dtypes: float64(2), int64(5), object(4)
memory usage: 114.8+ KB


In [16]:
print("each unique value and respective counts in Sex column\n",df['Sex'].value_counts())
#creating another data frame for Sex column 
sex_df = pd.get_dummies(df['Sex'],drop_first=3)
sex_df.head()

each unique value and respective counts in Sex column
 male      572
female    308
Name: Sex, dtype: int64


Unnamed: 0,male
0,1
1,0
2,0
3,0
4,1


In [17]:
print("each unique value and respective counts in Sex column\n",df['Embarked'].value_counts())
# creating dummies for Embarked
embark_df = pd.get_dummies(df['Embarked'],drop_first=True)
embark_df.head()

each unique value and respective counts in Sex column
 S    642
C    161
Q     77
Name: Embarked, dtype: int64


Unnamed: 0,Q,S
0,0,1
1,0,0
2,0,1
3,0,1
4,0,1


In [18]:
old_data = df.copy()
# we need to drop the sex and embarked columns and replace them with the newly created dummies data frames
# as Name and Tickt is not making any impact on the output label, we can drop them also
df.drop(['Sex','PassengerId','Embarked','Name','Ticket'],axis=1,inplace=True)
df.head()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare
0,0,3,22.0,1,0,7.25
1,1,1,38.0,1,0,71.2833
2,1,3,26.0,0,0,7.925
3,1,1,35.0,1,0,53.1
4,0,3,35.0,0,0,8.05


In [19]:
# After droping the Sex and Embarked columns, we are replacing them with out new data frames
data = pd.concat([df,sex_df,embark_df],axis=1)

Convert data in each numerical column so that it lies in the range [0,1]

In [20]:
# before scaling the data
data.head()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,male,Q,S
0,0,3,22.0,1,0,7.25,1,0,1
1,1,1,38.0,1,0,71.2833,0,0,0
2,1,3,26.0,0,0,7.925,0,0,1
3,1,1,35.0,1,0,53.1,0,0,1
4,0,3,35.0,0,0,8.05,1,0,1


In [21]:
# Scaling the data using minmax scaler so that values should be lies btw [0,1]
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
data[['Age','Pclass','Survived','SibSp','Parch','Fare','male','Q','S']] = scaler.fit_transform(data[['Age','Pclass','Survived','SibSp','Parch','Fare','male','Q','S']])

In [22]:
# after scaling the data
data.head()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,male,Q,S
0,0.0,1.0,0.271174,0.125,0.0,0.031865,1.0,0.0,1.0
1,1.0,0.0,0.472229,0.125,0.0,0.313299,0.0,0.0,0.0
2,1.0,1.0,0.321438,0.0,0.0,0.034831,0.0,0.0,1.0
3,1.0,0.0,0.434531,0.125,0.0,0.233381,0.0,0.0,1.0
4,0.0,1.0,0.434531,0.0,0.0,0.035381,1.0,0.0,1.0


Implement the following models on Titanic Dataset and determine the values of accuracy, precision, recall, f1 score and confusion matrix for the test data.

In [23]:
#  split the data
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data.drop('Survived',axis=1), 
                                                    data['Survived'], test_size=0.30, 
                                                    random_state=101)

In [24]:
from sklearn.linear_model import LogisticRegression

# Build the Model.
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)

LogisticRegression()

In [25]:
print("Predicting the model on the test set")
predicted =  logmodel.predict(X_test)

Predicting the model on the test set


In [26]:
print("predicted result !")
predicted

predicted result !


array([1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1.,
       0., 1., 1., 1., 0., 0., 1., 0., 1., 0., 0., 0., 1., 1., 0., 0., 1.,
       0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 1., 1., 0., 0., 0., 0.,
       0., 1., 0., 0., 1., 1., 0., 0., 0., 1., 1., 0., 0., 0., 1., 0., 0.,
       0., 1., 1., 0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 1., 0., 1., 0.,
       1., 0., 1., 0., 0., 1., 0., 0., 0., 1., 0., 1., 1., 1., 0., 0., 1.,
       0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 1., 0., 0.,
       0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 1., 1., 0., 1., 0.,
       0., 1., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0.,
       1., 0., 0., 1., 1., 0., 1., 0., 1., 0., 0., 0., 1., 0., 1., 0., 0.,
       0., 1., 0., 0., 1., 1., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1.,
       0., 0., 0., 0., 1., 0., 1., 0., 1., 1., 1., 1., 0., 0., 1., 1., 0.,
       0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 1., 0., 1., 0., 1.,
       1., 0., 0., 1., 0.

In [27]:
#confusion matrix
from sklearn.metrics import confusion_matrix, classification_report
print(confusion_matrix(y_test, predicted))

[[144  24]
 [ 28  68]]


In [28]:
# Precision Score
from sklearn.metrics import precision_score
print("Precision Score",precision_score(y_test,predicted))

Precision Score 0.7391304347826086


In [29]:
# Recall Score
from sklearn.metrics import recall_score
print("recall score",recall_score(y_test,predicted))

recall score 0.7083333333333334


In [30]:
# F1 Score
from sklearn.metrics import f1_score
print("f1 score",f1_score(y_test,predicted))

f1 score 0.723404255319149


In [31]:
# Classification report
from sklearn.metrics import classification_report
print(classification_report(y_test,predicted))

              precision    recall  f1-score   support

         0.0       0.84      0.86      0.85       168
         1.0       0.74      0.71      0.72        96

    accuracy                           0.80       264
   macro avg       0.79      0.78      0.79       264
weighted avg       0.80      0.80      0.80       264



In [32]:
# metrics are used to find accuracy or error
from sklearn import metrics
# using metrics module for accuracy calculation
print("ACCURACY of Logistic Regression Model: ", metrics.accuracy_score(y_test, predicted))


ACCURACY of Logistic Regression Model:  0.803030303030303


 Random Forest Classifier

In [33]:
# importing random forest classifier from assemble module
from sklearn.ensemble import RandomForestClassifier

In [34]:
# creating a RF classifier
clf = RandomForestClassifier(n_estimators = 100)

# Training the model on the training dataset
# fit function is used to train the model using the training sets as parameters
clf.fit(X_train, y_train)

# performing predictions on the test dataset
y_pred = clf.predict(X_test)

In [35]:
#confusion matrix
from sklearn.metrics import confusion_matrix, classification_report
print(confusion_matrix(y_test, y_pred))

[[143  25]
 [ 20  76]]


In [36]:
# Precision Score
from sklearn.metrics import precision_score
print("Precision Score",precision_score(y_test,y_pred))

Precision Score 0.7524752475247525


In [37]:
# Recall Score
from sklearn.metrics import recall_score
print("recall score",recall_score(y_test,y_pred))

recall score 0.7916666666666666


In [38]:
# F1 Score
from sklearn.metrics import f1_score
print("f1 score",f1_score(y_test,y_pred))

f1 score 0.7715736040609136


In [39]:
# Classification report
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

         0.0       0.88      0.85      0.86       168
         1.0       0.75      0.79      0.77        96

    accuracy                           0.83       264
   macro avg       0.81      0.82      0.82       264
weighted avg       0.83      0.83      0.83       264



In [40]:
# metrics are used to find accuracy or error
from sklearn import metrics
# using metrics module for accuracy calculation
print("ACCURACY of Random Forest Classifier Model: ", metrics.accuracy_score(y_test, y_pred))


ACCURACY of Random Forest Classifier Model:  0.8295454545454546
