# ***Logistic Regression***

## ***Titanic***

Bu ödevde, modelinizi kullanarak Titanic faciasında hayatta kalma durumunu tahmin edeceksiniz. 
Titanic verilerini [Kaggle](https://www.kaggle.com/c/titanic/data)'dan indirin. Buradaki ```train.csv``` dosyasındaki veriler ihtiyacınızı görecektir.

- Verilerinizi eğitim ve test kümelerine ayırın. Modelinizi oluşturarak ayırdığınız test kümesindeki verilere göre hayatta kalma durumlarını tahmin edin
- Modelinizin performansı tatmin edici mi? Açıklayın.
- Bazı değişkenleri ekleyerek veya çıkararak modelinizi tahmin performansı açısından geliştirmeye çalışın.
- Lojistik Regresyon'un avantaj ve dezavantajlarını araştırın ve mentörünüzle tartışın.
---

In [605]:
import pandas as pd
import numpy as np
# grafiksel araclar
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
sns.set(style="whitegrid")
# logistic regresyon icin
from sklearn.linear_model import LogisticRegression
# lineer regresyon icin
from sklearn.linear_model import LinearRegression
# egitim verisini ayirmak icin
from sklearn.model_selection import train_test_split

import statsmodels.api as sm
# polynomial features icin
from sklearn.preprocessing import PolynomialFeatures
# tahmin performansi icin
from sklearn.metrics import mean_absolute_error
from statsmodels.tools.eval_measures import mse, rmse
# regularizasyon icin
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet

# uyarilari dikkate alma
import warnings
warnings.filterwarnings('ignore')

# pandas varsayilan olarak cok sayida sutun veya satir varsa tumunu gostermez
# bu nedenle 100 sutun ve satir gostermesi icin
pd.options.display.max_rows = 100
pd.options.display.max_columns = 100

# virgulden sonra belirli sayida basamak gosterimi icin 
pd.options.display.float_format = '{:,.3f}'.format
# aciklama sutunun tam cikmasi icin
pd.options.display.max_colwidth = 100

# font tanimlamalari
title_font = {'family': 'times new roman', 'color': 'darkred', 
              'weight': 'bold', 'size': 14}
axis_font  = {'family': 'times new roman', 'color': 'darkred', 
              'weight': 'bold', 'size': 14}

# veri setini dataframe icerisine yukle
titanic = pd.read_csv('data/train.csv')

- Fonksiyon tanimlamalari.

In [606]:
# dogruluk degerlerini tutacak df
accuracy_df = pd.DataFrame( columns=['description', 
                                     'train_accuracy', 'test_accuracy'])

accuracy_df.index.name = 'model'

# dogruluk degerlerini df kaydeden fonksiyon
def save_accuracy(model_nu, description, train_accuracy, test_accuracy):    
    global accuracy_df
    # model varsa drop et
    if (accuracy_df.index==model_nu).any():
        accuracy_df.drop(index=model_nu, inplace=True)    
    df = pd.DataFrame([])
    df.index.name = 'model'
    df.loc[model_nu, 'description'] = description,
    df.loc[model_nu, 'train_accuracy'] = train_accuracy
    df.loc[model_nu, 'test_accuracy'] = test_accuracy
    accuracy_df = pd.concat([accuracy_df, df])        
    
# c degerlerine gore dogruluk degerlerini tutacak df
cval_df = pd.DataFrame( columns=['model', 'description', 'c_val', 
                                 'train_accuracy', 'test_accuracy'])


# c degerlerine gore dogruluk degerlerini df kaydeden fonksiyon
def save_cval_accuracy(model_nu, description, cval, train_accuracy, test_accuracy):    
    global cval_df
    # model varsa drop et kaydet
    #if (cval_df.model==model_nu).any():
    #   cval_df = cval_df.drop(cval_df[cval_df.model == model_nu].index)
    cval_df = cval_df.append({'model': model_nu, 'description': description, 'c_val': cval,
                              'train_accuracy' : train_accuracy,
                              'test_accuracy': test_accuracy
                             }, ignore_index=True)

In [607]:
display(accuracy_df, cval_df)

Unnamed: 0_level_0,description,train_accuracy,test_accuracy
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1


Unnamed: 0,model,description,c_val,train_accuracy,test_accuracy


### Veri Kesfi
- Veri setini inceleme.

In [608]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [609]:
for val in titanic['Pclass'].unique():
    print(val, '{:.3f}'.format(titanic.loc[titanic['Pclass']==val, 'Cabin'].isnull().mean()))

3 0.976
1 0.185
2 0.913


- Pclass 2 ve 3 olanlarin Cabin numaralari genelde bos (sirasiyla %91 ve %97), 1 inci sinif yolcularda ise %18 in Cabin numarasi bos.
- Nan degerleri 'None' string degeriyle doldurabiliriz.

In [610]:
titanic['Cabin'].fillna('None', inplace=True)

In [611]:
titanic['Embarked'].unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [612]:
for val in titanic['Embarked'].unique()[0:3]:    
    print(val, titanic.loc[titanic['Embarked']==val, 'PassengerId'].count())

S 644
C 168
Q 77


- Embarked 3 farkli deger almis, eksik veriler en sik karsilasilan degerle doldurulabilir.

In [613]:
titanic['Embarked'].fillna('S', inplace=True)

- Son olarak yas degeri de ortalama deger ile dolduruldu.

In [614]:
np.floor(titanic['Age'].mean())

29.0

In [615]:
titanic['Age'].fillna(np.floor(titanic['Age'].mean()), inplace=True)

In [616]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          891 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        891 non-null    object 
 11  Embarked     891 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


### ***1. Model: Sadece numerik degiskenler***

- Hedef (bagimli) degisken : Survived
- Aciklayici (bagimsiz) degiskenler : Diger tum degiskenler

In [617]:
# aciklayici degiskenler numerik olanlar  
expl_vars = [column for column in titanic.columns 
                if titanic.dtypes[column] != 'object']
# aciklayici degiskenlerden hedef degisken ve id degiskeni ayrilir
expl_vars.remove('Survived')
expl_vars.remove('PassengerId')
expl_vars

['Pclass', 'Age', 'SibSp', 'Parch', 'Fare']

In [618]:
# hedef ve aciklayici degiskenler
Y = titanic['Survived']
X = titanic[expl_vars]

# egitim ve test veri kumelerine ayirma
X_train, X_test, Y_train, Y_test =  train_test_split(X, Y, test_size=0.20, random_state=111)

# log reg model nesnesi olustur
log_reg = LogisticRegression()

# modeli egit
log_reg.fit(X_train, Y_train)

# dogruluk skorlari
train_accuracy = log_reg.score(X_train, Y_train)
test_accuracy = log_reg.score(X_test, Y_test)

# dogruluk degerlerini kaydet
save_accuracy(1, 'numerik degiskenler', train_accuracy, test_accuracy)
accuracy_df

Unnamed: 0_level_0,description,train_accuracy,test_accuracy
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,numerik degiskenler,0.705,0.698


### ***1. Model: Sadece numerik degiskenler (Regularizasyon)***
- C degerleri degistirilerek modele etkisi gozlemlenir. C degeri varsayilan olarak 1.

In [619]:
c_vals = [0.001,0.01,0.1,1,10,100, 1000]

# dogruluk degerlerini tutacak df
accuracy_df2 = pd.DataFrame(columns=['C Değeri', 'Eğitim Doğruluğu', 'Test Doğruluğu'])

for c in c_vals:
    
    # Apply logistic regression model to training data
    lr = LogisticRegression(penalty = 'l2', C = c, random_state = 111)
    lr.fit(X_train,Y_train)
    save_cval_accuracy(1, 'numerik degiskenler', c, 
                       lr.score(X_train, Y_train), lr.score(X_test, Y_test))

cval_df

Unnamed: 0,model,description,c_val,train_accuracy,test_accuracy
0,1,numerik degiskenler,0.001,0.687,0.665
1,1,numerik degiskenler,0.01,0.709,0.698
2,1,numerik degiskenler,0.1,0.711,0.709
3,1,numerik degiskenler,1.0,0.705,0.698
4,1,numerik degiskenler,10.0,0.706,0.698
5,1,numerik degiskenler,100.0,0.706,0.698
6,1,numerik degiskenler,1000.0,0.706,0.698


### ***2. Model: Cinsiyet ve Liman bilgisi dahil***
- Kadin kurtulam orani fazla oldugundan kadin 1, erkek 0.
- Liman bilgisi de kurtulma oranina gore siralanarak encode edildi.

In [620]:
titanic[['Sex', 'Survived']].groupby('Sex').mean()

Unnamed: 0_level_0,Survived
Sex,Unnamed: 1_level_1
female,0.742
male,0.189


In [621]:
titanic['Sex'].replace({'female': 1, 'male': 0}, inplace=True)

In [622]:
titanic[['Embarked', 'Survived']].groupby('Embarked').mean()

Unnamed: 0_level_0,Survived
Embarked,Unnamed: 1_level_1
C,0.554
Q,0.39
S,0.339


In [623]:
titanic['Embarked'].replace({'C': 3, 'Q': 2, 'S': 1}, inplace=True)

In [624]:
# aciklayici degiskenler numerik olanlar  
expl_vars = [column for column in titanic.columns 
                if titanic.dtypes[column] != 'object']
# aciklayici degiskenlerden hedef degisken ve id degiskeni ayrilir
expl_vars.remove('Survived')
expl_vars.remove('PassengerId')
expl_vars

['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']

In [625]:
# hedef ve aciklayici degiskenler
Y = titanic['Survived']
X = titanic[expl_vars]

# egitim ve test veri kumelerine ayirma
X_train, X_test, Y_train, Y_test =  train_test_split(X, Y, test_size=0.20, random_state=111)

# log reg model nesnesi olustur
log_reg = LogisticRegression()

# modeli egit
log_reg.fit(X_train, Y_train)

# dogruluk skorlari
train_accuracy = log_reg.score(X_train, Y_train)
test_accuracy = log_reg.score(X_test, Y_test)

# dogruluk degerlerini kaydet
save_accuracy(2, 'cinsiyet ve liman bilgisi dahil', train_accuracy, test_accuracy)
accuracy_df

Unnamed: 0_level_0,description,train_accuracy,test_accuracy
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,numerik degiskenler,0.705,0.698
2,cinsiyet ve liman bilgisi dahil,0.799,0.771


### ***2. Model: Cinsiyet ve Liman bilgisi dahil (Regularizasyon)***
- C degerleri degistirilerek modele etkisi gozlemlenir. C degeri varsayilan olarak 1.

In [626]:
c_vals = [0.001,0.01,0.1,1,10,100, 1000]


for c in c_vals:
    
    # Apply logistic regression model to training data
    lr = LogisticRegression(penalty = 'l2', C = c, random_state = 111)
    lr.fit(X_train,Y_train)
    save_cval_accuracy(2, 'cinsiyet ve liman bilgisi dahil', c, 
                       lr.score(X_train, Y_train), lr.score(X_test, Y_test))

cval_df

Unnamed: 0,model,description,c_val,train_accuracy,test_accuracy
0,1,numerik degiskenler,0.001,0.687,0.665
1,1,numerik degiskenler,0.01,0.709,0.698
2,1,numerik degiskenler,0.1,0.711,0.709
3,1,numerik degiskenler,1.0,0.705,0.698
4,1,numerik degiskenler,10.0,0.706,0.698
5,1,numerik degiskenler,100.0,0.706,0.698
6,1,numerik degiskenler,1000.0,0.706,0.698
7,2,cinsiyet ve liman bilgisi dahil,0.001,0.688,0.659
8,2,cinsiyet ve liman bilgisi dahil,0.01,0.735,0.76
9,2,cinsiyet ve liman bilgisi dahil,0.1,0.806,0.782


### ***3. Model: Yasi tahminle doldurarak***
- Yas degiskenindeki eksik degerler lineer regresyon yontemiyle doldurulacaktir.

In [627]:
titanic = pd.read_csv('data/train.csv')

titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


- Eksik degerler Yas haric onceki gibi doldurulur.

In [628]:
titanic['Cabin'].fillna('None', inplace=True)

titanic['Embarked'].fillna('S', inplace=True)

- Cinsiyet ve liman bilgisi onceki gibi encode edilir.

In [629]:
titanic['Sex'].replace({'female': 1, 'male': 0}, inplace=True)
titanic['Embarked'].replace({'C': 3, 'Q': 2, 'S': 1}, inplace=True)

In [630]:
titanic['Age'].isnull().mean()

0.19865319865319866

- Yas bilgisinin null orani da 80 e 20 kuaralina uymaktadir.

- Yas degeri olan ve olmayan data ikiye ayrildi

In [633]:
titanic_no_age = titanic[titanic['Age'].isnull()]

In [634]:
titanic_w_age = titanic[titanic['Age'].notnull()]

- Hedef (bagimli) degisken : Yas
- Aciklayici (bagimsiz) degiskenler : Diger tum numerik degiskenler

In [635]:
# aciklayici degiskenler numerik olanlar  
expl_vars = [column for column in titanic_w_age.columns 
                if titanic_w_age.dtypes[column] != 'object']
# aciklayici degiskenlerden hedef degisken ve id degiskeni ayrilir
expl_vars.remove('Age')
expl_vars.remove('PassengerId')
expl_vars

['Survived', 'Pclass', 'Sex', 'SibSp', 'Parch', 'Fare', 'Embarked']

In [636]:
# hedef ve aciklayici degiskenler
Y = titanic_w_age['Age']
X = titanic_w_age[expl_vars]

# egitim ve test veri kumelerine ayirma
X_train, X_test, Y_train, Y_test =  train_test_split(X, Y, test_size=0.20, random_state=111)

# log reg model nesnesi olustur
lrm = LinearRegression()
# modeli egit
lrm.fit(X_train, Y_train)

# rsquare skorlari
train_rsquare = lrm.score(X_train, Y_train)
test_rsquare = lrm.score(X_test, Y_test)

# test verisi ile tahmin
Y_test_pred = lrm.predict(X_test)
# egitim verisi ile tahmin
Y_train_pred = lrm.predict(X_train)

print('Egitim rsquare degeri: ', train_rsquare, 
      'Test rsquare degeri:', test_rsquare, 
      'RMSE degeri: ', rmse(Y_test, Y_test_pred) , sep='\n')


Egitim rsquare degeri: 
0.27711273581074836
Test rsquare degeri:
0.2501980404839499
RMSE degeri: 
13.036706369815096


- Bir kez daha model olusturuyoruz. Bu defa daha cok veriyle doldurmak icin split etmeden olusturduk. Yas degeri olmayan verilerle de tahmin yaparak yas degerini dolduracagiz.

In [639]:
Y = titanic_w_age['Age']
X = titanic_w_age[expl_vars]

# model nesnesi olustur
lrm = LinearRegression()
# modeli egit
lrm.fit(X, Y)

# tahmini yas degeri olmayan data uzerinde yaptik
# yas degeri eksi ya da ondalikli olmasin 
titanic_no_age['Age'] = np.floor(np.abs(lrm.predict(titanic_no_age[expl_vars])))

In [651]:
# iki veri kumesini birlestir
titanic = pd.concat([titanic_no_age, titanic_w_age])
# sirala
titanic = titanic.sort_values(by='PassengerId')

In [653]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    int64  
 5   Age          891 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        891 non-null    object 
 11  Embarked     891 non-null    int64  
dtypes: float64(2), int64(7), object(3)
memory usage: 90.5+ KB


- Nihai logistic regression modelini olusturalim.

In [654]:
# aciklayici degiskenler numerik olanlar  
expl_vars = [column for column in titanic.columns 
                if titanic.dtypes[column] != 'object']
# aciklayici degiskenlerden hedef degisken ve id degiskeni ayrilir
expl_vars.remove('Survived')
expl_vars.remove('PassengerId')
expl_vars

['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']

In [658]:
# hedef ve aciklayici degiskenler
Y = titanic['Survived']
X = titanic[expl_vars]

# egitim ve test veri kumelerine ayirma
X_train, X_test, Y_train, Y_test =  train_test_split(X, Y, test_size=0.20, random_state=111)

# log reg model nesnesi olustur
log_reg = LogisticRegression()

# modeli egit
log_reg.fit(X_train, Y_train)

# dogruluk skorlari
train_accuracy = log_reg.score(X_train, Y_train)
test_accuracy = log_reg.score(X_test, Y_test)

# dogruluk degerlerini kaydet
save_accuracy(3, 'yas tahminle dolduruldu', train_accuracy, test_accuracy)
accuracy_df

Unnamed: 0_level_0,description,train_accuracy,test_accuracy
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,numerik degiskenler,0.705,0.698
2,cinsiyet ve liman bilgisi dahil,0.799,0.771
3,yas tahminle dolduruldu,0.823,0.799
