# Gender classification model through Logistic regression

## When to use logistic regression
    - To solve **binary** classification problem. (It can be used for multiclass also but tedious job)
## Asumtions
    - There should not be outliers in datset. If exsit then remove it. Boxplot canbe used to identify outliers.
    - The correlation **among independent variable** should not be high(Ideally it should be < .9). Correlation matrix canbe used to identify highly correlated independent variables.
## How to test goodness of fit?
    - Rsquare value (It is contravercial, It may work ormay not sometimes)
    - HL test(Hosmer-Lemeshow : Works for Binary logistic regression only)
    - Confusion matrix
    
### HL-test
    - HL test is a collection of three values.
        1. Chi-square
        2. DF
        3. P-value
    - for good fit,
        Chi-square should be less and p-value near to 1 represents good logistic regression model fit
    

#### Importing libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

#### Reading dataset

In [2]:
df = pd.read_csv('./dataset/dataset.csv')
df.head()

Unnamed: 0,Favorite Color,Favorite Music Genre,Favorite Beverage,Favorite Soft Drink,Gender
0,Cool,Rock,Vodka,7UP/Sprite,F
1,Neutral,Hip hop,Vodka,Coca Cola/Pepsi,F
2,Warm,Rock,Wine,Coca Cola/Pepsi,F
3,Warm,Folk/Traditional,Whiskey,Fanta,F
4,Cool,Rock,Vodka,Coca Cola/Pepsi,F


#### Category encoding

In [3]:
mapping_dict = {}
for each_column in df.columns:
    df[each_column] = df[each_column].astype('category')
    mapping_dict[each_column] = dict(enumerate(df[each_column].cat.categories))
    df[each_column] = df[each_column].cat.codes
print(mapping_dict)
df.head()

{'Favorite Color': {0: 'Cool', 1: 'Neutral', 2: 'Warm'}, 'Favorite Music Genre': {0: 'Electronic', 1: 'Folk/Traditional', 2: 'Hip hop', 3: 'Jazz/Blues', 4: 'Pop', 5: 'R&B and soul', 6: 'Rock'}, 'Favorite Beverage': {0: 'Beer', 1: "Doesn't drink", 2: 'Other', 3: 'Vodka', 4: 'Whiskey', 5: 'Wine'}, 'Favorite Soft Drink': {0: '7UP/Sprite', 1: 'Coca Cola/Pepsi', 2: 'Fanta', 3: 'Other'}, 'Gender': {0: 'F', 1: 'M'}}


Unnamed: 0,Favorite Color,Favorite Music Genre,Favorite Beverage,Favorite Soft Drink,Gender
0,0,6,3,0,0
1,1,2,3,1,0
2,2,6,5,1,0
3,2,1,4,2,0
4,0,6,3,1,0


#### Spliting data

In [4]:
series = pd.Series(np.ones(df.shape[0]))
df = pd.concat([series, df], axis=1)

train_x = df.sample(frac=0.8)
test_x = df.drop(train_x.index)

train_y = train_x.Gender
train_x = train_x.loc[:,train_x.columns[: -1]]
train_x = train_x.to_numpy()
train_y = train_y.to_numpy()

test_y = test_x.Gender
test_x = test_x.loc[:,test_x.columns[: -1]]
test_x = test_x.to_numpy()
test_y = test_y.to_numpy()

print(train_x.shape, train_y.shape)
print(test_x.shape, test_y.shape)

(53, 5) (53,)
(13, 5) (13,)


#### Initializing theta

In [5]:
# theta = np.random.rand(5)
theta = [0.55808994, 0.19248718, -0.22750871, -0.2472149, 0.63575857]

#### Sigmoid

In [6]:
def sigmoid(z):
    return (1 / (1 + np.exp(-z)))

#### Hypothesis

In [7]:
def hypothesis(x, theta):
    return sigmoid(np.matmul(x, theta))

#### Gradient descent

In [8]:
def gradient_descent(alpha, m, prediction, actual, X):
    return (alpha / m) * np.matmul(X.T, prediction - actual)

#### Train

In [9]:
def train(alpha, epoch, x, y, theta):
    for i in range(0, epoch):
        prediction = hypothesis(x, theta)
        theta -= gradient_descent(alpha, x.shape[0], prediction, y, x)

#### Execution

In [10]:
train(0.001, 1000, train_x, train_y, theta)

#### Evaluation: Confusion matrix

In [11]:
y_pred = hypothesis(test_x, theta)
for i in range(len(y_pred)):
    y_pred[i] = 0 if y_pred[i] < 0.5 else 1
print(classification_report(test_y, y_pred))
print('accuracy: {}'.format(accuracy_score(test_y, y_pred)))
#Acc: 85% | [ 0.55808994  0.19248718 -0.22750871 -0.2472149   0.63575857]

              precision    recall  f1-score   support

           0       0.71      0.56      0.63         9
           1       0.33      0.50      0.40         4

    accuracy                           0.54        13
   macro avg       0.52      0.53      0.51        13
weighted avg       0.60      0.54      0.56        13

accuracy: 0.5384615384615384


#### Evaluation: R-squared

In [12]:
def r2_score(y_true, y_pred):
    rss = np.sum((y_true - y_pred)**2) #rss: residual sum of square | residual(error): y_true - y_pred 
    tss = np.sum((y_true - y_true.mean())**2) #tss: total sum of square
    return 1 - (rss / tss)
print('R-squared score: {}'.format(r2_score(test_y, y_pred)))

R-squared score: -1.1666666666666665
