<a href="https://colab.research.google.com/github/amirFirdaus39/Time-to-Tie-the-Knot/blob/main/Time_To_Tie_The_Knot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Time to Tie the Knot
## Predicting how long will it get you to get married to your loved one!

By: Amir Firdaus, Imran, Zariff, Xuhuipin, XueBing

In [None]:
# used for manipulating directory paths
import os
 
# Scientific and vector computation for python
import numpy as np
 
# Statistic library
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
import statsmodels.api as sm


### **9 Features to be used for the model:**

x1-How many years did you know each other before becoming a couple? (years_knowing)

x2-How many partners did you have before in total? (total_partners)

x3-How much do you care about money? (money)

x4-How ready are you in having children at that time? (ready_children) 

x5-How similar are your hobbies/interest with your partner? (hobbies)

x6-Do you communicate well with your partner during couple? (communicate)
 
x7-How is your financial status during couple relationship? (financial)

x8-What is the age gap (in years) between you and your partner? (age_gap)

x9-How well is you and your partner relationship with respective in-laws? (inlaw)



# Testing Different Kind of Models with Different Features

Model 1: Using same data set for training and testing (70 data)

Model 2: Using 45 data for training and 25 data for testing

Model 3: Removing feature hobbies

Model 4: Removing feature partners_before

Model 5: Removing feature communication

Model 6: Removing age_gap

Model 7: Removing money 


### Model 1. Using same training and testing dataset(70 data).

In [None]:
# Load data

data = np.loadtxt(os.path.join('survey.txt'), delimiter=',')
X = data[:, :9]
y = data[:, 9]
m = y.size

In [None]:
def featureNormalize(X):
    # You need to set these values correctly
    X_norm = X.copy()
    mu = np.zeros(X.shape[1])
    sigma = np.zeros(X.shape[1])

    mu = np.mean(X, axis = 0)
    sigma = np.std(X, axis = 0)
    X_norm = (X - mu) / sigma
    
    return X_norm, mu, sigma

In [None]:
# call featureNormalize on the loaded data
X_norm, mu, sigma = featureNormalize(X)
# Add intercept term to X
X = np.concatenate([np.ones((m, 1)), X_norm], axis=1)

In [None]:
#calculate the cost for the model
def computeCostMulti(X, y, theta):

    # Initialize some useful values
    m = y.shape[0] # number of training examples
    J = 0
    h = np.dot(X, theta)
    J = (1/(2 * m)) * np.sum(np.square(np.dot(X, theta) - y))
    return J

#performing gradient descent on the multi regression model
def gradientDescentMulti(X, y, theta, alpha, num_iters):

    # Initialize some useful values
    m = y.shape[0] # number of training examples
    # make a copy of theta, which will be updated by gradient descent
    theta = theta.copy()
    J_history = []
    for i in range(num_iters):
        theta = theta - (alpha / m) * (np.dot(X, theta) - y).dot(X)

    # save the cost J in every iteration
    J_history.append(computeCostMulti(X, y, theta))
    return theta, J_history

#predict data
def predictAll(X_norm,size,theta):
    X_norm = np.concatenate([np.ones((size,1)),X_norm], axis = 1)
    pred = X_norm.dot(theta)
    
    return pred

In [None]:
# Settings
# Choose some alpha value
alpha = 0.1
num_iters = 200
# init theta and run gradient descent
theta = np.zeros(10)
theta, J_history = gradientDescentMulti(X, y, theta, alpha, num_iters)

In [None]:
#evaluating the accuracy of the model using rmspe, mape, almost correct
def evaluate(y_test,y_pred):
  m_test = y_test.size
  #calculating rmse 
  rmse = mean_squared_error(y_test, y_pred, squared = False)
  print("RMSE: {:.2f}".format(rmse))

  #finding error percentage using the percentage error rmspe
  eps = 1e-10
  rmspe = (np.sqrt(np.mean(np.square((y_test - y_pred)/(y_test+eps))))) 
  print("RMSPE: {:.2f}%".format((rmspe)*100)) 
  print("Accuracy using RMSPE: {:.2f}%\n".format((1- rmspe)*100))

  #finding error rate using median absolute percentage error
  mape = np.median(abs((y_test - y_pred)/y_test))
  print("MAPE: {:.2f}%".format((mape) * 100))
  print("Accuracy using MAPE: {:.2f}%\n".format((1- mape)*100))

  #finding accuracy using almost correct method (we take it as correct if the difference between true and predicted less than 25%)
  almost = np.zeros(m_test)
  for i in range(m_test):
      if abs((y_test[i] - y_pred[i])/y_test[i]) < 0.25:
          almost[i] = 1
      else:
          almost[i] = 0
      
  print("Accuracy using almost correct: {:.2f}%\n".format(np.mean(almost == 1) * 100))

In [None]:
#evaluation metrics

#predicted value
y_pred = predictAll(X_norm,m,theta)
evaluate(y,y_pred)


RMSE: 0.80
RMSPE: 19.94%
Accuracy using RMSPE: 80.06%

MAPE: 14.25%
Accuracy using MAPE: 85.75%

Accuracy using almost correct: 78.57%



In [None]:
#display the osl summary
est = sm.OLS(y, X)
est2 = est.fit()
print(est2.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.537
Model:                            OLS   Adj. R-squared:                  0.467
Method:                 Least Squares   F-statistic:                     7.730
Date:                Wed, 02 Jun 2021   Prob (F-statistic):           1.91e-07
Time:                        12:59:28   Log-Likelihood:                -83.742
No. Observations:                  70   AIC:                             187.5
Df Residuals:                      60   BIC:                             210.0
Df Model:                           9                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          3.8229      0.103     36.996      0.0

only x7(financial) and x9(inlaw) has p value < 0.05

highest p value: x3(money) -> x2(total_partner) -> x3(money) -> x6(communicate) -> x1(years_knowing) -> x8(age_gape) -> x4(ready_children)


# Model 2. Using 45 data for training, 25 for testing

In [None]:
# Load data
data = np.loadtxt(os.path.join('survey.txt'), delimiter=',')
split = 45
X = data[:split, :9]
y = data[:split, 9]
X_test =  data[split:, :9]
y_test = data[split:, 9]

In [None]:
#Creating a function model that easily run all relevan code to test a model
def model(X,y,X_test,y_test,numtheta):
  m = y.size
  m_test = y_test.size
  
  X_norm, mu, sigma = featureNormalize(X)
  X = np.concatenate([np.ones((m, 1)), X_norm], axis=1)

  # Settings
  # Choose some alpha value
  alpha = 0.1
  num_iters = 200
  
  # init theta and run gradient descent
  theta = np.zeros(numtheta)
  theta, J_history = gradientDescentMulti(X, y, theta, alpha, num_iters)

  X_norm_test, mu, sigma = featureNormalize(X_test)
  #predicted value
  y_pred = predictAll(X_norm_test,m_test,theta)
  evaluate(y_test,y_pred)

  #display the ols summary
  est = sm.OLS(y, X)
  est2 = est.fit()
  print(est2.summary())
  
  return theta

In [None]:
theta2 = model(X,y,X_test,y_test,10)

RMSE: 0.98
RMSPE: 28.86%
Accuracy using RMSPE: 71.14%

MAPE: 23.74%
Accuracy using MAPE: 76.26%

Accuracy using almost correct: 60.00%

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.633
Model:                            OLS   Adj. R-squared:                  0.539
Method:                 Least Squares   F-statistic:                     6.717
Date:                Wed, 02 Jun 2021   Prob (F-statistic):           1.57e-05
Time:                        13:03:24   Log-Likelihood:                -54.056
No. Observations:                  45   AIC:                             128.1
Df Residuals:                      35   BIC:                             146.2
Df Model:                           9                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.

only x7(financial) has p value < 0.05

high p value:
x5(hobbies) --> x2(total_partner) --> x6(communicate) --> x8(age_gape) --> x3(money) --> x1(years_knowing) --> x4(ready_children) --> x9(inlaw)


# Removing features from Model 2 to get better model
## Model 3. Removing Hobbies

### Features used:

x1-years_knowing

x2-total_partners

x3-money

x4-ready_children

x5-communicate

x6-financial

x7-age_gap

x8-inlaw

~x5-hobbies~

In [None]:
# Load data

data = np.loadtxt(os.path.join('survey.txt'), delimiter=',')
split = 45

X = data[:split, :4]
X_rest =data[:split, 5:9]
X = np.concatenate([X, X_rest], axis=1)

X_test =  data[split:, :4]
X_test_rest = data[split:, 5:9]
X_test = np.concatenate([X_test, X_test_rest], axis=1)

y = data[:split, 9]
y_test = data[split:, 9]

print(X.shape)
print(X_test.shape)
print(y.shape)
print(y_test.shape)

(45, 8)
(25, 8)
(45,)
(25,)


In [None]:
theta3 = model(X,y,X_test,y_test,9)

RMSE: 0.99
RMSPE: 29.06%
Accuracy using RMSPE: 70.94%

MAPE: 23.96%
Accuracy using MAPE: 76.04%

Accuracy using almost correct: 60.00%

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.633
Model:                            OLS   Adj. R-squared:                  0.551
Method:                 Least Squares   F-statistic:                     7.756
Date:                Wed, 02 Jun 2021   Prob (F-statistic):           5.44e-06
Time:                        13:05:19   Log-Likelihood:                -54.085
No. Observations:                  45   AIC:                             126.2
Df Residuals:                      36   BIC:                             142.4
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.

**Notes:**

adjusted r2:-

model 1: 0.467

model 2: 0.539

model 3: 0.551


p value < 0.05:
x6-financial

highest p value: 
x2-total_partners

## Model 4. Removing total_partners

x1-years_knowing

x2-money

x3-ready_children


x4-communicate

x5-financial

x6-age_gap

x7-inlaw

~hobbies~

~total_partners~

In [None]:
# Load data

data = np.loadtxt(os.path.join('survey.txt'), delimiter=',')
split = 45

Xnew = X[:, :1]
X_rest1 = X[:, 2:8]
Xnew = np.concatenate([Xnew, X_rest1], axis=1)

X_testnew =  X_test[:, :1]
X_test_rest1 = X_test[:, 2:8]
X_testnew = np.concatenate([X_testnew, X_test_rest1], axis=1)

X = Xnew
X_test = X_testnew
y = data[:split, 9]
y_test = data[split:, 9]

print(X.shape)
print(X_test.shape)
print(y.shape)
print(y_test.shape)

(45, 7)
(25, 7)
(45,)
(25,)


In [None]:
theta4 = model(X,y,X_test,y_test,8)

RMSE: 1.00
RMSPE: 29.27%
Accuracy using RMSPE: 70.73%

MAPE: 24.04%
Accuracy using MAPE: 75.96%

Accuracy using almost correct: 60.00%

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.632
Model:                            OLS   Adj. R-squared:                  0.562
Method:                 Least Squares   F-statistic:                     9.066
Date:                Wed, 02 Jun 2021   Prob (F-statistic):           1.79e-06
Time:                        13:07:36   Log-Likelihood:                -54.155
No. Observations:                  45   AIC:                             124.3
Df Residuals:                      37   BIC:                             138.8
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.

**Notes:**

adjusted r2:

model 1: 0.467

model 2: 0.539

model 3 (remove hobbies): 0.551

model 4 (remove total_partner): 0.562

p value < 0.05: financial 

highest p value: communicate

## Model 5. Removing communicate

x1-years_knowing

x2-money

x3-ready_children

x4-financial

x5-age_gap

x6-inlaw

~hobbies~

~total_partners~

~communicate~

In [None]:
# Load data

data = np.loadtxt(os.path.join('survey.txt'), delimiter=',')
split = 45

Xnew = X[:, :3]
X_rest1 = X[:, 4:7]
Xnew = np.concatenate([Xnew, X_rest1], axis=1)

X_testnew =  X_test[:, :3]
X_test_rest1 = X_test[:, 4:7]
X_testnew = np.concatenate([X_testnew, X_test_rest1], axis=1)

X = Xnew
X_test = X_testnew
y = data[:split, 9]
y_test = data[split:, 9]

print(X.shape)
print(X_test.shape)
print(y.shape)
print(y_test.shape)


(45, 6)
(25, 6)
(45,)
(25,)


In [None]:
theta5 = model(X,y,X_test,y_test,7)

RMSE: 1.00
RMSPE: 29.77%
Accuracy using RMSPE: 70.23%

MAPE: 25.57%
Accuracy using MAPE: 74.43%

Accuracy using almost correct: 48.00%

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.624
Model:                            OLS   Adj. R-squared:                  0.565
Method:                 Least Squares   F-statistic:                     10.52
Date:                Wed, 02 Jun 2021   Prob (F-statistic):           7.29e-07
Time:                        13:12:42   Log-Likelihood:                -54.607
No. Observations:                  45   AIC:                             123.2
Df Residuals:                      38   BIC:                             135.9
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.

**Notes:**

adjusted r2:

model 1: 0.467

model 2: 0.539

model 3 (remove hobbies): 0.551

model 4 (remove total_partner): 0.562
  
model 5 (remove communicate): 0.565  (even though adj r2 increasing but there are too much negative coefficients(4) and only 2 positive coefficient which easily lead to negative prediction-not good)

p value < 0.05: financial

highest p value: age gap

## Model 6. Removing age_gap

x1-years_knowing

x2-money

x3-ready_children

x4-financial

x5-inlaw

~hobbies~

~total_partners~

~communicate~

~age_gap~

In [None]:
# Load data

data = np.loadtxt(os.path.join('survey.txt'), delimiter=',')
split = 45

Xnew = X[:, :4]
X_rest1 = X[:, 5:6]
Xnew = np.concatenate([Xnew, X_rest1], axis=1)

X_testnew =  X_test[:, :4]
X_test_rest1 = X_test[:, 5:6]
X_testnew = np.concatenate([X_testnew, X_test_rest1], axis=1)

X = Xnew
X_test = X_testnew
y = data[:split, 9]
y_test = data[split:, 9]

print(X.shape)
print(X_test.shape)
print(y.shape)
print(y_test.shape)

(45, 5)
(25, 5)
(45,)
(25,)


In [None]:
model(X,y,X_test,y_test,6)

RMSE: 0.99
RMSPE: 30.14%
Accuracy using RMSPE: 69.86%

MAPE: 24.08%
Accuracy using MAPE: 75.92%

Accuracy using almost correct: 56.00%

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.619
Model:                            OLS   Adj. R-squared:                  0.571
Method:                 Least Squares   F-statistic:                     12.70
Date:                Wed, 02 Jun 2021   Prob (F-statistic):           2.39e-07
Time:                        13:13:27   Log-Likelihood:                -54.893
No. Observations:                  45   AIC:                             121.8
Df Residuals:                      39   BIC:                             132.6
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.

array([ 4.03555555, -0.335058  ,  0.21619282, -0.36343106, -0.459909  ,
       -0.31950202])

**Notes:**

adjusted r2:

model 1: 0.467

model 2: 0.539

model 3 (remove hobbies): 0.551

model 4 (remove total_partner): 0.562 (4 negative, 3 positive coefficients)

model 5 (remove communicate): 0.565  (even though adj r2 increasing but there are too much negative coefficients(4) and only 2 positive coefficients which easily lead to negative prediction-not good)

model 6 (remove age_gap): 0.571 
(even though  adj r2 increasing but there are too much negative coefficients(4) and only 1 positive coefficient which easily lead to negative prediction-not good)

p value < 0.05: financial, ready_children

highest p value: money

## Model 7. Removing money

x1-years_knowing

x2-ready_children

x3-financial

x4-inlaw

~hobbies~

~total_partners~

~communicate~

~age_gap~

~money~

In [None]:
# Load data

data = np.loadtxt(os.path.join('survey.txt'), delimiter=',')
split = 45

Xnew = X[:, :1]
X_rest1 = X[:, 2:5]
Xnew = np.concatenate([Xnew, X_rest1], axis=1)

X_testnew =  X_test[:, :1]
X_test_rest1 = X_test[:, 2:5]
X_testnew = np.concatenate([X_testnew, X_test_rest1], axis=1)

X = Xnew
X_test = X_testnew
y = data[:split, 9]
y_test = data[split:, 9]

print(X.shape)
print(X_test.shape)
print(y.shape)
print(y_test.shape)

(45, 4)
(25, 4)
(45,)
(25,)


In [None]:
model(X,y,X_test,y_test,5)

RMSE: 0.92
RMSPE: 27.05%
Accuracy using RMSPE: 72.95%

MAPE: 19.97%
Accuracy using MAPE: 80.03%

Accuracy using almost correct: 56.00%

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.603
Model:                            OLS   Adj. R-squared:                  0.563
Method:                 Least Squares   F-statistic:                     15.17
Date:                Wed, 02 Jun 2021   Prob (F-statistic):           1.25e-07
Time:                        13:14:20   Log-Likelihood:                -55.859
No. Observations:                  45   AIC:                             121.7
Df Residuals:                      40   BIC:                             130.8
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.

array([ 4.03555555, -0.31163076, -0.28973307, -0.38491554, -0.3667786 ])

**Notes**:

adjusted r2:

model 1: 0.467

model 2: 0.539

model 3 (remove hobbies): 0.551

model 4 (remove total_partner): 0.562 (4 negative, 3 positive coefficients)

model 5 (remove communicate): 0.565 (even though adj r2 increasing but there are too much negative coefficients(4) and only 2 positive coefficients which easily lead to negative prediction-not good)

model 6 (remove age_gap): 0.571 (even though adj r2 increasing but there are too much negative coefficients(4) and only 1 positive coefficient which easily lead to negative prediction-not good)

model 7 (remove money): 0.563 (it decreasing-not good)

p value < 0.05: financial, inlaw

highest p value: years_knowing

## Based on 7 models we have tested, we decided to go with Model 4 with 7 features.

# Finalised Model:

**7 Features to be used for the model**:

x1 - years knowing each other

x2 - how much you care about money

x3 - readiness to have children

x4 - how well you communicate with your partner

x5 - your financial status

x6 - age gap

x7 - relationship with in-law

In [None]:
data = np.loadtxt(os.path.join('survey.txt'), delimiter=',')
split = 45

X = data[:split, :1]
X_rest = data[:split, 2:4]
X = np.concatenate([X, X_rest], axis=1)

X_rest2 = data[:split, 5:9]
X = np.concatenate([X, X_rest2], axis=1)
y = data[:split, 9]
m = y.size

print(X.shape)
print(y.shape)


(45, 7)
(45,)


In [None]:
#to predict 1 new input
def predict(x1,x2,x3,x4,x5,x6,x7):
    x1_norm = ((x1 - mu[0]) / sigma[0]) 
    x2_norm = ((x2 - mu[1]) / sigma[1])
    x3_norm = ((x3 - mu[2]) / sigma[2])
    x4_norm = ((x4 - mu[3]) / sigma[3])
    x5_norm = ((x5 - mu[4]) / sigma[4])
    x6_norm = ((x6 - mu[5]) / sigma[5])
    x7_norm = ((x7 - mu[6]) / sigma[6])
    
    #time taken to get married
    time = np.array([1.0, x1_norm , x2_norm, x3_norm , x4_norm, x5_norm , 
                      x6_norm, x7_norm]).dot(theta)
    #value for each input given
    value = np.array([x1,x2,x3,x4,x5,x6,x7])
    
    return time,value
  

In [None]:
# call featureNormalize on the loaded data
X_norm, mu, sigma = featureNormalize(X)
# Add intercept term to X
X = np.concatenate([np.ones((m, 1)), X_norm], axis=1)

# Settings
# Choose some alpha value
alpha = 0.1
num_iters = 200
# init theta and run gradient descent
theta = np.zeros(8)
theta, J_history = gradientDescentMulti(X, y, theta, alpha, num_iters)

In [None]:
  #displaying time taken to get married according to new input 

def result(prediction, value):
  print('How many years did you know each other before becoming a couple? {:.0f}'.format(value[0]))
  print('How much do you care about money? {:.0f}'.format(value[1]))
  print('How ready are you in having children at that time? {:.0f} '.format(value[2]))
  print('How well do you communicate with your partner? {:.0f}'.format(value[3]))
  print('How is your financial status during couple relationship? {:.0f}'.format(value[4]))
  print('What is the age gap (in years) between you and your partner? {:.0f}'.format(value[5]))
  print('How well is you and your partner relationship with respective in-laws? {:.0f}'.format(value[6]))

  print('===============================================================================')
  if prediction < 1:
    print('Predicted time taken to get married: < 1 year!!\n'.format(prediction))
    print('You have good mindset and preparation for marriage! \nFind a suitable moment and propose to your love one! Good Luck!')
  elif prediction < 4:
    print('Predicted time taken to get married: {:.1f} years!!\n'.format(prediction))
    print('You still have time before getting into marriage. \nPrepare yourself in becoming a better partner during this time and \nstrike when the time is ripe!')
  else:
    print('Predicted time taken to get married: {:.1f} years!!\n'.format(prediction))
    print('Hmmm, that is long time to go. \nAll I can say is to get to know your partner more,\nhave a stable job and income and get close your partner\'s family! \nYou can do it!')
  print('===============================================================================')


In [None]:
prediction1, value1 = predict(7,6,9,7,9,0,9)  #fastest example
result(prediction1,value1)

How many years did you know each other before becoming a couple? 7
How much do you care about money? 6
How ready are you in having children at that time? 9 
How well do you communicate with your partner? 7
How is your financial status during couple relationship? 9
What is the age gap (in years) between you and your partner? 0
How well is you and your partner relationship with respective in-laws? 9
Predicted time taken to get married: < 1 year!!

You have good mindset and preparation for marriage! 
Find a suitable moment and propose to your love one! Good Luck!


In [None]:
prediction2, value2 = predict(2,7,6,7,8,1,8) #medium example
result(prediction2,value2)

How many years did you know each other before becoming a couple? 2
How much do you care about money? 7
How ready are you in having children at that time? 6 
How well do you communicate with your partner? 7
How is your financial status during couple relationship? 8
What is the age gap (in years) between you and your partner? 1
How well is you and your partner relationship with respective in-laws? 8
Predicted time taken to get married: 2.5 years!!

You still have time before getting into marriage. 
Prepare yourself in becoming a better partner during this time and 
strike when the time is ripe!


In [None]:
prediction3, value3 = predict(1,7,3,4,4,1,1) #long example
result(prediction3,value3)

How many years did you know each other before becoming a couple? 1
How much do you care about money? 7
How ready are you in having children at that time? 3 
How well do you communicate with your partner? 4
How is your financial status during couple relationship? 4
What is the age gap (in years) between you and your partner? 1
How well is you and your partner relationship with respective in-laws? 1
Predicted time taken to get married: 6.4 years!!

Hmmm, that is long time to go. 
All I can say is to get to know your partner more,
have a stable job and income and get close your partner's family! 
You can do it!
