### I. Explore and prepare the data

#### Data Exploration

In [1]:
import pandas as pd
import numpy as np

wvs = pd.read_csv("wvs.csv", sep="\t")
wvs.head(10)

Unnamed: 0,V2,V4,V5,V6,V7,V8,V9,V10,V11,V12,...,MN_228S8,MN_229A,MN_230A,MN_233A,MN_237B1,MN_249A1,MN_249A3,I_RELIGBEL,I_NORM1,I_VOICE1
0,12,1,1,1,-2,1,1,2,1,1,...,3,-3,-3,-3,-3,1,1,0.0,1.0,0.0
1,12,1,2,3,4,2,2,2,2,2,...,3,-3,-3,-3,-3,2,-1,0.0,1.0,0.66
2,12,1,3,2,4,2,1,2,2,2,...,4,1,1,2,-3,1,1,0.0,1.0,0.33
3,12,1,1,3,4,3,1,2,1,2,...,2,2,1,2,-3,1,2,0.0,1.0,0.0
4,12,1,1,1,2,1,1,1,3,2,...,2,2,1,2,-3,1,2,0.0,1.0,0.66
5,12,1,2,2,2,4,1,2,1,2,...,3,2,1,1,-3,1,2,0.0,1.0,0.0
6,12,1,1,1,1,1,1,2,2,1,...,3,2,2,2,-3,1,1,0.0,1.0,0.66
7,12,1,1,1,1,2,2,2,1,2,...,3,1,1,2,-3,2,2,0.0,1.0,0.0
8,12,1,1,1,2,2,2,2,2,2,...,3,2,1,1,-3,-3,-3,0.0,1.0,0.33
9,12,1,1,1,2,1,1,1,1,2,...,3,-3,-3,-3,0,-3,-3,0.0,1.0,0.66


In [2]:
wvs.shape

(90350, 328)

In [3]:
wvs["V204"].value_counts().sort_index()

-5        23
-4      1523
-2      1045
-1      2017
 1     40227
 2      7896
 3      6294
 4      4497
 5      9580
 6      4395
 7      3493
 8      3397
 9      1896
 10     4067
Name: V204, dtype: int64

- 85742 or roughly 95% of the responses are non missing answers 
- A majority of the responses are/incline towards abortion being unjustifiable. More than 50% of the responses are in the 1-5 scale because of which it can be considered that a majority of the people believe in or tend towards an anti abortion view. 

#### Data Cleaning: Deal with missing values

In [4]:
#drop rows based on condition
#Get indexes for which V204 and V2 are not positive
notpositive_ab = wvs[wvs["V204"]<0].index
notpositive_country = wvs[wvs["V2"]<0].index

In [5]:
#drop all rows with where V204 is less than 0
df_modified = wvs.drop(notpositive_ab)
df_modified.shape

(85742, 328)

In [6]:
#Drop all rows where V2 is less than zero
df_modified.drop(notpositive_country,inplace=True)
df_modified.shape

(85742, 328)

In [7]:
#drop missing rows
df_modified.dropna(inplace=True)
df_modified.shape

(79267, 328)

In [8]:
#create binary variable 
df_modified['abortion'] = np.where(df_modified.V204 > 3, 1, 0)
df_modified.sample(10)

Unnamed: 0,V2,V4,V5,V6,V7,V8,V9,V10,V11,V12,...,MN_229A,MN_230A,MN_233A,MN_237B1,MN_249A1,MN_249A3,I_RELIGBEL,I_NORM1,I_VOICE1,abortion
35536,400,1,1,2,1,1,1,1,2,1,...,-4,1,2,-4,2,1,0.0,0.0,0.33,0
61821,643,1,3,3,3,1,3,1,2,2,...,-4,-4,-4,-4,-4,-4,1.0,1.0,0.0,0
37230,398,1,3,1,3,3,3,2,2,1,...,-4,-4,-4,-4,-4,-4,1.0,1.0,0.33,1
85234,840,1,2,1,2,1,1,2,3,2,...,-4,-4,-4,-4,-4,-4,0.0,1.0,0.33,1
59741,642,2,2,1,1,3,3,2,1,1,...,-4,-4,-4,-4,-4,-4,1.0,1.0,0.66,1
36361,400,1,1,3,2,1,1,1,1,1,...,-4,1,3,-4,1,1,0.0,0.0,0.0,0
4996,31,1,1,2,3,1,3,2,1,1,...,-4,-4,-4,-4,-4,-4,1.0,0.0,0.0,1
82386,804,1,1,2,2,1,1,2,2,1,...,-4,-4,-4,-4,-4,-4,0.0,0.0,0.0,1
89094,716,1,3,3,2,1,2,2,2,1,...,-4,-4,-4,-4,-4,-4,0.0,1.0,0.0,0
27317,356,1,1,1,1,1,2,2,2,1,...,-4,-4,-4,-4,-4,-4,0.0,0.0,0.33,0


In [33]:
#pearson for abortion
corr_df = df_modified.corr(method="pearson").loc["abortion"]
#store index of sorted values
index_list = corr_df.abs().sort_values(ascending=False).index

In [34]:
#display based on index
print(corr_df.loc[index_list].reset_index())

        index  abortion
0    abortion  1.000000
1        V204  0.881048
2        V205  0.548653
3        V203  0.485419
4        V206  0.446394
..        ...       ...
324        V7  0.003503
325      V103 -0.003465
326       V17 -0.001757
327   V125_06  0.000761
328      V113 -0.000752

[329 rows x 2 columns]


V205          0.548653   Justifiable: Divorce  
V203          0.485419   Justifiable: Prostitution  
V206          0.446394   Justifiable: Sex before marriage   
    
Based on the table the above variables have strong correlation with abortion and represent the rating participants gave for divorce, prostitution and sex before marriage

In [11]:
#Rename V2 to country
df_modified.rename({"V2":"country"},axis="columns",inplace=True)
df_modified.head(6)

Unnamed: 0,country,V4,V5,V6,V7,V8,V9,V10,V11,V12,...,MN_229A,MN_230A,MN_233A,MN_237B1,MN_249A1,MN_249A3,I_RELIGBEL,I_NORM1,I_VOICE1,abortion
0,12,1,1,1,-2,1,1,2,1,1,...,-3,-3,-3,-3,1,1,0.0,1.0,0.0,0
1,12,1,2,3,4,2,2,2,2,2,...,-3,-3,-3,-3,2,-1,0.0,1.0,0.66,0
2,12,1,3,2,4,2,1,2,2,2,...,1,1,2,-3,1,1,0.0,1.0,0.33,0
3,12,1,1,3,4,3,1,2,1,2,...,2,1,2,-3,1,2,0.0,1.0,0.0,0
4,12,1,1,1,2,1,1,1,3,2,...,2,1,2,-3,1,2,0.0,1.0,0.66,0
5,12,1,2,2,2,4,1,2,1,2,...,2,1,1,-3,1,2,0.0,1.0,0.0,0


In [12]:
#convert country to dummy variable
df_final = pd.get_dummies(df_modified, columns = ['country'])

In [13]:
df_final.shape

(79267, 386)

In [14]:
#drop one country column
country_cols = [col for col in df_final.columns if 'country' in col]
print(country_cols)
print(len(country_cols))

['country_12', 'country_31', 'country_32', 'country_36', 'country_48', 'country_51', 'country_76', 'country_112', 'country_152', 'country_156', 'country_158', 'country_170', 'country_196', 'country_218', 'country_233', 'country_268', 'country_275', 'country_276', 'country_288', 'country_344', 'country_356', 'country_368', 'country_392', 'country_398', 'country_400', 'country_410', 'country_417', 'country_422', 'country_434', 'country_458', 'country_484', 'country_504', 'country_528', 'country_554', 'country_566', 'country_586', 'country_604', 'country_608', 'country_616', 'country_634', 'country_642', 'country_643', 'country_646', 'country_702', 'country_705', 'country_710', 'country_716', 'country_724', 'country_752', 'country_764', 'country_780', 'country_788', 'country_792', 'country_804', 'country_840', 'country_858', 'country_860', 'country_887']
58


In [15]:
#delete one country dummy column
df_final.drop("country_12",axis=1,inplace=True)
#delete V204
df_final.drop("V204",axis=1,inplace=True)
df_final.shape

(79267, 384)

The dataset currently has 79267 rows and 384 columns. The dataset contains 57 unique dummy values for countries.

### II. Implement cross validation

In [25]:
from sklearn.metrics import f1_score, accuracy_score, precision_score, recall_score
import statistics

def k_fold_cross_validation(k,model,X,Y):
    """
    k = k fold number
    model = unfitted model
    X = features
    Y = target
    Perform k fold cross validation of the model with the above parameters
    """
    #Generate list of indices
    indices = list(X.index.values)
    
    #initialize empty list to hold f scores for each iteration 
    f_metrics = list()
    precision_metrics = list()
    recall_metrics = list()
    accuracy_metrics = list()
    
    for i in range(0,k):
        #shuffle indices
        np.random.shuffle(indices)
        shuffled_indices = set(indices)
        
        #initialize set to hold validation indices
        validation_indices = set()
        
        #generate list of indices for validation set
        for index in range(0,len(indices),k):
            validation_indices.add(indices[index])
        
        #generate list of indices for training set
        training_indices = list(shuffled_indices - validation_indices)
        
        #create training set for X and Y
        train_X = X.loc[training_indices,:]
        train_Y = Y.loc[training_indices]
        
        #create validation set for X and Y
        validation_X = X.loc[validation_indices,:]
        validation_Y = Y.loc[validation_indices]
        
        #fit model 
        model.fit(train_X,train_Y)
    
        #predict outcome on validation set
        predicted = model.predict(validation_X)
        
        #generate metrics
        f_k = f1_score(validation_Y, predicted)
        a_k = accuracy_score(validation_Y, predicted)
        p_k = precision_score(validation_Y, predicted)
        r_k = recall_score(validation_Y, predicted)
        
        #add f score for kth model to the metrics list
        f_metrics.append(f_k)
        accuracy_metrics.append(a_k)
        precision_metrics.append(p_k)
        recall_metrics.append(r_k)
        
    #return average of k f scores
    return statistics.mean(f_metrics), statistics.mean(accuracy_metrics), statistics.mean(precision_metrics), statistics.mean(recall_metrics)



### III. Find best model

In this section your task is to find which model: k-NN, logistic regression, or SVM works best. 
You will evaluate the model performance using 5-fold cross-validation with accuracy and F-score as the metric using your own CV implementation.

k-NN and SVM are sensitive to the distance metric, so you may also try to normalized versus nonnormalized features. Check out sklearn.preprocessing.normalize. 
Logistic regression is agnostic with respect to the metric, but may benefit from more similar variable values for numerical reasons.

Some of the methods (k-NN, SVM) are slow to compute, so you may start with a subset of data (say, 5000 random lines only). If everything turns out fine, you increase the data size as far as your computer can go.

#### Normalize model

In [17]:
from sklearn import preprocessing

X = df_final.drop('abortion', axis=1)
y = df_final['abortion'].reset_index(drop=True)
norm = preprocessing.normalize(X, norm='l2')
X_norm = pd.DataFrame(norm)

#### Logistic Regression

In [41]:
from sklearn.linear_model import LogisticRegression

#set k fold variable 
k_fold = 5

#create logistic regression model 
logr = LogisticRegression(n_jobs = 10)

#perform cross validation
result,accuracy,precision,recall = k_fold_cross_validation(k_fold, logr, X_norm.iloc[0:10000], y.iloc[0:10000])
print("F1_Score",result)
print("Accuracy",accuracy)
print("Precision",precision)
print("Recall",recall)

  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))


F1_Score 0.7213636269246321
Accuracy 0.8106
Precision 0.8017225351383326
Recall 0.6559594228605768


#### KNN 

In [44]:
from sklearn.neighbors import KNeighborsClassifier

knn = range(3,7)

for k in knn:
    knn_model = KNeighborsClassifier(n_neighbors=k,n_jobs=10)
    knn_result,accuracy,precision,recall = k_fold_cross_validation(k_fold, knn_model, X_norm.iloc[0:10000], y.iloc[0:10000])
    print('When data is normalized and k =',k)
    print("F1_Score",knn_result)
    print("Accuracy",accuracy)
    print("Precision",precision)
    print("Recall",recall)
    print("***********")

When data is normalized and k = 3
F1_Score 0.6907527623509588
Accuracy 0.7718999999999999
Precision 0.6944950143703211
Recall 0.6874564736235153
***********
When data is normalized and k = 4
F1_Score 0.6496512592475462
Accuracy 0.7703
Precision 0.7471354001622996
Recall 0.5748808935662406
***********
When data is normalized and k = 5
F1_Score 0.6901111679060885
Accuracy 0.7725
Precision 0.7087285388623378
Recall 0.6724969847370159
***********
When data is normalized and k = 6
F1_Score 0.6761045978594153
Accuracy 0.7787000000000001
Precision 0.7429608443799995
Recall 0.6204172824206102
***********


#### SVM

In [43]:
from sklearn.svm import SVC
kernels = [ 'linear','poly', 'rbf', 'sigmoid']

for kernel in kernels:
    svm_model = SVC(kernel=kernel,max_iter = 1000)
    svm_result,accuracy,precision,recall = k_fold_cross_validation(k_fold, svm_model, X_norm.iloc[0:10000], y.iloc[0:10000])
    print("For kernel=",kernel)
    print("F1_Score",svm_result)
    print("Accuracy",accuracy)
    print("Precision",precision)
    print("Recall",recall)



For kernel= linear
F1_Score 0.6095973975614218
Accuracy 0.5519
Precision 0.4527482459206154
Recall 0.9355988253477325




For kernel= poly
F1_Score 0.6057259558062371
Accuracy 0.5462
Precision 0.44881924357041325
Recall 0.9313231090357944




For kernel= rbf
F1_Score 0.589778550234841
Accuracy 0.491
Precision 0.42126896109398065
Recall 0.9833561878023829




For kernel= sigmoid
F1_Score 0.5995805436731869
Accuracy 0.5021
Precision 0.4309943336169513
Recall 0.9849255540805925


While the accuracy score seems okay, scores for precision are lower whereas the scores for recall are very high. I believe that this is happening due to the results being skewed toward class = 0

Precision = 𝑇𝑃 / (𝑇𝑃+𝐹𝑃)
Since there are higher number of results in class = 0, the chances of false positives are higher in the case for class=1. This will increase the denominator and decrease the overall accuracy.

Recall = Sensitivity = 𝑇𝑃 / (𝑇𝑃+𝐹𝑁)
Since there are higher number of results in class = 0, the chances of false negatives are lower leading to increased recall.

#### Compare the models

1. Which ones performed the best in terms of accuracy? Which ones in terms of F-score? Did you encounter other kind of issues with certain models? Which models were fast and which ones slow?
    - Accuracy and F-Score: The logistic regression model had the highest accuracy and F-Score followed by the knn models and lastly the svm models.
    - Speed: The logistic regression model performed the fastest. The svm models were faster than the knn models. 
    

2. If you have to repeat the exercise with a single model which one will you pick?
    - If I have to repeat the exercise with a single model I would choose the logistic regression model.

### IV. How large a role does country play? 

Does the fact that we include country code in data help us to substantially improve the predictions?

#### Run on all variables and the entire dataset using the best model: Logistic Regression

In [45]:
#create logistic regression model 
logr = LogisticRegression(n_jobs = 10)

#perform cross validation
result,accuracy,precision,recall = k_fold_cross_validation(k_fold, logr, df_final, y)
print("F1_Score",result)
print("Accuracy",accuracy)
print("Precision",precision)
print("Recall",recall)

  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))


F1_Score 0.7536717170654494
Accuracy 0.829443673520878
Precision 0.7927571994404582
Recall 0.7182637373782923


#### Run without country variables and the entire dataset using the best model: Logistic Regression

In [49]:
#find all columns apart from country columns
cols = [c for c in df_final.columns if c.lower()[:7] != 'country']

#create dataset without country columns
df_wc= df_final[cols]

#normalize the data 
X = df_wc.drop('abortion', axis=1)
y = df_wc['abortion'].reset_index(drop=True)
norm = preprocessing.normalize(X, norm='l2')
X_wc = pd.DataFrame(norm)


#perform cross validation
result,accuracy,precision,recall = k_fold_cross_validation(k_fold, logr, X_wc, y)
print("F1_Score",result)
print("Accuracy",accuracy)
print("Precision",precision)
print("Recall",recall)

  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))


F1_Score 0.7567499509496556
Accuracy 0.8295193641983096
Precision 0.7958502249991868
Recall 0.721343817512693


Based on the above results, country information does not  help to noticeably improve the prediction.